Tokenization — Lumo glossary

Tokenization is the silent first step. The model never sees your sentence; it sees a sequence of subword fragments, drawn from a fixed vocabulary, each mapped to an integer. Most strange model behaviors begin here.

In plain language

In AI and machine learning, you will run into this term whenever someone talks about how a model is built or used. Tokenization is the silent first step. The model never sees your sentence; it sees a sequence of subword fragments, drawn from a fixed vocabulary, each mapped to an integer. Most strange model behaviors begin here. If you are new to the field, the simplest mental model is this: slicing text into the units a model actually reads. Read it once with that frame in mind, then come back and read it again — that is usually enough for the rest of the entry to make sense.

Inline editorial illustration evoking Tokenization: slicing text into the units a model actually reads. — FIG. 1Tokenization, seen from a second angle — slicing text into the units a model actually reads.

An everyday picture

Think of Tokenization less like a thinking person and more like someone who has read an enormous amount and now finishes other people's sentences for a living. They have absorbed the shape of the work; they have not memorised any one page.

Where it shows up

Tokenization tends to sit inside products that need to read, write, or recognise without a hard-coded rule — assistants, search, document tools, voice apps. It is rarely the only moving part, but it is often the part the user feels.

A small example

Imagine the scene above. The role Tokenization plays is the one its blurb describes — Slicing text into the units a model actually reads. When a chatbot in a customer service portal reads a question and returns a draft reply, several of these AI ideas — model, prompt, context — are at work behind the single button you saw.

Common misunderstanding

MYTH

It is easy to assume Tokenization 'understands' the way a person does. It does not. It learns patterns, and patterns can be fooled — confident answers are not the same thing as correct ones.

One line to take with you

Tokenization is statistics worn well. Useful for patterns; double-check it for facts.

In plain language

An everyday picture

Where it shows up

A small example

Common misunderstanding

One line to take with you

One letter a week, lasting understanding.

One letter a week,
lasting understanding.