Skip to content
AEO Canon · the reference for answer-engine optimization

What Is Tokenization in AI?

Tokenization is how an AI model breaks text into tokens — words or word-pieces — that it can process numerically. Tokens are the unit LLMs read, predict, and bill by, and they shape cost, limits, and clarity.

BBurke Atkerson2 min read

Tokenization is how an AI model breaks text into tokens — words or word-pieces — that it can turn into numbers and process. Tokens are the atomic unit of everything an LLM does: it reads in tokens, predicts in tokens, is priced per token, and has limits measured in tokens. Understanding them explains a lot of otherwise-mysterious AI behavior.

Tokenization · what the model reads

Models don't read words — they read tokens (whole words or word-pieces). Type below and watch your text split.

Answ##erengi##neopti##miza##tionisresh##apin##gsear##ch.
14 tokens47 characters3.4 chars / token

Illustrative only — real models use a learned vocabulary. The takeaway holds: token count (not word count) drives context limits and cost, so tight, plainly-worded passages go further.

What exactly is a token?

A token is the smallest chunk of text an LLM handles — typically a whole common word or a piece of a less common one. Modern models use subword tokenization: frequent words map to a single token, while rarer words break into familiar fragments. "dog" is one token; "tokenization" might split into "token" and "ization." Each token is then converted into an embedding the model can do math on, the first step in how LLMs work.

Subword tokenization

Instead of a dictionary of whole words, models learn a vocabulary of word-pieces (commonly via an algorithm called byte-pair encoding). This lets them represent any string — new words, names, typos, code, other languages — by composing known pieces, while keeping the vocabulary to a manageable size (tens of thousands of tokens).

Why not just use whole words?

Models don't use whole words because subword tokens are far more flexible and efficient. A whole-word vocabulary would be enormous, would break on any word it never saw, and would handle misspellings and other languages poorly. Subword tokenization solves all three: it can build "antidisestablishmentarianism" or a brand-new product name out of pieces it already knows, with no special cases. The dominant method, byte-pair encoding, was adapted for language models in "Neural Machine Translation of Rare Words with Subword Units" (arXiv 1508.07909); lab tokenizer documentation then lets you see exactly how a given string splits — useful when a passage behaves unexpectedly.

Why do tokens determine cost and limits?

Tokens determine cost and limits because LLMs measure both in tokens, not words. API pricing is per token (input and output), and a model's context window — the maximum text it can consider at once — is a token count. Two consequences follow:

  1. 1

    Verbosity costs money

    Padded, repetitive text uses more tokens to say the same thing, raising API cost and filling context faster.

  2. 2

    Unusual text costs more tokens

    Code, emoji, non-English scripts, and odd formatting often use more tokens per unit of meaning than plain prose.

  3. 3

    Budgets are finite

    Everything — your prompt, retrieved sources, and the answer — shares one token budget. Efficient text leaves more room for the parts that matter.

As a working estimate, 750 English words is roughly 1,000 tokens, so a typical 1,500-word article is about 2,000 tokens.

How does tokenization affect getting cited?

Tokenization affects citation indirectly but really: content that tokenizes cleanly and says more per token is cheaper to process and easier for a model to represent accurately. When an answer engine retrieves passages and fits them into a limited context window, concise, conventionally-written passages compete better than bloated ones — there's simply more room for them, and less noise around the point. That's another angle on why extractability rewards tight, answer-first passages.

It also reframes a familiar mistake: keyword stuffing doesn't just fail to help AEO — it wastes tokens, diluting the signal in every passage the model reads. Write for meaning per token.

Next: what are embeddings for what happens to tokens after they're split, or the context window for the limit they're measured against.

Frequently asked questions

What is a token in an LLM?
A token is the basic unit of text an LLM reads and generates — usually a word or a word-piece, not always a whole word. Common words are often one token; rarer words split into several. As a rough rule of thumb in English, one token is about four characters, or roughly 0.75 words.
Why don't models just use words?
Because subword tokenization handles any input gracefully. Splitting rare or novel words into known pieces (like 'token' + 'ization') lets the model represent words it never saw in training, handle typos and other languages, and keep its vocabulary a manageable size. Pure whole-word vocabularies would be huge and brittle.
Why does tokenization affect cost and limits?
Because LLMs are priced and bounded by tokens, not words. API costs are per token, and a model's context window is measured in tokens. Verbose or unusual text uses more tokens for the same meaning, raising cost and consuming more of the available context.
How many tokens is a typical page?
Very roughly, 750 words is about 1,000 tokens in English. So a 1,500-word article is around 2,000 tokens. Code, non-English text, and unusual formatting can use more tokens per word than plain prose.

Last updated .

Related reading

AI-detection tools are unreliable and not what answer engines use to decide citations — engines judge content on quality, originality, and accuracy, not on whether a machine wrote it. Stop chasing detection and make content genuinely original, because generic content fails no matter who wrote it.

2 min read

AI-generated content can get cited, but only when it's made genuinely original, accurate, and useful — raw model output tends to be generic, unsourced, and interchangeable, which is exactly what engines skip. The deciding factor is the substance and originality you add, not whether a model helped write it.

2 min read

It depends on the engine — web-grounded engines like Perplexity and Google AI can surface new content within days once it's crawled, while a model's built-in training knowledge lags months behind its cutoff. So fresh content reaches retrieval-based answers quickly but base-model knowledge slowly.

2 min read