Tokenization
Tokenization is the process of splitting text into tokens before an AI model can process it, converting human-readable language into the units the model actually operates on.
Tokenization is the step that turns your text into model-readable pieces. Before a model can do anything with a passage, it breaks the text into tokens — splitting words into common sub-pieces so it can handle any input, including names and rare terms it never saw whole during training.
For AEO it's mostly background plumbing, but it has one practical implication: clear, conventional language tokenizes predictably and is represented cleanly, whereas odd formatting, run-together strings, or gimmicky spellings can fragment into messy tokens that are harder for the model to interpret and quote. Plain, well-formed prose — part of the extractability pillar — is the safest input. Tokenization is also the precursor to creating embeddings.
Example. A normal word like "optimization" tokenizes neatly, while a stylized string like "Op-Ti-Miz-Ation!!!" fragments into many awkward tokens — needless friction for a model trying to read and reuse your point.
Relevant pillar
Related terms
- TokenA token is the basic unit of text an AI model processes — typically a word or word-piece — and is how model limits, costs, and context windows are measured.
- EmbeddingsEmbeddings are numerical representations of text that capture its meaning, letting AI systems find passages that are semantically related to a query even when they share no exact keywords.
- Large Language Model (LLM)A large language model is an AI system trained on vast amounts of text to predict and generate language, and is the engine that writes the answers in AI search.