Skip to content
AEO Canon · the reference for answer-engine optimization

How Do Large Language Models Work?

LLMs work by breaking text into tokens, converting them to embeddings, using a transformer's attention mechanism to weigh context, and predicting the next token one at a time — repeated to generate full answers.

BBurke Atkerson3 min read

LLMs work by turning text into tokens, converting those tokens into numeric vectors, using a transformer's attention mechanism to weigh context, and then predicting the next token one at a time. Repeat that prediction step a few hundred times and you get a full answer. Every capability — writing, reasoning, summarizing — is built on this single loop.

01Promptyour text goes in
02Tokenizesplit into tokens
03Predictweigh context, pick the next token
04Repeatappend and predict again
05Answertokens become text
An LLM generates one token at a time, each conditioned on everything before it.

What are the steps from prompt to answer?

An LLM turns your prompt into an answer through a fixed sequence of steps, repeated for every token it produces. The loop is the same whether the model is writing a poem or debugging code.

  1. 1

    Tokenize the input

    Your text is split into tokens — words or word-pieces. 'unbelievable' might become 'un', 'believ', 'able'. See tokenization for why this matters.

  2. 2

    Embed the tokens

    Each token is mapped to a high-dimensional vector (an embedding) that encodes its meaning and position, so the model can do math on language.

  3. 3

    Apply attention across layers

    Stacked transformer layers use attention to weigh how strongly each token should influence every other — building a context-aware representation.

  4. 4

    Predict the next token

    The model outputs a probability distribution over all possible next tokens and selects one (with some controlled randomness).

  5. 5

    Repeat

    The chosen token is appended to the sequence and the whole process runs again — until the answer is finished.

The pieces above each have their own guide: tokenization, embeddings, and the context window that bounds how much text the model can attend to at once.

What is attention, and why does it matter?

Attention is the operation that lets a model decide which earlier tokens matter most when predicting the next one — and it is the breakthrough that made modern LLMs possible. Introduced in the 2017 paper "Attention Is All You Need" (arXiv 1706.03762), the transformer architecture replaced slower sequential processing with attention that can relate every token to every other token in parallel.

Why attention is the whole game

Attention is how a model links "it" to the noun three sentences back, or a question to the one fact in a long passage that answers it. For content, the practical implication is direct: a clear, self-contained passage is easier for attention to resolve correctly than one riddled with vague references — which is exactly what extractability rewards.

Tokenization · what the model reads

Models don't read words — they read tokens (whole words or word-pieces). Type below and watch your text split.

Answ##erengi##neopti##miza##tionisresh##apin##gsear##ch.
14 tokens47 characters3.4 chars / token

Illustrative only — real models use a learned vocabulary. The takeaway holds: token count (not word count) drives context limits and cost, so tight, plainly-worded passages go further.

Does an LLM know facts, or just predict words?

An LLM stores a compressed, lossy impression of its training data in its parameters — so it "knows" facts only in the sense that the patterns make them likely to be reproduced. There is no database inside the model you can look up; there are billions of weights that make true-sounding text probable. This is why a model can nail a common fact and fabricate a rare one with equal confidence, the root of hallucination.

It is also why a plain model can't tell you where an answer came from. To make answers current and citable, engines connect the model to external sources via retrieval-augmented generation — covered next, and the mechanism behind every AI answer that shows its sources.

Why does the same prompt give different answers?

The same prompt gives different answers because text generation is probabilistic: at each step the model samples from a distribution of possible next tokens rather than always choosing the single most likely one. A parameter called temperature controls how much randomness is permitted — low temperature produces near-deterministic, repeatable answers; higher temperature produces more varied, creative ones. This is useful to know for AEO measurement: because output varies, tracking whether an engine cites you means testing a fixed prompt set repeatedly, not just once.

Why does this matter for getting cited?

How an LLM works matters for AEO because it tells you what the machine can actually use. It reads tokens, not pages; it resolves meaning through attention, not skimming; and it answers from training unless it's been grounded in live sources. Content that is cleanly tokenizable, semantically clear, and self-contained is content a model can represent and reuse accurately. That is the bridge from "how LLMs work" to what is AEO and the practices in The AEO Canon.

Next: see where a model's knowledge comes from in how AI models are trained, or how engines add live knowledge in what is RAG.

Frequently asked questions

How does an LLM generate text?
It generates text one token at a time. The model reads your prompt as tokens, converts them to numeric vectors (embeddings), passes them through transformer layers that use attention to weigh how each token relates to the others, and outputs a probability distribution over the next token. It picks one, appends it, and repeats until the answer is complete.
What is the attention mechanism?
Attention is the core operation of the transformer architecture that lets a model weigh which earlier tokens matter most when predicting the next one. It's how the model keeps track of context — connecting a pronoun to the noun it refers to, or a question to the relevant detail. It was introduced in the 2017 paper "Attention Is All You Need."
Does an LLM look things up while it answers?
Not by default. A base model answers purely from patterns learned during training — it has no live access to the web. Only when an LLM is connected to retrieval (RAG) or a search tool does it pull in external, current information. That difference is why some AI answers cite sources and others can't.
Why does the same prompt give different answers?
Because generation is probabilistic. At each step the model has a distribution over possible next tokens and samples from it; a setting called temperature controls how much randomness is allowed. Higher temperature means more varied output; lower means more deterministic, repeatable answers.

Last updated .

Part of

Related reading

AI-detection tools are unreliable and not what answer engines use to decide citations — engines judge content on quality, originality, and accuracy, not on whether a machine wrote it. Stop chasing detection and make content genuinely original, because generic content fails no matter who wrote it.

2 min read

AI-generated content can get cited, but only when it's made genuinely original, accurate, and useful — raw model output tends to be generic, unsourced, and interchangeable, which is exactly what engines skip. The deciding factor is the substance and originality you add, not whether a model helped write it.

2 min read

It depends on the engine — web-grounded engines like Perplexity and Google AI can surface new content within days once it's crawled, while a model's built-in training knowledge lags months behind its cutoff. So fresh content reaches retrieval-based answers quickly but base-model knowledge slowly.

2 min read