Knowledge Distillation

Knowledge distillation shrinks a big model into a smaller, faster one. A compact "student" model is trained to reproduce the behavior of a larger "teacher," capturing much of its capability at a fraction of the size and cost — which is how many of the efficient models that power consumer assistants are built.

For AEO it's useful context for why so many engines behave similarly: many smaller models inherit patterns from a few large ones. It also reinforces a strategic point — because models increasingly learn from each other and from the open web, genuinely original content that exists nowhere else is what gives engines something they can't get by distilling existing knowledge. Distillation copies what's already known; it can't manufacture your first-hand data.

Example. A lightweight assistant on a phone may be a distilled version of a much larger model. It reflects the teacher's general knowledge — but for anything novel and specific, it still depends on retrieving an original source like yours.

Relevant pillar

Related terms