Skip to content
AEO Canon · the reference for answer-engine optimization

The Multimodal Future of Citation

Citation is text-first today — engines quote transcripts, captions, and alt text — but multimodal models that read video, audio, and images directly are emerging. The durable strategy is to win the text layer now (it loses nothing later) while making your content genuinely strong across formats.

BBurke Atkerson3 min read

Citation is text-first today — engines quote transcripts, captions, and alt text — but multimodal models that read video, audio, and images directly are emerging. The durable strategy is to win the text layer now (it loses nothing later) while making your content genuinely strong across formats, because that's what survives the shift.

The honest forecast

Today engines cite the text layer — transcripts, captions, alt text. Multimodal models that read media directly are improving, on an uncertain timeline. The safe play: win the text layer now (it loses nothing) and make your video, audio, and images genuinely good. The pillars don't change — multimodal just widens what "extractable" includes.

Where does citation stand today?

Today, citation is overwhelmingly text-first: engines retrieve and quote text, so multimodal content is cited through its transcripts, captions, and alt text — the words attached to the media, not the media itself, which is why structured types like schema.org/VideoObject describe media with text fields. This is why every guide in this cluster points back to the text layer: it's the citable surface across video, images, voice, and podcasts. The one place media itself already moves the needle is authority — YouTube is the strongest off-site signal (~0.737, Ahrefs) — but even there it's the mention and presence that count, not the engine watching your footage.

What's actually changing?

What's changing is that models are becoming genuinely multimodal — able to interpret video frames, audio, and images directly rather than only their text descriptions. As that capability matures, engines will lean less on transcripts and alt text and more on the content itself, especially for visual and audio search — a direction reflected in Google's evolving AI features for search. The mechanism underneath is shared representation: embeddings that place text, images, and audio in one space, so an engine can match a query to a video moment or an image as readily as to a paragraph. It's real, and it's advancing — but the timeline and the degree are uncertain, so it's a direction to prepare for, not a switch that has flipped.

How should you prepare without overreacting?

Prepare by winning the text layer now and making your media genuinely good — a strategy that pays off in both the present and the multimodal future:

  1. 1

    Win the text layer today

    Transcripts, captions, alt text, and answer-first copy work now and lose nothing later. This is the no-regret move.

  2. 2

    Make the media genuinely strong

    Multimodal rewards quality content directly, so a genuinely useful video or clear, original image is an asset that compounds — not filler.

  3. 3

    Build authority in media platforms

    YouTube and others already carry strong off-site signals; genuine presence there pays off regardless of how reading evolves.

  4. 4

    Measure and adapt

    Track citation share as engines change, and adjust — the adaptability pillar is the meta-skill for a shifting landscape.

Don't chase the future by neglecting the present: the text layer is what's cited today, and it remains useful no matter how multimodal engines become.

What stays true through the shift?

Originality and credibility stay true through any shift. However engines come to read content — text, transcript, frame, or waveform — they reward what's genuinely yours and verifiably true, and they discount generic, unsupported output. A clear, original, well-evidenced answer in whatever format is what earns the citation, and that won't change as the modalities expand. The constant is quality with a point of view; the variable is which formats engines can read — and adaptability is how you keep up.

Where this fits in the Canon

The multimodal future is the adaptability pillar looking forward, on a foundation of extractability (win the text layer now) and originality (what survives any shift). It ties together the cluster — video, images, voice, podcasts, and how AI reads transcripts — and the YouTube AEO playbook for the authority signal media already carries.

Frequently asked questions

Will AI start citing video and images directly?
Increasingly, yes — multimodal models that interpret video frames, audio, and images directly are improving, so over time engines will rely less on transcripts and alt text and more on the content itself. But the shift is gradual and the timeline uncertain, and text remains the primary citable surface today. The safe strategy is to win the text layer now while making your multimodal content genuinely good, because both pay off across the transition.
Does multimodal AI change AEO strategy?
Not the fundamentals. The pillars — access, alignment, extractability, authority, credibility, originality, freshness, adaptability — hold regardless of modality; multimodal mainly widens what 'extractable' means to include well-made video, audio, and images, not just text. The biggest change is that genuinely strong content in every format becomes more directly rewarded, so quality and originality matter even more.
Should I wait for multimodal AI before investing in video or images?
No. The text layer (transcripts, captions, alt text) works today and loses nothing as multimodal improves, and genuinely good multimodal content compounds in authority now — YouTube is already the strongest off-site signal. Waiting cedes ground; building citable text around strong media positions you for both the present and the multimodal future.
What stays true no matter how multimodal AI gets?
Originality and credibility. However engines come to read content, they reward what's genuinely yours and verifiably true, and penalize generic, unsupported output. A clear, original, well-evidenced answer — in whatever format — is what gets cited, and that won't change as the modalities engines can read expand. Adaptability is the meta-skill: measure and adjust as the shift unfolds.

Last updated .

Related reading

It depends on the engine — web-grounded engines like Perplexity and Google AI can surface new content within days once it's crawled, while a model's built-in training knowledge lags months behind its cutoff. So fresh content reaches retrieval-based answers quickly but base-model knowledge slowly.

2 min read

A model's knowledge cutoff means its built-in training data stops at a fixed date, so it won't natively know anything published after it — which is why recent content reaches you only through engines that retrieve the live web. Freshness in AI search runs through retrieval, not the model's frozen memory.

2 min read

AI & LLM Fundamentals

Why Do AI Models Hallucinate?

AI models hallucinate — state false things confidently — because they generate the most plausible text, not verified truth. When training patterns run thin, they fill the gap with fluent fabrication. Grounding in real sources is the main fix.

2 min read