Skip to content
AEO Canon · the reference for answer-engine optimization

Why Per-Engine Measurement Beats a Blended Average

Measure AI visibility per engine, not as one blended average. Profound found only ~11% citation overlap across engines, so a single "AI visibility" number averages together separate universes — hiding where you actually win or lose. Here's why per-engine tracking is the only honest read.

BBurke Atkerson5 min read

Measure AI visibility per engine, not as one blended average: Profound found only about 11% citation overlap across engines, so a single "AI visibility" number averages together separate universes and hides where you actually win or lose. Each engine is its own game; a blended score is a figure that corresponds to no real surface.

The short answer

Track share of voice per engine. With only ~11% cross-engine overlap (Profound), a blended average can show a comfortable middle number while you're collapsing on the one engine your buyers actually use. Report each engine separately and watch each trend.

Illustrative: one site, very different per engine — why a blended average misleads
ChatGPT
Perplexity
AI Overviews
Gemini
“best X for Y”
Often
Rarely
Sometimes
Rarely
“X alternatives”
Sometimes
Often
Rarely
Sometimes
“is X worth it”
Rarely
Often
Sometimes
Rarely
strong needs work critical

Why does a blended average mislead?

A blended average misleads because it combines engines that barely agree on what to cite, producing a number that matches no real surface. Profound found only about 11% of cited sources are shared across the major engines — they're effectively separate universes. So if you average a 60% share of voice in Perplexity with a 5% share in Google AI Overviews, the "32%" blended figure describes nothing you can act on: you're not at 32% anywhere. Worse, it can stay flat while one engine quietly collapses, because a gain elsewhere offsets it.

A worked example: how the blend hides a collapse

Watch what a single averaged number conceals. Suppose you track three engines and report a naive average:

Example: the blended average holds at 40% while one engine halves
EngineMonth 1 SOVMonth 2 SOV
Perplexity60%75%
ChatGPT30%30%
Google AI Overviews30%15%
Blended average40%40%

The blended 40% is identical both months, so a dashboard built on it reports "steady — nothing to do." But Google AI Overviews just halved, masked by a gain in Perplexity. If AI Overviews is where your buyers are, you're losing badly and the average is actively hiding it. The per-engine view makes that collapse impossible to miss — and tells you exactly which engine to investigate.

How little do engines actually overlap?

Engines overlap remarkably little: about 11% of cited sources are shared across them, per Profound. The cause is structural — each engine runs a different index, a different retrieval-and-rerank pipeline, and a different weighting of freshness, authority, and community sources. On top of that, citations are volatile within each engine: Semrush found 40–60% of LLM-cited sources change month to month. Ahrefs has watched a single source's presence swing from 76% to 38% across a tracking window, and Semrush found Google keeps a given URL in an AI Overview only about 3.87 days on average. So each engine is not only different from the others but a moving target in its own right. Low overlap across engines and high churn within them is exactly the situation where a single averaged number is least trustworthy.

Blended average vs per-engine measurement
Blended averagePer-engine
What it reportsOne combined 'AI visibility' scoreA share of voice for each engine
With ~11% overlapAverages separate universesReflects each engine's reality
Hides a one-engine collapse?YesNo
Tells you where to actNoYes — by engine
Best useA rough exec headline (with caveats)Every real decision

What does per-engine measurement look like?

Per-engine measurement reports a separate share of voice for each engine you track, on the same fixed prompt set, so you can see exactly where you stand on each. The mechanics are the same as any share-of-voice program — you just never collapse the results into one figure.

  1. 1

    Run one prompt set across each engine

    Use the same fixed questions on every engine so the engines are comparable to each other, not to a moving target.

  2. 2

    Score each engine separately

    Compute brand and competitive share of voice per engine — ChatGPT, Perplexity, Google AI Overviews, and any others your audience uses.

  3. 3

    Prioritize by audience

    Weight your attention toward the engines your buyers actually use, not toward whichever inflates the average.

  4. 4

    Track each trend over time

    Watch the slope per engine; a healthy total can still hide a sharp decline on the one that matters most.

Why do engines differ so much in the first place? Because they retrieve and rank differently — the mechanics are in ChatGPT vs Perplexity vs Google AI Overviews: how citation differs and how AI engines choose citations.

How should you report per-engine results?

Report per-engine results as small multiples, not one merged line — a compact panel per engine showing its share-of-voice trend, side by side. That layout keeps each engine's reality visible while letting you scan the whole picture at a glance. Three practices keep the report honest:

  • One row per engine, with its trend. Show current SOV and the direction over the last several readings, per engine — never just a single combined figure.
  • Order by audience, not by score. Put the engines your buyers actually use first, so attention follows revenue rather than whichever number looks best.
  • Annotate changes. Note when an engine shipped a known update or when you changed tactics, so a swing has context — the kind of evidence the Adaptability pillar runs on.

Is a blended number ever OK?

A blended number is acceptable only as a rough executive headline, and only when it's reported alongside the per-engine breakdown so it can't hide a collapse. If leadership wants one figure to glance at, give them one — but make every decision about where to invest from the per-engine view, because that's where the signal is. The moment a blended score starts driving strategy, you're optimizing for an average of universes that don't overlap.

If you must roll up to one figure, a usage-weighted index beats a naive average: weight each engine's share of voice by the share of your audience that actually uses it, so the number at least tracks the surfaces that matter to you rather than treating a niche engine and your buyers' main engine as equals. It still hides per-engine detail, so keep it as a headline sitting on top of the breakdown — never as the figure you optimize against.

The averaging trap

A flat blended score can mean "steady everywhere" or "surging on one engine, sinking on another." You cannot tell which from the average — and they call for opposite responses. Always keep the per-engine breakdown in view, and treat any blended figure as a headline, never a diagnosis.

Where this fits in the Canon

Per-engine measurement is the Adaptability pillar made operational: the engines change monthly and overlap little, so you measure each one and adapt on its own evidence. It's the discipline behind share of voice and the reason a single "AI visibility" number is a trap.

Related: ChatGPT vs Perplexity vs Google AI Overviews, how to measure your AI visibility, and the best AI visibility tools (track each engine separately).

Frequently asked questions

Should I measure AI visibility per engine or as a blended average?
Per engine. The major answer engines cite largely different sources — Profound found only about 11% citation overlap across them — so a single blended "AI visibility" score averages together separate universes and hides where you're actually winning or losing. Track share of voice separately for each engine your audience uses, then watch each trend.
Why does a blended AI visibility score mislead?
Because it can mask large per-engine swings. You might dominate Perplexity and be invisible in Google AI Overviews, yet a blended average shows a mediocre middle number that points you nowhere. With only ~11% cross-engine overlap, the engines aren't measuring the same thing, so averaging them produces a figure that doesn't correspond to any real surface.
How much do AI engines overlap in what they cite?
Very little. Profound found only about 11% of cited sources are shared across the major engines, and citations are volatile within each one (Semrush found 40–60% of cited sources change month to month). Each engine runs a different index, retrieval pipeline, and ranking emphasis, so winning one is no guarantee of winning another.
Is a blended average ever useful?
Only as a rough executive headline, and even then with caveats. If leadership wants one number, report it alongside the per-engine breakdown so the average never hides a collapse on a specific engine. For any decision about where to invest, use the per-engine view — that's where the actionable signal lives.
How many AI engines should I track?
Track every engine your audience actually uses, and no more — usually three to five (ChatGPT, Perplexity, Google AI Overviews, and sometimes Gemini or Copilot). Because tools price by prompts times engines, adding engines your buyers don't use just inflates cost and dilutes attention. Prioritize by where your customers are, then score each engine separately.
Can I roll per-engine scores into one KPI?
Only as a headline over the per-engine breakdown — and a usage-weighted index beats a naive average, weighting each engine's share of voice by the share of your audience that uses it. Even then, never optimize against the rolled-up number; make decisions from the per-engine view, because a single figure can hide a collapse on the engine that matters most.

Last updated .

Part of

Related reading

Yes, partially — you can see referral traffic from AI engines in Google Analytics by filtering for their referrer domains, but it undercounts, because many AI answers cite you without sending a click and some referrers are misattributed. Use analytics for the visits, and a prompt set for the citations it can't see.

2 min read

Check AI citations on a regular cadence matched to how fast your space moves — weekly or biweekly for most, daily only for fast-moving or high-stakes topics. The point is consistency over frequency, because citations fluctuate, so a steady schedule reveals the trend that any single check would miss.

2 min read

Analytics & Measurement

Can I A/B Test for AEO?

Classic A/B testing doesn't fit AEO, because you can't split-test an AI answer and citations are noisy — instead, test changes sequentially by measuring citation share on a fixed prompt set before and after a change, holding everything else steady. It's before/after measurement, not a controlled split.

2 min read