Skip to content
AEO Canon · the reference for answer-engine optimization

Which AI Crawlers Should You Allow?

Not all AI crawlers do the same job. Search and user-fetch bots (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, Bingbot) drive citations and should be allowed; training crawlers (GPTBot, ClaudeBot) and Google-Extended are an opt-in choice. Here's what each one does.

BBurke Atkerson3 min read

Not all AI crawlers do the same job: search and user-fetch bots drive citations and should be allowed, while training crawlers are an opt-in choice that doesn't control citation eligibility. Knowing which is which lets you be citable everywhere without making decisions you didn't intend.

The short answer

Allow the citation-driving bots: OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, and Bingbot. Decide deliberately on the training crawlers (GPTBot, ClaudeBot) and Google-Extended — these govern model-training and generative use, not whether you can be cited.

What does each AI crawler actually do?

Each AI crawler falls into one of three jobs: training (improving models), search indexing (building the index AI answers draw from), or user-fetch (grabbing a page in real time when a user asks). The table maps every major user-agent to its operator, job, and whether allowing it affects your citation eligibility.

The major AI crawlers and what each one does
User-agentOperatorJobAllow for citations?
GPTBotOpenAIModel trainingOptional (training opt-in)
OAI-SearchBotOpenAISearch indexing for ChatGPTYes
ChatGPT-UserOpenAIReal-time user-initiated fetchYes
ClaudeBotAnthropicModel trainingOptional (training opt-in)
Claude-SearchBotAnthropicSearch indexing for ClaudeYes
PerplexityBotPerplexitySearch indexYes
Google-ExtendedGoogleGemini training & grounding tokenOptional (generative opt-in)
BingbotMicrosoftBing index (feeds Copilot/AI answers)Yes

Which bots actually drive AI citations?

The bots that drive citations are the search-indexing and user-fetch crawlers, because they're what build the retrievable index and fetch the live pages AI answers quote. Concretely: OAI-SearchBot and ChatGPT-User make you eligible in ChatGPT, Claude-SearchBot in Claude, PerplexityBot in Perplexity, and Bingbot in the Bing index that feeds Copilot and some other AI surfaces. If you allow nothing else, allow these — blocking them removes you from the candidate set an engine can cite, which is the access gate slamming shut.

What about the training crawlers?

Training crawlers — GPTBot and ClaudeBot — are a separate, values-based decision, not a citation lever. They crawl content to train and improve the underlying models. Blocking them opts your content out of training but does not, on its own, stop the matching search bot from indexing and citing you. So the choice is about whether you want to contribute to model training, not about visibility:

  • Allow them if you're comfortable contributing to training and want maximum long-term presence in the models themselves.
  • Block them if you want to opt out of training while staying citable via the search bots.

Google-Extended works similarly: it's a robots.txt token (not a crawler) that controls whether Google uses already-crawled content for Gemini training and for grounding AI answers. Allowing it opts you into Google's generative uses; it doesn't change normal Google Search crawling, which Googlebot handles.

Allow, or decide deliberately?

Choose always allow if…

  • OAI-SearchBot & ChatGPT-User — to be citable in ChatGPT.
  • Claude-SearchBot & PerplexityBot — to be citable in Claude and Perplexity.
  • Bingbot — to stay in the Bing index that feeds Copilot.

Choose decide deliberately if…

  • GPTBot & ClaudeBot — only if you want to contribute to model training.
  • Google-Extended — only if you want content used for Gemini training/grounding.
  • You have legal, licensing, or brand reasons to opt out of training.

How do you put this in robots.txt?

Add an explicit group per user-agent you want to allow, exactly as in the robots.txt guide. If you want to allow citation bots but opt out of training, allow the search/user bots and disallow the training ones:

# Citable in AI answers, but opted out of model training
 
User-agent: OAI-SearchBot
Allow: /
 
User-agent: ChatGPT-User
Allow: /
 
User-agent: Claude-SearchBot
Allow: /
 
User-agent: PerplexityBot
Allow: /
 
User-agent: Bingbot
Allow: /
 
# Opt OUT of training/generative use
User-agent: GPTBot
Disallow: /
 
User-agent: ClaudeBot
Disallow: /
 
User-agent: Google-Extended
Disallow: /

User-agent strings change as vendors add crawlers, so re-check the official docs periodically — keeping this list current is the adaptability pillar applied to crawl management.

Where this fits in the Canon

Choosing which crawlers to allow is the permission half of the access pillar. Pair it with the copy-paste robots.txt guide and then confirm the bots can actually read your rendered HTML in how to check if AI crawlers can read your site.

Frequently asked questions

Which AI crawlers should I allow for citations?
Allow the search-indexing and user-fetch crawlers — OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot (Anthropic), PerplexityBot (Perplexity), and Bingbot (Microsoft) — because these build the indexes and fetch the pages that AI answers cite. These are the ones that directly affect whether you can be surfaced and quoted.
What's the difference between GPTBot and OAI-SearchBot?
GPTBot crawls content to train and improve OpenAI's models; OAI-SearchBot crawls to surface and cite sites in ChatGPT's search results. Allowing OAI-SearchBot (and ChatGPT-User, for real-time fetches) is what makes you eligible to be cited in ChatGPT. GPTBot is about training-data inclusion, which is a separate, values-based choice.
Should I allow training crawlers like GPTBot and ClaudeBot?
That's an opt-in decision, not a citation requirement. Blocking GPTBot or ClaudeBot opts your content out of model training but does not, by itself, stop you from being cited by the corresponding search and user-fetch bots. Allow them if you're comfortable contributing to training and want maximum long-term presence; block them if you want to opt out of training.
What is Google-Extended?
Google-Extended is a robots.txt token, not a separate crawler. It controls whether Google may use content it has already crawled (via Googlebot) for Gemini AI training and for grounding AI answers. Allowing it opts your content into Google's generative AI uses; it doesn't change normal Google Search crawling, which Googlebot handles.

Last updated .

Part of

Related reading

Auto detailers should use AutomotiveBusiness (a LocalBusiness subtype) schema with accurate name, address, phone, service area, hours, and services, plus FAQ schema on answer pages — it helps engines parse who you are. Schema clarifies content for AI; it never rescues a thin site or a buried answer.

2 min read

A detailing business needs a website rebuild for AEO when it lives on social media with no real site, is slow, or lacks per-package answer-first pages and schema — because the engine can only recommend what it can read. The rebuild is the access layer everything else depends on.

2 min read

Auto repair shops should use the AutoRepair (a LocalBusiness subtype) schema with accurate name, address, phone, service area, hours, and services, plus FAQ schema on answer pages — it helps engines parse and confirm who you are. Schema clarifies content for AI; it never rescues a slow site or a buried answer.

2 min read