Which AI crawlers should I allow for citations?

Allow the search-indexing and user-fetch crawlers — OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot (Anthropic), PerplexityBot (Perplexity), and Bingbot (Microsoft) — because these build the indexes and fetch the pages that AI answers cite. These are the ones that directly affect whether you can be surfaced and quoted.

What's the difference between GPTBot and OAI-SearchBot?

GPTBot crawls content to train and improve OpenAI's models; OAI-SearchBot crawls to surface and cite sites in ChatGPT's search results. Allowing OAI-SearchBot (and ChatGPT-User, for real-time fetches) is what makes you eligible to be cited in ChatGPT. GPTBot is about training-data inclusion, which is a separate, values-based choice.

Should I allow training crawlers like GPTBot and ClaudeBot?

That's an opt-in decision, not a citation requirement. Blocking GPTBot or ClaudeBot opts your content out of model training but does not, by itself, stop you from being cited by the corresponding search and user-fetch bots. Allow them if you're comfortable contributing to training and want maximum long-term presence; block them if you want to opt out of training.

What is Google-Extended?

Google-Extended is a robots.txt token, not a separate crawler. It controls whether Google may use content it has already crawled (via Googlebot) for Gemini AI training and for grounding AI answers. Allowing it opts your content into Google's generative AI uses; it doesn't change normal Google Search crawling, which Googlebot handles.

Which AI Crawlers Should You Allow?

Not all AI crawlers do the same job: search and user-fetch bots drive citations and should be allowed, while training crawlers are an opt-in choice that doesn't control citation eligibility. Knowing which is which lets you be citable everywhere without making decisions you didn't intend.

The short answer

Allow the citation-driving bots: OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, and Bingbot. Decide deliberately on the training crawlers (GPTBot, ClaudeBot) and Google-Extended — these govern model-training and generative use, not whether you can be cited.

What does each AI crawler actually do?

Each AI crawler falls into one of three jobs: training (improving models), search indexing (building the index AI answers draw from), or user-fetch (grabbing a page in real time when a user asks). The table maps every major user-agent to its operator, job, and whether allowing it affects your citation eligibility.

The major AI crawlers and what each one does
User-agent	Operator	Job	Allow for citations?
GPTBot	OpenAI	Model training	Optional (training opt-in)
OAI-SearchBot	OpenAI	Search indexing for ChatGPT	Yes
ChatGPT-User	OpenAI	Real-time user-initiated fetch	Yes
ClaudeBot	Anthropic	Model training	Optional (training opt-in)
Claude-SearchBot	Anthropic	Search indexing for Claude	Yes
PerplexityBot	Perplexity	Search index	Yes
Google-Extended	Google	Gemini training & grounding token	Optional (generative opt-in)
Bingbot	Microsoft	Bing index (feeds Copilot/AI answers)	Yes

Which bots actually drive AI citations?

The bots that drive citations are the search-indexing and user-fetch crawlers, because they're what build the retrievable index and fetch the live pages AI answers quote. Concretely: OAI-SearchBot and ChatGPT-User make you eligible in ChatGPT, Claude-SearchBot in Claude, PerplexityBot in Perplexity, and Bingbot in the Bing index that feeds Copilot and some other AI surfaces. If you allow nothing else, allow these — blocking them removes you from the candidate set an engine can cite, which is the access gate slamming shut.

What about the training crawlers?

Training crawlers — GPTBot and ClaudeBot — are a separate, values-based decision, not a citation lever. They crawl content to train and improve the underlying models. Blocking them opts your content out of training but does not, on its own, stop the matching search bot from indexing and citing you. So the choice is about whether you want to contribute to model training, not about visibility:

Allow them if you're comfortable contributing to training and want maximum long-term presence in the models themselves.
Block them if you want to opt out of training while staying citable via the search bots.

Google-Extended works similarly: it's a robots.txt token (not a crawler) that controls whether Google uses already-crawled content for Gemini training and for grounding AI answers. Allowing it opts you into Google's generative uses; it doesn't change normal Google Search crawling, which Googlebot handles.

Allow, or decide deliberately?

Choose always allow if…

▸OAI-SearchBot & ChatGPT-User — to be citable in ChatGPT.
▸Claude-SearchBot & PerplexityBot — to be citable in Claude and Perplexity.
▸Bingbot — to stay in the Bing index that feeds Copilot.

Choose decide deliberately if…

▸GPTBot & ClaudeBot — only if you want to contribute to model training.
▸Google-Extended — only if you want content used for Gemini training/grounding.
▸You have legal, licensing, or brand reasons to opt out of training.

How do you put this in robots.txt?

Add an explicit group per user-agent you want to allow, exactly as in the robots.txt guide. If you want to allow citation bots but opt out of training, allow the search/user bots and disallow the training ones:

# Citable in AI answers, but opted out of model training
 
User-agent: OAI-SearchBot
Allow: /
 
User-agent: ChatGPT-User
Allow: /
 
User-agent: Claude-SearchBot
Allow: /
 
User-agent: PerplexityBot
Allow: /
 
User-agent: Bingbot
Allow: /
 
# Opt OUT of training/generative use
User-agent: GPTBot
Disallow: /
 
User-agent: ClaudeBot
Disallow: /
 
User-agent: Google-Extended
Disallow: /

User-agent strings change as vendors add crawlers, so re-check the official docs periodically — keeping this list current is the adaptability pillar applied to crawl management.

Where this fits in the Canon

Choosing which crawlers to allow is the permission half of the access pillar. Pair it with the copy-paste robots.txt guide and then confirm the bots can actually read your rendered HTML in how to check if AI crawlers can read your site.

Which AI Crawlers Should You Allow?

What does each AI crawler actually do?

Which bots actually drive AI citations?

What about the training crawlers?

How do you put this in robots.txt?

Where this fits in the Canon

Frequently asked questions

Part of

Related reading

AI Crawler User Agents — The 2026 Cheat Sheet

What HTTP Status Codes Tell an AI Crawler

Core Web Vitals Thresholds for AEO