Skip to content
AEO Canon · the reference for answer-engine optimization

How to Allow AI Crawlers in robots.txt

AI engines can only cite pages their crawlers are allowed to fetch. This guide gives you a verified, copy-paste robots.txt block that explicitly allows GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, and Bingbot — plus the one mistake that silently blocks them all.

BBurke Atkerson3 min read

AI engines can only cite pages their crawlers are allowed to fetch, so allowing the major AI crawlers in robots.txt is the first technical step in AEO. The fix is short: give each AI user-agent an explicit, permissive group — or at minimum make sure no rule disallows them. Here is a verified, copy-paste block.

The short answer

Paste the block below into https://yourdomain.com/robots.txt, swap in your real sitemap URL, and confirm the file returns 200 as text/plain. AI crawlers obey the most specific matching user-agent group, so these per-bot groups grant access even if a wildcard rule is restrictive.

What's the copy-paste robots.txt block?

Here is a complete robots.txt that explicitly allows the major AI crawlers. Copy it, replace the sitemap URL with your own, and deploy it to your domain root.

# robots.txt — explicitly allow major AI crawlers
# Replace the Sitemap URL with your own. Deploy at https://yourdomain.com/robots.txt
 
# OpenAI — training crawler
User-agent: GPTBot
Allow: /
 
# OpenAI — surfaces sites in ChatGPT search
User-agent: OAI-SearchBot
Allow: /
 
# OpenAI — user-initiated fetches from ChatGPT
User-agent: ChatGPT-User
Allow: /
 
# Anthropic — training crawler
User-agent: ClaudeBot
Allow: /
 
# Anthropic — search indexing
User-agent: Claude-SearchBot
Allow: /
 
# Perplexity — search index
User-agent: PerplexityBot
Allow: /
 
# Google — controls use of your content for Gemini AI training & grounding
User-agent: Google-Extended
Allow: /
 
# Microsoft Bing — index that feeds Copilot and other AI answers
User-agent: Bingbot
Allow: /
 
# Everyone else (keep your existing rules here)
User-agent: *
Allow: /
 
Sitemap: https://yourdomain.com/sitemap.xml

An empty Disallow: line is equivalent to Allow: / — both mean "crawl everything." Use whichever your team prefers; the block above uses Allow: / for readability.

How does robots.txt allow or block a crawler?

robots.txt allows or blocks a crawler by matching its user-agent to the most specific group in the file and applying that group's rules. Each AI crawler looks for a group whose User-agent matches its own name; if it finds one, it obeys that group and ignores the wildcard * group entirely (Google: robots.txt). That's why an explicit User-agent: GPTBot group with Allow: / lets GPTBot in even if User-agent: * says Disallow: /.

This is the access pillar at its most literal: access is a binary gate, and robots.txt is the lock. For which specific bots to list and what each one does, see which AI crawlers should you allow?

What's the one mistake that blocks every AI crawler?

The mistake is a leftover site-wide Disallow: / with no per-bot exceptions — usually copied from a staging environment that was meant to stay private.

The block that costs you every citation

This single group blocks every crawler that doesn't have its own group — including all AI crawlers:

# DANGER: this blocks every AI crawler that lacks its own group
User-agent: *
Disallow: /

If you see that in production, you are invisible to AI engines no matter how good your content is. Either remove the Disallow: /, or add the explicit per-bot groups from the block above so the AI crawlers match their own rules instead of the wildcard.

How do you test it yourself?

Verify your robots.txt is live and permissive before assuming anything else is wrong.

  1. 1

    Fetch the file

    Open https://yourdomain.com/robots.txt in a browser, or run: curl -s https://yourdomain.com/robots.txt

  2. 2

    Confirm the status and type

    It must return HTTP 200 and Content-Type text/plain. A 404 means 'no rules' (crawlable), but a 5xx error can pause crawling.

  3. 3

    Search for a stray Disallow

    Look for any 'Disallow: /' that isn't scoped to a path you intend to hide.

  4. 4

    Simulate a crawler fetch

    Run: curl -s -A "GPTBot" https://yourdomain.com/ and confirm you get your real HTML back, not a block page.

# Fetch your robots.txt and check it
curl -s https://yourdomain.com/robots.txt
 
# Fetch your homepage as GPTBot would, and confirm real HTML comes back
curl -s -A "GPTBot" https://yourdomain.com/ | head -n 40

robots.txt readiness

0 / 5

Each unchecked box is a place a competitor can beat you to the AI answer.

Where this fits in the Canon

Allowing AI crawlers is step one of the access pillar — the binary gate that decides whether an engine can read you at all. Robots.txt only governs permission; the content still has to be in the HTML the crawler fetches, which is the rendering side of access.

Next: which AI crawlers should you allow?, how to check if AI crawlers can read your site, and why JavaScript breaks your AI citation eligibility.

Frequently asked questions

How do I allow AI crawlers in robots.txt?
Add an explicit group for each AI user-agent with an empty Disallow (or Allow / ), or simply make sure no rule disallows them. AI crawlers obey the most specific matching user-agent group, so a per-bot group granting access overrides a restrictive wildcard. The copy-paste block in this guide allows GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot, Google-Extended, and Bingbot.
Does robots.txt actually block AI crawlers?
Yes, for the major, well-behaved crawlers. OpenAI, Anthropic, Perplexity, and Google all document that their crawlers respect robots.txt, so a Disallow rule keeps them out. It is not a security control — it relies on the crawler honoring it — but the major AI crawlers do honor it, which is exactly why an accidental Disallow silently costs you citations.
What's the most common robots.txt mistake that blocks AI?
A site-wide "User-agent: * / Disallow: /" left over from a staging environment, or a blanket block of unfamiliar bots. Because the wildcard group applies to any crawler without its own group, it blocks every AI crawler at once. Always check for a stray Disallow / before assuming an engine just isn't crawling you.
Where does the robots.txt file go?
At the root of your domain, served at https://yourdomain.com/robots.txt. It must be reachable over HTTP(S) at that exact path, return a 200 status, and be served as text/plain. Crawlers fetch it before crawling; a robots.txt that 404s is treated as "no restrictions," but one that errors (5xx) can pause crawling entirely.

Last updated .

Part of

Related reading

Auto detailers should use AutomotiveBusiness (a LocalBusiness subtype) schema with accurate name, address, phone, service area, hours, and services, plus FAQ schema on answer pages — it helps engines parse who you are. Schema clarifies content for AI; it never rescues a thin site or a buried answer.

2 min read

A detailing business needs a website rebuild for AEO when it lives on social media with no real site, is slow, or lacks per-package answer-first pages and schema — because the engine can only recommend what it can read. The rebuild is the access layer everything else depends on.

2 min read

Auto repair shops should use the AutoRepair (a LocalBusiness subtype) schema with accurate name, address, phone, service area, hours, and services, plus FAQ schema on answer pages — it helps engines parse and confirm who you are. Schema clarifies content for AI; it never rescues a slow site or a buried answer.

2 min read