GPTBot
GPTBot is OpenAI's web crawler that gathers content to train its models, identified by the GPTBot user-agent and controllable through your robots.txt file.
GPTBot is OpenAI's crawler for collecting training data. It fetches public web
pages to help train OpenAI's models, announces itself with the GPTBot
user-agent, and obeys robots.txt, so site owners can allow
or block it. It is distinct from OAI-SearchBot, which powers ChatGPT's live search
citations — blocking GPTBot restricts training use, not search visibility.
For AEO, the key fact is that crawlers like GPTBot read raw HTML and do not execute JavaScript, so any content assembled in the browser is invisible to them. Being crawlable as server-rendered text is the access pillar — the price of admission. Whether to allow GPTBot specifically is a separate rights decision: allowing it permits training use but doesn't directly earn citations, and blocking it won't remove you from ChatGPT's search results.
Example. Adding User-agent: GPTBot followed by Disallow: / to your
robots.txt tells OpenAI not to use your site for training — while leaving the
search and live-fetch crawlers free to cite you, if you allow those separately.
Relevant pillar
Related terms
- ClaudeBotClaudeBot is Anthropic's web crawler that collects content used to train its Claude models, identified by the ClaudeBot user-agent and controllable via robots.txt.
- PerplexityBotPerplexityBot is Perplexity's web crawler that indexes pages so they can be retrieved and cited in Perplexity's answers, identified by the PerplexityBot user-agent.
- Google-ExtendedGoogle-Extended is a robots.txt control that lets you opt out of having your content used to train Google's Gemini and Vertex AI models, without affecting Google Search or AI Overviews.
- robots.txtrobots.txt is a plain text file at the root of your domain that tells crawlers which user-agents may access which parts of your site, and is how you allow or block AI crawlers.