CCBot
CCBot is the crawler operated by Common Crawl, whose open dataset is a foundational training source for many AI models, so allowing or blocking it shapes how widely your content trains AI.
CCBot is the crawler that feeds Common Crawl. It gathers public web pages into Common Crawl, the large open dataset that many AI models have trained on. Because so many downstream models derive from it, CCBot's reach is broad — allowing it means your content can flow into numerous training sets at once.
It respects robots.txt via the CCBot user-agent, so the
access decision is yours. Like other training crawlers, blocking
CCBot is a training-rights choice that won't change
whether a search-grounded engine cites you today. Note that historical crawls
already collected can't be retroactively removed by blocking it now.
Example. A site that wants to limit its presence in open training corpora adds a
User-agent: CCBot / Disallow: / rule — affecting future Common Crawl snapshots,
though not copies already distributed.
Relevant pillar
Related terms
- Common CrawlCommon Crawl is a nonprofit that publishes a free, massive archive of crawled web pages, which has served as a foundational training dataset for many large language models.
- robots.txtrobots.txt is a plain text file at the root of your domain that tells crawlers which user-agents may access which parts of your site, and is how you allow or block AI crawlers.
- Training DataTraining data is the body of text and other content an AI model learns from during training, shaping what it knows by default before any live retrieval is involved.