Robots Exclusion Protocol
The Robots Exclusion Protocol is the robots.txt convention for telling automated crawlers which URL paths they are requested to avoid, now central to the fight over AI crawling, answer engines, training data, and agentic browsing.
Definition
The Robots Exclusion Protocol, usually encountered as a robots.txt file, is a machine-readable way for a service owner to publish crawler rules for a site. RFC 9309, published in September 2022, standardized the protocol as an IETF Standards Track document. Its authors are Martijn Koster, Gary Illyes, Henner Zeller, and Lizzi Sassman.
The protocol predates generative AI. Martijn Koster originally defined it in 1994 for crawlers that recursively traverse links for indexing. The modern dispute is that the same small file now sits between search indexing, model training, answer generation, retrieval, and automated agents.
The essential warning is also in RFC 9309: robots.txt rules are not access authorization. They are a request that crawlers are expected to honor, not a security boundary, contract, paywall, authentication layer, or proof of consent.
How It Works
A robots.txt file is served from the top level of a site, such as https://example.com/robots.txt. RFC 9309 defines groups made from one or more User-agent lines followed by rules such as Allow and Disallow. Crawlers use their product token to find the relevant group and determine which path prefixes they may access.
A robots.txt file applies to the host, protocol, and port where it is served; it does not automatically cover every subdomain, scheme, or shared host. RFC 9309 also defines behavior for unavailable files, unreachable files, parsing errors, and caching. A 4xx response can mean the file is unavailable and crawling is allowed, while server or network errors can require crawlers to assume complete disallow for a period.
Google's documentation gives a practical interpretation for search operators: robots.txt primarily manages crawler traffic and should not be used to hide pages from Google Search. Google also says its automated crawlers download and parse robots.txt before crawling, while user-controlled or safety-related fetchers can be outside that ordinary crawler model.
Crawler and Agent Context
Robots.txt is now a governance surface for AI Search and Answer Engines. OpenAI documents separate robots.txt controls for OAI-SearchBot and GPTBot, so a site can allow search appearance while disallowing use of crawled content for training OpenAI's generative AI foundation models. Google documents common crawlers that obey robots.txt rules when crawling automatically.
That granularity is useful but incomplete. Robots.txt says whether a crawler should fetch a path. It does not say whether content may be used for model training after access, summarized in an answer engine, retained in an agent memory, sold as a dataset, or used inside a logged-in browser session. Those are content-use, licensing, product, and delegation questions.
This is why robots.txt should be read beside AI Preferences (AIPREF), Web Bot Auth, and AI Data Licensing. Robots.txt is a crawler preference signal. AIPREF is content-use vocabulary. Web Bot Auth is traffic authentication. Licensing is a legal or contractual layer. None of them replaces the others.
Governance and Safety
The strongest robots.txt governance rule is negative: do not put sensitive paths in robots.txt as if the file were private. RFC 9309 warns that listing paths exposes them publicly and says real access control should use appropriate security measures such as HTTP authentication.
Robots.txt is also not enough for compliance evidence. A publisher may need logs showing which crawler fetched which URL, which rule applied at the time, which IP ranges or signatures were verified, which contract or license existed, and which downstream use was permitted. Without records, a preference file becomes a ritual of refusal rather than an enforceable workflow.
The AI-era pressure is that crawling is no longer a single bargain. Search visibility, model training, answer grounding, live retrieval, ad safety checks, and user-requested browsing have different social meanings. A one-file convention can coordinate some crawler behavior, but it cannot carry the full public contract for machine access to the web.
Defense Pattern
- Version the file. Keep robots.txt changes in reviewable source control, with timestamps and reasons for crawler-specific rules.
- Separate crawler tokens. Treat search, training, ads, monitoring, and user-triggered fetchers as distinct where providers document distinct user agents.
- Do not expose secrets. Use authentication, authorization, and noindex controls for sensitive material rather than listing private paths in robots.txt.
- Log the exchange. Preserve request time, user-agent, IP verification, fetched robots.txt version, matched rule, response status, and downstream use category.
- Pair with content-use signals. Use AIPREF, license terms, provenance records, and contracts when the issue is what happens after crawling.
- Test failure modes. Check 4xx, 5xx, redirects, cache behavior, subdomains, nonstandard ports, and stale crawler caches.
Source Discipline
Claims about robots.txt should distinguish the standard from a provider's crawler profile. RFC 9309 defines the protocol. Google Search Central documents how Google's automated crawlers interpret it. OpenAI documents how its crawler tags map to its products.
When auditing AI data access, preserve the robots.txt file as it existed at acquisition time, not only the current file. State which crawler token was involved and whether the claim concerns crawling, indexing, training, search display, retrieval, summarization, or user-triggered browsing.
Spiralist Reading
Spiralism reads robots.txt as a politeness sign for machine visitors. It is not a wall. It is a sentence nailed to the public gate: this way, not that way.
The agentic web tests whether institutions treat such sentences as meaningful. When a crawler ignores them, the failure is not mystical. It is a governance choice made visible in logs.
Open Questions
- How should robots.txt apply to user-directed browser agents that fetch pages through a person's session?
- Should AI search, answer grounding, model training, summarization, and RAG indexing each have distinct crawler identities?
- How should publishers prove what their robots.txt file said when a crawler fetched content months earlier?
- What should happen when a contract permits use but a robots.txt rule disallows crawling?
- Can crawler preferences stay useful when noncompliant crawlers route through residential proxies, browsers, or third-party caches?
Related Pages
- AI Preferences (AIPREF)
- AI Data Licensing
- AI Search and Answer Engines
- Agent-Native Internet
- Web Bot Auth
- AI Browsers and Computer Use
- Training Data
- AI Data Provenance
- Content Provenance and Watermarking
- Platform Governance
- AI Governance
- AI Copyright Litigation
Sources
- RFC Editor, RFC 9309: Robots Exclusion Protocol, September 2022.
- IETF Datatracker, RFC 9309: Robots Exclusion Protocol, reviewed June 25, 2026.
- Google Search Central, Introduction to robots.txt, reviewed June 25, 2026.
- Google Search Central, How Google interprets the robots.txt specification, reviewed June 25, 2026.
- Google Search Central, Google's common crawlers, reviewed June 25, 2026.
- OpenAI Developers, Overview of OpenAI Crawlers, reviewed June 25, 2026.