How to Configure robots.txt for AI Crawlers in 2026: The Complete Technical Guide
AI crawlers have fundamentally changed how robots.txt needs to be configured. This 2026 guide covers the exact user-agent strings, directive logic, and access strategies for managing both traditional search bots and LLM crawlers.
Eden Clarke · · 4 min read
Technical SEO
How to Configure robots.txt for AI Crawlers in 2026: The Complete Technical Guide
The robots.txt file you configured for Googlebot in 2023 is no longer sufficient. At least 14 distinct AI crawler user-agent strings are now active in the wild—each with different compliance behaviors, crawl frequencies, and data-use purposes. This guide gives you the exact configuration logic for managing all of them.
RM
Rafael Mora
|Updated May 15, 2026|13 min read Expert Reviewed
robots.txt for AI Crawlers — 2026 Configuration Map
User-agent strings, compliance behaviors, and directive logic for 14+ active AI crawlers
Alt: robots.txt AI crawler configuration guide 2026 user agent strings
The Core Shift
Traditional robots.txt configuration assumed two categories of bots: search engine crawlers (which you want) and spam bots (which you don't). AI crawlers have created a third category: data-collection bots that may or may not respect your directives, and whose data-use purposes differ fundamentally from indexing. Configuring robots.txt for this new category requires understanding which AI crawlers are active, which user-agent strings they use, and—critically—which ones actually comply with the Robots Exclusion Protocol.
Why AI Crawlers Have Changed the robots.txt Calculus
The Robots Exclusion Protocol was designed in 1994 for a world where the primary reason a bot visited your site was to index it for search. That assumption no longer holds. AI crawlers visit your site for fundamentally different purposes: training large language models, powering real-time AI search responses, generating content summaries, and building knowledge bases that may never surface your site as a source.
According to a Cloudflare bot traffic analysis published May 13, 2026, AI crawler traffic across monitored websites increased by 340% between January 2024 and April 2026. More significantly, the analysis found that 23% of identified AI crawler traffic came from bots using user-agent strings not listed in any public documentation—meaning standard robots.txt configurations based on known user-agent strings miss nearly a quarter of AI crawler activity.
Source: Cloudflare, "The State of AI Bot Traffic: Q1 2026 Analysis," published May 13, 2026.
340%
Increase in AI crawler traffic across monitored websites, Jan 2024 – Apr 2026 (Cloudflare, May 2026)
14+
Distinct AI crawler user-agent strings confirmed active as of May 2026, up from 3 in early 2024
23%
Of AI crawler traffic uses undocumented user-agent strings, bypassing standard robots.txt rules
The practical implication: robots.txt is a necessary but insufficient tool for managing AI crawler access in 2026. It remains the correct first line of configuration—most major AI providers have committed to honoring robots.txt directives—but it must be combined with server-level rate limiting, HTTP header controls, and terms-of-service enforcement for comprehensive coverage.
AI Crawler Traffic Growth 2024–2026
Monthly AI crawler request volume across 1M+ monitored websites, showing the 340% growth trajectory
Fig. 2 — Filename: ai-crawler-traffic-growth-2024-2026.jpg | Alt: AI crawler traffic growth 2024 to 2026 chart | Position: Below "Why AI Crawlers Have Changed the robots.txt Calculus" H2 | Description: A line chart on a white background with teal accents. X-axis: months from Jan 2024 to Apr 2026. Y-axis: relative crawler request volume (indexed to 100 in Jan 2024). Three lines: "Traditional Search Crawlers" (flat), "Known AI Crawlers" (steep upward), "Unknown/Undocumented AI Crawlers" (moderate upward). The combined AI crawler line reaches 440 by Apr 2026. Source attribution at bottom.
The 2026 AI Crawler User-Agent Reference Table
The following table documents the confirmed user-agent strings for major AI crawlers as of May 2026, along with their compliance behavior and primary data-use purpose. This is the reference you need before writing a single line of robots.txt configuration.
User-Agent String
Provider
Primary Purpose
REP Compliance
GPTBot
OpenAI
LLM training data collection
Confirmed
ChatGPT-User
OpenAI
Real-time browsing for ChatGPT responses
Confirmed
ClaudeBot
Anthropic
LLM training and Claude AI responses
Confirmed
anthropic-ai
Anthropic
Secondary Anthropic crawler identifier
Confirmed
PerplexityBot
Perplexity AI
AI search index and answer generation
Confirmed
Applebot-Extended
Apple
Apple Intelligence training data
Confirmed
Bytespider
ByteDance / TikTok
LLM training and content analysis
Partial
cohere-ai
Cohere
Enterprise LLM training
Confirmed
Meta-ExternalAgent
Meta AI
Meta AI assistant data collection
Confirmed
Diffbot
Diffbot
Structured data extraction for AI knowledge graphs
Partial
YouBot
You.com
AI search engine indexing
Confirmed
Amazonbot
Amazon
Alexa AI and Amazon LLM training
Confirmed
ICC-Crawler
Various AI research orgs
Academic AI research data collection
Unknown
AI2Bot
Allen Institute for AI
Open-source AI research datasets
Confirmed
Sources: Individual provider documentation pages; Dark Visitors AI crawler database, updated May 2026; Cloudflare bot intelligence reports, May 2026.
Compliance ≠ Guaranteed Blocking
"Confirmed" compliance means the provider has publicly committed to honoring robots.txt directives and has demonstrated this in testing. It does not mean every instance of their crawler will comply—particularly if the provider uses third-party crawling infrastructure or if their crawler has been updated since compliance was last verified. Treat robots.txt as a strong signal, not a guarantee.
robots.txt Syntax: The Directives That Matter for AI Access Control
The Robots Exclusion Protocol uses a small set of directives. For AI crawler management, four directives are relevant: User-agent, Disallow, Allow, and Crawl-delay. Understanding how they interact—particularly the precedence rules—is essential for writing configurations that behave as intended.
Precedence Rules: How Conflicts Are Resolved
When multiple rules could apply to a single URL, robots.txt resolves conflicts using two principles:
Specificity wins: A more specific rule overrides a less specific one. Allow: /blog/public/ overrides Disallow: /blog/ for URLs within /blog/public/.
Specific user-agent blocks override wildcard blocks: Rules for a named user-agent take precedence over User-agent: * rules for that bot. A bot with a specific block does not inherit wildcard rules.
Critical Precedence Misunderstanding
Many site owners write a wildcard block and then add a specific AI bot block, expecting the AI bot to be subject to both. This is wrong. Once a bot matches a specific User-agent block, it only follows that block's rules—not the wildcard block. If you want an AI bot to be blocked from everything the wildcard blocks plus additional paths, you must repeat all the wildcard rules in the specific bot's block.
Eight Configuration Scenarios with Complete Code Examples
The following scenarios cover the most common access control requirements for AI crawlers. Each includes a complete, copy-ready robots.txt block with inline annotations.
1
Block All AI Training Crawlers, Allow AI Search Crawlers
Most Requested
You want your content to appear in AI-powered search results (Perplexity, You.com) but don't want it used for LLM training datasets. This is the most nuanced configuration—it requires distinguishing between crawlers by purpose, not just by provider.
robots.txt
# Block LLM training crawlersUser-agent:GPTBotDisallow:/User-agent:ClaudeBotDisallow:/User-agent:anthropic-aiDisallow:/User-agent:Applebot-ExtendedDisallow:/User-agent:cohere-aiDisallow:/User-agent:AI2BotDisallow:/# Allow AI search crawlers (no Disallow = full access)User-agent:PerplexityBotAllow:/User-agent:YouBotAllow:/# Standard crawlers — full accessUser-agent:GooglebotAllow:/User-agent:BingbotAllow:/Sitemap:https://www.yourdomain.com/sitemap.xml
Best for: Publishers, content creators, and media sites that want AI search visibility without training data exposure
2
Block All AI Crawlers Entirely
Maximum Control
You want no AI crawler access—neither for training nor for AI search. This is appropriate for sites with proprietary data, paywalled content, or legal restrictions on automated data collection.
robots.txt
# Block all known AI crawlersUser-agent:GPTBotDisallow:/User-agent:ChatGPT-UserDisallow:/User-agent:ClaudeBotDisallow:/User-agent:anthropic-aiDisallow:/User-agent:PerplexityBotDisallow:/User-agent:Applebot-ExtendedDisallow:/User-agent:BytespiderDisallow:/User-agent:cohere-aiDisallow:/User-agent:Meta-ExternalAgentDisallow:/User-agent:DiffbotDisallow:/User-agent:YouBotDisallow:/User-agent:AmazonbotDisallow:/User-agent:AI2BotDisallow:/# Standard search crawlers — full accessUser-agent:GooglebotAllow:/User-agent:BingbotAllow:/Sitemap:https://www.yourdomain.com/sitemap.xml
Best for: Paywalled content, proprietary databases, legal/medical sites with data sensitivity requirements
3
Protect Specific Directories from AI Crawlers Only
Surgical Control
You want AI crawlers to access your public blog and marketing pages, but not your user-generated content, API endpoints, or member-only sections. Standard search crawlers get full access.
robots.txt
# Restrict AI training crawlers to public content onlyUser-agent:GPTBotDisallow:/api/Disallow:/members/Disallow:/user-content/Disallow:/private/Allow:/blog/Allow:/about/User-agent:ClaudeBotDisallow:/api/Disallow:/members/Disallow:/user-content/Disallow:/private/Allow:/blog/Allow:/about/# All other bots — standard restrictions onlyUser-agent:*Disallow:/admin/Disallow:/private/Disallow:/staging/Sitemap:https://www.yourdomain.com/sitemap.xml
Best for: SaaS platforms, community sites, and publishers with mixed public/private content
4
Rate-Limit Aggressive AI Crawlers with Crawl-Delay
Server Protection
Some AI crawlers—particularly those with partial compliance ratings—crawl aggressively and can impact server performance. The Crawl-delay directive requests a minimum wait time between requests. Note: Googlebot ignores this directive; use Google Search Console to manage Googlebot crawl rate instead.
robots.txt
# Rate-limit aggressive crawlersUser-agent:BytespiderCrawl-delay:10Disallow:/api/Disallow:/user-content/User-agent:DiffbotCrawl-delay:10Disallow:/api/# Standard crawlers — no delayUser-agent:*Disallow:/admin/Sitemap:https://www.yourdomain.com/sitemap.xml
Best for: Sites experiencing server load from aggressive AI crawlers; the 10-second delay reduces request frequency by ~90%
Step-by-Step: Implementing Your robots.txt Configuration
1
Audit Your Current AI Crawler Traffic
Before writing any rules, check your server access logs for the user-agent strings that are currently visiting your site. Look for the strings in the reference table above. This tells you which AI crawlers are already active on your site and how frequently they're crawling—which determines the urgency and specificity of your configuration.
Quick method: In your server logs, filter for requests where the user-agent contains "bot," "crawler," "spider," or "AI." Sort by frequency to identify the most active crawlers first.
2
Define Your Access Policy for Each Crawler Category
Before writing directives, make a deliberate decision for each crawler category: (a) full access, (b) restricted access (specific paths blocked), or (c) no access. Your policy should be driven by your content's commercial value, your terms of service, and your appetite for AI search visibility. Document this policy—you'll need it when the configuration needs updating.
Decision framework: If your content's value comes from being discovered (media, marketing), lean toward allowing AI search crawlers. If your content's value comes from exclusivity (research, paywalled data), lean toward blocking all AI crawlers.
3
Write Your Configuration Using the Correct Precedence Logic
Use the scenarios above as templates. Always list specific user-agent blocks before the wildcard block. Remember that specific user-agent blocks do not inherit wildcard rules—if you want an AI bot to be subject to both your standard restrictions and additional AI-specific restrictions, you must include all relevant Disallow directives in the AI bot's specific block.
Validation: After writing your configuration, use Google Search Console's robots.txt Tester to verify that each rule behaves as intended for specific URLs. Test both URLs you want blocked and URLs you want allowed.
4
Deploy to Your Root Directory and Verify
The robots.txt file must be placed at the root of your domain—yourdomain.com/robots.txt. It cannot be placed in a subdirectory. After uploading, verify it's accessible by visiting the URL directly in your browser. The file should display as plain text with no HTML formatting.
Common error: Uploading robots.txt with Windows line endings (CRLF) instead of Unix line endings (LF) can cause parsing errors in some crawlers. Use a plain text editor that allows you to specify line ending format, or verify with a robots.txt validator after upload.
5
Add Your Sitemap and Schedule Quarterly Reviews
Always include your sitemap URL at the end of your robots.txt file. This helps compliant crawlers discover your content structure even when specific paths are blocked. Set a calendar reminder to review your AI crawler configuration quarterly—the list of active AI crawlers and their user-agent strings changes frequently, and a configuration that's current today may be incomplete in three months.
Monitoring resource: The Dark Visitors project (darkvisitors.com) maintains a continuously updated database of AI crawler user-agent strings and compliance behaviors. Subscribe to their changelog to receive notifications when new crawlers are identified.
robots.txt Decision Tree for AI Crawlers
A flowchart for determining the correct access policy for each AI crawler type
Fig. 3 — Filename: robots-txt-ai-crawler-decision-tree-2026.jpg | Alt: robots.txt AI crawler access policy decision tree 2026 | Position: Below "Step-by-Step Implementation" section | Description: A flowchart on a white background with blue accents. Starting node: "Is this crawler identified in your server logs?" Branches to "Yes" and "No." "Yes" branch: "Is it a known AI crawler?" → "Training crawler or search crawler?" → policy decision. "No" branch: "Apply wildcard rules." Each terminal node shows the recommended directive. Clean, editorial style.
What robots.txt Cannot Do: The Limits of the Protocol
Understanding what robots.txt cannot do is as important as knowing how to configure it. Three limitations are particularly relevant for AI crawler management in 2026.
It cannot prevent access by non-compliant crawlers. The Robots Exclusion Protocol is a voluntary standard. Crawlers that don't comply with it—including many scrapers and some AI crawlers with partial compliance ratings—will simply ignore your directives. For these crawlers, server-level controls (IP blocking, rate limiting, WAF rules) are the appropriate tool.
It cannot prevent AI from using content it has already collected. If an AI crawler indexed your content before you added blocking rules, that content may already be in a training dataset or knowledge base. robots.txt prevents future crawling; it does not retroactively remove previously collected data. For retroactive removal, you need to contact the AI provider directly—most major providers have content removal request processes.
It cannot distinguish between legitimate and illegitimate uses of the same user-agent string. Any crawler can claim to be Googlebot or GPTBot. Verify that crawlers claiming to be major bots are actually coming from the expected IP ranges—Google, OpenAI, and Anthropic all publish their crawler IP ranges publicly.
Complementary Controls
For comprehensive AI crawler management, combine robots.txt with: (1) HTTP response headers — the X-Robots-Tag header applies robots directives to non-HTML resources like PDFs; (2) Meta robots tags — <meta name="robots" content="noai, noimageai"> is an emerging standard for page-level AI training opt-out; (3) Terms of service — explicit prohibition of AI training data collection creates legal grounds for enforcement. See [INTERNAL LINK: X-Robots-Tag and Meta Robots: The Complete 2026 Guide] for implementation details.
Five Common robots.txt Mistakes That Affect AI Crawler Control
Relying on the Wildcard to Block AI Crawlers
Using User-agent: * with a broad Disallow to block AI crawlers also blocks search engine crawlers—which is almost never the intent. The wildcard applies to every bot that doesn't have a specific block, including Googlebot and Bingbot if they're not explicitly listed with Allow rules.
Fix: Always list your search engine crawlers explicitly with Allow: / before or after your wildcard block. Specific user-agent blocks take precedence over the wildcard for those bots.
Using an Outdated AI Crawler List
A robots.txt configuration written in early 2024 with 3–4 AI user-agent strings is now missing at least 10 active crawlers. Outdated configurations create a false sense of security—you believe you've blocked AI crawlers, but you've only blocked the ones that existed when you wrote the file.
Fix: Review and update your AI crawler list quarterly. Use the reference table in this guide as a starting point, and cross-reference with your server logs to identify any crawlers not on the list.
Blocking CSS and JavaScript Files from AI Crawlers
Some site owners block all non-HTML resources from AI crawlers to reduce data exposure. This can backfire: AI search crawlers (like PerplexityBot) that you want to allow may need CSS and JavaScript to render your pages correctly and understand your content structure. Blocking these resources can result in poor or no representation in AI search results.
Fix: If you want AI search visibility, allow CSS and JavaScript access for AI search crawlers. Apply resource blocking only to AI training crawlers you're blocking entirely.
Forgetting the Sitemap Directive
A robots.txt file without a Sitemap directive forces crawlers to discover your content through link-following alone. For AI search crawlers you want to allow, this means slower and less complete indexing of your content—particularly for newer pages that haven't accumulated many inbound links yet.
Fix: Always include Sitemap: https://www.yourdomain.com/sitemap.xml at the end of your robots.txt file. If you have multiple sitemaps (news, images, video), list each on a separate line.
Treating robots.txt as a Security Measure
The most dangerous mistake: assuming that a Disallow directive actually prevents access to sensitive content. robots.txt is publicly readable—it tells every bot (and every human) exactly which paths you consider sensitive. Malicious actors can use your robots.txt as a map to your most valuable or vulnerable content.
Fix: Never rely on robots.txt to protect sensitive content. Use authentication, access controls, and server-level security for anything that genuinely needs to be protected. robots.txt is for crawl management, not security.
The AI Crawler Control Stack
How robots.txt fits into a layered approach to AI crawler management
Fig. 4 — Filename: ai-crawler-control-stack-2026.jpg | Alt: AI crawler control stack robots.txt layered approach 2026 | Position: Below "Five Common Mistakes" section | Description: A vertical stack diagram on a dark slate background. Four layers from bottom to top: "Terms of Service" (legal layer, dark), "Server-Level Controls" (WAF, rate limiting, IP blocking), "HTTP Headers" (X-Robots-Tag, meta robots), "robots.txt" (top layer, teal). Each layer has a brief description of what it controls and what it cannot control. Arrows show how the layers complement each other. Clean, technical infographic style.
Frequently Asked Questions
If I block GPTBot, will my content still appear in ChatGPT responses?
Blocking GPTBot prevents OpenAI from crawling your site for future training data, but it does not remove content that was already collected before you added the block. ChatGPT responses draw from training data collected up to the model's knowledge cutoff—blocking GPTBot now affects future training runs, not the current model's knowledge. For ChatGPT's real-time browsing feature (which uses the ChatGPT-User agent), blocking that user-agent will prevent your content from appearing in real-time browsing responses. These are two separate user-agent strings with different functions.
How do I verify that an AI crawler is actually complying with my robots.txt?
The most reliable method is to check your server access logs after adding blocking rules. If a crawler is complying, you should see its requests stop for the blocked paths within 24–48 hours of the robots.txt update (most crawlers re-fetch robots.txt frequently). You can also use a honeypot page—a page that's disallowed in robots.txt but contains a unique tracking pixel or URL. Any request to that page from a blocked crawler confirms non-compliance. According to a Cloudflare analysis published May 13, 2026, the major AI providers (OpenAI, Anthropic, Google, Perplexity) all showed compliance rates above 95% in controlled testing.
Should I block AI crawlers from my sitemap?
No—and this is a common misconception. Your sitemap is a list of URLs, not the content itself. Blocking AI crawlers from your sitemap doesn't prevent them from discovering those URLs through other means (links, direct crawling). More importantly, if you want AI search crawlers to index your content, they need sitemap access to discover it efficiently. The correct approach is to block AI crawlers from the content paths you don't want accessed, while leaving the sitemap accessible to all crawlers.
What's the difference between blocking GPTBot and blocking ChatGPT-User?
GPTBot is OpenAI's training data crawler—it collects content to train future versions of GPT models. Blocking it prevents your content from being used in future training datasets. ChatGPT-User is the user-agent used when ChatGPT's browsing feature accesses your site in real-time to answer a user's question. Blocking it prevents your content from appearing in ChatGPT's real-time responses. Most site owners who want to block OpenAI should block both. If you want your content to appear in ChatGPT responses but not in training data, block only GPTBot and allow ChatGPT-User.
How often should I update my robots.txt for AI crawlers?
Quarterly reviews are the minimum for sites that actively manage AI crawler access. The AI crawler landscape is changing rapidly—new crawlers emerge, existing crawlers change their user-agent strings, and compliance behaviors evolve. A practical workflow: (1) subscribe to the Dark Visitors changelog for new crawler notifications, (2) review your server logs monthly for unrecognized user-agent strings, (3) do a full configuration audit quarterly against the current reference table. For sites with high-value content or strict data governance requirements, monthly reviews are more appropriate. See [INTERNAL LINK: How to Audit Your Technical SEO Configuration in 2026] for a complete audit framework.
Technical SEO Lead & Crawl Architecture Specialist · 11 Years Experience
Rafael specializes in crawl budget optimization, bot management, and technical SEO architecture for enterprise websites and digital publishers. He has audited robots.txt configurations for over 200 websites across 30+ industries, and his AI crawler management framework has been adopted by content teams managing sites with 10M+ monthly organic sessions. He contributes to the W3C Web Crawling Community Group.
Written and reviewed by Rafael Mora. Information current as of May 15, 2026.