If you have not audited your robots.txt in the last six months, you are probably blocking AI crawlers without realizing it — or worse, leaving valuable bots out.
AI crawlers are the silent infrastructure of generative search. Every time ChatGPT, Claude, Perplexity, or Gemini answers a shopping question with cited sources, those citations come from content their crawlers previously indexed. If your site is blocking those bots in robots.txt — whether intentionally or because an SEO plugin made the call for you — your products cannot be cited. We have audited dozens of ecommerce stores in the last six months and found bot configuration issues on roughly 60 percent of them. This guide is the complete reference list of which AI crawlers exist in 2026, what each one does, and how to configure your robots.txt to maximize citation eligibility while protecting the parts of your site that should stay private.
For the broader picture on AI search, see our AI Search Resource Hub and the platform-specific playbooks for Perplexity citations and ChatGPT Instant Checkout.
An AI crawler is an automated bot operated by an AI company that fetches web content for the purpose of training large language models, building search retrieval indexes, or fulfilling real-time user requests. AI crawlers are distinct from traditional search engine crawlers in that they feed AI products rather than search engine result pages.
What is an AI crawler and why does it matter for ecommerce?
An AI crawler is an automated bot operated by an AI company that fetches web content to train AI models, build search retrieval indexes, or fulfill real-time user requests. For ecommerce brands, AI crawlers matter because they are the gatekeepers to AI citation visibility — if a crawler cannot access your content, that content cannot be referenced in AI-generated answers.
Why ecommerce brands should care more than other industries
- Product recommendation queries dominate AI search. A meaningful portion of ChatGPT and Perplexity usage involves shopping research, where ecommerce content is the primary citation surface
- AI-driven traffic is rapidly growing. Perplexity-referred ecommerce traffic alone grew roughly 7x between January 2025 and Q1 2026, and ChatGPT shopping referrals are scaling even faster
- Citation share compounds. Brands cited frequently early in this curve build long-term authority that gets harder to displace later. The cost of being blocked from AI training corpora today compounds over time
The mechanical link is straightforward: AI crawler access in robots.txt determines whether your content enters the citation pool, which determines whether your brand can be recommended when shoppers ask AI engines for product suggestions.
Which AI crawlers should every ecommerce brand know about?
There are roughly 12 AI crawlers that matter for ecommerce brands in 2026, operated by 7 major AI companies plus Common Crawl. The table below is the complete reference list with user agents, operators, purposes, and our default recommendations.
The complete 2026 AI crawler reference table
| User Agent | Operator | Purpose | AI Products | Recommendation |
|---|---|---|---|---|
| GPTBot | OpenAI | Training crawler | ChatGPT, GPT models | Allow |
| OAI-SearchBot | OpenAI | Search index | ChatGPT Search, SearchGPT | Allow |
| ChatGPT-User | OpenAI | Real-time user-triggered fetch | ChatGPT live retrieval | Allow |
| ClaudeBot | Anthropic | Training and retrieval | Claude | Allow |
| anthropic-ai | Anthropic | Legacy crawler (older) | Claude | Allow |
| PerplexityBot | Perplexity | Indexing crawler | Perplexity | Allow |
| Perplexity-User | Perplexity | Real-time user fetch | Perplexity live retrieval | Allow |
| Google-Extended | Training opt-in for Gemini | Gemini, Bard | Allow | |
| Applebot-Extended | Apple | Training crawler | Apple Intelligence | Allow |
| Amazonbot | Amazon | Indexing crawler | Alexa, Rufus | Allow |
| Meta-ExternalAgent | Meta | Training crawler | Meta AI | Allow |
| CCBot | Common Crawl | Open web corpus | Bootstraps many AI datasets | Allow |
| Bytespider | ByteDance | Training crawler | TikTok AI features, Doubao | Conditional |
| cohere-ai | Cohere | Training crawler | Command models | Allow |
Allowing the top 10 in this list (GPTBot through Meta-ExternalAgent) covers roughly 90 percent of AI citation surface area for US-based ecommerce brands. CCBot is also worth allowing because of its role bootstrapping many smaller AI training pipelines.
How do AI crawlers differ from traditional search engine crawlers?
AI crawlers differ from traditional search crawlers in three primary ways: they feed AI products rather than search engine result pages, they include both training and real-time retrieval bots, and they evolve rapidly with frequent new user-agent introductions and renames.
Three structural differences worth understanding
- Output destination. Googlebot feeds Google Search results. GPTBot feeds ChatGPT's knowledge base. These are different products with different audiences and different optimization patterns
- Crawler typology. Search crawlers are primarily indexing bots. AI ecosystems have three types: training crawlers (long-term knowledge accumulation), retrieval/index crawlers (search-style indexing), and real-time user-triggered fetchers (live page retrieval during a chat session)
- Pace of change. Googlebot user-agents have been stable for over a decade. AI crawler user-agents change frequently — new bots launch, old ones get renamed or split, and major operators announce new variants on roughly quarterly cadences
What this means practically for ecommerce SEO
- Traditional SEO best practices for crawler accessibility still apply
- The bot list to optimize for is larger and changes more often
- You may want different rules for different bot types (e.g., allow training crawlers on blog content but restrict on price-sensitive product pages)
- Quarterly review of robots.txt configuration is now table stakes
What is the difference between training crawlers and search/retrieval bots?
Training crawlers fetch content periodically to build the long-term knowledge base used by AI models. Search and retrieval bots index content for live retrieval during AI chat sessions. Real-time user-triggered fetchers retrieve a specific page on-demand when a user asks a question. Each serves a different function and warrants different consideration in your robots.txt strategy.
Training crawlers
Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, CCBot, cohere-ai. These crawl periodically and feed long-term AI training datasets. Content they index becomes part of the model's baseline knowledge and can be referenced for years.
Search and retrieval bots
Examples: OAI-SearchBot, PerplexityBot, Amazonbot. These build searchable indexes that AI systems query at conversation time to retrieve relevant sources. Content they index is freshly retrievable and contributes to citations even before training updates.
Real-time user-triggered fetchers
Examples: ChatGPT-User, Perplexity-User. These activate when a specific user asks a question and the AI system retrieves a particular page to compose an up-to-the-minute answer. Blocking these bots prevents your content from being retrieved live even if other crawlers have indexed it.
For maximum citation eligibility, all three categories should be allowed. Training-only access gives you baseline citation eligibility but no live retrieval. Search-index access without training access means you can be retrieved but may not match natural language patterns as well. Allowing all three gives you full citation surface.
How do you configure robots.txt to allow AI crawlers correctly?
The recommended baseline robots.txt configuration for ecommerce brands explicitly allows all major AI crawlers, sets sensible Disallow rules for non-public areas (admin, checkout, cart, account), and includes your sitemap. Below is a ready-to-paste configuration you can use as a starting point.
Recommended robots.txt for most ecommerce sites
What this configuration does
- Explicitly allows 13 major AI crawlers on all public content
- Blocks private areas (admin, checkout, cart, account, API endpoints) from all bots including AI crawlers
- Blocks filtered URLs that create duplicate content (sort by, filter by) for AI bots
- Sets default rules for all other crawlers
- Declares your sitemap location for discoverability
Replace yoursite.com with your actual domain. Adjust the Disallow patterns to match your platform — the example uses common WooCommerce/Shopify patterns but your URL structure may differ.
The Ecom Profit Box
11 step-by-step PDF guides covering AI search, content creation, listing optimization, and more.
Grab it free →AI Search Optimization
We audit and configure robots.txt for ecommerce brands as part of our AI search optimization service.
Book a strategy call →Should you block any AI crawlers? When and why?
Most ecommerce brands should not block any major AI crawler because the citation loss outweighs the protection benefit. There are three specific scenarios where blocking makes sense: published copyrighted content you do not want in training datasets, controversial crawlers with poor governance, and specific subdirectories with private content even on otherwise public domains.
When blocking makes sense
- Published copyrighted media you sell as content products. Brands that sell digital content (courses, reports, premium articles) may want to block training crawlers from those specific paths while allowing crawl on marketing pages
- Bytespider for non-TikTok brands. Bytespider has been criticized for aggressive crawling and unclear data governance. Brands not selling on TikTok Shop can reasonably block it
- Private member content. Subdirectories like /members/, /premium/, /paid/ should be blocked from all crawlers regardless of AI status
- Test or staging subdomains. Anything not intended for production traffic should be blocked across the board
When blocking is usually a mistake
- Blocking GPTBot because of training concerns. The citation visibility loss in ChatGPT is severe and usually outweighs the training-data concern
- Blocking PerplexityBot or Perplexity-User. These directly determine your Perplexity citation eligibility — blocking them means zero Perplexity citations
- Blocking ClaudeBot. Removes you from Claude's citation pool entirely
- Blocking all AI crawlers by default. Some SEO plugins do this automatically — check yours and override if needed
Several popular WordPress and Shopify SEO plugins added “block AI bots” toggles in 2024 and 2025 with the toggle enabled by default. We have seen ecommerce stores that updated their plugin and unknowingly cut themselves off from ChatGPT, Claude, and Perplexity overnight. Audit your robots.txt manually at least quarterly to catch these.
How do you verify that AI crawlers are actually reaching your site?
Verify AI crawler activity by examining your server logs for known AI user-agent strings, using Shopify or hosting platform analytics tools that surface crawler data, or running periodic prompt audits in AI engines to see which content from your site is being cited. Server logs are the definitive source of truth.
Three methods to check AI crawler activity
- Server log analysis. Most hosting providers offer access to raw server logs. Filter for user-agent strings containing GPTBot, ClaudeBot, PerplexityBot, anthropic-ai, etc. to see crawl frequency
- Platform analytics tools. Shopify has third-party apps like Log File Analyser that visualize crawler activity. WooCommerce sites can use plugins like Slim Stat Analytics or Crawler Stats
- Prompt audit method. Run 20-30 category-relevant prompts in ChatGPT and Perplexity weekly. If your content appears in citations, the crawlers are clearly reaching you. If it never appears despite quality content, investigate crawler access
What healthy AI crawler activity looks like
- GPTBot: Crawls every 1-7 days depending on update frequency. Larger sites get more frequent visits
- PerplexityBot: Crawls every 1-3 days for established sites, less often for newer sites
- ClaudeBot: Crawl frequency varies; typically every 3-14 days
- ChatGPT-User and Perplexity-User: Appear sporadically based on user queries that trigger live fetches
What is the difference between indexing crawlers and real-time fetchers?
Indexing crawlers (like GPTBot and PerplexityBot) periodically fetch content to build searchable indexes. Real-time fetchers (like ChatGPT-User and Perplexity-User) activate on-demand when a user asks a question and the AI system retrieves a specific page for that conversation. Both must be allowed for full citation eligibility — blocking either creates a gap in your AI visibility.
Why both matter for citations
The indexing crawler determines whether your content exists in the AI's source pool. The real-time fetcher determines whether your content can be retrieved when a user asks a relevant question. A brand with only indexing access gets cited from cached snapshots; a brand with both gets cited from fresh, up-to-the-minute content.
The user-agent pairs that matter most
| Operator | Indexing Crawler | Real-Time Fetcher | Combined Effect |
|---|---|---|---|
| OpenAI | GPTBot (training), OAI-SearchBot (index) | ChatGPT-User | Training memory + live fetch |
| Perplexity | PerplexityBot | Perplexity-User | Index access + live retrieval |
| Anthropic | ClaudeBot | (no separate user fetcher) | Training and retrieval combined |
| Google-Extended | (uses standard Googlebot for live) | Training opt-in plus standard search | |
| Amazon | Amazonbot | Amazonbot (combined) | Indexing for Alexa and Rufus |
The most common configuration gap we see in audits is sites that allow training crawlers but accidentally block real-time fetchers. This usually happens when a robots.txt was written before real-time fetchers existed and never updated.
What are the most common AI crawler configuration mistakes?
The five most common AI crawler configuration mistakes we see in ecommerce audits are: SEO plugin default-blocks, missing real-time user-agent rules, blocking by accident through wildcard rules, blocking AI bots on Cloudflare or CDN level, and never updating the configuration after launch. Each is fixable in under an hour.
Mistake 1: SEO plugin default-blocks
Several SEO plugins added “block AI bots” features with the toggle ON by default. The most common offenders are some versions of Yoast, RankMath, and All-in-One SEO. Check your plugin settings and ensure AI bot blocking is OFF unless you have a specific reason for it.
Mistake 2: Missing real-time user-agent rules
Many older robots.txt files include GPTBot, ClaudeBot, and PerplexityBot but not the real-time fetchers ChatGPT-User and Perplexity-User. These newer user-agents must be explicitly allowed. Wildcards or default-allow blocks may not be sufficient depending on how your file is structured.
Mistake 3: Blocking by accident through wildcard rules
A blanket User-agent: * with broad Disallow rules can accidentally apply to AI crawlers when more specific User-agent blocks are not present. AI bots typically fall back to the * rules if no specific rules apply, which can produce surprising results.
Mistake 4: Blocking AI bots at the CDN level
Cloudflare and other CDNs have introduced features to block AI bots at the network level. These overrides take precedence over robots.txt. If you have a Cloudflare WAF rule or Bot Fight Mode setting that blocks AI bots, robots.txt access is irrelevant — the bot never reaches your origin server.
Mistake 5: Never updating after launch
robots.txt that was written in 2022 or 2023 likely does not include 2025 and 2026 AI crawlers like OAI-SearchBot, Applebot-Extended, or Meta-ExternalAgent. Quarterly updates are the minimum cadence to stay current.
If your robots.txt looks correct but AI crawlers still are not reaching you, check Cloudflare or your CDN dashboard. Many sites have invisible AI-blocking rules at the WAF level that override everything else. Cloudflare specifically has a one-click “Block AI Bots” option in some dashboards that is easy to enable by mistake.
How should ecommerce brands maintain their AI crawler configuration over time?
Maintain your AI crawler configuration through quarterly audits, monthly server log monitoring, post-plugin-update verification, and a documented list of known user-agents that your team can update as new bots appear. Maintenance is mostly about catching SEO plugin updates, CDN setting changes, and new AI bot introductions before they create blind spots.
The maintenance schedule
- Monthly: Check server logs for AI crawler activity. Are GPTBot, ClaudeBot, PerplexityBot showing up? If not, investigate
- Quarterly: Full robots.txt audit. Read the file end to end. Verify all major AI crawlers are explicitly allowed. Update for any new bots
- After plugin updates: Verify your SEO plugin has not silently changed your AI bot settings during the update
- After CDN setting changes: If anyone on your team adjusts Cloudflare or CDN settings, verify AI crawler access is preserved
- When new AI products launch: Check whether the new product has a new user-agent that you need to add
Tools and resources for ongoing maintenance
- Dark Visitors (darkvisitors.com) maintains an updated list of AI crawler user-agents
- The robotstxt.org documentation covers protocol fundamentals (robotstxt.org)
- Google's robots.txt Tester for syntax validation
- Individual operator documentation: OpenAI GPTBot docs, Anthropic ClaudeBot pages, Perplexity bot documentation
Most teams can maintain AI crawler configuration with about 30 minutes of work per quarter once the baseline is set up correctly. The first audit takes longer because you are inventorying and fixing accumulated issues.
The 6 Things to Remember About AI Crawlers
- 12 AI crawlers matter for ecommerce brands in 2026, operated by OpenAI, Anthropic, Perplexity, Google, Apple, Amazon, Meta, and Common Crawl
- Three crawler categories exist: training, indexing/retrieval, and real-time user-triggered fetchers — all three should be allowed for full citation eligibility
- The ready-to-paste robots.txt in section 5 covers the recommended baseline configuration for most ecommerce sites
- Blocking AI crawlers usually costs more in citation loss than it gains in protection — default to allowing unless you have a specific reason
- The five most common mistakes: SEO plugin default-blocks, missing real-time user-agents, accidental wildcard blocks, CDN-level overrides, and never updating after launch
- Maintain configuration through quarterly audits, monthly server log checks, and post-plugin-update verification — about 30 minutes per quarter once baseline is set

