TECHNICAL REFERENCE · /robots.txt · UPDATED MAY 2026
REFERENCE PUBLISHED MAY 11, 2026·UPDATED MAY 21, 2026·14 MIN READ · 3,500 WORDS

The AI Crawler List 2026: Every User-Agent Ecommerce Brands Need to Know.

The complete reference of AI crawlers and bots that ecommerce brands need to know about in 2026 — user agents, operators, purposes, and clear robots.txt recommendations for each. Includes ready-to-paste robots.txt configurations.

12Major AI crawlers that ecommerce brands should configure for
3Crawler categories: training, indexing, real-time fetchers
~90%Of AI citations covered by allowing top 10 user-agents
15 minTime to audit and update your existing robots.txt
Quick Answer

Ecommerce brands need to know roughly 12 AI crawlers in 2026, operated by OpenAI (GPTBot, OAI-SearchBot, ChatGPT-User), Anthropic (ClaudeBot, anthropic-ai), Perplexity (PerplexityBot, Perplexity-User), Google (Google-Extended), Apple (Applebot-Extended), Amazon (Amazonbot), Meta (Meta-ExternalAgent), and Common Crawl (CCBot). For maximum AI citation eligibility, allow all of these in your robots.txt. The full ready-to-paste robots.txt configuration is in section 5 of this guide.

UPDATED FOR ALEXA FOR SHOPPINGAmazon retired the Rufus brand on May 13, 2026 and consolidated the technology into Alexa for Shopping. The optimization principles in this guide still apply to the new system.

If you have not audited your robots.txt in the last six months, you are probably blocking AI crawlers without realizing it — or worse, leaving valuable bots out.

Custom Jingle Portfolio Lumenbed · Weighted Blanket Smooth Pop · Dreamy
Hear All 63 View Portfolio

AI crawlers are the silent infrastructure of generative search. Every time ChatGPT, Claude, Perplexity, or Gemini answers a shopping question with cited sources, those citations come from content their crawlers previously indexed. If your site is blocking those bots in robots.txt — whether intentionally or because an SEO plugin made the call for you — your products cannot be cited. We have audited dozens of ecommerce stores in the last six months and found bot configuration issues on roughly 60 percent of them. This guide is the complete reference list of which AI crawlers exist in 2026, what each one does, and how to configure your robots.txt to maximize citation eligibility while protecting the parts of your site that should stay private.

For the broader picture on AI search, see our AI Search Resource Hub and the platform-specific playbooks for Perplexity citations and ChatGPT Instant Checkout.

Definition: AI Crawler

An AI crawler is an automated bot operated by an AI company that fetches web content for the purpose of training large language models, building search retrieval indexes, or fulfilling real-time user requests. AI crawlers are distinct from traditional search engine crawlers in that they feed AI products rather than search engine result pages.

01

What is an AI crawler and why does it matter for ecommerce?

An AI crawler is an automated bot operated by an AI company that fetches web content to train AI models, build search retrieval indexes, or fulfill real-time user requests. For ecommerce brands, AI crawlers matter because they are the gatekeepers to AI citation visibility — if a crawler cannot access your content, that content cannot be referenced in AI-generated answers.

Why ecommerce brands should care more than other industries

  • Product recommendation queries dominate AI search. A meaningful portion of ChatGPT and Perplexity usage involves shopping research, where ecommerce content is the primary citation surface
  • AI-driven traffic is rapidly growing. Perplexity-referred ecommerce traffic alone grew roughly 7x between January 2025 and Q1 2026, and ChatGPT shopping referrals are scaling even faster
  • Citation share compounds. Brands cited frequently early in this curve build long-term authority that gets harder to displace later. The cost of being blocked from AI training corpora today compounds over time

The mechanical link is straightforward: AI crawler access in robots.txt determines whether your content enters the citation pool, which determines whether your brand can be recommended when shoppers ask AI engines for product suggestions.

02

Which AI crawlers should every ecommerce brand know about?

There are roughly 12 AI crawlers that matter for ecommerce brands in 2026, operated by 7 major AI companies plus Common Crawl. The table below is the complete reference list with user agents, operators, purposes, and our default recommendations.

The complete 2026 AI crawler reference table

User AgentOperatorPurposeAI ProductsRecommendation
GPTBotOpenAITraining crawlerChatGPT, GPT modelsAllow
OAI-SearchBotOpenAISearch indexChatGPT Search, SearchGPTAllow
ChatGPT-UserOpenAIReal-time user-triggered fetchChatGPT live retrievalAllow
ClaudeBotAnthropicTraining and retrievalClaudeAllow
anthropic-aiAnthropicLegacy crawler (older)ClaudeAllow
PerplexityBotPerplexityIndexing crawlerPerplexityAllow
Perplexity-UserPerplexityReal-time user fetchPerplexity live retrievalAllow
Google-ExtendedGoogleTraining opt-in for GeminiGemini, BardAllow
Applebot-ExtendedAppleTraining crawlerApple IntelligenceAllow
AmazonbotAmazonIndexing crawlerAlexa, RufusAllow
Meta-ExternalAgentMetaTraining crawlerMeta AIAllow
CCBotCommon CrawlOpen web corpusBootstraps many AI datasetsAllow
BytespiderByteDanceTraining crawlerTikTok AI features, DoubaoConditional
cohere-aiCohereTraining crawlerCommand modelsAllow

Allowing the top 10 in this list (GPTBot through Meta-ExternalAgent) covers roughly 90 percent of AI citation surface area for US-based ecommerce brands. CCBot is also worth allowing because of its role bootstrapping many smaller AI training pipelines.

03

How do AI crawlers differ from traditional search engine crawlers?

AI crawlers differ from traditional search crawlers in three primary ways: they feed AI products rather than search engine result pages, they include both training and real-time retrieval bots, and they evolve rapidly with frequent new user-agent introductions and renames.

Three structural differences worth understanding

  • Output destination. Googlebot feeds Google Search results. GPTBot feeds ChatGPT's knowledge base. These are different products with different audiences and different optimization patterns
  • Crawler typology. Search crawlers are primarily indexing bots. AI ecosystems have three types: training crawlers (long-term knowledge accumulation), retrieval/index crawlers (search-style indexing), and real-time user-triggered fetchers (live page retrieval during a chat session)
  • Pace of change. Googlebot user-agents have been stable for over a decade. AI crawler user-agents change frequently — new bots launch, old ones get renamed or split, and major operators announce new variants on roughly quarterly cadences

What this means practically for ecommerce SEO

  • Traditional SEO best practices for crawler accessibility still apply
  • The bot list to optimize for is larger and changes more often
  • You may want different rules for different bot types (e.g., allow training crawlers on blog content but restrict on price-sensitive product pages)
  • Quarterly review of robots.txt configuration is now table stakes
04

What is the difference between training crawlers and search/retrieval bots?

Training crawlers fetch content periodically to build the long-term knowledge base used by AI models. Search and retrieval bots index content for live retrieval during AI chat sessions. Real-time user-triggered fetchers retrieve a specific page on-demand when a user asks a question. Each serves a different function and warrants different consideration in your robots.txt strategy.

Training crawlers

Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, CCBot, cohere-ai. These crawl periodically and feed long-term AI training datasets. Content they index becomes part of the model's baseline knowledge and can be referenced for years.

Custom Jingle Portfolio Slicktop · Hair Gel Upbeat Pop · Bold
Hear All 63 View Portfolio

Search and retrieval bots

Examples: OAI-SearchBot, PerplexityBot, Amazonbot. These build searchable indexes that AI systems query at conversation time to retrieve relevant sources. Content they index is freshly retrievable and contributes to citations even before training updates.

Real-time user-triggered fetchers

Examples: ChatGPT-User, Perplexity-User. These activate when a specific user asks a question and the AI system retrieves a particular page to compose an up-to-the-minute answer. Blocking these bots prevents your content from being retrieved live even if other crawlers have indexed it.

Why All Three Matter

For maximum citation eligibility, all three categories should be allowed. Training-only access gives you baseline citation eligibility but no live retrieval. Search-index access without training access means you can be retrieved but may not match natural language patterns as well. Allowing all three gives you full citation surface.

05

How do you configure robots.txt to allow AI crawlers correctly?

The recommended baseline robots.txt configuration for ecommerce brands explicitly allows all major AI crawlers, sets sensible Disallow rules for non-public areas (admin, checkout, cart, account), and includes your sitemap. Below is a ready-to-paste configuration you can use as a starting point.

Recommended robots.txt for most ecommerce sites

# Recommended robots.txt for ecommerce # Allow all major AI crawlers, restrict private areasUser-agent: GPTBot User-agent: OAI-SearchBot User-agent: ChatGPT-User User-agent: ClaudeBot User-agent: anthropic-ai User-agent: PerplexityBot User-agent: Perplexity-User User-agent: Google-Extended User-agent: Applebot-Extended User-agent: Amazonbot User-agent: Meta-ExternalAgent User-agent: CCBot User-agent: cohere-ai Allow: / Disallow: /admin/ Disallow: /checkout/ Disallow: /cart/ Disallow: /account/ Disallow: /api/ Disallow: /*?orderby= Disallow: /*?filter=# Default rules for all other crawlers User-agent: * Allow: / Disallow: /admin/ Disallow: /checkout/ Disallow: /cart/ Disallow: /account/Sitemap: https://yoursite.com/sitemap.xml

What this configuration does

  • Explicitly allows 13 major AI crawlers on all public content
  • Blocks private areas (admin, checkout, cart, account, API endpoints) from all bots including AI crawlers
  • Blocks filtered URLs that create duplicate content (sort by, filter by) for AI bots
  • Sets default rules for all other crawlers
  • Declares your sitemap location for discoverability

Replace yoursite.com with your actual domain. Adjust the Disallow patterns to match your platform — the example uses common WooCommerce/Shopify patterns but your URL structure may differ.

Free Resource

The Ecom Profit Box

11 step-by-step PDF guides covering AI search, content creation, listing optimization, and more.

Grab it free →
Evolve Media Service

AI Search Optimization

We audit and configure robots.txt for ecommerce brands as part of our AI search optimization service.

Book a strategy call →
06

Should you block any AI crawlers? When and why?

Most ecommerce brands should not block any major AI crawler because the citation loss outweighs the protection benefit. There are three specific scenarios where blocking makes sense: published copyrighted content you do not want in training datasets, controversial crawlers with poor governance, and specific subdirectories with private content even on otherwise public domains.

When blocking makes sense

  • Published copyrighted media you sell as content products. Brands that sell digital content (courses, reports, premium articles) may want to block training crawlers from those specific paths while allowing crawl on marketing pages
  • Bytespider for non-TikTok brands. Bytespider has been criticized for aggressive crawling and unclear data governance. Brands not selling on TikTok Shop can reasonably block it
  • Private member content. Subdirectories like /members/, /premium/, /paid/ should be blocked from all crawlers regardless of AI status
  • Test or staging subdomains. Anything not intended for production traffic should be blocked across the board

When blocking is usually a mistake

  • Blocking GPTBot because of training concerns. The citation visibility loss in ChatGPT is severe and usually outweighs the training-data concern
  • Blocking PerplexityBot or Perplexity-User. These directly determine your Perplexity citation eligibility — blocking them means zero Perplexity citations
  • Blocking ClaudeBot. Removes you from Claude's citation pool entirely
  • Blocking all AI crawlers by default. Some SEO plugins do this automatically — check yours and override if needed
The Default-Block Trap

Several popular WordPress and Shopify SEO plugins added “block AI bots” toggles in 2024 and 2025 with the toggle enabled by default. We have seen ecommerce stores that updated their plugin and unknowingly cut themselves off from ChatGPT, Claude, and Perplexity overnight. Audit your robots.txt manually at least quarterly to catch these.

07

How do you verify that AI crawlers are actually reaching your site?

Verify AI crawler activity by examining your server logs for known AI user-agent strings, using Shopify or hosting platform analytics tools that surface crawler data, or running periodic prompt audits in AI engines to see which content from your site is being cited. Server logs are the definitive source of truth.

Three methods to check AI crawler activity

  • Server log analysis. Most hosting providers offer access to raw server logs. Filter for user-agent strings containing GPTBot, ClaudeBot, PerplexityBot, anthropic-ai, etc. to see crawl frequency
  • Platform analytics tools. Shopify has third-party apps like Log File Analyser that visualize crawler activity. WooCommerce sites can use plugins like Slim Stat Analytics or Crawler Stats
  • Prompt audit method. Run 20-30 category-relevant prompts in ChatGPT and Perplexity weekly. If your content appears in citations, the crawlers are clearly reaching you. If it never appears despite quality content, investigate crawler access

What healthy AI crawler activity looks like

  • GPTBot: Crawls every 1-7 days depending on update frequency. Larger sites get more frequent visits
  • PerplexityBot: Crawls every 1-3 days for established sites, less often for newer sites
  • ClaudeBot: Crawl frequency varies; typically every 3-14 days
  • ChatGPT-User and Perplexity-User: Appear sporadically based on user queries that trigger live fetches
08

What is the difference between indexing crawlers and real-time fetchers?

Indexing crawlers (like GPTBot and PerplexityBot) periodically fetch content to build searchable indexes. Real-time fetchers (like ChatGPT-User and Perplexity-User) activate on-demand when a user asks a question and the AI system retrieves a specific page for that conversation. Both must be allowed for full citation eligibility — blocking either creates a gap in your AI visibility.

Why both matter for citations

The indexing crawler determines whether your content exists in the AI's source pool. The real-time fetcher determines whether your content can be retrieved when a user asks a relevant question. A brand with only indexing access gets cited from cached snapshots; a brand with both gets cited from fresh, up-to-the-minute content.

The user-agent pairs that matter most

OperatorIndexing CrawlerReal-Time FetcherCombined Effect
OpenAIGPTBot (training), OAI-SearchBot (index)ChatGPT-UserTraining memory + live fetch
PerplexityPerplexityBotPerplexity-UserIndex access + live retrieval
AnthropicClaudeBot(no separate user fetcher)Training and retrieval combined
GoogleGoogle-Extended(uses standard Googlebot for live)Training opt-in plus standard search
AmazonAmazonbotAmazonbot (combined)Indexing for Alexa and Rufus

The most common configuration gap we see in audits is sites that allow training crawlers but accidentally block real-time fetchers. This usually happens when a robots.txt was written before real-time fetchers existed and never updated.

09

What are the most common AI crawler configuration mistakes?

The five most common AI crawler configuration mistakes we see in ecommerce audits are: SEO plugin default-blocks, missing real-time user-agent rules, blocking by accident through wildcard rules, blocking AI bots on Cloudflare or CDN level, and never updating the configuration after launch. Each is fixable in under an hour.

Mistake 1: SEO plugin default-blocks

Several SEO plugins added “block AI bots” features with the toggle ON by default. The most common offenders are some versions of Yoast, RankMath, and All-in-One SEO. Check your plugin settings and ensure AI bot blocking is OFF unless you have a specific reason for it.

Mistake 2: Missing real-time user-agent rules

Many older robots.txt files include GPTBot, ClaudeBot, and PerplexityBot but not the real-time fetchers ChatGPT-User and Perplexity-User. These newer user-agents must be explicitly allowed. Wildcards or default-allow blocks may not be sufficient depending on how your file is structured.

Mistake 3: Blocking by accident through wildcard rules

A blanket User-agent: * with broad Disallow rules can accidentally apply to AI crawlers when more specific User-agent blocks are not present. AI bots typically fall back to the * rules if no specific rules apply, which can produce surprising results.

Mistake 4: Blocking AI bots at the CDN level

Cloudflare and other CDNs have introduced features to block AI bots at the network level. These overrides take precedence over robots.txt. If you have a Cloudflare WAF rule or Bot Fight Mode setting that blocks AI bots, robots.txt access is irrelevant — the bot never reaches your origin server.

Mistake 5: Never updating after launch

robots.txt that was written in 2022 or 2023 likely does not include 2025 and 2026 AI crawlers like OAI-SearchBot, Applebot-Extended, or Meta-ExternalAgent. Quarterly updates are the minimum cadence to stay current.

The CDN Override Risk

If your robots.txt looks correct but AI crawlers still are not reaching you, check Cloudflare or your CDN dashboard. Many sites have invisible AI-blocking rules at the WAF level that override everything else. Cloudflare specifically has a one-click “Block AI Bots” option in some dashboards that is easy to enable by mistake.

10

How should ecommerce brands maintain their AI crawler configuration over time?

Maintain your AI crawler configuration through quarterly audits, monthly server log monitoring, post-plugin-update verification, and a documented list of known user-agents that your team can update as new bots appear. Maintenance is mostly about catching SEO plugin updates, CDN setting changes, and new AI bot introductions before they create blind spots.

The maintenance schedule

  • Monthly: Check server logs for AI crawler activity. Are GPTBot, ClaudeBot, PerplexityBot showing up? If not, investigate
  • Quarterly: Full robots.txt audit. Read the file end to end. Verify all major AI crawlers are explicitly allowed. Update for any new bots
  • After plugin updates: Verify your SEO plugin has not silently changed your AI bot settings during the update
  • After CDN setting changes: If anyone on your team adjusts Cloudflare or CDN settings, verify AI crawler access is preserved
  • When new AI products launch: Check whether the new product has a new user-agent that you need to add

Tools and resources for ongoing maintenance

  • Dark Visitors (darkvisitors.com) maintains an updated list of AI crawler user-agents
  • The robotstxt.org documentation covers protocol fundamentals (robotstxt.org)
  • Google's robots.txt Tester for syntax validation
  • Individual operator documentation: OpenAI GPTBot docs, Anthropic ClaudeBot pages, Perplexity bot documentation

Most teams can maintain AI crawler configuration with about 30 minutes of work per quarter once the baseline is set up correctly. The first audit takes longer because you are inventorying and fixing accumulated issues.

Key Takeaways

The 6 Things to Remember About AI Crawlers

  • 12 AI crawlers matter for ecommerce brands in 2026, operated by OpenAI, Anthropic, Perplexity, Google, Apple, Amazon, Meta, and Common Crawl
  • Three crawler categories exist: training, indexing/retrieval, and real-time user-triggered fetchers — all three should be allowed for full citation eligibility
  • The ready-to-paste robots.txt in section 5 covers the recommended baseline configuration for most ecommerce sites
  • Blocking AI crawlers usually costs more in citation loss than it gains in protection — default to allowing unless you have a specific reason
  • The five most common mistakes: SEO plugin default-blocks, missing real-time user-agents, accidental wildcard blocks, CDN-level overrides, and never updating after launch
  • Maintain configuration through quarterly audits, monthly server log checks, and post-plugin-update verification — about 30 minutes per quarter once baseline is set

Common Questions

AI Crawler
FAQ

How many AI crawlers should I worry about as an ecommerce brand?

There are roughly 12-15 AI crawlers that matter for most ecommerce brands in 2026. The most important are GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Amazonbot, and Meta-ExternalAgent. Allowing these ten bots covers the vast majority of AI citation surface area.

Should I allow GPTBot to crawl my site?

Yes for most ecommerce brands. GPTBot is OpenAI's training crawler, which feeds ChatGPT's knowledge base. Blocking GPTBot removes your content from ChatGPT's training corpus, which significantly reduces citation likelihood for queries where users ask ChatGPT for recommendations in your category. The downside of allowing GPTBot — your content being used in training — is generally outweighed by the citation benefit.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler that fetches content periodically to build the knowledge base used by ChatGPT. ChatGPT-User is the real-time user-agent activated when a ChatGPT user asks a question and the system retrieves a current page to compose an answer. Both should typically be allowed for full citation eligibility.

Will blocking AI crawlers protect my content from being scraped?

Partially. Legitimate AI crawlers from major operators (OpenAI, Anthropic, Google, Apple) respect robots.txt directives. However, blocking does not stop bad actors or unauthorized scrapers that ignore robots.txt entirely. The practical impact is that blocking removes your content from AI citation pools but does not stop content theft. For most ecommerce brands the citation loss outweighs the protection benefit.

How do I check which AI crawlers are visiting my site?

Most ecommerce platforms expose server logs that show user-agent strings for every request. On Shopify, third-party apps like Site Audit Pro or Log File Analyser provide this data. Self-hosted sites can analyze raw server logs. Look for user-agent patterns containing GPTBot, ClaudeBot, PerplexityBot, etc. to confirm AI crawlers are reaching your site.

Why are there separate Perplexity bots and how do they differ?

PerplexityBot is the indexing crawler that builds Perplexity's source pool by periodically fetching web content. Perplexity-User is the real-time fetcher that retrieves a specific page when a Perplexity user asks a question and the system needs current information. Both must be allowed in robots.txt for full Perplexity citation eligibility.

Should I block Bytespider since it is operated by TikTok's parent company?

Bytespider is a controversial case. It is operated by ByteDance (TikTok's parent) and has been criticized for aggressive crawling behavior and unclear data usage policies. Most ecommerce brands that sell on TikTok Shop should allow Bytespider for TikTok product discovery benefits. Brands not selling on TikTok can reasonably block it without significant citation loss.

Does Common Crawl matter for AI citations?

Yes, indirectly. Common Crawl (operated by CCBot) is a large open web archive that many AI companies use to bootstrap training datasets. Allowing CCBot makes your content available to a wide range of AI training pipelines beyond the major operators. Blocking CCBot reduces your content surface area across the AI ecosystem.

How often should I update my robots.txt file?

Quarterly review is sufficient for most ecommerce brands. New AI crawlers appear periodically and major operators occasionally rename or split their user agents. A quarterly audit ensures your configuration stays current. Brands seeing significant AI citation traffic should also monitor monthly for any new bot patterns appearing in server logs.

Can I allow AI crawlers on some pages but block them on others?

Yes. robots.txt supports User-agent specific Disallow rules. You can allow GPTBot to crawl your blog and product pages while disallowing crawl of admin pages, checkout flows, account areas, and other non-public content. Use granular Disallow rules to protect private content while keeping public content available.

Does meta name=robots noindex work for blocking AI crawlers?

Partially. The standard meta robots noindex directive primarily affects search engine indexing. AI crawlers have varying support for meta robots directives — some honor them, others rely primarily on robots.txt. For comprehensive AI crawler control, use robots.txt as the primary mechanism with meta robots as a secondary layer for specific pages.

What happens if my robots.txt has a syntax error?

Most AI crawlers handle robots.txt syntax errors conservatively — if they cannot parse a directive, they fall back to crawling. The risk is that strict Disallow rules you intended to apply may not be honored due to syntax issues. Use Google's robots.txt Tester (still useful for AI bots since most follow similar parsing rules) or third-party robots.txt validators to verify your file parses correctly.

Ian Smith
Ian Smith
Founder, Evolve Media Agency · Ecommerce & AI Search Specialist

Ian co-founded Evolve Media Agency in 2017 with his wife Megan. Over 9 years he has worked with $1M-$10M ecommerce brands on AI search visibility, technical SEO, and listing optimization. Based in Colorado. Read Ian's full bio →

Work With Ian

Audit Your robots.txt

Stop Accidentally Blocking Your AI Citations.

Book a free 30-minute AI search strategy call. We will audit your robots.txt, CDN settings, and server logs to verify your site is actually reachable by all major AI crawlers — and fix the gaps in real time.