AI Crawler List 2026: Complete Bot Reference for Ecommerce

Q: How many AI crawlers should I worry about as an ecommerce brand?

There are roughly 12-15 AI crawlers that matter for most ecommerce brands in 2026. The most important are GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Amazonbot, and Meta-ExternalAgent. Allowing these ten bots covers the vast majority of AI citation surface area.

Q: Should I allow GPTBot to crawl my site?

Yes for most ecommerce brands. GPTBot is OpenAI's training crawler, which feeds ChatGPT's knowledge base. Blocking GPTBot removes your content from ChatGPT's training corpus, which significantly reduces citation likelihood for queries where users ask ChatGPT for recommendations in your category. The downside of allowing GPTBot - your content being used in training - is generally outweighed by the citation benefit.

Q: What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler that fetches content periodically to build the knowledge base used by ChatGPT. ChatGPT-User is the real-time user-agent activated when a ChatGPT user asks a question and the system retrieves a current page to compose an answer. Both should typically be allowed for full citation eligibility.

Q: Will blocking AI crawlers protect my content from being scraped?

Partially. Legitimate AI crawlers from major operators (OpenAI, Anthropic, Google, Apple) respect robots.txt directives. However, blocking does not stop bad actors or unauthorized scrapers that ignore robots.txt entirely. The practical impact is that blocking removes your content from AI citation pools but does not stop content theft. For most ecommerce brands the citation loss outweighs the protection benefit.

Q: How do I check which AI crawlers are visiting my site?

Most ecommerce platforms expose server logs that show user-agent strings for every request. On Shopify, third-party apps like Site Audit Pro or Log File Analyser provide this data. Self-hosted sites can analyze raw server logs. Look for user-agent patterns containing GPTBot, ClaudeBot, PerplexityBot, etc. to confirm AI crawlers are reaching your site.

Q: Why are there separate Perplexity bots and how do they differ?

PerplexityBot is the indexing crawler that builds Perplexity's source pool by periodically fetching web content. Perplexity-User is the real-time fetcher that retrieves a specific page when a Perplexity user asks a question and the system needs current information. Both must be allowed in robots.txt for full Perplexity citation eligibility.

Q: Should I block Bytespider since it is operated by TikTok's parent company?

Bytespider is a controversial case. It is operated by ByteDance (TikTok's parent) and has been criticized for aggressive crawling behavior and unclear data usage policies. Most ecommerce brands that sell on TikTok Shop should allow Bytespider for TikTok product discovery benefits. Brands not selling on TikTok can reasonably block it without significant citation loss.

Q: Does Common Crawl matter for AI citations?

Yes, indirectly. Common Crawl (operated by CCBot) is a large open web archive that many AI companies use to bootstrap training datasets. Allowing CCBot makes your content available to a wide range of AI training pipelines beyond the major operators. Blocking CCBot reduces your content surface area across the AI ecosystem.

Q: How often should I update my robots.txt file?

Quarterly review is sufficient for most ecommerce brands. New AI crawlers appear periodically and major operators occasionally rename or split their user agents. A quarterly audit ensures your configuration stays current. Brands seeing significant AI citation traffic should also monitor monthly for any new bot patterns appearing in server logs.

Q: Can I allow AI crawlers on some pages but block them on others?

Yes. robots.txt supports User-agent specific Disallow rules. You can allow GPTBot to crawl your blog and product pages while disallowing crawl of admin pages, checkout flows, account areas, and other non-public content. Use granular Disallow rules to protect private content while keeping public content available.

UPDATED FOR ALEXA FOR SHOPPINGAmazon retired the Rufus brand on May 13, 2026 and consolidated the technology into Alexa for Shopping. The optimization principles in this guide still apply to the new system.

If you have not audited your robots.txt in the last six months, you are probably blocking AI crawlers without realizing it — or worse, leaving valuable bots out.

AI crawlers are the silent infrastructure of generative search. Every time ChatGPT, Claude, Perplexity, or Gemini answers a shopping question with cited sources, those citations come from content their crawlers previously indexed. If your site is blocking those bots in robots.txt — whether intentionally or because an SEO plugin made the call for you — your products cannot be cited. We have audited dozens of ecommerce stores in the last six months and found bot configuration issues on roughly 60 percent of them. This guide is the complete reference list of which AI crawlers exist in 2026, what each one does, and how to configure your robots.txt to maximize citation eligibility while protecting the parts of your site that should stay private.

Get FREE access to our Ecom Profit Box with multiple POWERFUL growth guides here!

For the broader picture on AI search, see our AI Search Resource Hub and the platform-specific playbooks for Perplexity citations and ChatGPT Instant Checkout.

Definition: AI Crawler

An AI crawler is an automated bot operated by an AI company that fetches web content for the purpose of training large language models, building search retrieval indexes, or fulfilling real-time user requests. AI crawlers are distinct from traditional search engine crawlers in that they feed AI products rather than search engine result pages.

What is an AI crawler and why does it matter for ecommerce?

An AI crawler is an automated bot operated by an AI company that fetches web content to train AI models, build search retrieval indexes, or fulfill real-time user requests. For ecommerce brands, AI crawlers matter because they are the gatekeepers to AI citation visibility — if a crawler cannot access your content, that content cannot be referenced in AI-generated answers.

Why ecommerce brands should care more than other industries

Product recommendation queries dominate AI search. A meaningful portion of ChatGPT and Perplexity usage involves shopping research, where ecommerce content is the primary citation surface
AI-driven traffic is rapidly growing. Perplexity-referred ecommerce traffic alone grew roughly 7x between January 2025 and Q1 2026, and ChatGPT shopping referrals are scaling even faster
Citation share compounds. Brands cited frequently early in this curve build long-term authority that gets harder to displace later. The cost of being blocked from AI training corpora today compounds over time

The mechanical link is straightforward: AI crawler access in robots.txt determines whether your content enters the citation pool, which determines whether your brand can be recommended when shoppers ask AI engines for product suggestions.

Which AI crawlers should every ecommerce brand know about?

There are roughly 12 AI crawlers that matter for ecommerce brands in 2026, operated by 7 major AI companies plus Common Crawl. The table below is the complete reference list with user agents, operators, purposes, and our default recommendations.

Book a FREE Amazon Listing Audit + Consulting Zoom Call by clicking here!

The complete 2026 AI crawler reference table

User Agent	Operator	Purpose	AI Products	Recommendation
GPTBot	OpenAI	Training crawler	ChatGPT, GPT models	Allow
OAI-SearchBot	OpenAI	Search index	ChatGPT Search, SearchGPT	Allow
ChatGPT-User	OpenAI	Real-time user-triggered fetch	ChatGPT live retrieval	Allow
ClaudeBot	Anthropic	Training and retrieval	Claude	Allow
anthropic-ai	Anthropic	Legacy crawler (older)	Claude	Allow
PerplexityBot	Perplexity	Indexing crawler	Perplexity	Allow
Perplexity-User	Perplexity	Real-time user fetch	Perplexity live retrieval	Allow
Google-Extended	Google	Training opt-in for Gemini	Gemini, Bard	Allow
Applebot-Extended	Apple	Training crawler	Apple Intelligence	Allow
Amazonbot	Amazon	Indexing crawler	Alexa, Rufus	Allow
Meta-ExternalAgent	Meta	Training crawler	Meta AI	Allow
CCBot	Common Crawl	Open web corpus	Bootstraps many AI datasets	Allow
Bytespider	ByteDance	Training crawler	TikTok AI features, Doubao	Conditional
cohere-ai	Cohere	Training crawler	Command models	Allow

Allowing the top 10 in this list (GPTBot through Meta-ExternalAgent) covers roughly 90 percent of AI citation surface area for US-based ecommerce brands. CCBot is also worth allowing because of its role bootstrapping many smaller AI training pipelines.

How do AI crawlers differ from traditional search engine crawlers?

AI crawlers differ from traditional search crawlers in three primary ways: they feed AI products rather than search engine result pages, they include both training and real-time retrieval bots, and they evolve rapidly with frequent new user-agent introductions and renames.

Three structural differences worth understanding

Output destination. Googlebot feeds Google Search results. GPTBot feeds ChatGPT's knowledge base. These are different products with different audiences and different optimization patterns
Crawler typology. Search crawlers are primarily indexing bots. AI ecosystems have three types: training crawlers (long-term knowledge accumulation), retrieval/index crawlers (search-style indexing), and real-time user-triggered fetchers (live page retrieval during a chat session)
Pace of change. Googlebot user-agents have been stable for over a decade. AI crawler user-agents change frequently — new bots launch, old ones get renamed or split, and major operators announce new variants on roughly quarterly cadences

What this means practically for ecommerce SEO

Traditional SEO best practices for crawler accessibility still apply
The bot list to optimize for is larger and changes more often
You may want different rules for different bot types (e.g., allow training crawlers on blog content but restrict on price-sensitive product pages)
Quarterly review of robots.txt configuration is now table stakes

What is the difference between training crawlers and search/retrieval bots?

Training crawlers fetch content periodically to build the long-term knowledge base used by AI models. Search and retrieval bots index content for live retrieval during AI chat sessions. Real-time user-triggered fetchers retrieve a specific page on-demand when a user asks a question. Each serves a different function and warrants different consideration in your robots.txt strategy.

Training crawlers

Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, CCBot, cohere-ai. These crawl periodically and feed long-term AI training datasets. Content they index becomes part of the model's baseline knowledge and can be referenced for years.

Search and retrieval bots

Examples: OAI-SearchBot, PerplexityBot, Amazonbot. These build searchable indexes that AI systems query at conversation time to retrieve relevant sources. Content they index is freshly retrievable and contributes to citations even before training updates.

Real-time user-triggered fetchers

Examples: ChatGPT-User, Perplexity-User. These activate when a specific user asks a question and the AI system retrieves a particular page to compose an up-to-the-minute answer. Blocking these bots prevents your content from being retrieved live even if other crawlers have indexed it.

Why All Three Matter

For maximum citation eligibility, all three categories should be allowed. Training-only access gives you baseline citation eligibility but no live retrieval. Search-index access without training access means you can be retrieved but may not match natural language patterns as well. Allowing all three gives you full citation surface.

How do you configure robots.txt to allow AI crawlers correctly?

The recommended baseline robots.txt configuration for ecommerce brands explicitly allows all major AI crawlers, sets sensible Disallow rules for non-public areas (admin, checkout, cart, account), and includes your sitemap. Below is a ready-to-paste configuration you can use as a starting point.

Recommended robots.txt for most ecommerce sites

# Recommended robots.txt for ecommerce # Allow all major AI crawlers, restrict private areasUser-agent: GPTBot User-agent: OAI-SearchBot User-agent: ChatGPT-User User-agent: ClaudeBot User-agent: anthropic-ai User-agent: PerplexityBot User-agent: Perplexity-User User-agent: Google-Extended User-agent: Applebot-Extended User-agent: Amazonbot User-agent: Meta-ExternalAgent User-agent: CCBot User-agent: cohere-ai Allow: / Disallow: /admin/ Disallow: /checkout/ Disallow: /cart/ Disallow: /account/ Disallow: /api/ Disallow: /*?orderby= Disallow: /*?filter=# Default rules for all other crawlers User-agent: * Allow: / Disallow: /admin/ Disallow: /checkout/ Disallow: /cart/ Disallow: /account/Sitemap: https://yoursite.com/sitemap.xml

What this configuration does

Explicitly allows 13 major AI crawlers on all public content
Blocks private areas (admin, checkout, cart, account, API endpoints) from all bots including AI crawlers
Blocks filtered URLs that create duplicate content (sort by, filter by) for AI bots
Sets default rules for all other crawlers
Declares your sitemap location for discoverability

Replace yoursite.com with your actual domain. Adjust the Disallow patterns to match your platform — the example uses common WooCommerce/Shopify patterns but your URL structure may differ.

Free Resource

The Ecom Profit Box

11 step-by-step PDF guides covering AI search, content creation, listing optimization, and more.

Grab it free →

Evolve Media Service

AI Search Optimization

We audit and configure robots.txt for ecommerce brands as part of our AI search optimization service.

Book a strategy call →

Should you block any AI crawlers? When and why?

Most ecommerce brands should not block any major AI crawler because the citation loss outweighs the protection benefit. There are three specific scenarios where blocking makes sense: published copyrighted content you do not want in training datasets, controversial crawlers with poor governance, and specific subdirectories with private content even on otherwise public domains.

When blocking makes sense

Published copyrighted media you sell as content products. Brands that sell digital content (courses, reports, premium articles) may want to block training crawlers from those specific paths while allowing crawl on marketing pages
Bytespider for non-TikTok brands. Bytespider has been criticized for aggressive crawling and unclear data governance. Brands not selling on TikTok Shop can reasonably block it
Private member content. Subdirectories like /members/, /premium/, /paid/ should be blocked from all crawlers regardless of AI status
Test or staging subdomains. Anything not intended for production traffic should be blocked across the board

When blocking is usually a mistake

Blocking GPTBot because of training concerns. The citation visibility loss in ChatGPT is severe and usually outweighs the training-data concern
Blocking PerplexityBot or Perplexity-User. These directly determine your Perplexity citation eligibility — blocking them means zero Perplexity citations
Blocking ClaudeBot. Removes you from Claude's citation pool entirely
Blocking all AI crawlers by default. Some SEO plugins do this automatically — check yours and override if needed

The Default-Block Trap

Several popular WordPress and Shopify SEO plugins added “block AI bots” toggles in 2024 and 2025 with the toggle enabled by default. We have seen ecommerce stores that updated their plugin and unknowingly cut themselves off from ChatGPT, Claude, and Perplexity overnight. Audit your robots.txt manually at least quarterly to catch these.

How do you verify that AI crawlers are actually reaching your site?

Verify AI crawler activity by examining your server logs for known AI user-agent strings, using Shopify or hosting platform analytics tools that surface crawler data, or running periodic prompt audits in AI engines to see which content from your site is being cited. Server logs are the definitive source of truth.

Three methods to check AI crawler activity

Server log analysis. Most hosting providers offer access to raw server logs. Filter for user-agent strings containing GPTBot, ClaudeBot, PerplexityBot, anthropic-ai, etc. to see crawl frequency
Platform analytics tools. Shopify has third-party apps like Log File Analyser that visualize crawler activity. WooCommerce sites can use plugins like Slim Stat Analytics or Crawler Stats
Prompt audit method. Run 20-30 category-relevant prompts in ChatGPT and Perplexity weekly. If your content appears in citations, the crawlers are clearly reaching you. If it never appears despite quality content, investigate crawler access

What healthy AI crawler activity looks like

GPTBot: Crawls every 1-7 days depending on update frequency. Larger sites get more frequent visits
PerplexityBot: Crawls every 1-3 days for established sites, less often for newer sites
ClaudeBot: Crawl frequency varies; typically every 3-14 days
ChatGPT-User and Perplexity-User: Appear sporadically based on user queries that trigger live fetches

What is the difference between indexing crawlers and real-time fetchers?

Indexing crawlers (like GPTBot and PerplexityBot) periodically fetch content to build searchable indexes. Real-time fetchers (like ChatGPT-User and Perplexity-User) activate on-demand when a user asks a question and the AI system retrieves a specific page for that conversation. Both must be allowed for full citation eligibility — blocking either creates a gap in your AI visibility.

Why both matter for citations

The indexing crawler determines whether your content exists in the AI's source pool. The real-time fetcher determines whether your content can be retrieved when a user asks a relevant question. A brand with only indexing access gets cited from cached snapshots; a brand with both gets cited from fresh, up-to-the-minute content.

The user-agent pairs that matter most

Operator	Indexing Crawler	Real-Time Fetcher	Combined Effect
OpenAI	GPTBot (training), OAI-SearchBot (index)	ChatGPT-User	Training memory + live fetch
Perplexity	PerplexityBot	Perplexity-User	Index access + live retrieval
Anthropic	ClaudeBot	(no separate user fetcher)	Training and retrieval combined
Google	Google-Extended	(uses standard Googlebot for live)	Training opt-in plus standard search
Amazon	Amazonbot	Amazonbot (combined)	Indexing for Alexa and Rufus

The most common configuration gap we see in audits is sites that allow training crawlers but accidentally block real-time fetchers. This usually happens when a robots.txt was written before real-time fetchers existed and never updated.

What are the most common AI crawler configuration mistakes?

The five most common AI crawler configuration mistakes we see in ecommerce audits are: SEO plugin default-blocks, missing real-time user-agent rules, blocking by accident through wildcard rules, blocking AI bots on Cloudflare or CDN level, and never updating the configuration after launch. Each is fixable in under an hour.

Mistake 1: SEO plugin default-blocks

Several SEO plugins added “block AI bots” features with the toggle ON by default. The most common offenders are some versions of Yoast, RankMath, and All-in-One SEO. Check your plugin settings and ensure AI bot blocking is OFF unless you have a specific reason for it.

Mistake 2: Missing real-time user-agent rules

Many older robots.txt files include GPTBot, ClaudeBot, and PerplexityBot but not the real-time fetchers ChatGPT-User and Perplexity-User. These newer user-agents must be explicitly allowed. Wildcards or default-allow blocks may not be sufficient depending on how your file is structured.

Mistake 3: Blocking by accident through wildcard rules

A blanket User-agent: * with broad Disallow rules can accidentally apply to AI crawlers when more specific User-agent blocks are not present. AI bots typically fall back to the * rules if no specific rules apply, which can produce surprising results.

Mistake 4: Blocking AI bots at the CDN level

Cloudflare and other CDNs have introduced features to block AI bots at the network level. These overrides take precedence over robots.txt. If you have a Cloudflare WAF rule or Bot Fight Mode setting that blocks AI bots, robots.txt access is irrelevant — the bot never reaches your origin server.

Mistake 5: Never updating after launch

robots.txt that was written in 2022 or 2023 likely does not include 2025 and 2026 AI crawlers like OAI-SearchBot, Applebot-Extended, or Meta-ExternalAgent. Quarterly updates are the minimum cadence to stay current.

The CDN Override Risk

If your robots.txt looks correct but AI crawlers still are not reaching you, check Cloudflare or your CDN dashboard. Many sites have invisible AI-blocking rules at the WAF level that override everything else. Cloudflare specifically has a one-click “Block AI Bots” option in some dashboards that is easy to enable by mistake.

How should ecommerce brands maintain their AI crawler configuration over time?

Maintain your AI crawler configuration through quarterly audits, monthly server log monitoring, post-plugin-update verification, and a documented list of known user-agents that your team can update as new bots appear. Maintenance is mostly about catching SEO plugin updates, CDN setting changes, and new AI bot introductions before they create blind spots.

The maintenance schedule

Monthly: Check server logs for AI crawler activity. Are GPTBot, ClaudeBot, PerplexityBot showing up? If not, investigate
Quarterly: Full robots.txt audit. Read the file end to end. Verify all major AI crawlers are explicitly allowed. Update for any new bots
After plugin updates: Verify your SEO plugin has not silently changed your AI bot settings during the update
After CDN setting changes: If anyone on your team adjusts Cloudflare or CDN settings, verify AI crawler access is preserved
When new AI products launch: Check whether the new product has a new user-agent that you need to add

Tools and resources for ongoing maintenance

Dark Visitors (darkvisitors.com) maintains an updated list of AI crawler user-agents
The robotstxt.org documentation covers protocol fundamentals (robotstxt.org)
Google's robots.txt Tester for syntax validation
Individual operator documentation: OpenAI GPTBot docs, Anthropic ClaudeBot pages, Perplexity bot documentation

Most teams can maintain AI crawler configuration with about 30 minutes of work per quarter once the baseline is set up correctly. The first audit takes longer because you are inventorying and fixing accumulated issues.

Key Takeaways

The 6 Things to Remember About AI Crawlers

12 AI crawlers matter for ecommerce brands in 2026, operated by OpenAI, Anthropic, Perplexity, Google, Apple, Amazon, Meta, and Common Crawl
Three crawler categories exist: training, indexing/retrieval, and real-time user-triggered fetchers — all three should be allowed for full citation eligibility
The ready-to-paste robots.txt in section 5 covers the recommended baseline configuration for most ecommerce sites
Blocking AI crawlers usually costs more in citation loss than it gains in protection — default to allowing unless you have a specific reason
The five most common mistakes: SEO plugin default-blocks, missing real-time user-agents, accidental wildcard blocks, CDN-level overrides, and never updating after launch
Maintain configuration through quarterly audits, monthly server log checks, and post-plugin-update verification — about 30 minutes per quarter once baseline is set

Sources & References

External Sources Cited in This Article

The AI Crawler List 2026: Every User-Agent Ecommerce Brands Need to Know.

What is an AI crawler and why does it matter for ecommerce?

Why ecommerce brands should care more than other industries

Which AI crawlers should every ecommerce brand know about?

The complete 2026 AI crawler reference table

How do AI crawlers differ from traditional search engine crawlers?

Three structural differences worth understanding

What this means practically for ecommerce SEO

What is the difference between training crawlers and search/retrieval bots?

Training crawlers

Search and retrieval bots

Real-time user-triggered fetchers

How do you configure robots.txt to allow AI crawlers correctly?

Recommended robots.txt for most ecommerce sites

What this configuration does

The Ecom Profit Box

AI Search Optimization

Should you block any AI crawlers? When and why?

When blocking makes sense

When blocking is usually a mistake

How do you verify that AI crawlers are actually reaching your site?

Three methods to check AI crawler activity

What healthy AI crawler activity looks like

What is the difference between indexing crawlers and real-time fetchers?

Why both matter for citations

The user-agent pairs that matter most

What are the most common AI crawler configuration mistakes?

Mistake 1: SEO plugin default-blocks

Mistake 2: Missing real-time user-agent rules

Mistake 3: Blocking by accident through wildcard rules

Mistake 4: Blocking AI bots at the CDN level

Mistake 5: Never updating after launch

How should ecommerce brands maintain their AI crawler configuration over time?

The maintenance schedule

Tools and resources for ongoing maintenance

The 6 Things to Remember About AI Crawlers

External Sources Cited in This Article

AI Crawler
FAQ

Stop Accidentally Blocking Your AI Citations.

Keep Exploring

The AI Crawler List 2026: Every User-Agent Ecommerce Brands Need to Know.

What is an AI crawler and why does it matter for ecommerce?

Why ecommerce brands should care more than other industries

Which AI crawlers should every ecommerce brand know about?

The complete 2026 AI crawler reference table

How do AI crawlers differ from traditional search engine crawlers?

Three structural differences worth understanding

What this means practically for ecommerce SEO

What is the difference between training crawlers and search/retrieval bots?

Training crawlers

Search and retrieval bots

Real-time user-triggered fetchers

How do you configure robots.txt to allow AI crawlers correctly?

Recommended robots.txt for most ecommerce sites

What this configuration does

The Ecom Profit Box

AI Search Optimization

Should you block any AI crawlers? When and why?

When blocking makes sense

When blocking is usually a mistake

How do you verify that AI crawlers are actually reaching your site?

Three methods to check AI crawler activity

What healthy AI crawler activity looks like

What is the difference between indexing crawlers and real-time fetchers?

Why both matter for citations

The user-agent pairs that matter most

What are the most common AI crawler configuration mistakes?

Mistake 1: SEO plugin default-blocks

Mistake 2: Missing real-time user-agent rules

Mistake 3: Blocking by accident through wildcard rules

Mistake 4: Blocking AI bots at the CDN level

Mistake 5: Never updating after launch

How should ecommerce brands maintain their AI crawler configuration over time?

The maintenance schedule

Tools and resources for ongoing maintenance

The 6 Things to Remember About AI Crawlers

External Sources Cited in This Article

AI CrawlerFAQ

Related AI Search Resources

Stop Accidentally Blocking Your AI Citations.

Keep Exploring

AI Crawler
FAQ