AI Crawlers in 2026: The 18 Bots Your Website Should Welcome
GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and fifteen more. A field guide to the AI crawlers that decide whether your site shows up in generative answers — and the robots.txt that lets them in.
Two years ago, most robots.txt files contained a single directive: User-agent: *, followed by whatever you wanted to disallow. That world is over. In 2026 there are at least 18 distinct AI crawlersvisiting the open web every day — each operated by a different company, each respecting (or ignoring) different rules, and each gating a different audience of users who might be asking ChatGPT, Claude, Perplexity, or Gemini about your industry right now.
A site that doesn’t explicitly allow these bots is making a bet. Sometimes the bet works (the bot crawls anyway, on a generic permission). Sometimes it doesn’t (the bot follows a strict opt-in policy and skips your domain entirely). The cheap fix is to stop betting: list every bot you want, allow them by name, and be sure.
The 18 bots, what they do, and why they matter
OpenAI: three bots, three jobs
- GPTBot— ingests pages into ChatGPT’s training and knowledge layer. The big one. Blocking it removes your site from the foundation of every future ChatGPT answer.
- ChatGPT-User— the live fetch when a user asks ChatGPT to read a specific URL. Lower volume, but each fetch corresponds to a real user actively interested in your content.
- OAI-SearchBot— powers ChatGPT’s search feature. Crawls more aggressively than GPTBot but for a shorter retention window.
Anthropic: four bots
- ClaudeBot— the main training and retrieval crawler. Behaves similarly to GPTBot.
- Claude-Web— legacy identifier still used for some retrieval, kept allowed for backwards compatibility.
- claude-user— live fetches from Claude users who pass URLs into a conversation. Equivalent to ChatGPT-User.
- claude-searchbot— Claude’s search feature crawler, introduced 2025.
Perplexity: two bots, very high traffic
- PerplexityBot— the citation engine. Perplexity’s entire product is grounded in real-time crawls, so blocking this bot quite literally removes your site from the answer.
- Perplexity-User— live fetches initiated by individual users.
Google & Microsoft: the AI-specific overlays on classical search
- Google-Extended— the opt-in token that controls whether Google can use your content to train Bard / Gemini and improve generative search. Crucially separate from regular Googlebot — if you only allow Googlebot, you are implicitly disallowing Google’s AI products.
- GoogleOther— the umbrella user-agent Google uses for non-Search research and product testing.
- Bingbot— powers Bing search, which in turn powers Microsoft Copilot and ChatGPT’s web search fallback.
- Applebot-Extended— Apple’s AI-training opt-in, equivalent of Google-Extended for Apple Intelligence.
Meta, Apple, X, ByteDance, Common Crawl
- Meta-ExternalAgent— the bot powering Meta’s LLaMA / Meta AI products.
- Applebot— the classical Spotlight / Siri crawler.
- Twitterbot— renders link previews when your URL is posted on X.
- Bytespider— ByteDance’s crawler; powers Doubao and other Chinese-market AI products. Block this if your audience is exclusively EU/US.
- CCBot— Common Crawl. Not an AI itself, but its public datasets are the training-data source for almost every open-source LLM. Allowing CCBot ensures you appear in downstream models you can’t individually target.
The robots.txt template
The default that ships on every SOSEI-generated site:
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: claude-user
Allow: /
User-agent: claude-searchbot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: GoogleOther
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
User-agent: Applebot
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: CCBot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xmlWhat about “just allow everything”?
A generic User-agent: * with no explicit per-bot block doesimplicitly allow the bots. But that’s not the failure mode to worry about. The failure mode is the WordPress plugin, the SEO consultant, or the well-meaning developer who, two years from now, sees an unfamiliar bot in server logs and adds a Disallowfor it without understanding what they’ve cut off. Explicit per-bot allows survive that kind of casual maintenance.
They also give you the option to selectively block when you genuinely need to (Bytespider for EU-only sites, for example) without breaking the rest of the AI ecosystem in a single line.
Beyond robots.txt
A welcoming robots.txt is necessary but not sufficient. The other signals AI crawlers care about:
- A valid
sitemap.xmldeclared in robots.txt so the bot can find all your pages, not just the homepage. - An
llms.txtat the root summarising your site in markdown — see llms.txt explained. - Server-rendered HTML the bot can parse without executing JavaScript. Many bots do not run JS.
- Schema.org JSON-LD on every page so the bot knows what entity your site is about.
- Reasonable rate limits — aggressive 429 responses train the bot to crawl you less often.
What SOSEI ships by default
Every site SOSEI generates ships with the 18-bot allowlist above, plus a valid sitemap referenced from robots.txt, plus anllms.txt at the root, plus server-rendered HTML with no JS required for the marketing pages. The whole AI-discoverability layer is one of seven dimensions in the free analyzer — run yours to see which bots your current site is accidentally locking out.