By Meiko Neuman, FounderMay 14, 20267 min readAI DiscoverabilityGEOrobots.txt

AI Crawlers in 2026: The 18 Bots Your Website Should Welcome

GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and fifteen more. A field guide to the AI crawlers that decide whether your site shows up in generative answers — and the robots.txt that lets them in.

Two years ago, most robots.txt files contained a single directive: User-agent: *, followed by whatever you wanted to disallow. That world is over. In 2026 there are at least 18 distinct AI crawlersvisiting the open web every day — each operated by a different company, each respecting (or ignoring) different rules, and each gating a different audience of users who might be asking ChatGPT, Claude, Perplexity, or Gemini about your industry right now.

A site that doesn’t explicitly allow these bots is making a bet. Sometimes the bet works (the bot crawls anyway, on a generic permission). Sometimes it doesn’t (the bot follows a strict opt-in policy and skips your domain entirely). The cheap fix is to stop betting: list every bot you want, allow them by name, and be sure.

The 18 AI crawlers SOSEI's default robots.txt explicitly allows, grouped by parent company.

The 18 bots, what they do, and why they matter

OpenAI: three bots, three jobs

GPTBot— ingests pages into ChatGPT’s training and knowledge layer. The big one. Blocking it removes your site from the foundation of every future ChatGPT answer.
ChatGPT-User— the live fetch when a user asks ChatGPT to read a specific URL. Lower volume, but each fetch corresponds to a real user actively interested in your content.
OAI-SearchBot— powers ChatGPT’s search feature. Crawls more aggressively than GPTBot but for a shorter retention window.

Anthropic: four bots

ClaudeBot— the main training and retrieval crawler. Behaves similarly to GPTBot.
Claude-Web— legacy identifier still used for some retrieval, kept allowed for backwards compatibility.
claude-user— live fetches from Claude users who pass URLs into a conversation. Equivalent to ChatGPT-User.
claude-searchbot— Claude’s search feature crawler, introduced 2025.

Perplexity: two bots, very high traffic

PerplexityBot— the citation engine. Perplexity’s entire product is grounded in real-time crawls, so blocking this bot quite literally removes your site from the answer.
Perplexity-User— live fetches initiated by individual users.

Google & Microsoft: the AI-specific overlays on classical search

Google-Extended— the opt-in token that controls whether Google can use your content to train Bard / Gemini and improve generative search. Crucially separate from regular Googlebot — if you only allow Googlebot, you are implicitly disallowing Google’s AI products.
GoogleOther— the umbrella user-agent Google uses for non-Search research and product testing.
Bingbot— powers Bing search, which in turn powers Microsoft Copilot and ChatGPT’s web search fallback.
Applebot-Extended— Apple’s AI-training opt-in, equivalent of Google-Extended for Apple Intelligence.

Meta, Apple, X, ByteDance, Common Crawl

Meta-ExternalAgent— the bot powering Meta’s LLaMA / Meta AI products.
Applebot— the classical Spotlight / Siri crawler.
Twitterbot— renders link previews when your URL is posted on X.
Bytespider— ByteDance’s crawler; powers Doubao and other Chinese-market AI products. Block this if your audience is exclusively EU/US.
CCBot— Common Crawl. Not an AI itself, but its public datasets are the training-data source for almost every open-source LLM. Allowing CCBot ensures you appear in downstream models you can’t individually target.

The robots.txt template

The default that ships on every SOSEI-generated site:

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: claude-user
Allow: /

User-agent: claude-searchbot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Applebot
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

What about “just allow everything”?

A generic User-agent: * with no explicit per-bot block doesimplicitly allow the bots. But that’s not the failure mode to worry about. The failure mode is the WordPress plugin, the SEO consultant, or the well-meaning developer who, two years from now, sees an unfamiliar bot in server logs and adds a Disallowfor it without understanding what they’ve cut off. Explicit per-bot allows survive that kind of casual maintenance.

They also give you the option to selectively block when you genuinely need to (Bytespider for EU-only sites, for example) without breaking the rest of the AI ecosystem in a single line.

Beyond robots.txt

A welcoming robots.txt is necessary but not sufficient. The other signals AI crawlers care about:

A valid sitemap.xml declared in robots.txt so the bot can find all your pages, not just the homepage.
An llms.txtat the root summarising your site in markdown — see llms.txt explained.
Server-rendered HTML the bot can parse without executing JavaScript. Many bots do not run JS.
Schema.org JSON-LD on every page so the bot knows what entity your site is about.
Reasonable rate limits — aggressive 429 responses train the bot to crawl you less often.

What SOSEI ships by default

Every site SOSEI generates ships with the 18-bot allowlist above, plus a valid sitemap referenced from robots.txt, plus anllms.txt at the root, plus server-rendered HTML with no JS required for the marketing pages. The whole AI-discoverability layer is one of seven dimensions in the free analyzer — run yours to see which bots your current site is accidentally locking out.