llms.txt Explained: What AI Crawlers Actually Read on Your Site
A markdown file at the root of your site decides whether ChatGPT, Claude, and Perplexity summarize you correctly. Here's what llms.txt is, what to put in it, and the common mistakes that make it useless.
A single markdown file at https://yourdomain.com/llms.txt decides whether AI engines understand your business correctly when they cite it. The file is small, the convention is loose, and most sites still donβt have one β which means the few that do get a disproportionate share of the citations.
What llms.txt is
llms.txt is a plain-text markdown file at the root of your site, modeled loosely after robots.txt and the way a developer-friendly README is structured. Its job is to give a large language model a fast, deterministic way to understand:
- What your site is about, in one sentence
- The handful of URLs that are actually worth reading
- How to contact the people behind it
It is not a replacement for sitemap.xml. Sitemap is for search engine crawlers and lists every indexable URL. llms.txt is for LLMs and lists the few pages that compress well into context. We compare them in detail in Sitemap.xml vs llms.txt β do you need both? (yes).
Where it came from
The convention was proposed by Jeremy Howard (fast.ai, Answer.AI) in September 2024 and adopted within months by Anthropic, Vercel, Mintlify, Cloudflare, and a long tail of SaaS docs sites. There is no IETF RFC and no W3C blessing β it is a de facto standard that LLM tooling now expects to find.
The anatomy of a good llms.txt
The format is loose, but the four-section pattern below is what every reference implementation lands on:
# Site Name
> One-sentence description. Be specific. The model reads this first.
## What this site does
- Bullet list of 3-6 items.
- Concrete capabilities, not marketing adjectives.
## Key pages
- [Page name](https://example.com/page) β short note about what's there.
- [Another page](https://example.com/other)
## Contact
- Email: [email protected]The mistakes that make it useless
1. Marketing fluff at the top
βWe are a forward-thinking team passionate about empowering businesses toβ¦β is the kind of opener that wastes the only sentence the LLM is guaranteed to read. Lead with what you are and what you do. Specific nouns and verbs only.
2. A 200-line file
The whole point is that an LLM can read it in one shot. If your llms.txt requires multiple context calls to digest, you have rebuilt sitemap.xml in markdown. Keep it under 60 lines. Link out to llms-full.txt for full content.
3. Linking to internal-only or paywalled URLs
LLMs follow the links. If they hit a 404 or a login wall, they deprioritize the entire file. Every URL listed should be publicly readable.
4. Forgetting it exists
Out-of-date llms.txt files are worse than missing ones. If your pricing changed in March and your llms.txt still references the old plans, the LLM will confidently quote stale numbers. Tie regeneration of the file to your deploy pipeline.
llms-full.txt: the bigger sibling
Where llms.txt is a navigation index, llms-full.txtis the full corpus of marketing content, concatenated as plain markdown. Useful when an AI agent wants to ingest your entire pitch in one fetch rather than crawling page-by-page. We generate it dynamically from the same data modules our pages render, so the LLM-readable copy can never drift from what humans see β you can read ours here.
How to verify itβs working
curl https://yourdomain.com/llms.txtβ should return 200 withcontent-type: text/plainortext/markdown.- Ask Perplexity or ChatGPT a question whose answer is in your
llms.txt. Compare the cited summary to your file. - Check your server logs for hits from
GPTBot,ClaudeBot,PerplexityBot,Google-Extendedβ they should be reaching/llms.txtregularly.
Every site SOSEI rebuilds ships llms.txt and llms-full.txt by default, both regenerated on every deploy from the live site content. Want to see how your current site scores on AI discoverability? Run the free 40-point audit β the GEO category covers llms.txt presence, format, and freshness directly.