May 22, 20264 min readWorkspace CMS Editorial

What llms.txt is and why your site needs one in 2026

If you ran a website in 1996, the most important file in your repo was robots.txt — the 30-line plain-text manifest that told Google what to crawl and what to skip. If you're running a website in 2026, you need a second one: llms.txt. It's the file that tells the AI engines what they're allowed to cite, what they should treat as canonical, and which sections of your site they're not supposed to summarize back to their users.

What llms.txt actually is

llms.txt is a proposed standard — first floated in late 2024, adopted in earnest through 2025 — that lives at the root of your site (https://yourdomain.com/llms.txt) and gives Large Language Model crawlers three things:

Permissions. Which URLs the LLM is welcome to ingest for training, retrieval, or live citation.
Canonical structure. A curated, low-noise list of the pages on your site that the LLM should prefer when summarizing your business — your homepage, your service pages, your case studies, your pricing — without having to crawl the noisy ones to find them.
Provenance signals. Author names, publication dates, and editorial responsibility metadata so the LLM can attach the right citation when it surfaces your content in an answer.

It is, structurally, a Markdown file. That's deliberate — LLMs parse Markdown natively as part of their training pipelines, so a Markdown manifest is the lowest-friction format both for the engine reading it and for the human writing it. A typical llms.txt looks like a one-page table of contents for your business.

Why the four engines all support it now

The four search surfaces buyers actually use in 2026 — ChatGPT (with web tool enabled), Claude (with the browse tool), Perplexity, and Gemini — all read llms.txt as part of their crawl. They don't all do exactly the same thing with it:

Perplexity uses the canonical list as a priority queue: pages listed in llms.txt are checked first when answering a query about your business.
ChatGPT's web tool uses it as a safety filter: pages marked “do not cite” are excluded from response previews even if the underlying retrieval index found them.
Claude uses provenance signals to attach author and publication-date metadata to the citation card it shows the user.
Gemini uses it as a discoverability hint, similar to Google's classic XML sitemap.

The net effect: a site with a thoughtful llms.txt gets cited more often, more accurately, and with the right attribution. A site without one is at the mercy of whatever the engine's generic crawler decided to keep.

What goes in a useful llms.txt

The bare minimum is title, summary, and a hierarchical list of links to your core pages. A useful llms.txt — the kind we ship by default on Workspace CMS sites — goes further:

A “What this business is” paragraph. Two sentences. This is what the LLM will quote when a user asks “what does [your company] do?” Make it the version you'd actually want quoted.
A canonical service map. One bullet per offering, with the URL. Skip the marketing fluff; lead with the noun the user would search for.
Authoritative reference pages. Your pricing page, your contact page, your case studies — the high-signal surfaces the LLM should prefer over a blog post that mentions a price tangentially.
Excluded sections. Customer-portal URLs, internal docs, anything you don't want summarized back to a stranger. The exclude block is the new Disallow.
An updated: timestamp. The engines use this to decide whether their cached copy is stale.

The mistake most sites make

The most common llms.txt error in 2026 is the same as the most common robots.txt error in 2006: pasting a generic template, never updating it, and never reading it back. We see llms.txt files that still list URLs that 404, that describe the business as it existed two years ago, and that omit half the service pages the site actually offers.

An llms.txt is a marketing asset, not a config file. It earns its keep when a real editor — ideally the same person who owns the homepage copy — owns it. On Workspace CMS sites, the llms.txt is generated from the same structured content the homepage and service pages render from, so it can't drift. On other platforms, you'll need to put a reminder in the calendar.

How to know it's working

The proof point is citation share. Run the same prompt — “what does [your company] do?” or “who are the best [your category] firms in [your city]?” — across all four engines, before and after you ship the llms.txt. Track whether your site shows up in the citation row at the bottom of the answer, and whether the language the engine uses is yours or a guess.

If you don't have a way to track that yet, you'll want to. That's exactly what the AI Visibility tracker is for — and it's the topic of the next post in this series.

Curious what your llms.txt should say, or whether you even have one yet? Book a free strategy call — we'll walk through your AI-citation posture on the call and draft the file with you live.

Liked this? Talk to the team that wrote it.

Book a free 30-minute call. A real 1Digital® strategist runs a live audit of your site, your PageSpeed, your llms.txt, and your AI visibility — and tells you what we'd do.

Book your free strategy call Book a demo (888) 982-8269

Want the full picture? Browse more posts