By SitemapFixer Team
Updated April 2026

llms.txt: The Emerging Standard for AI Crawler Guidance

Check your sitemap is crawlable by all botsAnalyze My Sitemap

llms.txt is a plain-text file placed at the root of a website — similar to robots.txt — that provides guidance to large language model crawlers about the site's content. The format was proposed in September 2024 by Jeremy Howard and is gaining adoption as AI companies launch their own web crawlers for training and retrieval-augmented generation (RAG) systems.

Unlike robots.txt, which uses a binary allow/block permission model, llms.txt is designed to give AI systems richer context: what the site contains, which content is most useful for AI, and what the site owner's preferences are for AI usage.

The Three-File Stack: sitemap.xml, robots.txt, llms.txt

These three files serve different but complementary roles for web crawlers:

FilePurposeAudienceStandard?
sitemap.xmlLists all URLs for discovery and indexingSearch crawlersW3C + sitemaps.org protocol
robots.txtControls which bots can access which pathsAll crawlersRFC 9309
llms.txtDescribes content and AI usage preferencesAI/LLM crawlersProposed (not yet standardized)

The key distinction: robots.txt tells crawlers where they can go; llms.txt tells AI systems what they will find and how the owner wants the content used. A crawler that respects both files gets permission from robots.txt and context from llms.txt.

llms.txt File Format

The proposed format uses Markdown. The file lives at https://yoursite.com/llms.txt. A minimal example:

# llms.txt
# SitemapFixer
> SitemapFixer is a technical SEO tool that analyzes XML sitemaps and
> identifies crawling and indexing issues. Audience: SEO professionals,
> developers, and site owners.
## Docs
- [Sitemap Guide](/learn/what-is-an-xml-sitemap): Complete XML sitemap reference
- [GSC Errors](/gsc-errors): Google Search Console error explanations
## Optional
- [Blog](/blog): Technical SEO articles
## Blocked
- Do not train on pricing or checkout pages
- Do not reproduce content verbatim without attribution

The format has four main sections:

  • Header — Site name and a blockquote description of what the site is and who it serves
  • Docs — Key documentation URLs with descriptions. These are the pages most useful for AI to index.
  • Optional — Secondary content the AI may include if needed (blog posts, supplementary guides)
  • Blocked — Content the site owner requests AI not use (paywalled content, pricing pages, proprietary data)

llms-full.txt: The Extended Version

Alongside llms.txt, some sites publish llms-full.txt at https://yoursite.com/llms-full.txt. This file includes the full text of key documentation pages — not just links — formatted for direct ingestion by AI systems. This is particularly useful for tools that want to index your documentation without crawling individual pages. It functions similarly to a sitemap, but instead of listing URLs, it provides the actual content.

Not all sites need llms-full.txt. It is most valuable for documentation-heavy sites (developer tools, SaaS products, API references) where AI-powered coding assistants frequently look up information.

Which AI Crawlers Read llms.txt?

As of mid-2026, llms.txt is a proposed convention, not an enforced standard. Adoption among AI companies is voluntary. Some crawlers that have indicated awareness of the format include Perplexity (which has documented the format on its developer site) and various RAG-based search engines. OpenAI's GPTBot and Anthropic's ClaudeBot have not officially committed to reading llms.txt, but both companies have stated they respect robots.txt.

The practical situation: llms.txt is not meaningfully enforced. It communicates intent, not enforces access. For hard access control, robots.txt remains the authoritative mechanism. llms.txt is better thought of as a good-faith signal to cooperative AI systems.

Should You Create an llms.txt?

Consider creating llms.txt if:

  • Your site contains documentation that AI tools or coding assistants commonly look up
  • You want to clearly communicate to AI companies which content is safe to train on or cite
  • You publish proprietary research or paywalled content and want to signal boundaries
  • You want to be early to a standard that may become more formally adopted

Do not expect llms.txt alone to prevent AI training on your content. Only robots.txt with a Disallow directive for specific bots (GPTBot, ClaudeBot, PerplexityBot, etc.) provides a technical signal that crawlers should honor. llms.txt is complementary, not a replacement.

Creating Your llms.txt: Step by Step

  1. Create the file — Create a plain-text file named llms.txt at your site root. In Next.js, place it in /public/llms.txt so it serves at /llms.txt.
  2. Write the header — Site name as H1, followed by a blockquote description of what the site is and who it serves.
  3. List your key pages in Docs — These should be your most comprehensive, useful pages. For a technical SEO tool, this would be major guides and tool pages.
  4. Add Optional content — Blog posts, supplementary articles, or any content that is useful but not your primary documentation.
  5. Specify Blocked content — List any content type or URL pattern you do not want AI systems to use.
  6. Reference in robots.txt — Some sites add a comment in robots.txt pointing to llms.txt: # llms.txt: https://yoursite.com/llms.txt

llms.txt vs. robots.txt: Which One Controls AI Access?

This is the most common source of confusion:

  • robots.txt is the mechanism for blocking crawlers. AI crawlers like GPTBot, ClaudeBot, and PerplexityBot respect robots.txt Disallow directives. If you want to block a specific bot, use robots.txt.
  • llms.txt is a contextual signal. It tells cooperative AI systems what your site contains and what you prefer. It is not a technical block — uncooperative scrapers will ignore it.
  • sitemap.xml is for search indexing. It tells Googlebot and other search crawlers which URLs to index. AI training crawlers may or may not read it.

Use all three files in concert. robots.txt for access control, sitemap.xml for search indexing, llms.txt for AI context. Each serves a different layer of web crawler communication.

Verify your sitemap is correctly configured
Free — finds crawl and indexing issues in 60 seconds
Analyze My Sitemap Free

Related Guides