By SitemapFixer Team
Updated April 2026

GPTBot: How to Control OpenAI's Web Crawler

Check your robots.txt and sitemap for bot issuesAnalyze My Sitemap

GPTBot is OpenAI's web crawler. It was first documented by OpenAI in August 2023 and crawls publicly available web content to improve future versions of OpenAI's AI models — including GPT-4 and its successors. GPTBot is distinct from the ChatGPT browsing feature (which uses a different bot user agent) and from the search crawlers used for ChatGPT search.

Understanding GPTBot is important for any site owner who wants control over whether their content is used for AI training, and what the implications of blocking it are for visibility in OpenAI's products.

GPTBot Technical Details

PropertyValue
User agent tokenGPTBot
Full user agent stringMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot
PurposeAI model training data collection
OperatorOpenAI
robots.txt supportYes — respects Disallow directives
IP ranges documentationPublished at openai.com/gptbot

What GPTBot Crawls and What It Skips

OpenAI states that GPTBot is designed to skip content that:

  • Requires payment or login to access (paywalled content)
  • Violates OpenAI's usage policies
  • Collects private information or personally identifiable information (PII)

In practice, GPTBot crawls publicly accessible pages much like Googlebot does — it follows links, reads HTML, and extracts text content. Unlike Googlebot, GPTBot does not execute JavaScript in the same way, which means dynamically rendered content (React, Vue, Angular apps with client-side rendering) may not be captured unless you serve pre-rendered HTML.

GPTBot does not read your XML sitemap automatically. It discovers pages primarily through link following. If you want to ensure certain pages are crawled, they must be linked from other crawlable pages — or accessible through other discovery mechanisms.

How to Block GPTBot

To block GPTBot entirely, add this to your robots.txt:

User-agent: GPTBot
Disallow: /

To allow GPTBot access to some pages but block others, use path-specific rules:

# Allow GPTBot on public guides, block on private/premium content
User-agent: GPTBot
Allow: /learn/
Allow: /blog/
Disallow: /dashboard/
Disallow: /checkout/
Disallow: /api/

OpenAI has confirmed that GPTBot respects robots.txt directives. You can verify GPTBot traffic in your server access logs by filtering for the GPTBot user agent string. Some CDNs (Cloudflare, Fastly) also have built-in GPTBot blocking options in their bot management dashboards.

ChatGPT Search vs. GPTBot: What's the Difference?

GPTBot is for training data collection. Separately, OpenAI's ChatGPT search feature uses a different crawler called OAI-SearchBot (user agent: OAI-SearchBot) to retrieve real-time web content when users ask ChatGPT to search the web. If you block GPTBot but not OAI-SearchBot, your content can still appear in ChatGPT's live web browsing responses.

This distinction matters for SEO strategy. Blocking GPTBot affects AI training but not real-time ChatGPT search results. If you want visibility in ChatGPT's web search answers, you need to allow OAI-SearchBot (or at minimum, not Disallow it). To block real-time ChatGPT search specifically, add a separate robots.txt rule for OAI-SearchBot.

BotUser AgentPurpose
GPTBotGPTBotTraining data for future GPT models
ChatGPT-UserChatGPT-UserBrowsing during live ChatGPT conversations
OAI-SearchBotOAI-SearchBotChatGPT search index

What Happens When You Block GPTBot?

Blocking GPTBot means your content will not be used to train future versions of GPT models. It does not mean your content disappears from existing ChatGPT models — if your content was already crawled before you added the Disallow rule, it may already be in training data from earlier crawls.

Blocking GPTBot also does not affect your Google search rankings. Googlebot and GPTBot are separate bots with completely independent crawl budgets and robots.txt rule sets. A Disallow for GPTBot has zero effect on Googlebot.

The practical trade-off: allowing GPTBot means your content could help train future OpenAI models, which may (in theory) result in more accurate responses about your topic from ChatGPT — though this causal link is very indirect. Blocking it means you retain full control over your content's use in AI training, at the cost of that potential future influence.

GPTBot and Your XML Sitemap

Unlike Googlebot, GPTBot does not read your XML sitemap as part of its crawl workflow. GPTBot discovers pages primarily through link following — it starts from a seed set of URLs and follows hyperlinks to find new content, much like a traditional web crawler. If a page is not linked from anywhere GPTBot can reach, it may never be crawled regardless of whether the URL is in your sitemap.

This means your sitemap strategy has no direct effect on GPTBot crawling. However, there is an indirect relationship: pages with strong internal linking and external backlinks (which signal importance to Googlebot and influence how well-linked a page is in general) are also more likely to be discovered by GPTBot through its link-following process.

If you want to proactively signal your content to AI crawlers, an emerging standard called llms.txt serves a similar purpose to sitemaps but for AI models. This is a plain text file at the root of your domain that lists your most important content with brief descriptions, formatted for AI consumption. It is not yet universally adopted, but major AI companies including Anthropic and some OpenAI tooling have begun supporting it.

For most sites, the practical takeaway is: focus your sitemap and robots.txt efforts on Googlebot, since that directly impacts search rankings. For GPTBot specifically, the robots.txt rules you set are the most direct control mechanism available.

Verifying GPTBot in Your robots.txt

To verify your robots.txt is correctly configured for GPTBot:

  1. Open your robots.txt file at https://yoursite.com/robots.txt
  2. Look for a User-agent: GPTBot section
  3. Confirm the Disallow or Allow rules match your intent
  4. Use Google's robots.txt tester in Search Console to validate syntax (it tests Googlebot rules, but the syntax applies to GPTBot as well)
  5. Check server logs for GPTBot user agent hits to confirm actual crawl behavior

Note: robots.txt rules are parsed per user agent token, case-insensitively. User-agent: GPTBot and User-agent: gptbot are equivalent.

Check your sitemap and robots.txt configuration
Free — identifies crawl and indexing issues in 60 seconds
Analyze My Sitemap Free

Related Guides