GPTBot: How to Control OpenAI's Web Crawler
GPTBot is OpenAI's web crawler. It was first documented by OpenAI in August 2023 and crawls publicly available web content to improve future versions of OpenAI's AI models — including GPT-4 and its successors. GPTBot is distinct from the ChatGPT browsing feature (which uses a different bot user agent) and from the search crawlers used for ChatGPT search.
Understanding GPTBot is important for any site owner who wants control over whether their content is used for AI training, and what the implications of blocking it are for visibility in OpenAI's products.
GPTBot Technical Details
| Property | Value |
|---|---|
| User agent token | GPTBot |
| Full user agent string | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot |
| Purpose | AI model training data collection |
| Operator | OpenAI |
| robots.txt support | Yes — respects Disallow directives |
| IP ranges documentation | Published at openai.com/gptbot |
What GPTBot Crawls and What It Skips
OpenAI states that GPTBot is designed to skip content that:
- Requires payment or login to access (paywalled content)
- Violates OpenAI's usage policies
- Collects private information or personally identifiable information (PII)
In practice, GPTBot crawls publicly accessible pages much like Googlebot does — it follows links, reads HTML, and extracts text content. Unlike Googlebot, GPTBot does not execute JavaScript in the same way, which means dynamically rendered content (React, Vue, Angular apps with client-side rendering) may not be captured unless you serve pre-rendered HTML.
GPTBot does not read your XML sitemap automatically. It discovers pages primarily through link following. If you want to ensure certain pages are crawled, they must be linked from other crawlable pages — or accessible through other discovery mechanisms.
How to Block GPTBot
To block GPTBot entirely, add this to your robots.txt:
To allow GPTBot access to some pages but block others, use path-specific rules:
OpenAI has confirmed that GPTBot respects robots.txt directives. You can verify GPTBot traffic in your server access logs by filtering for the GPTBot user agent string. Some CDNs (Cloudflare, Fastly) also have built-in GPTBot blocking options in their bot management dashboards.
ChatGPT Search vs. GPTBot: What's the Difference?
GPTBot is for training data collection. Separately, OpenAI's ChatGPT search feature uses a different crawler called OAI-SearchBot (user agent: OAI-SearchBot) to retrieve real-time web content when users ask ChatGPT to search the web. If you block GPTBot but not OAI-SearchBot, your content can still appear in ChatGPT's live web browsing responses.
This distinction matters for SEO strategy. Blocking GPTBot affects AI training but not real-time ChatGPT search results. If you want visibility in ChatGPT's web search answers, you need to allow OAI-SearchBot (or at minimum, not Disallow it). To block real-time ChatGPT search specifically, add a separate robots.txt rule for OAI-SearchBot.
| Bot | User Agent | Purpose |
|---|---|---|
| GPTBot | GPTBot | Training data for future GPT models |
| ChatGPT-User | ChatGPT-User | Browsing during live ChatGPT conversations |
| OAI-SearchBot | OAI-SearchBot | ChatGPT search index |
What Happens When You Block GPTBot?
Blocking GPTBot means your content will not be used to train future versions of GPT models. It does not mean your content disappears from existing ChatGPT models — if your content was already crawled before you added the Disallow rule, it may already be in training data from earlier crawls.
Blocking GPTBot also does not affect your Google search rankings. Googlebot and GPTBot are separate bots with completely independent crawl budgets and robots.txt rule sets. A Disallow for GPTBot has zero effect on Googlebot.
The practical trade-off: allowing GPTBot means your content could help train future OpenAI models, which may (in theory) result in more accurate responses about your topic from ChatGPT — though this causal link is very indirect. Blocking it means you retain full control over your content's use in AI training, at the cost of that potential future influence.
GPTBot and Your XML Sitemap
Unlike Googlebot, GPTBot does not read your XML sitemap as part of its crawl workflow. GPTBot discovers pages primarily through link following — it starts from a seed set of URLs and follows hyperlinks to find new content, much like a traditional web crawler. If a page is not linked from anywhere GPTBot can reach, it may never be crawled regardless of whether the URL is in your sitemap.
This means your sitemap strategy has no direct effect on GPTBot crawling. However, there is an indirect relationship: pages with strong internal linking and external backlinks (which signal importance to Googlebot and influence how well-linked a page is in general) are also more likely to be discovered by GPTBot through its link-following process.
If you want to proactively signal your content to AI crawlers, an emerging standard called llms.txt serves a similar purpose to sitemaps but for AI models. This is a plain text file at the root of your domain that lists your most important content with brief descriptions, formatted for AI consumption. It is not yet universally adopted, but major AI companies including Anthropic and some OpenAI tooling have begun supporting it.
For most sites, the practical takeaway is: focus your sitemap and robots.txt efforts on Googlebot, since that directly impacts search rankings. For GPTBot specifically, the robots.txt rules you set are the most direct control mechanism available.
Verifying GPTBot in Your robots.txt
To verify your robots.txt is correctly configured for GPTBot:
- Open your robots.txt file at https://yoursite.com/robots.txt
- Look for a
User-agent: GPTBotsection - Confirm the Disallow or Allow rules match your intent
- Use Google's robots.txt tester in Search Console to validate syntax (it tests Googlebot rules, but the syntax applies to GPTBot as well)
- Check server logs for GPTBot user agent hits to confirm actual crawl behavior
Note: robots.txt rules are parsed per user agent token, case-insensitively. User-agent: GPTBot and User-agent: gptbot are equivalent.
Related Guides
- ClaudeBot: Anthropic's Three-Bot Crawling Framework
- PerplexityBot: What It Is and How to Block It
- Sitemaps and Google AI Overviews: What You Need to Know
- llms.txt: The Emerging Standard for AI Crawler Guidance
- robots.txt Complete Guide: Syntax, Testing, and Best Practices
- Google Crawlers: Complete List and User Agents