PerplexityBot: What It Crawls and How to Block It
PerplexityBot is the web crawler operated by Perplexity AI, the AI-powered answer engine. It crawls the web to build the index that Perplexity uses to answer user queries with cited sources. Unlike training-focused crawlers like GPTBot or ClaudeBot, PerplexityBot is primarily a search indexing crawler — its purpose is closer to Googlebot than to AI training bots.
PerplexityBot became controversial in 2024 when researchers and publishers discovered it was crawling websites that had explicitly blocked it in robots.txt. This raised serious questions about Perplexity's robots.txt compliance that persist into 2026. This guide covers the facts, the controversy, and what you can actually do to control PerplexityBot's access.
PerplexityBot Technical Details
| Property | Value |
|---|---|
| Primary user agent token | PerplexityBot |
| Secondary user agent | Mozilla/5.0 ... (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) |
| Purpose | Search index for Perplexity AI answer engine |
| Operator | Perplexity AI |
| robots.txt compliance | Disputed (see controversy section below) |
What PerplexityBot Crawls
PerplexityBot crawls publicly accessible web pages to index their content for Perplexity's answer engine. When a user asks Perplexity a question, Perplexity retrieves relevant pages from its index and synthesizes an answer with citations. Your page being in Perplexity's index means your content could appear as a source in Perplexity responses — with attribution links back to your site.
Unlike pure training crawlers, PerplexityBot has a direct and visible effect on whether your site appears as a cited source in Perplexity answers. This is analogous to Googlebot's role in determining whether your page ranks in Google search — except Perplexity displays fewer results and credits sources more prominently.
PerplexityBot crawls text content, metadata, and structured data. It handles JavaScript rendering to varying degrees, but static HTML content is more reliably indexed. Your XML sitemap helps Perplexity discover your pages if PerplexityBot follows sitemaps (which is not officially confirmed for all versions of the bot).
The robots.txt Compliance Controversy
In June 2024, Wired and other publications reported that PerplexityBot was crawling websites that had explicitly blocked it in robots.txt. The reports included technical evidence: server logs showing PerplexityBot user agent strings on sites with User-agent: PerplexityBot / Disallow: / in their robots.txt.
Perplexity's initial response was to deny the reports, claiming their bots respected robots.txt. Researchers pushed back with log evidence showing the crawling continued after the robots.txt blocks were in place. A second mechanism was identified: Perplexity appeared to be using third-party infrastructure (including some residential IP address pools) to fetch pages in ways that bypassed standard user agent checks.
Perplexity subsequently updated their robots.txt policy documentation and committed to stronger compliance. However, the practical situation as of 2026 is: PerplexityBot's compliance is better than it was in 2024, but is less uniformly reliable than Googlebot's, GPTBot's, or ClaudeBot's compliance. Some site owners report continued crawling despite Disallow directives.
How to Block PerplexityBot
Add this to your robots.txt to block PerplexityBot:
If you want to allow some paths but block others:
Given the documented compliance issues, some site owners add IP-level blocks using their CDN or firewall in addition to robots.txt rules. Perplexity publishes its crawler IP ranges — blocking those IPs provides a harder technical barrier than robots.txt alone.
Should You Block PerplexityBot?
The decision depends on your content type and business goals:
Reasons to allow PerplexityBot
- Your site appears as a cited source in Perplexity answers, which drives referral traffic. Perplexity displays source attribution more prominently than Google AI Overviews.
- Perplexity has a large and growing user base of technical and research-oriented users — the same audience many B2B and SaaS sites want to reach.
- Being indexed by multiple AI search engines (Perplexity, ChatGPT search, Google) diversifies your traffic sources.
Reasons to block PerplexityBot
- You have paywalled or subscription content that should not be summarized freely in AI answers.
- You are concerned about content being reproduced in AI-generated answers without driving click-through (the "zero-click" concern).
- You object on principle to your content being used to power an AI service without compensation or opt-in.
- You have verified via logs that PerplexityBot is consuming significant crawl budget without commensurate referral traffic benefit.
PerplexityBot vs. Other AI Crawlers
| Bot | Primary use | robots.txt reliability | Citation in results |
|---|---|---|---|
| Googlebot | Search index | Excellent | Yes (AI Overviews) |
| GPTBot | AI training | Good | Indirect (future models) |
| ClaudeBot | AI training | Good | Indirect (future models) |
| PerplexityBot | AI search index | Variable | Yes (prominent attribution) |
Verifying PerplexityBot in Your Server Logs
To check if PerplexityBot is crawling your site:
- Access your server access logs (Apache, Nginx, or via your CDN dashboard)
- Filter for user agent strings containing "PerplexityBot"
- Check the source IP addresses against Perplexity's published IP ranges to confirm authenticity
- If your robots.txt has a Disallow for PerplexityBot and you still see crawl activity, you may be seeing the compliance issue documented in 2024
- Consider adding an IP-level block at your CDN or firewall if robots.txt-based blocking is insufficient
Cloudflare's Bot Fight Mode and similar CDN bot management tools can identify and block PerplexityBot traffic at the infrastructure level, which is more reliable than robots.txt alone if you want hard enforcement.
Related Guides
- GPTBot: How to Control OpenAI's Web Crawler
- ClaudeBot: Anthropic's Three-Bot Crawling Framework
- Sitemaps and Google AI Overviews: What You Need to Know
- llms.txt: The Emerging Standard for AI Crawler Guidance
- robots.txt Complete Guide: Syntax, Testing, and Best Practices
- How to Block Bad Bots: robots.txt and CDN Methods