Updated April 2026

How to Block Bad Bots: robots.txt, CDN, and IP Methods

Check which bots are crawling your sitemapAnalyze My Sitemap

Bad bots consume server bandwidth, scrape content for resale or AI training without consent, probe for vulnerabilities, and inflate analytics numbers. Unlike legitimate search crawlers (Googlebot, Bingbot), bad bots often ignore robots.txt, spoof user agents, and operate from residential IP pools to evade detection.

This guide covers the three layers of bot blocking: robots.txt for cooperative bots, CDN-level rules for volume control, and IP-level blocks for the most aggressive scrapers.

Layer 1: robots.txt (for Cooperative Bots)

robots.txt is a convention that well-behaved bots voluntarily follow. Legitimate crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, and most major AI crawlers — will honor Disallow rules. Bad bots (content scrapers, vulnerability scanners, spam bots) typically ignore robots.txt entirely. robots.txt is the right tool for cooperative bots, not adversarial ones.

Block all unknown bots, allow only known crawlers

# Default: block all bots from sensitive paths
User-agent: *
Disallow: /api/
Disallow: /dashboard/
Disallow: /checkout/
Disallow: /admin/
# Explicitly allow search crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Block AI training bots
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
# Always declare your sitemap
Sitemap: https://yoursite.com/sitemap.xml

Known bad bot user agents to block

These user agents are commonly associated with scrapers and low-quality crawlers. Add them to robots.txt as a first layer:

User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: CCBot
Disallow: /

Note: AhrefsBot, SemrushBot, and Moz crawlers are legitimate SEO tools — not malicious. Whether to block them is a business decision. If you allow them, they can discover and track your pages in their indexes, which can drive referral traffic. Many site owners block them to reduce crawl overhead.

Layer 2: CDN and WAF Rules (for Volume Control)

CDN-level bot management operates at the network edge before requests reach your server. This is the right layer for bots that ignore robots.txt, high-volume scrapers, and bots that disguise themselves with common browser user agents.

Cloudflare Bot Management

Cloudflare's Bot Fight Mode (free tier) automatically challenges known bad bots. The Pro plan and above offer Bot Analytics showing which bot categories are hitting your site. Super Bot Fight Mode allows you to block verified bots, likely automated traffic, or specific crawler categories independently.

For targeted bot blocking via Cloudflare Firewall Rules (WAF):

# Cloudflare Firewall Rule expression to block specific user agents
(http.user_agent contains "GPTBot")
(http.user_agent contains "ClaudeBot")
(http.user_agent contains "Bytespider")
# Action: Block or Challenge

Cloudflare WAF rules work regardless of what the robots.txt says. A bot that ignores robots.txt will still hit the CDN firewall. This makes CDN rules the correct enforcement mechanism for non-cooperative bots.

Rate Limiting

Aggressive scrapers often make hundreds of requests per minute from a single IP. Rate limiting caps the request rate per IP address. In Cloudflare, set rate limits under Security → Rate Limiting. A common threshold: 100 requests per minute to your main content paths triggers a 60-second challenge. Googlebot and other legitimate crawlers respect crawl rate limits and will not trip well-configured rate rules.

Layer 3: IP Blocks (for Persistent Scrapers)

IP-level blocking is the hardest enforcement layer. Once a specific IP or CIDR range is blocked at the firewall or server level, the request never reaches your application. This is effective against persistent scrapers operating from known hosting providers or data center IP ranges.

Blocking via Nginx

# In nginx.conf or site configuration
deny 192.0.2.0/24;
deny 198.51.100.0/24;
allow all;

Blocking AI crawler IP ranges

Major AI companies publish their crawler IP ranges. You can block specific AI crawlers at the IP level rather than relying on user agent matching (which can be spoofed). OpenAI publishes GPTBot IP ranges at openai.com/gptbot, and Anthropic publishes ClaudeBot IP ranges. Subscribe to these lists and update your firewall rules when they change — AI company IP ranges change more frequently than Googlebot ranges.

How to Identify Bad Bots in Your Logs

Check your server access logs or CDN analytics for these signals:

High request rate from a single IP — legitimate crawlers are slower; scrapers hit dozens of pages per second
No referrer, unusual user agent — bots often have blank referer headers and generic or oddly versioned user agents
Requests for non-existent paths — vulnerability scanners probe for /wp-admin, /phpinfo.php, /.env, etc.
Sequential page crawling — going through pages in exact numeric order (/?page=1, /?page=2...) suggests a scraper not following normal link structure
IP from known hosting/datacenter ranges — DigitalOcean, AWS, Hetzner, Linode IPs for crawls are common scrapers; Googlebot and Bingbot have their own documented ranges

What Not to Block: Keeping Good Crawlers Happy

Be careful not to over-block. Blocking Googlebot will tank your search rankings. Blocking Bingbot removes you from Bing. Even blocking some AI crawlers has implications — blocking OAI-SearchBot prevents your content from appearing in ChatGPT search results. Always verify which user agent you are targeting before adding a Disallow or firewall rule.

The safest pattern: use specific user agent tokens rather than broad patterns. Disallow: / under User-agent: BadBot is safer than a wildcard pattern that might accidentally match legitimate crawlers.

Check your sitemap and crawl configuration

Free — identifies bot access and indexing issues in 60 seconds

Analyze My Sitemap Free