How to Block Bad Bots: robots.txt, CDN, and IP Methods
Bad bots consume server bandwidth, scrape content for resale or AI training without consent, probe for vulnerabilities, and inflate analytics numbers. Unlike legitimate search crawlers (Googlebot, Bingbot), bad bots often ignore robots.txt, spoof user agents, and operate from residential IP pools to evade detection.
This guide covers the three layers of bot blocking: robots.txt for cooperative bots, CDN-level rules for volume control, and IP-level blocks for the most aggressive scrapers.
Layer 1: robots.txt (for Cooperative Bots)
robots.txt is a convention that well-behaved bots voluntarily follow. Legitimate crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, and most major AI crawlers — will honor Disallow rules. Bad bots (content scrapers, vulnerability scanners, spam bots) typically ignore robots.txt entirely. robots.txt is the right tool for cooperative bots, not adversarial ones.
Block all unknown bots, allow only known crawlers
Known bad bot user agents to block
These user agents are commonly associated with scrapers and low-quality crawlers. Add them to robots.txt as a first layer:
Note: AhrefsBot, SemrushBot, and Moz crawlers are legitimate SEO tools — not malicious. Whether to block them is a business decision. If you allow them, they can discover and track your pages in their indexes, which can drive referral traffic. Many site owners block them to reduce crawl overhead.
Layer 2: CDN and WAF Rules (for Volume Control)
CDN-level bot management operates at the network edge before requests reach your server. This is the right layer for bots that ignore robots.txt, high-volume scrapers, and bots that disguise themselves with common browser user agents.
Cloudflare Bot Management
Cloudflare's Bot Fight Mode (free tier) automatically challenges known bad bots. The Pro plan and above offer Bot Analytics showing which bot categories are hitting your site. Super Bot Fight Mode allows you to block verified bots, likely automated traffic, or specific crawler categories independently.
For targeted bot blocking via Cloudflare Firewall Rules (WAF):
Cloudflare WAF rules work regardless of what the robots.txt says. A bot that ignores robots.txt will still hit the CDN firewall. This makes CDN rules the correct enforcement mechanism for non-cooperative bots.
Rate Limiting
Aggressive scrapers often make hundreds of requests per minute from a single IP. Rate limiting caps the request rate per IP address. In Cloudflare, set rate limits under Security → Rate Limiting. A common threshold: 100 requests per minute to your main content paths triggers a 60-second challenge. Googlebot and other legitimate crawlers respect crawl rate limits and will not trip well-configured rate rules.
Layer 3: IP Blocks (for Persistent Scrapers)
IP-level blocking is the hardest enforcement layer. Once a specific IP or CIDR range is blocked at the firewall or server level, the request never reaches your application. This is effective against persistent scrapers operating from known hosting providers or data center IP ranges.
Blocking via Nginx
Blocking AI crawler IP ranges
Major AI companies publish their crawler IP ranges. You can block specific AI crawlers at the IP level rather than relying on user agent matching (which can be spoofed). OpenAI publishes GPTBot IP ranges at openai.com/gptbot, and Anthropic publishes ClaudeBot IP ranges. Subscribe to these lists and update your firewall rules when they change — AI company IP ranges change more frequently than Googlebot ranges.
How to Identify Bad Bots in Your Logs
Check your server access logs or CDN analytics for these signals:
- High request rate from a single IP — legitimate crawlers are slower; scrapers hit dozens of pages per second
- No referrer, unusual user agent — bots often have blank referer headers and generic or oddly versioned user agents
- Requests for non-existent paths — vulnerability scanners probe for /wp-admin, /phpinfo.php, /.env, etc.
- Sequential page crawling — going through pages in exact numeric order (/?page=1, /?page=2...) suggests a scraper not following normal link structure
- IP from known hosting/datacenter ranges — DigitalOcean, AWS, Hetzner, Linode IPs for crawls are common scrapers; Googlebot and Bingbot have their own documented ranges
What Not to Block: Keeping Good Crawlers Happy
Be careful not to over-block. Blocking Googlebot will tank your search rankings. Blocking Bingbot removes you from Bing. Even blocking some AI crawlers has implications — blocking OAI-SearchBot prevents your content from appearing in ChatGPT search results. Always verify which user agent you are targeting before adding a Disallow or firewall rule.
The safest pattern: use specific user agent tokens rather than broad patterns. Disallow: / under User-agent: BadBot is safer than a wildcard pattern that might accidentally match legitimate crawlers.