Robots.txt Guide: Write, Test, and Fix Your Robots.txt
What Is robots.txt and Why It Matters
The robots.txt file is a plain text file that lives at the root of your domain — always at https://yoursite.com/robots.txt. It uses a simple directive syntax to tell web crawlers which pages they are allowed to access and which they should skip. Every major search engine bot — Googlebot, Bingbot, and dozens of others — checks this file before crawling your site.
robots.txt matters for two reasons. First, it protects your crawl budget: Googlebot has a limited amount of time it allocates to crawling your site per day. If you waste that budget on admin pages, parameter variations, and internal search results, fewer of your important pages get crawled. Second, it prevents confusing signals: if Googlebot crawls near-duplicate pages created by URL parameters, it may struggle to determine which version to rank.
One critical point that many site owners misunderstand: robots.txt controls crawling, not indexing. If another website links to a URL you have blocked in robots.txt, Google can still index that URL based on the external link — it just cannot crawl and read the page content. This distinction is important and covered in detail below.
robots.txt Syntax: The Complete Reference
A robots.txt file is made up of groups. Each group starts with one or more User-agent lines identifying which crawler the rules apply to, followed by Disallow and Allow directives. Here is a complete example that shows all the key directives:
# Apply to all crawlers User-agent: * Disallow: /admin/ Disallow: /search? Disallow: /cart/ Disallow: /checkout/ Allow: /admin/public/ # Block GPTBot from training on your content User-agent: GPTBot Disallow: / # Point crawlers to your sitemap Sitemap: https://yoursite.com/sitemap.xml
Key directives explained:
- User-agent: * — applies to all crawlers. Use a specific user agent name (like
GooglebotorGPTBot) to target individual bots. - Disallow: /path/ — blocks access to that path and all its sub-paths. An empty
Disallow:means allow everything. - Allow: /path/ — overrides a broader Disallow for a specific sub-path. Allow takes precedence over Disallow for the same path length.
- Sitemap: URL — tells crawlers where your XML sitemap lives. You can have multiple Sitemap lines.
- # — comment lines are ignored by crawlers and are useful for documentation.
What to Block in robots.txt
Good robots.txt files block pages that should not be crawled but keep the doors open for everything that should be indexed. Here is what to block:
- Admin and login pages —
/admin/,/wp-admin/,/login. These have no indexing value and waste crawl budget. - Internal search results —
/search?. Every unique search query creates a unique URL with near-duplicate or thin content. - Cart and checkout pages —
/cart/,/checkout/. These are transactional and should never be indexed. - URL parameter variations — tracking parameters (
?utm_source=), session IDs (?sessionid=), and sorting/filtering parameters that create duplicate versions of the same page. - Staging and development directories —
/staging/,/dev/. These should ideally be on separate domains with password protection, but robots.txt adds a secondary layer.
What you must never block: CSS files, JavaScript files, images used in content, and your sitemap. Googlebot needs CSS and JavaScript to render pages correctly. Blocking them causes Google to see a broken version of your page, which can hurt rankings. It also cannot follow a Sitemap directive that points to a blocked URL.
robots.txt and Your XML Sitemap
Your sitemap and robots.txt must be consistent. The rule is simple: never include a URL in your sitemap that is blocked by robots.txt. When Google sees a URL in your sitemap but cannot crawl it due to a robots.txt Disallow rule, it creates a conflicting signal. The URL is in the sitemap (suggesting it should be indexed) but is blocked (preventing Google from reading it). The result is wasted crawl budget and confused indexing signals.
The most common version of this conflict happens with URL parameter pages. A site might block Disallow: /?sort= to prevent sorting parameter pages from being crawled, but then have some of those URLs slip into the sitemap via an automated sitemap generator. SitemapFixer checks your sitemap against your robots.txt and flags any URLs that appear in the sitemap but are blocked by your robots.txt rules, so you can catch this conflict before it causes indexing problems.
robots.txt Does Not Prevent Indexing
This is one of the most important and most misunderstood facts about robots.txt. Blocking a URL in robots.txt prevents Googlebot from crawling that URL. It does not prevent Google from indexing it.
If another website links to your blocked page, Google can still add that URL to its index based on the external link signal alone — it just cannot read the page content since crawling is blocked. The result is a URL that appears in Google Search results with no title or description (Google shows a generic snippet instead), which is a poor user experience.
If you want to prevent a page from being indexed, use a noindex meta tag in the page's <head> or an X-Robots-Tag: noindex HTTP response header. The page must be crawlable for Google to see and respect the noindex signal — so you cannot use robots.txt and noindex together on the same page. If you block crawling with robots.txt, Google cannot read the noindex tag.
How Google Processes robots.txt
Googlebot fetches and caches your robots.txt file approximately every 24 hours. Changes you make to robots.txt may not be reflected in Googlebot's behavior for up to a day. If you need immediate effect, you can request a robots.txt recrawl through Google Search Console.
Google supports wildcard matching with * and end-of-URL matching with $. For example, Disallow: /*.pdf$ blocks all URLs ending in .pdf. The * in User-agent: * applies to all crawlers that do not have a more specific rule.
If your robots.txt returns a 404 (not found), Google treats it as if the file is empty — meaning all pages are crawlable. If your robots.txt returns a 5xx server error, Googlebot treats your site as temporarily inaccessible and will back off crawling until the file becomes accessible again. This means a broken robots.txt can effectively pause crawling of your entire site.
Testing Your robots.txt
Google Search Console includes a robots.txt tester under Settings. It lets you enter a URL and check whether Googlebot can crawl it based on your current robots.txt rules. Use this tool after any change to verify that you have not accidentally blocked important pages.
You can also test your robots.txt manually:
- Navigate to
https://yoursite.com/robots.txtin a browser to confirm it is accessible and returns 200. - Use the URL Inspection tool in Google Search Console for specific URLs — it shows whether Googlebot can access the URL and exactly which robots.txt rule is blocking it if access is denied.
- Use
curl -I https://yoursite.com/robots.txtto check the HTTP status code from the command line.
Common errors to look for: syntax errors (missing colon after directive), a trailing space after a path, inconsistent capitalization, and Disallow rules that are too broad and match more paths than intended.
Common robots.txt Mistakes to Avoid
- Disallow: / — blocks your entire site from all crawlers. This is catastrophically bad and will cause all your pages to fall out of the index over time. Double-check your rules before saving.
- Blocking CSS and JavaScript — prevents Google from rendering your pages correctly. Google may misclassify well-designed pages as thin content if it cannot access the styling and scripts.
- Case sensitivity errors — on Linux servers,
/Admin/and/admin/are different paths. Your robots.txt rules must match the exact case of your URL paths. - Missing Sitemap directive — not including your sitemap URL means search engines have to discover it through other means (like GSC submission). Always include
Sitemap: https://yoursite.com/sitemap.xml. - Using robots.txt as a security measure — robots.txt is publicly visible. Any URL you list in a Disallow rule is exposed to anyone who reads the file. Never use robots.txt to protect sensitive content — use proper server-side authentication instead.
- Canonical conflicts — disallowing a URL that is the canonical target of other pages breaks the canonical chain. Googlebot must be able to crawl the canonical URL to process the canonical signal.
Related Guides
- Robots.txt Examples: WordPress, Shopify & More
- Crawl Budget: What It Is and How to Optimize It
- Google Not Crawling My Site? Here Are the Fixes
- Crawl Errors: Types, Causes, and How to Fix Each One
- Mobile-First Indexing: How to Prepare Your Site
- How to Find the Sitemap of Any Website
- X-Robots-Tag: HTTP Header for Non-HTML File Indexing Control
- WordPress robots.txt: The Complete Guide
- Robots.txt Disallow All: Block Every Crawler Safely
- Robots.txt Noindex: Why It No Longer Works
- Crawl Budget SEO: How to Optimize for Googlebot