htaccess noindex: Block Indexing via X-Robots-Tag in Apache
The HTML <meta name="robots" content="noindex"> tag only works on HTML pages. If you need to keep PDFs out of Google's index, block staging environments without a template change, or apply noindex rules to whole directories of files you cannot edit individually, the answer is the X-Robots-Tag HTTP response header — set in Apache via .htaccess. This guide covers the syntax, the modules it depends on, conditional patterns for files and paths, environment-aware rules for staging, every supported directive, how to verify with curl, the nginx equivalent, and the troubleshooting steps for the two failure modes that catch nearly everyone.
How X-Robots-Tag in .htaccess Works
X-Robots-Tag is an HTTP response header. When Apache returns a file — any file, HTML or otherwise — it sends a set of headers ahead of the body. Adding X-Robots-Tag: noindex to those headers tells crawlers that support the spec (Google, Bing, Yandex, DuckDuckGo) to drop the URL from their index, exactly as if a meta robots noindex tag were present in an HTML <head>. The crawler does not need to parse the body to find the directive — it sees the header during the HTTP response and applies the rule before deciding whether to download the rest of the file.
Setting headers from .htaccess requires Apache's mod_headers module. mod_headers is bundled with Apache 2.x but is not always enabled by default — on Debian/Ubuntu you may need a2enmod headers && systemctl restart apache2; on shared hosts the module is usually pre-enabled but worth confirming. The basic syntax uses the Header directive:
# .htaccess — noindex every response from this directory and below
# Requires: mod_headers enabled, AllowOverride FileInfo or All
<IfModule mod_headers.c>
Header set X-Robots-Tag "noindex, nofollow"
</IfModule>The <IfModule mod_headers.c> wrapper is defensive: if the module is not loaded, the directive is skipped instead of throwing a 500 error. Header set writes the header unconditionally on every response — overwriting any previous value. Use Header append if you need to preserve existing X-Robots-Tag values written elsewhere; use Header always set if you want the header included on error responses (4xx, 5xx) too, which Apache otherwise omits for non-2xx replies.
Why Use .htaccess Instead of an HTML Meta Tag
The meta robots tag is fine when you have full control over the HTML and the URL serves an HTML document. There are four common situations where it does not work, and X-Robots-Tag in .htaccess is the right answer:
1. Non-HTML files. PDFs, DOCX files, images, ZIP archives, RSS feeds, and JSON endpoints have no <head>. The only way to noindex them is via the HTTP response header. A common scenario: a documentation site that exposes /manuals/*.pdf versions of every help page. Without an X-Robots-Tag rule, the PDFs end up competing with the HTML versions in search results.
2. No edit access to HTML templates. Vendor-provided applications, legacy systems, or hosts where you have FTP access but cannot modify the deployed code — you still control .htaccess, so X-Robots-Tag becomes the only available lever.
3. Server-rendered output without an SEO layer. A custom CGI script or a small framework that generates HTML but has no convenient hook to inject meta tags into <head>. Adding a five-line .htaccess rule is faster than refactoring the rendering layer.
4. Bulk rules across hundreds of files. Noindex every file matching a pattern (every ?print=1 URL, every file in /internal/, every .zip download) without editing each one individually. .htaccess with FilesMatch or LocationMatch handles this in a single block.
Conditional Noindex: FilesMatch for File Types
Apply the X-Robots-Tag only to specific file extensions using <FilesMatch>. This is the standard pattern for keeping PDFs and other downloadables out of the index while leaving the surrounding HTML pages indexable:
# .htaccess — noindex PDFs, DOCX, and ZIP downloads only
# HTML pages in the same directory remain indexable
<IfModule mod_headers.c>
<FilesMatch "\.(pdf|docx?|xlsx?|zip|rar|tar\.gz)$">
Header set X-Robots-Tag "noindex, nofollow, noarchive"
</FilesMatch>
# Also noindex print-friendly variants
<FilesMatch "print\.html?$">
Header set X-Robots-Tag "noindex, follow"
</FilesMatch>
</IfModule>The pattern inside <FilesMatch> is a regular expression matched against the filename only — not the full path. Escape literal dots with \\., group alternatives with parentheses and |, and end with $ to anchor the match at the end of the filename. Using nofollow on PDFs is usually correct (you do not want crawlers consuming budget following links inside PDFs), and noarchive prevents Google from showing a cached copy of the file.
For images you want kept out of Google Images but still rendered on the page, use noimageindex instead — applied to the image file responses, this prevents the image from appearing as a standalone result while leaving the embedding page unaffected.
Whole-Directory Noindex
To noindex every file in a directory and its subdirectories, drop a minimal .htaccess file at the directory root. This is the right pattern for internal tools at /admin/, customer dashboards at /account/, or any path you do not want appearing in Google's index:
# /var/www/example.com/admin/.htaccess
# Apply to every response served from /admin/ and below
<IfModule mod_headers.c>
Header set X-Robots-Tag "noindex, nofollow, nosnippet"
</IfModule>
# Belt-and-braces: also block at the robots.txt layer for crawl-budget savings
# (robots.txt blocks crawling; X-Robots-Tag handles already-discovered URLs)A subtlety worth knowing: robots.txt Disallow alone does not remove URLs from the index — it only prevents fresh crawling. URLs already in Google's index, or URLs Google discovers via external links, can remain indexed even when blocked by robots.txt (showing up as "Indexed, though blocked by robots.txt"). For genuine de-indexing you must let Google crawl the URL and see the X-Robots-Tag noindex header. So either rely on X-Robots-Tag alone, or add the robots.txt block only after Google has had time to recrawl and process the noindex.
LocationMatch and Path-Based Patterns
<LocationMatch> matches URL paths rather than filesystem filenames. It is the right tool when the URL path does not map directly to a file — typical for sites that route everything through index.php or use URL rewriting:
# Note: <LocationMatch> only works in httpd.conf or VirtualHost context
# In .htaccess, use <If> with a request URI test instead:
<IfModule mod_headers.c>
# Noindex all URLs containing a search query parameter
<If "%{QUERY_STRING} =~ /(^|&)q=/">
Header set X-Robots-Tag "noindex, follow"
</If>
# Noindex any URL under /tag/ or /author/ (typical thin-content patterns)
<If "%{REQUEST_URI} =~ m#^/(tag|author)/#">
Header set X-Robots-Tag "noindex, follow"
</If>
</IfModule>The <If> directive (Apache 2.4+) supports expression syntax with =~ for regex matching against any request variable — %{REQUEST_URI}, %{QUERY_STRING}, %{HTTP_HOST}, headers, and more. This is more flexible than FilesMatch and works inside .htaccess, where LocationMatch does not.
Staging Server Noindex via .htaccess
Indexed staging environments are one of the most common ways internal data leaks into Google. .htaccess in the staging document root is a robust fix because it does not depend on the application or any environment variable inside it — Apache itself adds the header to every response, regardless of what the application returns:
# /var/www/staging.example.com/.htaccess
# Block every staging response from indexing — including 404s, redirects, JSON
<IfModule mod_headers.c>
Header always set X-Robots-Tag "noindex, nofollow, nosnippet, noarchive"
</IfModule>
# Optional: also require HTTP Basic Auth so the staging site is private
AuthType Basic
AuthName "Staging — Internal Only"
AuthUserFile /etc/apache2/.staging-htpasswd
Require valid-userThe always keyword is important here: without it, Apache only adds the header to 2xx responses, meaning a staging 404 page might still get indexed if Google crawled it and then followed up. Combining the X-Robots-Tag with HTTP Basic Auth is the safest belt-and-braces configuration — Basic Auth blocks anyone (including crawlers) from seeing the content, and the noindex header ensures that if anyone ever opens up access without remembering to remove it from search engines, indexing stays blocked.
Environment-Based Rules with SetEnvIf
For more nuanced rules — only noindex when accessed from a particular hostname, only on dev branches, only for certain user agents — combine SetEnvIf (or SetEnvIfNoCase) with the env= conditional on Header:
# .htaccess — noindex only for hostnames matching staging/dev/preview patterns
# Production traffic on www.example.com is unaffected
<IfModule mod_headers.c>
# Mark any non-production hostname
SetEnvIfNoCase Host "^(staging|dev|preview|test)\." NONPROD=1
SetEnvIfNoCase Host "\.staging\.example\.com$" NONPROD=1
SetEnvIfNoCase Host "\.vercel\.app$" NONPROD=1
SetEnvIfNoCase Host "\.netlify\.app$" NONPROD=1
# Apply X-Robots-Tag only when NONPROD env var is set
Header set X-Robots-Tag "noindex, nofollow" env=NONPROD
</IfModule>This pattern is particularly useful when staging, dev, and production all share the same codebase and document root configuration — the same .htaccess file is deployed everywhere, and the hostname check decides whether to apply the noindex. No need to remember to edit a separate config per environment, and no risk of accidentally shipping a noindex into production by checking in the wrong file.
You can also condition on user agent — for example, allowing only your monitoring crawler to index pages you would otherwise hide — though this is a rare requirement.
X-Robots-Tag Directive Variants
X-Robots-Tag supports the same directive vocabulary as the meta robots tag. The most useful values:
noindex — drop the URL from the search index. Existing index entries are removed on the next crawl-and-process cycle (typically days to weeks).
nofollow — do not follow any links found on the page (or in the file). Useful on PDFs and tag archives where the linked content is already crawled via other paths.
nosnippet — do not show a text snippet or video preview in search results. The URL can still appear; only the snippet is suppressed.
noarchive — do not show a cached copy. Note that Google Cache is largely deprecated, but the directive still suppresses any remaining cached views.
noimageindex — do not index images on the page (when applied to an HTML response) or do not index this image (when applied to an image response).
max-snippet:NN — limit the snippet to N characters. Common values: max-snippet:0 (no snippet) or max-snippet:160 (limit to one meta-description-equivalent line).
max-image-preview:none|standard|large — control image preview size in results.
unavailable_after:DATE — automatically deindex after a given RFC 850 date. Useful for time-limited content like event pages.
Multiple directives are comma-separated in a single header value. You can also target specific crawlers by prefixing the value with the crawler name: X-Robots-Tag: googlebot: noindex applies to Google only; without a prefix it applies to all bots.
How to Verify with curl
After deploying any .htaccess change, verify the header is actually present in the response. curl -I shows response headers without downloading the body:
# Check headers on a single URL
curl -I https://example.com/admin/dashboard
# Look for: X-Robots-Tag: noindex, nofollow
# Check a PDF
curl -I https://example.com/docs/manual.pdf | grep -i x-robots-tag
# Bypass any CDN cache and hit the origin directly
curl -I --resolve example.com:443:203.0.113.42 https://example.com/admin/
# Sweep a list of URLs and report which lack X-Robots-Tag
while read URL; do
HDR=$(curl -sI "$URL" | grep -i x-robots-tag)
echo "$URL: ${HDR:-MISSING}"
done < urls.txt
# Test what Googlebot specifically sees
curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
https://example.com/admin/If curl -I shows the header but Google Search Console's URL Inspection tool reports the URL as indexed without a noindex directive, the cause is usually a CDN caching the response from before the fix. Purge the CDN cache, then re-run URL Inspection's "Test live URL" — that bypasses Google's own cache and fetches the page fresh, which should show the X-Robots-Tag.
Nginx Equivalent
Nginx does not read .htaccess files — its configuration model is centralized in nginx.conf and included site files. The equivalent of the Apache rules above uses the add_header directive inside a server, location, or if block:
# /etc/nginx/sites-available/staging.example.com
server {
listen 443 ssl;
server_name staging.example.com;
# Whole-server noindex — applies to every response
add_header X-Robots-Tag "noindex, nofollow, nosnippet, noarchive" always;
# Per-location override for PDFs only
location ~* \.(pdf|docx?|zip)$ {
add_header X-Robots-Tag "noindex, nofollow, noarchive" always;
}
# Conditional on hostname (analogous to SetEnvIf)
if ($host ~* "^(staging|dev|preview)\.") {
add_header X-Robots-Tag "noindex, nofollow" always;
}
}The always argument is the nginx equivalent of Apache's Header always set — without it, the header is only added on 2xx and 3xx responses. Note that nginx's add_header is inherited from outer blocks only if the inner block has no add_header of its own — defining one inside location replaces the parent's headers entirely. This is the most common nginx X-Robots-Tag bug; if you have headers at the server level and override them in a location, repeat them in the inner block.
Troubleshooting: Two Failure Modes That Catch Everyone
If you deployed the .htaccess rule and curl -I does not show the X-Robots-Tag header, one of two things has gone wrong.
1. mod_headers is not loaded. The Header directive is only available when mod_headers is active. Confirm with:
# Check if mod_headers is loaded apachectl -M | grep headers # Should show: headers_module (shared) # If not present, enable it (Debian/Ubuntu): sudo a2enmod headers sudo systemctl restart apache2 # Or on RHEL/CentOS, edit /etc/httpd/conf.modules.d/00-base.conf: # Uncomment: LoadModule headers_module modules/mod_headers.so sudo systemctl restart httpd # Verify the .htaccess is being parsed at all by checking Apache logs: tail -f /var/log/apache2/error.log # Hit the URL and watch for any AH-prefixed warnings
The <IfModule mod_headers.c> wrapper means a missing module silently no-ops rather than throwing 500 — useful for resilience but it can mask the real problem. When debugging, temporarily remove the wrapper. If you then see a 500 error, the module is missing; if the rule still does not work, the problem is the second failure mode.
2. AllowOverride does not permit Header directives. Apache's .htaccess processing is gated by the AllowOverride setting in the parent VirtualHost or Directory block. If AllowOverride None is set, .htaccess files are ignored entirely. If AllowOverride permits some directive groups but not FileInfo (which includes Header), your rule is silently skipped:
# /etc/apache2/sites-available/example.com.conf
<VirtualHost *:443>
ServerName example.com
DocumentRoot /var/www/example.com
<Directory /var/www/example.com>
# Required for .htaccess Header directives to take effect
AllowOverride All
# Or, more narrowly:
# AllowOverride FileInfo Indexes Limit
Require all granted
</Directory>
</VirtualHost>
# After editing, reload Apache:
sudo apachectl configtest && sudo systemctl reload apache2On shared hosting, you usually cannot edit the VirtualHost — but most shared hosts ship with AllowOverride All by default. If your X-Robots-Tag rule isn't taking effect on a shared host, contact support and ask them to confirm AllowOverride includes FileInfo for your account's directory.
Other gotchas worth checking: a CDN like Cloudflare may strip non-standard headers — log in and verify under Rules > Transform Rules that no rule is removing X-Robots-Tag, and check the cached response with curl -H "Cache-Control: no-cache". A reverse proxy (nginx in front of Apache, or HAProxy) may rewrite or drop headers — test directly against the Apache port. And Header set in a parent .htaccess can be overridden by a child .htaccess, so if you have nested .htaccess files, audit them all.
Confirming Deindexation in Google Search Console
Once the X-Robots-Tag header is verified in curl -I, the next step is confirming Google sees it. Open Google Search Console, run URL Inspection on a representative page, and click "Test live URL". The HTTP Response section should list X-Robots-Tag: noindex among the response headers, and the indexing verdict should change to "Excluded by ‘noindex’ tag" on the next crawl.
For URLs already indexed, expect 1–4 weeks for them to drop from the index — Google needs to recrawl, see the new header, and process the deindex. To accelerate removal of high-priority URLs, use Search Console's Removals tool to request a temporary 6-month suppression while the noindex propagates. For wholesale audits of which pages are still indexed and which have been successfully deindexed, run a sitemap-wide check with SitemapFixer to flag any URL still returning index-eligible status.
Related Guides
- X-Robots-Tag: The Complete Guide to HTTP-Header SEO Directives
- Noindex Directives: Meta, HTTP Header, and robots.txt Compared
- robots.txt Noindex: Why It Stopped Working and What to Use Instead
- Canonical and Noindex Together: Conflicting Signals to Avoid
- De-indexing Pages: Step-by-Step Removal from Google's Index