By SitemapFixer Team
Updated April 2026

Robots Noarchive: When to Block Google's Cached Page

Audit your robots directives freeAudit your robots directives free

The noarchive directive is one of the most misunderstood robots meta tags. Half the developers who reach for it think it prevents indexing. The other half think it blocks AI training data scraping. It does neither. This guide covers exactly what noarchive controls, why Google's 2024 removal of the public "Cached" link changed (but did not eliminate) its usefulness, and where it actually belongs in a robots strategy. If you want a broader view of HTTP-level robots directives, see the X-Robots-Tag guide.

What Noarchive Actually Does

The noarchive directive tells search engines not to display a cached copy of the page in search results. That is the entire scope. When Googlebot crawls a page with noarchive set, it still:

Crawls the page normally and follows links from it. Renders JavaScript and processes structured data. Includes the page in its index. Ranks the page using all normal signals. Generates a snippet for the SERP listing using on-page text and meta description.

What changes is one specific user-facing element: historically, Google's search results listings included a small dropdown arrow next to each result with a "Cached" link, which loaded a stored snapshot of the page hosted at webcache.googleusercontent.com. The noarchive directive suppressed that link. Without the directive, Google decided based on its own heuristics whether to expose a cached copy.

For Bing, the behavior is parallel: a cached link appears in the result's context menu, and noarchive suppresses it. Bing has not deprecated this feature — it remains live in 2026.

What Noarchive Does NOT Do

Most of the questions about noarchive are really questions about other directives that have been confused with it. Here is the explicit list of things noarchive does not do:

It does not prevent indexing. Use noindex for that. A page can be cached-blocked and indexed simultaneously, and that is the most common use case.

It does not prevent crawling. Use robots.txt with Disallow for that. Googlebot must crawl the page to even see the noarchive directive — by definition, the page is being read.

It does not prevent the snippet from showing. Use nosnippet, max-snippet:0, or data-nosnippet attributes for that. Google still extracts and displays text from the page in SERPs even with noarchive set.

It does not prevent AI training data ingestion. This is the most common misconception in 2026. The standard noarchive directive has no defined effect on Google-Extended, GPTBot, ClaudeBot, CCBot, or any AI training crawler. Those crawlers respect their own dedicated user-agent rules in robots.txt, not noarchive. See the section below on AI extensions for what actually works.

It does not prevent the Internet Archive Wayback Machine from archiving your site. The Wayback Machine historically respected an archive.org_bot Disallow rule and (until 2017) a noarchive meta tag, but since 2017 it primarily uses its own opt-out form and stopped honoring noarchive by default for active crawls.

It does not prevent third-party caching services like Google Translate's pass-through, Cloudflare's public-facing cache, or screenshot services from rendering your page.

The 2024 Google Cached Link Removal

In January 2024, Google announced the removal of the "Cached" link from search results. Danny Sullivan confirmed via X that the feature, which had existed for over 25 years, was being deprecated because "it was meant for helping people access pages when way back, you often couldn't depend on a page loading." The dropdown arrow next to result URLs no longer exposes a cached version on web.search results.

This change has three implications for the noarchive directive:

The primary use case for noarchive is gone for Google. If your reason for setting noarchive was to prevent searchers from seeing your old content via the cached link, that link no longer exists in the consumer-facing SERP. The directive is functionally inert for that scenario in 2026.

The cache.google.com URL still works for now. Power users can still construct https://webcache.googleusercontent.com/search?q=cache:yourdomain.com/page manually and retrieve cached copies, although Google has stated this endpoint will eventually be retired. noarchive still suppresses cached copies served via this URL.

Internal Google features still use cached copies. The "About this result" panel, snippets in Discover, and certain Bard/Gemini features reference Google's stored copies of pages. Some of these surfaces respect noarchive, though Google has not published a definitive list. If you have legal reasons to suppress cached snapshots in any Google-owned surface, setting the directive remains the documented way to signal intent.

Bing, Yandex, and DuckDuckGo are unaffected. Bing's cached pages feature is fully active in 2026 and respects noarchive. Yandex respects it identically. DuckDuckGo does not maintain its own cache — it pulls from Bing — so the directive's effect there is downstream of Bing's behavior.

HTML Meta Tag Syntax

The standard implementation is a single <meta> tag inside the <head> of the page. There are two forms — a generic robots directive that applies to all crawlers that respect it, and bot-specific directives that apply only to a named crawler.

<!-- Apply noarchive to all compliant crawlers -->
<meta name="robots" content="noarchive">

<!-- Apply only to Googlebot -->
<meta name="googlebot" content="noarchive">

<!-- Apply only to Bingbot -->
<meta name="bingbot" content="noarchive">

<!-- Combine with other directives (comma-separated) -->
<meta name="robots" content="index, follow, noarchive">

<!-- Index, follow links, no cache, no snippet -->
<meta name="robots" content="noarchive, nosnippet">

<!-- Pair with max-snippet for fine control -->
<meta name="robots" content="noarchive, max-snippet:160">

A few syntax notes that catch developers out: directives are case-insensitive, but stick to lowercase by convention. The content attribute takes a comma-separated list — never use semicolons or pipe characters. The default behavior when no robots tag is present is equivalent to index, follow with caching allowed, so you only need to specify what you want to change.

If you set both a generic robots tag and a bot-specific tag (e.g., googlebot) on the same page, the bot-specific tag takes precedence for that bot. This is the documented behavior across Google, Bing, and Yandex. Be careful: if your generic tag says noarchive and your googlebot tag says index, follow (omitting noarchive), Googlebot will treat the page as cacheable.

X-Robots-Tag HTTP Header (Non-HTML Files)

Meta tags only work in HTML. For PDFs, images, videos, JSON endpoints, and any other non-HTML resource, use the X-Robots-Tag response header. The directive syntax is identical to the meta tag content.

# Apache .htaccess — apply noarchive to all PDFs
<FilesMatch "\.pdf$">
  Header set X-Robots-Tag "noarchive"
</FilesMatch>

# Apply to PDFs and DOCX with multiple directives
<FilesMatch "\.(pdf|docx)$">
  Header set X-Robots-Tag "noarchive, noindex"
</FilesMatch>

# Bot-specific via X-Robots-Tag
<FilesMatch "\.pdf$">
  Header set X-Robots-Tag "googlebot: noarchive"
</FilesMatch>

Nginx equivalent — note the use of add_header at the location or server block level. The always parameter ensures the header is sent even on non-200 responses (though for noarchive specifically, you only care about 200s):

# nginx.conf — apply noarchive to PDFs
location ~* \.pdf$ {
  add_header X-Robots-Tag "noarchive" always;
}

# Apply site-wide (overridden per-location if needed)
server {
  listen 443 ssl;
  server_name example.com;
  add_header X-Robots-Tag "noarchive" always;

  location /public/ {
    # Override: cacheable for marketing pages
    add_header X-Robots-Tag "all" always;
  }
}

# Bot-specific directive
location /financial-reports/ {
  add_header X-Robots-Tag "googlebot: noarchive, noindex" always;
  add_header X-Robots-Tag "bingbot: noarchive, noindex" always;
}

Verify the header is actually being sent — many configurations look correct but fail because of add_header inheritance rules in nginx (any add_header in a more specific block silently replaces all parent add_header directives unless you re-declare them). Run curl -I https://example.com/file.pdf and confirm X-Robots-Tag: noarchive appears in the response.

Next.js Metadata API Implementation

In Next.js App Router (13.4+), the canonical way to set robots directives is via the metadata export or generateMetadata function in page.tsx or layout.tsx. The framework will inject the corresponding <meta> tag into the SSR HTML automatically.

// app/legal/contract/page.tsx
import type { Metadata } from 'next';

export const metadata: Metadata = {
  title: 'Service Agreement',
  robots: {
    index: true,
    follow: true,
    noarchive: true,
    nosnippet: false,
    googleBot: {
      index: true,
      follow: true,
      noarchive: true,
      'max-snippet': 160,
    },
  },
};

export default function Page() {
  return <article>{/* contract content */}</article>;
}

For dynamic routes where the directive depends on data (e.g., paywalled articles), use generateMetadata:

// app/articles/[slug]/page.tsx
import type { Metadata } from 'next';
import { getArticle } from '@/lib/articles';

export async function generateMetadata(
  { params }: { params: { slug: string } }
): Promise<Metadata> {
  const article = await getArticle(params.slug);
  return {
    title: article.title,
    robots: {
      index: true,
      follow: true,
      noarchive: article.isPaywalled,
    },
  };
}

For non-HTML responses served via Next.js Route Handlers (e.g., generated PDFs from app/api/report/route.ts), set the X-Robots-Tag header on the NextResponse directly:

// app/api/report/[id]/route.ts
import { NextResponse } from 'next/server';
import { generatePdf } from '@/lib/pdf';

export async function GET(
  _: Request,
  { params }: { params: { id: string } }
) {
  const pdfBuffer = await generatePdf(params.id);
  return new NextResponse(pdfBuffer, {
    headers: {
      'Content-Type': 'application/pdf',
      'X-Robots-Tag': 'noarchive, noindex',
    },
  });
}

When to Use Noarchive

The directive is worth setting in a small number of well-defined scenarios. Outside these, it adds noise to your robots configuration without providing benefit.

Paywalled content. If your articles are behind a metered or hard paywall, you do not want a search engine cached copy bypassing the wall. Newspapers and SaaS knowledge bases routinely set noarchive on subscriber-only pages while still allowing indexing — combined with Google's Flexible Sampling (formerly First Click Free) signals, this is the standard pattern. Note: noarchive does not enforce the paywall, it just removes one bypass route.

Frequently changing data. Stock tickers, cryptocurrency prices, sports scores, real-time inventory pages — anything where a stale cached copy could mislead a user into making a decision based on out-of-date information. Setting noarchive avoids the situation where a user clicks the (now-removed for Google, still-present for Bing) cached link and sees yesterday's price.

Legal and compliance constraints. Pages displaying regulated financial disclosures, medical product information, or jurisdiction-restricted content sometimes have legal requirements that the "official" version is the one served live. A cached snapshot — which the regulator might pull months later in an audit — could show outdated terms. Setting noarchive documents intent and reduces the surface area of stale snapshots. Confirm with counsel whether this satisfies the specific regulatory requirement; it usually does not by itself.

Internal beta or staging-adjacent content. If a page must remain crawlable for some reason (perhaps it is a public preview deliberately not noindexed) but you want to minimize how widely a snapshot is preserved, noarchive reduces persistence somewhat — though, again, it does not stop the Internet Archive from grabbing it.

When NOT to Use Noarchive

The vast majority of sites should never set noarchive sitewide. Here are the wrong reasons to use it that I see most often:

To prevent AI training scraping. It does not work. GPTBot, ClaudeBot, Google-Extended, and similar crawlers ignore noarchive as a directive. They respect either robots.txt rules for their specific user-agent or the proposed noai/noimageai meta tags (see next section).

To "hide" a page from Google. Hiding a page is what noindex does. noarchive still allows the page to be indexed and to appear in search results — just without the historical cached link. If your goal is removal, this is the wrong tool.

As a perceived ranking signal. There is no documented or observed ranking effect from setting noarchive. It is purely a display-control directive.

For copyright protection. The cached copy that noarchive blocks was a copy already created by Google during indexing. The directive only affects whether that copy is exposed in the SERP UI — Google still has the copy internally for its ranking and quality systems. If you have a copyright concern, the path is a DMCA takedown, not noarchive.

On marketing landing pages. Setting noarchive on pages where you want maximum visibility and shareability is purely a downside — Bing users searching for your brand may prefer the cached version when your site is briefly down, and you have removed that fallback for no benefit.

AI Training Implications and Noai/Noimageai

The directives that actually affect AI training crawlers in 2026 are not noarchive. They are bot-specific robots.txt rules and a small set of proposed extensions. Here is what each does:

GPTBot, ClaudeBot, Google-Extended, CCBot, PerplexityBot — the dominant AI training and inference crawlers — each have their own user-agent strings and are blocked exclusively via robots.txt:

# robots.txt — block major AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow normal search crawling to continue
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Note that Google-Extended is distinct from Googlebot — blocking the former opts you out of Bard/Gemini training without affecting search ranking. Similarly Applebot-Extended controls Apple Intelligence training while regular Applebot continues to crawl for Spotlight and Siri.

The proposed noai and noimageai meta directives were popularized by DeviantArt and a coalition of artist platforms in 2023. They look like:

<meta name="robots" content="noai, noimageai">

Adoption is partial. As of 2026, no major AI lab has formally committed to honoring these tags as part of their default crawler behavior, although some research crawlers and ethically-conscious scrapers do. Setting them costs nothing and signals intent, but should not be relied on as enforcement. The combined approach — robots.txt for the named bots plus noai, noimageai meta as a layered signal — is what platforms publishing original creative work typically deploy.

For deeper specifics on AI crawler control, see the GPTBot guide and the ClaudeBot guide.

Interaction With Other Robots Directives

The robots directive vocabulary has roughly a dozen members and they combine in ways that occasionally surprise developers. Here is how noarchive interacts with the most commonly-paired directives:

noindex + noarchive — redundant but harmless. If a page is noindexed, it will not appear in search results, so there is nowhere for a cached link to appear in the first place. Some teams add it for documentation purposes (signaling that the page is sensitive); the parser will not complain.

noarchive + nosnippet — useful pair. Removes both the cached link (where it still exists) and the SERP description text. Common on legal terms pages where you want the page indexed (so it ranks for the company's "terms of service" query) but do not want either a snippet or cached snapshot.

noarchive + max-snippet:N — fine-grained control. max-snippet:160 caps SERP description length at 160 characters while noarchive independently blocks the cache. They do not conflict.

noarchive + max-image-preview:none — common for paywalled news. Suppresses both the cached page and any preview image in SERPs while still allowing indexing.

noarchive + unavailable_after:[date] — useful for time-bound content like event pages. The page is indexed and cached-blocked until the date, after which Google treats it as noindexed.

robots.txt Disallow + noarchive meta — does not work as expected. If robots.txt blocks the URL, Googlebot never crawls it and never reads the meta tag. The page may still appear in SERPs as a URL-only result with no snippet, no cached link by default — and your noarchive directive is invisible. Always pair noindex with crawl-allowed rules, never with Disallow.

Verifying Noarchive Is Working

After deploying the directive, verify it is actually being delivered. Three checks to run:

# 1. Confirm meta tag is in raw HTML (not just JS-injected)
curl -s https://example.com/page | grep -i 'name="robots"'
# Expected: <meta name="robots" content="noarchive">

# 2. Confirm X-Robots-Tag header for non-HTML files
curl -I https://example.com/document.pdf | grep -i 'x-robots-tag'
# Expected: X-Robots-Tag: noarchive

# 3. Check what Googlebot sees (via Google Search Console)
# URL Inspection > View Crawled Page > More Info > HTTP Response
# Confirm both the meta tag and any X-Robots-Tag header are present

# 4. Audit at scale across a sitemap
curl -s https://example.com/sitemap.xml | \
  grep -oE 'https?://[^<]+' | \
  while read url; do
    HAS_NOARCH=$(curl -s "$url" | grep -ci 'noarchive')
    echo "$url: noarchive=$HAS_NOARCH"
  done

The most common failure mode is client-side JavaScript injection: a single-page application sets the meta tag via document.head.appendChild after hydration, which means Googlebot's first-pass crawl of the raw HTML does not see it. The directive may still register on the second-pass render, but until then Google treats the page as cacheable. Always set robots directives in the server-rendered HTML — in Next.js, that means the metadata export, not useEffect.

SitemapFixer's bulk crawl flags pages where noarchive appears inconsistently (e.g., set on some product pages but missing on others in the same template), which is the most common rollout bug — a CMS template variant that is missing the metadata block.

Related Guides

Audit your robots directives across every page
Free analysis in 60 seconds
Analyze My Site Free
Related guides