By SitemapFixer Team
Updated April 2026

Crawling and Indexing in SEO: How They Differ and How to Control Each

Find crawl and index issues freeScan my site free

Most SEO problems people describe as "Google is ignoring my page" are actually one of two distinct failures: a crawling failure or an indexing failure. The fixes are different, the GSC reports are different, and the controls are different. Treating them as one problem is why so many sites stay stuck — they noindex pages they meant to disallow, or disallow pages they meant to noindex, and the symptoms get worse instead of better. This guide separates the two cleanly: what each means, how Google does each, and how to control each independently.

Crawling vs Indexing: The Core Distinction

Crawling is fetching. Googlebot makes an HTTP request to a URL, downloads the HTML (and optionally the JavaScript, CSS, and images), and stores the raw response. Crawling is bounded by your server's response capacity, your robots.txt rules, and Google's allocated crawl budget for your domain. A successful crawl produces a 200 OK response with the page's HTML in Google's temporary storage.

Indexing is processing and storing. Google takes the crawled HTML, renders any JavaScript, parses the content, evaluates quality and duplication, applies meta directives (noindex, canonical), and decides whether to add the URL to its searchable database. Indexing is bounded by perceived content quality, duplicate detection, and Google's indexing capacity — which Google has openly stated is a finite resource it cannot spend on every URL it crawls.

This means four states are possible for any URL: not crawled and not indexed (Google has not seen it), crawled but not indexed (Google fetched but rejected), indexed but not recently crawled (in the database from an old fetch), and crawled and indexed (the working state). Each state has different fixes.

The Google Pipeline: Discover, Crawl, Index, Rank

Google describes its pipeline in four stages, and understanding which stage your URL is stuck at is the first step in any debugging session.

Stage 1 — Discovery. Google learns the URL exists. Sources include sitemap submissions, internal links from already-known pages, external backlinks, and direct submission via the URL Inspection tool. A URL that has not been discovered cannot be crawled — there is nothing for Googlebot to fetch.

Stage 2 — Crawling. Googlebot fetches the URL. The HTTP response code, response time, and HTML size are recorded. If the URL returns 4xx or 5xx, the crawl fails and is retried later with backoff. If robots.txt disallows the path, the URL is not fetched at all.

Stage 3 — Indexing. Google processes the crawled HTML. JavaScript is rendered (in a second wave for JS-heavy pages). Canonical tags are evaluated, content quality is scored, duplicates are clustered, and a final decision is made: add to the index, or discard. The noindex directive is enforced here.

Stage 4 — Ranking. When a query is issued, indexed pages compete for positions in the SERP based on relevance, authority, freshness, and hundreds of other signals. A page that did not reach Stage 3 cannot appear in Stage 4 results — period.

How to Control Crawling: Robots.txt and Crawl Rate

Crawling is controlled at the path level via robots.txt, served from the root of your domain. A Disallow directive tells compliant crawlers (Googlebot, Bingbot) not to fetch matching paths. It does not affect indexing on its own — only the network fetch.

# https://example.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /search?
Disallow: /*?sessionid=

# Allow Googlebot to crawl JS and CSS for rendering
User-agent: Googlebot
Allow: /*.js$
Allow: /*.css$

# Sitemap reference (helps with discovery, not crawl control)
Sitemap: https://example.com/sitemap.xml

Three rules to remember: robots.txt is path-pattern based, not URL-list based — one rule can cover thousands of URLs. It is advisory, not enforceable — major search engines respect it but malicious scrapers ignore it. And it does not remove URLs from the index — if you disallow a URL that is already indexed, it stays indexed but with a generic snippet, because Google can no longer fetch it to see updates.

How to Control Indexing: Noindex, Canonical, X-Robots-Tag

Indexing is controlled at the URL level via meta directives. Unlike robots.txt, these directives must be read by Google — which means the page must be crawlable. A page that is both disallowed and noindexed will stay indexed, because Google can never crawl it to see the noindex tag.

Meta robots tag (HTML pages): Place inside <head>. Tells Google whether to index the page and whether to follow its outgoing links.

<!-- Block indexing, allow link following (most common pattern) -->
<meta name="robots" content="noindex, follow">

<!-- Block indexing AND link following -->
<meta name="robots" content="noindex, nofollow">

<!-- Default: index and follow (no tag needed, but explicit is fine) -->
<meta name="robots" content="index, follow">

<!-- Specific to Googlebot only -->
<meta name="googlebot" content="noindex, nosnippet">

X-Robots-Tag (any file type): Sent as an HTTP response header. Use this for non-HTML files like PDFs, images, or generated reports where you cannot inject a meta tag. Configure it at the server or CDN level.

# Apache .htaccess: noindex all PDF files
<FilesMatch "\.pdf$">
  Header set X-Robots-Tag "noindex, noarchive"
</FilesMatch>

# Nginx: noindex a specific path
location /reports/ {
  add_header X-Robots-Tag "noindex, follow" always;
}

# Verify with curl
curl -I https://example.com/reports/q4.pdf
# Look for: X-Robots-Tag: noindex, noarchive

Canonical tag (influences which version is indexed): When multiple URLs serve similar content, the rel="canonical" tag tells Google which one to keep in the index and consolidate signals to. It does not block indexing of duplicates — Google may still index a non-canonical version if signals conflict — but it strongly influences the choice. See the canonical tags guide for the full mechanics.

What Happens When Robots.txt Blocks an Already-Indexed Page

This is one of the most common SEO mistakes: a page is in the index, the SEO wants it removed, so they add a Disallow rule to robots.txt. The page does not get removed. Instead, it stays indexed in a half-broken state — Google retains the URL and basic metadata it had from the previous crawl, but cannot refresh the snippet, see content changes, or detect a noindex tag added later.

The page may show in search results with the message "A description for this result is not available because of this site's robots.txt" — which looks worse than not appearing at all. The fix sequence to actually remove an indexed page:

Step 1: Allow crawling — remove the Disallow rule from robots.txt. Step 2: Add <meta name="robots" content="noindex"> to the page. Step 3: Wait for Google to recrawl (1–4 weeks for low-priority URLs, 1–3 days for high-priority). Once Google sees the noindex tag, the URL is removed from the index. Step 4 (optional): Re-add the robots.txt disallow to save crawl budget on the now-deindexed page.

For urgent removals (leaked private data, accidentally exposed admin pages), use the GSC Removals tool to suppress the URL from search results within hours while you complete the noindex steps.

Reading GSC Status Messages: "Discovered" vs "Crawled" — Currently Not Indexed

Google Search Console's Pages report (Indexing → Pages) categorizes non-indexed URLs into specific statuses. Two of the most common — and most confused — are:

"Discovered - currently not indexed" means Google has heard about the URL (from your sitemap or an internal link) but has not yet crawled it. The URL is sitting in the discovery queue waiting for a crawl slot. Causes: Google deprioritizes the URL because the site has weak authority, the URL appears low-importance based on internal linking, or your site has crawl capacity issues (slow responses, frequent 5xx errors). Fix: improve internal linking to the URL from authoritative pages, fix server response times, and request indexing manually via URL Inspection for high-priority URLs.

"Crawled - currently not indexed" is a different problem entirely. Google fetched the URL successfully and read its content — then chose not to add it to the index. The decision is editorial: Google decided the page does not meet its quality bar, is too similar to a canonical version, or is not valuable enough to spend index slots on. Fix: improve content depth (the page may be thin or templated), check for duplication with other site pages, and add internal links signaling the page's importance. Requesting indexing here usually does not help — Google has already decided.

Other common statuses worth knowing: "Excluded by ‘noindex’ tag" (working as intended if you added the tag deliberately), "Blocked by robots.txt" (the URL is disallowed — confirm this is intentional), "Page with redirect" (the URL redirects elsewhere — Google indexes the destination), and "Soft 404" (the page returns 200 but looks empty or error-like to Google).

Debugging a Not-Indexed Page Step by Step

When a specific URL is not indexed, run it through GSC's URL Inspection tool first. The output tells you exactly which stage the URL is stuck at, and the next move depends entirely on what it says.

# Sample URL Inspection output and what each line means

URL: https://example.com/blog/new-post

Coverage:
  Submitted and indexed              -> Working. No action needed.
  URL is not on Google               -> Not indexed. Check the reason below.

Discovery:
  Sitemaps: https://example.com/sitemap.xml
  Referring page: https://example.com/blog/   -> Internal link found

Crawl:
  Last crawl: Apr 22, 2026, 3:14 PM
  Crawled as: Googlebot smartphone
  Crawl allowed? Yes
  Page fetch: Successful
  Indexing allowed? Yes               -> No noindex tag detected

Indexing:
  User-declared canonical: https://example.com/blog/new-post
  Google-selected canonical: Inspected URL  -> Self-canonical confirmed

# If "Indexing allowed?" says "No: noindex detected" -> Remove the noindex tag.
# If "Crawl allowed?" says "No: blocked by robots.txt" -> Edit robots.txt.
# If "Page fetch" says "Failed" -> Check 4xx/5xx response code.
# If "Google-selected canonical" differs from yours -> Canonical signal conflict.

If URL Inspection shows the page is technically eligible (crawl allowed, indexing allowed, fetch successful, canonical correct) but it is still not indexed, you are in "Crawled - currently not indexed" territory. That is a quality and authority problem, not a technical one — the fixes are content depth, internal links, and removing thin/duplicate variants from the same site. See why pages are not indexed for the full diagnostic flow.

Sitemaps: How They Help Discovery and Crawling

An XML sitemap is the most direct signal you can send Google about which URLs you want crawled and indexed. It does not guarantee indexing — Google still applies its quality filters — but it short-circuits the discovery stage and gives Google a prioritized URL list to work through.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-04-25</lastmod>
  </url>
  <url>
    <loc>https://example.com/blog/new-post</loc>
    <lastmod>2026-04-28</lastmod>
  </url>
  <url>
    <loc>https://example.com/products/widget</loc>
    <lastmod>2026-04-20</lastmod>
  </url>
</urlset>

Three rules: every URL in the sitemap should return 200 OK and be canonicalized to itself — including URLs that redirect, are noindexed, or return 4xx triggers GSC sitemap warnings. Update the <lastmod> field accurately when content actually changes — Google uses it to prioritize recrawls. And reference the sitemap in robots.txt with Sitemap: https://example.com/sitemap.xml so any crawler that reads robots.txt also discovers it.

Crawl Budget vs Indexing Budget

Crawl budget and indexing budget are different resources. Confusing them leads to optimizing the wrong thing.

Crawl budget is the number of requests Googlebot is willing to make to your site in a given period. It is determined by your server's capacity to respond (host load) and your site's perceived importance (crawl demand). Crawl budget matters for sites with hundreds of thousands of URLs — a small site of 500 pages will rarely run out. To optimize: reduce server response time, fix soft 404s and crawl errors, block low-value parameter URLs in robots.txt, and consolidate duplicate URL patterns. See the crawl budget guide for the full breakdown.

Indexing budget is the number of URLs from your site Google is willing to keep in its index. This is not a published number, but Google has stated indexing capacity is finite and quality-gated. A site with 10,000 thin pages will see Google index a fraction; a site with 1,000 high-quality pages may see all of them indexed. To optimize indexing budget: prune thin and duplicate content (canonical or noindex it), strengthen the topical depth of pages you want indexed, and avoid programmatic content generation that produces low-effort variations.

The interaction matters: if your crawl budget is wasted on parameter URLs and faceted navigation, Google never crawls your high-value pages, which then never compete for indexing budget. Fixing crawl budget often unlocks indexing improvements that look unrelated.

Recovery Timelines After Fixes

How long until a fix shows up in GSC depends on which stage the fix targets and how often Google revisits your URLs.

Robots.txt change: Google fetches robots.txt approximately every 24 hours. A new Disallow or Allow rule takes effect on the next crawl after that fetch — usually within 1–2 days. You can force a re-fetch via the GSC robots.txt Tester.

Noindex tag removal (page should now index): Google must recrawl the page and observe the missing noindex. For high-priority pages, this happens within 1–3 days. For low-priority pages buried in a large site, it can take 2–6 weeks. Submit the URL via URL Inspection > Request Indexing to accelerate.

Noindex tag added (page should now de-index): Same recrawl timeline. Once Google sees the noindex tag, the URL is removed from the index within hours of that crawl. GSC will reflect the change in the Pages report within 1–2 days of the deindex.

Quality improvements (lifting "Crawled - currently not indexed"): The slowest fix. Google may not re-evaluate the URL for weeks even after a recrawl, because indexing decisions are not made on every crawl. Expect 4–12 weeks for systemic content quality improvements to translate into indexing gains. The signal that it is working: pages start moving from "Crawled - currently not indexed" to "Submitted and indexed" in batches, not one at a time.

Monitoring Crawling and Indexing With the GSC Coverage Report

The Pages report (formerly Coverage report) is the single best dashboard for monitoring crawl and index health. Three views matter:

Indexed pages count over time. The headline number on the Pages report. A steady or rising count is healthy. A sudden drop of more than 5% in a week is an alert — investigate immediately for accidental noindex deployment, robots.txt regression, or sitemap corruption.

Non-indexed reasons breakdown. Click into the "Why pages aren't indexed" section. Each reason is a separate diagnostic category. Track the count per reason month over month: an increase in "Discovered - currently not indexed" suggests crawl capacity issues; an increase in "Crawled - currently not indexed" suggests content quality drift; an increase in "Duplicate without user-selected canonical" suggests the canonical configuration broke.

Sitemap-specific view. Filter the Pages report to only URLs from your submitted sitemap. This is your most reliable signal for whether the URLs you actively care about are being indexed. The ratio of "indexed" to "total submitted" should be above 90% for a healthy site. Below 70% means a substantial chunk of your intended index is being rejected — usually for quality, canonical, or duplication reasons.

For more frequent monitoring than GSC's daily refresh, run a recurring crawl of your sitemap with SitemapFixer or Screaming Frog and compare against the previous run. Catching a noindex regression or robots.txt mistake within hours instead of days prevents weeks of recovery work.

Related Guides

Find every crawl and index issue on your site
Free analysis in 60 seconds
Analyze My Site Free
Related guides