How Google Crawler Works: The Full Pipeline Explained
Most SEO advice treats Googlebot as a single black box: pages go in, rankings come out. The reality is a multi-stage pipeline with at least seven distinct subsystems — URL discovery, scheduling, robots fetching, HTTP fetching, parsing, rendering, and the indexing handoff — and a problem in any one of them surfaces as a different symptom in Search Console. This guide walks the entire pipeline end to end, with the actual HTTP requests Googlebot makes, the headers it sends, and the directives it honors at each step. By the end you should be able to look at any indexing problem and identify which stage it broke at.
Stage 1: URL Discovery
Before Googlebot can crawl a URL, it has to know the URL exists. Discovery feeds a queue called the crawl frontier. There are five canonical sources Google uses to populate that frontier:
XML sitemaps. Sitemaps are the most direct discovery channel. When you submit a sitemap in Search Console or reference it in robots.txt via Sitemap:, Google fetches it on its own schedule (usually within hours of a change), parses every <loc> entry, and adds new URLs to the frontier. The <lastmod> hint affects scheduling priority — but only if Google trusts your sitemap to set it accurately.
Internal and external links. Every page Googlebot already knows about is a discovery surface. When a parsed page contains an <a href>, Google extracts the URL, normalizes it (resolving relative paths, decoding percent-encodings, dropping fragments), and adds it to the frontier. External links from other indexed sites work the same way and carry stronger discovery signals.
Manual submission via URL Inspection. The "Request Indexing" button in Search Console adds a URL directly to a high-priority crawl queue. There is a daily quota (around 10–12 requests per property) so this is a tactical tool, not a primary discovery channel.
IndexNow. Google began honoring IndexNow signals in 2024 for partner search engines. While Google does not officially "subscribe" to IndexNow, URLs pinged through Bing's endpoint frequently surface in Google's discovery via the cross-engine link graph faster than passive crawl alone.
RSS/Atom feeds and pubsub. Google still consumes RSS/Atom feeds for news and blog discovery. If your CMS exposes a feed and you reference it in <link rel="alternate" type="application/rss+xml">, Googlebot will poll it for new entries.
Stage 2: URL Scheduling and Crawl Budget
Discovery only adds URLs to the frontier — it does not guarantee a fetch. The scheduler decides which URLs to crawl, in what order, and how often. For small sites (under 10,000 URLs) the scheduler is rarely a constraint. For large sites it is the primary bottleneck and is what people mean when they say "crawl budget."
Two values determine crawl budget: crawl capacity (how fast Google can fetch your site without harming users) and crawl demand (how badly Google wants to crawl your URLs based on freshness, popularity, and staleness). Multiply them and you get the actual crawl rate.
You can influence capacity by serving fast 200s consistently — Google ramps up parallel connections when response times stay under 500ms. You influence demand by linking, updating, and being cited externally. URLs Google considers low-value (parameterized duplicates, soft-404s, thin pages) get pushed to the back of the queue and may be crawled only every few months.
The crawl-delay directive in robots.txt is honored by Bing and Yandex but ignored by Google. The only Google-supported way to slow Googlebot down is the "Crawl rate" setting in the legacy Search Console (deprecated for most properties in 2024) or returning 503 Service Unavailable with a Retry-After header during overload.
# Bing/Yandex respect this; Googlebot does not User-agent: bingbot Crawl-delay: 5 # Google-only way to throttle: return 503 with Retry-After HTTP/1.1 503 Service Unavailable Retry-After: 3600 Content-Type: text/html # Sitemap reference (Googlebot reads this on every robots.txt fetch) Sitemap: https://example.com/sitemap.xml
Stage 3: Fetching robots.txt
Before any URL on a host gets fetched, Googlebot fetches /robots.txt. The result is cached for up to 24 hours. If robots.txt returns a 5xx or times out, Google treats the entire host as "disallow all" for the cache duration — which is why a flaky robots.txt response can silently halt crawling for a full day.
Status codes Googlebot honors: 200 — parse and apply rules. 404 / 410 — assume "allow all". 403 — also treated as "allow all" (counterintuitive but documented). 5xx for > 30 days — fall back to the last known good copy, and if there is none, "disallow all."
Inside robots.txt, Google honors User-agent, Disallow, Allow, Sitemap, and wildcard patterns (* and $). The most specific group wins — if you have rules for Googlebot and *, Googlebot uses only the Googlebot group.
# Check what Googlebot would see for robots.txt curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \ -I https://example.com/robots.txt # Example robots.txt that Google parses correctly User-agent: Googlebot Disallow: /admin/ Disallow: /*?sort= Allow: /admin/public/ User-agent: * Disallow: /private/ Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-news.xml
Note that Disallow blocks crawling but does not block indexing. A URL that is linked from elsewhere can still appear in search results as a URL-only listing even if robots.txt forbids fetching it. The only way to keep a URL out of the index entirely is to allow crawling and return X-Robots-Tag: noindex or a meta robots tag.
Stage 4: HTTP Fetch with Googlebot User Agents
Once robots.txt permits a URL, Googlebot fetches it. Different Googlebot variants exist for different content types and purposes:
Googlebot Smartphone — the default crawler for nearly all sites since mobile-first indexing rolled out fully. It identifies as Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). The Chrome version (W.X.Y.Z) is updated to match the current evergreen Googlebot Chromium build.
Googlebot Desktop — used for legacy desktop-only properties and as a comparison crawler for parity checks. Identifies with the desktop Chrome UA suffixed with the Googlebot token.
Googlebot Image — fetches images discovered in <img> tags and image sitemaps. Sends Accept: image/avif,image/webp,image/*.
Googlebot News — crawls publishers approved for Google News with stricter freshness expectations.
Googlebot Video, AdsBot-Google, Mediapartners-Google — specialized fetchers for video sitemaps, ads landing page quality checks, and AdSense respectively. AdsBot ignores User-agent: * rules — you must target it explicitly.
# Simulate a Googlebot Smartphone request curl -v -A "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) \ AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Mobile Safari/537.36 \ (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \ -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \ -H "Accept-Encoding: gzip, deflate, br" \ https://example.com/page # Verify a request actually came from Google (reverse DNS check) host 66.249.66.1 # Should return crawl-66-249-66-1.googlebot.com # Forward DNS confirms host crawl-66-249-66-1.googlebot.com # Should return the same 66.249.66.1
Stage 5: HTTP Response Handling
How Googlebot handles the response depends entirely on the status code returned. Each code maps to a specific downstream behavior:
200 OK — body is passed to the parser. Cache headers (ETag, Last-Modified) are stored to enable conditional requests on next fetch via If-Modified-Since or If-None-Match.
301 Moved Permanently / 308 Permanent Redirect — the target URL replaces the original in the index over time, and ranking signals consolidate to the target. Google follows up to 5 redirect hops in a single fetch; longer chains result in the URL being deferred for re-evaluation later.
302 Found / 307 Temporary Redirect — followed for crawling, but the original URL stays in the index. If Google detects a 302 has been in place for many months, it may treat it as a 301.
304 Not Modified — confirms the cached body is still current. The URL is marked as recently checked but not re-parsed.
404 Not Found / 410 Gone — URL is dropped from the index after repeated confirmation. 410 is processed slightly faster than 404 as a stronger removal signal, but both reach the same end state.
429 Too Many Requests / 503 Service Unavailable — Googlebot backs off and retries later. Persistent 5xx for 30+ days causes URLs to drop from the index.
5xx other — treated as transient. Repeated 5xx reduces crawl rate for the host until responses recover.
Stage 6: HTML Parsing and Link Extraction
For 200 responses with HTML content, Googlebot runs an HTML parser before any rendering. The parser extracts:
Meta robots and X-Robots-Tag headers — controlling indexing (noindex), link following (nofollow), snippet length (max-snippet:), image preview (max-image-preview:), and translation (notranslate).
Canonical link — the <link rel="canonical"> hint that influences (but does not guarantee) which URL Google selects as the canonical version of duplicate content.
Hreflang annotations — for international content alternate signaling.
Structured data — JSON-LD, microdata, and RDFa for rich result eligibility.
All <a href> links — added to the discovery frontier with their rel attributes (nofollow, ugc, sponsored) preserved for ranking-signal classification.
<!-- Meta robots: parsed during HTML pre-parse --> <meta name="robots" content="noindex, follow, max-snippet:160, max-image-preview:large"> <meta name="googlebot" content="index, follow"> <!-- Canonical: applied after duplicate clustering --> <link rel="canonical" href="https://example.com/canonical-version"> <!-- Hreflang for international variants --> <link rel="alternate" hreflang="en-gb" href="https://example.com/en-gb/page"> <link rel="alternate" hreflang="de-de" href="https://example.com/de-de/page"> <!-- Or via response headers (works for any content type) --> HTTP/1.1 200 OK X-Robots-Tag: noindex, nofollow Link: <https://example.com/canonical-version>; rel="canonical"
Crucially, the meta robots tag is read at the pre-parse stage. If your page sets noindex via JavaScript after rendering, the pre-parse already saw the static HTML without it — but the rendered version overrides that. The reverse is more dangerous: a static noindex that JavaScript later removes is honored at the pre-parse stage and the rendered correction may come too late to prevent a temporary deindex.
Stage 7: Rendering with the Web Rendering Service (WRS)
For pages that depend on JavaScript to assemble content, Google runs the rendered DOM through its Web Rendering Service. WRS uses an evergreen Chromium build — Chromium 121+ as of early 2026 — that tracks stable Chrome releases with a small lag.
WRS is not a full-featured user browser. Several APIs are stubbed or disabled: localStorage and sessionStorage are functional but ephemeral (cleared between page loads), service workers are skipped, the requestPermission APIs (notifications, camera, geolocation) auto-deny, and IndexedDB is in-memory only. WebGL, WebSockets, and WebRTC may execute but are not used for indexing signals.
The rendering timeout is approximately 5 seconds for content-affecting JavaScript. Anything that loads after that timeout (lazy-loaded images that depend on scroll events, content fetched via setTimeout with long delays, etc.) may not be in the rendered DOM Google sees.
Google's indexing model is sometimes called two-pass indexing: the pre-parse runs immediately on raw HTML, and the rendered version is queued separately. For sites that rely heavily on client-rendered content, the rendered pass can lag the pre-parse by minutes to days, depending on the rendering queue depth. This is why JavaScript-only canonicals, structured data, or content can take longer to be reflected in Search results than equivalent server-rendered signals.
Stage 8: Mobile-First Indexing and Signals to the Index
After rendering, Google compiles a feature vector for the URL containing every signal the indexing pipeline cares about: title, headings, body text, internal link anchors, structured data, canonical resolution, language detection, content quality classifiers, page experience signals (Core Web Vitals from CrUX), and freshness markers.
Under mobile-first indexing, the version Googlebot Smartphone fetched is the version used to compute these signals. If your mobile rendering omits content that the desktop version contains, that content effectively does not exist for ranking purposes. Common mobile-first issues: hidden navigation tabs that fail to render server-side, lazy-loaded sections that depend on viewport size, structured data only output to desktop user agents.
The compiled signals enter the indexing pipeline, where deduplication, canonical clustering, and quality classification happen. The URL may be admitted to the served index, sidelined as a non-canonical duplicate, or held in "Discovered, currently not indexed" if the quality classifier scores it below the threshold for that site's budget.
Common Crawl Issues and Their Fixes
Pages stuck at "Discovered – currently not indexed". The URL is in the frontier but the scheduler has deprioritized it. Cause is usually low quality signals or weak internal linking. Fix: link to the page from a higher-authority page on your site, ensure the page has unique content, and keep it included in your sitemap with a fresh <lastmod>.
"Crawled – currently not indexed". Googlebot fetched, parsed, and chose not to index. This is a quality-classifier verdict. Fix: improve content depth and uniqueness, remove near-duplicates, and consolidate thin pages.
Soft 404 detection. Googlebot decided your 200 response is actually a not-found page (empty search results, generic "no items" templates). Fix: return a real 404 status, or add substantive unique content to the template that confirms the URL is a valid resource.
Robots.txt blocking unintentionally. A wildcard rule like Disallow: /*? can block more than intended. Test with the URL Inspection tool's "Test Live URL" to see exactly what robots.txt evaluation Google performs.
JavaScript-injected noindex. The pre-parse may briefly see a state where the meta tag is present, then the render removes it. The reverse — JavaScript adding noindex — is rarely intentional but does happen via tag managers. Audit your rendered HTML in the URL Inspection "View Rendered HTML" output.
Recovery Timelines After Fixes
Robots.txt unblock — fastest. Google re-fetches robots.txt within 24 hours of the cache expiring, then re-queues affected URLs. Visible recovery in 1–3 days for high-priority URLs.
Noindex removal — same timeline as the page's natural recrawl interval. Top pages: 1–3 days. Mid-tier: 1–2 weeks. Long tail: 4–8 weeks. "Request Indexing" in URL Inspection accelerates this for individual URLs.
Canonical correction — slowest of the common fixes. Google re-evaluates clustering decisions over 2–6 weeks. Do not interpret slow movement as a failed fix; check URL Inspection for individual URLs to confirm the corrected canonical is registered.
Quality-related deindexing — slowest overall. Quality re-evaluation happens on long cycles (3–6 months for site-wide signals). Bulk content improvements typically begin to show in indexed page counts at the 8–12 week mark.
Monitoring Crawler Behavior with GSC Crawl Stats
Search Console's Crawl Stats report (Settings → Crawl stats) is the single best telemetry surface for understanding what Googlebot is actually doing on your site. Key views to watch:
Total crawl requests over time. A sudden drop usually points to robots.txt or 5xx issues. A sudden rise often correlates with new sitemap submissions or accidental URL-space explosion (parameter handling, faceted navigation).
Average response time. Trending up means crawl capacity is shrinking. Anything sustained above 1000ms will start to throttle Googlebot's parallelism.
Crawl requests by response code. A spike in 4xx after a release usually means broken internal linking. A spike in 5xx is a server health problem.
Requests by file type. If image or JS requests dwarf HTML requests, you may have a discovery imbalance — Google is rendering pages but spending most of its budget on resources rather than new content.
Requests by Googlebot type. Confirms whether smartphone vs. desktop ratios match expectations. A site fully on mobile-first should see Smartphone dominate; if Desktop is large, mobile-first hasn't fully migrated.
Combine Crawl Stats with the URL Inspection API for programmatic monitoring. The API exposes the same data the Inspection UI shows, so you can build dashboards that flag URLs slipping out of the index before they affect traffic.
# URL Inspection API request (returns the same JSON the GSC UI uses)
POST https://searchconsole.googleapis.com/v1/urlInspection/index:inspect
Authorization: Bearer YOUR_OAUTH_TOKEN
Content-Type: application/json
{
"inspectionUrl": "https://example.com/page",
"siteUrl": "https://example.com/"
}
# Example response excerpt
{
"inspectionResult": {
"indexStatusResult": {
"verdict": "PASS",
"coverageState": "Submitted and indexed",
"robotsTxtState": "ALLOWED",
"indexingState": "INDEXING_ALLOWED",
"lastCrawlTime": "2026-04-28T14:32:18Z",
"googleCanonical": "https://example.com/page",
"userCanonical": "https://example.com/page",
"crawledAs": "MOBILE"
}
}
}If you operate at scale, SitemapFixer continuously cross-references your sitemap against URL Inspection data and Crawl Stats so that drift between what you submit and what Google actually crawls is surfaced in one dashboard rather than buried across multiple GSC views.
Related Guides
- Google Not Crawling My Site: Diagnostics and Fixes
- Crawl Budget: How It Works and How to Optimize It
- Why Pages Are Not Indexed: A Diagnostic Guide
- AJAX Crawling: How Google Handles Dynamic Content
- How to Force Google to Crawl Your Site Faster
- Crawling and Indexing in SEO: The Complete Guide
- What Is Search Engine Indexing? A Complete Guide