What is search engine indexing in simple terms?

Search engine indexing is the process where a search engine fetches a web page, analyses its content and signals, and stores a structured copy of it in a database called the index. When you search Google, it does not crawl the web in real time — it queries this pre-built index. A page must be indexed to appear in search results. Crawling alone is not enough.

What is the difference between crawling and indexing?

Crawling is the act of fetching a URL's HTML. Indexing is the act of analysing that HTML, deciding the page is worth storing, and adding it to the searchable database. A page can be crawled but never indexed if Google decides it is low-quality, duplicate of another page, blocked by a noindex directive, or returns a soft 404. In Google Search Console this shows as "Crawled — currently not indexed".

How do I check if my page is indexed?

Three methods. First, run site:example.com/your-page in Google — if it returns the URL, it is indexed. Second, use the URL Inspection tool in Google Search Console for an authoritative status ("URL is on Google" means indexed). Third, check the Pages report in GSC for bulk indexed and non-indexed counts. The site: operator is fastest; URL Inspection is most accurate.

By SitemapFixer Team

Updated April 2026

What Is Search Engine Indexing?

Check which of your pages are indexedFree Index Audit

Search engine indexing is the process by which a search engine fetches web pages, analyses their content and signals, and stores a structured copy of them in a massive database called the index. When you type a query into Google, the engine does not crawl the web in real time — it queries this pre-built index, ranks the matching documents, and returns a list of results. Indexing is the bridge between a page existing on the web and a page being findable. Without it, your content is invisible to searchers no matter how well it is written.

The Definition: What Indexing Actually Is

Indexing in the context of search is the storage and organisation of web documents in a way that makes them retrievable for arbitrary queries in milliseconds. The technical foundation is the inverted index — a data structure that maps each unique word (or token) to the list of documents containing it, along with positional and frequency information. When you search for "blue running shoes", the engine intersects the document lists for each token, scores the candidates, and returns the top matches. None of this would scale without indexing.

The web has somewhere north of 50 billion indexed pages on Google alone. Real-time crawling for every query would be impossible — both for the engine and for every server on the internet. The index is what makes search feasible.

The Four-Stage Pipeline: Discovery to Ranking

Indexing is one stage in a longer pipeline. Understanding where it sits clarifies why some pages get indexed and others do not.

1. Discovery. Google finds out a URL exists. Sources include sitemaps you submit, internal links from already-indexed pages, external backlinks, and direct submission via the URL Inspection tool. A URL Google has never heard of cannot be crawled.

2. Crawling. Googlebot fetches the URL, respecting robots.txt and crawl-rate budgets. The output is raw HTML (and any HTTP headers). Crawling does not imply indexing — it just means the page was fetched.

3. Rendering. For pages with JavaScript, Google runs a headless Chromium instance to execute scripts and produce the final DOM. Server-rendered pages skip most of this stage. Render queues are a known bottleneck on large JS-heavy sites.

4. Indexing. Google parses the rendered HTML, extracts content, signals (canonical, hreflang, structured data), and decides whether to store the page. Pages that pass quality and uniqueness checks are added to the index. The rest are dropped or flagged.

5. Ranking. At query time, the engine retrieves matching indexed documents and orders them. Ranking only operates on indexed pages — it is downstream of everything above.

What an Index Entry Actually Contains

An indexed document is far more than a stored copy of HTML. Each entry typically contains:

The URL (canonical form, after consolidating duplicates), the parsed text content (title, headings, body, anchor text from incoming links), extracted entities (people, places, products mentioned), structured data (Schema.org markup parsed into a knowledge graph), signals about the page (language, country, freshness, mobile-friendliness, Core Web Vitals), signals about the site (overall authority, topical focus, trust), and the inverted-index tokens with positions and weights.

This is why you can search for a phrase and get an exact match — the positional inverted index records that "running shoes" appeared as adjacent tokens at offset 47 of the body. It is also why Google can answer "who is the CEO of Anthropic?" without you typing the company name into the URL — the entity extraction layer linked the page to the relevant entities at index time.

Crawled vs Indexed: The Critical Distinction

This is the most misunderstood concept in technical SEO. Crawling and indexing are separate stages, and a page can be crawled hundreds of times without ever being indexed. In Google Search Console, the status "Crawled — currently not indexed" explicitly reports this state.

Reasons a crawled page may not be indexed:

Low quality. Thin content, duplicated boilerplate, or pages that exist mainly for SEO with no clear user value. Google's quality systems explicitly drop these.

Duplicate content. Google clusters near-duplicates and indexes only the canonical representative. The rest are crawled, recognised as duplicates, and dropped.

Noindex directives. A meta robots tag or X-Robots-Tag header explicitly telling Google not to index the page.

Soft 404. The page returns HTTP 200 but its content (empty results page, "product not found") suggests the URL has no real content. Google demotes these to soft-404 status and excludes them from the index.

Crawl-budget triage. On very large sites, low-importance pages may be crawled but deprioritised for indexing because the cost of indexing them outweighs the search value.

For a deeper look at this specific failure mode, see why pages are not indexed.

What Controls Indexing: The Directives That Matter

You influence indexing through a small set of directives. Knowing exactly what each does — and what it does not do — prevents the most common mistakes.

Meta robots tag. An HTML tag in the <head> that tells search engines whether to index the page and follow its links:

<!-- Allow indexing (default — no tag needed) -->
<meta name="robots" content="index, follow">

<!-- Block indexing for this page -->
<meta name="robots" content="noindex, follow">

<!-- Block indexing AND tell Google not to follow links -->
<meta name="robots" content="noindex, nofollow">

<!-- Target only Googlebot -->
<meta name="googlebot" content="noindex">

X-Robots-Tag HTTP header. Same effect as the meta tag, but delivered via the response header — essential for non-HTML resources like PDFs and images, and useful for applying rules at the server level:

# nginx — noindex all PDFs
location ~* \.pdf$ {
  add_header X-Robots-Tag "noindex, nofollow";
}

# Apache .htaccess — noindex a directory
<Files "draft-*.html">
  Header set X-Robots-Tag "noindex"
</Files>

# Raw HTTP response example
HTTP/1.1 200 OK
Content-Type: text/html
X-Robots-Tag: noindex, nofollow

Canonical tag. Does not block indexing — it tells Google which URL among a duplicate cluster should be the indexed representative. Google may still pick a different canonical if your signals contradict each other.

Hreflang. Does not block indexing — it tells Google which language/region variant of a page to show to which user. All hreflang variants get indexed; hreflang only affects which one is served at query time.

Robots.txt. Blocks crawling, not indexing. A URL blocked in robots.txt can still appear in search results (often as a bare URL with no snippet) if external links point to it. Use noindex, not robots.txt, when you want a page out of the index.

XML sitemap. A hint for discovery, not a guarantee of indexing. Including a URL in your sitemap signals you want it indexed, but Google still applies its own quality and uniqueness checks. See what is an XML sitemap.

Sitemap and Robots Examples

A minimal valid sitemap entry — the only required fields are <loc> wrapped in <url>:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-04-29</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/blog/indexing-guide</loc>
    <lastmod>2026-04-30</lastmod>
  </url>
</urlset>

A robots.txt that allows all crawling and points to the sitemap — the right default for almost every site:

User-agent: *
Allow: /

# Block crawler from low-value paths (still indexable if linked externally!)
Disallow: /search
Disallow: /cart
Disallow: /admin

Sitemap: https://example.com/sitemap.xml

How Other Search Engines Differ from Google

Google dominates global search but is not the only index that matters. Each engine has quirks worth knowing.

Bing. Operates its own crawler (Bingbot) and index, used by Bing Search, Yahoo, DuckDuckGo, Ecosia, and ChatGPT's web search feature. Bing's index is roughly an order of magnitude smaller than Google's but covers the long tail less aggressively. Bing weighs social signals and exact-match keywords more heavily than Google. Submit via Bing Webmaster Tools and the IndexNow API for instant indexing notifications.

DuckDuckGo. Does not maintain its own full web index. It blends Bing results with its own crawler (DuckDuckBot) for spam filtering and instant-answer features. Optimising for Bing effectively optimises for DuckDuckGo.

Yandex. Russia's dominant engine. Maintains its own index with strong emphasis on regional relevance, behavioural signals (click patterns from Yandex Metrica), and a more aggressive duplicate detection than Google. Submit via Yandex Webmaster.

Baidu. The Chinese market leader. Strict requirements: an ICP licence for hosting in mainland China, simplified-Chinese content, and Baidu-specific structured data. Baidu's crawler ignores many Google-standard signals.

A Brief History: From Inverted Index to Modern Search

The inverted index predates the web. Information retrieval researchers in the 1950s and 1960s used it to search bibliographic records on mainframes. The web simply scaled the problem by ten orders of magnitude.

Early web indexes (WebCrawler in 1994, Lycos, AltaVista, Excite) used straightforward keyword-frequency scoring. Google's 1998 innovation was PageRank — using the link graph as a quality signal layered on top of the inverted index. The quality of the index entry, not just its existence, became the differentiator.

Modern indexing layers on entity extraction (Knowledge Graph), neural embeddings (BERT-style query and document understanding), and quality classifiers trained on user behaviour. The fundamental data structure — words mapped to documents — is unchanged since the 1960s.

How AI Search Engines Build Their Indexes

AI-native search engines (Perplexity, You.com, ChatGPT search, Claude with web access) face a build-vs-buy decision on indexing.

Perplexity runs its own crawler (PerplexityBot) and maintains its own index. It supplements with third-party APIs (notably Bing) for queries where its own coverage is thin. Allowing PerplexityBot in robots.txt is what determines whether your pages appear as cited sources.

ChatGPT search is powered primarily by Bing's index, with OpenAI's own crawler (OAI-SearchBot) layering on top for content prioritisation. To be cited, you need both Bingbot and OAI-SearchBot allowed.

Claude uses partner search APIs rather than maintaining its own index. The ClaudeBot crawler is for training data collection, not query-time retrieval — a distinction that matters when configuring access.

Google AI Overviews and Gemini reuse Google's existing index. Anything indexed by Google can appear as an AI Overview citation. There is no separate "AI index" to be added to.

Practical implication: if you want to be cited by AI engines, you need a robots.txt that explicitly allows the relevant crawlers, and you need standard SEO indexing fundamentals to be solid. AI search builds on classical search infrastructure; it does not replace it.

Monitoring Index Status

Two reliable methods for checking whether a page is indexed by Google.

Method 1: the site: operator. Type site:example.com/your-page into Google. If the URL appears, it is indexed. If it does not appear, it is not indexed (or has been deindexed). For a domain-wide count, site:example.com returns an approximate indexed-page total — useful for tracking trends, not for exact figures.

Method 2: GSC URL Inspection. The authoritative source. Sample output for a healthy URL:

URL Inspection — https://example.com/blog/indexing-guide

Presence on Google
  URL is on Google                              Indexed
  Coverage                                      Submitted and indexed
  Sitemaps                                      https://example.com/sitemap.xml
  Referring page                                https://example.com/blog/

Last crawl
  Last crawl                                    Apr 28, 2026, 14:32 UTC
  Crawled as                                    Googlebot smartphone
  Crawl allowed?                                Yes
  Page fetch                                    Successful
  Indexing allowed?                             Yes

Canonicals
  User-declared canonical                       https://example.com/blog/indexing-guide
  Google-selected canonical                     Inspected URL

The two lines that matter most: "URL is on Google" (Indexed vs Not indexed) and "Google-selected canonical" (should match your declared canonical — if it does not, Google chose a different URL as the index representative). The Google Search Console tutorial walks through the full report.

Common Indexing Problems and Fixes

"Discovered — currently not indexed". Google knows the URL exists but has not crawled it yet. Causes: weak internal linking, low site authority, or crawl-budget pressure on large sites. Fix: improve internal links to the page, submit via URL Inspection, ensure the page is in your sitemap.

"Crawled — currently not indexed". The page was fetched but not added to the index. Almost always a quality or duplicate issue. Fix: improve content depth, remove boilerplate, ensure the page provides value not already covered elsewhere on the site.

"Duplicate, Google chose different canonical than user". Your canonical declaration was overridden because other signals (internal links, sitemaps, redirects) point elsewhere. Fix: align all signals to the canonical you want indexed.

"Excluded by ‘noindex’ tag". A meta robots or X-Robots-Tag is explicitly blocking indexing. Fix: remove the directive — but first confirm the page should be indexed (sometimes the noindex was correct and a process is auto-submitting the URL).

"Soft 404". The page returns 200 but looks empty or error-like to Google. Fix: either add real content, or return a proper 404/410 status if the URL should not exist.

For systematic diagnosis across many URLs, see crawling and indexing in SEO.

Related Guides

See exactly which pages Google has indexed

Free index audit in 60 seconds

Analyze My Site Free

Related guides