Fix Your Sitemap for Large Sites (10k+ pages)

Updated April 2026·By SitemapFixer Team

Sitemap strategy for sites over 10,000 pages: sitemap indexes, crawl budget optimization, priority tiers, and monitoring at scale. Everything gets harder when the sitemap stops fitting in a single file.

Analyze your large site sitemap nowTry Sitemap Fixer Free

At 10k URLs you can still get away with a single file. At 50k you hit the hard cap and need an index. At 500k you need to rethink regeneration: full rebuilds take longer than your content update cycle, and you start shipping stale sitemaps. At 5M+, sitemap architecture becomes a serious engineering problem.

Worked on a news publisher with 4.2M article URLs. Full sitemap regeneration took 47 minutes. They were running it hourly, and the previous run was still finishing when the next one started. CPU was pegged 24/7 for a job that should complete in minutes. Switching to shard-level regeneration (only rewrite files containing changed articles) dropped the per-update time to under 30 seconds.

Hard limits to know

Recommended sharding strategy

Shard by content type first, then by ID range within each type:

/sitemap.xml                 # sitemap index
  /sitemap-static.xml         # ~50 URLs, static marketing pages
  /sitemap-categories.xml     # ~1,500 URLs, category pages
  /sitemap-articles-0.xml     # articles ID 1-40000
  /sitemap-articles-1.xml     # articles ID 40001-80000
  ...
  /sitemap-articles-104.xml   # most recent articles
  /sitemap-users-0.xml        # public user profiles
  /sitemap-images-0.xml       # image sitemap for galleries

Why this structure: GSC shows coverage per file. When articles shard 47 drops from 100% indexed to 62%, you know exactly which ID range has the problem. You wouldn't get that signal from one giant file.

Lastmod is the most important field

Google publicly said priority and changefreq are ignored, but it uses lastmod to prioritize re-crawls. On a large site, accurate lastmod is the single highest-leverage optimization. If every URL has the same lastmod (generator time), Google can't tell which pages changed and falls back to its own scheduling - which is less efficient than your ground truth. Derive lastmod from the actual content update timestamp, not the generation time.

Incremental regeneration

Stop regenerating the whole sitemap on every content change. Track which shard a given URL lives in (ID-based sharding makes this trivial: floor(post_id / 40000)), and rewrite only that shard when a post changes. Regenerate the index only when shard filenames change (new shard added, old one removed). For very large sites, a queue-based approach works well: content changes enqueue "shard N dirty" jobs that batch into regenerations.

Monitoring at scale

Common large-site mistakes

Step-by-Step Fix Guide

  1. Shard by content type, then by ID range. Target ~40k URLs per file
  2. Implement accurate lastmod from content update timestamps
  3. Move to incremental regeneration: rewrite only the shard containing changed content
  4. Block low-value URL patterns in robots.txt (filters, search, sessions, trackers)
  5. Submit the sitemap index (not individual shards) to GSC for per-shard coverage data
  6. Monitor per-shard coverage weekly. Flag drops below 80%
  7. Audit Googlebot server logs monthly to confirm crawl budget spends match your priorities
  8. Set alerts on generation time, file size, and coverage drops

Frequently Asked Questions

What's the sitemap file size and URL limit?
50,000 URLs or 50MB uncompressed per file. A sitemap index can reference up to 50,000 sub-sitemaps. Keep each shard well under 40,000 URLs for headroom and faster regeneration.
How do I optimize crawl budget on a large site?
Accurate lastmod is the single biggest lever - Google uses it to prioritize re-crawls. Remove dead URLs promptly, shard by content type so coverage reports are readable, and block low-value URL patterns in robots.txt.
Should I regenerate the entire sitemap on every change?
On sites over 500k URLs, no - regenerate only the shards containing changed content. Structure your generator to rewrite one shard at a time, triggered by content events.
Analyze your large-site sitemap
Find all issues in your sitemap - free, no credit card needed
Analyze My Sitemap Free
Other platform guides