Fix Your Sitemap for Large Sites (10k+ pages)
Sitemap strategy for sites over 10,000 pages: sitemap indexes, crawl budget optimization, priority tiers, and monitoring at scale. Everything gets harder when the sitemap stops fitting in a single file.
At 10k URLs you can still get away with a single file. At 50k you hit the hard cap and need an index. At 500k you need to rethink regeneration: full rebuilds take longer than your content update cycle, and you start shipping stale sitemaps. At 5M+, sitemap architecture becomes a serious engineering problem.
Worked on a news publisher with 4.2M article URLs. Full sitemap regeneration took 47 minutes. They were running it hourly, and the previous run was still finishing when the next one started. CPU was pegged 24/7 for a job that should complete in minutes. Switching to shard-level regeneration (only rewrite files containing changed articles) dropped the per-update time to under 30 seconds.
Hard limits to know
- 50,000 URLs per sitemap file. Hard cap. File is rejected if exceeded.
- 50 MB uncompressed per file. Usually hit first on sites with long URLs or lots of image entries.
- 50,000 sub-sitemaps per index. So the theoretical max is 2.5 billion URLs from one index.
- Indexes can't reference other indexes. You get one level of nesting: one index → up to 50k leaf files.
Recommended sharding strategy
Shard by content type first, then by ID range within each type:
/sitemap.xml # sitemap index /sitemap-static.xml # ~50 URLs, static marketing pages /sitemap-categories.xml # ~1,500 URLs, category pages /sitemap-articles-0.xml # articles ID 1-40000 /sitemap-articles-1.xml # articles ID 40001-80000 ... /sitemap-articles-104.xml # most recent articles /sitemap-users-0.xml # public user profiles /sitemap-images-0.xml # image sitemap for galleries
Why this structure: GSC shows coverage per file. When articles shard 47 drops from 100% indexed to 62%, you know exactly which ID range has the problem. You wouldn't get that signal from one giant file.
Lastmod is the most important field
Google publicly said priority and changefreq are ignored, but it uses lastmod to prioritize re-crawls. On a large site, accurate lastmod is the single highest-leverage optimization. If every URL has the same lastmod (generator time), Google can't tell which pages changed and falls back to its own scheduling - which is less efficient than your ground truth. Derive lastmod from the actual content update timestamp, not the generation time.
Incremental regeneration
Stop regenerating the whole sitemap on every content change. Track which shard a given URL lives in (ID-based sharding makes this trivial: floor(post_id / 40000)), and rewrite only that shard when a post changes. Regenerate the index only when shard filenames change (new shard added, old one removed). For very large sites, a queue-based approach works well: content changes enqueue "shard N dirty" jobs that batch into regenerations.
Monitoring at scale
- GSC coverage per sub-sitemap - check weekly; drops below 80% warrant investigation
- Server log analysis - what's Googlebot actually crawling? If it's burning budget on filter URLs, your robots.txt needs work
- Generation time - if your sitemap build starts taking longer than your update cycle, something will break silently
- File size per shard - if any shard approaches 40MB, split it
- Lastmod distribution - plot the histogram. If everything has today's date, your generator is lying
Common large-site mistakes
- Single sitemap file over 50MB or 50k URLs - Google rejects it entirely
- Full regeneration on every update when incremental would work
- No
lastmod, or generator-timelastmodon every URL - Sub-sitemap URLs mixed across domains (sitemap must reference only URLs on the same host)
- Robots.txt not updated to point at the sitemap index
- CDN caching the sitemap too aggressively so crawlers see stale content
- Sharding by hash instead of by ID range, making incremental regen impossible
- No monitoring - first sign of trouble is GSC rankings dropping weeks later
Step-by-Step Fix Guide
- Shard by content type, then by ID range. Target ~40k URLs per file
- Implement accurate
lastmodfrom content update timestamps - Move to incremental regeneration: rewrite only the shard containing changed content
- Block low-value URL patterns in robots.txt (filters, search, sessions, trackers)
- Submit the sitemap index (not individual shards) to GSC for per-shard coverage data
- Monitor per-shard coverage weekly. Flag drops below 80%
- Audit Googlebot server logs monthly to confirm crawl budget spends match your priorities
- Set alerts on generation time, file size, and coverage drops