Fix Your Sitemap for Large Sites (10k+ pages)

Q: How do I optimize crawl budget on a large site?

Accurate lastmod dates are the single biggest lever - Google uses them to prioritize re-crawls. Remove dead URLs from the sitemap promptly, shard by content type so coverage reports are readable, and block low-value URL patterns (filters, search, session IDs) in robots.txt so they don't eat budget.

Updated April 2026·By SitemapFixer Team

Sitemap strategy for sites over 10,000 pages: sitemap indexes, crawl budget optimization, priority tiers, and monitoring at scale. Everything gets harder when the sitemap stops fitting in a single file.

Analyze your large site sitemap nowTry Sitemap Fixer Free

At 10k URLs you can still get away with a single file. At 50k you hit the hard cap and need an index. At 500k you need to rethink regeneration: full rebuilds take longer than your content update cycle, and you start shipping stale sitemaps. At 5M+, sitemap architecture becomes a serious engineering problem.

Worked on a news publisher with 4.2M article URLs. Full sitemap regeneration took 47 minutes. They were running it hourly, and the previous run was still finishing when the next one started. CPU was pegged 24/7 for a job that should complete in minutes. Switching to shard-level regeneration (only rewrite files containing changed articles) dropped the per-update time to under 30 seconds.

Hard limits to know

50,000 URLs per sitemap file. Hard cap. File is rejected if exceeded.
50 MB uncompressed per file. Usually hit first on sites with long URLs or lots of image entries.
50,000 sub-sitemaps per index. So the theoretical max is 2.5 billion URLs from one index.
Indexes can't reference other indexes. You get one level of nesting: one index → up to 50k leaf files.

Recommended sharding strategy

Shard by content type first, then by ID range within each type:

/sitemap.xml                 # sitemap index
  /sitemap-static.xml         # ~50 URLs, static marketing pages
  /sitemap-categories.xml     # ~1,500 URLs, category pages
  /sitemap-articles-0.xml     # articles ID 1-40000
  /sitemap-articles-1.xml     # articles ID 40001-80000
  ...
  /sitemap-articles-104.xml   # most recent articles
  /sitemap-users-0.xml        # public user profiles
  /sitemap-images-0.xml       # image sitemap for galleries

Why this structure: GSC shows coverage per file. When articles shard 47 drops from 100% indexed to 62%, you know exactly which ID range has the problem. You wouldn't get that signal from one giant file.

Lastmod is the most important field

Google publicly said priority and changefreq are ignored, but it uses lastmod to prioritize re-crawls. On a large site, accurate lastmod is the single highest-leverage optimization. If every URL has the same lastmod (generator time), Google can't tell which pages changed and falls back to its own scheduling - which is less efficient than your ground truth. Derive lastmod from the actual content update timestamp, not the generation time.

Incremental regeneration

Stop regenerating the whole sitemap on every content change. Track which shard a given URL lives in (ID-based sharding makes this trivial: floor(post_id / 40000)), and rewrite only that shard when a post changes. Regenerate the index only when shard filenames change (new shard added, old one removed). For very large sites, a queue-based approach works well: content changes enqueue "shard N dirty" jobs that batch into regenerations.

Monitoring at scale

GSC coverage per sub-sitemap - check weekly; drops below 80% warrant investigation
Server log analysis - what's Googlebot actually crawling? If it's burning budget on filter URLs, your robots.txt needs work
Generation time - if your sitemap build starts taking longer than your update cycle, something will break silently
File size per shard - if any shard approaches 40MB, split it
Lastmod distribution - plot the histogram. If everything has today's date, your generator is lying

Common large-site mistakes

Single sitemap file over 50MB or 50k URLs - Google rejects it entirely
Full regeneration on every update when incremental would work
No lastmod, or generator-time lastmod on every URL
Sub-sitemap URLs mixed across domains (sitemap must reference only URLs on the same host)
Robots.txt not updated to point at the sitemap index
CDN caching the sitemap too aggressively so crawlers see stale content
Sharding by hash instead of by ID range, making incremental regen impossible
No monitoring - first sign of trouble is GSC rankings dropping weeks later

Step-by-Step Fix Guide

Shard by content type, then by ID range. Target ~40k URLs per file
Implement accurate lastmod from content update timestamps
Move to incremental regeneration: rewrite only the shard containing changed content
Block low-value URL patterns in robots.txt (filters, search, sessions, trackers)
Submit the sitemap index (not individual shards) to GSC for per-shard coverage data
Monitor per-shard coverage weekly. Flag drops below 80%
Audit Googlebot server logs monthly to confirm crawl budget spends match your priorities
Set alerts on generation time, file size, and coverage drops

Frequently Asked Questions

What's the sitemap file size and URL limit?

50,000 URLs or 50MB uncompressed per file. A sitemap index can reference up to 50,000 sub-sitemaps. Keep each shard well under 40,000 URLs for headroom and faster regeneration.

How do I optimize crawl budget on a large site?

Accurate lastmod is the single biggest lever - Google uses it to prioritize re-crawls. Remove dead URLs promptly, shard by content type so coverage reports are readable, and block low-value URL patterns in robots.txt.

Should I regenerate the entire sitemap on every change?

On sites over 500k URLs, no - regenerate only the shards containing changed content. Structure your generator to rewrite one shard at a time, triggered by content events.

Analyze your large-site sitemap

Find all issues in your sitemap - free, no credit card needed

Analyze My Sitemap Free

Other platform guides