Fix Your Sitemap for Ruby on Rails
Rails apps almost always use the sitemap_generator gem to emit sitemap.xml.gz. Incorrect default_host, missing updated_at on ActiveRecord models, and forgetting to upload to S3 or CDN on ephemeral-filesystem hosts are the usual failure points.
The canonical Rails sitemap pattern: a config/sitemap.rb file that uses find_each over your ActiveRecord models, a rake sitemap:refresh task run on cron (or Sidekiq Cron), and the output stored either on disk or in S3. Each piece is simple. Skip one and you ship a broken sitemap.
Debugged a Rails SaaS on Heroku last year. Sitemap was empty, returning just the index with two shards that didn't exist. The rake task was running fine - in the release dyno, which immediately got torn down. The generated files went with it. Switched to the S3 adapter, kept files in a bucket, added a CloudFront in front. Problem vanished.
Working config/sitemap.rb
SitemapGenerator::Sitemap.default_host = 'https://example.com'
SitemapGenerator::Sitemap.sitemaps_host = 'https://d1234.cloudfront.net'
SitemapGenerator::Sitemap.adapter = SitemapGenerator::AwsSdkAdapter.new(
ENV['S3_BUCKET'],
aws_access_key_id: ENV['AWS_KEY'],
aws_secret_access_key: ENV['AWS_SECRET'],
aws_region: 'us-east-1',
acl: 'public-read'
)
SitemapGenerator::Sitemap.create do
add '/', changefreq: 'daily', priority: 1.0
add '/about', changefreq: 'monthly'
add '/pricing', changefreq: 'monthly'
# Blog posts
Post.where(published: true)
.where('published_at <= ?', Time.current)
.find_each(batch_size: 1000) do |post|
add post_path(post),
lastmod: post.updated_at,
changefreq: 'weekly',
priority: 0.7
end
# Products with images
Product.active.find_each(batch_size: 1000) do |product|
add product_path(product),
lastmod: product.updated_at,
images: product.images.map { |img|
{ loc: img.url, title: product.name }
}
end
endCommon Rails Sitemap Issues
default_hostnot set, sitemap emits relative URLs- Missing
updated_atsolastmodfalls back to nil orTime.now - Heroku deploys losing
sitemap.xml.gzon dyno restart (ephemeral FS) - Scoped queries missing
published_at <= Time.current- drafts leak out - Admin/internal routes exposed because they weren't excluded from the sitemap
- Non-batched queries (
.eachinstead of.find_each) causing OOM on large tables - Sitemap generator crashing because an image URL was nil and the gem expected a string
- Robots.txt pointing at
/sitemap.xmlbut actual file served at/sitemaps/sitemap.xml.gz
Heroku, Fly.io, and ephemeral filesystems
If you're on any platform where the filesystem doesn't persist between restarts (Heroku, Fly.io, Render without a volume), you must upload generated files somewhere durable. sitemap_generator has built-in adapters for AWS S3, Google Cloud Storage, and WasabiSys. Configure one of them, point sitemaps_host at your CDN, and add a redirect in routes.rb from /sitemap.xml to the CDN URL so Google can find it at the canonical location.
Scheduling
# Sidekiq Cron (config/schedule.yml or initializer) sitemap_refresh: cron: "0 * * * *" # hourly class: "SitemapRefreshJob" queue: low # Or with whenever gem (config/schedule.rb) every 1.hour do rake "sitemap:refresh" end # Or a plain system cron on a worker dyno / server 0 * * * * cd /app && bundle exec rake sitemap:refresh
Step-by-Step Fix Guide
- Add
gem 'sitemap_generator', bundle, runrails generate sitemap:install - Set
default_hostinconfig/sitemap.rb - Ensure every model has
updated_at; passlastmod: obj.updated_atto everyadd - Scope queries to published records and use
find_each(batch_size: 1000) - On ephemeral hosts, configure the S3 (or GCS) adapter and a CDN
- Schedule
rake sitemap:refreshhourly via Sidekiq Cron, whenever, or system cron - Verify with
curl -L https://yoursite.com/sitemap.xml.gz | zcat | head - Submit to Google Search Console