Fix Your Sitemap for Webflow
Webflow auto-generates a sitemap, but CMS collections, paginated collection lists, and staging subdomains routinely cause issues that hurt indexing and crawl efficiency.
Webflow's sitemap is auto-generated and you get one of two modes: auto (everything public) or custom (paste your own XML and maintain it manually). There's no middle ground. No per-item sitemap toggle. No filter patterns. You either accept what Webflow produces or you hand-write the file.
Fixed a Webflow agency site with 210 pages published but 680 URLs in the auto sitemap. The extras were paginated CMS Collection List URLs (?page=2 through ?page=14) that Webflow had emitted as separate URLs, plus four Collection template pages the team had duplicated and forgotten. Switching to a custom sitemap with just canonical URLs fixed coverage in GSC within three weeks.
Common Webflow Sitemap Issues
- Staging subdomain (
yoursite.webflow.io) URLs leaking into indexing on free plans - CMS Collection pagination URLs (
?page=2) included inconsistently - Draft CMS items appearing if set to Published before content was ready
- Utility pages (404, password, style guide) listed despite being marked noindex
- Missing
lastmodvalues - Webflow uses the site's publish time, not item update time - Collection template pages appearing without their parent list
- Site published without sitemap regeneration (Webflow only rebuilds on publish)
- Custom sitemap getting out of sync with actual pages over time
The webflow.io staging leak
Free-plan Webflow sites expose the *.webflow.io staging subdomain without noindex. Paid plans add the noindex header automatically. Either way, you should connect a custom domain as soon as possible. Once the custom domain is primary, Webflow serves the sitemap from the custom domain and drops .webflow.io from search results over a few weeks.
Robots.txt workaround
Webflow lets you edit robots.txt at Project Settings > SEO > robots.txt. A basic setup that blocks the staging subdomain only works if Google respects the User-agent split - which it does not always, so combine it with the noindex header:
# Custom domain robots.txt User-agent: * Disallow: /401 Disallow: /404 Disallow: /style-guide Disallow: /detail_* Disallow: /*?page= Sitemap: https://yourdomain.com/sitemap.xml
CMS Collections and pagination
Webflow paginates Collection Lists at 100 items by default. The pagination URLs (?page=2, ?page=3) get crawled but shouldn't be indexed - they're essentially duplicate list pages with different contents. Block *?page= in robots.txt (above) and set rel=next/prev via Custom Code if you want to help Google understand the sequence. For CMS item detail pages, use the per-item "Exclude from sitemap" toggle only on items that really shouldn't be indexed (drafts, internal resources).
Step-by-Step Fix Guide
- Connect a custom domain and set it as primary in Project Settings > Hosting
- In Project Settings > SEO, enable Auto-generate sitemap.xml and verify base URL
- Add robots.txt rules blocking
?page=,/401,/404,/style-guide - Mark utility pages as "Exclude from sitemap" in Page Settings > SEO
- Per CMS item: toggle "Exclude from sitemap" for anything not ready
- Publish the site (Webflow only regenerates sitemap.xml on publish)
- Verify with
curl https://yourdomain.com/sitemap.xml- spot-check URL count - Confirm staging returns noindex:
curl -I https://yoursite.webflow.io - Submit the custom-domain sitemap to Google Search Console