The Sitemap Pre-Launch Checklist: 12 Things to Verify Before Going Live
A sitemap error discovered after launch costs significantly more to fix than one caught before. Google may have already crawled and cached incorrect information, canonical signals may have been diluted, and GSC errors accumulate lag time before they clear. This checklist is built around the most common and most impactful sitemap problems — the ones that regularly appear in post-migration audits and post-launch technical reviews.
Run through these 12 items before every new site launch and before every major content restructure, CMS migration, or domain move. Each item is specific and checkable in under five minutes.
Sitemap Is Declared in robots.txt
Open your robots.txt file and confirm it contains a Sitemap: directive with the full absolute URL to your sitemap index. Example: Sitemap: https://example.com/sitemap.xml. This is how search engines discover your sitemap without being told about it directly. If you have multiple sitemap indexes (e.g., one for the main site and one for a subdomain), each needs its own Sitemap: line. Verify the URL in robots.txt actually resolves — a typo here means Googlebot finds nothing.
No Staging or Development URLs in the Sitemap
Fetch your sitemap and scan every URL for domain mismatches. Your production sitemap should contain only production domain URLs. Common failure modes: a database restore that carried over staging absolute URLs, a find-and-replace during migration that missed URL fields in certain post types, or a multi-environment CMS that writes environment-specific URLs into the sitemap at generation time. Run a simple grep for your staging domain (e.g., staging., .dev., localhost) across your full sitemap output. Zero occurrences is the only acceptable result.
All Sitemap URLs Return HTTP 200
Fetch every URL listed in your sitemap and verify the final response code is 200. Any URL returning 301 (redirect), 404 (not found), 410 (gone), or 5xx (server error) should be removed from the sitemap. Redirects waste crawl budget and create ambiguity about which URL Google should assign ranking signal to. 404s tell Googlebot your sitemap is unreliable. If a redirect destination is the intended indexable page, update the sitemap to point directly to that final URL.
No Noindex Conflict — Every Sitemap URL Is Indexable
For each URL in your sitemap, check its robots meta tag. Any page with <meta name="robots" content="noindex"> or an X-Robots-Tag: noindex response header must be removed from the sitemap. Sitemap + noindex is a direct contradiction. Google resolves it by respecting noindex and ignoring your sitemap instruction — but it still spends crawl budget discovering and re-evaluating those pages. This conflict is especially common after CMS migrations where noindex settings were applied to temporary pages and never cleaned up.
Canonical Tags Are Consistent with Sitemap URLs
Every URL in your sitemap should have a self-referential canonical — a canonical tag pointing to itself. If a sitemap URL has a canonical pointing to a different URL, you are telling Google: "crawl this page, but index that one." Either update the canonical to be self-referential, or remove the URL from the sitemap and replace it with the canonical destination. Also check for protocol mismatches: if your sitemap lists https://example.com/page but the canonical on that page reads http://example.com/page, that is a conflict.
Sitemap Index Used If Total URLs Exceed 50,000
A single XML sitemap file cannot contain more than 50,000 URLs or exceed 50MB uncompressed. If your site is approaching or exceeding this limit, you need a sitemap index file that references multiple child sitemaps. Structure it by content type (e.g., sitemap-posts.xml, sitemap-pages.xml, sitemap-products.xml) rather than by arbitrary URL ranges — this makes maintenance easier and gives Google cleaner signals about content segmentation. The sitemap index itself must be submitted to GSC, not each individual child sitemap.
lastmod Dates Are Accurate and Meaningful
The lastmod field should reflect when the page content was meaningfully last changed — not when a plugin regenerated the sitemap, not when a minor widget was updated, not the current timestamp applied to every URL on generation. If your CMS writes today's date as lastmod for every URL every time the sitemap regenerates, remove the lastmod field entirely. Inaccurate lastmod dates train Google to distrust your sitemap signals, which undermines recrawl prioritization for pages that are genuinely updated.
Image and Video Sitemaps Are Present If Applicable
If your site has image-heavy pages (photography, e-commerce product images, editorial images that should appear in Google Image search), an image sitemap or image sitemap extensions help Google discover images that may not be crawlable from HTML alone. Similarly, if you publish video content you want indexed in Google Video search, a video sitemap with the required metadata (title, description, thumbnail URL, content URL or player URL) is required. These are separate sitemaps or extensions on your existing page sitemap — confirm they are referenced from your sitemap index before launch.
Sitemap Is Submitted to Google Search Console
Publishing a sitemap is not the same as submitting it. In GSC, navigate to Sitemaps under the Indexing section and submit your sitemap index URL explicitly. This triggers an immediate fetch and gives you access to error reports, URL count data, and last-read timestamps. After submission, check back within 48 hours: a status of "Success" with a URL count close to your expected total means GSC can read and parse your sitemap correctly. A "Couldn't fetch" or "Has errors" status means something is wrong with the file itself.
No Critical Resources Blocked in robots.txt
A common pre-launch mistake: robots.txt is set to Disallow: / during development, and nobody removes it before go-live. Check your robots.txt allows Googlebot to crawl your pages, CSS files, JavaScript files, and images. Blocking CSS or JS prevents Google from rendering your pages correctly, which affects how your content is understood. Use GSC's URL Inspection tool on a sample of key pages and check the rendered screenshot — if it looks broken or blank, something Googlebot needs to render the page is blocked.
HTTPS Consistency — All URLs Use the Same Protocol
Every URL in your sitemap must use HTTPS if your site is on HTTPS. Mixed protocol sitemaps (some http://, some https://) create duplicate content signals and diluted ranking authority. Beyond the sitemap, verify that your canonical tags, internal links, hreflang attributes (if applicable), and Open Graph tags all use HTTPS consistently. A quick way to check: curl -I http://example.com/any-page — it should return a 301 to the HTTPS version. If it returns 200, your HTTP URLs are live and may be indexed separately.
URL Count in Sitemap Matches Expected Page Count
Before launch, establish your expected indexable page count: total published pages minus any intentionally noindexed pages, paginated pages beyond page 1 (if canonicalized to page 1), utility pages (login, thank-you, etc.), and any pages excluded by policy. Compare this to the URL count GSC reports for your submitted sitemap. A significantly lower count means pages are being excluded or filtered out by your CMS that you may want included. A significantly higher count means content you did not intend to expose is being included — often draft pages, auto-generated archive URLs, or attachment pages.
This checklist covers the most common sitemap problems encountered across CMS migrations, site relaunches, and plugin updates. The items that cause the most indexing damage in practice are the ones that are easiest to overlook: staging URLs leaking into production sitemaps, noindex flags left on from development, and robots.txt Disallow rules that were set during build and never removed.
For ongoing monitoring rather than one-time pre-launch verification, the most valuable investment is connecting your sitemap to Google Search Console and reviewing the Coverage report weekly. New errors in GSC almost always indicate something changed — a plugin update, a content editor action, or a server configuration change — and catching them quickly limits the indexing damage.