Page Finder: How to Discover Every URL on a Website
A "page finder" is any tool or technique that produces the complete list of URLs that exist on a website. The output is a flat list — one URL per line — that you can then use for an SEO audit, a content inventory, a migration plan, or competitor research. Page finders sit at the intersection of three older categories: sitemap parsers, web crawlers, and search-operator lookups. Modern page finders combine all three because no single source captures every URL a site actually serves.
What Counts as a "Page"
The definition matters because it changes what your page finder should look for.
Indexable HTML pages. The default meaning — pages a search engine would consider for the index. Excludes redirects, 404s, soft-404s, noindex pages, and admin URLs. This is the right list for an SEO audit.
All responding URLs. Includes everything that returns HTTP 200, regardless of whether it's indexable. Useful for a security audit (what surface area is exposed?) or a migration (what URLs do we need to redirect?).
All declared URLs. Everything in the sitemap plus everything internally linked, regardless of current status. Used for finding dead links and stale references.
Historical URLs. URLs the site used to serve, captured via Wayback Machine, server logs, or backlink databases. Critical for migration planning — you don't want to break URLs that still get traffic from old backlinks.
Pick the definition before you pick the tool. A page finder optimised for "indexable pages" will miss orphaned URLs that a crawler-based tool would catch; a crawler-based tool will miss URLs only declared in the sitemap if no internal link points to them.
The Three Sources Every Page Finder Combines
No single data source is complete. A page finder that's worth using merges these three:
1. The XML sitemap. What the site explicitly declares as its public URL set. Fastest and cleanest source. Misses anything intentionally omitted from the sitemap (drafts, low-priority pages, orphans). Coverage depends entirely on how well the site maintains its sitemap.
2. Internal-link crawl. Following every <a href> from the homepage breadth-first. Finds pages that have internal links but aren't in the sitemap. Slow on large sites and misses orphan pages (pages with no inbound link from anywhere on the site).
3. Search-engine reverse lookup. Querying Google with site:domain.com, plus path-scoped variants (site:domain.com/blog/, site:domain.com/products/), plus the GSC Pages export when you own the site. Catches indexed orphans that the sitemap omits.
The union of these three minus duplicates is your complete URL list. The intersection is your "definitely live, definitely declared, definitely findable" URL list — useful for prioritising audit work.
Page Finder vs Crawler vs Sitemap Tool
The three labels overlap and get used interchangeably in practice, but they emphasise different things:
The right tool depends on the question. Auditing a competitor? A combined page finder is the fastest path. Looking for orphan pages on your own site? You need the crawler diff against the sitemap — see orphan pages for the full workflow.
When to Use a Page Finder
Pre-migration inventory. Before changing your URL structure or moving to a new platform, you need a complete list of current URLs to build the 301 redirect map. Missing pages here cost SEO traffic for years afterwards.
SEO content audit. Before deciding which pages to keep, merge, or noindex, you need to see the whole landscape. Page finders produce the input list that downstream audit tools score and rank.
Competitor research. Knowing exactly which pages your competitor has indexed reveals their content strategy: which clusters they invest in, which topics they neglect, how deep their programmatic SEO goes.
Diagnosing indexing gaps. When GSC shows fewer indexed pages than you expect, a page finder lets you compare the "exists" list to the "indexed" list and find the delta. The pages in the delta are where indexing work needs to happen.
Acquiring a site. Due diligence on a website purchase requires verifying the URL inventory matches what the seller claims. A page finder catches missing or undocumented sections quickly.
Limits Every Page Finder Hits
JavaScript-rendered routes. Pages that exist only as client-side routes (e.g. some React or Vue SPAs without server-rendered HTML for each route) won't show up in a sitemap or in site: queries because Google itself struggles to index them. A JS-aware crawler is required.
Auth-gated pages. Anything behind a login wall is invisible to public crawlers and to Google's index by design. A page finder for your own site needs to be run inside the authenticated session if you want logged-in pages included.
Query-string variants. A URL with ?utm_source=x may serve identical content to the base URL. Most page finders deduplicate by canonical, so query-string variants disappear — usually correct, occasionally misleading.
Old URLs that 301 elsewhere. A current page finder reports the destination, not the source. To find the historical URLs, use Wayback Machine or your backlink-database's URL list.
Robots.txt-blocked paths. Pages disallowed in robots.txt can still exist and serve content but won't appear in crawl-based finders. The sitemap source catches them if the site declares them anyway (a common mistake — pages in the sitemap but blocked by robots.txt should not be indexed).
Running SitemapFixer as a Page Finder
SitemapFixer combines sitemap parsing, robots.txt parsing, and live HTTP checks. Enter a domain on the homepage and it does the following in order: locates the sitemap by checking 20+ standard paths and the robots.txt declaration, parses every URL from the sitemap (including nested sitemap-index files), groups URLs by section based on URL pattern, and runs live HTTP checks to confirm each URL responds with 200. The output is a complete, deduplicated, status-checked URL list ready to export. For sites with up to 500 URLs the analysis is free and takes under 60 seconds. Beyond that we offer paid plans for larger sites.
If you need crawl-based discovery (orphans, JS-rendered routes), pair SitemapFixer with a desktop crawler like Screaming Frog — start with our page list, then run the crawler to add anything we missed. The two sources are complementary, and SitemapFixer is much faster as the first pass.