How to Perform a Technical SEO Audit
A technical SEO audit is not a checklist you tick through in order; it is an investigation that follows the data. The workflow below is the one I run on any new site, in roughly this sequence, because each step produces information the next step needs. By the end you will have a prioritised list of issues, code-ready evidence for engineering, and a one-page summary you can hand to a non-technical stakeholder. Total time for a mid-size site (5,000-50,000 URLs) is one to two working days.
Step 1: Crawl the Site
Start with a full crawl. Without this, every other step is guesswork. The two industry-standard crawlers are Screaming Frog SEO Spider and Sitebulb; either works. Configure the crawler to respect robots.txt the way Googlebot does, render JavaScript if your site is client-rendered, and follow internal links to depth 10 or more.
What you are looking for in the crawl output: total URL count by status code, response time distribution, redirect chains, orphan pages (URLs in the sitemap but not linked internally, or vice versa), and basic on-page issues like missing titles, duplicate titles, and empty meta descriptions. Export every report to CSV so the rest of the audit has a single source of truth.
Compare the crawl URL count to three other numbers: the number of URLs in your XML sitemap, the number of pages indexed in Google Search Console, and the number of URLs receiving organic traffic in the last 90 days. Discrepancies between these four numbers are the most reliable single indicator of where the audit will find issues.
Step 2: Reconcile GSC Indexation vs Crawl
Open Google Search Console and go to Indexing > Pages. Note the "Indexed" count and the "Not indexed" count. Compare each to the crawl total:
Indexed > Crawl: Google has discovered URLs your crawler did not. Likely causes: parameter URLs, old URLs that still exist, or URLs linked from external sites that you forgot about. Export the indexed list and diff it against the crawl.
Indexed < Crawl: Google has chosen not to index some of your pages. Open the "Not indexed" reasons in GSC: "Crawled - currently not indexed", "Discovered - currently not indexed", and "Duplicate without user-selected canonical" are the three most common categories. Each maps to a different fix.
Indexed roughly equals Crawl: Coverage is healthy. Move on, but keep an eye on the "Not indexed" reason categories - even if the totals match, individual category spikes signal trouble.
Step 3: Verify robots.txt and sitemap.xml
Both files must return HTTP 200, parse without errors, and contain only the directives you intend. Most issues here are not file presence but file content - an over-aggressive Disallow rule blocking a section that should be crawled, or a sitemap full of 404s and redirected URLs.
# Confirm robots.txt and sitemap.xml exist and return 200
curl -o /dev/null -s -w "robots.txt: %{http_code}\n" \
https://example.com/robots.txt
curl -o /dev/null -s -w "sitemap.xml: %{http_code}\n" \
https://example.com/sitemap.xml
# Pull every URL from the sitemap and check status codes
curl -s https://example.com/sitemap.xml \
| grep -oE '<loc>[^<]+</loc>' \
| sed -E 's/<\/?loc>//g' \
| while read url; do
code=$(curl -o /dev/null -s -w "%{http_code}" "$url")
echo "$code $url"
done | sort | uniq -c -w3
# Find Disallow rules in robots.txt
curl -s https://example.com/robots.txt | grep -E '^Disallow:'The sitemap should contain only canonical, indexable URLs that return 200. Any other state is a sitemap error. If even 5% of your sitemap URLs return non-200, Google reduces trust in the entire file - meaning your healthy URLs get crawled less frequently.
Step 4: Audit Canonicals
Canonicals are the silent killer of mid-size sites. Every page should have exactly one self-referencing canonical (or a canonical pointing to a deliberate alternative). The two failure modes are zero canonicals and multiple canonicals - both confuse Google equally.
# Count canonical tags per page (expected: 1)
for url in $(cat sitemap-urls.txt); do
count=$(curl -s "$url" | grep -c 'rel="canonical"')
echo "$count $url"
done | awk '$1 != 1'
# Find HTTP canonicals on HTTPS pages (a common bug)
for url in $(cat sitemap-urls.txt); do
curl -s "$url" \
| grep -oE 'rel="canonical"[^>]*href="[^"]+"' \
| grep 'href="http://' \
&& echo " on: $url"
done
# Find pages whose canonical points to a redirect or 404
for url in $(cat sitemap-urls.txt); do
canonical=$(curl -s "$url" \
| grep -oE 'rel="canonical"[^>]*href="[^"]+"' \
| grep -oE 'https?://[^"]+' | head -1)
code=$(curl -o /dev/null -s -w "%{http_code}" "$canonical")
[ "$code" != "200" ] && echo "$code $url -> $canonical"
doneFor sites larger than a few thousand URLs, scripting this against the full sitemap becomes unwieldy. SitemapFixer is a free sitemap audit tool that pulls canonical, status, and indexability data across the entire sitemap in one pass and flags inconsistencies by pattern - useful for the audit deliverable because it groups issues by root cause rather than by URL.
Step 5: Check hreflang Errors
Skip this section if your site is monolingual. For international sites, hreflang errors are the most under-diagnosed source of lost traffic - users in the wrong country see the wrong page, bounce, and the entire region underperforms.
Check three things: every hreflang annotation must be reciprocal (if /en/ points to /de/, then /de/ must point back to /en/); every targeted URL must return 200; and language-region codes must follow the ISO 639-1 plus ISO 3166-1 alpha-2 format (e.g. en-GB, not en-UK). GSC's International Targeting report (still available in legacy GSC) flags reciprocity and code errors directly. Screaming Frog has a dedicated hreflang tab that exports the same data more cleanly.
If your site has hreflang but no x-default, add one. It tells Google which page to show users whose locale does not match any of your specific targets - usually the global English version.
Step 6: Core Web Vitals via PageSpeed Insights API
Manual PageSpeed Insights tests are fine for one URL. For an audit covering many templates, use the API and pull field data (real Chrome user data) rather than lab data, because field is what Google ranks on.
# Pull CWV for representative URLs from the PSI API
API_KEY="your-google-api-key"
URLS=(
"https://example.com/"
"https://example.com/category/laptops"
"https://example.com/products/laptop-x1"
"https://example.com/blog/buyers-guide"
)
for url in "${URLS[@]}"; do
echo "=== $url ==="
curl -s "https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=$url&strategy=MOBILE&key=$API_KEY" \
| jq '.loadingExperience.metrics | {
LCP: .LARGEST_CONTENTFUL_PAINT_MS.percentile,
INP: .INTERACTION_TO_NEXT_PAINT.percentile,
CLS: .CUMULATIVE_LAYOUT_SHIFT_SCORE.percentile
}'
done
# Run Lighthouse CLI for lab data
npm install -g lighthouse
lighthouse https://example.com/ \
--only-categories=performance \
--form-factor=mobile \
--output=json --output-path=./lh-report.json
cat lh-report.json | jq '.audits["largest-contentful-paint"].numericValue'Targets to remember: LCP under 2.5 seconds, INP under 200 milliseconds, CLS under 0.1, all measured at the 75th percentile of mobile users. If a template fails any of these, that template gets flagged. Audit by template, not by URL - all 5,000 product pages share the same template, so fixing one fixes them all.
Step 7: Mobile-Friendly Check
Google's standalone Mobile-Friendly Test was retired in late 2023, but the underlying signals remain in Lighthouse and the URL Inspection tool inside GSC. Three things to verify per template:
Viewport meta tag: every page should have <meta name="viewport" content="width=device-width, initial-scale=1">. Missing or fixed-width viewports cause Google to mark a page as not mobile-friendly.
Tap target size: interactive elements (links, buttons) should be at least 48 by 48 CSS pixels with at least 8 pixels of spacing. Lighthouse audits this directly.
Content parity: the mobile version must contain the same content as desktop. Since Google indexes the mobile version exclusively (mobile-first indexing has been the default since 2023), any content hidden or removed on mobile is content Google does not see at all. Screaming Frog can crawl with a mobile user agent - run a second crawl with that setting and diff the results against the desktop crawl.
Step 8: HTTPS and Security Audit
HTTPS has been a confirmed ranking signal since 2014, and Chrome marks non-HTTPS pages as "Not Secure". The audit needs to confirm three things: certificate validity, HSTS configuration, and the absence of mixed content.
# Inspect certificate and expiry
echo | openssl s_client -servername example.com \
-connect example.com:443 2>/dev/null \
| openssl x509 -noout -dates -issuer -subject
# Check that HTTP redirects to HTTPS with 301
curl -o /dev/null -s -w "%{http_code} -> %{redirect_url}\n" \
http://example.com/
# Check HSTS header
curl -sI https://example.com/ | grep -i 'strict-transport-security'
# Find mixed content - HTTP resources on HTTPS pages
curl -s https://example.com/ \
| grep -oE '(src|href)="http://[^"]+"' \
| sort -uIf HSTS is missing, add it with a max-age of at least 31536000 (one year). If mixed content exists, every http:// resource reference is a flag - update them all to https:// or to protocol-relative //.
Step 9: Structured Data Validation
Structured data does not directly improve rankings, but it controls eligibility for rich results - which materially affects click-through rate. Audit per template, not per URL: every product page shares one Product schema, every article shares one Article schema. Test one URL per template and any error generalizes.
Use two tools in sequence. First, Schema Markup Validator (validator.schema.org) to confirm syntactic correctness. Second, Rich Results Test (search.google.com/test/rich-results) to confirm Google specifically can use it. The two tools disagree often - the Validator passes generic schema, but Google requires specific properties (like aggregateRating on Product, or author with a sub-type on Article) before it will display rich results.
Common findings: missing required properties, schema referencing prices in the wrong format, dates in non-ISO formats, and review schema applied at the page level when it should be per-product. Each is a one-line template fix that propagates to thousands of URLs.
Step 10: Log File Analysis
Log file analysis is the step most audits skip and the one that produces the most surprising findings. Server access logs tell you what Googlebot actually does - which URLs it requests, how often, with what response codes. The crawler tells you what Googlebot could do; logs tell you what it does.
# Filter Googlebot requests from the last 30 days of access logs
grep -i 'Googlebot' access.log* > googlebot-requests.txt
# Verify the IP actually belongs to Google (reverse DNS check)
awk '{print $1}' googlebot-requests.txt | sort -u | while read ip; do
host=$(host "$ip" | awk '{print $NF}')
forward=$(host "$host" | awk '/has address/ {print $NF}')
[ "$ip" = "$forward" ] && echo "VERIFIED $ip" || echo "FAKE $ip"
done
# Top 20 URLs by Googlebot request count
awk '{print $7}' googlebot-requests.txt \
| sort | uniq -c | sort -rn | head -20
# Status code distribution for Googlebot
awk '{print $9}' googlebot-requests.txt \
| sort | uniq -c | sort -rn
# Crawl waste - URLs Googlebot hits that return 404 or 301
awk '$9 == "404" || $9 == "301" {print $7}' googlebot-requests.txt \
| sort | uniq -c | sort -rn | head -20What to look for: Googlebot spending crawl budget on parameter URLs, paginated pages no one searches for, old redirected URLs, or 404s that still receive crawl traffic. Every Googlebot request to a non-canonical URL is a request not made to a canonical one. On large sites, recovering 30-50% of wasted crawl budget by blocking or redirecting low-value URLs is common.
Step 11: Prioritize Findings by Impact and Effort
By this point the audit has produced 20-50 findings. Without prioritisation, engineering will fix the easiest ones and ignore the hardest, which is rarely the right order. Score every finding on two axes:
Impact (1-5): how many URLs are affected, how much traffic those URLs represent, and how directly the fix affects rankings or rich-result eligibility. A canonical bug across 5,000 URLs ranks 5; a missing alt text on one page ranks 1.
Effort (1-5): developer hours to implement, plus dependencies on other teams. A robots.txt edit is effort 1; a JavaScript framework migration to fix client-side canonical injection is effort 5.
Plot the findings on a 2x2: high-impact + low-effort is the "quick wins" quadrant - ship these in week one. High-impact + high-effort is the "strategic" quadrant - get them on the roadmap. Low-impact + low-effort goes into the backlog. Low-impact + high-effort gets dropped.
Step 12: Deliverable Templates
An audit that lives in a Google Doc never gets implemented. Two deliverables make implementation likely:
Issue tracker file (for engineering): one row per finding, with columns for severity, affected URL count, evidence (link to the crawl export or screenshot), suggested fix, and acceptance criteria. Most teams use Linear, Jira, or GitHub Issues - export the audit as CSV and import directly. Each finding should be implementable from the row alone, without rereading the audit narrative.
Executive summary (for leadership): one page. Three sections: top three findings (with one-sentence business impact each), expected timeline to fix, and projected traffic recovery once fixed. Avoid technical detail - the audience is the person approving engineering time, not the person doing the work. If they cannot read it in 90 seconds, rewrite it.
The audit is complete when both deliverables are shipped. Re-run the audit quarterly - sites accumulate new technical debt at roughly the same rate they ship features, and a 6-month-old audit is already partly obsolete.