UTF-8 Encoding Issues in Your Sitemap
Sitemaps must be UTF-8 encoded with percent-encoded URLs for any non-ASCII characters. When the file is actually ISO-8859-1 masquerading as UTF-8, or when it starts with a hidden BOM, or when URLs contain raw accented characters, Googlebot either rejects the file or silently skips malformed entries - costing you indexing coverage without any obvious symptom.
What is this error?
UTF-8 encoding issues split into three categories: (1) the sitemap file itself declares encoding="UTF-8" but contains bytes that aren't valid UTF-8 sequences, (2) the file starts with a byte-order mark (BOM: EF BB BF) before the XML declaration, or (3) URL values contain raw non-ASCII characters that should be percent-encoded (e.g., https://example.com/caf\u00e9 instead of https://example.com/caf%C3%A9).
Why does it happen?
The classic cause is a database that stores URLs as Latin-1 (ISO-8859-1) while the sitemap generator assumes UTF-8. Windows editors like Notepad and older versions of Visual Studio add a BOM when saving UTF-8 files. PHP applications often emit raw Unicode when the developer forgets urlencode(). Sites in French, Spanish, Chinese, Japanese, Korean, Arabic, and Cyrillic-script languages are most affected.
Why does it hurt SEO?
A BOM causes the entire sitemap to fail parsing - zero URLs processed. Invalid UTF-8 sequences cause Google to either reject the file or drop individual malformed entries. Unencoded characters in URLs often get rewritten by Google into URLs that don't match your canonical structure, creating duplicate content. Localized sites (French, Spanish, Chinese, Japanese) often lose 10-30% of indexing coverage when these issues aren't caught.
How to detect it
Run file sitemap.xml on the command line - it reports encoding and whether a BOM is present. Use iconv -f UTF-8 -t UTF-8 sitemap.xml > /dev/null to validate UTF-8 sequences. Sitemap Fixer combines all three checks: BOM detection, UTF-8 sequence validation, and URL percent-encoding compliance in a single scan.
How to fix it
1. Strip the BOM: `sed -i '1s/^\\xEF\\xBB\\xBF//' sitemap.xml` (or save as UTF-8 without BOM). 2. Percent-encode all non-ASCII characters in URLs: use your language's equivalent of encodeURI() or urlencode(). 3. Verify your database connection uses UTF-8: `SET NAMES utf8mb4` in MySQL, `SET client_encoding TO 'UTF8'` in Postgres. 4. Configure your sitemap generator to emit UTF-8 without BOM (most XML libraries have a writer option for this). 5. Validate with `iconv -f UTF-8 -t UTF-8 sitemap.xml > /dev/null` - no errors means valid UTF-8. 6. Resubmit the sitemap in Search Console and check coverage over the next 2 weeks.
Real-world example
A French blog with article URLs like /articles/caf\u00e9-parisien saw only 12% of articles indexed. Their sitemap listed URLs with raw accented characters. After percent-encoding (/articles/caf%C3%A9-parisien) and adding the same redirect at the server level, indexed pages rose from 340 to 2,600 over 4 weeks.
Common mistakes
- Saving the sitemap in a text editor that silently adds a UTF-8 BOM
- Mixing percent-encoded and raw Unicode URLs in the same sitemap
- Forgetting to urlencode() URLs built from database strings in PHP/Python