UTF-8 Encoding Issues in Your Sitemap

Q: Why do I see weird characters like Ã© in my sitemap?

That's mojibake - the symptom of UTF-8 bytes being interpreted as Latin-1 (or vice versa). It happens when the database encoding doesn't match the connection encoding or when a file is converted between encodings without declaring the source correctly.

Updated April 2026·By SitemapFixer Team

Sitemaps must be UTF-8 encoded with percent-encoded URLs for any non-ASCII characters. When the file is actually ISO-8859-1 masquerading as UTF-8, or when it starts with a hidden BOM, or when URLs contain raw accented characters, Googlebot either rejects the file or silently skips malformed entries - costing you indexing coverage without any obvious symptom.

Check your encoding in seconds

We detect BOM bytes, invalid UTF-8 sequences, and unencoded URL characters

Analyze My Sitemap

What is this error?

UTF-8 encoding issues split into three categories: (1) the sitemap file itself declares encoding="UTF-8" but contains bytes that aren't valid UTF-8 sequences, (2) the file starts with a byte-order mark (BOM: EF BB BF) before the XML declaration, or (3) URL values contain raw non-ASCII characters that should be percent-encoded (e.g., https://example.com/caf\u00e9 instead of https://example.com/caf%C3%A9).

Why does it happen?

The classic cause is a database that stores URLs as Latin-1 (ISO-8859-1) while the sitemap generator assumes UTF-8. Windows editors like Notepad and older versions of Visual Studio add a BOM when saving UTF-8 files. PHP applications often emit raw Unicode when the developer forgets urlencode(). Sites in French, Spanish, Chinese, Japanese, Korean, Arabic, and Cyrillic-script languages are most affected.

Why does it hurt SEO?

A BOM causes the entire sitemap to fail parsing - zero URLs processed. Invalid UTF-8 sequences cause Google to either reject the file or drop individual malformed entries. Unencoded characters in URLs often get rewritten by Google into URLs that don't match your canonical structure, creating duplicate content. Localized sites (French, Spanish, Chinese, Japanese) often lose 10-30% of indexing coverage when these issues aren't caught.

How to detect it

Run file sitemap.xml on the command line - it reports encoding and whether a BOM is present. Use iconv -f UTF-8 -t UTF-8 sitemap.xml > /dev/null to validate UTF-8 sequences. Sitemap Fixer combines all three checks: BOM detection, UTF-8 sequence validation, and URL percent-encoding compliance in a single scan.

How to fix it

1. Strip the BOM: `sed -i '1s/^\\xEF\\xBB\\xBF//' sitemap.xml` (or save as UTF-8 without BOM). 2. Percent-encode all non-ASCII characters in URLs: use your language's equivalent of encodeURI() or urlencode(). 3. Verify your database connection uses UTF-8: `SET NAMES utf8mb4` in MySQL, `SET client_encoding TO 'UTF8'` in Postgres. 4. Configure your sitemap generator to emit UTF-8 without BOM (most XML libraries have a writer option for this). 5. Validate with `iconv -f UTF-8 -t UTF-8 sitemap.xml > /dev/null` - no errors means valid UTF-8. 6. Resubmit the sitemap in Search Console and check coverage over the next 2 weeks.

Real-world example

A French blog with article URLs like /articles/caf\u00e9-parisien saw only 12% of articles indexed. Their sitemap listed URLs with raw accented characters. After percent-encoding (/articles/caf%C3%A9-parisien) and adding the same redirect at the server level, indexed pages rose from 340 to 2,600 over 4 weeks.

Common mistakes

Saving the sitemap in a text editor that silently adds a UTF-8 BOM
Mixing percent-encoded and raw Unicode URLs in the same sitemap
Forgetting to urlencode() URLs built from database strings in PHP/Python

Frequently Asked Questions

How should non-ASCII characters appear in sitemap URLs?

Non-ASCII characters in URLs must be percent-encoded (e.g., 'caf\u00e9' becomes 'caf%C3%A9'). The sitemap file itself must be saved in UTF-8, but the URL values inside should use percent-encoding for any character above ASCII 127.

What is a BOM and why does it break sitemaps?

A Byte Order Mark (BOM) is three hidden bytes (EF BB BF) that some editors add to the start of UTF-8 files. XML parsers expect the very first character to be '<', so a BOM before the XML declaration causes an instant parse error.

Why do I see weird characters like \u00c3\u00a9 in my sitemap?

That's mojibake - the symptom of UTF-8 bytes being interpreted as Latin-1 (or vice versa). It happens when the database encoding doesn't match the connection encoding or when a file is converted between encodings without declaring the source correctly.

Fix this in your sitemap now

Enter your domain and get a full sitemap audit in 60 seconds

Analyze My Sitemap Free

Related sitemap errors

All sitemap errors