Unicode, UTF-8 encoding, and Windows-1252 encoding
I've seen many posts where
typographic quotation marks are being rendered as junk characters. This is a
typical example: "a ‘score’ can be worked out"
That could be the result of Unicode characters being
stored using UTF-8 encoding but
interpreted and displayed using Windows-1252 encoding. Working backwards from what's displayed:
‘ in Windows-1252 encoding maps to E2 80 98 = 1110 0010 1000 0000 1001 1000
Using
UTF-8 encoding those three bytes represent a single character (more properly, a Unicode
code point) using the 16 bits within the curly brackets: 1110 {0010} 10{00 0000} 10{01 1000} The other bits are part of the UTF-8 encoding overhead and are ignored here.
That leaves: 0010 0000 0001 1000 = 20 18 = code point 8216 = left single quotation mark. Similarly, ’ maps to a Unicode code point 8217, right single quotation mark.
I wish I could say there's a trivial fix for this, but if your database has data encoded
both ways there's probably not one that's going to work
perfectly.
You might try changing this:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
to this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
UTF-8 encoding is the standard now, and has been for some years. But some Windows-1252 encodings are not valid UTF-8 encodings, and those characters (mostly a handful of accented letters) might display as something else.