Character encoding driving me crazy

Azala

I've decided to leave behind latin1 and do everything in utf8 which I should have done from the beginning I realize now, cause this is a pain.

I've set apache2, php5, mysql and the meta tag to use utf-8.

Upon making a connection to the mysql database I immediatly call:
"SET NAMES 'utf8'"

Now I got 2 sets of weirdness in my database, problem one, if I don't call SET NAMES æøå for instance will look like ? in about half of my database.

If I call SET NAMES an å would look like Ã¥ when echoed in the other half of the database.

Basically half my database will print one thing and the other another thing based on this SET NAMES call to the mysql db or not.

So I figured ok I got 2 character sets in my DB and I should try to convert the part that fails when I don't call SET NAMES.

On the mysql site I found this to try get the server to convert the data.

ALTER TABLE myTable MODIFY myColumn BINARY(255);
ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8;

That didn't work.

Then we have this from the w3.org site:

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {

   // From http://w3.org/International/questions/qa-forms-utf-8.html
   return preg_match('%^(?:
         [\x09\x0A\x0D\x20-\x7E]            # ASCII
       | [\xC2-\xDF][\x80-\xBF]            # non-overlong 2-byte
       |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
       | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
       |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
       |  \xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
       | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
       |  \xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
   )*$%xs', $string);

} // function is_utf8

Returns true on both halfs in the database, so it says its all valid UTF8??

And lastly i've ran:

/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

from the php.net site.

This gives me a mix of UTF8 and ASCII, I ran the above code on the result of a mysql_query on specific columns.

So from what I can see I got two sets of valid UTF8 which is displayed incorrectly based on SET NAMES or not, what part is valid and what part isn't and how do I convert this?

This is such a pain...

Another thing i've noticed is that Firefox and IE will display ? and □ in cases of -. I've noticed some of the - are longer than others, I assume this happens if pasted from Microsoft Word or something directly into form fields. I assume some invisible chars are included in these cases. How can I make sure this doesn't happen?

Did I mention this was a pain :\

Thanks for any help!

Azala

– // Weird one?
- // Hypen-minus

I can't find the weird one in the character map and unicode utf-8 shows it as a ? meaning it doesn't know it, but it is still stored and showed correctly in the form fields...

Any ideas?

Azala

I can reproduce the - in Microsoft Word as expected.

If you type foo-bar in word its all good. But if you type foo - bar then hit enter it converts the - to one that is slightly longer. Anyone got any good info on how to deal with this?

MarkR

Azala wrote:
If I call SET NAMES an å would look like Ã¥ when echoed in the other half of the database.

When you say "half of the database", do you mean half the tables or half the rows?

What encoding are the tables set to (I would assume that they are utf-8 too throughout?)

What encoding is the HTML set to in headers and/or html meta http-equiv?

What encoding is the browser actually interpreting the pages as?

If any of these are not utf-8, then you have a problem.

Another problem may be that some legacy data are rubbish in the database; this is not soluble if you didn't identify it at the time as you won't be able to fix it without breaking newer data.

Mark

Azala

About half of the tables, sorry.

All tables are set to UTF8, HTML Meta tag is UTF8 and the browsers are using UTF8.

I've made some interesting finds though.

One said to try put some arabaic letters into the database to actually see wether its storing the characters above 128 correctly or not. Without setting SET NAMES it didn't! They ended up as ?, meaning it was not true unicode, calling SET NAMES would display the arabic characters, so now I at least know which "half" is the right and which isn't.

That leaves me with two problems still, one is, how do I convert the corrupt data to true UTF8 Unicode, the other, how do I deal with those Microsoft Word characters mentioned above.

I looked up a few functions on php.net and came across the comment section of "strtr" which had quite a few references to exactly Microsoft Word.

/*Latin1 (iso-8859-1) DONT define chars \x80-\x9f (128-159),
but Windows charset 1252 defines _some_ of them
-- like the infamous msoffice 'magic quotes' (\x92 146).
Dont use those invalid control chars in webpages,
but their html (unicode) entities. See ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
or http://www.microsoft.com/typography/unicode/1252.htm
PS: a '?' in the code means the win-cp1252 dont define the given char.*/

$badlatin1_cp1252_to_htmlent =
  array(
   '\x80'=>'&#x20AC;', '\x81'=>'?', '\x82'=>'&#x201A;', '\x83'=>'&#x0192;',
   '\x84'=>'&#x201E;', '\x85'=>'&#x2026;', '\x86'=>'&#x2020;', \x87'=>'&#x2021;',
   '\x88'=>'&#x02C6;', '\x89'=>'&#x2030;', '\x8A'=>'&#x0160;', '\x8B'=>'&#x2039;',
   '\x8C'=>'&#x0152;', '\x8D'=>'?', '\x8E'=>'&#x017D;', '\x8F'=>'?',
   '\x90'=>'?', '\x91'=>'&#x2018;', '\x92'=>'&#x2019;', '\x93'=>'&#x201C;',
   '\x94'=>'&#x201D;', '\x95'=>'&#x2022;', '\x96'=>'&#x2013;', '\x97'=>'&#x2014;',
   '\x98'=>'&#x02DC;', '\x99'=>'&#x2122;', '\x9A'=>'&#x0161;', '\x9B'=>'&#x203A;',
   '\x9C'=>'&#x0153;', '\x9D'=>'?', '\x9E'=>'&#x017E;', '\x9F'=>'&#x0178;'
  );
$str = strtr($str, $badlatin1_cp1252_to_htmlent);

The dashes I talked about further up is called (en)-dash and (em)-dash, the mdash is the one which isn't defined and ends up as a ? firefox or square in IE.

Obviously all I have to do here is write a function to deal with these characters should they exist in the form data I receive.

If you want to learn some about Unicode I found this article good reading:
http://www.joelonsoftware.com/articles/Unicode.html

About the convert problem, I guess I have to write a PHP script to replace all occurences of the bad characters with correct ones, shouldn't take forever I hope.