About half of the tables, sorry.
All tables are set to UTF8, HTML Meta tag is UTF8 and the browsers are using UTF8.
I've made some interesting finds though.
One said to try put some arabaic letters into the database to actually see wether its storing the characters above 128 correctly or not. Without setting SET NAMES it didn't! They ended up as ?, meaning it was not true unicode, calling SET NAMES would display the arabic characters, so now I at least know which "half" is the right and which isn't.
That leaves me with two problems still, one is, how do I convert the corrupt data to true UTF8 Unicode, the other, how do I deal with those Microsoft Word characters mentioned above.
I looked up a few functions on php.net and came across the comment section of "strtr" which had quite a few references to exactly Microsoft Word.
/*Latin1 (iso-8859-1) DONT define chars \x80-\x9f (128-159),
but Windows charset 1252 defines _some_ of them
-- like the infamous msoffice 'magic quotes' (\x92 146).
Dont use those invalid control chars in webpages,
but their html (unicode) entities. See ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
or http://www.microsoft.com/typography/unicode/1252.htm
PS: a '?' in the code means the win-cp1252 dont define the given char.*/
$badlatin1_cp1252_to_htmlent =
array(
'\x80'=>'€', '\x81'=>'?', '\x82'=>'‚', '\x83'=>'ƒ',
'\x84'=>'„', '\x85'=>'…', '\x86'=>'†', \x87'=>'‡',
'\x88'=>'ˆ', '\x89'=>'‰', '\x8A'=>'Š', '\x8B'=>'‹',
'\x8C'=>'Œ', '\x8D'=>'?', '\x8E'=>'Ž', '\x8F'=>'?',
'\x90'=>'?', '\x91'=>'‘', '\x92'=>'’', '\x93'=>'“',
'\x94'=>'”', '\x95'=>'•', '\x96'=>'–', '\x97'=>'—',
'\x98'=>'˜', '\x99'=>'™', '\x9A'=>'š', '\x9B'=>'›',
'\x9C'=>'œ', '\x9D'=>'?', '\x9E'=>'ž', '\x9F'=>'Ÿ'
);
$str = strtr($str, $badlatin1_cp1252_to_htmlent);
The dashes I talked about further up is called (en)-dash and (em)-dash, the mdash is the one which isn't defined and ends up as a ? firefox or square in IE.
Obviously all I have to do here is write a function to deal with these characters should they exist in the form data I receive.
If you want to learn some about Unicode I found this article good reading:
http://www.joelonsoftware.com/articles/Unicode.html
About the convert problem, I guess I have to write a PHP script to replace all occurences of the bad characters with correct ones, shouldn't take forever I hope.