I've decided to leave behind latin1 and do everything in utf8 which I should have done from the beginning I realize now, cause this is a pain.
I've set apache2, php5, mysql and the meta tag to use utf-8.
Upon making a connection to the mysql database I immediatly call:
"SET NAMES 'utf8'"
Now I got 2 sets of weirdness in my database, problem one, if I don't call SET NAMES æøå for instance will look like ? in about half of my database.
If I call SET NAMES an å would look like Ã¥ when echoed in the other half of the database.
Basically half my database will print one thing and the other another thing based on this SET NAMES call to the mysql db or not.
So I figured ok I got 2 character sets in my DB and I should try to convert the part that fails when I don't call SET NAMES.
On the mysql site I found this to try get the server to convert the data.
ALTER TABLE myTable MODIFY myColumn BINARY(255);
ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8;
That didn't work.
Then we have this from the w3.org site:
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8
Returns true on both halfs in the database, so it says its all valid UTF8??
And lastly i've ran:
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);
from the php.net site.
This gives me a mix of UTF8 and ASCII, I ran the above code on the result of a mysql_query on specific columns.
So from what I can see I got two sets of valid UTF8 which is displayed incorrectly based on SET NAMES or not, what part is valid and what part isn't and how do I convert this?
This is such a pain...
Another thing i've noticed is that Firefox and IE will display ? and □ in cases of -. I've noticed some of the - are longer than others, I assume this happens if pasted from Microsoft Word or something directly into form fields. I assume some invisible chars are included in these cases. How can I make sure this doesn't happen?
Did I mention this was a pain :\
Thanks for any help!