When I grab the data from the database and output to xml sheet, I got strange characters.
Some of these strange characters unicode characters for european characters. Some of them are windows symbols, such as long "-", strange quotes etc. Some of them are already in html_entity_encode, and some of them are in html_entity_encode (unix format).
First of all, I used these functions on the xml data generated from mysql.
1) htmlspecialchars
It works for some cases such as windows long "-" symbols. etc.
2) utf8_encode
It works on special european characters in unicode format.
But sometimes it double encoded (if the value is saved as encoded already), so I use the following instead,
<?php
// Fixes the encoding to uf8
function fixEncoding($in_str)
{
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
return $in_str;
else
return utf8_encode($in_str);
} // fixEncoding
?>
But the above code sometimes missing to encode some values that it is supposed to encode.
Buy the way, it seems this function works on the value (longtext not varchar) in the 3rd party database I got. I am not sure if it is supposed to be this way or just due to the data in the longtext column works fine and in the varchar column not.
3)
<?php
function xml_character_encode($string, $trans='') {
$trans = (is_array($trans)) ? $trans : get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
foreach ($trans as $k=>$v)
$trans[$k]= "&#".ord($k).";";
return strtr($string, $trans);
}
?>
This is only working on some characters, but not on the windows symbols etc. Not on the unicode European characters.
4)
<?php
function convert_smart_quotes($string) {
//converts smart quotes to normal quotes.
$search = array(chr(145), chr(146), chr(147), chr(148), chr(151));
$replace = array("'", "'", '"', '"', '-');
return str_replace($search, $replace, $string);
}
?>
This only solve part of xml parser issue.
5) And some times, the data in the database already in html entities. When I call htmlspecialchars, utf8_encode on them, it get double encoded. So I have to call html_entity_decode on these values first before I call htmlspecialchars, utf8_encode.
But then some times html_entity_decode default set up doesn't work, I have to call html_entity_decode on utf8.
So you see, depends on what the 3rd party have saved in the latin1 charset mysql tables. I have to use all the above solutions. The worse case is that some functions conflict with each other, sometimes I double encoded etc.
I "humanly" look the xml output data, and hand code in the functions (try another if one doesn't work). That is not a computer solution. I will never know what the 3rd party will input into the database next time.
So in simple words.
1) the original database is in latin1 character set. there are windows symbols, European unicode characters, html entity values etc. already in the database.
2) I am looking for a solution that cover all these case without me hand code in one function or another I mentioned above.
3) I can set up my database in utf8 and select these data from 3rd party and insert these data into my new database and output data from my new database to xml pages. (will this be the solution? If that is the case, how should I do it right without missing any issues?)
There got be a solution for this case already.
Thanks!