Summary: I am making a multilingual site so why shouldn't I use UTF-8 encoding on all my pages when it works? If I use UTF-8, then why convert odd characters into entites?
For months on end, I have been working on a community based site, built on mysql using php with the Sablotron extention. XML/XSLT has been a great experience, and I don't regret for a second chosing it.
The site will eventually be in around 10 different language (the most important are english, french, chinese, japanese and korean). Some pages should be able to have text in several different characters sets (eg. be able to see japanese, english and chinese on the same page).
1) What encoding should I use?
I have tested everything with UTF-8, and it works like a charm. I can seen any number of different characters sets on the same page, and text I send using forms, is stored correctly and displayed corretly when retrieved. Ie. it works.
The thing is, UTF-8 is only relatively recently supported by browsers, and if I view the site in Windows encoding (ie. iso-8859-1) I can browse fine, but all the foreign chars are displayed as '??????'. This I could live with, but the worst problem is that when users submit data in a form, there is a good chance Sablotron will barf at the sight of special chars like æ,ï,$ etc. I cannot get around this by converting the string to entities, as I am assuming that they are viewing at UTF-8, and htmlentities() takes encoding as a parameter. So...
2) Is there a reliable way of knowing what encoding the user is currently using?
However, when browsing with UTF-8 I can view both ø AND ø without any problems. So...
3) IF I can assume it's safe to use UTF-8, why use htmlentities() at all to convert odd chars?
One solution could be to use iso-8859-1 for the lating character set languages (eg. english and french), and require utf-8 on all the odd character set languages (eg. chinese).
I have tested the site with the newest IE, Mozilla and Opera browsers without any problems. They detect UTF-8 automatically and show all the character sets fine (actually Opera had some problems displaying Korean). My guess is that about 3-5% of my users use browsers which don't support UTF-8 at the moment.
Any comments, links or answers?
Thanks in advance,
Jens