displaying Chinese, Spanish and English characters all on the same screen... together

perpetualshaun · Jul 13, 2010

I have a client that is reporting comments from some survey that they've collected online to their clients. One of these reports has multiple languages, and I'm trying to find some combination of the 'meta http-equiv="Content-Type" content="text/html; charset=??' tag and some kind of function that I would wrap all the text in so that I can show all three on the screen; side-by-side.

Right now, I have the meta charset to ISO-8859-1, and I have the text that's pulled out from the database wrapped in htmlentities(). That seems to display everything properly EXCEPT for the Chinese characters. Check out these screen shots:

http://designchemistry.net/i/improvements_iso_encoding_screen_1.png
http://designchemistry.net/i/improvements_iso_encoding_screen_2.png

If I change the character encoding on the whole page to utf-8, and take out the htmlentities(), MOST of the Chinese characters appear correctly, but then some of the other characters aren't displayed. Looks like most of them are apostrophes (which was probably copied and pasted from Word) and those darn en dashes. And check out what it did to the Spanish characters!

http://designchemistry.net/i/improvements_utf_encoding_screen_1.png
http://designchemistry.net/i/improvements_utf_encoding_screen_2.png

So... I'm sure there is some combination of the page encoding and some kind of code that I can put around the text so that all three languages will appear side-by-side on the screen. It seems like leaving the character encoding at ISO-8859-1 and writing some kind of function that I can place around the text for the Chinese characters would be the path of least resistance here, but that may not be an option. I'm open to any and all suggestions on how I can complete this for our client (and I can learn something new also).

Thanks,

Shaun Worcester | senior webologist

design chemistry, llc | next generation web site design
p. 800 640 0424 x 105 | http://www.designchemistry.com/

Connect to me: http://www.google.com/profiles/shaun.daytonwebheads
Follow us on Twitter: http://twitter.com/daytonwebdesign
Become a fan on Facebook: http://www.facebook.com/daytonwebdesign

sneakyimp · Jul 13, 2010

This may or may not be solvable depending on how y ou have acquired your data, how it is stored, and the declarations of your various charsets in your html pages and in your database.

When you capture your data, you should always try to capture it as utf-8.

I'm not certain, but I believe ISO-8859-1 (Latin 1) is useless for chinese because chinese has tens of thousands of glyphs and Latin-1 only supports 8-bits per char (which has a limit of 256 total chars). If you are permitting Chinese input into an HTML form declared as Latin 1, you are going to have problems. Your capture form should always be UTF-8 if you want to support multi-byte charsets like Chinese.

Generally speaking, chars with spanish accents are encoded as one byte in Latin 1 and two bytes when encoded in UTF-8. This is why the spanish chars look fine one way and broken the other.

And then there's your data storage. If you are storing chinese chars, you should declare a charset for your database and its tables that supports chinese. Choose one that's UTF-8.

And then there's PHP. By default, PHP functions like strlen and str_replace assume that you have one byte per character. If you are dealing with multibyte charsets, you should be checking string lengths and such with the multibye string functions or you might get incorrect lengths for chinese strings.

I hope this starts to make sense. You shouldn't be capturing chinese user input as Latin-1. You shouldn't be storing chinese data as Latin-1. You shouldn't be outputting chinese data as Latin-1. You should be using UTF-8 from start to finish if you are dealing with chinese.

You can deal with english, french, spanish languages using Latin 1, and your PHP code will work fine with the friendly old string functions like strlen, but your data won't play nice with Chinese. You might be able to utf8_encode data stored as Latin-1 at the output phase and have it play nice with chinese. However, if you store spanish and chinese in the same system, the chinese may look crappy when it gets utf8-encoded. i don't really know.

Weedpacket took a great deal of time to explain some of this to me.

perpetualshaun · Jul 14, 2010

Thank you for your direction @ ! I am a little closer to a "final answer" here, and I'm hoping that I can provide a little more information, ask you to take another look, and see if we can come any closer to a final solution here.

I verified that everything in the MySQL database is saved with the utf8_general_ci collation. And when I looked up some of the text in question directly through phpMyAdmin, I found some side-by-side Spanish and Chinese responses that looked like this: http://designchemistry.net/i/multiple-languages_in_mysql.jpg

So... I tried just leaving the charset on the page set to ISO and changing the htmlentities() to utf8_encode(), and it probably won't surprise you to hear that didn't work. But after I changed the charset to utf-8, then MOST of the chacters appeared correctly on the screen. I tried again wrapping the text I'm pulling out from MySQL in the utf8_encode(), but that didn't seem to do anything. So I am just echoing the values from MySQL directly to the screen with the utf-8 charset on the page, and the majority of the responses are appearing correctly.

However, I'm still seeing some of the pesky � characters on the screen... how can I write a function that will COMPLETELY eliminate those?? I know the language that was selected for each corresponding comment, so if I could pass the language selection into a function and either do something or nothing with the text before it's echoed to the screen - that would be great. If there's a way to take the human element out of it and write a function that does those things based on the bit length of the characters in the string... that would be even better!

Here's what I'm seeing now:
http://designchemistry.net/i/multiple-languages_screen_1.jpg <-- clearly, that first paragraph should have an apostrophe there
http://designchemistry.net/i/multiple-languages_screen_2.jpg

Any further help?

Shaun Worcester | senior webologist

design chemistry, llc | next generation web site design
p. 800 640 0424 x 105 | http://www.designchemistry.com/

Connect to me: http://www.google.com/profiles/shaun.daytonwebheads
Follow us on Twitter: http://twitter.com/daytonwebdesign
Become a fan on Facebook: http://www.facebook.com/daytonwebdesign

sneakyimp · Jul 14, 2010

If you're displaying chinese chars, your charset in your output HTML must be utf-8 or some other multibyte charset. Latin-1 just won't work.

If you make your HTML output charset utf-8 and the spanish chars still look funny, this is probably because they were captured as Latin 1. This would be the case if someone entered spanish text with accents into a <form> displayed in an html page with charset of latin-1.

In that case, someone is entering chars which have encodings that are encoded differently in Latin 1 than in UTF-8. An example of such a char is Ñ. In Latin 1 encoding, it's one byte. In UTF-8, it's two bytes. Basically any char with [man]ord[/man] value greater than 127 is going to be a two-byte char in UTF-8.

I think this is what happens. Someone types Ñ in a <form> displayed in their browser and clicks submit. The browser sees that the charset of the page is Latin-1 and says "ok Ñ in latin-1 encoding is a single byte 11010001" and the single-byte value of 11010001 is what the browser sends for the char Ñ. PHP receives it on the other end and stuffs it into a database.

Had you declared your input form as UTF-8, the Ñ char would have been sent from your browser to your server as two bytes. If your DB stores them as UTF-8 and the output page displays them as UTF-8 then they would probably look fine.

If you captured spanish chars as Latin-1, stored them as UTF-8, and displayed them as UTF-8, they are probably broken. You could have captured them as Latin-1, utf8_encoded them before inserting them into the db, and they'd probably look fine. It's too late for that. You might try utf8_encode on the spanish chars after they come out of the DB and they might display properly in a page with charset utf-8, but maybe not. That's pretty kludgey if you ask me.

If you captured chinese chars in a Latin-1 form (which I believe is more or less pointless) then the browser might have gone "whatever dude" and just passed on the info as-is. Makes me wonder what a chinese browser does with data entered in a Latin-1 form. This is probably why the chinese chars appear fine when stored as utf-8 and displayed as utf-8 because chinese chars don't make any sense at all in Latin-1 encoding.

I linked the wrong thread in my last post. Weedpacket spent lots and lots of time helping me understand this whole thing in this thread. Skip over the first few posts and you'll find some interesting info.

I hope this helps. I know it can be confusing. The best practice is declare everything utf-8 and use the multi-byte string functions.

perpetualshaun · Jul 15, 2010

Thanks again for your input @. I spent a little more time on tests and trials today, and I have a little more information to share with you.

First, I can confirm that for our client BOTH the English and Spanish data was recorded to the database from a from where the meta charset=ISO-8859-1. For the Chinese data, that form is set to utf-8. So what I'm left with is a "mixed bag" of text that's all stored in MySQL with the utf8_general_ci collation, but recorded form two different types of charsets on the forms.

So what options are available to me here? I took a look at the other post that you referred me to and there may be some applications there that I could use for my particular problem, but I'm not seeing it.

I've put together a small "test" file: http://www.designchemistry.net/encoding_test.html

You'll see that most of this displays properly, except for that pesky apostrophe. So I tried to write a function that would take care of it, but this isn't really doing anything:

function fixChars($text,$lang) {
	$fixed = $text;
	if ($lang == 'en') {
		//echo 'before: '.$fixed.'<br />'."\n";
		$fixed = htmlspecialchars($text, ENT_QUOTES, "ISO-8859-1");
		//echo 'after: '.$fixed.'<br />'."\n";
	}
	return $fixed;
}

Ideas?

Thanks,

Shaun Worcester | senior webologist

design chemistry, llc | next generation web site design
p. 800 640 0424 x 105 | http://www.designchemistry.com/

Connect to me: http://www.google.com/profiles/shaun.daytonwebheads
Follow us on Twitter: http://twitter.com/daytonwebdesign
Become a fan on Facebook: http://www.facebook.com/daytonwebdesign

sneakyimp · Jul 15, 2010

If I were you, I would try to fix your database by querying for all the stuff captured with a Latin-1 form and then using [man]utf8_encode[/man] on it and storing it back to the database. I'm not certain this will work -- I could be neglecting some other step of the input/php/database/php/output phase. I'm not certain, but I believe a connection from php to mysql may have a charset associated with that as well. I have no idea if any translation takes place between php and mysql. Fixing it might be worth a try though.

And once you are certain that all of your data in the DB is utf8, you can change your input form and declare it as utf8 as well. If you are utf8 at every step of the way, you will probably get better results.

As for those pesky apostrophes, I'd be willing to bet they are the infernal 'curly quotes' that word processors like to use -- MS Word being one particular offender. If you look at a Latin-1 character table, you will see that there are double quotes and single quotes -- neither of which is curly. If your input form is declared as latin-1, then someone may have posted input in there from MS Word or something and when the form was submitted, I have no idea what happened to this char. I find myself wondering if it's a single byte char or a double byte char. You might have to try a cleanup job to replace all curly quotes with regular ones. Not really sure how to do that.