Multilingual Blues

sol01

Here is the deal.....Chinese (Big5) code set, each character contains 2 bytes (English characters). Or two Unicode characters. If you are viewing a Chinese article in the browser and 1 of the two bytes of the character is missing then the remainder displays as a question mark (?).

Now what I need to do is extract a string of 120 bytes (60 characters) from the database and display it to the browser. Now if the entire string had no spaces or punctuation then all would be well because I would know that 120 bytes = 60 characters, 2 per character means no remainder that is viewed as a question mark (?)in the browser.

However if there is a comma that makes the selection uneven and I get a question mark because a comma = 1 Unicode/byte. The Unicode that "powers" the comma, so to speak is also used as part of other characters. So I can't do a simple string replace for that as I would be taking out chunks of other characters thereby making more question marks (?) 🙂!!

Anyone have any powerful suggestions???