Special characters driving me crazy - converting UTF-8 to HTML

bunner_bob · Nov 24, 2009

I'm trying to convert text that's entered in a form into HTML characters (either numeric or alpha - I don't care which at this point) before storing in a database. Specifically things like "curly" quotes, curly apostrophe, bullet, accented characters - that sort of thing.

I had it all working great (using a character map array) when the page my form was in was encoded as ISO-8859-1, but now the page is UTF-8 and I can't get anything to work. I tried

mb_convert_encode($s,'HTML-ENTITIES','UTF-8');

which resulted in one kind of gibberish. Then I tried the class I found [URL="http://mikolajj.*********.pl/"]here[/URL], using

$cc = new ConvertCharset('utf-8', 'iso-8859-1', 1);
$result = $cc->Convert($imploded_string);

with different gibberish as the result (e.g. left double quote turns into â�&#65533. Also tried running my character map on $result which resulted in even more gibberish.

I'm tempted to go back to ISO-8859-1...

Any ideas? Obviously I'm clueless regarding character sets...

NogDog · Nov 24, 2009

Of possible interest from my blog:
Filtereing MS Word Text (including "smart quotes")
UTF8 in PHP and MySQL

bunner_bob · Nov 24, 2009

Very helpful! Wow - I had to do pretty much everything to get it to work:
- UTF-8 for the db (fortunately already done)
- UTF-8 setting in forms
- UTF-8 in htmlentities
- and your filter function

What's weird is that when I view the database with phpMyAdmin, the special characters all show up as themselves - NOT numeric or other HTML items. As if they're not being converted to anything at all. Maybe they aren't?

Everything displays fine on Mac and Windows.

However when I view source all the characters are still themselves - not visibly encoded as html entities of any kind. Is that just how UTF-8 works? We no longer need HTML entities?

bunner_bob · Nov 24, 2009

Huh - well apparently I was doing something wrong. For one I mis-placed the UTF-8 in htmlentities. Now that it's in the right place it seems to be successfully converting Word curly quotes and everything else I can throw at it - without the need for additional filtering.

Whew!

NogDog · Nov 25, 2009

bunner bob;10934735 wrote:
...
What's weird is that when I view the database with phpMyAdmin, the special characters all show up as themselves - NOT numeric or other HTML items. As if they're not being converted to anything at all. Maybe they aren't?
...

The characters should be stored as themselves in the DB. As long as whatever you are using to view them displays UTF-8 (or is doing its own filtering) you'll see them OK.

...However when I view source all the characters are still themselves - not visibly encoded as html entities of any kind. Is that just how UTF-8 works? We no longer need HTML entities?

When outputting a web page as UTF-8, most special characters do not need to be converted to HTML entities. The only ones that have to be are the ones with special meaning in HTML. (HTML entities were essentially invented to cope with characters that were not part of the basic ASCII character set and represented by a single byte. Since UTF-8 uses multiple bytes as needed, that is no longer an issue.)

bunner_bob · Nov 25, 2009

Huh - that's some scary newfangledness. I'm so used to making everything HTML.

So your function catches the items that must be converted, leaving the rest?

NogDog · Nov 25, 2009

bunner bob;10934776 wrote:
Huh - that's some scary newfangledness. I'm so used to making everything HTML.

So your function catches the items that must be converted, leaving the rest?

The one for the MS Word characters is because M$ uses its own character set, so the function converts those characters that don't translate well into UTF-8 (or ISO-8859-1, for that matter).

Special characters driving me crazy - converting UTF-8 to HTML

Bbunner_bob

NogDog

Bbunner_bob

Bbunner_bob

NogDog

Bbunner_bob

NogDog