weird utf8_encode problem

nubs

the php version on the server isn't up to date, tho' i can't check the version (php_info is blocked), and i have no control over its updates

i've got a string encoded in iso-8859-1 that i call utf8_encode on before echoing out to html, but while most of it looks fine (umlauts), i only get a – instead of a dash (–)
0096 is iso-8859-1 for dash and when i look at the echoed html without the utf8_encode and as iso-8859-1, that's exactly what i see

so why isn't utf8_encode encoding this character? is it a bug in older php, or am i missing something here?

Weedpacket

Sorry; what are you getting instead of a dash? I see a dash there: –.

Codepoint 96 is for a backtick character ` (it's plain ASCII and therefore the same in plain ASCII, ISO-8859-1, and UTF-8).

nubs wrote:
i can't check the version (php_info is blocked)

Can you use the [man]phpversion[/man] function?

nubs

hm, let's try that again with code tags... this is what i see

text – more text

phpversion(); doesn't give me anything either

nubs

in case the code tag didn't work (pending moderator check), here's a cropped screenshot of it in firefox:
http://i44.tinypic.com/alni3r.png

nubs

oh and 96 in hex -> 150 in dec

Weedpacket

(0x96==150, d'oh! I'm getting old....)
If phpversion() is blocked as well, then my next point of call would be to the person who's responsible for the installation (and the blocking) to find out what the current version is. That said, I'm not aware of any versioning issues with utf8_encode.

Given the screenshot, I understand what you mean now: an en-dash encoded according to ISO-8859-1 sitting in the middle of a page that is supposed to be encoded in UTF-8.

The page is clearly not ISO-encoded, or the character would be rendered as –. I notice in that screen shot an a-umlaut on an earlier line. That's encoded correctly - it should be represented in the document by two bytes, 195-164 (rendered in ISO as Ã&#164😉. It may have got into the page by a different route, but even if it has it's still a sanity check on the encoding for the rest of the page.

But like I said; I don't know of any version issues with utf8_encode, and without knowing the mechanics of how the text is stored with what encoding, and how it gets on to the page, I'm at a bit of a loss about what to suggest. You might want to look at the [man]iconv[/man] extension to see if that offers more control.

(On a bit of a side note, personally I store and render all text as UTF-8; one downside to it is that I need to know in advance the encoding of any text I'm going to store, because if it's in UTF-8 and I treat it as ISO-8859-1 I'll end up storing and saying "tenttejÃ¤".) I'm also looking forward to PHP 6, which is supposed to have much better Unicode support.

nubs

i intentionally left the ä in that crop - they come from the same place. problem is, i don't really know where from
i'm doing a website for a department in a university and the (news-)content is retrieved by an included php class (which i have no read access to)

when i remove the utf8_encode($string), only echo the plain $string and render the page as iso-8859-1, the dash shows fine (but umlauts and such are obviously messed up), so it's definitely encoded with that; why utf8_encode doesn't convert it, i've no idea

i'm guessing all i can do is try to communicate with the webmaster...

nubs

meant to say umlauts and such are messed up for the other parts, which are encoded in utf8

Weedpacket

I thought that was why you had the ä there.

Just running on guesswork and making stuff up as I go along here, now (my usual approach to this kind of problem, I'm afraid).

I agree with your original post: a lone chr(150) should not come out of utf8_encode(); either that function is broken (and I know of no bugs in that regard), or something is converting it back before it goes in the page.

Something like this:

The text contains an "en-dash"
Which has been encoded according to ISO-8859-1
In other words, chr(150)
That byte goes into utf8_encode
Which assumes it's ISO-8859-1
And outputs two bytes chr(194) chr(150)
Which is the UTF-8 encoding for "en-dash"
Something weird happens (this is the important bit and this is where I'm not being much help)
And that "en-dash" is back to being encoded as chr(150)
Which is rendered according to UTF-8
And the renderer says a rude thing.

Why should the ä come through intact, though? If that text was already in UTF-8 when it was stored, then utf8_encode (which assumes its input is in ISO-8859-1) would convert it again (outputting bytes representing "A-tilde currency-sign"). Then the weirdness happens and those bytes get turned back into the UTF-8 encoding for "a-umlaut" and is rendered without any trouble.

I reckon that what you have stored in the database is encoded according to ISO-8859-1 (hence the need for utf8_encode in the first place); but UTF-8-encoded text has been put in it (valid UTF-8 is also valid ISO-8859-1).

Another possibility: incoming text is UTF-8 for the most part, but there are some invalid characters in there (say, a text file originally encoded as UTF-8, but later edits used ISO-8859-1 and the writer didn't notice things like "Ã¤" instead of "ä"). That would be messy, and would probably require sentient intervention to tidy up. Definitely time for a 🙁.

Inconsistently-encoded input would explain inconsistently-encoded output. But it doesn't explain why the effect is the exact opposite of what one'd expect.

nubs

is there a way to see the binary of the returned html?

laserlight

You should be able to copy and paste it into a hex editor.

nubs

would that keep the binary intact tho'?
i've no idea what the browser does to the data and then it goes through the (win xp) clipboard?