UTF-8 conversions

Arkane

Hey guys,

I'm having some problems trying to convert stuff to access an external api.
The site I'm using has access by querying with a title formatted as UTF-8, however I can't seem to get the right results in php.

I have been messing around all day, and have made no progress basically.

What I have so far, is

urlencode(utf8_encode($game))

This half works, but not fully
What I'm trying to get it this

Game = The Incredible Hulk™, Actual Result = The+Incredible+Hulk%99, Desired Result = The+Incredible+Hulk%E2%84%A2

Game = Unreal Tournament® 3, Actual Result = Unreal+Tournament%C2%AE+3, Desired Result = Unreal+Tournament%C2%AE+3

Game = LEGO® Indiana Jones™, Actual Result = LEGO%C2%AE+Indiana+Jones%C2%99, Desired Result = LEGO%C2%AE+Indiana+Jones%E2%84%A2

As you can see, it works for some symbols, but not others. I've been searching, and according to the letter database, the desired does match up for what the UTF-8 codes should be, I'm just not getting them.

I'm pulling the game names out of a mysql database, but I have also tried it as a constant, and still have the same problem.
Can anyone give any suggestions asto why this is happening, or what might make it go right?

Cheers
Arkane

Weedpacket

*Why would you want ™ to come out as \xe2\x84\xa2 ? Encoded as UTF-8 it's \xc2\x99, not \xe2\x84\xa2.

Arkane

Weedpacket;10907322 wrote:
*Why would you want ™ to come out as \xe2\x84\xa2 ? Encoded as UTF-8 it's \xc2\x99, not \xe2\x84\xa2.

Thats the result I keep getting, but the api I'm trying to access has it as e2,84,a2, so that waht I need. I just assumed it was UTF-8 (well, not entirely assumed, got it fromthis page during my hunting for information, so naturally assumed it was right.

Perhaps its actually another encoding or something that I may be able to find?

Cheers
Arkane

Weedpacket

Oh; I see what's going on. The text you've got (wherever it's come from) is encoded (and you're displaying it) according to the Windows-1251 character set, which puts ™ at code point 153.

Unicode puts that glyph at code point 8482. For the most part the two agree, but this is one place where they don't.

[man]iconv/man; returns the proper byte sequence.

Curiously, if I look at the source code of this page, your trademark signs appear as numeric entities &#8482; and mine appear as ™ - even though this page is allegedly ISO-8859-1. That is weird.

Arkane

OK, this just seems a very odd situation.

I guess I'm going to have to write a workaround or something. I tried the lazy way, and just used iconv to translate the string, but now I get â„¢ as the TM using that same method. Even if thats the only thing in the script.

I see what you mean about the source, it seems very odd.

Now I'm just stumped. 🙁

Weedpacket

Arkane wrote:
but now I get â„¢

Of course; that's what \xe2\x84\xa2 looks like before you urlencode it (and if you display it as Windows-1251 instead of UTF-8).

Arkane

Weedpacket;10907444 wrote:
Of course; that's what \xe2\x84\xa2 looks like before you urlencode it (and if you display it as Windows-1251 instead of UTF-8).

OK. It's working now... couple of fails, but for the most part its great, so thanks very much for all your help Weedpacket.

Oddly, I actually discovered the ™ situation cropping up again on mysql. Most of the results all use the actual symbols, but recent ones are using the codes instead.
Could it be somehow browser related? (Not claiming to know, just looking at possibilities)

Weedpacket

Arkane wrote:
Could it be somehow browser related? (Not claiming to know, just looking at possibilities)

Don't quote me on this, but there might be some possibilities. One is that if you don't specify a charset attribute for a form, submitted character data will be encoded in the same way that the page itself is encoded. Is this coming from a form? I don't think people would be typing ™ as part of their search criteria 🙂.

In the database, there is the question of which character set is being used to store the data there as well. I've actually instituted a policy of "UTF-8 everywhere" locally - down to and including source files.

I've even had my browser pick UTF-8 as its preferred encoding. One of the reasons why I'm replying here is because I've just changed it to ISO-8859-1 and I want to see how the "™" ends up - since that's not a valid ISO-8859-1 character I'm expecting to see a numeric entity in the source code 🙂 ....

...And there it is! It's even escaped '"' as well!