[RESOLVED] php and sockets: utf8 capable?

sneakyimp · Feb 24, 2009

So I'm wondering if PHP can handle double-byte character sets. It doesn't seem to be default. I ran this script on an iMac with MAMP installed and it resulted in a file with a size of 3 bytes:

<?php

file_put_contents('foo.txt', '123')
  or die('file put failed');
?>

It's obviously using a single byte for each char which is fine for ASCII text but not good for Kanji or something.

I also noticed that the [man]socket_read[/man] function returns a string. Does this string consist of single-byte or double-byte characters? What if the data I'm reading comes from a Java or Flash client which in turn is reading multibyte chars from a user form?

I'd really like to know more about how this might work.

scrupul0us · Feb 24, 2009

See the man page...

Option Flags:

"FILE_TEXT" data is written in text mode. If unicode semantics are enabled, the default encoding is UTF-8. You can specify a different encoding by creating a custom context or by using the stream_default_encoding() to change the default. This flag cannot be used with FILE_BINARY. This flag is only available since PHP 6.

Weedpacket · Feb 24, 2009

PHP5 doesn't know about character encodings; it just writes bytes. So a double-byte character would be two bytes, and a string containing a double-byte character would be described as having a length of two.

sneakyimp · Feb 24, 2009

Thanks for the posts, but I'm not sure I understand at all.

scrupul0us, what does it mean to say "If unicode semantics are enabled, the default encoding is UTF-8" ?? Or for that matter, what does it mean to say "You can specify a different encoding by creating a custom context" ?? Also, it says the flag is only available since PHP 6 which doesn't help me at all.

Weedpacket, I don't really follow your post. Suppose I change my script to this:

<?php

file_put_contents('foo.txt', '&#23551;&#21496;&#12399;&#12362;&#12356;&#12375;&#12356;')
  or die('file put failed');
?>

I don't know if this will post properly here, but the string I'm putting in 'foo.txt' contains Japanese characters. Is this proper PHP ?? Will it run? I can't seem to get the chars in there using Notepad2, Dreamweaver, nano, or OpenOffice Writer 2.4.

EDIT: Japanese chars do work if you have Japanese fonts.

Weedpacket · Feb 24, 2009

sneakyimp wrote:
I don't know if this will post properly here, but the string I'm putting in 'foo.txt' contains Japanese characters. Is this proper PHP ?? Will it run? I can't seem to get the chars in there using Notepad2, Dreamweaver, nano, or OpenOffice Writer 2.4.

As far as PHP is concerned the string literal is just a string of bytes. No magic there. Whether it that string of bytes will represent a UTF-8-encoded string depends on whether whatever you edited the PHP file in saved it as such.

Incidentally, you don't have to go all the way out into Japanese. Just having something like "ΦФ" would be enough and then the font being used to render the page would only have to support Greek and Cyrillic. It would still involve characters that take up more than one byte in UTF-8.

For example:

<?php echo strlen("F&#934;&#1060;"); ?>

returns 5 when the file is saved as UTF-8 (one byte for 'F' and two each for "Φ" and "Ф". If you try saving it as ASCII you lose the extra characters (obviously) and they probably get replaced by '?', resulting in a string length of 3 or 2 (depending on how consecutive invalid characters are handled).

OpenOffice Writer 2.4.

Did you try saving it as "Text Encoded" and selecting UTF-8? Generally speaking, if you want your string literals to be handled as UTF-8 your editor has to save it as such. Not surprising: it's the editor's job to translate between displayed characters and bit patterns. Even bog-standard Windows Notepad can do this.

sneakyimp · Feb 28, 2009

OK I tried again with your greek chars. I'm not so sure what was so difficult about it last time...perhaps the chinese characters are somehow more foreign than those greek ones (and this doesn't help my confidence about the whole UTF8 question we are considering here).

I made this file in notepad and did a 'save as' using UT8 encoding

<?php
file_put_contents('foo.txt', '&#934;&#1060;')
  or die('file put failed');
?>

I uploaded it to my dev machine (a Mac) via FTP. Interestingly, if I use nano on the mac to edit my file, I get this:

ï»¿<?php
file_put_contents('foo.txt', 'Î¦Ð¤')
  or die('file put failed');
?>

I have no idea what the funky chars are at the beginning. Perhaps something to signify ut8 encoding? Also worth noting are the weird chars get instead of the ΦФ I was expecting. I suppose I'm not surprised that nano would have trouble with greek chars. Should I be?

When I execute the file using MAMP or the version 4.x PHP installed on OSX by default, those chars reappear (look carefully at the beginning of the second line):

MyMac:Desktop sneakyimp$ php chump.php
ï»¿MyMac:Desktop sneakyimp$

This doesn't look like it's PHP's fault to me -- probably some weird file formatting issue between macs and PCs, either notepad or FTP or something like that.

If I open the file using TextEdit on the mac, I don't see those funny chars at all. The file looks fine and dandy with the greek chars and all.

If open the file in TextEdit and save it without changing anything, the funny chars aren't there when I subsequently open it with nano:

<?php
file_put_contents('foo.txt', 'Î¦Ð¤')
  or die('file put failed');
?>

And this is where things get pretty darn confusing to me. The PHP code on the mac looks awesome when I open it using text edit...exactly as I want it to:

<?php
file_put_contents('foo.txt', '&#934;&#1060;')
  or die('file put failed');
?>

I run it with no complaints or weird chars appearing and the resulting file is 4 bytes as weedpacket predicted:

MyMac:Desktop sneakyimp$ php chump.php
MyMac:Desktop sneakyimp$ ls -l foo.txt
-rw-r--r--  1 sneakyimp  staff  4 Feb 28 14:00 foo.txt

HOWEVER, when I open the output file foo.txt in TextEdit on the Mac, it doesn't contain two greek chars but four chars that look nothing like the ones I put in there.

Œ¶–§

Further boggling my brain is the fact that when I download foo.txt using Dreamweaver to my WinXP machine, I can open it in notepad or dreamweaver and there are my two greek chars in all their naked greek glory:

&#934;&#1060;

I guess this weird-char-thing illustrates that I'm not the only person confused by utf8 encoding. This is reassuring -- or maybe not?

I will tentatively say that PHP seems to do its job properly. When you run PHP on some file, it would appear to interpret the PHP file properly regardless of whether it is UT8 or ASCII encoded. What puzzles me is what happens when I run this script (named 'bar.php') on foo.txt:

<?php
echo "file foo.txt is " . filesize('foo.txt') . " bytes\n";
$str = file_get_contents('foo.txt');
echo "the contents of foo.txt have a string length of " . strlen($str) . " characters\n";

for($i=0; $i<strlen($str); $i++) {
	echo "\t$i=" . $str[$i] . "\n";
}
?>

The output in my terminal window is this:

MyMac:Desktop sneakyimp$ php bar.php
file foo.txt is 4 bytes
the contents of foo.txt have a string length of 4 characters
        0=Î
        1=¦
        2=Ð
        3=¤

I'm perfectly willing to accept that foo.txt has a filesize of 4 bytes but I must take issue with the thought that it is 4 characters long. It's only got two greek chars, not 4.

This also brings me round to my original question about information coming off a socket. [man]socket_read[/man] returns a string value. If I attempt to deal with that string value in the wrong way, I could be getting gibberish (and twice as much of it!) rather than the original greek or japanese or whatever that was coming off the socket.

There seems to be a bit more to this than I am understanding.

Weedpacket · Mar 1, 2009

sneakyimp wrote:
I have no idea what the funky chars are at the beginning. Perhaps something to signify ut8 encoding?

Your editor chose to put a Unicode Byte-Order-Mark (BOM) at the start of the file. This isn't needed in UTF-8 (it only makes sense in full multibyte encodings like UCS-16, so that the recipient knows what order the bytes of each character are written in), but some editors think that sticking a BOM into UTF-8 documents is somehow helpful.

Also worth noting are the weird chars get instead of the ΦФ I was expecting. I suppose I'm not surprised that nano would have trouble with greek chars. Should I be?

Depends, is nano supposed to be able to read UTF-8-encoded files?

I'm perfectly willing to accept that foo.txt has a filesize of 4 bytes but I must take issue with the thought that it is 4 characters long. It's only got two greek chars, not 4.

Like I said originally, PHP doesn't recognise character encodings; to it strings are just sequences of bytes, and that's what it counts.

This also brings me round to my original question about information coming off a socket. socket_read returns a string value. If I attempt to deal with that string value in the wrong way, I could be getting gibberish (and twice as much of it!) rather than the original greek or japanese or whatever that was coming off the socket.

Sockets carry bytes, not text. "Character encoding" doesn't have any meaning until later, when you try to interpret the sequence of bytes as text.

If you ask for 8192 bytes you'll get 8192 bytes (or less if you hit the end of the file!). The resulting string will have a string length of 8192. How many characters that represents (assuming you interpret it as representing text) depends on the character encoding that was used.

If you want a function that will interpret a string as UTF-8-encoded text and give you the number of characters, you want [man]mb_string[/man].

it would appear to interpret the PHP file properly regardless of whether it is UT8 or ASCII encoded.

Yes; because ASCII is a proper subset of UTF-8; all of PHP's language constructs (including, crucially, "<?php") use only ASCII characters and are therefore result in identical bytes being written to the file when encoded in UTF-8.

sneakyimp · Mar 4, 2009

Thanks for the guidance, Weedpacket. I still insist that a function like strlen should tell you the number of characters in a string and not the number of bytes. Seems to me like this all comes down to a protocol of some kind. Either my socket code specifies that data needs to be ASCII or UTF-8 or binary or whatever OR I run the risk of confusion when data comes off it.

Thanks especially for mb_string tip.

Weedpacket · Mar 5, 2009

Sockets are binary.

sneakyimp wrote:
I still insist that a function like strlen should tell you the number of characters in a string and not the number of bytes.

How can it tell you that if it doesn't know what the encoding is? What's the length in characters of a JPEG file?

PHP 6 is supposed to have a Binary/Unicode switch.

sneakyimp · Mar 5, 2009

I was under the mistaken impression until your last post that the string type in PHP somehow would know what its own encoding is. I perhaps mistakenly thought that PHP might have some other data type to hold binary data of any kind. I was completely puzzled why you kept insisting that 'sockets are binary' when I could clearly go and look at the documentation for [man]socket_read[/man] where it clearly says that the result is of type string.

Now I see that [man]pack[/man] also returns type string. It is a revelation to me (forgive me for being stupid) that strings can contain raw binary data. I suppose this is what I get for learning on a loosely typed data language and ignoring all that typing stuff.

I am left with this vague mistrust of input strings of any kind: form input, databases, sockets, files, etc.

Weedpacket · Mar 6, 2009

sneakyimp wrote:
I was under the mistaken impression until your last post that the string type in PHP somehow would know what its own encoding is. I perhaps mistakenly thought that PHP might have some other data type to hold binary data of any kind.

See the first paragraph of:
http://www.php.net/manual/en/language.types.string.html
which defines both "string" and "character" as they are used in PHP. Like terms in other branches of mathematics, terms in programming mean only what they are defined to mean.

sneakyimp · Mar 10, 2009

I've been reading the man page for [man]utf8_encode[/man] over and over again. I tried nearly all of those is_utf8 functions and they don't really agree with each other.

I also have checked out the mb_string stuff and have found it interesting that regular ASCII chars, although only 1 byte in length each, somehow come out as valid UTF-8 chars. Here's a little informative script:

echo 'mb_internal_encoding:' . mb_internal_encoding() . "\n\n";

$str1 = '&#934;&#1060;';
echo 'length:' . strlen($str1) . "\n";
echo 'mb length:' . mb_strlen($str1, 'utf-8') . "\n";
echo 'serialized length:' . strlen(serialize($str1)) . "\n";
echo "mb_check_encoding utf8:" . mb_check_encoding($str1, 'utf-8') . "\n";
echo "mb_check_encoding latin 1:" . mb_check_encoding($str1, 'ISO-8859-1') . "\n";
echo "mb_detect_encoding:" . mb_detect_encoding($str1) . "\n";

echo "\n";

$str2 = 'aa';
echo 'length:' . strlen($str2) . "\n";
echo 'mb length:' . mb_strlen($str2, 'utf-8') . "\n";
echo 'serialized length:' . strlen(serialize($str2)) . "\n";
echo "mb_check_encoding utf8:" . mb_check_encoding($str2, 'utf-8') . "\n";
echo "mb_check_encoding latin 1:" . mb_check_encoding($str2, 'ISO-8859-1') . "\n";
echo "mb_detect_encoding:" . mb_detect_encoding($str2) . "\n";

The output

mb_internal_encoding:ISO-8859-1

length:4
mb length:2
serialized length:11
mb_check_encoding utf8:1
mb_check_encoding latin 1:1
mb_detect_encoding:UTF-8

length:2
mb length:2
serialized length:9
mb_check_encoding utf8:1
mb_check_encoding latin 1:1
mb_detect_encoding:ASCII

I'm wondering now what the easiest way is to actually inspect the binary data in a string...i.e., what function should I use to determine what the actual zeros and ones are?

Weedpacket · Mar 11, 2009

neakyimp wrote:
I also have checked out the mb_string stuff and have found it interesting that regular ASCII chars, although only 1 byte in length each, somehow come out as valid UTF-8 chars.

I referred to this fact at the end of Post #7. It was a deliberate design decision on Ken Thompson's part to make UTF-8 backward-compatible with ASCII.

sneakyimp · Mar 11, 2009

That would explain why 'aa' is valid UTF-8, but why does this return true:

echo "mb_check_encoding latin 1:" . mb_check_encoding($str1, 'ISO-8859-1') . "\n";

Is Φ also an ASCII char? I don't recall seeing it in the ASCII table.

Any tips on how to get a look at the individual bits and bytes stored in a string? I'm pretty eager to maybe take a stab at [man]pack[/man]ing some binary data and would like to get more familiar with the actual bytes.

Weedpacket · Mar 11, 2009

sneakyimp wrote:
Is Φ also an ASCII char? I don't recall seeing it in the ASCII table.

No, but mb_check_encoding is only going to see a string of bytes and determine whether it would be a legal string of bytes if interpreted using the given encoding.

$str1 = "&#934;";
var_export(array_map('ord', str_split($str1)));

That's a perfectly legitimate sequence of bytes according to ISO-8859-1, which would interpret it as the string "Î¦" (assuming that the original file was UTF-8-encoded).

Then again, any sequence of bytes would be valid according to ISO-8859-1.

That string would not, however, be valid according to ASCII or Base64 encodings, for example.

[RESOLVED] php and sockets: utf8 capable?

Ssneakyimp

Sscrupul0us

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket