OK I tried again with your greek chars. I'm not so sure what was so difficult about it last time...perhaps the chinese characters are somehow more foreign than those greek ones (and this doesn't help my confidence about the whole UTF8 question we are considering here).
I made this file in notepad and did a 'save as' using UT8 encoding
<?php
file_put_contents('foo.txt', 'ΦФ')
or die('file put failed');
?>
I uploaded it to my dev machine (a Mac) via FTP. Interestingly, if I use nano on the mac to edit my file, I get this:
<?php
file_put_contents('foo.txt', 'ΦФ')
or die('file put failed');
?>
I have no idea what the funky chars are at the beginning. Perhaps something to signify ut8 encoding? Also worth noting are the weird chars get instead of the ΦФ I was expecting. I suppose I'm not surprised that nano would have trouble with greek chars. Should I be?
When I execute the file using MAMP or the version 4.x PHP installed on OSX by default, those chars reappear (look carefully at the beginning of the second line):
MyMac:Desktop sneakyimp$ php chump.php
MyMac:Desktop sneakyimp$
This doesn't look like it's PHP's fault to me -- probably some weird file formatting issue between macs and PCs, either notepad or FTP or something like that.
If I open the file using TextEdit on the mac, I don't see those funny chars at all. The file looks fine and dandy with the greek chars and all.
If open the file in TextEdit and save it without changing anything, the funny chars aren't there when I subsequently open it with nano:
<?php
file_put_contents('foo.txt', 'ΦФ')
or die('file put failed');
?>
And this is where things get pretty darn confusing to me. The PHP code on the mac looks awesome when I open it using text edit...exactly as I want it to:
<?php
file_put_contents('foo.txt', 'ΦФ')
or die('file put failed');
?>
I run it with no complaints or weird chars appearing and the resulting file is 4 bytes as weedpacket predicted:
MyMac:Desktop sneakyimp$ php chump.php
MyMac:Desktop sneakyimp$ ls -l foo.txt
-rw-r--r-- 1 sneakyimp staff 4 Feb 28 14:00 foo.txt
HOWEVER, when I open the output file foo.txt in TextEdit on the Mac, it doesn't contain two greek chars but four chars that look nothing like the ones I put in there.
ΦФ
Further boggling my brain is the fact that when I download foo.txt using Dreamweaver to my WinXP machine, I can open it in notepad or dreamweaver and there are my two greek chars in all their naked greek glory:
ΦФ
I guess this weird-char-thing illustrates that I'm not the only person confused by utf8 encoding. This is reassuring -- or maybe not?
I will tentatively say that PHP seems to do its job properly. When you run PHP on some file, it would appear to interpret the PHP file properly regardless of whether it is UT8 or ASCII encoded. What puzzles me is what happens when I run this script (named 'bar.php') on foo.txt:
<?php
echo "file foo.txt is " . filesize('foo.txt') . " bytes\n";
$str = file_get_contents('foo.txt');
echo "the contents of foo.txt have a string length of " . strlen($str) . " characters\n";
for($i=0; $i<strlen($str); $i++) {
echo "\t$i=" . $str[$i] . "\n";
}
?>
The output in my terminal window is this:
MyMac:Desktop sneakyimp$ php bar.php
file foo.txt is 4 bytes
the contents of foo.txt have a string length of 4 characters
0=Î
1=¦
2=Ð
3=¤
I'm perfectly willing to accept that foo.txt has a filesize of 4 bytes but I must take issue with the thought that it is 4 characters long. It's only got two greek chars, not 4.
This also brings me round to my original question about information coming off a socket. [man]socket_read[/man] returns a string value. If I attempt to deal with that string value in the wrong way, I could be getting gibberish (and twice as much of it!) rather than the original greek or japanese or whatever that was coming off the socket.
There seems to be a bit more to this than I am understanding.