socket message protocol?

Weedpacket · Mar 6, 2009

So.... maybe ....
$str = "has \0!"; // bogus! a mesg containing null char
$str = str_replace(chr(0), '\0', $str); // note the single quotes.

sneakyimp · Mar 10, 2009

Weedpacket;10905853 wrote:
So.... maybe ....
$str = "has \0!"; // bogus! a mesg containing null char
$str = str_replace(chr(0), '\0', $str); // note the single quotes.

The problem with that is that when I try to turn \0 back into chr(0) when I'm pulling everything out the other side, then I turn bits that are supposed to be \0 into null chars and this:

To enter a null char, type \0

becomes this:

To enter a null char, type

Weedpacket · Mar 10, 2009

So escape '\' as well

sneakyimp · Mar 10, 2009

OK this is starting to sound more and more costly. So far I have two str_replace calls (a first one for backslashes, a second for null chars) as well as a pass to [man]serialize[/man] or [man]json_encode[/man] the data and then I've got to reverse the process on the way out the other side. On top of that, I am trying to grok this thread which suggests there is still a bag of hurt in trying to deal with multibyte chars.

All this str_replace baloney is necessary because I'm using a delimiter char which may exist in my data. I can appreciate that using a delimiter that might exist in your data is common practice (as evidenced by CSV files, tab-delimited text, escape sequences in PHP string declarations, etc) but I cannot help but wonder if there might be some better way for a network communication protocol.

I found it interesting that the docs on [man]utf8_encode[/man] say that UTF-8 is self-synchronizing.

sneakyimp · Mar 10, 2009

So I tried a little script to test the performance of base64 encoding against the find-and-replace approach. I was unable to figure out the right regex to unescape my backslashes and null chars as weedpacket recommended. Any help figuring that out would be much appreciated. It looks like base64 is slightly faster encoding HOWEVER it results in encoded messages that are about 30% longer which could be a significant penalty where bandwidth is tight.

I would, however, like to test with the correct other_decode function. The current scheme has incorrect results in about 1.5% of all the messages.

<?php

define('MESG_LENGTH', 10000);
define('ITERATIONS', 10000);

$stats = array();

$stats['base64']['encode_time'] = 0;
$stats['base64']['decode_time'] = 0;
$stats['base64']['avg_char_length'] = 0;
$stats['base64']['null_chars'] = 0;
$stats['base64']['bad_mesgs'] = 0;

$stats['other']['encode_time'] = 0;
$stats['other']['decode_time'] = 0;
$stats['other']['avg_char_length'] = 0;
$stats['other']['null_chars'] = 0;
$stats['other']['bad_mesgs'] = 0;

for($i=0; $i<ITERATIONS; $i++) {
	// create a message
	$msg = '';
	for($j=0; $j<MESG_LENGTH; $j++) {
		$char = chr(rand(0,255));
		$msg .= $char;
	}

// encode it using base64
$start = microtime_float();
$coded = base64_encode($msg);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for base64 encode
$stats['base64']['encode_time'] += $elapsed;
$stats['base64']['avg_char_length'] += strlen($coded)/ITERATIONS;
if (strpos($coded, "\0") !== FALSE) {
	$stats['base64']['null_chars']++;
}

// decode
$start = microtime_float();
$decoded = base64_decode($coded);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for base64 decode
$stats['base64']['decode_time'] += $elapsed;
if ($decoded !== $msg) {
	$stats['base64']['bad_mesgs']++;
}



// encode it using other
$start = microtime_float();
$coded = other_encode($msg);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for other encode
$stats['other']['encode_time'] += $elapsed;
$stats['other']['avg_char_length'] += strlen($coded)/ITERATIONS;
if (strpos($coded, "\0") !== FALSE) {
	$stats['other']['null_chars']++;
}

// decode using other
$start = microtime_float();
$decoded = other_decode($coded);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for other decode
$stats['other']['decode_time'] += $elapsed;
if ($decoded !== $msg) {
	$stats['other']['bad_mesgs']++;
#		find_string_difference($msg, $decoded);
	}
}

print_r($stats);

function other_encode($str) {
	$result = str_replace("\\", "\\\\", $str);
	$result = str_replace(chr(0), "\\0", $result);
	return $result;
}
function other_decode($str) {
	$pattern = "/(?!\\\\\\\\)\\\\0/";
	$result = preg_replace($pattern, chr(0), $str);
	$result = str_replace("\\\\", "\\", $result);
	return $result;
}

function find_string_difference($str1, $str2) {
	$len1 = strlen($str1);
	$len2 = strlen($str2);

if ($len1 !== $len2) {
	die("strings differ in length: $len1, $len2");
}

for($l=0; $l<$len1; $l++) {
	$c1 = $str1[$l];
	$c2 = $str2[$l];
	if ($c1 !== $c2) {
		echo "1st different char is $c1 (ord=" . ord($c1) . ")\n";
		echo "2nd different char is $c2 (ord=" . ord($c2) . ")\n";
	}
}
}

function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}
?>

the output:

Array
(
    [base64] => Array
        (
            [encode_time] => 0.27830028533936
            [decode_time] => 1.0116159915924
            [avg_char_length] => 13335.999999999
            [null_chars] => 0
            [bad_mesgs] => 0
        )

[other] => Array
    (
        [encode_time] => 0.82219839096069
        [decode_time] => 0.68969488143921
        [avg_char_length] => 10078.1106
        [null_chars] => 0
        [bad_mesgs] => 1423
    )

)

EDIT: fixed array def. in code

sneakyimp · Mar 11, 2009

Ok weedpacket. Thanks to Weedpacket's post this in the other thread:

function other_encode($string) {
    return strtr($string, array(chr(0)=>'\\0', '\\'=>'\\\\'));
}

function other_decode($string) {
    return strtr($string, array('\\\\'=>'\\', '\\0' => chr(0)));
}

It is now working correctly, but the base64 encoding is an order of magnitude faster. The down side of base64 is that the resulting encoded messages are 30% longer:

Array
(
    [base64] => Array
        (
            [encode_time] => 0.27819752693176
            [decode_time] => 1.0042262077332
            [avg_char_length] => 13335.999999999
            [null_chars] => 0
            [bad_mesgs] => 0
        )

[other] => Array
    (
        [encode_time] => 3.390506029129
        [decode_time] => 3.5554423332214
        [avg_char_length] => 10078.0814
        [null_chars] => 0
        [bad_mesgs] => 0
    )

)

Interestingly, just using addslashes (thanks to weedpacket for this suggestion) results in a faster overall performance than either approach and has significantly shorter mesg length than base64 encoding:

function other_encode($string) {
    return addslashes($string);
}

function other_decode($string) {
    return stripslashes($string);
}

Array
(
    [base64] => Array
        (
            [encode_time] => 0.26593828201294
            [decode_time] => 0.96766448020935
            [avg_char_length] => 13335.999999999
            [null_chars] => 0
            [bad_mesgs] => 0
        )

[other] => Array
    (
        [encode_time] => 0.76142621040344
        [decode_time] => 0.23549246788025
        [avg_char_length] => 10156.1982
        [null_chars] => 0
        [bad_mesgs] => 0
    )

)

sneakyimp · Mar 12, 2009

Alrighty I'm thinking I might brave trying to write my own AMF serializer class.

I've been reading the AMFPHP source code. I found this little bit of code to detect a system's Endianness and on every machine I've tested it on, it always defines with a value of 1:

$tmp = pack("d", 1); // determine the multi-byte ordering of this machine temporarily pack 1
define("AMFPHP_BIG_ENDIAN", $tmp == "\0\0\0\0\0\0\360\77");

echo 'AMFPHP_BIG_ENDIAN:' . AMFPHP_BIG_ENDIAN . "\n";

I'm totally confused as to why packing a simple 1 results in this binary string:

00000000 00000000 00000000 00000000 00000000 00000000 11110000 00111111

Do I totally misunderstand what's in the string? I am DYING to know how to convert a PHP string (binary or otherwise) into its binary representation.

Also, am I right in thinking that AMFPHP_BIG_ENDIAN is TRUE or 1 if the current machine uses big-endian byte order. Does anyone have a system on which this code would define AMFPHP_BIG_ENDIAN as FALSE/0/empty ?

Lastly, when I read the AMF3 Specification, it says a double is:

AMF3 Specification wrote:
8 byte IEEE-754 double precision floating point value in network byte order (sign bit in low memory).

Doesn't 'network byte order' mean 'big endian' ?

The reason I ask is that there appears to be a switch in the AMFPHP serialization code which will reverse the byte order of a double created with [man]pack[/man] depending on what AMFPHP_BIG_ENDIAN says:

	function writeDouble($d) {
		$b = pack("d", $d); // pack the bytes
		if ($this->isBigEndian) { // if we are a big-endian processor
			$r = strrev($b);
		} else { // add the bytes to the output
			$r = $b;
		} 

	$this->outBuffer .= $r;
}

I'm having trouble understanding why AMFPHP_BIG_ENDIAN would be true on my machine and then I have to reverse the results of a simple pack operation.

Weedpacket · Mar 13, 2009

sneakyimp wrote:
I am DYING to know how to convert a PHP string (binary or otherwise) into its binary representation.

A PHP string is already in its binary representation: one character == one byte. If you want to see the bytes in a string, array_map('ord',str_split($string)), or pack('H').

I'm totally confused as to why packing a simple 1 results in this binary string:

Because you're telling [man]pack[/man] to treat 1 as a double: the number 1.0. What you're looking at is the IEEE-754-standard binary representation of 1.0 (in sixty-four bits).

Doesn't 'network byte order' mean 'big endian' ?

Yes; AMFPHP has got its test back to front. AMFPHP_BIG_ENDIAN is true iff the machine is little-endian (e.g., Intel-based).

The IEEE-754 representation (network byte order, hence big-endian) of 1.0 would be

00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000 
*------- ---===== ======== ======== ======== ======== ======== ========

* Sign bit: positive
- Exponent: 1023
= Mantissa: 1.00000000000000000000000000000000000000000000000000000 (in binary)

Value: (1-sign*2) * mantissa * 2^(exponent-1023)

sneakyimp · Mar 13, 2009

Weedpacket;10906864 wrote:
A PHP string is already in its binary representation: one character == one byte. If you want to see the bytes in a string, array_map('ord',str_split($string)), or pack('H').

This script outputs the ASCII ordinals of the characters in a string.

$str = 'abc';
$v = array_map('ord',str_split($string));
print_r($v);

outputs this:

Array
(
    [0] => 97
    [1] => 98
    [2] => 99
)

I understand what that function is doing. $v contains an array of integers that correspond to the ASCII ordinals of a, b, and c.

This I don't understand at all:

$str = pack('H', 'ABC');
echo 'len:' . strlen($str) . "\n";
$v = array_map('ord',str_split($str));
print_r($v);

It outputs this:

len:1
Array
(
    [0] => 160
)

It I put a * after the H, then I get a string of length 2:

len:2
Array
(
    [0] => 171
    [1] => 192
)

I have also tried pack with C and c instead but that just returns an array with zero as its only member. Now I know that with those ordinals above, I can use [man]base_convert[/man] and get something like this:

function myfunc($s) {
        $ord = ord($s);
        $s2 = strval($ord);
        return base_convert($s2, 10, 2);
}
$str = 'abc';
$v = array_map('myfunc',str_split($str));
print_r($v);

which outputs this:

Array
(
    [0] => 1100001
    [1] => 1100010
    [2] => 1100011
)

I also know that what is happening here is that we are grabbing the ordinals (an integer) and converting them to base-2 integers. I hope I'm not being totally obtuse here when I point out that those binary numbers have only 7 digits and a
byte has 8 digits. I know that if PHP has a byte in memory somewhere that it has one more bit. Can I assume that the missing bit is a leading zero or is there some kind of two's compliment thing going on? Or some Endian bushwhacking?

I'm certainly feeling obtuse here. I was just hoping for a way to check exactly what bits and bytes I've managed to [man]pack[/man] up in my string to determine if it matches the bit-and-byte order descriptions described in the Adobe AMF3 Specification. Without being able to actually look at the ones and zeros, I feel like I'm working in the dark. I did manage to cook up a function to show me the bits in a number value, but it returns 00000000 whenever you feed it a string:

function my2bin($c) {
        $result = '';
        for($i=0; $i<8; $i++) {
                $p2 = pow(2, $i);
                if ($p2 & $c) {
                        $result = '1' . $result;
                } else {
                        $result = '0' . $result;
                }
        }
        return $result;
}

Also, I'm not sure how to detect the 'bit length' of a given argument so I can do doubles or floats or whatever.

Weedpacket;10906864 wrote:
Because you're telling [man]pack[/man] to treat 1 as a double: the number 1.0. What you're looking at is the IEEE-754-standard binary representation of 1.0 (in sixty-four bits).

Sadly, I went looking for the IEEE-754 standard and was prompted to purchase it at the IEEE website. I was naively expecting that a big-endian 8-btye (that's 64 bits) representation of the number 1 would look like this:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001

.
I went and looked some more and there's an article on wikipedia that I'll be attempting to digest. I'm still rather choking on the AMF3 spec and Augmented Backus-Naur Form.

Weedpacket;10906864 wrote:
Yes; AMFPHP has got its test back to front. AMFPHP_BIG_ENDIAN is true iff the machine is little-endian (e.g., Intel-based).

Aha! Just as I suspected. A bad constant name. Or something like that. Definitely confusing. There are few comments in the amfphp code either. Also, I don't think their AMF serialization supports args being passed by reference. I could be wrong. I have increasing confidence that my crusade to write this AMFPHP class is justified. BTW, I think it's worth noting that this constant defines as true on both an intel mac and also on a dual-core AMD machine running CentOS. So PHP on both of these machines is little-endian? Or is that a function of the OS?

Weedpacket;10906864 wrote:

The IEEE-754 representation (network byte order, hence big-endian) of 1.0 would be

00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000 
*------- ---===== ======== ======== ======== ======== ======== ========

* Sign bit: positive
- Exponent: 1023
= Mantissa: 1.00000000000000000000000000000000000000000000000000000 (in binary)

Value: (1-sign*2) * mantissa * 2^(exponent-1023)

I like your notation and thank you very very very much for that. I do think I'm getting somewhere here. I think I know understand that my machine, when packing doubles, will choose some machine-specific endianness for it and that is why I must detect the endianness of my machine if I am to reliably pack doubles for this protocol.

Weedpacket · Mar 13, 2009

sneakyimp wrote:
This I don't understand at all:
$str = pack('H', 'ABC');
echo 'len:' . strlen($str) . "\n";
$v = array_map('ord',str_split($str));
print_r($v);
It outputs this:

Sorry; typo on my part. unpack('H*'). (Notice that pack('H') spits the dummy if the second argument does not consist of hex digits).

function myfunc($s) {
        $ord = ord($s);
        $s2 = strval($ord);
        return base_convert($s2, 10, 2);
}

Or

function myfunc($s)
{
 return decbin(ord($s));
}

sneakyimp wrote:
I also know that what is happening here is that we are grabbing the ordinals (an integer) and converting them to base-2 integers. I hope I'm not being totally obtuse here when I point out that those binary numbers have only 7 digits and a
byte has 8 digits.

Sorry to point this out then, but it's normal to not write the leading digit of an integer if it's a zero. Do you write 42 or 042? Or 0042? Or 00042? Or..... After all, decbin(1942) will return the correct result (one thousand, nine hundred and forty-two, written in binary).

I was naively expecting that a big-endian 8-btye (that's 64 bits) representation of the number 1 would look like this:

It would, if it was a (64-bit) integer. But pack('d') packs it as a double-precision floating-point number, which is why I was writing it throughout as 1.0, with an explicit decimal point. Floating-point is not just raw binary, it's encoded (as I illustrated); otherwise "$googol = pow(10,100);" wouldn't fit in 64 bits.

You sound like you're getting wildly tangled up in what you're doing. If specific bit sequences and data types and endianness and other C-level stuff are so crucial to the job, then a language like PHP that alters the types of values depending on what you're doing with them may not be appropriate. To use some code of yours as an example:

        $ord = ord($s);
        $s2 = strval($ord);

Let's say that $s is the string "Foo". ord() only works on one character, so the string will be truncated to 'F' (stored as one byte: 01000110). ord() will take that byte and return an integer, 70 (stored as four bytes, 00000000 00000000 00000000 01000110). Then strval() will take that integer, 70, and turn it into a string, "70" (stored as two bytes, 00110111 00110000).

If you hadn't called strval(), then base_convert would have called it instead, because it wants a string as its first argument. You'd have given it four bytes (00000000 00000000 00000000 01000110) and it would have turned it into two (00110111 00110000) without saying. And, in this case (base_convert($s2, 10, 2), returned a string made up of seven bytes (00110001 00110000 00110000 00110000 00110001 00110001 00110000).

I'm wondering if there isn't a serious category error going on here. In PHP, characters are defined (in the second sentence of the strings page) to be the same thing as bytes.

sneakyimp · Mar 13, 2009

I apologize for what is obviously muddle thinking on my part here. A double is not an integer, it's a float. If an ASCII char is a byte then there can be no use of something like two's compliment because we have values from 0 to 255 and we need all 8 bits.

My objective at this point is to create an AMF serializer/deserializer class and this requires that I create binary strings. I think you can appreciate that the type juggling that goes on in PHP makes this task confusing. It certainly seems difficult to inspect the actual bits that you've packed into your binary string. I still have no idea how this might be accomplished. Perhaps its not necessary, but I would like to be able to do it so I can get some idea of what my binary strings really look like.

This whole AMF serializer in PHP has been done in amfphp and in SabreAMF. The latest beta of amfphp tries first to make use of a PHP extension written in C called AMFEXT. My issue with those other PHP implementations is that they appear to involve dozens of files and are pretty intimately connected with the other bits and pieces of the projects to which they belong -- there's a lot of interitance and crosstalk between classes so it's difficult to extract the AMF serialization bits. I love the idea of AMFEXT (written in C, faster) but it requires that one install an extension to PHP.

I would like to offer something super simple for my project -- something that just pops right in. I believe those other PHP projects have lessons for me but I don't know if they are strictly adhering to AMF3 specs. AMF3, for instance, supports passing args by reference and uses a variable length integer encoding scheme. I could at least begin to test these notions if I could convert the binary strings they contain into the actual bits so I could get a look at them. It doesn't need to all happen in PHP. Maybe I need a little C utility I can run from the command line or something? Instantiate a dynamic library?

I truly appreciate your patience and help with this. I am pretty wildly tangled up in it.

Weedpacket · Mar 14, 2009

sneakyimp wrote:
and this requires that I create binary strings.

Strings are strings (PHP6 adds a new type of string, but ordinary strings are still the same things as ever). Strings are made of characters. Characters are bytes.

It certainly seems difficult to inspect the actual bits that you've packed into your binary string. I still have no idea how this might be accomplished.

If you want to see the bytes in a string, array_map('ord',str_split($string))

You can even wrap [man]decbin[/man] around that, if you like.

If you like you can work with arrays of integers [0..255] for everything. To turn that into a string, one integer per byte/character, join(array_map('chr', $array)).

echo join(array_map('chr', array(72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33));

 echo "\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21";

$png = "\x89\x50\x4e\x47\xd\xa\x1a\xa\x0\x0\x0\xd\x49\x48\x44\x52\x0\x0\x0\x45\x0\x0\x0\x16\x4\x3\x0\x0\x0\x64".
"\xb9\x4c\x43\x0\x0\x0\x24\x50\x4c\x54\x45\x0\x0\x0\xff\xff\xff\x8c\x8c\x8c\xf0\xf0\xf0\xa7\xa7\xa7\xf8\xf8\xf8".
"\x7c\x7c\x7c\xe1\xe1\xe1\xb2\xb2\xb2\xd0\xd0\xd0\xbd\xbd\xbd\xd9\xd9\xd9\x22\xb2\x64\xe2\x0\x0\x0\x9a\x49\x44".
"\x41\x54\x78\x5e\xed\x8e\x3d\xe\xc2\x30\xc\x46\x6d\x92\xc2\x6a\xb3\x30\xb0\xf0\x73\x81\x16\x89\x9d\xe\xdc\xa0".
"\xaa\xc4\x96\xa9\xac\x85\x85\x15\x6e\x2\x37\xe5\xab\x95\x66\x49\x4f\x80\xf0\xf4\xf2\xf2\x64\x99\xfe\xc3\x4a\x2a".
"\x99\x3\x44\x29\xc\xe6\xf4\x4\x44\x64\xfb\x9c\x6e\xee\xb3\x3e\x6f\xe6\xc7\x93\x35\xe7\xc6\xc4\x63\xd5\x83\xd6\xf8".
"\xf6\x97\xd8\xa8\xba\xae\xb4\xe6\xb0\x37\xf1\xb9\xbd\x41\x4b\x62\x72\xbb\xb4\x87\x55\xac\x89\xe2\x55\x3f\x41\x82".
"\x86\x29\x35\x2e\x50\x11\xa8\x1d\xf7\x6c\x79\x3\x12\x38\xec\xa1\xf1\x9e\x4a\xbc\x52\x2d\xb8\x62\x70\xd7\x45\x30".
"\xf2\x6a\xf7\x64\x23\x99\x99\x6a\x7e\x63\xbe\x2b\xfd\x10\xab\xcc\x97\x3a\x59\x0\x0\x0\x0\x49\x45\x4e\x44\xae\x42\x60\x82";
file_put_contents('do.png', $png);
$new_png = file_get_contents('do.png');
echo $png == $new_png ? 'same': 'different';

sneakyimp · Mar 14, 2009

Thanks so much for your persistence. I'm sorry that you've had to repeat yourself so much. I guess I was just finding it hard to swallow that one had to use [man]ord[/man] on a raw, binary string to get a look at its raw bit representation because ord just returns the ASCII value of a character. Upon reflection, I realize that the ASCII value of a character requires all 8 bits of a byte because it is a value between 0 and 255. That doesn't leave any room for two's compliment or any other IEEE-754 type encoding weirdness. It's a straight conversion from an ASCII ordinal to the underlying binary bits stored by PHP. Additionally, when you call str_split on a string (binary or otherwise) it just splits the string into bytes and loads those bytes in an array. ord doesn't care if you have used an non-printing control character or anything like that, it basically just tells you the decimal equivalent of your stored byte.

So I finally have my string-to-binary-representation function:

function str2bin($str) {
        if (!is_string($str)) {
                die("string2bits works only on string values");
        }
        $chars = str_split($str);
        $result = array();
        $len = count($chars);
        for($i=0; $i<$len; $i++) {
                $result[$i] = str_pad(decbin(ord($chars[$i])), 8, '0', STR_PAD_LEFT);
        }
        return $result;
}

$arr = str2bin(pack('d', 1));
echo join($arr, ' ') . "\n";

output is a LITTLE-ENDIAN representation of the IEEE-754 encoding of 1 as a double:

00000000 00000000 00000000 00000000 00000000 00000000 11110000 00111111

This is a tremendous relief. However, it re-introduces the basic delimiter question:

$binstr = pack('d', 1);
$x = strpos($binstr, "\0");
if ($x === FALSE) {
        echo "phew!  no null char\n";
} else {
        echo "NULL CHAR FOUND\n";
}

This one comes out 'NULL CHAR FOUND'. I therefore cannot use a null char as a delimiter for these AMF3 encoded objects unless I run addslashes or something similar on them too. Without a delimiter, I can't really be sure when one message has finished and the next has begun. Any suggestions about how to tackle that problem would be appreciated.

thanks weedpacket.

Weedpacket · Mar 15, 2009

I still smell a category error. What do you mean by "character"? To PHP, a character and a byte are the same thing. "a" is a one-character-long string.

sneakyimp wrote:
It's a straight conversion from an ASCII ordinal to the underlying binary bits stored by PHP.

It's the exact opposite, taking the underlying bit representation of a character/byte and returning a (four-byte) integer. [man]chr[/man] turns an integer into a byte.

However, it re-introduces the basic delimiter question:

[man]json_encode[/man] turns NUL characters (and other weirdnesses) into Unicode escape sequences: \u0000 for NUL. It also expects UTF-8 encoding. So $encoded_data = json_encode(utf8_encode($raw_data)) ought to produce something safe for transport. At the other end you'll need to $raw_data = utf8_decode(json_decode($encoded_data)) (however that's done in what you're doing).

Base64 is still an option, even though it does increase data size: json_encode() also causes data inflation. Base64 turns three bytes into four; but the above encoding has the potential to turn some single bytes into twelve.

echo strlen(json_encode(utf8_encode(join(array_map('chr', range(0,255))))))

produces a number considerably larger than 256, though base64 is still worse in this situation.

Which is more efficient depends on whether or not your data consists largely of "printable ASCII" characters (bytes 32..127) or not.

But there will always be some inflation: you're trying to pack eight-bit bytes into packages that are less than eight bits in size - you're going to need more than one package per byte.

sneakyimp · Mar 15, 2009

I'm not sure what you mean by 'category error' ? Are you referring to the use of PHP for this binary manipulation or are you referring to some other instance of my obtuseness?

As for json_encode, it's been my experience that it doesn't escape NULL chars -- see post #4 in this thread.

My bad on the description of ord's workings. I've got it fairly straight in my head. PHP stores bytes in some binary spots in memory. This function tells me what the decimal equivalent of those raw bytes are. I'm pretty sure I understand it. consider that dead horse sufficiently beaten.

I'm thinking I will put some more thought into the AMF3 approach. It's very helpful on the Actionscript side of things because it's native there. It's also supposedly compact and harder for script kiddies to decode. If I take one of these other approaches, I'll have a bigger task decoding the data once it gets into Actionscript.

Weedpacket · Mar 15, 2009

sneakyimp wrote:
As for json_encode, it's been my experience that it doesn't escape NULL chars -- see post #4 in this thread.

I just tried it and it turned a NUL into "\u0000".

echo json_encode(chr(0));

Literally - quotes and all!

The point about JSON is of course that it's literal ECMAScript source code, so that part of the decoding is trivial.

As for UTF8->byte decoding.... The ECMAScript specification defines a character as being 16 bits (I had to look it up; I hope ECMAScript 4 will be written better). But that won't be a problem: the ordinal values (in Unicode terminology, the "code points") after decoding will still be the same as before encoding (otherwise it wouldn't be very useful encoding!).

Weedpacket · Mar 15, 2009

sneakyimp wrote:
As for json_encode, it's been my experience that it doesn't escape NULL chars -- see post #4 in this thread.

I just tried it and it turned a NUL into "\u0000".

echo json_encode(chr(0));

Literally - quotes and all!

The point about JSON is of course that it's literal ECMAScript source code, so that part of the decoding is trivial.

As for UTF8->byte decoding.... The ECMAScript specification defines a character as being 16 bits (I had to look it up; I hope ECMAScript 4 will be written better). But that won't be a problem: the ordinal values (in Unicode terminology, the "code points") after decoding will still be the same as before encoding (otherwise it wouldn't be very useful encoding!).

sneakyimp · Mar 15, 2009

OK I've tried json_encode on 3 different machines now. The iMac returns the "\u0000" for null chars when I use either pre-installed php4 or MAMP (php 5.something). HOWEVER, the debian etch machine running php 5.2 returns the null char stuff in post #4. All installs report the same version of json_encode (1.2.1). WTF?

As for decoding JSON in Actionscript, it's one thing to type some JSON in your source code, it's another to turn a string into a valid object. Here's a script to do that which has about 300 lines of code. I have not yet seen any way to 'natively' encode/decode JSON in Actionscript. There's another one here.

Weedpacket wrote:
the ordinal values (in Unicode terminology, the "code points") after decoding will still be the same as before encoding (otherwise it wouldn't be very useful encoding!).

I didn't follow that.

sneakyimp · Mar 15, 2009

Can I be sure that pack('c', $int) will always result in the least significant bits of $int or does the behavior depend on the machine's endianness? I would test this myself but I don't have access to a big-endian machine.

sneakyimp · Mar 25, 2009

OK so I broke down and wrote my own AMF3 Serializer class. Please find it attached. It doesn't support references in the AMF partly because i'm lazy and partly because it would result in enormous data structures in a daemon environment. It seems to work. I would not have been unable to do it with out the SabreAMF and AMFPHP examples to study, but I believe mine is actually more faithful to the sort of AMF3 serialization done by Flash.

It is copiously commented and I welcome any input. I haven't thoroughly tested it in final form so it could still have bugs. I'll run some benchmarks shortly to see how it compares to JSON or PHP serialization.

Given that I can't think of any byte sequence that might not exist in an AMF3 serialized object, I'm guessing my protocol will have to send a length header first to indicate how many bytes are in the forthcoming message. I worry about what happens if a message loses parts or we get 'out of sync' somehow. Unlike UTF8, AMF3 is not self-synchronizing.

socket message protocol?

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp