socket message protocol?

sneakyimp

I've written a tool for PHP to Flash socket communication (see my signature). The basic idea is that it serializes function calls to let you make Remote Procedure Calls between Flash and PHP. It's not perfect by any means, but the one problem that's really bugging me now is the socket message protocol. It basically involves putting some values into an array, calling [man]serialize[/man] on it, and then tacking on a null character to the end. The reason for the null character [the same thing you get from chr(0) or "\0" ] is because of the way Flash handles its XMLSocket input. The socket never actually 'arrives' in your Flash app until you terminate it with a null character. Basically, until Flash sees that null char, it assumes the data is still coming in. It's kind of like EOF or the word STOP in a telegram.

As you might imagine, this presents a problem if your data has null chars in it. Serializing a string with a null char in it results in a spurious delimiter. Flash thinks it has received a message, tries to unserialize it (using this Actionscript class) and fails to do so, throwing an error or discarding the partial message and then discarding whatever remainder of the message comes off the socket because it doesn't represent a properly serialized object.

To illustrate:

<?
$str = "has \0!"; // bogus!  a mesg containing null char
$ser = serialize($str) . "\0"; // we add a null char 'STOP' to all messages
echo $ser . "\n";
$len = strlen($ser);
echo "length:" . $len . "\n";
for($i=0; $i<$len; $i++) {
  $c = $ser[$i];
  echo '$ser[' . $i . ']=' . $c . ', ord=' . ord($c) . "\n";
}
?>

check the output, the 9th character is a spurious null char which results in the message being interpreted as two messages.

[user@server my_dir]# php foo.php
s:6:"has !";
length:14
$ser[0]=s, ord=115
$ser[1]=:, ord=58
$ser[2]=6, ord=54
$ser[3]=:, ord=58
$ser[4]=", ord=34
$ser[5]=h, ord=104
$ser[6]=a, ord=97
$ser[7]=s, ord=115
$ser[8]= , ord=32
$ser[9]=, ord=0
$ser[10]=!, ord=33
$ser[11]=", ord=34
$ser[12]=;, ord=59
$ser[13]=, ord=0

Obviously, encoding these serialized strings using something like [man]base64_encode[/man] might eliminate any null chars within (would it?) but would also result in a message 33% longer and would be a fairly costly operation.

Another possibility is to establish a protocol where each message begins with a message length value. However, when message pieces continue to come in, I can easily imagine a situation where what I believe is a message length indicator (The start of a new message) is actually the message itself (the middle of some other message?).

Can anyone suggest an approach to this protocol which is fast and compact? Keep in mind the basic idea is to create a 3-element nested array (serviceName, methodName, argumentArray), serialize it somehow, and send it over the network in such a way that the beginning and end of the message are obvious.

Any thoughts would be much appreciated.

dalecosp

Nice question.

Can you imagine any need to have a NULL in the data? If not, can you call some replace() method on the data prior to the concatenation of the NULL terminator? A space perhaps?

But then, that might cause problems on the receiving end.

What about [man]pack/man ing it first?

Hmm. I think I'm in over my head. Congrats ... you've received your first ignorant reply to a nifty thread 🙂

Weedpacket

I haven't done anything involving this, so this is just a wild thing.... But since ActionScript is just a dialect of ECMAScript, JSON should work as a serialisation protocol (seeing as JSON doesn't involve any JavaScript-specific semantics).

In which case literal NUL characters would be enclosed in literal ECMAScript/Actionscript/JavaScript strings and hence escaped appropriately, with the JSON decoder (probably just the Actionscript engine itself) unescaping it as appropriate.

sneakyimp

Thanks a lot for the resposes.

dalecosp: I'm trying to make this tool as generally useful as possible. I originally thought that surely nobody would be sending any null chars, but recently tried to apply this code to a situation where it would be receiving input from a worldwide audience. I still don't know who might be using a null char in their data, but I'd like my system to handle it properly. It did occur to me to do some kind of escaping replacement which is subsequently reversed, but I got lost when it occurred to me that my replacement string might exist in the transmitted data by sheer coincidence and would subsequently get wrongly replaced with a null char.

weedpacket: I have considered using JSON and went so far as to incorporate a JSON serialization class should a user opt for that serialization technique instead. My ultimate goal is to support a variety of serialization techniques as well as encryption. Unfortunately, null chars also appear in JSON serialized data too:

<?php
$str = "has \0!"; // bogus!  a mesg containing null char
$ser = json_encode($str) . "\0"; // we add a null char 'STOP' to all messages
echo $ser . "\n";
$len = strlen($ser);
echo "length:" . $len . "\n";
for($i=0; $i<$len; $i++) {
  $c = $ser[$i];
  echo '$ser[' . $i . ']=' . $c . ', ord=' . ord($c) . "\n";
}
?>

output:

sneakyimp@server:~/test$ php chump.php
"has !"
length:9
$ser[0]=", ord=34
$ser[1]=h, ord=104
$ser[2]=a, ord=97
$ser[3]=s, ord=115
$ser[4]= , ord=32
$ser[5]=, ord=0
$ser[6]=!, ord=33
$ser[7]=", ord=34
$ser[8]=, ord=0

Behold the null char at index 5.

This null char delimiter issue is a problem on both the server and client side. If on the server I create a string containing a null char and serialize it into an RPC message terminated by a null char and send it off to the client, it will be interpreted by a Flash XMLSocket as two messages. Flash is hard-wired to interpret the null char as the end of an XML message. On the server side, I have mirrored this approach by reading data off each socket and storing it in a buffer. I then split the buffer at each null char, treating the resulting pieces as individual messages.

Flash also supports a more basic Socket class which deals in binary data and I am thinking that this is what I should be using instead. Being unsure about how a binary socket communication protocol might work, I came here for help. I have written code to read and write binary data but it's been YEARS (lots of them).

I recently learned that Adobe has published the Action Message Format 3 specification (AMF3). The doc is here but is somewhat impenetrable for me. I'm working at it but it's slow going. I did find some interesting code in AMFPHP.

Weedpacket

So.... maybe ....
$str = "has \0!"; // bogus! a mesg containing null char
$str = str_replace(chr(0), '\0', $str); // note the single quotes.

sneakyimp

Weedpacket;10905853 wrote:
So.... maybe ....
$str = "has \0!"; // bogus! a mesg containing null char
$str = str_replace(chr(0), '\0', $str); // note the single quotes.

The problem with that is that when I try to turn \0 back into chr(0) when I'm pulling everything out the other side, then I turn bits that are supposed to be \0 into null chars and this:

To enter a null char, type \0

becomes this:

To enter a null char, type

Weedpacket

So escape '\' as well 🙂

sneakyimp

OK this is starting to sound more and more costly. So far I have two str_replace calls (a first one for backslashes, a second for null chars) as well as a pass to [man]serialize[/man] or [man]json_encode[/man] the data and then I've got to reverse the process on the way out the other side. On top of that, I am trying to grok this thread which suggests there is still a bag of hurt in trying to deal with multibyte chars.

All this str_replace baloney is necessary because I'm using a delimiter char which may exist in my data. I can appreciate that using a delimiter that might exist in your data is common practice (as evidenced by CSV files, tab-delimited text, escape sequences in PHP string declarations, etc) but I cannot help but wonder if there might be some better way for a network communication protocol.

I found it interesting that the docs on [man]utf8_encode[/man] say that UTF-8 is self-synchronizing.

sneakyimp

So I tried a little script to test the performance of base64 encoding against the find-and-replace approach. I was unable to figure out the right regex to unescape my backslashes and null chars as weedpacket recommended. Any help figuring that out would be much appreciated. It looks like base64 is slightly faster encoding HOWEVER it results in encoded messages that are about 30% longer which could be a significant penalty where bandwidth is tight.

I would, however, like to test with the correct other_decode function. The current scheme has incorrect results in about 1.5% of all the messages.

<?php

define('MESG_LENGTH', 10000);
define('ITERATIONS', 10000);

$stats = array();

$stats['base64']['encode_time'] = 0;
$stats['base64']['decode_time'] = 0;
$stats['base64']['avg_char_length'] = 0;
$stats['base64']['null_chars'] = 0;
$stats['base64']['bad_mesgs'] = 0;

$stats['other']['encode_time'] = 0;
$stats['other']['decode_time'] = 0;
$stats['other']['avg_char_length'] = 0;
$stats['other']['null_chars'] = 0;
$stats['other']['bad_mesgs'] = 0;

for($i=0; $i<ITERATIONS; $i++) {
	// create a message
	$msg = '';
	for($j=0; $j<MESG_LENGTH; $j++) {
		$char = chr(rand(0,255));
		$msg .= $char;
	}

// encode it using base64
$start = microtime_float();
$coded = base64_encode($msg);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for base64 encode
$stats['base64']['encode_time'] += $elapsed;
$stats['base64']['avg_char_length'] += strlen($coded)/ITERATIONS;
if (strpos($coded, "\0") !== FALSE) {
	$stats['base64']['null_chars']++;
}

// decode
$start = microtime_float();
$decoded = base64_decode($coded);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for base64 decode
$stats['base64']['decode_time'] += $elapsed;
if ($decoded !== $msg) {
	$stats['base64']['bad_mesgs']++;
}



// encode it using other
$start = microtime_float();
$coded = other_encode($msg);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for other encode
$stats['other']['encode_time'] += $elapsed;
$stats['other']['avg_char_length'] += strlen($coded)/ITERATIONS;
if (strpos($coded, "\0") !== FALSE) {
	$stats['other']['null_chars']++;
}

// decode using other
$start = microtime_float();
$decoded = other_decode($coded);
$end = microtime_float();
$elapsed = $end - $start;

// track stats for other decode
$stats['other']['decode_time'] += $elapsed;
if ($decoded !== $msg) {
	$stats['other']['bad_mesgs']++;
#		find_string_difference($msg, $decoded);
	}
}

print_r($stats);

function other_encode($str) {
	$result = str_replace("\\", "\\\\", $str);
	$result = str_replace(chr(0), "\\0", $result);
	return $result;
}
function other_decode($str) {
	$pattern = "/(?!\\\\\\\\)\\\\0/";
	$result = preg_replace($pattern, chr(0), $str);
	$result = str_replace("\\\\", "\\", $result);
	return $result;
}

function find_string_difference($str1, $str2) {
	$len1 = strlen($str1);
	$len2 = strlen($str2);

if ($len1 !== $len2) {
	die("strings differ in length: $len1, $len2");
}

for($l=0; $l<$len1; $l++) {
	$c1 = $str1[$l];
	$c2 = $str2[$l];
	if ($c1 !== $c2) {
		echo "1st different char is $c1 (ord=" . ord($c1) . ")\n";
		echo "2nd different char is $c2 (ord=" . ord($c2) . ")\n";
	}
}
}

function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}
?>

the output:

Array
(
    [base64] => Array
        (
            [encode_time] => 0.27830028533936
            [decode_time] => 1.0116159915924
            [avg_char_length] => 13335.999999999
            [null_chars] => 0
            [bad_mesgs] => 0
        )

[other] => Array
    (
        [encode_time] => 0.82219839096069
        [decode_time] => 0.68969488143921
        [avg_char_length] => 10078.1106
        [null_chars] => 0
        [bad_mesgs] => 1423
    )

)

EDIT: fixed array def. in code

sneakyimp

Ok weedpacket. Thanks to Weedpacket's post this in the other thread:

function other_encode($string) {
    return strtr($string, array(chr(0)=>'\\0', '\\'=>'\\\\'));
}

function other_decode($string) {
    return strtr($string, array('\\\\'=>'\\', '\\0' => chr(0)));
}

It is now working correctly, but the base64 encoding is an order of magnitude faster. The down side of base64 is that the resulting encoded messages are 30% longer:

Array
(
    [base64] => Array
        (
            [encode_time] => 0.27819752693176
            [decode_time] => 1.0042262077332
            [avg_char_length] => 13335.999999999
            [null_chars] => 0
            [bad_mesgs] => 0
        )

[other] => Array
    (
        [encode_time] => 3.390506029129
        [decode_time] => 3.5554423332214
        [avg_char_length] => 10078.0814
        [null_chars] => 0
        [bad_mesgs] => 0
    )

)

Interestingly, just using addslashes (thanks to weedpacket for this suggestion) results in a faster overall performance than either approach and has significantly shorter mesg length than base64 encoding:

function other_encode($string) {
    return addslashes($string);
}

function other_decode($string) {
    return stripslashes($string);
}

Array
(
    [base64] => Array
        (
            [encode_time] => 0.26593828201294
            [decode_time] => 0.96766448020935
            [avg_char_length] => 13335.999999999
            [null_chars] => 0
            [bad_mesgs] => 0
        )

[other] => Array
    (
        [encode_time] => 0.76142621040344
        [decode_time] => 0.23549246788025
        [avg_char_length] => 10156.1982
        [null_chars] => 0
        [bad_mesgs] => 0
    )

)

sneakyimp

Alrighty I'm thinking I might brave trying to write my own AMF serializer class.

I've been reading the AMFPHP source code. I found this little bit of code to detect a system's Endianness and on every machine I've tested it on, it always defines with a value of 1:

$tmp = pack("d", 1); // determine the multi-byte ordering of this machine temporarily pack 1
define("AMFPHP_BIG_ENDIAN", $tmp == "\0\0\0\0\0\0\360\77");

echo 'AMFPHP_BIG_ENDIAN:' . AMFPHP_BIG_ENDIAN . "\n";

I'm totally confused as to why packing a simple 1 results in this binary string:

00000000 00000000 00000000 00000000 00000000 00000000 11110000 00111111

Do I totally misunderstand what's in the string? I am DYING to know how to convert a PHP string (binary or otherwise) into its binary representation.

Also, am I right in thinking that AMFPHP_BIG_ENDIAN is TRUE or 1 if the current machine uses big-endian byte order. Does anyone have a system on which this code would define AMFPHP_BIG_ENDIAN as FALSE/0/empty ?

Lastly, when I read the AMF3 Specification, it says a double is:

AMF3 Specification wrote:
8 byte IEEE-754 double precision floating point value in network byte order (sign bit in low memory).

Doesn't 'network byte order' mean 'big endian' ?

The reason I ask is that there appears to be a switch in the AMFPHP serialization code which will reverse the byte order of a double created with [man]pack[/man] depending on what AMFPHP_BIG_ENDIAN says:

	function writeDouble($d) {
		$b = pack("d", $d); // pack the bytes
		if ($this->isBigEndian) { // if we are a big-endian processor
			$r = strrev($b);
		} else { // add the bytes to the output
			$r = $b;
		} 

	$this->outBuffer .= $r;
}

I'm having trouble understanding why AMFPHP_BIG_ENDIAN would be true on my machine and then I have to reverse the results of a simple pack operation.

Weedpacket

sneakyimp wrote:
I am DYING to know how to convert a PHP string (binary or otherwise) into its binary representation.

A PHP string is already in its binary representation: one character == one byte. If you want to see the bytes in a string, array_map('ord',str_split($string)), or pack('H').

I'm totally confused as to why packing a simple 1 results in this binary string:

Because you're telling [man]pack[/man] to treat 1 as a double: the number 1.0. What you're looking at is the IEEE-754-standard binary representation of 1.0 (in sixty-four bits).

Doesn't 'network byte order' mean 'big endian' ?

Yes; AMFPHP has got its test back to front. AMFPHP_BIG_ENDIAN is true iff the machine is little-endian (e.g., Intel-based).

The IEEE-754 representation (network byte order, hence big-endian) of 1.0 would be

00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000 
*------- ---===== ======== ======== ======== ======== ======== ========

* Sign bit: positive
- Exponent: 1023
= Mantissa: 1.00000000000000000000000000000000000000000000000000000 (in binary)

Value: (1-sign*2) * mantissa * 2^(exponent-1023)

sneakyimp

Weedpacket;10906864 wrote:
A PHP string is already in its binary representation: one character == one byte. If you want to see the bytes in a string, array_map('ord',str_split($string)), or pack('H').

This script outputs the ASCII ordinals of the characters in a string.

$str = 'abc';
$v = array_map('ord',str_split($string));
print_r($v);

outputs this:

Array
(
    [0] => 97
    [1] => 98
    [2] => 99
)

I understand what that function is doing. $v contains an array of integers that correspond to the ASCII ordinals of a, b, and c.

This I don't understand at all:

$str = pack('H', 'ABC');
echo 'len:' . strlen($str) . "\n";
$v = array_map('ord',str_split($str));
print_r($v);

It outputs this:

len:1
Array
(
    [0] => 160
)

It I put a * after the H, then I get a string of length 2:

len:2
Array
(
    [0] => 171
    [1] => 192
)

I have also tried pack with C and c instead but that just returns an array with zero as its only member. Now I know that with those ordinals above, I can use [man]base_convert[/man] and get something like this:

function myfunc($s) {
        $ord = ord($s);
        $s2 = strval($ord);
        return base_convert($s2, 10, 2);
}
$str = 'abc';
$v = array_map('myfunc',str_split($str));
print_r($v);

which outputs this:

Array
(
    [0] => 1100001
    [1] => 1100010
    [2] => 1100011
)

I also know that what is happening here is that we are grabbing the ordinals (an integer) and converting them to base-2 integers. I hope I'm not being totally obtuse here when I point out that those binary numbers have only 7 digits and a
byte has 8 digits. I know that if PHP has a byte in memory somewhere that it has one more bit. Can I assume that the missing bit is a leading zero or is there some kind of two's compliment thing going on? Or some Endian bushwhacking?

I'm certainly feeling obtuse here. I was just hoping for a way to check exactly what bits and bytes I've managed to [man]pack[/man] up in my string to determine if it matches the bit-and-byte order descriptions described in the Adobe AMF3 Specification. Without being able to actually look at the ones and zeros, I feel like I'm working in the dark. I did manage to cook up a function to show me the bits in a number value, but it returns 00000000 whenever you feed it a string:

function my2bin($c) {
        $result = '';
        for($i=0; $i<8; $i++) {
                $p2 = pow(2, $i);
                if ($p2 & $c) {
                        $result = '1' . $result;
                } else {
                        $result = '0' . $result;
                }
        }
        return $result;
}

Also, I'm not sure how to detect the 'bit length' of a given argument so I can do doubles or floats or whatever.

Weedpacket;10906864 wrote:
Because you're telling [man]pack[/man] to treat 1 as a double: the number 1.0. What you're looking at is the IEEE-754-standard binary representation of 1.0 (in sixty-four bits).

Sadly, I went looking for the IEEE-754 standard and was prompted to purchase it at the IEEE website. I was naively expecting that a big-endian 8-btye (that's 64 bits) representation of the number 1 would look like this:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001

.
I went and looked some more and there's an article on wikipedia that I'll be attempting to digest. I'm still rather choking on the AMF3 spec and Augmented Backus-Naur Form.

Weedpacket;10906864 wrote:
Yes; AMFPHP has got its test back to front. AMFPHP_BIG_ENDIAN is true iff the machine is little-endian (e.g., Intel-based).

Aha! Just as I suspected. A bad constant name. Or something like that. Definitely confusing. There are few comments in the amfphp code either. Also, I don't think their AMF serialization supports args being passed by reference. I could be wrong. I have increasing confidence that my crusade to write this AMFPHP class is justified. BTW, I think it's worth noting that this constant defines as true on both an intel mac and also on a dual-core AMD machine running CentOS. So PHP on both of these machines is little-endian? Or is that a function of the OS?

Weedpacket;10906864 wrote:

The IEEE-754 representation (network byte order, hence big-endian) of 1.0 would be

00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000 
*------- ---===== ======== ======== ======== ======== ======== ========

* Sign bit: positive
- Exponent: 1023
= Mantissa: 1.00000000000000000000000000000000000000000000000000000 (in binary)

Value: (1-sign*2) * mantissa * 2^(exponent-1023)

I like your notation and thank you very very very much for that. I do think I'm getting somewhere here. I think I know understand that my machine, when packing doubles, will choose some machine-specific endianness for it and that is why I must detect the endianness of my machine if I am to reliably pack doubles for this protocol.

Weedpacket

sneakyimp wrote:
This I don't understand at all:
$str = pack('H', 'ABC');
echo 'len:' . strlen($str) . "\n";
$v = array_map('ord',str_split($str));
print_r($v);
It outputs this:

Sorry; typo on my part. unpack('H*'). (Notice that pack('H') spits the dummy if the second argument does not consist of hex digits).

function myfunc($s) {
        $ord = ord($s);
        $s2 = strval($ord);
        return base_convert($s2, 10, 2);
}

function myfunc($s)
{
 return decbin(ord($s));
}

sneakyimp wrote:
I also know that what is happening here is that we are grabbing the ordinals (an integer) and converting them to base-2 integers. I hope I'm not being totally obtuse here when I point out that those binary numbers have only 7 digits and a
byte has 8 digits.

Sorry to point this out then, but it's normal to not write the leading digit of an integer if it's a zero. Do you write 42 or 042? Or 0042? Or 00042? Or..... After all, decbin(1942) will return the correct result (one thousand, nine hundred and forty-two, written in binary).

I was naively expecting that a big-endian 8-btye (that's 64 bits) representation of the number 1 would look like this:

It would, if it was a (64-bit) integer. But pack('d') packs it as a double-precision floating-point number, which is why I was writing it throughout as 1.0, with an explicit decimal point. Floating-point is not just raw binary, it's encoded (as I illustrated); otherwise "$googol = pow(10,100);" wouldn't fit in 64 bits.

You sound like you're getting wildly tangled up in what you're doing. If specific bit sequences and data types and endianness and other C-level stuff are so crucial to the job, then a language like PHP that alters the types of values depending on what you're doing with them may not be appropriate. To use some code of yours as an example:

        $ord = ord($s);
        $s2 = strval($ord);

Let's say that $s is the string "Foo". ord() only works on one character, so the string will be truncated to 'F' (stored as one byte: 01000110). ord() will take that byte and return an integer, 70 (stored as four bytes, 00000000 00000000 00000000 01000110). Then strval() will take that integer, 70, and turn it into a string, "70" (stored as two bytes, 00110111 00110000).

If you hadn't called strval(), then base_convert would have called it instead, because it wants a string as its first argument. You'd have given it four bytes (00000000 00000000 00000000 01000110) and it would have turned it into two (00110111 00110000) without saying. And, in this case (base_convert($s2, 10, 2), returned a string made up of seven bytes (00110001 00110000 00110000 00110000 00110001 00110001 00110000).

I'm wondering if there isn't a serious category error going on here. In PHP, characters are defined (in the second sentence of the strings page) to be the same thing as bytes.

sneakyimp

I apologize for what is obviously muddle thinking on my part here. A double is not an integer, it's a float. If an ASCII char is a byte then there can be no use of something like two's compliment because we have values from 0 to 255 and we need all 8 bits.

My objective at this point is to create an AMF serializer/deserializer class and this requires that I create binary strings. I think you can appreciate that the type juggling that goes on in PHP makes this task confusing. It certainly seems difficult to inspect the actual bits that you've packed into your binary string. I still have no idea how this might be accomplished. Perhaps its not necessary, but I would like to be able to do it so I can get some idea of what my binary strings really look like.

This whole AMF serializer in PHP has been done in amfphp and in SabreAMF. The latest beta of amfphp tries first to make use of a PHP extension written in C called AMFEXT. My issue with those other PHP implementations is that they appear to involve dozens of files and are pretty intimately connected with the other bits and pieces of the projects to which they belong -- there's a lot of interitance and crosstalk between classes so it's difficult to extract the AMF serialization bits. I love the idea of AMFEXT (written in C, faster) but it requires that one install an extension to PHP.

I would like to offer something super simple for my project -- something that just pops right in. I believe those other PHP projects have lessons for me but I don't know if they are strictly adhering to AMF3 specs. AMF3, for instance, supports passing args by reference and uses a variable length integer encoding scheme. I could at least begin to test these notions if I could convert the binary strings they contain into the actual bits so I could get a look at them. It doesn't need to all happen in PHP. Maybe I need a little C utility I can run from the command line or something? Instantiate a dynamic library?

I truly appreciate your patience and help with this. I am pretty wildly tangled up in it.

Weedpacket

sneakyimp wrote:
and this requires that I create binary strings.

Strings are strings (PHP6 adds a new type of string, but ordinary strings are still the same things as ever). Strings are made of characters. Characters are bytes.

It certainly seems difficult to inspect the actual bits that you've packed into your binary string. I still have no idea how this might be accomplished.

If you want to see the bytes in a string, array_map('ord',str_split($string))

You can even wrap [man]decbin[/man] around that, if you like.

If you like you can work with arrays of integers [0..255] for everything. To turn that into a string, one integer per byte/character, join(array_map('chr', $array)).

echo join(array_map('chr', array(72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33));

 echo "\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21";

$png = "\x89\x50\x4e\x47\xd\xa\x1a\xa\x0\x0\x0\xd\x49\x48\x44\x52\x0\x0\x0\x45\x0\x0\x0\x16\x4\x3\x0\x0\x0\x64".
"\xb9\x4c\x43\x0\x0\x0\x24\x50\x4c\x54\x45\x0\x0\x0\xff\xff\xff\x8c\x8c\x8c\xf0\xf0\xf0\xa7\xa7\xa7\xf8\xf8\xf8".
"\x7c\x7c\x7c\xe1\xe1\xe1\xb2\xb2\xb2\xd0\xd0\xd0\xbd\xbd\xbd\xd9\xd9\xd9\x22\xb2\x64\xe2\x0\x0\x0\x9a\x49\x44".
"\x41\x54\x78\x5e\xed\x8e\x3d\xe\xc2\x30\xc\x46\x6d\x92\xc2\x6a\xb3\x30\xb0\xf0\x73\x81\x16\x89\x9d\xe\xdc\xa0".
"\xaa\xc4\x96\xa9\xac\x85\x85\x15\x6e\x2\x37\xe5\xab\x95\x66\x49\x4f\x80\xf0\xf4\xf2\xf2\x64\x99\xfe\xc3\x4a\x2a".
"\x99\x3\x44\x29\xc\xe6\xf4\x4\x44\x64\xfb\x9c\x6e\xee\xb3\x3e\x6f\xe6\xc7\x93\x35\xe7\xc6\xc4\x63\xd5\x83\xd6\xf8".
"\xf6\x97\xd8\xa8\xba\xae\xb4\xe6\xb0\x37\xf1\xb9\xbd\x41\x4b\x62\x72\xbb\xb4\x87\x55\xac\x89\xe2\x55\x3f\x41\x82".
"\x86\x29\x35\x2e\x50\x11\xa8\x1d\xf7\x6c\x79\x3\x12\x38\xec\xa1\xf1\x9e\x4a\xbc\x52\x2d\xb8\x62\x70\xd7\x45\x30".
"\xf2\x6a\xf7\x64\x23\x99\x99\x6a\x7e\x63\xbe\x2b\xfd\x10\xab\xcc\x97\x3a\x59\x0\x0\x0\x0\x49\x45\x4e\x44\xae\x42\x60\x82";
file_put_contents('do.png', $png);
$new_png = file_get_contents('do.png');
echo $png == $new_png ? 'same': 'different';

sneakyimp

Thanks so much for your persistence. I'm sorry that you've had to repeat yourself so much. I guess I was just finding it hard to swallow that one had to use [man]ord[/man] on a raw, binary string to get a look at its raw bit representation because ord just returns the ASCII value of a character. Upon reflection, I realize that the ASCII value of a character requires all 8 bits of a byte because it is a value between 0 and 255. That doesn't leave any room for two's compliment or any other IEEE-754 type encoding weirdness. It's a straight conversion from an ASCII ordinal to the underlying binary bits stored by PHP. Additionally, when you call str_split on a string (binary or otherwise) it just splits the string into bytes and loads those bytes in an array. ord doesn't care if you have used an non-printing control character or anything like that, it basically just tells you the decimal equivalent of your stored byte.

So I finally have my string-to-binary-representation function:

function str2bin($str) {
        if (!is_string($str)) {
                die("string2bits works only on string values");
        }
        $chars = str_split($str);
        $result = array();
        $len = count($chars);
        for($i=0; $i<$len; $i++) {
                $result[$i] = str_pad(decbin(ord($chars[$i])), 8, '0', STR_PAD_LEFT);
        }
        return $result;
}

$arr = str2bin(pack('d', 1));
echo join($arr, ' ') . "\n";

output is a LITTLE-ENDIAN representation of the IEEE-754 encoding of 1 as a double:

00000000 00000000 00000000 00000000 00000000 00000000 11110000 00111111

This is a tremendous relief. However, it re-introduces the basic delimiter question:

$binstr = pack('d', 1);
$x = strpos($binstr, "\0");
if ($x === FALSE) {
        echo "phew!  no null char\n";
} else {
        echo "NULL CHAR FOUND\n";
}

This one comes out 'NULL CHAR FOUND'. I therefore cannot use a null char as a delimiter for these AMF3 encoded objects unless I run addslashes or something similar on them too. Without a delimiter, I can't really be sure when one message has finished and the next has begun. Any suggestions about how to tackle that problem would be appreciated.

thanks weedpacket.

Weedpacket

I still smell a category error. What do you mean by "character"? To PHP, a character and a byte are the same thing. "a" is a one-character-long string.

sneakyimp wrote:
It's a straight conversion from an ASCII ordinal to the underlying binary bits stored by PHP.

It's the exact opposite, taking the underlying bit representation of a character/byte and returning a (four-byte) integer. [man]chr[/man] turns an integer into a byte.

However, it re-introduces the basic delimiter question:

[man]json_encode[/man] turns NUL characters (and other weirdnesses) into Unicode escape sequences: \u0000 for NUL. It also expects UTF-8 encoding. So $encoded_data = json_encode(utf8_encode($raw_data)) ought to produce something safe for transport. At the other end you'll need to $raw_data = utf8_decode(json_decode($encoded_data)) (however that's done in what you're doing).

Base64 is still an option, even though it does increase data size: json_encode() also causes data inflation. Base64 turns three bytes into four; but the above encoding has the potential to turn some single bytes into twelve.

echo strlen(json_encode(utf8_encode(join(array_map('chr', range(0,255))))))

produces a number considerably larger than 256, though base64 is still worse in this situation.

Which is more efficient depends on whether or not your data consists largely of "printable ASCII" characters (bytes 32..127) or not.

But there will always be some inflation: you're trying to pack eight-bit bytes into packages that are less than eight bits in size - you're going to need more than one package per byte.

sneakyimp

I'm not sure what you mean by 'category error' ? Are you referring to the use of PHP for this binary manipulation or are you referring to some other instance of my obtuseness?

As for json_encode, it's been my experience that it doesn't escape NULL chars -- see post #4 in this thread.

My bad on the description of ord's workings. I've got it fairly straight in my head. PHP stores bytes in some binary spots in memory. This function tells me what the decimal equivalent of those raw bytes are. I'm pretty sure I understand it. consider that dead horse sufficiently beaten.

I'm thinking I will put some more thought into the AMF3 approach. It's very helpful on the Actionscript side of things because it's native there. It's also supposedly compact and harder for script kiddies to decode. If I take one of these other approaches, I'll have a bigger task decoding the data once it gets into Actionscript.

Weedpacket

sneakyimp wrote:
As for json_encode, it's been my experience that it doesn't escape NULL chars -- see post #4 in this thread.

I just tried it and it turned a NUL into "\u0000".

echo json_encode(chr(0));

Literally - quotes and all!

The point about JSON is of course that it's literal ECMAScript source code, so that part of the decoding is trivial.

As for UTF8->byte decoding.... The ECMAScript specification defines a character as being 16 bits (I had to look it up; I hope ECMAScript 4 will be written better). But that won't be a problem: the ordinal values (in Unicode terminology, the "code points") after decoding will still be the same as before encoding (otherwise it wouldn't be very useful encoding!).