I've written a class method that implements AMF3 serialization of an array object. I've posted documented source code but here it is again:

    public function write_array($d) {

    // Arrays go onto the object_table if we haven't seen them before
    // NOTE:  strict equality will return true if two distinct arrays
    // have identical keys and values...if we can find a way to check for
    // a real reference, that would rule
    // DISABLED FOR NOW because we don't want arrays to be artificially referencing each other
#        $key = array_search($d, $this->object_table, TRUE);
#        if ($key !== FALSE) {
        if (FALSE) {
            // we have already written an indentical array to the object table
            // write a ref and be done with it
            $ref = $key << 1; // low bit is zero to indicate ref, other bits are 29-bit variable int to specify index
            $this->write_integer($ref);
            return;
        }

    // not a ref, store reference to this object in object_table if there is room
    if (count($this->object_table) < self::AMF_OBJECT_MAX_REFERENCES) {
#            $this->object_table[] =& $d
            $this->object_table[] = ''; // we have to store empty strings so that when
                                        // we do add an object reference that the ref index
                                        // will match what flash expects
        }

    // divide the array into two arrays, one for numeric keys, one for associative keys
    $num = array();
    $assoc = array();

    $hi_key = 0;
    $prev_num_key = NULL;
    $nums_ordered_and_dense = TRUE;
    foreach($d as $key => $val) {
        if (is_int($key) && ($key >= 0)) {
            $num[$key] = $val;
            $hi_key = max($hi_key, $key);
            // track whether the numeric values are both ordered and dense
            // this will help with performance for write_array when we are
            // serializing large dense numeric arrays
            if (is_null($prev_num_key)) {
                // we haven't checked anything yet...first key in dense, ordered array must be zero
                if ($key !== 0) $nums_ordered_and_dense = FALSE;
            } else {

                // current key *must* equal previous key+1
                if (($prev_num_key+1) !== $key) $nums_ordered_and_dense = FALSE;
            }
            $prev_num_key = $key;
        } else {
            $assoc[$key] = $val;
        } 
    }

    if ($nums_ordered_and_dense) {
        $contig_len = count($num);
    } else {
        // either the array is non contiguous or out of order or both
        // check to see how much of the numeric part is contiguous
        $num_len = count($num);
        if ($hi_key == ($num_len - 1)) {
            // all numeric values are contiguous
            $contig_len = $num_len;
        } else {
            // the array either has negative indices or has non-contiguous indices
            for($i=0; $i<$hi_key; $i++) {
                if (!isset($num[$i])) {
                    $contig_len = $i;
                    break;
                }
            }
        }
    }
    // due to the way array lengths are encoded, we have a maximum upper length for arrays        
    if ($contig_len > self::AMF_ARRAY_MAX_CONTIG_LENGTH) {
        throw new Exception('Array contiguous length exceeded the maximum allowed of ' . self::AMF_ARRAY_MAX_CONTIG_LENGTH);
    }

    // write the length of the contiguous part to the buffer 
    // as length packed together with the value/reference flag bit
    // living in the lowest bit of a variable length integer
    $len_val_ref = $contig_len << 1 | 0x01;
    $this->write_integer($len_val_ref);

    // output the key/value pairs
    // e.g., arr['AB'] = 1
    // first the key, a string:
    // 00000101 - first bytes indicate key length, here it says length is 2, low bit is 'val' flag
    // 01000001 - UTF code for upper case A
    // 01000010 - UTF code for upper case B
    // next, the value
    // 00000100 - first byte is type indicator, here an integer
    // 01111111 - the value of the integer, here 127

    // first, the non-contiguous numeric indices
    foreach($num as $key => $val) {
        if (($key < 0) || ($key > $contig_len)) { // NOTE that $num[$contig_len] should never contain a value
                                                  // as it is either greater than the array length for
                                                  // contiguous arrays and for non-contiguous arrays, it
                                                  // is the index of the lowest empty slot
            $this->write_string($key);
            $this->serialize_r($val);

        }
    }
    // then the associative values
    foreach($assoc as $key => $val) {
        $this->write_string($key);
        $this->serialize_r($val);
    }


    // after name-value pairs are finished, output an empty string
    $this->write_string('');

    // now output just the values of the contiguous part of the num array (if any);
    for($i=0; $i<$contig_len; $i++) {
        $this->serialize_r($num[$i]);
    }

}

Note that AMF3 serialization treats associative arrays quite differently than numeric ones and also that it considers numeric arrrays with missing indices to be associative arrays -- at least the part of the array after the missing index. This contributes a lot of complexity to the function because it requires checks to make sure that numeric array indices are both ordered and dense.

It seems to me it would be nice to reduce the amount of looping through the array values. The efficiency of this serialization is extremely important for a project I've been working on. I also ask because I aspire to write this code as a PECL extension (i.e., in C) and I'm dreading converting this code to C.

    Does it have to be AMF3, rather than some other serialization scheme, such as msgpack?

    Possible benefits of msgpack
    1. Variable-length encoding (if size matters)
    2. Already exists as a PECL extension for PHP
    3. You can use it together with other things you are (probably) using, such as redis

      I've now looked at the array handling code and there are possible improvements.

          public function write_array_opt($d) {
      
          // Arrays go onto the object_table if we haven't seen them before
          // NOTE:  strict equality will return true if two distinct arrays
          // have identical keys and values...if we can find a way to check for
          // a real reference, that would rule
          // DISABLED FOR NOW because we don't want arrays to be artificially referencing each other
      #        $key = array_search($d, $this->object_table, TRUE);
      #        if ($key !== FALSE) {
              if (FALSE) {
                  // we have already written an indentical array to the object table
                  // write a ref and be done with it
                  $ref = $key << 1; // low bit is zero to indicate ref, other bits are 29-bit variable int to specify index
                  $this->write_integer($ref);
                  return;
              }
      
          // not a ref, store reference to this object in object_table if there is room
          if (count($this->object_table) < self::AMF_OBJECT_MAX_REFERENCES) {
      #            $this->object_table[] =& $d
                  $this->object_table[] = ''; // we have to store empty strings so that when
                  // we do add an object reference that the ref index
                  // will match what flash expects
              }
      
          // divide the array into two arrays, one for numeric keys, one for associative keys
          # To be precise, this comment should be:
          # divide the array into two parts, the dense part and the non-dense part.
          # Note that
          # The dense part consists of 0-based, ordered and contiguous keys: all numeric
          # The non-dense part consists of keys that are not dense. They could start above 0,
          # be unordered and/or non-contiguous: all may still be numeric.
          $num = array();
          # I kept using $assoc to make the rewrite more evident. See comment below code block on how
          # to optimize further. When that is done, $assoc is no longer needed.
          $assoc = array();
      
          $array_is_dense = true;
          $next_dense_key_would_be = 0;
      
          foreach($d as $key => $val) {
              # From this point, all values will go into $assoc
              if (!$array_is_dense) {
                  $assoc[$key] = $val;
              } else {
                  # AMF3 uses 28 bits (3,5 bytes) to encode the length of the dense array.
                  # PHP does not run on 16 bits systems so I suppose a PHP int is never
                  # less than 4 bytes and this is therefor safe to compare without bounds checking
                  if (is_int($key) && $key == $next_dense_key_would_be) {
                      $num[] = $val;
                      $next_dense_key_would_be += 1;
      
                      # The code you posted thew on: $contig_len > self::AMF_ARRAY_MAX_CONTIG_LENGTH
                      # But it is possible to send the first AMF_ARRAY_MAX_CONTIG_LENGTH elements as
                      # a dense array and the rest of it in the associative part. But I do not know if the spec
                      # prohibits this.
                      if ($next_dense_key_would_be > self::AMF_ARRAY_MAX_CONTIG_LENGTH) {
                          $array_is_dense = false;
      
                          # if the spec prohibits sending the "overflow" elements as an associative array,
                          # simply reinstate the throw here
                          # throw new Exception('Array contiguous length exceeded the maximum allowed of ' . self::AMF_ARRAY_MAX_CONTIG_LENGTH);
                      }
                  } else {
                      $array_is_dense = false;
                      $assoc[$key]    = $val;
                  }
              }
          }
      
          // write the length of the contiguous part to the buffer
          // as length packed together with the value/reference flag bit
          // living in the lowest bit of a variable length integer
          # $len_val_ref = $contig_len << 1 | 0x01;
          # now becomes either
          # $len_val_ref = count($num) << 1 | 0x01;
          # or
          $len_val_ref = ($next_dense_key_would_be - 1)  << 1 | 0x01;
      
          $this->write_integer($len_val_ref);
      
          // output the key/value pairs
          // e.g., arr['AB'] = 1
          // first the key, a string:
          // 00000101 - first bytes indicate key length, here it says length is 2, low bit is 'val' flag
          // 01000001 - UTF code for upper case A
          // 01000010 - UTF code for upper case B
          // next, the value
          // 00000100 - first byte is type indicator, here an integer
          // 01111111 - the value of the integer, here 127
      
          # I have changed the order in which the non-dense part is sent from the original code. The new
          # ordering brings it in-line with the spec (as I understand it). Your original code sent, in order
          # - numeric keys and their values, from the non-dense part of the array
          # - non-numeric keys and their values,
          # - the dense part of the array
          /*
          // first, the non-contiguous numeric indices
          foreach($num as $key => $val) {
              if (($key < 0) || ($key > $contig_len)) { // NOTE that $num[$contig_len] should never contain a value
                  // as it is either greater than the array length for
                  // contiguous arrays and for non-contiguous arrays, it
                  // is the index of the lowest empty slot
                  $this->write_string($key);
                  $this->serialize_r($val);
      
              }
          }
          // then the associative values
          foreach($assoc as $key => $val) {
              $this->write_string($key);
              $this->serialize_r($val);
          }
          */
          # I do not re-order the associative array
          foreach ($assoc as $key => $val) {
              $this->write_string($key);
              $this->serialize_r($val);
          }
      
          // after name-value pairs are finished, output an empty string
          $this->write_string('');
      
          // now output just the values of the contiguous part of the num array (if any);
          for($i = 0; $i < $next_dense_key_would_be; $i++) {
              $this->serialize_r($num[$i]);
          }
      
      }
      

      It is now rather easy to optimize away the remaining unnecessary looping. As soon as $is_dense is set to false, write the length of the dense array. Then replace the part inside the initial foreach loop that assigns to $assoc by the part that loops over $assoc to write its key/values, so that those key/values are written on the fly. Finally, outside the initial foreach-loop, first write the empty string delimiter, then loop over $num to write the dense part.

        14 days later

        I appreciate your valuable effort optimizing this. I'm don't think it properly captures the dense part of an array. Consider how I define this array:

        <?php
        
        $arr = array();
        $arr["foo"] = "foo";
        $arr[1] = "one";
        $arr["bar"] = "bar";
        $arr[0] = "zero";
        
        foreach($arr as $key => $value) {
        	echo $key . "=" . $value . "\n";
        }
        
        ?>
        

        The array obviously is dense from 0 to 1, but it's out of order when we use a foreach loop:

        $ php foo.php
        foo=foo
        1=one
        bar=bar
        0=zero
        

          I failed to consider that.

          However, during first pass, you can find all of the first dense elements that appear in order (amongst themselves) in the original array.

          $a => [
            'a' => 1,
             0  => ,
             3  => ,
             5  => ,
             1  => ,
             2  => ,
          ];
          

          Here you can immediately recognize that elements 0, 1 and 2 belong in the dense array. Thus, you do not need to bundle these elements with the other numerically indexed elements. Instead of pushing these to $num, you could put them separately in $dense. That element with key 3 also is part of the dense array, or that element with key 5 is not part of it does not matter. Putting those two elements in $num, just like before, would still allow you to deal with these elements after first pass has been completed. The benefits from optimization is application dependent, and relates to how well ordered the numerical keys are. Worst case is when the first numerical index encountered is 1. Best case is when the entire array is dense and ordered where you'd go from 2N iterations to N iterations.
          I believe that most of my own php code has been dealing with well ordered numerical keys. Perhaps this is commonplace?

          You could also serialize all elements with non-numerical indices on the fly as well as those elements that are recognized as dense during firs pass, instead of using the intermediate $assoc array. I am however uncertain of the efficiency gains in this regard. You would reduce the number of needed iterations, but I believe you'd still be performing the same amount of data-copying.

            I managed to think of a way that doesn't involve copying data, but it may involve more looping iterations. Some pseudo:

            assume the highest int key is -1
            loop through ALL array keys. if an array key is an int, check to see if it's the highest we've yet seen.
            we now know what the highest int key is
            
            loop from i=0 to hiKey-1 (if no int keys are found this loop will not run even once)
            if i doesn't correspond to an element in the array, then i represents the size of our dense array, [b]contiguousLength[/b]. Break out of this loop.
            
            write contiguousLength to the serialization buffer
            
            loop through ALL elements of the array, serializing any associative keys and any int keys that are less than zero or >= contiguousLength along with their corresponding value
            
            write the NULL char to the serialization buffer
            
            loop from 0 to contigLength, serializing the corresponding values
            
            

            It would be interesting to do a comparison. I've written a PHP extension in C (amfext) which implements this array serialization algorithm and it's anywhere from 20 to 200 times faster than my original PHP algorithm. Not sure if that's because it's written in C or due to a better algorithm that copies less data.

              Write a Reply...