array_merge vs array_replace -- do they conserve sequence of associative keys?

sneakyimp

I'm working on a spam filter that uses machine learning. I have a vocabulary, $vocab, of words that I use for my training set. These are words that appear often enough that I can use their presence/absence in an incoming message to make a ham/spam assessment. My vocabulary currently has 1966 words in it. E.g.:

$vocab = [
  0 => 'dear',
  1 => 'sir',
  2 => 'ma',
  3 => 'am',
  4 => 'hajiya',
  5 => 'maryam',
  6 => 'abacha',
  7 => 'wife',
  8 => 'late',
  9 => 'nigeria',
  10 => 'head',
// etc
];

I also have a $words array for any given message. This is an array, with no duplicates, of the words that appear in that particular message. NOTE that $words may contain words that are not in $vocab:

$words = [
  0 => 'sir',
  1 => 'hajiya',
  2 => 'extraword'
];

'extraword' is not in my vocabulary.

What I want to do is create an array with the $vocab words as keys -- SEQUENCE MATTERS and must be perfectly consistent between the original $vocab array and every array derived here -- and with a zero or a one if a particular message contains that vocabulary word. For the $vocab and $words array above, I'd probably generate this first:

$out1 = [
  'dear' => 0,
  'sir' => 1,
  'ma' => 0,
  'am' => 0,
  'hajiya' => 1,
  'maryam' => 0,
  'abacha' => 0,
  'wife' => 0,
  'late' => 0,
  'nigeria' => 0,
  'head' => 0,
];

And then use array_values($out1) to get my final 'feature vector'.

This can be done fairly straightforwardly with this simple code, but I'm concerned about performance:

$out1 = array_fill_keys($vocab, 0);
foreach($words as $w) {
        if (array_key_exists($w, $out1)) {
                $out1[$w] = 1;
        } else {
                // word is not in vocab, ignore it
        }
}

array_merge might work if I combine with array_intersect, but I'm worried that the sequence of the associative keys in vocab might not be preserved. The docs don't mention anything about preservation of associative key ordering.

$a = array_fill_keys($vocab, 0);
$b = array_fill_keys(array_intersect($vocab, $words), 1);
$out1 = array_merge($a, $b);

array_replace also requires array_intersect to avoid adding the extraword:

$a = array_fill_keys($vocab, 0);
$b = array_fill_keys(array_intersect($vocab, $words), 1);
$out1 = array_replace($a, $b);

And both of these seem to preserve the associative key ordering. I worry that this might be undocumented behavior, though. The sequence of associative keys must be preserved so when I use array_values to get the output, I can be sure that the zeros and ones are in the exact same order as the original $vocab.

Main question: can I be certain that the order of the keys will be preserved in the array merge/replace operations?

My concern about performance is because I have thousands of words in my vocabulary and about the same number of thousands of files in my data corpus. E.g., about 2k words per file and about 2k files (5k? 10k?) we are talking about millions of iterations quite quickly.

Secondary question: Can anyone suggest the fastest way to generate these feature vectors, given $vocab and $words?

Weedpacket

Hm. I haven't timed anything but

$out1 = array_combine($vocab, array_map(fn($v) => (int)in_array($v, $words), $vocab));

$flipped_words = array_flip($words); // This would also uniquify the "flipped" word list automatically
$out1 = array_combine($vocab, array_map(fn($v) => (int)isset($flipped_words[$v]), $vocab));

I'm not totally certain that array_merge won't reorder associative keys (it's described as "replacing values" if subsequent arrays repeat keys) but implementing it in any way that would reorder them would seem more complicated for no benefit. I am certain that the elements output by array_map aren't rearranged. You say you're going to use array_values on the $out1 array; the array_combine in that case becomes redundant.

If it's critical that array keys need to be retained in a specific order (for comparison with other runs I expect; one more reason to expect that keys aren't rearranged ad lib), you still have the $vocab array, so however you build the output you can:

$vorder = array_flip($vocab);
uksort($out1, fn($a, $b) => $vorder[$b] - $vorder[$a]);

sneakyimp

Thanks for your very helpful response, @Weedpacket. Your clarity of thought is always much appreciated. Your command of those array functions is always impressive, too. It's a pretty knotty puzzle to figure out what is going on.

I had a first run checking your code on my mac laptop with PHP 7.3.something. It didn't like the arrow functions, which were introduced in PHP 7.4, so I reformatted as anonymous function. That didn't work because $flipped_words was not defined in the mapped function -- no error, just all zeroes in $out. Had to declare $flipped_words as global to get proper results. I ended up with something like this:

$flipped_words = array_flip($words); // This would also uniquify the word list as a side-effect
$out1 = array_combine(
        $vocab,
        array_map(
                function($v) {
                        global $flipped_words;
                        return (int)isset($flipped_words[$v]);
                },
                $vocab
        )
);

I wondered briefly why the bar was in scope with your arrow notation example but then noticed this in the documentation:

Arrow functions support the same features as anonymous functions, except that using variables from the parent scope is always automatic.
When a variable used in the expression is defined in the parent scope it will be implicitly captured by-value...

In any case, I hastily cooked up this script to compare various approaches:

<?php

$vocab = [
  0 => 'dear',
  1 => 'sir',
  2 => 'ma',
  3 => 'am',
  4 => 'hajiya',
  5 => 'maryam',
  6 => 'abacha',
  7 => 'wife',
  8 => 'late',
  9 => 'nigeria',
  10 => 'head'
];

$words = [
  0 => 'sir',
  1 => 'hajiya',
  2 => 'extraword'
];

$desired = [
	0,
	1,
	0,
	0,
	1,
	0,
	0,
	0,
	0,
	0,
	0
];

function feat1($vocab, $words) {
	$out = array_fill_keys($vocab, 0);
	foreach($words as $w) {
		if (array_key_exists($w, $out)) {
			$out[$w] = 1;
		} else {
			// word is not in vocab, ignore it
		}
	}
	return array_values($out);
}

function feat2($vocab, $words) {
	$a = array_fill_keys($vocab, 0);
	$b = array_fill_keys(array_intersect($vocab, $words), 1);
	return array_values(array_merge($a, $b));
}

function feat3($vocab, $words) {
	$a = array_fill_keys($vocab, 0);
	$b = array_fill_keys(array_intersect($vocab, $words), 1);
	return array_values(array_replace($a, $b));
}

function feat4($vocab, $words) {
	return array_map(fn($v) => (int)in_array($v, $words), $vocab);
}

function feat5($vocab, $words) {
	$flipped_words = array_flip($words); // This would also uniquify the "flipped" word list automatically
	return array_map(fn($v) => (int)isset($flipped_words[$v]), $vocab);
}

function feat6($vocab, $words) {
	return array_map(fn($v) => (int)in_array($v, $words), $vocab);
}

for($i=1; $i<7; $i++) {
	$fn = 'feat' . $i;
	echo "=== $fn ===\n";
	$start = microtime(TRUE);
	for($j=0; $j<100000; $j++) {
		$v = $fn($vocab, $words);
		if ($j === 0) {
			print_r($v);
			// check result
			if ($v !== $desired) {
				throw new Exception('bad return value');
			}
		}
	}
	echo "elapsed time " . (microtime(TRUE) - $start) . " seconds\n";
	echo "\n";
}

I was pretty surprised at the results:

=== feat1 ===
Array
(
    [0] => 0
    [1] => 1
    [2] => 0
    [3] => 0
    [4] => 1
    [5] => 0
    [6] => 0
    [7] => 0
    [8] => 0
    [9] => 0
    [10] => 0
)
elapsed time 0.037767171859741 seconds

=== feat2 ===
Array
(
    [0] => 0
    [1] => 1
    [2] => 0
    [3] => 0
    [4] => 1
    [5] => 0
    [6] => 0
    [7] => 0
    [8] => 0
    [9] => 0
    [10] => 0
)
elapsed time 0.1438570022583 seconds

=== feat3 ===
Array
(
    [0] => 0
    [1] => 1
    [2] => 0
    [3] => 0
    [4] => 1
    [5] => 0
    [6] => 0
    [7] => 0
    [8] => 0
    [9] => 0
    [10] => 0
)
elapsed time 0.14240193367004 seconds

=== feat4 ===
Array
(
    [0] => 0
    [1] => 1
    [2] => 0
    [3] => 0
    [4] => 1
    [5] => 0
    [6] => 0
    [7] => 0
    [8] => 0
    [9] => 0
    [10] => 0
)
elapsed time 0.12053203582764 seconds

=== feat5 ===
Array
(
    [0] => 0
    [1] => 1
    [2] => 0
    [3] => 0
    [4] => 1
    [5] => 0
    [6] => 0
    [7] => 0
    [8] => 0
    [9] => 0
    [10] => 0
)
elapsed time 0.11504316329956 seconds

=== feat6 ===
Array
(
    [0] => 0
    [1] => 1
    [2] => 0
    [3] => 0
    [4] => 1
    [5] => 0
    [6] => 0
    [7] => 0
    [8] => 0
    [9] => 0
    [10] => 0
)
elapsed time 0.12274479866028 seconds

feat1, the original & simple php approach, is about 3 times faster than the nearest competitor. This seems totally counterintuitive to me, and I can't help but wonder if something is being cached or if my sample vocab/words arrays just happen to favor feat1 somehow.

I'll see if I can profile these with some real data.

Weedpacket

sneakyimp Had to declare $flipped_words as global to get proper results.

function($v)use($flipped_words) {...}

NogDog

Weedpacket use($flipped_words)

Today I learned... 🙂

Discovered you can pass by reference, if desired:

20:48 $ php -a
Interactive shell

php > $bar = 'bar';
php > $test1 = function($foo) use($bar) {
php {   echo "$foo $bar\n";
php {   $bar = 'pub';
php {   echo "$foo $bar\n";
php { };
php > $test1("Let's go to the");
Let's go to the bar
Let's go to the pub
php > echo $bar;
bar
php > $test2 = function($foo) use(&$bar) {
php {   echo "$foo $bar\n";
php {   $bar = 'saloon';
php {   echo "$foo $bar\n";
php { };
php > $test2("Let's go to the");
Let's go to the bar
Let's go to the saloon
php > echo $bar;
saloon
php >

sneakyimp

I ran these feature vector functions using real data and -- to my great surprise -- that first function is, indeed, the fastest:

2291 files loaded from chosen_corpus_files.json
1966 words loaded from chosen_corpus_vocab.json
=== 1 
processing 2291 files from chosen corpus
all 2291 messages processed in 1.5267519950867 seconds

=== 2 
processing 2291 files from chosen corpus
all 2291 messages processed in 2.2587580680847 seconds

=== 3 
processing 2291 files from chosen corpus
all 2291 messages processed in 2.2026159763336 seconds

=== 4 
processing 2291 files from chosen corpus
all 2291 messages processed in 3.3228468894958 seconds

=== 5 
processing 2291 files from chosen corpus
all 2291 messages processed in 1.7535991668701 seconds

=== 6 
processing 2291 files from chosen corpus
all 2291 messages processed in 3.3175408840179 seconds

Still need to make sure the code is properly parsing this data.