I'm working on a spam filter that uses machine learning. I have a vocabulary, $vocab
, of words that I use for my training set. These are words that appear often enough that I can use their presence/absence in an incoming message to make a ham/spam assessment. My vocabulary currently has 1966 words in it. E.g.:
$vocab = [
0 => 'dear',
1 => 'sir',
2 => 'ma',
3 => 'am',
4 => 'hajiya',
5 => 'maryam',
6 => 'abacha',
7 => 'wife',
8 => 'late',
9 => 'nigeria',
10 => 'head',
// etc
];
I also have a $words
array for any given message. This is an array, with no duplicates, of the words that appear in that particular message. NOTE that $words may contain words that are not in $vocab:
$words = [
0 => 'sir',
1 => 'hajiya',
2 => 'extraword'
];
'extraword' is not in my vocabulary.
What I want to do is create an array with the $vocab words as keys -- SEQUENCE MATTERS and must be perfectly consistent between the original $vocab array and every array derived here -- and with a zero or a one if a particular message contains that vocabulary word. For the $vocab and $words array above, I'd probably generate this first:
$out1 = [
'dear' => 0,
'sir' => 1,
'ma' => 0,
'am' => 0,
'hajiya' => 1,
'maryam' => 0,
'abacha' => 0,
'wife' => 0,
'late' => 0,
'nigeria' => 0,
'head' => 0,
];
And then use array_values($out1)
to get my final 'feature vector'.
This can be done fairly straightforwardly with this simple code, but I'm concerned about performance:
$out1 = array_fill_keys($vocab, 0);
foreach($words as $w) {
if (array_key_exists($w, $out1)) {
$out1[$w] = 1;
} else {
// word is not in vocab, ignore it
}
}
array_merge might work if I combine with array_intersect, but I'm worried that the sequence of the associative keys in vocab might not be preserved. The docs don't mention anything about preservation of associative key ordering.
$a = array_fill_keys($vocab, 0);
$b = array_fill_keys(array_intersect($vocab, $words), 1);
$out1 = array_merge($a, $b);
array_replace also requires array_intersect to avoid adding the extraword:
$a = array_fill_keys($vocab, 0);
$b = array_fill_keys(array_intersect($vocab, $words), 1);
$out1 = array_replace($a, $b);
And both of these seem to preserve the associative key ordering. I worry that this might be undocumented behavior, though. The sequence of associative keys must be preserved so when I use array_values to get the output, I can be sure that the zeros and ones are in the exact same order as the original $vocab.
Main question: can I be certain that the order of the keys will be preserved in the array merge/replace operations?
My concern about performance is because I have thousands of words in my vocabulary and about the same number of thousands of files in my data corpus. E.g., about 2k words per file and about 2k files (5k? 10k?) we are talking about millions of iterations quite quickly.
Secondary question: Can anyone suggest the fastest way to generate these feature vectors, given $vocab and $words?