spam filter findings?

sneakyimp · Oct 6, 2023

If anyone here is curious, I'd be happy to share my findings from working on a machine learning spam filter. A few broad strokes:

The ideas behind the supervised learning filter were taught in Andrew Ng's class on machine learning, which is available via Coursera. I don't think the original class (coded in Matlab/Octave) is available any more, but they have a sort of diluted 3-course regimen now, coded in Python.
You need a data corpus of ham and spam messages.
The algorithm first munges all the ham and spam to develop a vocabulary of common words. The algorithm works by weighting each of these words and scoring a novel message solely based on the absence or presence of these words in this novel message. The number of words in this vocabularly can vary widely, but you need a fair amount of these words (hundreds?) to make an effective filter. It all depends on the peculiarities of your data corpus.
Once you have the vocabulary, you train your algorithm by feeding it ham and spam messages. You need a lot of ham and spam messages (thousands?). Too few messages and you don't get a useful filter. Too many and training becomes impractical. You need roughly equal amounts of ham and spam and, if it's not obvious, you have to have classified them as either ham or spam. Best practices involve dividing your message corpus into training/test/cross-validation in a ratio of 60/20/20.
After the training is complete, you end up with a weighted array that can be used to assess a novel message. In my experience, it's never perfect. It's common to see success rates between 90 and 95%, but it probably won't be perfect. You might see some false positives or false negatives.
One of the biggest challenges has been getting decent performance on the matrix multiplication operations. IIRC, I got the best results if I was able to use a PHP approach that harnesses some external library based on BLAS or CBLAS. I did a whole big thread on my efforts to get some speed.

Let me know if I should share any code? Happy to talk through it. It would probably help me solidify my understanding of the subject matter, and I'd very much value the insights of the skilled coders on this forum.

Weedpacket · Oct 6, 2023

sneakyimp Too few messages and you don't get a useful filter.

For example, after the first message, say it's ham, it will guess that the next one is ham as well, because that's all it's seen.

sneakyimp you have to have classified them as either ham or spam.

And when training, it may be a good idea to decide at random whether the next message to provide is one you've classified as ham or as spam. I've seen at least one case where the training was done by strictly alternating ham vs. spam messages (okay, that wasn't the context but YKWIM) and the discriminator got very good at distinguishing them because it learned the rule "odd-numbered messages are ham, even-numbered messages are spam".

I'm wondering about the vocabulary. At what point does this become part of the network? And also, how does it help the network identify spammy features that aren't part of the vocabulary? Wouldn't the network itself develop triggers for certain sequences of characters (and correlations between the occurrences of such sequences) that are more predominantly spam than ham?

sneakyimp · Oct 6, 2023

Weedpacket discriminator got very good at distinguishing them because it learned the rule "odd-numbered messages are ham, even-numbered messages are spam".

This is really funny. I feel like generative AI is probably absolutely full of problems like this, but everyone seems hell bent on having it generate all our content and pedagogical materials now.

EDIT: regarding sequence of training input: part of the algorithm involves randomly shuffling your data corpus.

Weedpacket I'm wondering about the vocabulary. At what point does this become part of the network?

I'm not sure what you mean by this. Broadly speaking, I can say that your vocabulary should be specific to your application. I.e., it would not be effective to use a vocabulary derived from spanish messages to train your machine on messages written in English. I imagine this website's vocabulary would be different than that for a forum devoted to old Throbbing Gristle records.

EDIT: I think by 'network' you mean 'Artificial Neural Network' here. Firstly, I believe I implemented a Support Vector Machine and got similar results to a FANN. The vocabulary is generated first -- you do a sort of pre-munge of all the messages to try and identify commonly used words. It is not helpful to examine words that only appear in a single message -- or no messages at all. The class instructed us that any word in the vocabulary should probably appear in at least 10-15% of the messages in your overall corpus to be useful. Once you have the vocabulary, you train the SVM/FANN by generating an array of features for each message. This feature array indicates for each word in your vocabulary whether that word is present or not in the current message. Any currency expressions are converted to some string like EXPRCURRENCY and urls to EXPRHTTPLINK or something like that so the absence or presence of currency expressions and links are examined. The SVM/FANN ends up with some array that assigns a weight to each word in the vocabulary (plus one extra baseline constant). A new message gets its own feature array by comparing it to the vocabulary, then you use your weighted SVM/FANN array to perform the ham/spam classification.

Weedpacket how does it help the network identify spammy features that aren't part of the vocabulary?

It doesn't! I've provided the description above expressly to point out what this filter does. I don't think it would be hard to expand beyond a word vocabulary to other features (this is a technical term used in the class to describe the input variables). I can think of a couple of other types of features:

a series of binary indicators regarding the language (or charset) in which the message is written. e.g., isChinese, isEnglish, isSpanish, isRussian. I expect isChinese values of 1/true would be a strong indicator that a message is spam.
a series of binary indicators that the message contains one or more links to a particular TLD. E.g., hasCOM, hasNET, hasUK, hasRU, hasCN, hasBIZ, hasINFO, hasTV, etc.

NOTE that the feature arrays we generate while munging messages don't use these associative keys I'm suggesting here. I just use the associative key names to try and suggest what's in a feature array. When the heavy lifting gets done, feature arrays are numerically indexed arrays with the exact same number of features (hundreds or thousands of them). The training phase involves a huge matrix multiplication.

I welcome any suggestions about other features.

Weedpacket Wouldn't the network itself develop triggers for certain sequences of characters (and correlations between the occurrences of such sequences) that are more predominantly spam than ham?

The particular algorithm I've implemented for this is an example from class, and probably quite simple compared to some of the more advanced possibilities out there today. I didn't bother to implement it on my personal website's contact form because I deployed recaptcha V3 and the spam stopped instantly. Disconcertingly, I get no correspondence at all now, and I wonder if any human being has tried to send me some actual ham.

I will note that this type of classifier is offered by Amazon as a Binary Classification Model and they charge pretty stiff rates for this sort of simple classifier. I'm not sure what, if anything, might make their classifier fancier than my algorithm here. Anecdotally, I experimented with one of these binary classifiers for a day or two and then shut it down using the AWS console. Something went wrong, however, and the classifier was not terminated and I got a bill several weeks later for over $200USD. I complained and Amazon refunded the money, but this sort of capability can be fairly flexible and the tech cos are charging a fair penny for it.

sneakyimp · Oct 11, 2023

If anyone is remotely interested in this, I'm happy to check some code into github and start sharing. LMK.

IdealCleaning · Nov 4, 2023

Thanks for sharing this info sir@sneakyimp#11099876

sneakyimp · Dec 7, 2023

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. I'd love to hear from the phpbuilder community about this spam filter concept. I'll start posting some code here, later today I hope.

sneakyimp · Dec 7, 2023

I hope to relay the basic techniques I learned, originally in Matlab/Octave, to implement a machine learning spam filter. I'll briefly mention that this approach is simple, and works by checking for the presence or absence of the words in some vocabulary generated from a corpus of messages. For this approach to generalize well, you must only consider words that are fairly common in your message corpus. It's not good to consider any words that appear in only one message -- or don't appear in any messages at all. You also need a lot of messages for it to work well. A few thousand at least.

To start, we must have a corpus of messages, with some idea of which are somewhat evenly split between ham and spam. In my case, I didn't have several thousand messages so I ended up padding out my own message corpus with the old SpamAssassin corpus here.

Because this data set is fairly large, and because it is computationally intensive, the approach here typically reads messages file by file from several directories, and each stage of processing tends to generate JSON data of its computation.

I have this config file which specifies some file locations with brief descriptions:

// === config.php===
/**
 * contains JSON-encoded array of the files we have chosen for our corpus
 * along with an is_spam indication and a strip_headers flag if we need to strip
 * email headers preceding the message body before processing
 */
define('JSON_FILE_CHOSEN_CORPUS_FILES', 'chosen_corpus_files.json');

/**
 * array with all stemmed/normalized/adjusted words in our chosen corpus as keys
 * and the values being the number of files in our corpus that contain each word
 * NOTE: the frequency values are NOT the number of times the word appears in the corpus
 * the frequency values are the number of FILES that contain that word one or more times
 */
define('JSON_FILE_CHOSEN_CORPUS_WORD_FREQUENCIES', 'chosen_corpus_word_frequencies.json');

/**
 * This file defines an array of the all words we will consider when training our algorithm
 * The number of words in it varies depending on various parameters we choose
 */
define('JSON_FILE_CHOSEN_CORPUS_VOCAB', 'chosen_corpus_vocab.json');

/**
 * An array listing all the files in the SpamAssassin corpus, along with is_spam flag
 * e.g., each element is associative array ['file' => '/full/path/to/file', 'is_spam' => 1]
 */
define('JSON_FILE_SA_CORPUS_FILES', 'sa_corpus_files.json');

/**
 * An array listing all the files in MY corpus, along with is_spam flag
 * e.g., each element is associative array ['file' => '/full/path/to/file', 'is_spam' => 1]
 */
define('JSON_FILE_MY_CORPUS_FILES', 'my_corpus_files.json');

We need to decide what our vocabulary will be, so I have this script, analyze-corpus-word-frequency.php, which looks at our corpus and analyzes what words appear in the messages, and, for each word, how many message files contain that word at least once. As you'll see, this script does a bit more because I'm trying to pad out my own modest message corpus with files from the SpamAssassin corpus:

$current_dir = dirname(__FILE__) . '/';


// CONFIG
echo "this script has configurable parameters. See config.php\n";
require_once $current_dir . 'config.php';

// TODO move this to config.php
$sa_file_count = 2000;



// === FIRST, randomly pick this many files from the SA corpus
// this number should be big enough to get a nice big training sample
// but not so large that we let the SA corpus totally dwarf MY corpus
echo "we will choose $sa_file_count files from the SA corpus\n";



// we need fns in here to retrieve paths to corpus files
require_once 'list-fns.php';

$start = microtime(TRUE);

// retrieves ~6k files as of this writing as array of associative arrays with file and is_spam keys
// e.g. ["file" => "/full/path/to/file", "is_spam" => 1]
$sa_corpus_files = list_sa_corpus();
echo sizeof($sa_corpus_files), " found in SA corpus\n";
// the file system may not always return the list of files in the same order, so i figured i'd
// store the last list of files returned
$sa_corpus_loc = $current_dir . JSON_FILE_SA_CORPUS_FILES;
file_put_contents($sa_corpus_loc, json_encode($sa_corpus_files));
echo "SA corpus file list written to $sa_corpus_loc\n";


// I originally thought shuffling just the keys would be faster, but
// shuffling the entire sa_corpus_files array is just as fast so
// deprecated this
// create a randomly shuffled array of the sa corpus indexes
//$random_sa_keys = array_keys($sa_corpus_files);
//shuffle($random_sa_keys);
//echo sizeof($random_sa_keys), " keys randomly shuffled, saving to random_sa_keys.json\n";
//file_put_contents($current_dir . 'random_sa_keys.json', json_encode($random_sa_keys));

// randomly shuffle the sa corpus files
shuffle($sa_corpus_files);
// slice off the first $sa_file_count of them
$sa_subset = array_slice($sa_corpus_files, 0,$sa_file_count);
echo sizeof($sa_subset) . " files sliced off SA corpus\n";
// free up memory, don't need entire corpus any more
unset($sa_corpus_files);

// The SA corpus files have email headers, so we set a flag that tells us
// to strip out the mail headers in the SA files in later stages of processing
for($i=0; $i<$sa_file_count; $i++) {
	$sa_subset[$i]['strip_headers'] = 1;
}


// fetch MY corpus
$my_corpus_files = list_my_corpus();
$my_corpus_file_count = sizeof($my_corpus_files);
echo "$my_corpus_file_count files found in MY corpus\n";
$my_corpus_loc = $current_dir . JSON_FILE_MY_CORPUS_FILES;
file_put_contents($my_corpus_loc, json_encode($my_corpus_files));
echo "MY corpus file list written to $my_corpus_loc\n";

// mark a flag that tells us NOT to strip out the mail headers in the MY files
for($i=0; $i<$my_corpus_file_count; $i++) {
	$my_corpus_files[$i]['strip_headers'] = 0;
}

$chosen_corpus_files = array_merge($sa_subset, $my_corpus_files);
echo sizeof($chosen_corpus_files), " files in chosen corpus.\n";
echo "shuffling chosen corpus files...";
shuffle($chosen_corpus_files);
$chosen_filename = $current_dir . JSON_FILE_CHOSEN_CORPUS_FILES;
file_put_contents($chosen_filename, json_encode($chosen_corpus_files));
echo "chosen corpus files written to $chosen_filename\n";

// free up memory/resources
unset($sa_subset);
unset($my_corpus_files);

// how many ham/spam
$ham = 0;
$spam = 0;
foreach($chosen_corpus_files as $cf) {
	if ($cf['is_spam']) {
		$spam++;
	} else {
		$ham++;
	}
}
echo "$ham ham files\n";
echo "$spam spam files\n";
echo ($ham/($ham+$spam))*100 . " percent of " . ($spam+$ham) .  " files are ham\n";


// we will need these functions
require_once 'text-fns.php';

// this associative array will contain our word frequencies
// word => number_of_files_containing_the_word
$word_frequencies = [];
$chosen_corpus_file_count = sizeof($chosen_corpus_files);
echo "processing $chosen_corpus_file_count files from chosen corpus\n";
foreach($chosen_corpus_files as $i => $cf) {
	if (($i % 100) == 0) {
		echo "processing $i of $chosen_corpus_file_count\n";
		echo "word_frequencies length: ", sizeof($word_frequencies), "\n";
	}

	$file = $cf['file'];

	$contents = file_get_contents($file);
	// we have to strip_headers for SA corpus, so second optional param is TRUE
	// this returns an array of the massaged, unique words, no duplicates
	$words = pre_process_message($contents, (bool)$cf['strip_headers']);
	unset($contents);

	// since $words contains no duplicates, it indicates simply that the word
	// exists in this file, that lets us do a simple increment operation
	foreach($words as $word) {
		if (array_key_exists($word, $word_frequencies)) {
			// increment it
			$word_frequencies[$word]++;
		} else {
			// set it to 1
			$word_frequencies[$word] = 1;
		}
	}
}


// SORT the word frequenices by freq desc
function wfsort($a, $b) {
    if ($a == $b) {
        return 0;
    }
    return ($a > $b) ? -1 : 1;
}
uasort($word_frequencies, "wfsort");
//var_dump($word_frequencies);
$wf_filename = $current_dir . JSON_FILE_CHOSEN_CORPUS_WORD_FREQUENCIES;
file_put_contents($wf_filename, json_encode($word_frequencies));
echo "word frequencies written to $wf_filename\n";
echo sizeof($word_frequencies) . " words in word_frequencies\n";
echo "COMPLETED in " . (microtime(TRUE) - $start) . " seconds\n";

Some notes:

this script selects $sa_file_count messages randomly from the SpamAssassin corpus and uses my entire message corpus and randomly shuffles these selected messages. It saves JSON files indicating the file paths for these files and their ham/spam classification
It counts how many ham/spam messages and reports that. To generalize well, this proportion should not be too lopsided
It uses a function, pre_process_message to reduce each message to just an array of the words in it. I'll post this function below, but it does various things: eliminate trivial stop words, change all currency expressions to something like pprocurrency, converts all urls to pprohttpaddr, etc. This is to avoid overfitting.
counts how many message files that contains each word it encounters, sorts the array of these words+frequencies, and writes it to a JSON file.

Any suggestions or observations welcome.

sneakyimp · Dec 7, 2023

Here is text-fns.php, which defines our pre_process_message fn, among other things.

EDIT: it looks like the PorterStemmer code uses deprecated curly brace syntax to access chars in a string, which has been deprecated in PHP8.

text-fns.php
define('MIN_WORD_LEN', 2);
define('MAX_WORD_LEN', strlen('Supercalifragilisticexpialidocious'));

// massaged this from the MySQL default stopwords
// @see https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html
$stop_words = ['a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','i','in','is','it','la','no','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und'];


// porter stemmer downloaded from here:
// https://tartarus.org/martin/PorterStemmer/php.txt
require_once dirname(__FILE__) . '/PorterStemmer.php';

// function which decides which words to keep
function keep_word($word) {
	global $stop_words;

	// get rid of words too short or too long
	$len = strlen($word);
	if ($len < MIN_WORD_LEN || $len > MAX_WORD_LEN) {
		return FALSE;
	}

	// get rid of stop words
	if (in_array($word, $stop_words)) {
		return FALSE;
	}

	// otherwise, keep it
	return TRUE;
}


// this is the function we use to pre-process each message,
// reducing it to just important words. it removes HTML,
// optionally email headers, changes everything to lowercase,
// converts any hyperlinks to the same tag (this presumably
// to prevent overfitting)
// this based on coursera/ex6 assignment except it returns an
// array of unique, porter-stemmed words, no duplicates
function pre_process_message($email_contents, $strip_headers=FALSE) {

	global $stop_words;


// html
//$f = 'sa-corpus/spam_2/01277.6763a79fad1f1b39cb7d5b7faf92ea98';
// email
//$f = 'sa-corpus/spam_2/00410.fb7b31cdd9d053f8b446da7ce89383fa';
// $$$$
//$f = 'sa-corpus/spam_2/00830.079ed7d24f78024e023b82417a6fe2ca';
// £ / GBP char
//$f = 'sa-corpus/hard_ham/00220.11f0abf371588687744d151b46903087';
//	$email_contents = file_get_contents($f);

	if ($strip_headers) {
		// i did a grep search for \r\n as suggested here:
		// https://stackoverflow.com/questions/73833/how-do-you-search-for-files-containing-dos-line-endings-crlf-with-grep-on-linu
		// only one file in sa corpus: sa-corpus/spam_2/00083.1aead789d4b4c7022c51bc632e4f2445
		// and it looks like the \r\n endings in that are NOT in the headers so i'll keep this simple
		// to trim email headers, just look for first double newline and take everything after
		$clean_contents = trim(strstr($email_contents, "\n\n"));

		if ($clean_contents === FALSE) {
			// FIXME remove this exception and just set $clean_contents = trim($email_contents)
			throw new Exception("no email header found, just use entire email");
		}
	} else {
		$clean_contents = trim($email_contents);
	}


	// strip all NULL bytes, HTML and PHP tags
	$clean_contents = strip_tags($clean_contents);

	// to avoid overfitting, change all numbers to just PREPRONUMSEQUENCE or
	// something unlikely to appear in the corpus
	$clean_contents = preg_replace('/\d+/', 'ppronumseq', $clean_contents);

	// to avoid overfitting, change all urls to just PREPROHTTPADDR or
	// something unlikely to appear in the corpus
	// TODO this could be improved alot
	$url_pattern = '@(http|https)://[^\s]+@';
//	$matches = NULL;
//	$count = preg_match_all('@(http|https)://[^\s]+@', $clean_contents, $matches);
//var_dump($matches);
//die();
	$clean_contents = preg_replace($url_pattern, 'pprohttpaddr', $clean_contents);

	// to avoid overfitting, replace email addresses with PREPROEMAILADDR
	$clean_contents = preg_replace('/[^\s]+@[^\s]+/', 'pproemailaddr', $clean_contents);

	// dollar signs are a common spam element
	// TODO it migth be worthwhile to apply some kind of linear or log scale to
	// multiple currency chars, i.e., more chars would be counted rather than binary yes/no
	$clean_contents = preg_replace('/(\$|£)+/', 'pprocurrency', $clean_contents);


	// we prob won't have any numbers at this point, but to get rid of punctuation,
	// replace any non-alphanumeric chars with a space
	$clean_contents = trim(preg_replace('/[^a-zA-Z0-9]+/', ' ', $clean_contents));

	// convert to lowercase...this yields a pretty clean representation of the mail,
	// just spaces and alphanumeric strings
	$clean_contents = strtolower($clean_contents);

	// split into words
	$clean_words = preg_split('/\s+/', $clean_contents);
	// don't need this var any more, save memory
	unset($clean_contents);

	// get rid of duplicate words
	$clean_words = array_unique($clean_words);

	// filter out stop words and too long / too short ones
	$clean_words = array_values(array_filter($clean_words, 'keep_word'));

	// apply the porter stemmer
	$clean_words = array_map('PorterStemmer::Stem', $clean_words);
	// may introduce duplicates so get rid of duplicate words again
	$clean_words = array_unique($clean_words);


	return $clean_words;
}


// the functions that follow all do the same, but using a different approach
function feature_vector_1($vocab, $words) {
	$out = array_fill_keys($vocab, 0);
	foreach($words as $w) {
		if (array_key_exists($w, $out)) {
			$out[$w] = 1;
		} else {
			// word is not in vocab, ignore it
		}
	}
	return array_values($out);
}

function feature_vector_2($vocab, $words) {
	$a = array_fill_keys($vocab, 0);
	$b = array_fill_keys(array_intersect($vocab, $words), 1);
	return array_values(array_merge($a, $b));
}

function feature_vector_3($vocab, $words) {
	$a = array_fill_keys($vocab, 0);
	$b = array_fill_keys(array_intersect($vocab, $words), 1);
	return array_values(array_replace($a, $b));
}

function feature_vector_4($vocab, $words) {
	return array_map(fn($v) => (int)in_array($v, $words), $vocab);
}

function feature_vector_5($vocab, $words) {
	$flipped_words = array_flip($words); // This would also uniquify the "flipped" word list automatically
	return array_map(fn($v) => (int)isset($flipped_words[$v]), $vocab);
}

function feature_vector_6($vocab, $words) {
	return array_map(fn($v) => (int)in_array($v, $words), $vocab);
}

NOTE: feature_vector is a compact array of binary values (there are no associative keys) that indicates the presence/absence of each of our vocabulary words in a single message. These are useful for later stages of processing.

sneakyimp · Dec 7, 2023

By running the analyze-corpus-word-frequency.php script above, you get chosen_corpus_word_frequencies.json, which contains an array of all words appearing in your corpus as keys and the value of each key indicates how many message files that word appears in. E.g.:

    [ppronumseq] => 1784
    [pprohttpaddr] => 1609
    [you] => 1344
    [have] => 1043
    [us] => 1028
    [list] => 996
    [not] => 968
    [your] => 948
...
   [ahg] => 1
    [wdppronumseqfiiv] => 1
    [cevtr] => 1
    [wcrtegxypdpppronumseq] => 1
    [wdppronumseqnv] => 1
    [wcppronumseqfppronumseq] => 1
    [dpppronumseqf] => 1
    [bjdrenx] => 1
    [svatshenk] => 1

sneakyimp · Dec 7, 2023

Once we've analyzed the corpus to determine which words appear and how often, we run this script to select the most popular words for our vocabulary. NOTE that this vocabulary will dictate the feature vector that we use to train our classifier on existing messages and also asses the ham/spam classification of incoming, novel messages:

// generate-vocab.php

<?php

// this script loads the word frequency data (generated at significant computational expense)
// from analysis of the thousands of files in the SA and MY corpus and tells us the most commonly
// appearing words and how many files of the chosen corpus that they appear in
// lastly, it generates a vocab of words used with sufficient frequency to serve
// as our vocab

// IMPORTANT: You'll need to make sure you have the correct, latest word
// frequency analysis data in the file chosen_corpus_word_frequencies.json
// we generate/store that in a separate script because it involves munging
// thousands of text documents -- computationally expensive

// you'll need to adjust MIN_FREQ so that you end up
// with about 1-2k words in your vocab

// our vocabulary will only include words that appear
// in at least this many emails in the SA corpus
// TODO move this to config.php
define('MIN_FREQ', 20);

$dest_dir = dirname(__FILE__) . '/';

// CONFIG
echo "this script has configurable parameters. See config.php\n";
require_once $dest_dir . 'config.php';


// TODO put this filename in some config somewhere, shared with the freq analysis script
$freq_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_WORD_FREQUENCIES;
echo "loading $freq_file\n";
$word_frequencies = json_decode(file_get_contents($freq_file), TRUE);


echo "corpus frequencies loaded, found " . sizeof($word_frequencies) . " words\n";

// should already be sorted by freq desc
$top = array_slice($word_frequencies, 0, 40);
echo "TOP WORDS\n";
print_r($top);

$vocab = [];
foreach($word_frequencies as $word => $freq) {
        if ($freq < MIN_FREQ) {
                // that's it! no more
                break;
        }

        $vocab[] = $word;
}

// sort alphabetically
// TODO a bit of performance profiling suggests we should NOT sort this
// array alphabetically, but rather by popular word first for better
// lookup peformance?
sort($vocab);

echo sizeof($vocab) . " words in the vocab\n";

$vocab_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_VOCAB;
echo "Saving vocab to $vocab_file\n";
file_put_contents($vocab_file, json_encode($vocab));

Sample output in my case:

$ php generate-vocab.php
this script has configurable parameters. See config.php
loading /home/jaith/biz/machine-learning/chosen_corpus_word_frequencies.json
corpus frequencies loaded, found 28084 words
TOP WORDS
Array
(
    [ppronumseq] => 1784
    [pprohttpaddr] => 1609
    [you] => 1344
    [have] => 1043
    [us] => 1028
    [list] => 996
    [not] => 968
    [your] => 948
    [if] => 942
    [pproemailaddr] => 930
    [can] => 914
    [mail] => 874
    [all] => 840
    [get] => 763
    [do] => 757
    [but] => 752
    [we] => 739
    [so] => 674
    [more] => 674
    [just] => 673
    [here] => 663
    [on] => 663
    [out] => 642
    [time] => 633
    [my] => 631
    [email] => 628
    [new] => 620
    [there] => 611
    [up] => 605
    [our] => 587
    [onli] => 577
    [ani] => 577
    [ha] => 576
    [now] => 556
    [like] => 551
    [work] => 536
    [messag] => 536
    [thei] => 533
    [inform] => 530
    [free] => 510
)
1955 words in the vocab
Saving vocab to /home/jaith/biz/machine-learning/chosen_corpus_vocab.json

This writes our vocab to chosen_corpus_vocab.json.

sneakyimp · Dec 7, 2023

Once we have our JSON files containing our corpus of chosen message files and our vocabulary, we can generate our training, cross validation, and test arrays.

NOTE: once again, PorterStemmer will barf in php 8, so run this with php 7 or fix it.

// === generate-training-sets.php ===
/**
 * This script loads the previously generated list of our corpus
 * files (which has already been randomly shuffled) and generates
 * the matrix of vectors we use to train our machine. That 
 * matrix will be a PHP array, containing one entry for each
 * file in our corpus. Each entry will itself be an array with
 * one entry for each word in our vocabulary.  We will then
 * break up this ALL matrix into train, validation, and test
 * subsets (60/20/20) and store those in a JSON file for use
 * by our training algorithm.
*/

$dest_dir = dirname(__FILE__) . '/';

require_once $dest_dir . 'config.php';

// load our file corpus
// NOTE each element should be an associative array specifying file, is_spam, and strip_headers
// e.g.:  ['file' => '/full/path/to/file', 'is_spam' => 1, 'strip_headers' => 0]
$corpus_json_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_FILES;
$corpus_files = json_decode(file_get_contents($corpus_json_file), TRUE);
$corpus_file_count = sizeof($corpus_files);
echo "$corpus_file_count files loaded from $corpus_json_file\n";

$vocab_json_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_VOCAB;
$vocab = json_decode(file_get_contents($vocab_json_file), TRUE);
$vocab_word_count = sizeof($vocab);
echo "$vocab_word_count words loaded from $vocab_json_file\n";

// we encountered some character encoding problems with regex in ocatave/matlab
// do we need to declare a charset for these files or perform some kind of charset conversion?
// before we start processing these emails, we need to set the charset
// or it can barf doing some regex
// NOTE: this does not appear to cause trouble in PHP
//__mfile_encoding__ ("iso-8859-1");


// we will need these functions to analyze each message
// TODO give this file a more meaningful name
require_once 'text-fns.php';


// this is our master array of feature vectors
$Xall = [];
// this is our master array of y scores
$yall = [];

echo "processing $corpus_file_count files from chosen corpus\n";
$start = microtime(TRUE);
foreach($corpus_files as $i => $cf) {
//	echo "processing " . $cf['file'] . "\n";

        if (($i % 100) == 0) {
                echo "processing $i of $corpus_file_count\n";
        }

        $file = $cf['file'];

        $contents = file_get_contents($cf['file']);
	if (!$contents) {
		throw new Exception($cf['file'] . ' could not be fetched, false or empty returned');
	}

        // this returns an array of the massaged, unique words, in a message -- no duplicates
	// TODO we will probably want to remove 2nd strip_headers param when we port this to
	// a website application, it's just a quirk of the SA data corpus
        $words = pre_process_message($contents, (bool)$cf['strip_headers']);
        unset($contents);

	// take our $vocab and $words to generate a feature vector
	// which is just 0s and 1s, indicating which vocab words are in the current message
	$Xall[] = feature_vector_1($vocab, $words);
	$yall[] = $cf['is_spam'];

}
$elapsed = microtime(TRUE) - $start;
echo "all $corpus_file_count messages processed in $elapsed seconds\n";

echo sizeof($Xall) . " records in Xall\n";
echo sizeof($yall) . " records in yall\n";


// save processed training data
$data_file_all = $dest_dir . 'training_data_all.json';
echo "saving Xall and yall to $data_file_all\n";
file_put_contents($data_file_all, json_encode(['Xall' => $Xall, 'yall' => $yall]));

// divide up the entire set into subsets for training/validation/testing
$row_count = sizeof($yall);

if ($row_count < 10) {
	die("you don't even have 10 training examples. I refuse to finish\n");
}

// to split in 3 groups of 60/20/20, we need 2 cuts
$cut1 = round($row_count * .6);
echo "cut1 $cut1\n";
$cut2 = round($row_count * .8);
echo "cut2 $cut2\n";


$Xtrain = array_slice($Xall, 0, $cut1);
echo sizeof($Xtrain) . " elements in Xtrain\n";

$ytrain = array_slice($yall, 0, $cut1);
echo sizeof($ytrain) . " elements in ytrain\n";


$Xval = array_slice($Xall, $cut1, ($cut2-$cut1));
echo sizeof($Xval) . " elements in Xval\n";

$yval = array_slice($yall, $cut1, ($cut2-$cut1));
echo sizeof($yval) . " elements in yval\n";

$Xtest = array_slice($Xall, $cut2);
echo sizeof($Xtest) . " elements in Xtest\n";

$ytest = array_slice($yall, $cut2);
echo sizeof($ytest) . " elements in ytest\n";

echo "X element total = " . (sizeof($Xtrain) + sizeof($Xval) + sizeof($Xtest)) . "\n";
echo "y element total = " . (sizeof($ytrain) + sizeof($yval) + sizeof($ytest)) . "\n";

// save data sets
$data_file_sets = $dest_dir . 'training_data_sets.json';
echo "saving Xall and yall to $data_file_all\n";
file_put_contents($data_file_sets, json_encode([
	'Xtrain' => $Xtrain,
	'ytrain' => $ytrain,
	'Xval' => $Xval,
	'yval' => $yval,
	'Xtest' => $Xtest,
	'ytest' => $ytest,
]));

sneakyimp · Dec 7, 2023

OK so once we have the training/test/validation set feature vectors in a big fat JSON file, we can train a Support Vector Machine. Glossing over a bunch of details, this is the top-level script. I'm wondering if I should maybe just put this in a repo on github? If anyone is curious, LMK.

// === train-svm.php ===

/**
 * this script trains a Support Vector Machine (i.e., a machine
 * learning classifier algorithm) to determine if an email message
 * is spam or ham
*/

// Load the Spam Email dataset
// You will have Xtrain, ytrain, Xval, yval, Xtest, ytest
// we expect this data to be ranomly shuffled, and to contain
// feature vectors from both the old SA corpus and MY corpus
$data_file = __DIR__ . '/training_data_sets.json';
echo "loading data from $data_file\n";
$data = json_decode(file_get_contents($data_file), TRUE);

// output some data points
$data_keys = ['Xtrain', 'ytrain', 'Xval', 'yval', 'Xtest', 'ytest'];
foreach($data_keys as $key) {
	if (!array_key_exists($key, $data)) {
		throw new Exception("data file did not define key=$key");
	}
	echo "$key has " . sizeof($data[$key]) . " elements\n";
}

// we will be using ghostjat/np for matrix operations
// NOTE: i vaguely recall having modified this library to get it to return the same
// results as octave for some matrix operation
require_once __DIR__ . '/np/vendor/autoload.php';
use Np\matrix;
use Np\vector;

$x_matrix = matrix::ar($data['Xtrain']);
$y_matrix = vector::ar($data['ytrain']);

// trial and error suggested C=1 would be best, but *shrug*
//$C = 3;
// testing
$C = 0.1;
echo "training model with C=$C\n";


// FIXME before we can call our svm-fns we need to load BLAS because Np depends on it
// need to put it somewhere correct? this is awkward
Np\core\blas::$ffi_blas = FFI::load(__DIR__ . '/np/vendor/ghostjat/np/src/core/blas.h');
require_once 'svm-fns.php';

echo "Training SVM (Spam Classification)\n";
echo "(this may take 1 to 2 minutes) ...\n";
// train the model
// best practice has with separate training/validation/test sets
$start = microtime(TRUE);
$model = svm_train($x_matrix, $y_matrix, $C, 'linear_kernel');
echo "training completed in " . (microtime(TRUE) - $start) . " seconds\n";

// $model will be an associative array with these keys:
// kernel_fn => string specifying kernel function
// b =>  float
// x => matrix of input x vectors (0s/1s in spam example) where alpha > 0
// y => vector of input y classifications (0/1 in spam example) where alpha > 0
// alphas => vector of alpha values calculated during training
// w => vector of weights, a float, for each input feature

// convert $model to something we can JSON_ENCODE, i.e., only
// basic data types instead of Np\vector or Np\matrix
$to_array = ['x', 'y', 'alphas', 'w'];
$model_to_save = [];
foreach($model as $key => $val) {
	if (in_array($key, $to_array)) {
		$model_to_save[$key] = $val->asArray();
	} else {
		$model_to_save[$key] = $val;
	}
}

// save the model so we can crack it open and test it
// without having to retrain the entire model all over again.
$model_data_file = __DIR__ . '/trained-svm-model.json';
echo "Writing trained SVM model params to $model_data_file\n";
file_put_contents($model_data_file, json_encode($model_to_save));

echo "\nEvaluating the trained Linear SVM on TRAINING set ...\n";
$x = Np\matrix::ar($data['Xtrain']);
$y = Np\vector::ar($data['ytrain']);
$results = svm_assess($model, $x, $y);
print_r($results);

echo "\nEvaluating the trained Linear SVM on Xtest set ...\n";
$x = Np\matrix::ar($data['Xtest']);
$y = Np\vector::ar($data['ytest']);
$results = svm_assess($model, $x, $y);
print_r($results);


die("DONE\n");

I'll post the svm-fns.php in the next post.

sneakyimp · Dec 7, 2023

But first, sample output. As you can see, with C=1, the trained SVM assesses the messages in our training set with 99.9% accuracy, and in the test data set (i.e., new messages it has not seen) with 93.0% accuracy:

loading data from /home/sneakyimp/biz/machine-learning/training_data_sets.json
Xtrain has 1375 elements
ytrain has 1375 elements
Xval has 458 elements
yval has 458 elements
Xtest has 458 elements
ytest has 458 elements
training model with C=1
Training SVM (Spam Classification)
(this may take 1 to 2 minutes) ...
train with C=1 and kernel_fn=linear_kernel
x matrix m=1375, n=1955
y vector size 1375
CALCULATING KERNEL
KERNEL CALC COMPLETE

Training.......................................................................
..........................................................................
MAX_PASSES (5) REACHED, training done
training completed in 82.92901802063 seconds
Writing trained SVM model params to /home/sneakyimp/biz/machine-learning/trained-svm-model.json

Evaluating the trained Linear SVM on TRAINING set ...
Array
(
    [x_samples] => 1375
    [x_features] => 1955
    [y_samples] => 1375
    [p_size] => 1375
    [correct_predictions] => 1373
    [true_positives] => 548
    [true_negatives] => 825
    [false_positives] => 2
    [false_negatives] => 0
    [precision] => 0.99636363636364
    [recall] => 1
    [f_score] => 0.99817850637523
    [correct_decimal] => 0.99854545454545
    [correct_percent] => 99.854545454545
    [elapsed_time] => 0.29263210296631
)

Evaluating the trained Linear SVM on Xtest set ...
Array
(
    [x_samples] => 458
    [x_features] => 1955
    [y_samples] => 458
    [p_size] => 458
    [correct_predictions] => 426
    [true_positives] => 155
    [true_negatives] => 271
    [false_positives] => 18
    [false_negatives] => 14
    [precision] => 0.89595375722543
    [recall] => 0.91715976331361
    [f_score] => 0.90643274853801
    [correct_decimal] => 0.93013100436681
    [correct_percent] => 93.013100436681
    [elapsed_time] => 0.097591161727905
)
DONE

sneakyimp · Dec 7, 2023

Here are the contents of svm-fns.php, which do the training.

// === svm-fns.php ===
/**
 * Defines functions to train and use  Support Vector Machine
 */

/**
 * LINEARKERNEL returns a linear kernel between x1 and x2
 * NOTE that the incoming vectors x1 and x2 were originally both column vectors
 * of dimensions (vocab size) x 1
 */
function linear_kernel($x1, $x2) {
	
	// Ensure that x1 and x2 are column vectors
	// while this conversion may be necessary for a broad use of
	// this function, it is unnecessary in the svmTrain context
	// and probably hampers performance a tiny bit
	// x1 = x1(:); x2 = x2(:);

	// Compute the kernel
	// this should return a 1x1 matrix (scalar value?)

	return $x1->dot($x2);  // dot product, should yield scalar
}

/**
 * returns a radial basis function kernel between x1 and x2
 * sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2
 * and returns the value
 */
function gaussian_kernel(Np\vector $x1, Np\vector $x2, $sigma) {

	// NOTE the incoming vectors x1 and x2 are column vectors
	// of dimension (vocab_size) x 1

	// orig octave:
	//sim = exp(-sum((x1 - x2) .^ 2) / (2 * sigma^2));
	// NOTE sim will be a 1x1 result (a scalar value)

	return exp(-$x1->subtract($x2)->square()->sum() / (2 * $sigma*$sigma));
}

function get_gaussian_predict_k(Np\matrix $x, array $model) {

	// orig octave code in svmPredict
	// Vectorized RBF Kernel
	// This is equivalent to computing the kernel on every pair of examples
	//X1 = sum(X.^2, 2);
	//X2 = sum(model.X.^2, 2)';
	//K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
	//K = model.kernelFunction(1, 0) .^ K;
	//K = bsxfun(@times, model.y', K);
	//K = bsxfun(@times, model.alphas', K);
	//p = sum(K, 2);

	$x1 = $x->square()->sumRows();
	//echo "x1 ", $x1, "\n";

	// we don't need to transpose this because ghostjat/np doesn't distinguish col vs row vectors
	$x2 = $model['x']->square()->sumRows();
	//echo "x2 ", $x2, "\n";

	// need to build K.
	$K = $x->dot($model['x']->transpose())->multiply(-2);

	// do the inner bsxfun(plus...)
	// ghostjat has no means to add a ROW vector to a matrix soooo we fake it
	$kshape = $K->getShape();
	$km = $kshape->m;
	$kn = $kshape->n;
	$x2size = $x2->getSize();
	// $km should match the dimensions of $x2
	// sanity check
	if ($x2size !== $kn) {
		throw new \Exception('x2 size ($x2size) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$x2size; $i++) {
		$x2val = $x2->data[$i];
		for($j=0; $j<$km; $j++) {
			// add the ith x2 value to the ith column of the jth row
			$K->data[($j * $kn) + $i] += $x2val;
		}
	}

	// do the outer bsxfun(plus...)
	// ghostjat has no means to add a COLUMN vector soooo we fake it
	$x1size = $x1->getSize();
	// $km should match the dimensions of $x1
	// sanity check
	if ($x1size !== $km) {
		throw new \Exception('x1 size ($x1size) does not match km ($km)');
	}
	// i are rows, j are columns
	for($i=0; $i<$x1size; $i++) {
		$x1val = $x1->data[$i];
		for($j=0; $j<$kn; $j++) {
			// add the ith x1 value to the jaith column of the ith row
			//$offset = ($i * $kn) + $j;
			$K->data[($i * $kn) + $j] += $x1val;
		}
	}

	$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $model['sigma']);
	//echo "kf ", $kf, "\n";
	$K = $K->map(fn($v) => (pow($kf, $v)));

	$mysize = $model['y']->getSize();
	// $km should match the dimensions of $model['y']
	// sanity check
	if ($mysize !== $kn) {
		throw new \Exception('model.y size ($mysize) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$mysize; $i++) {
		$yval = $model['y']->data[$i];
		for($j=0; $j<$km; $j++) {
			// multiply the ith y value by the ith column of the jth row
			$K->data[($j * $kn) + $i] *= $yval;
		}
	}


	$alphasize = $model['alphas']->getSize();
	// $km should match the dimensions of $model['alphas']
	// sanity check
	if ($alphasize !== $kn) {
		throw new \Exception('model.alpha size ($alphasize) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$alphasize; $i++) {
		$aval = $model['alphas']->data[$i];
		for($j=0; $j<$km; $j++) {
			// multiply the ith y value by the ith column of the jth row
			$K->data[($j * $kn) + $i] *= $aval;
		}
	}

	return $K;
}


/**
 * trains an SVM classifier and returns trained model. X is the matrix of
 * training examples.  Each row is a training example, and the jth column
 * holds the jth feature.  Y is a column matrix containing 1 for positive
 * examples and 0 for negative examples.  C is the standard SVM regularization
 * parameter.  tol is a tolerance value used for determining equality of
 * floating point numbers. max_passes controls the number of iterations
 * over the dataset (without changes to alpha) before the algorithm quits.
 *
 * Note: This is a simplified version of the SMO algorithm for training
 * SVMs. In practice, if you want to train an SVM classifier, we
 * recommend using an optimized package such as:
 * 	LIBSVM   (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
 & 	SVMLight (http://svmlight.joachims.org/)
 */
function svm_train($x_matrix, $y_vector, $C, $kernel_fn, $sigma=null, $tol=0.001, $max_passes=5) {
	echo "train with C=$C";
	if (!is_null($sigma)) {
		echo ", sigma=$sigma";
	}
	echo " and kernel_fn=$kernel_fn\n";
	$shape = $x_matrix->getShape();
	echo "x matrix m={$shape->m}, n={$shape->n}\n";
	$size = $y_vector->getSize();
	echo "y vector size ", $size, "\n";

	$m = $shape->m;


	// map the 0s in y to -1; note this appears to be faster than vector->map() stuff
	// BIG FAT WARNING we have to make a copy of the $y_vector object because
	// changing it here apparently propagates those changes back up to the calling scope
	$yvec = Np\vector::ones($y_vector->getSize());
	$yndim = $y_vector->ndim;
	for($i=0; $i<$yndim; $i++) {
        	if ($y_vector->data[$i] == 0) {
                	$yvec->data[$i] = -1;
        	}
	}


	// Pre-compute the Kernel Matrix since our dataset is small
	// (in practice, optimized SVM packages that handle large datasets
	// gracefully will _not_ do this)

	echo "CALCULATING KERNEL\n";
	// We have implemented optimized vectorized version of the Kernels here so
	// that the svm training will run faster.
	if ($kernel_fn === 'linear_kernel') {
		// Vectorized computation for the Linear Kernel
		// This is equivalent to computing the kernel on every pair of examples
		$K = $x_matrix->dot($x_matrix->transpose());
	} elseif ($kernel_fn === 'gaussian_kernel') {

		if (is_null($sigma)) {
			throw new Exception('You must provide a sigma value for gaussian kernel training');
		}

		// Vectorized RBF Kernel
		// This is equivalent to computing the kernel on every pair of examples

		// orig octave:
		// X2 = sum(X.^2, 2);
		// K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X')));
		// K = kernelFunction(1, 0) .^ K;

		// a vector of size n, orig X2 was a column vector
		$x2 = $x_matrix->square()->sumRows();
		// need to buidl K. this gets us pretty far, calculating inner bsxfun somehow
		$K = $x_matrix->dot($x_matrix->transpose())->multiply(-2)->sum($x2);
		// ghostjat has no means to add a column vector soooo we fake it
		$kshape = $K->getShape();
		$km = $kshape->m;
		$kn = $kshape->n;
		$x2size = $x2->getSize();
		// $km should match the dimensions of $x2
		// sanity check
		if ($x2size !== $km) {
			throw new \Exception('x2 size ($x2size) does not match km ($km)');
		}
		for($i=0; $i<$x2size; $i++) {
			$x2val = $x2->data[$i];
			for($j=0; $j<$kn; $j++) {
				// add the ith x2 value to the jth column of each row
				//$offset = ($i * $kn) + $j;
				$K->data[($i * $kn) + $j] += $x2val;
			}
		}
		// free memory
		unset($x2);

		$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $sigma);
		$K = $K->map(fn($v) => (pow($kf, $v)));

	} else {
		// Pre-compute the Kernel Matrix
		// The following can be slow due to the lack of vectorization
		echo "NON-VECTORIZED, SLOW\n";
		$K = Np\matrix::zeros($m, $m);
		for ($i=0; $i<$m; $i++) {
			if ($i >0 && ($i % 10 == 0)) {
				echo "\tloop $i\n";
			}
			for ($j=0; $j<$m; $j++) {

				// original matlab/octave code
				//K(i,j) = kernelFunction(X(i,:)', X(j,:)');
				//K(j,i) = K(i,j); %the matrix is symmetric

				// FIXME define a set() fn for matrix class rather than awkwardly calculating offset
				$kernel_val = $kernel_fn($x_matrix->rowAsVector($i), $x_matrix->rowAsVector($j));
				// location of $i, $j
				$offset1 = ($i * $K->col) + $j;
				$K->data[$offset1] = $kernel_val;
				// K matrix is symmetric, location of $j, $i
				$offset2 = ($j * $K->col) + $i;
				$K->data[$offset2] = $kernel_val;
			} // j loop

		} // i loop
	} // if linear/gaussian/slow
	echo "KERNEL CALC COMPLETE\n";

	// Variables
	$alphas = Np\vector::zeros($m);
	$b = 0;
	$E = Np\vector::zeros($m);
	$passes = 0;
	$eta = 0;
	$L = 0;
	$H = 0;

	// Train
	echo "\nTraining...";
	$dots = 11;
	while ($passes < $max_passes) {

		$num_changed_alphas = 0;
		for ($i=0; $i<$m; $i++) {
			// comments from original coursera class octave source:
			// Calculate Ei = f(x(i)) - y(i) using (2).
			// this line commented out in coursera source
			// E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i);

			// we want to calculate this octave expression from coursera source
			//E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i);

			// considerable trial and error yielded this for the sum, returns a scalar/float
			//$sum = $alphas->multiply($yvec)->multiply($K->rowAsVector($i))->sum();

			$E->data[$i] = $b + $alphas->multiply($yvec)->multiply($K->rowAsVector($i))->sum() - $yvec->data[$i];

			// orig octave if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0)),
			if (
				($yvec->data[$i] * $E->data[$i] < -$tol && $alphas->data[$i] < $C)
				|| ($yvec->data[$i] * $E->data[$i] > $tol && $alphas->data[$i] > 0)
			) {

				// In practice, there are many heuristics one can use to select
				// the i and j. In this simplified code, we select them randomly.
				do {
					$j = mt_rand(0, ($m-1));
				} while ($j === $i);

// TESTING
//$j = ($i + 1) % $m;
//echo "j: $j\n";

				// Calculate Ej = f(x(j)) - y(j) using (2).
				// orig octave calc: E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j);
				$E->data[$j] = $b + $alphas->multiply($yvec)->multiply($K->rowAsVector($j))->sum() - $yvec->data[$j];

				// Save old alphas
				$alpha_i_old = $alphas->data[$i];
				$alpha_j_old = $alphas->data[$j];

				// Compute L and H by (10) or (11).
				$ai = $alphas->data[$i]; // grab these to prevent costly lookups any more than necessary
				$aj = $alphas->data[$j];
				if ($yvec->data[$i] == $yvec->data[$j]) {
					$L = max(0, $aj + $ai - $C);
					$H = min($C, $aj + $ai);
				} else {
					$L = max(0, $aj - $ai);
					$H = min($C, $C + $aj - $ai);
				}


				if ($L == $H) {
					// continue to next i.
					continue;
				}
            
				// Compute eta by (14).
				$eta = 2 * $K->at($i,$j) - $K->at($i,$i) - $K->at($j,$j);
            
				if ($eta >= 0) {
					// continue to next i.
					continue;
				}

				// Compute and clip new value for alpha j using (12) and (15).
				// orig octave: alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta;
				// to avoid costly lookups, lets use the $aj var we just set above
				$aj = $aj - ($yvec->data[$j] * ($E->data[$i] - $E->data[$j])) / $eta;
				// Clip
				//alphas(j) = min (H, alphas(j));
				//alphas(j) = max (L, alphas(j));
				$aj = min($H, $aj);
				$aj = max($L, $aj);
				// make sure we put the new $aj value back into $alphas
				$alphas->data[$j] = $aj;

				// Check if change in alpha is significant
				if (abs($aj - $alpha_j_old) < $tol) {
					// continue to next i.
					// replace anyway
					$alphas->data[$j] = $alpha_j_old;
					continue;
				}

				// Determine value for alpha i using (16).
				// alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j));
				$ai = $ai + $yvec->data[$i] * $yvec->data[$j] * ($alpha_j_old - $aj);
				// be sure to put new $ai back in $alphas
				$alphas->data[$i] = $ai;

				//  Compute b1 and b2 using (17) and (18) respectively.
				//b1 = b - E(i) ...
				//- Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
				//- Y(j) * (alphas(j) - alpha_j_old) *  K(i,j)';
				$b1 = $b - $E->data[$i]
					- $yvec->data[$i] * ($ai - $alpha_i_old) * $K->at($i, $j)
					- $yvec->data[$j] * ($aj - $alpha_j_old) * $K->at($i, $j);

				//b2 = b - E(j) ...
				//- Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
				//- Y(j) * (alphas(j) - alpha_j_old) *  K(j,j)';
				$b2 = $b - $E->data[$j]
					- $yvec->data[$i] * ($ai - $alpha_i_old) * $K->at($i, $j)
					- $yvec->data[$j] * ($aj - $alpha_j_old) * $K->at($j, $j);

				// Compute b by (19).
				if (0 < $ai && $ai < $C) {
					$b = $b1;
				} elseif (0 < $aj && $aj < $C) {
					$b = $b2;
				} else {
					$b = ($b1+$b2)/2;
				}
				$num_changed_alphas = $num_changed_alphas + 1;


			} //  if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0))


		} // for loop

		if ($num_changed_alphas == 0) {
			$passes++;
		} else {
			$passes = 0;
		}

		echo '.';
		$dots++;
		if ($dots > 78) {
			$dots = 0;
			echo "\n";
		}

	} // while passes < max_passes
	echo "\nMAX_PASSES ($max_passes) REACHED, training done\n";


	// NOTE: alphas is a m x 1 column vector containing some floats and
	// many near-zero values and a few floats a tiny bit less than zero

	// idx is an size m vector with ones or zeros indicating which alphas are > 0
	// while this is convenient & readable in octave, it's gratuitous in PHP
	// FIXME remove this
	//$idx =  $alphas->map(fn($v) => ($v > 0));

	// b value calculated from our training, a float, e.g. 0.9990
	// FIXME we don't need this extra ret_b var, move comment below?
	$ret_b = $b;

	// GENERATE THESE WITH LOOP which is actually faster/simpler than vectot::map()
	// X subset matrix of orig feature vectors who end up with alpha > 0
	// size typically 500 x n (where m x n is size of orig training set X)
	$ret_x = [];
	// subset (column) vector indicating original classification for our
	// new subset model.X. size same as model.X, e.g. 500 x 1
	$ret_y = [];
	// subset (column) vector same size as our model.X, e.g., 500 x 1 containing
	// float alpha values calculated by our training
	$ret_alphas = [];

	// only include x/y/alphas with value greater than zero
	for($i=0; $i<$m; $i++) {
		$alpha = $alphas->data[$i];
		if ($alpha > 0) {
			// sadly ghostjat/np offers no efficient methods to construct new matrix from vectors
			// so we have to convert to native PHP arrays
			// TODO this would probably be faster if we looped directly in $x_matrix->data
			$ret_x[] = $x_matrix->rowAsVector($i)->asArray();
			$ret_y[] = $yvec->data[$i];
			$ret_alphas[] = $alpha;
		}
	}
	$ret_x = Np\matrix::ar($ret_x);
	$ret_y = Np\vector::ar($ret_y);
	$ret_alphas = Np\vector::ar($ret_alphas);
	

	// column vector containing our weights for each feature, size
	// is n x 1 (where m x n is size of orig training set X)
	// the orig octave
	// model.w = ((alphas.*Y)'*X)';
	// getting the correct output required much trial and error, produced weird sumRows thing
	$ret_w = $alphas->multiply($yvec)->multiply($x_matrix->transpose())->sumRows();
	// Return the model
	return [
		'kernel_fn' => $kernel_fn, // string specifying kernel function
		'b' => $ret_b, // float
		'x' => $ret_x, // matrix
		'y' => $ret_y, // vector
		'alphas' => $ret_alphas, // vector
		'w' => $ret_w, // vector
		'sigma' => $sigma,
		'c' => $C
	];

} // svm_train()

/**
 * returns a vector of predictions using a SVM trained by svm_train
 * @param $x is either a m x n matrix or a vector of size n
 * @param model is an associative array svm model returned from svm_train()
 * @return size m vector of predictions
 */
function svm_predict($model, $x) {
	if ($x instanceof Np\matrix) {
		// matrix is acceptable
	} elseif ($x instanceof Np\vector) {
		// FIXME work up a variant of this fn to predict for a vector
		die("is vector\n");
	} else {
		throw new Exception(gettype($x) . ' is not a valid type for $x');
	}

	$shape = $x->getShape();
	$m = $shape->m;
	$features = $shape->n;

	if ($model['kernel_fn'] == 'linear_kernel') {
		// We can use the weights and bias directly if working with the
		// linear kernel
		// original octave:
		// p = X * model.w + model.b;
		// WARNING this seems to return the right result, but
		// the order of operands is reversed, there's a sum, etc. real kludgy.
		$p = $model['w']->multiply($x)->sumRows()->add($model['b']);

	} elseif ($model['kernel_fn'] == 'gaussian_kernel') {
		$K = get_gaussian_predict_k($x, $model);
		//p = sum(K, 2);
		$p = $K->sumRows();

	} else {
		// Other kernel fn -- THIS WILL PROB BE SLOW
		$shape = $model['x']->getShape();
		$model_x_m = $shape->m;
		$p = Np\vector::zeros($m);

		for($i=0; $i<$model_x_m; $i++) {
			$prediction = 0;
			for($j=0; $j<$features; $j++) {
				throw new Exception("NOT YET IMPLEMENTED");
				// we want to do this original octave stuff here:

				//prediction = prediction + ...
				//model.alphas(j) * model.y(j) * ...
				//model.kernelFunction(X(i,:)', model.X(j,:)');
			} // for j
			$p->data[$i] = $prediction + $model['b'];
		} // for i
	} // if kernel_fn is linear/gaussian/other

	// change calculated ranges to zero or one
	return $p->map(fn($v) => ($v >= 0));
} // svm_predict()

/**
 * Runs the specified model on the $x and $y provided and
 * returns details about the time and accuracy
 */
function svm_assess(array $model, Np\matrix $x, Np\vector $y) {
	$start = microtime(TRUE);

	$retval = [];

	$shape = $x->getShape();
	$retval['x_samples'] = $shape->m;
	$retval['x_features'] = $shape->n;

	$y_size = $y->getSize();
	$retval['y_samples'] = $y_size;

	$p = svm_predict($model, $x);
	$p_size = $p->getSize();
	$retval['p_size'] = $p_size;

	// sanity check
	if ($p_size !== $y_size) {
        	throw new Exception("p size $p_size does not match y size $y_size");
	}

	// calculate what percentage of the time our model's prediction
	// matches y. $p is full of predictions, $y is full of answers
	$correct = 0;
	$true_positives = 0;
	$true_negatives = 0;
	$false_positives = 0;
	$false_negatives = 0;
	for($i=0; $i<$p_size; $i++){
		// FIXME modify this logic to calculate true & false positives/negatives
	        // if prediction matches training set value, it's CORRECT
		$pval = $p->data[$i];
	        if ($pval == $y->data[$i]) {
	                $correct++;
			if ($pval == 1) {
				$true_positives++;
			} else {
				$true_negatives++;
			}
	        } else {
			if ($pval == 1) {
				$false_positives++;
			} else {
				$false_negatives++;
			}
		}
	}
	$precision = $true_positives / ($true_positives + $false_positives);
	$recall = $true_positives / ($true_positives + $false_negatives);
	
	$retval['correct_predictions'] = $correct;
	$retval['true_positives'] = $true_positives;
	$retval['true_negatives'] = $true_negatives;
	$retval['false_positives'] = $false_positives;
	$retval['false_negatives'] = $false_negatives;
	$retval['precision'] = $precision;
	$retval['recall'] = $recall;
	$retval['f_score'] = (2 * $precision * $recall) / ($precision + $recall);
	


	$accuracy = ($correct/$p_size);
	$retval['correct_decimal'] = $accuracy;
	$retval['correct_percent'] = $accuracy * 100;

	$retval['elapsed_time'] = microtime(TRUE) - $start;

	return $retval;
}


/**
 * returns optimal C by training numerous SVM classifiers with varying
 * values of C and returning the one that performs best
 *
 */
function svm_linear_optimal_c($xtrain, $ytrain, $xval, $yval) {

	// evenly spaced (exponentially) generated from powers of 1.4
	$cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	echo "Begin linear sweep\n";
	echo "\tvalues of c: ", implode(", ", $cvals), "\n";

	$best_c = null;
	$best_results = null;
	$best_correct_percent = null;
	$best_model = null;

	$train_results = [];
	$val_results = [];
	$result_idx = 0;
	foreach($cvals as $c_i => $cval) {
		echo "== Training SVM with C=$cval ==\n";

		$start = microtime(TRUE);
		// train the model on the training set
		$model = svm_train($xtrain, $ytrain, $cval, 'linear_kernel', null, 0.0001);
		$elapsed = microtime(TRUE) - $start;
		echo "training completed in $elapsed seconds\n";

		// assess the model with the xtrain set
		$results = svm_assess($model, $xtrain, $ytrain);
		echo "XTRAIN\n";
		print_r($results);
		$train_results[$result_idx] = array_merge(
			['c' => $cval, 'training_time' => $elapsed],
			$results			
		);

		// assess the model with the xval set
		$results = svm_assess($model, $xval, $yval);
		echo "XVAL\n";
		print_r($results);
		$val_results[$result_idx] = array_merge(
			['c' => $cval, 'training_time' => $elapsed],
			$results
		);

		// maybe optimize for f_score?
		$correct_percent = $results['correct_percent'];
		if (is_null($best_c) || $correct_percent > $best_correct_percent) {
			$best_c = $cval;
			$best_results = $results;
			$best_correct_percent = $correct_percent;
			$best_model = $model;
		}

		$result_idx++;

	}

	echo "\n=====\n";
	echo "Best value for C is $best_c, with correct_percent of $best_correct_percent\n";
	print_r($best_results);

	return [
		'c' => $best_c,
		'model' => $best_model,
		'train_results' => $train_results,
		'val_results' => $val_results
	];

}


/**
 * returns optimal C and sigma by training numerous gaussian SVM classifiers with varying
 * values of C and sigma, returning the one that performs best
 *
 */
function svm_gaussian_optimal_c($xtrain, $ytrain, $xval, $yval) {
	// good, sort of hand picked
//	$cvals = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
//	$sigma_vals = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
	// evenly spaced (exponentially) generated from powers of 1.4
	$cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	$sigma_vals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	echo "Begin gaussian sweep\n";
	echo "\tvalues of c: ", implode(", ", $cvals), "\n";
	echo "\tvalues of sigma: ", implode(", ", $sigma_vals), "\n";

	$best_c = null;
	$best_sigma = null;
	$best_results = null;
	$best_correct_percent = null;
	$best_model = null;

	$train_results = [];
	$val_results = [];
	$result_idx = 0;
	foreach($cvals as $c_i => $cval) {
		foreach($sigma_vals as $s_i => $sigma) {
			echo "== Training SVM with C=$cval, sigma=$sigma ==\n";

			$start = microtime(TRUE);
			// train the model on the training set
			$model = svm_train($xtrain, $ytrain, $cval, 'gaussian_kernel', $sigma);
			$elapsed = microtime(TRUE) - $start;
			echo "training completed in $elapsed seconds\n";

			// assess the model with the xtrain set
			$results = svm_assess($model, $xtrain, $ytrain);
			echo "XTRAIN\n";
			print_r($results);
			$train_results[$result_idx] = array_merge(
				['c' => $cval, 'sigma' => $sigma, 'training_time' => $elapsed],
				$results				
			);

			// assess the model with the xval set
			$results = svm_assess($model, $xval, $yval);
			echo "XVAL\n";
			print_r($results);
			$val_results[$result_idx] = array_merge(
				['c' => $cval, 'sigma' => $sigma, 'training_time' => $elapsed],
				$results
			);

			// TODO maybe optimize for f_score instead?
			// find a way to punish false positives more or we throwing out good messages a spam
			$correct_percent = $results['correct_percent'];
			if (is_null($best_c) || $correct_percent > $best_correct_percent) {
				$best_c = $cval;
				$best_sigma = $sigma;
				$best_results = $results;
				$best_correct_percent = $correct_percent;
				$best_model = $model;
			}

			$result_idx++;

		} // foreach sigma
	} // foreach c

	echo "\n=====\n";
	echo "Best C is $best_c, best sigma is $best_sigma, with correct_percent of $best_correct_percent\n";
	print_r($best_results);

	return [
		'c' => $best_c,
		'sigma' => $best_sigma,
		'model' => $best_model,
		'train_results' => $train_results,
		'val_results' => $val_results
	];
}

Steve_R_Jones · Dec 8, 2023

sneakyimp

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. -> Because it isn't SPAM.

Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. -> The username ":could be" from a spammer.... So could any user name that starts with SNEAKY.

sneakyimp · Dec 8, 2023

Steve_R_Jones Because it isn't SPAM

Are you sure? Because it's their only post ever, and completely unrelated to this thread. I think it's quite likely that this user's second post, if it ever appears, will have a spam link in it.

And @Steve_R_Jones, there appears to be an off-by-one error in the logic that sends response notifications for this site. Your response #11100226 prompted the site to send me this email:

Hey sneakyimp!

sneakyimp replied to your post (#6) in spam filter findings?.

(LINK WAS HERE)

sneakyimp

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. -> Because it isn't SPAM.

Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. -> >The username ":could be" from a spammer.... So could any user name that starts with SNEAKY.

Steve_R_Jones · Dec 9, 2023

Around here - everyone is ""Innocent until proven guilty." " Otherwise, no one could last as long as you have.

The off-by-one issue will probably have to remain an issue. The resources for the site are a bit on the low side.

Weedpacket · Dec 12, 2023

I'd just appreciate an effective spam filter...

sneakyimp · Dec 14, 2023

Steve_R_Jones Around here - everyone is ""Innocent until proven guilty." " Otherwise, no one could last as long as you have

I like to think my posts have been topical, and a reasonable person (or spam filter) could easily distinguish them from posts that contribute precisely nothing to the conversation at hand.

Steve_R_Jones The off-by-one issue will probably have to remain an issue. The resources for the site are a bit on the low side.

I'd be happy to donate some free time to examine the problem. i was also imagining the spam filter discussion here might also help facilitate content moderation on this forum.

Weedpacket I'd just appreciate an effective spam filter...

@Weedpacket agreed! The ratio of spam to ham here is somewhat lamentable. Even the ham we get seems confused and poorly expressed.

spam filter findings?

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

IdealCleaning

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Steve_R_Jones

Ssneakyimp

(LINK WAS HERE)

Steve_R_Jones

Weedpacket

Ssneakyimp