spam filter findings?

IdealCleaning · Nov 4, 2023

Thanks for sharing this info sir@sneakyimp#11099876

sneakyimp · Dec 7, 2023

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. I'd love to hear from the phpbuilder community about this spam filter concept. I'll start posting some code here, later today I hope.

sneakyimp · Dec 7, 2023

I hope to relay the basic techniques I learned, originally in Matlab/Octave, to implement a machine learning spam filter. I'll briefly mention that this approach is simple, and works by checking for the presence or absence of the words in some vocabulary generated from a corpus of messages. For this approach to generalize well, you must only consider words that are fairly common in your message corpus. It's not good to consider any words that appear in only one message -- or don't appear in any messages at all. You also need a lot of messages for it to work well. A few thousand at least.

To start, we must have a corpus of messages, with some idea of which are somewhat evenly split between ham and spam. In my case, I didn't have several thousand messages so I ended up padding out my own message corpus with the old SpamAssassin corpus here.

Because this data set is fairly large, and because it is computationally intensive, the approach here typically reads messages file by file from several directories, and each stage of processing tends to generate JSON data of its computation.

I have this config file which specifies some file locations with brief descriptions:

// === config.php===
/**
 * contains JSON-encoded array of the files we have chosen for our corpus
 * along with an is_spam indication and a strip_headers flag if we need to strip
 * email headers preceding the message body before processing
 */
define('JSON_FILE_CHOSEN_CORPUS_FILES', 'chosen_corpus_files.json');

/**
 * array with all stemmed/normalized/adjusted words in our chosen corpus as keys
 * and the values being the number of files in our corpus that contain each word
 * NOTE: the frequency values are NOT the number of times the word appears in the corpus
 * the frequency values are the number of FILES that contain that word one or more times
 */
define('JSON_FILE_CHOSEN_CORPUS_WORD_FREQUENCIES', 'chosen_corpus_word_frequencies.json');

/**
 * This file defines an array of the all words we will consider when training our algorithm
 * The number of words in it varies depending on various parameters we choose
 */
define('JSON_FILE_CHOSEN_CORPUS_VOCAB', 'chosen_corpus_vocab.json');

/**
 * An array listing all the files in the SpamAssassin corpus, along with is_spam flag
 * e.g., each element is associative array ['file' => '/full/path/to/file', 'is_spam' => 1]
 */
define('JSON_FILE_SA_CORPUS_FILES', 'sa_corpus_files.json');

/**
 * An array listing all the files in MY corpus, along with is_spam flag
 * e.g., each element is associative array ['file' => '/full/path/to/file', 'is_spam' => 1]
 */
define('JSON_FILE_MY_CORPUS_FILES', 'my_corpus_files.json');

We need to decide what our vocabulary will be, so I have this script, analyze-corpus-word-frequency.php, which looks at our corpus and analyzes what words appear in the messages, and, for each word, how many message files contain that word at least once. As you'll see, this script does a bit more because I'm trying to pad out my own modest message corpus with files from the SpamAssassin corpus:

$current_dir = dirname(__FILE__) . '/';


// CONFIG
echo "this script has configurable parameters. See config.php\n";
require_once $current_dir . 'config.php';

// TODO move this to config.php
$sa_file_count = 2000;



// === FIRST, randomly pick this many files from the SA corpus
// this number should be big enough to get a nice big training sample
// but not so large that we let the SA corpus totally dwarf MY corpus
echo "we will choose $sa_file_count files from the SA corpus\n";



// we need fns in here to retrieve paths to corpus files
require_once 'list-fns.php';

$start = microtime(TRUE);

// retrieves ~6k files as of this writing as array of associative arrays with file and is_spam keys
// e.g. ["file" => "/full/path/to/file", "is_spam" => 1]
$sa_corpus_files = list_sa_corpus();
echo sizeof($sa_corpus_files), " found in SA corpus\n";
// the file system may not always return the list of files in the same order, so i figured i'd
// store the last list of files returned
$sa_corpus_loc = $current_dir . JSON_FILE_SA_CORPUS_FILES;
file_put_contents($sa_corpus_loc, json_encode($sa_corpus_files));
echo "SA corpus file list written to $sa_corpus_loc\n";


// I originally thought shuffling just the keys would be faster, but
// shuffling the entire sa_corpus_files array is just as fast so
// deprecated this
// create a randomly shuffled array of the sa corpus indexes
//$random_sa_keys = array_keys($sa_corpus_files);
//shuffle($random_sa_keys);
//echo sizeof($random_sa_keys), " keys randomly shuffled, saving to random_sa_keys.json\n";
//file_put_contents($current_dir . 'random_sa_keys.json', json_encode($random_sa_keys));

// randomly shuffle the sa corpus files
shuffle($sa_corpus_files);
// slice off the first $sa_file_count of them
$sa_subset = array_slice($sa_corpus_files, 0,$sa_file_count);
echo sizeof($sa_subset) . " files sliced off SA corpus\n";
// free up memory, don't need entire corpus any more
unset($sa_corpus_files);

// The SA corpus files have email headers, so we set a flag that tells us
// to strip out the mail headers in the SA files in later stages of processing
for($i=0; $i<$sa_file_count; $i++) {
	$sa_subset[$i]['strip_headers'] = 1;
}


// fetch MY corpus
$my_corpus_files = list_my_corpus();
$my_corpus_file_count = sizeof($my_corpus_files);
echo "$my_corpus_file_count files found in MY corpus\n";
$my_corpus_loc = $current_dir . JSON_FILE_MY_CORPUS_FILES;
file_put_contents($my_corpus_loc, json_encode($my_corpus_files));
echo "MY corpus file list written to $my_corpus_loc\n";

// mark a flag that tells us NOT to strip out the mail headers in the MY files
for($i=0; $i<$my_corpus_file_count; $i++) {
	$my_corpus_files[$i]['strip_headers'] = 0;
}

$chosen_corpus_files = array_merge($sa_subset, $my_corpus_files);
echo sizeof($chosen_corpus_files), " files in chosen corpus.\n";
echo "shuffling chosen corpus files...";
shuffle($chosen_corpus_files);
$chosen_filename = $current_dir . JSON_FILE_CHOSEN_CORPUS_FILES;
file_put_contents($chosen_filename, json_encode($chosen_corpus_files));
echo "chosen corpus files written to $chosen_filename\n";

// free up memory/resources
unset($sa_subset);
unset($my_corpus_files);

// how many ham/spam
$ham = 0;
$spam = 0;
foreach($chosen_corpus_files as $cf) {
	if ($cf['is_spam']) {
		$spam++;
	} else {
		$ham++;
	}
}
echo "$ham ham files\n";
echo "$spam spam files\n";
echo ($ham/($ham+$spam))*100 . " percent of " . ($spam+$ham) .  " files are ham\n";


// we will need these functions
require_once 'text-fns.php';

// this associative array will contain our word frequencies
// word => number_of_files_containing_the_word
$word_frequencies = [];
$chosen_corpus_file_count = sizeof($chosen_corpus_files);
echo "processing $chosen_corpus_file_count files from chosen corpus\n";
foreach($chosen_corpus_files as $i => $cf) {
	if (($i % 100) == 0) {
		echo "processing $i of $chosen_corpus_file_count\n";
		echo "word_frequencies length: ", sizeof($word_frequencies), "\n";
	}

	$file = $cf['file'];

	$contents = file_get_contents($file);
	// we have to strip_headers for SA corpus, so second optional param is TRUE
	// this returns an array of the massaged, unique words, no duplicates
	$words = pre_process_message($contents, (bool)$cf['strip_headers']);
	unset($contents);

	// since $words contains no duplicates, it indicates simply that the word
	// exists in this file, that lets us do a simple increment operation
	foreach($words as $word) {
		if (array_key_exists($word, $word_frequencies)) {
			// increment it
			$word_frequencies[$word]++;
		} else {
			// set it to 1
			$word_frequencies[$word] = 1;
		}
	}
}


// SORT the word frequenices by freq desc
function wfsort($a, $b) {
    if ($a == $b) {
        return 0;
    }
    return ($a > $b) ? -1 : 1;
}
uasort($word_frequencies, "wfsort");
//var_dump($word_frequencies);
$wf_filename = $current_dir . JSON_FILE_CHOSEN_CORPUS_WORD_FREQUENCIES;
file_put_contents($wf_filename, json_encode($word_frequencies));
echo "word frequencies written to $wf_filename\n";
echo sizeof($word_frequencies) . " words in word_frequencies\n";
echo "COMPLETED in " . (microtime(TRUE) - $start) . " seconds\n";

Some notes:

this script selects $sa_file_count messages randomly from the SpamAssassin corpus and uses my entire message corpus and randomly shuffles these selected messages. It saves JSON files indicating the file paths for these files and their ham/spam classification
It counts how many ham/spam messages and reports that. To generalize well, this proportion should not be too lopsided
It uses a function, pre_process_message to reduce each message to just an array of the words in it. I'll post this function below, but it does various things: eliminate trivial stop words, change all currency expressions to something like pprocurrency, converts all urls to pprohttpaddr, etc. This is to avoid overfitting.
counts how many message files that contains each word it encounters, sorts the array of these words+frequencies, and writes it to a JSON file.

Any suggestions or observations welcome.

sneakyimp · Dec 7, 2023

Here is text-fns.php, which defines our pre_process_message fn, among other things.

EDIT: it looks like the PorterStemmer code uses deprecated curly brace syntax to access chars in a string, which has been deprecated in PHP8.

text-fns.php
define('MIN_WORD_LEN', 2);
define('MAX_WORD_LEN', strlen('Supercalifragilisticexpialidocious'));

// massaged this from the MySQL default stopwords
// @see https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html
$stop_words = ['a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','i','in','is','it','la','no','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und'];


// porter stemmer downloaded from here:
// https://tartarus.org/martin/PorterStemmer/php.txt
require_once dirname(__FILE__) . '/PorterStemmer.php';

// function which decides which words to keep
function keep_word($word) {
	global $stop_words;

	// get rid of words too short or too long
	$len = strlen($word);
	if ($len < MIN_WORD_LEN || $len > MAX_WORD_LEN) {
		return FALSE;
	}

	// get rid of stop words
	if (in_array($word, $stop_words)) {
		return FALSE;
	}

	// otherwise, keep it
	return TRUE;
}


// this is the function we use to pre-process each message,
// reducing it to just important words. it removes HTML,
// optionally email headers, changes everything to lowercase,
// converts any hyperlinks to the same tag (this presumably
// to prevent overfitting)
// this based on coursera/ex6 assignment except it returns an
// array of unique, porter-stemmed words, no duplicates
function pre_process_message($email_contents, $strip_headers=FALSE) {

	global $stop_words;


// html
//$f = 'sa-corpus/spam_2/01277.6763a79fad1f1b39cb7d5b7faf92ea98';
// email
//$f = 'sa-corpus/spam_2/00410.fb7b31cdd9d053f8b446da7ce89383fa';
// $$$$
//$f = 'sa-corpus/spam_2/00830.079ed7d24f78024e023b82417a6fe2ca';
// £ / GBP char
//$f = 'sa-corpus/hard_ham/00220.11f0abf371588687744d151b46903087';
//	$email_contents = file_get_contents($f);

	if ($strip_headers) {
		// i did a grep search for \r\n as suggested here:
		// https://stackoverflow.com/questions/73833/how-do-you-search-for-files-containing-dos-line-endings-crlf-with-grep-on-linu
		// only one file in sa corpus: sa-corpus/spam_2/00083.1aead789d4b4c7022c51bc632e4f2445
		// and it looks like the \r\n endings in that are NOT in the headers so i'll keep this simple
		// to trim email headers, just look for first double newline and take everything after
		$clean_contents = trim(strstr($email_contents, "\n\n"));

		if ($clean_contents === FALSE) {
			// FIXME remove this exception and just set $clean_contents = trim($email_contents)
			throw new Exception("no email header found, just use entire email");
		}
	} else {
		$clean_contents = trim($email_contents);
	}


	// strip all NULL bytes, HTML and PHP tags
	$clean_contents = strip_tags($clean_contents);

	// to avoid overfitting, change all numbers to just PREPRONUMSEQUENCE or
	// something unlikely to appear in the corpus
	$clean_contents = preg_replace('/\d+/', 'ppronumseq', $clean_contents);

	// to avoid overfitting, change all urls to just PREPROHTTPADDR or
	// something unlikely to appear in the corpus
	// TODO this could be improved alot
	$url_pattern = '@(http|https)://[^\s]+@';
//	$matches = NULL;
//	$count = preg_match_all('@(http|https)://[^\s]+@', $clean_contents, $matches);
//var_dump($matches);
//die();
	$clean_contents = preg_replace($url_pattern, 'pprohttpaddr', $clean_contents);

	// to avoid overfitting, replace email addresses with PREPROEMAILADDR
	$clean_contents = preg_replace('/[^\s]+@[^\s]+/', 'pproemailaddr', $clean_contents);

	// dollar signs are a common spam element
	// TODO it migth be worthwhile to apply some kind of linear or log scale to
	// multiple currency chars, i.e., more chars would be counted rather than binary yes/no
	$clean_contents = preg_replace('/(\$|£)+/', 'pprocurrency', $clean_contents);


	// we prob won't have any numbers at this point, but to get rid of punctuation,
	// replace any non-alphanumeric chars with a space
	$clean_contents = trim(preg_replace('/[^a-zA-Z0-9]+/', ' ', $clean_contents));

	// convert to lowercase...this yields a pretty clean representation of the mail,
	// just spaces and alphanumeric strings
	$clean_contents = strtolower($clean_contents);

	// split into words
	$clean_words = preg_split('/\s+/', $clean_contents);
	// don't need this var any more, save memory
	unset($clean_contents);

	// get rid of duplicate words
	$clean_words = array_unique($clean_words);

	// filter out stop words and too long / too short ones
	$clean_words = array_values(array_filter($clean_words, 'keep_word'));

	// apply the porter stemmer
	$clean_words = array_map('PorterStemmer::Stem', $clean_words);
	// may introduce duplicates so get rid of duplicate words again
	$clean_words = array_unique($clean_words);


	return $clean_words;
}


// the functions that follow all do the same, but using a different approach
function feature_vector_1($vocab, $words) {
	$out = array_fill_keys($vocab, 0);
	foreach($words as $w) {
		if (array_key_exists($w, $out)) {
			$out[$w] = 1;
		} else {
			// word is not in vocab, ignore it
		}
	}
	return array_values($out);
}

function feature_vector_2($vocab, $words) {
	$a = array_fill_keys($vocab, 0);
	$b = array_fill_keys(array_intersect($vocab, $words), 1);
	return array_values(array_merge($a, $b));
}

function feature_vector_3($vocab, $words) {
	$a = array_fill_keys($vocab, 0);
	$b = array_fill_keys(array_intersect($vocab, $words), 1);
	return array_values(array_replace($a, $b));
}

function feature_vector_4($vocab, $words) {
	return array_map(fn($v) => (int)in_array($v, $words), $vocab);
}

function feature_vector_5($vocab, $words) {
	$flipped_words = array_flip($words); // This would also uniquify the "flipped" word list automatically
	return array_map(fn($v) => (int)isset($flipped_words[$v]), $vocab);
}

function feature_vector_6($vocab, $words) {
	return array_map(fn($v) => (int)in_array($v, $words), $vocab);
}

NOTE: feature_vector is a compact array of binary values (there are no associative keys) that indicates the presence/absence of each of our vocabulary words in a single message. These are useful for later stages of processing.

sneakyimp · Dec 7, 2023

By running the analyze-corpus-word-frequency.php script above, you get chosen_corpus_word_frequencies.json, which contains an array of all words appearing in your corpus as keys and the value of each key indicates how many message files that word appears in. E.g.:

    [ppronumseq] => 1784
    [pprohttpaddr] => 1609
    [you] => 1344
    [have] => 1043
    [us] => 1028
    [list] => 996
    [not] => 968
    [your] => 948
...
   [ahg] => 1
    [wdppronumseqfiiv] => 1
    [cevtr] => 1
    [wcrtegxypdpppronumseq] => 1
    [wdppronumseqnv] => 1
    [wcppronumseqfppronumseq] => 1
    [dpppronumseqf] => 1
    [bjdrenx] => 1
    [svatshenk] => 1

sneakyimp · Dec 7, 2023

Once we've analyzed the corpus to determine which words appear and how often, we run this script to select the most popular words for our vocabulary. NOTE that this vocabulary will dictate the feature vector that we use to train our classifier on existing messages and also asses the ham/spam classification of incoming, novel messages:

// generate-vocab.php

<?php

// this script loads the word frequency data (generated at significant computational expense)
// from analysis of the thousands of files in the SA and MY corpus and tells us the most commonly
// appearing words and how many files of the chosen corpus that they appear in
// lastly, it generates a vocab of words used with sufficient frequency to serve
// as our vocab

// IMPORTANT: You'll need to make sure you have the correct, latest word
// frequency analysis data in the file chosen_corpus_word_frequencies.json
// we generate/store that in a separate script because it involves munging
// thousands of text documents -- computationally expensive

// you'll need to adjust MIN_FREQ so that you end up
// with about 1-2k words in your vocab

// our vocabulary will only include words that appear
// in at least this many emails in the SA corpus
// TODO move this to config.php
define('MIN_FREQ', 20);

$dest_dir = dirname(__FILE__) . '/';

// CONFIG
echo "this script has configurable parameters. See config.php\n";
require_once $dest_dir . 'config.php';


// TODO put this filename in some config somewhere, shared with the freq analysis script
$freq_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_WORD_FREQUENCIES;
echo "loading $freq_file\n";
$word_frequencies = json_decode(file_get_contents($freq_file), TRUE);


echo "corpus frequencies loaded, found " . sizeof($word_frequencies) . " words\n";

// should already be sorted by freq desc
$top = array_slice($word_frequencies, 0, 40);
echo "TOP WORDS\n";
print_r($top);

$vocab = [];
foreach($word_frequencies as $word => $freq) {
        if ($freq < MIN_FREQ) {
                // that's it! no more
                break;
        }

        $vocab[] = $word;
}

// sort alphabetically
// TODO a bit of performance profiling suggests we should NOT sort this
// array alphabetically, but rather by popular word first for better
// lookup peformance?
sort($vocab);

echo sizeof($vocab) . " words in the vocab\n";

$vocab_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_VOCAB;
echo "Saving vocab to $vocab_file\n";
file_put_contents($vocab_file, json_encode($vocab));

Sample output in my case:

$ php generate-vocab.php
this script has configurable parameters. See config.php
loading /home/jaith/biz/machine-learning/chosen_corpus_word_frequencies.json
corpus frequencies loaded, found 28084 words
TOP WORDS
Array
(
    [ppronumseq] => 1784
    [pprohttpaddr] => 1609
    [you] => 1344
    [have] => 1043
    [us] => 1028
    [list] => 996
    [not] => 968
    [your] => 948
    [if] => 942
    [pproemailaddr] => 930
    [can] => 914
    [mail] => 874
    [all] => 840
    [get] => 763
    [do] => 757
    [but] => 752
    [we] => 739
    [so] => 674
    [more] => 674
    [just] => 673
    [here] => 663
    [on] => 663
    [out] => 642
    [time] => 633
    [my] => 631
    [email] => 628
    [new] => 620
    [there] => 611
    [up] => 605
    [our] => 587
    [onli] => 577
    [ani] => 577
    [ha] => 576
    [now] => 556
    [like] => 551
    [work] => 536
    [messag] => 536
    [thei] => 533
    [inform] => 530
    [free] => 510
)
1955 words in the vocab
Saving vocab to /home/jaith/biz/machine-learning/chosen_corpus_vocab.json

This writes our vocab to chosen_corpus_vocab.json.

sneakyimp · Dec 7, 2023

Once we have our JSON files containing our corpus of chosen message files and our vocabulary, we can generate our training, cross validation, and test arrays.

NOTE: once again, PorterStemmer will barf in php 8, so run this with php 7 or fix it.

// === generate-training-sets.php ===
/**
 * This script loads the previously generated list of our corpus
 * files (which has already been randomly shuffled) and generates
 * the matrix of vectors we use to train our machine. That 
 * matrix will be a PHP array, containing one entry for each
 * file in our corpus. Each entry will itself be an array with
 * one entry for each word in our vocabulary.  We will then
 * break up this ALL matrix into train, validation, and test
 * subsets (60/20/20) and store those in a JSON file for use
 * by our training algorithm.
*/

$dest_dir = dirname(__FILE__) . '/';

require_once $dest_dir . 'config.php';

// load our file corpus
// NOTE each element should be an associative array specifying file, is_spam, and strip_headers
// e.g.:  ['file' => '/full/path/to/file', 'is_spam' => 1, 'strip_headers' => 0]
$corpus_json_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_FILES;
$corpus_files = json_decode(file_get_contents($corpus_json_file), TRUE);
$corpus_file_count = sizeof($corpus_files);
echo "$corpus_file_count files loaded from $corpus_json_file\n";

$vocab_json_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_VOCAB;
$vocab = json_decode(file_get_contents($vocab_json_file), TRUE);
$vocab_word_count = sizeof($vocab);
echo "$vocab_word_count words loaded from $vocab_json_file\n";

// we encountered some character encoding problems with regex in ocatave/matlab
// do we need to declare a charset for these files or perform some kind of charset conversion?
// before we start processing these emails, we need to set the charset
// or it can barf doing some regex
// NOTE: this does not appear to cause trouble in PHP
//__mfile_encoding__ ("iso-8859-1");


// we will need these functions to analyze each message
// TODO give this file a more meaningful name
require_once 'text-fns.php';


// this is our master array of feature vectors
$Xall = [];
// this is our master array of y scores
$yall = [];

echo "processing $corpus_file_count files from chosen corpus\n";
$start = microtime(TRUE);
foreach($corpus_files as $i => $cf) {
//	echo "processing " . $cf['file'] . "\n";

        if (($i % 100) == 0) {
                echo "processing $i of $corpus_file_count\n";
        }

        $file = $cf['file'];

        $contents = file_get_contents($cf['file']);
	if (!$contents) {
		throw new Exception($cf['file'] . ' could not be fetched, false or empty returned');
	}

        // this returns an array of the massaged, unique words, in a message -- no duplicates
	// TODO we will probably want to remove 2nd strip_headers param when we port this to
	// a website application, it's just a quirk of the SA data corpus
        $words = pre_process_message($contents, (bool)$cf['strip_headers']);
        unset($contents);

	// take our $vocab and $words to generate a feature vector
	// which is just 0s and 1s, indicating which vocab words are in the current message
	$Xall[] = feature_vector_1($vocab, $words);
	$yall[] = $cf['is_spam'];

}
$elapsed = microtime(TRUE) - $start;
echo "all $corpus_file_count messages processed in $elapsed seconds\n";

echo sizeof($Xall) . " records in Xall\n";
echo sizeof($yall) . " records in yall\n";


// save processed training data
$data_file_all = $dest_dir . 'training_data_all.json';
echo "saving Xall and yall to $data_file_all\n";
file_put_contents($data_file_all, json_encode(['Xall' => $Xall, 'yall' => $yall]));

// divide up the entire set into subsets for training/validation/testing
$row_count = sizeof($yall);

if ($row_count < 10) {
	die("you don't even have 10 training examples. I refuse to finish\n");
}

// to split in 3 groups of 60/20/20, we need 2 cuts
$cut1 = round($row_count * .6);
echo "cut1 $cut1\n";
$cut2 = round($row_count * .8);
echo "cut2 $cut2\n";


$Xtrain = array_slice($Xall, 0, $cut1);
echo sizeof($Xtrain) . " elements in Xtrain\n";

$ytrain = array_slice($yall, 0, $cut1);
echo sizeof($ytrain) . " elements in ytrain\n";


$Xval = array_slice($Xall, $cut1, ($cut2-$cut1));
echo sizeof($Xval) . " elements in Xval\n";

$yval = array_slice($yall, $cut1, ($cut2-$cut1));
echo sizeof($yval) . " elements in yval\n";

$Xtest = array_slice($Xall, $cut2);
echo sizeof($Xtest) . " elements in Xtest\n";

$ytest = array_slice($yall, $cut2);
echo sizeof($ytest) . " elements in ytest\n";

echo "X element total = " . (sizeof($Xtrain) + sizeof($Xval) + sizeof($Xtest)) . "\n";
echo "y element total = " . (sizeof($ytrain) + sizeof($yval) + sizeof($ytest)) . "\n";

// save data sets
$data_file_sets = $dest_dir . 'training_data_sets.json';
echo "saving Xall and yall to $data_file_all\n";
file_put_contents($data_file_sets, json_encode([
	'Xtrain' => $Xtrain,
	'ytrain' => $ytrain,
	'Xval' => $Xval,
	'yval' => $yval,
	'Xtest' => $Xtest,
	'ytest' => $ytest,
]));

sneakyimp · Dec 7, 2023

OK so once we have the training/test/validation set feature vectors in a big fat JSON file, we can train a Support Vector Machine. Glossing over a bunch of details, this is the top-level script. I'm wondering if I should maybe just put this in a repo on github? If anyone is curious, LMK.

// === train-svm.php ===

/**
 * this script trains a Support Vector Machine (i.e., a machine
 * learning classifier algorithm) to determine if an email message
 * is spam or ham
*/

// Load the Spam Email dataset
// You will have Xtrain, ytrain, Xval, yval, Xtest, ytest
// we expect this data to be ranomly shuffled, and to contain
// feature vectors from both the old SA corpus and MY corpus
$data_file = __DIR__ . '/training_data_sets.json';
echo "loading data from $data_file\n";
$data = json_decode(file_get_contents($data_file), TRUE);

// output some data points
$data_keys = ['Xtrain', 'ytrain', 'Xval', 'yval', 'Xtest', 'ytest'];
foreach($data_keys as $key) {
	if (!array_key_exists($key, $data)) {
		throw new Exception("data file did not define key=$key");
	}
	echo "$key has " . sizeof($data[$key]) . " elements\n";
}

// we will be using ghostjat/np for matrix operations
// NOTE: i vaguely recall having modified this library to get it to return the same
// results as octave for some matrix operation
require_once __DIR__ . '/np/vendor/autoload.php';
use Np\matrix;
use Np\vector;

$x_matrix = matrix::ar($data['Xtrain']);
$y_matrix = vector::ar($data['ytrain']);

// trial and error suggested C=1 would be best, but *shrug*
//$C = 3;
// testing
$C = 0.1;
echo "training model with C=$C\n";


// FIXME before we can call our svm-fns we need to load BLAS because Np depends on it
// need to put it somewhere correct? this is awkward
Np\core\blas::$ffi_blas = FFI::load(__DIR__ . '/np/vendor/ghostjat/np/src/core/blas.h');
require_once 'svm-fns.php';

echo "Training SVM (Spam Classification)\n";
echo "(this may take 1 to 2 minutes) ...\n";
// train the model
// best practice has with separate training/validation/test sets
$start = microtime(TRUE);
$model = svm_train($x_matrix, $y_matrix, $C, 'linear_kernel');
echo "training completed in " . (microtime(TRUE) - $start) . " seconds\n";

// $model will be an associative array with these keys:
// kernel_fn => string specifying kernel function
// b =>  float
// x => matrix of input x vectors (0s/1s in spam example) where alpha > 0
// y => vector of input y classifications (0/1 in spam example) where alpha > 0
// alphas => vector of alpha values calculated during training
// w => vector of weights, a float, for each input feature

// convert $model to something we can JSON_ENCODE, i.e., only
// basic data types instead of Np\vector or Np\matrix
$to_array = ['x', 'y', 'alphas', 'w'];
$model_to_save = [];
foreach($model as $key => $val) {
	if (in_array($key, $to_array)) {
		$model_to_save[$key] = $val->asArray();
	} else {
		$model_to_save[$key] = $val;
	}
}

// save the model so we can crack it open and test it
// without having to retrain the entire model all over again.
$model_data_file = __DIR__ . '/trained-svm-model.json';
echo "Writing trained SVM model params to $model_data_file\n";
file_put_contents($model_data_file, json_encode($model_to_save));

echo "\nEvaluating the trained Linear SVM on TRAINING set ...\n";
$x = Np\matrix::ar($data['Xtrain']);
$y = Np\vector::ar($data['ytrain']);
$results = svm_assess($model, $x, $y);
print_r($results);

echo "\nEvaluating the trained Linear SVM on Xtest set ...\n";
$x = Np\matrix::ar($data['Xtest']);
$y = Np\vector::ar($data['ytest']);
$results = svm_assess($model, $x, $y);
print_r($results);


die("DONE\n");

I'll post the svm-fns.php in the next post.

sneakyimp · Dec 7, 2023

But first, sample output. As you can see, with C=1, the trained SVM assesses the messages in our training set with 99.9% accuracy, and in the test data set (i.e., new messages it has not seen) with 93.0% accuracy:

loading data from /home/sneakyimp/biz/machine-learning/training_data_sets.json
Xtrain has 1375 elements
ytrain has 1375 elements
Xval has 458 elements
yval has 458 elements
Xtest has 458 elements
ytest has 458 elements
training model with C=1
Training SVM (Spam Classification)
(this may take 1 to 2 minutes) ...
train with C=1 and kernel_fn=linear_kernel
x matrix m=1375, n=1955
y vector size 1375
CALCULATING KERNEL
KERNEL CALC COMPLETE

Training.......................................................................
..........................................................................
MAX_PASSES (5) REACHED, training done
training completed in 82.92901802063 seconds
Writing trained SVM model params to /home/sneakyimp/biz/machine-learning/trained-svm-model.json

Evaluating the trained Linear SVM on TRAINING set ...
Array
(
    [x_samples] => 1375
    [x_features] => 1955
    [y_samples] => 1375
    [p_size] => 1375
    [correct_predictions] => 1373
    [true_positives] => 548
    [true_negatives] => 825
    [false_positives] => 2
    [false_negatives] => 0
    [precision] => 0.99636363636364
    [recall] => 1
    [f_score] => 0.99817850637523
    [correct_decimal] => 0.99854545454545
    [correct_percent] => 99.854545454545
    [elapsed_time] => 0.29263210296631
)

Evaluating the trained Linear SVM on Xtest set ...
Array
(
    [x_samples] => 458
    [x_features] => 1955
    [y_samples] => 458
    [p_size] => 458
    [correct_predictions] => 426
    [true_positives] => 155
    [true_negatives] => 271
    [false_positives] => 18
    [false_negatives] => 14
    [precision] => 0.89595375722543
    [recall] => 0.91715976331361
    [f_score] => 0.90643274853801
    [correct_decimal] => 0.93013100436681
    [correct_percent] => 93.013100436681
    [elapsed_time] => 0.097591161727905
)
DONE

sneakyimp · Dec 7, 2023

Here are the contents of svm-fns.php, which do the training.

// === svm-fns.php ===
/**
 * Defines functions to train and use  Support Vector Machine
 */

/**
 * LINEARKERNEL returns a linear kernel between x1 and x2
 * NOTE that the incoming vectors x1 and x2 were originally both column vectors
 * of dimensions (vocab size) x 1
 */
function linear_kernel($x1, $x2) {
	
	// Ensure that x1 and x2 are column vectors
	// while this conversion may be necessary for a broad use of
	// this function, it is unnecessary in the svmTrain context
	// and probably hampers performance a tiny bit
	// x1 = x1(:); x2 = x2(:);

	// Compute the kernel
	// this should return a 1x1 matrix (scalar value?)

	return $x1->dot($x2);  // dot product, should yield scalar
}

/**
 * returns a radial basis function kernel between x1 and x2
 * sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2
 * and returns the value
 */
function gaussian_kernel(Np\vector $x1, Np\vector $x2, $sigma) {

	// NOTE the incoming vectors x1 and x2 are column vectors
	// of dimension (vocab_size) x 1

	// orig octave:
	//sim = exp(-sum((x1 - x2) .^ 2) / (2 * sigma^2));
	// NOTE sim will be a 1x1 result (a scalar value)

	return exp(-$x1->subtract($x2)->square()->sum() / (2 * $sigma*$sigma));
}

function get_gaussian_predict_k(Np\matrix $x, array $model) {

	// orig octave code in svmPredict
	// Vectorized RBF Kernel
	// This is equivalent to computing the kernel on every pair of examples
	//X1 = sum(X.^2, 2);
	//X2 = sum(model.X.^2, 2)';
	//K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
	//K = model.kernelFunction(1, 0) .^ K;
	//K = bsxfun(@times, model.y', K);
	//K = bsxfun(@times, model.alphas', K);
	//p = sum(K, 2);

	$x1 = $x->square()->sumRows();
	//echo "x1 ", $x1, "\n";

	// we don't need to transpose this because ghostjat/np doesn't distinguish col vs row vectors
	$x2 = $model['x']->square()->sumRows();
	//echo "x2 ", $x2, "\n";

	// need to build K.
	$K = $x->dot($model['x']->transpose())->multiply(-2);

	// do the inner bsxfun(plus...)
	// ghostjat has no means to add a ROW vector to a matrix soooo we fake it
	$kshape = $K->getShape();
	$km = $kshape->m;
	$kn = $kshape->n;
	$x2size = $x2->getSize();
	// $km should match the dimensions of $x2
	// sanity check
	if ($x2size !== $kn) {
		throw new \Exception('x2 size ($x2size) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$x2size; $i++) {
		$x2val = $x2->data[$i];
		for($j=0; $j<$km; $j++) {
			// add the ith x2 value to the ith column of the jth row
			$K->data[($j * $kn) + $i] += $x2val;
		}
	}

	// do the outer bsxfun(plus...)
	// ghostjat has no means to add a COLUMN vector soooo we fake it
	$x1size = $x1->getSize();
	// $km should match the dimensions of $x1
	// sanity check
	if ($x1size !== $km) {
		throw new \Exception('x1 size ($x1size) does not match km ($km)');
	}
	// i are rows, j are columns
	for($i=0; $i<$x1size; $i++) {
		$x1val = $x1->data[$i];
		for($j=0; $j<$kn; $j++) {
			// add the ith x1 value to the jaith column of the ith row
			//$offset = ($i * $kn) + $j;
			$K->data[($i * $kn) + $j] += $x1val;
		}
	}

	$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $model['sigma']);
	//echo "kf ", $kf, "\n";
	$K = $K->map(fn($v) => (pow($kf, $v)));

	$mysize = $model['y']->getSize();
	// $km should match the dimensions of $model['y']
	// sanity check
	if ($mysize !== $kn) {
		throw new \Exception('model.y size ($mysize) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$mysize; $i++) {
		$yval = $model['y']->data[$i];
		for($j=0; $j<$km; $j++) {
			// multiply the ith y value by the ith column of the jth row
			$K->data[($j * $kn) + $i] *= $yval;
		}
	}


	$alphasize = $model['alphas']->getSize();
	// $km should match the dimensions of $model['alphas']
	// sanity check
	if ($alphasize !== $kn) {
		throw new \Exception('model.alpha size ($alphasize) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$alphasize; $i++) {
		$aval = $model['alphas']->data[$i];
		for($j=0; $j<$km; $j++) {
			// multiply the ith y value by the ith column of the jth row
			$K->data[($j * $kn) + $i] *= $aval;
		}
	}

	return $K;
}


/**
 * trains an SVM classifier and returns trained model. X is the matrix of
 * training examples.  Each row is a training example, and the jth column
 * holds the jth feature.  Y is a column matrix containing 1 for positive
 * examples and 0 for negative examples.  C is the standard SVM regularization
 * parameter.  tol is a tolerance value used for determining equality of
 * floating point numbers. max_passes controls the number of iterations
 * over the dataset (without changes to alpha) before the algorithm quits.
 *
 * Note: This is a simplified version of the SMO algorithm for training
 * SVMs. In practice, if you want to train an SVM classifier, we
 * recommend using an optimized package such as:
 * 	LIBSVM   (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
 & 	SVMLight (http://svmlight.joachims.org/)
 */
function svm_train($x_matrix, $y_vector, $C, $kernel_fn, $sigma=null, $tol=0.001, $max_passes=5) {
	echo "train with C=$C";
	if (!is_null($sigma)) {
		echo ", sigma=$sigma";
	}
	echo " and kernel_fn=$kernel_fn\n";
	$shape = $x_matrix->getShape();
	echo "x matrix m={$shape->m}, n={$shape->n}\n";
	$size = $y_vector->getSize();
	echo "y vector size ", $size, "\n";

	$m = $shape->m;


	// map the 0s in y to -1; note this appears to be faster than vector->map() stuff
	// BIG FAT WARNING we have to make a copy of the $y_vector object because
	// changing it here apparently propagates those changes back up to the calling scope
	$yvec = Np\vector::ones($y_vector->getSize());
	$yndim = $y_vector->ndim;
	for($i=0; $i<$yndim; $i++) {
        	if ($y_vector->data[$i] == 0) {
                	$yvec->data[$i] = -1;
        	}
	}


	// Pre-compute the Kernel Matrix since our dataset is small
	// (in practice, optimized SVM packages that handle large datasets
	// gracefully will _not_ do this)

	echo "CALCULATING KERNEL\n";
	// We have implemented optimized vectorized version of the Kernels here so
	// that the svm training will run faster.
	if ($kernel_fn === 'linear_kernel') {
		// Vectorized computation for the Linear Kernel
		// This is equivalent to computing the kernel on every pair of examples
		$K = $x_matrix->dot($x_matrix->transpose());
	} elseif ($kernel_fn === 'gaussian_kernel') {

		if (is_null($sigma)) {
			throw new Exception('You must provide a sigma value for gaussian kernel training');
		}

		// Vectorized RBF Kernel
		// This is equivalent to computing the kernel on every pair of examples

		// orig octave:
		// X2 = sum(X.^2, 2);
		// K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X')));
		// K = kernelFunction(1, 0) .^ K;

		// a vector of size n, orig X2 was a column vector
		$x2 = $x_matrix->square()->sumRows();
		// need to buidl K. this gets us pretty far, calculating inner bsxfun somehow
		$K = $x_matrix->dot($x_matrix->transpose())->multiply(-2)->sum($x2);
		// ghostjat has no means to add a column vector soooo we fake it
		$kshape = $K->getShape();
		$km = $kshape->m;
		$kn = $kshape->n;
		$x2size = $x2->getSize();
		// $km should match the dimensions of $x2
		// sanity check
		if ($x2size !== $km) {
			throw new \Exception('x2 size ($x2size) does not match km ($km)');
		}
		for($i=0; $i<$x2size; $i++) {
			$x2val = $x2->data[$i];
			for($j=0; $j<$kn; $j++) {
				// add the ith x2 value to the jth column of each row
				//$offset = ($i * $kn) + $j;
				$K->data[($i * $kn) + $j] += $x2val;
			}
		}
		// free memory
		unset($x2);

		$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $sigma);
		$K = $K->map(fn($v) => (pow($kf, $v)));

	} else {
		// Pre-compute the Kernel Matrix
		// The following can be slow due to the lack of vectorization
		echo "NON-VECTORIZED, SLOW\n";
		$K = Np\matrix::zeros($m, $m);
		for ($i=0; $i<$m; $i++) {
			if ($i >0 && ($i % 10 == 0)) {
				echo "\tloop $i\n";
			}
			for ($j=0; $j<$m; $j++) {

				// original matlab/octave code
				//K(i,j) = kernelFunction(X(i,:)', X(j,:)');
				//K(j,i) = K(i,j); %the matrix is symmetric

				// FIXME define a set() fn for matrix class rather than awkwardly calculating offset
				$kernel_val = $kernel_fn($x_matrix->rowAsVector($i), $x_matrix->rowAsVector($j));
				// location of $i, $j
				$offset1 = ($i * $K->col) + $j;
				$K->data[$offset1] = $kernel_val;
				// K matrix is symmetric, location of $j, $i
				$offset2 = ($j * $K->col) + $i;
				$K->data[$offset2] = $kernel_val;
			} // j loop

		} // i loop
	} // if linear/gaussian/slow
	echo "KERNEL CALC COMPLETE\n";

	// Variables
	$alphas = Np\vector::zeros($m);
	$b = 0;
	$E = Np\vector::zeros($m);
	$passes = 0;
	$eta = 0;
	$L = 0;
	$H = 0;

	// Train
	echo "\nTraining...";
	$dots = 11;
	while ($passes < $max_passes) {

		$num_changed_alphas = 0;
		for ($i=0; $i<$m; $i++) {
			// comments from original coursera class octave source:
			// Calculate Ei = f(x(i)) - y(i) using (2).
			// this line commented out in coursera source
			// E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i);

			// we want to calculate this octave expression from coursera source
			//E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i);

			// considerable trial and error yielded this for the sum, returns a scalar/float
			//$sum = $alphas->multiply($yvec)->multiply($K->rowAsVector($i))->sum();

			$E->data[$i] = $b + $alphas->multiply($yvec)->multiply($K->rowAsVector($i))->sum() - $yvec->data[$i];

			// orig octave if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0)),
			if (
				($yvec->data[$i] * $E->data[$i] < -$tol && $alphas->data[$i] < $C)
				|| ($yvec->data[$i] * $E->data[$i] > $tol && $alphas->data[$i] > 0)
			) {

				// In practice, there are many heuristics one can use to select
				// the i and j. In this simplified code, we select them randomly.
				do {
					$j = mt_rand(0, ($m-1));
				} while ($j === $i);

// TESTING
//$j = ($i + 1) % $m;
//echo "j: $j\n";

				// Calculate Ej = f(x(j)) - y(j) using (2).
				// orig octave calc: E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j);
				$E->data[$j] = $b + $alphas->multiply($yvec)->multiply($K->rowAsVector($j))->sum() - $yvec->data[$j];

				// Save old alphas
				$alpha_i_old = $alphas->data[$i];
				$alpha_j_old = $alphas->data[$j];

				// Compute L and H by (10) or (11).
				$ai = $alphas->data[$i]; // grab these to prevent costly lookups any more than necessary
				$aj = $alphas->data[$j];
				if ($yvec->data[$i] == $yvec->data[$j]) {
					$L = max(0, $aj + $ai - $C);
					$H = min($C, $aj + $ai);
				} else {
					$L = max(0, $aj - $ai);
					$H = min($C, $C + $aj - $ai);
				}


				if ($L == $H) {
					// continue to next i.
					continue;
				}
            
				// Compute eta by (14).
				$eta = 2 * $K->at($i,$j) - $K->at($i,$i) - $K->at($j,$j);
            
				if ($eta >= 0) {
					// continue to next i.
					continue;
				}

				// Compute and clip new value for alpha j using (12) and (15).
				// orig octave: alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta;
				// to avoid costly lookups, lets use the $aj var we just set above
				$aj = $aj - ($yvec->data[$j] * ($E->data[$i] - $E->data[$j])) / $eta;
				// Clip
				//alphas(j) = min (H, alphas(j));
				//alphas(j) = max (L, alphas(j));
				$aj = min($H, $aj);
				$aj = max($L, $aj);
				// make sure we put the new $aj value back into $alphas
				$alphas->data[$j] = $aj;

				// Check if change in alpha is significant
				if (abs($aj - $alpha_j_old) < $tol) {
					// continue to next i.
					// replace anyway
					$alphas->data[$j] = $alpha_j_old;
					continue;
				}

				// Determine value for alpha i using (16).
				// alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j));
				$ai = $ai + $yvec->data[$i] * $yvec->data[$j] * ($alpha_j_old - $aj);
				// be sure to put new $ai back in $alphas
				$alphas->data[$i] = $ai;

				//  Compute b1 and b2 using (17) and (18) respectively.
				//b1 = b - E(i) ...
				//- Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
				//- Y(j) * (alphas(j) - alpha_j_old) *  K(i,j)';
				$b1 = $b - $E->data[$i]
					- $yvec->data[$i] * ($ai - $alpha_i_old) * $K->at($i, $j)
					- $yvec->data[$j] * ($aj - $alpha_j_old) * $K->at($i, $j);

				//b2 = b - E(j) ...
				//- Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
				//- Y(j) * (alphas(j) - alpha_j_old) *  K(j,j)';
				$b2 = $b - $E->data[$j]
					- $yvec->data[$i] * ($ai - $alpha_i_old) * $K->at($i, $j)
					- $yvec->data[$j] * ($aj - $alpha_j_old) * $K->at($j, $j);

				// Compute b by (19).
				if (0 < $ai && $ai < $C) {
					$b = $b1;
				} elseif (0 < $aj && $aj < $C) {
					$b = $b2;
				} else {
					$b = ($b1+$b2)/2;
				}
				$num_changed_alphas = $num_changed_alphas + 1;


			} //  if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0))


		} // for loop

		if ($num_changed_alphas == 0) {
			$passes++;
		} else {
			$passes = 0;
		}

		echo '.';
		$dots++;
		if ($dots > 78) {
			$dots = 0;
			echo "\n";
		}

	} // while passes < max_passes
	echo "\nMAX_PASSES ($max_passes) REACHED, training done\n";


	// NOTE: alphas is a m x 1 column vector containing some floats and
	// many near-zero values and a few floats a tiny bit less than zero

	// idx is an size m vector with ones or zeros indicating which alphas are > 0
	// while this is convenient & readable in octave, it's gratuitous in PHP
	// FIXME remove this
	//$idx =  $alphas->map(fn($v) => ($v > 0));

	// b value calculated from our training, a float, e.g. 0.9990
	// FIXME we don't need this extra ret_b var, move comment below?
	$ret_b = $b;

	// GENERATE THESE WITH LOOP which is actually faster/simpler than vectot::map()
	// X subset matrix of orig feature vectors who end up with alpha > 0
	// size typically 500 x n (where m x n is size of orig training set X)
	$ret_x = [];
	// subset (column) vector indicating original classification for our
	// new subset model.X. size same as model.X, e.g. 500 x 1
	$ret_y = [];
	// subset (column) vector same size as our model.X, e.g., 500 x 1 containing
	// float alpha values calculated by our training
	$ret_alphas = [];

	// only include x/y/alphas with value greater than zero
	for($i=0; $i<$m; $i++) {
		$alpha = $alphas->data[$i];
		if ($alpha > 0) {
			// sadly ghostjat/np offers no efficient methods to construct new matrix from vectors
			// so we have to convert to native PHP arrays
			// TODO this would probably be faster if we looped directly in $x_matrix->data
			$ret_x[] = $x_matrix->rowAsVector($i)->asArray();
			$ret_y[] = $yvec->data[$i];
			$ret_alphas[] = $alpha;
		}
	}
	$ret_x = Np\matrix::ar($ret_x);
	$ret_y = Np\vector::ar($ret_y);
	$ret_alphas = Np\vector::ar($ret_alphas);
	

	// column vector containing our weights for each feature, size
	// is n x 1 (where m x n is size of orig training set X)
	// the orig octave
	// model.w = ((alphas.*Y)'*X)';
	// getting the correct output required much trial and error, produced weird sumRows thing
	$ret_w = $alphas->multiply($yvec)->multiply($x_matrix->transpose())->sumRows();
	// Return the model
	return [
		'kernel_fn' => $kernel_fn, // string specifying kernel function
		'b' => $ret_b, // float
		'x' => $ret_x, // matrix
		'y' => $ret_y, // vector
		'alphas' => $ret_alphas, // vector
		'w' => $ret_w, // vector
		'sigma' => $sigma,
		'c' => $C
	];

} // svm_train()

/**
 * returns a vector of predictions using a SVM trained by svm_train
 * @param $x is either a m x n matrix or a vector of size n
 * @param model is an associative array svm model returned from svm_train()
 * @return size m vector of predictions
 */
function svm_predict($model, $x) {
	if ($x instanceof Np\matrix) {
		// matrix is acceptable
	} elseif ($x instanceof Np\vector) {
		// FIXME work up a variant of this fn to predict for a vector
		die("is vector\n");
	} else {
		throw new Exception(gettype($x) . ' is not a valid type for $x');
	}

	$shape = $x->getShape();
	$m = $shape->m;
	$features = $shape->n;

	if ($model['kernel_fn'] == 'linear_kernel') {
		// We can use the weights and bias directly if working with the
		// linear kernel
		// original octave:
		// p = X * model.w + model.b;
		// WARNING this seems to return the right result, but
		// the order of operands is reversed, there's a sum, etc. real kludgy.
		$p = $model['w']->multiply($x)->sumRows()->add($model['b']);

	} elseif ($model['kernel_fn'] == 'gaussian_kernel') {
		$K = get_gaussian_predict_k($x, $model);
		//p = sum(K, 2);
		$p = $K->sumRows();

	} else {
		// Other kernel fn -- THIS WILL PROB BE SLOW
		$shape = $model['x']->getShape();
		$model_x_m = $shape->m;
		$p = Np\vector::zeros($m);

		for($i=0; $i<$model_x_m; $i++) {
			$prediction = 0;
			for($j=0; $j<$features; $j++) {
				throw new Exception("NOT YET IMPLEMENTED");
				// we want to do this original octave stuff here:

				//prediction = prediction + ...
				//model.alphas(j) * model.y(j) * ...
				//model.kernelFunction(X(i,:)', model.X(j,:)');
			} // for j
			$p->data[$i] = $prediction + $model['b'];
		} // for i
	} // if kernel_fn is linear/gaussian/other

	// change calculated ranges to zero or one
	return $p->map(fn($v) => ($v >= 0));
} // svm_predict()

/**
 * Runs the specified model on the $x and $y provided and
 * returns details about the time and accuracy
 */
function svm_assess(array $model, Np\matrix $x, Np\vector $y) {
	$start = microtime(TRUE);

	$retval = [];

	$shape = $x->getShape();
	$retval['x_samples'] = $shape->m;
	$retval['x_features'] = $shape->n;

	$y_size = $y->getSize();
	$retval['y_samples'] = $y_size;

	$p = svm_predict($model, $x);
	$p_size = $p->getSize();
	$retval['p_size'] = $p_size;

	// sanity check
	if ($p_size !== $y_size) {
        	throw new Exception("p size $p_size does not match y size $y_size");
	}

	// calculate what percentage of the time our model's prediction
	// matches y. $p is full of predictions, $y is full of answers
	$correct = 0;
	$true_positives = 0;
	$true_negatives = 0;
	$false_positives = 0;
	$false_negatives = 0;
	for($i=0; $i<$p_size; $i++){
		// FIXME modify this logic to calculate true & false positives/negatives
	        // if prediction matches training set value, it's CORRECT
		$pval = $p->data[$i];
	        if ($pval == $y->data[$i]) {
	                $correct++;
			if ($pval == 1) {
				$true_positives++;
			} else {
				$true_negatives++;
			}
	        } else {
			if ($pval == 1) {
				$false_positives++;
			} else {
				$false_negatives++;
			}
		}
	}
	$precision = $true_positives / ($true_positives + $false_positives);
	$recall = $true_positives / ($true_positives + $false_negatives);
	
	$retval['correct_predictions'] = $correct;
	$retval['true_positives'] = $true_positives;
	$retval['true_negatives'] = $true_negatives;
	$retval['false_positives'] = $false_positives;
	$retval['false_negatives'] = $false_negatives;
	$retval['precision'] = $precision;
	$retval['recall'] = $recall;
	$retval['f_score'] = (2 * $precision * $recall) / ($precision + $recall);
	


	$accuracy = ($correct/$p_size);
	$retval['correct_decimal'] = $accuracy;
	$retval['correct_percent'] = $accuracy * 100;

	$retval['elapsed_time'] = microtime(TRUE) - $start;

	return $retval;
}


/**
 * returns optimal C by training numerous SVM classifiers with varying
 * values of C and returning the one that performs best
 *
 */
function svm_linear_optimal_c($xtrain, $ytrain, $xval, $yval) {

	// evenly spaced (exponentially) generated from powers of 1.4
	$cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	echo "Begin linear sweep\n";
	echo "\tvalues of c: ", implode(", ", $cvals), "\n";

	$best_c = null;
	$best_results = null;
	$best_correct_percent = null;
	$best_model = null;

	$train_results = [];
	$val_results = [];
	$result_idx = 0;
	foreach($cvals as $c_i => $cval) {
		echo "== Training SVM with C=$cval ==\n";

		$start = microtime(TRUE);
		// train the model on the training set
		$model = svm_train($xtrain, $ytrain, $cval, 'linear_kernel', null, 0.0001);
		$elapsed = microtime(TRUE) - $start;
		echo "training completed in $elapsed seconds\n";

		// assess the model with the xtrain set
		$results = svm_assess($model, $xtrain, $ytrain);
		echo "XTRAIN\n";
		print_r($results);
		$train_results[$result_idx] = array_merge(
			['c' => $cval, 'training_time' => $elapsed],
			$results			
		);

		// assess the model with the xval set
		$results = svm_assess($model, $xval, $yval);
		echo "XVAL\n";
		print_r($results);
		$val_results[$result_idx] = array_merge(
			['c' => $cval, 'training_time' => $elapsed],
			$results
		);

		// maybe optimize for f_score?
		$correct_percent = $results['correct_percent'];
		if (is_null($best_c) || $correct_percent > $best_correct_percent) {
			$best_c = $cval;
			$best_results = $results;
			$best_correct_percent = $correct_percent;
			$best_model = $model;
		}

		$result_idx++;

	}

	echo "\n=====\n";
	echo "Best value for C is $best_c, with correct_percent of $best_correct_percent\n";
	print_r($best_results);

	return [
		'c' => $best_c,
		'model' => $best_model,
		'train_results' => $train_results,
		'val_results' => $val_results
	];

}


/**
 * returns optimal C and sigma by training numerous gaussian SVM classifiers with varying
 * values of C and sigma, returning the one that performs best
 *
 */
function svm_gaussian_optimal_c($xtrain, $ytrain, $xval, $yval) {
	// good, sort of hand picked
//	$cvals = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
//	$sigma_vals = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
	// evenly spaced (exponentially) generated from powers of 1.4
	$cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	$sigma_vals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	echo "Begin gaussian sweep\n";
	echo "\tvalues of c: ", implode(", ", $cvals), "\n";
	echo "\tvalues of sigma: ", implode(", ", $sigma_vals), "\n";

	$best_c = null;
	$best_sigma = null;
	$best_results = null;
	$best_correct_percent = null;
	$best_model = null;

	$train_results = [];
	$val_results = [];
	$result_idx = 0;
	foreach($cvals as $c_i => $cval) {
		foreach($sigma_vals as $s_i => $sigma) {
			echo "== Training SVM with C=$cval, sigma=$sigma ==\n";

			$start = microtime(TRUE);
			// train the model on the training set
			$model = svm_train($xtrain, $ytrain, $cval, 'gaussian_kernel', $sigma);
			$elapsed = microtime(TRUE) - $start;
			echo "training completed in $elapsed seconds\n";

			// assess the model with the xtrain set
			$results = svm_assess($model, $xtrain, $ytrain);
			echo "XTRAIN\n";
			print_r($results);
			$train_results[$result_idx] = array_merge(
				['c' => $cval, 'sigma' => $sigma, 'training_time' => $elapsed],
				$results				
			);

			// assess the model with the xval set
			$results = svm_assess($model, $xval, $yval);
			echo "XVAL\n";
			print_r($results);
			$val_results[$result_idx] = array_merge(
				['c' => $cval, 'sigma' => $sigma, 'training_time' => $elapsed],
				$results
			);

			// TODO maybe optimize for f_score instead?
			// find a way to punish false positives more or we throwing out good messages a spam
			$correct_percent = $results['correct_percent'];
			if (is_null($best_c) || $correct_percent > $best_correct_percent) {
				$best_c = $cval;
				$best_sigma = $sigma;
				$best_results = $results;
				$best_correct_percent = $correct_percent;
				$best_model = $model;
			}

			$result_idx++;

		} // foreach sigma
	} // foreach c

	echo "\n=====\n";
	echo "Best C is $best_c, best sigma is $best_sigma, with correct_percent of $best_correct_percent\n";
	print_r($best_results);

	return [
		'c' => $best_c,
		'sigma' => $best_sigma,
		'model' => $best_model,
		'train_results' => $train_results,
		'val_results' => $val_results
	];
}

Steve_R_Jones · Dec 8, 2023

sneakyimp

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. -> Because it isn't SPAM.

Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. -> The username ":could be" from a spammer.... So could any user name that starts with SNEAKY.

sneakyimp · Dec 8, 2023

Steve_R_Jones Because it isn't SPAM

Are you sure? Because it's their only post ever, and completely unrelated to this thread. I think it's quite likely that this user's second post, if it ever appears, will have a spam link in it.

And @Steve_R_Jones, there appears to be an off-by-one error in the logic that sends response notifications for this site. Your response #11100226 prompted the site to send me this email:

Hey sneakyimp!

sneakyimp replied to your post (#6) in spam filter findings?.

(LINK WAS HERE)

sneakyimp

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. -> Because it isn't SPAM.

Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. -> >The username ":could be" from a spammer.... So could any user name that starts with SNEAKY.

Steve_R_Jones · Dec 9, 2023

Around here - everyone is ""Innocent until proven guilty." " Otherwise, no one could last as long as you have.

The off-by-one issue will probably have to remain an issue. The resources for the site are a bit on the low side.

Weedpacket · Dec 12, 2023

I'd just appreciate an effective spam filter...

sneakyimp · Dec 14, 2023

Steve_R_Jones Around here - everyone is ""Innocent until proven guilty." " Otherwise, no one could last as long as you have

I like to think my posts have been topical, and a reasonable person (or spam filter) could easily distinguish them from posts that contribute precisely nothing to the conversation at hand.

Steve_R_Jones The off-by-one issue will probably have to remain an issue. The resources for the site are a bit on the low side.

I'd be happy to donate some free time to examine the problem. i was also imagining the spam filter discussion here might also help facilitate content moderation on this forum.

Weedpacket I'd just appreciate an effective spam filter...

@Weedpacket agreed! The ratio of spam to ham here is somewhat lamentable. Even the ham we get seems confused and poorly expressed.

spam filter findings?

IdealCleaning

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Steve_R_Jones

Ssneakyimp

(LINK WAS HERE)

Steve_R_Jones

Weedpacket

Ssneakyimp