spam filter findings?

sneakyimp · Dec 7, 2023

By running the analyze-corpus-word-frequency.php script above, you get chosen_corpus_word_frequencies.json, which contains an array of all words appearing in your corpus as keys and the value of each key indicates how many message files that word appears in. E.g.:

    [ppronumseq] => 1784
    [pprohttpaddr] => 1609
    [you] => 1344
    [have] => 1043
    [us] => 1028
    [list] => 996
    [not] => 968
    [your] => 948
...
   [ahg] => 1
    [wdppronumseqfiiv] => 1
    [cevtr] => 1
    [wcrtegxypdpppronumseq] => 1
    [wdppronumseqnv] => 1
    [wcppronumseqfppronumseq] => 1
    [dpppronumseqf] => 1
    [bjdrenx] => 1
    [svatshenk] => 1

sneakyimp · Dec 7, 2023

Once we've analyzed the corpus to determine which words appear and how often, we run this script to select the most popular words for our vocabulary. NOTE that this vocabulary will dictate the feature vector that we use to train our classifier on existing messages and also asses the ham/spam classification of incoming, novel messages:

// generate-vocab.php

<?php

// this script loads the word frequency data (generated at significant computational expense)
// from analysis of the thousands of files in the SA and MY corpus and tells us the most commonly
// appearing words and how many files of the chosen corpus that they appear in
// lastly, it generates a vocab of words used with sufficient frequency to serve
// as our vocab

// IMPORTANT: You'll need to make sure you have the correct, latest word
// frequency analysis data in the file chosen_corpus_word_frequencies.json
// we generate/store that in a separate script because it involves munging
// thousands of text documents -- computationally expensive

// you'll need to adjust MIN_FREQ so that you end up
// with about 1-2k words in your vocab

// our vocabulary will only include words that appear
// in at least this many emails in the SA corpus
// TODO move this to config.php
define('MIN_FREQ', 20);

$dest_dir = dirname(__FILE__) . '/';

// CONFIG
echo "this script has configurable parameters. See config.php\n";
require_once $dest_dir . 'config.php';


// TODO put this filename in some config somewhere, shared with the freq analysis script
$freq_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_WORD_FREQUENCIES;
echo "loading $freq_file\n";
$word_frequencies = json_decode(file_get_contents($freq_file), TRUE);


echo "corpus frequencies loaded, found " . sizeof($word_frequencies) . " words\n";

// should already be sorted by freq desc
$top = array_slice($word_frequencies, 0, 40);
echo "TOP WORDS\n";
print_r($top);

$vocab = [];
foreach($word_frequencies as $word => $freq) {
        if ($freq < MIN_FREQ) {
                // that's it! no more
                break;
        }

        $vocab[] = $word;
}

// sort alphabetically
// TODO a bit of performance profiling suggests we should NOT sort this
// array alphabetically, but rather by popular word first for better
// lookup peformance?
sort($vocab);

echo sizeof($vocab) . " words in the vocab\n";

$vocab_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_VOCAB;
echo "Saving vocab to $vocab_file\n";
file_put_contents($vocab_file, json_encode($vocab));

Sample output in my case:

$ php generate-vocab.php
this script has configurable parameters. See config.php
loading /home/jaith/biz/machine-learning/chosen_corpus_word_frequencies.json
corpus frequencies loaded, found 28084 words
TOP WORDS
Array
(
    [ppronumseq] => 1784
    [pprohttpaddr] => 1609
    [you] => 1344
    [have] => 1043
    [us] => 1028
    [list] => 996
    [not] => 968
    [your] => 948
    [if] => 942
    [pproemailaddr] => 930
    [can] => 914
    [mail] => 874
    [all] => 840
    [get] => 763
    [do] => 757
    [but] => 752
    [we] => 739
    [so] => 674
    [more] => 674
    [just] => 673
    [here] => 663
    [on] => 663
    [out] => 642
    [time] => 633
    [my] => 631
    [email] => 628
    [new] => 620
    [there] => 611
    [up] => 605
    [our] => 587
    [onli] => 577
    [ani] => 577
    [ha] => 576
    [now] => 556
    [like] => 551
    [work] => 536
    [messag] => 536
    [thei] => 533
    [inform] => 530
    [free] => 510
)
1955 words in the vocab
Saving vocab to /home/jaith/biz/machine-learning/chosen_corpus_vocab.json

This writes our vocab to chosen_corpus_vocab.json.

sneakyimp · Dec 7, 2023

Once we have our JSON files containing our corpus of chosen message files and our vocabulary, we can generate our training, cross validation, and test arrays.

NOTE: once again, PorterStemmer will barf in php 8, so run this with php 7 or fix it.

// === generate-training-sets.php ===
/**
 * This script loads the previously generated list of our corpus
 * files (which has already been randomly shuffled) and generates
 * the matrix of vectors we use to train our machine. That 
 * matrix will be a PHP array, containing one entry for each
 * file in our corpus. Each entry will itself be an array with
 * one entry for each word in our vocabulary.  We will then
 * break up this ALL matrix into train, validation, and test
 * subsets (60/20/20) and store those in a JSON file for use
 * by our training algorithm.
*/

$dest_dir = dirname(__FILE__) . '/';

require_once $dest_dir . 'config.php';

// load our file corpus
// NOTE each element should be an associative array specifying file, is_spam, and strip_headers
// e.g.:  ['file' => '/full/path/to/file', 'is_spam' => 1, 'strip_headers' => 0]
$corpus_json_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_FILES;
$corpus_files = json_decode(file_get_contents($corpus_json_file), TRUE);
$corpus_file_count = sizeof($corpus_files);
echo "$corpus_file_count files loaded from $corpus_json_file\n";

$vocab_json_file = $dest_dir . JSON_FILE_CHOSEN_CORPUS_VOCAB;
$vocab = json_decode(file_get_contents($vocab_json_file), TRUE);
$vocab_word_count = sizeof($vocab);
echo "$vocab_word_count words loaded from $vocab_json_file\n";

// we encountered some character encoding problems with regex in ocatave/matlab
// do we need to declare a charset for these files or perform some kind of charset conversion?
// before we start processing these emails, we need to set the charset
// or it can barf doing some regex
// NOTE: this does not appear to cause trouble in PHP
//__mfile_encoding__ ("iso-8859-1");


// we will need these functions to analyze each message
// TODO give this file a more meaningful name
require_once 'text-fns.php';


// this is our master array of feature vectors
$Xall = [];
// this is our master array of y scores
$yall = [];

echo "processing $corpus_file_count files from chosen corpus\n";
$start = microtime(TRUE);
foreach($corpus_files as $i => $cf) {
//	echo "processing " . $cf['file'] . "\n";

        if (($i % 100) == 0) {
                echo "processing $i of $corpus_file_count\n";
        }

        $file = $cf['file'];

        $contents = file_get_contents($cf['file']);
	if (!$contents) {
		throw new Exception($cf['file'] . ' could not be fetched, false or empty returned');
	}

        // this returns an array of the massaged, unique words, in a message -- no duplicates
	// TODO we will probably want to remove 2nd strip_headers param when we port this to
	// a website application, it's just a quirk of the SA data corpus
        $words = pre_process_message($contents, (bool)$cf['strip_headers']);
        unset($contents);

	// take our $vocab and $words to generate a feature vector
	// which is just 0s and 1s, indicating which vocab words are in the current message
	$Xall[] = feature_vector_1($vocab, $words);
	$yall[] = $cf['is_spam'];

}
$elapsed = microtime(TRUE) - $start;
echo "all $corpus_file_count messages processed in $elapsed seconds\n";

echo sizeof($Xall) . " records in Xall\n";
echo sizeof($yall) . " records in yall\n";


// save processed training data
$data_file_all = $dest_dir . 'training_data_all.json';
echo "saving Xall and yall to $data_file_all\n";
file_put_contents($data_file_all, json_encode(['Xall' => $Xall, 'yall' => $yall]));

// divide up the entire set into subsets for training/validation/testing
$row_count = sizeof($yall);

if ($row_count < 10) {
	die("you don't even have 10 training examples. I refuse to finish\n");
}

// to split in 3 groups of 60/20/20, we need 2 cuts
$cut1 = round($row_count * .6);
echo "cut1 $cut1\n";
$cut2 = round($row_count * .8);
echo "cut2 $cut2\n";


$Xtrain = array_slice($Xall, 0, $cut1);
echo sizeof($Xtrain) . " elements in Xtrain\n";

$ytrain = array_slice($yall, 0, $cut1);
echo sizeof($ytrain) . " elements in ytrain\n";


$Xval = array_slice($Xall, $cut1, ($cut2-$cut1));
echo sizeof($Xval) . " elements in Xval\n";

$yval = array_slice($yall, $cut1, ($cut2-$cut1));
echo sizeof($yval) . " elements in yval\n";

$Xtest = array_slice($Xall, $cut2);
echo sizeof($Xtest) . " elements in Xtest\n";

$ytest = array_slice($yall, $cut2);
echo sizeof($ytest) . " elements in ytest\n";

echo "X element total = " . (sizeof($Xtrain) + sizeof($Xval) + sizeof($Xtest)) . "\n";
echo "y element total = " . (sizeof($ytrain) + sizeof($yval) + sizeof($ytest)) . "\n";

// save data sets
$data_file_sets = $dest_dir . 'training_data_sets.json';
echo "saving Xall and yall to $data_file_all\n";
file_put_contents($data_file_sets, json_encode([
	'Xtrain' => $Xtrain,
	'ytrain' => $ytrain,
	'Xval' => $Xval,
	'yval' => $yval,
	'Xtest' => $Xtest,
	'ytest' => $ytest,
]));

sneakyimp · Dec 7, 2023

OK so once we have the training/test/validation set feature vectors in a big fat JSON file, we can train a Support Vector Machine. Glossing over a bunch of details, this is the top-level script. I'm wondering if I should maybe just put this in a repo on github? If anyone is curious, LMK.

// === train-svm.php ===

/**
 * this script trains a Support Vector Machine (i.e., a machine
 * learning classifier algorithm) to determine if an email message
 * is spam or ham
*/

// Load the Spam Email dataset
// You will have Xtrain, ytrain, Xval, yval, Xtest, ytest
// we expect this data to be ranomly shuffled, and to contain
// feature vectors from both the old SA corpus and MY corpus
$data_file = __DIR__ . '/training_data_sets.json';
echo "loading data from $data_file\n";
$data = json_decode(file_get_contents($data_file), TRUE);

// output some data points
$data_keys = ['Xtrain', 'ytrain', 'Xval', 'yval', 'Xtest', 'ytest'];
foreach($data_keys as $key) {
	if (!array_key_exists($key, $data)) {
		throw new Exception("data file did not define key=$key");
	}
	echo "$key has " . sizeof($data[$key]) . " elements\n";
}

// we will be using ghostjat/np for matrix operations
// NOTE: i vaguely recall having modified this library to get it to return the same
// results as octave for some matrix operation
require_once __DIR__ . '/np/vendor/autoload.php';
use Np\matrix;
use Np\vector;

$x_matrix = matrix::ar($data['Xtrain']);
$y_matrix = vector::ar($data['ytrain']);

// trial and error suggested C=1 would be best, but *shrug*
//$C = 3;
// testing
$C = 0.1;
echo "training model with C=$C\n";


// FIXME before we can call our svm-fns we need to load BLAS because Np depends on it
// need to put it somewhere correct? this is awkward
Np\core\blas::$ffi_blas = FFI::load(__DIR__ . '/np/vendor/ghostjat/np/src/core/blas.h');
require_once 'svm-fns.php';

echo "Training SVM (Spam Classification)\n";
echo "(this may take 1 to 2 minutes) ...\n";
// train the model
// best practice has with separate training/validation/test sets
$start = microtime(TRUE);
$model = svm_train($x_matrix, $y_matrix, $C, 'linear_kernel');
echo "training completed in " . (microtime(TRUE) - $start) . " seconds\n";

// $model will be an associative array with these keys:
// kernel_fn => string specifying kernel function
// b =>  float
// x => matrix of input x vectors (0s/1s in spam example) where alpha > 0
// y => vector of input y classifications (0/1 in spam example) where alpha > 0
// alphas => vector of alpha values calculated during training
// w => vector of weights, a float, for each input feature

// convert $model to something we can JSON_ENCODE, i.e., only
// basic data types instead of Np\vector or Np\matrix
$to_array = ['x', 'y', 'alphas', 'w'];
$model_to_save = [];
foreach($model as $key => $val) {
	if (in_array($key, $to_array)) {
		$model_to_save[$key] = $val->asArray();
	} else {
		$model_to_save[$key] = $val;
	}
}

// save the model so we can crack it open and test it
// without having to retrain the entire model all over again.
$model_data_file = __DIR__ . '/trained-svm-model.json';
echo "Writing trained SVM model params to $model_data_file\n";
file_put_contents($model_data_file, json_encode($model_to_save));

echo "\nEvaluating the trained Linear SVM on TRAINING set ...\n";
$x = Np\matrix::ar($data['Xtrain']);
$y = Np\vector::ar($data['ytrain']);
$results = svm_assess($model, $x, $y);
print_r($results);

echo "\nEvaluating the trained Linear SVM on Xtest set ...\n";
$x = Np\matrix::ar($data['Xtest']);
$y = Np\vector::ar($data['ytest']);
$results = svm_assess($model, $x, $y);
print_r($results);


die("DONE\n");

I'll post the svm-fns.php in the next post.

sneakyimp · Dec 7, 2023

But first, sample output. As you can see, with C=1, the trained SVM assesses the messages in our training set with 99.9% accuracy, and in the test data set (i.e., new messages it has not seen) with 93.0% accuracy:

loading data from /home/sneakyimp/biz/machine-learning/training_data_sets.json
Xtrain has 1375 elements
ytrain has 1375 elements
Xval has 458 elements
yval has 458 elements
Xtest has 458 elements
ytest has 458 elements
training model with C=1
Training SVM (Spam Classification)
(this may take 1 to 2 minutes) ...
train with C=1 and kernel_fn=linear_kernel
x matrix m=1375, n=1955
y vector size 1375
CALCULATING KERNEL
KERNEL CALC COMPLETE

Training.......................................................................
..........................................................................
MAX_PASSES (5) REACHED, training done
training completed in 82.92901802063 seconds
Writing trained SVM model params to /home/sneakyimp/biz/machine-learning/trained-svm-model.json

Evaluating the trained Linear SVM on TRAINING set ...
Array
(
    [x_samples] => 1375
    [x_features] => 1955
    [y_samples] => 1375
    [p_size] => 1375
    [correct_predictions] => 1373
    [true_positives] => 548
    [true_negatives] => 825
    [false_positives] => 2
    [false_negatives] => 0
    [precision] => 0.99636363636364
    [recall] => 1
    [f_score] => 0.99817850637523
    [correct_decimal] => 0.99854545454545
    [correct_percent] => 99.854545454545
    [elapsed_time] => 0.29263210296631
)

Evaluating the trained Linear SVM on Xtest set ...
Array
(
    [x_samples] => 458
    [x_features] => 1955
    [y_samples] => 458
    [p_size] => 458
    [correct_predictions] => 426
    [true_positives] => 155
    [true_negatives] => 271
    [false_positives] => 18
    [false_negatives] => 14
    [precision] => 0.89595375722543
    [recall] => 0.91715976331361
    [f_score] => 0.90643274853801
    [correct_decimal] => 0.93013100436681
    [correct_percent] => 93.013100436681
    [elapsed_time] => 0.097591161727905
)
DONE

sneakyimp · Dec 7, 2023

Here are the contents of svm-fns.php, which do the training.

// === svm-fns.php ===
/**
 * Defines functions to train and use  Support Vector Machine
 */

/**
 * LINEARKERNEL returns a linear kernel between x1 and x2
 * NOTE that the incoming vectors x1 and x2 were originally both column vectors
 * of dimensions (vocab size) x 1
 */
function linear_kernel($x1, $x2) {
	
	// Ensure that x1 and x2 are column vectors
	// while this conversion may be necessary for a broad use of
	// this function, it is unnecessary in the svmTrain context
	// and probably hampers performance a tiny bit
	// x1 = x1(:); x2 = x2(:);

	// Compute the kernel
	// this should return a 1x1 matrix (scalar value?)

	return $x1->dot($x2);  // dot product, should yield scalar
}

/**
 * returns a radial basis function kernel between x1 and x2
 * sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2
 * and returns the value
 */
function gaussian_kernel(Np\vector $x1, Np\vector $x2, $sigma) {

	// NOTE the incoming vectors x1 and x2 are column vectors
	// of dimension (vocab_size) x 1

	// orig octave:
	//sim = exp(-sum((x1 - x2) .^ 2) / (2 * sigma^2));
	// NOTE sim will be a 1x1 result (a scalar value)

	return exp(-$x1->subtract($x2)->square()->sum() / (2 * $sigma*$sigma));
}

function get_gaussian_predict_k(Np\matrix $x, array $model) {

	// orig octave code in svmPredict
	// Vectorized RBF Kernel
	// This is equivalent to computing the kernel on every pair of examples
	//X1 = sum(X.^2, 2);
	//X2 = sum(model.X.^2, 2)';
	//K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
	//K = model.kernelFunction(1, 0) .^ K;
	//K = bsxfun(@times, model.y', K);
	//K = bsxfun(@times, model.alphas', K);
	//p = sum(K, 2);

	$x1 = $x->square()->sumRows();
	//echo "x1 ", $x1, "\n";

	// we don't need to transpose this because ghostjat/np doesn't distinguish col vs row vectors
	$x2 = $model['x']->square()->sumRows();
	//echo "x2 ", $x2, "\n";

	// need to build K.
	$K = $x->dot($model['x']->transpose())->multiply(-2);

	// do the inner bsxfun(plus...)
	// ghostjat has no means to add a ROW vector to a matrix soooo we fake it
	$kshape = $K->getShape();
	$km = $kshape->m;
	$kn = $kshape->n;
	$x2size = $x2->getSize();
	// $km should match the dimensions of $x2
	// sanity check
	if ($x2size !== $kn) {
		throw new \Exception('x2 size ($x2size) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$x2size; $i++) {
		$x2val = $x2->data[$i];
		for($j=0; $j<$km; $j++) {
			// add the ith x2 value to the ith column of the jth row
			$K->data[($j * $kn) + $i] += $x2val;
		}
	}

	// do the outer bsxfun(plus...)
	// ghostjat has no means to add a COLUMN vector soooo we fake it
	$x1size = $x1->getSize();
	// $km should match the dimensions of $x1
	// sanity check
	if ($x1size !== $km) {
		throw new \Exception('x1 size ($x1size) does not match km ($km)');
	}
	// i are rows, j are columns
	for($i=0; $i<$x1size; $i++) {
		$x1val = $x1->data[$i];
		for($j=0; $j<$kn; $j++) {
			// add the ith x1 value to the jaith column of the ith row
			//$offset = ($i * $kn) + $j;
			$K->data[($i * $kn) + $j] += $x1val;
		}
	}

	$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $model['sigma']);
	//echo "kf ", $kf, "\n";
	$K = $K->map(fn($v) => (pow($kf, $v)));

	$mysize = $model['y']->getSize();
	// $km should match the dimensions of $model['y']
	// sanity check
	if ($mysize !== $kn) {
		throw new \Exception('model.y size ($mysize) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$mysize; $i++) {
		$yval = $model['y']->data[$i];
		for($j=0; $j<$km; $j++) {
			// multiply the ith y value by the ith column of the jth row
			$K->data[($j * $kn) + $i] *= $yval;
		}
	}


	$alphasize = $model['alphas']->getSize();
	// $km should match the dimensions of $model['alphas']
	// sanity check
	if ($alphasize !== $kn) {
		throw new \Exception('model.alpha size ($alphasize) does not match kn ($kn)');
	}
	// i are columns, j are rows
	for($i=0; $i<$alphasize; $i++) {
		$aval = $model['alphas']->data[$i];
		for($j=0; $j<$km; $j++) {
			// multiply the ith y value by the ith column of the jth row
			$K->data[($j * $kn) + $i] *= $aval;
		}
	}

	return $K;
}


/**
 * trains an SVM classifier and returns trained model. X is the matrix of
 * training examples.  Each row is a training example, and the jth column
 * holds the jth feature.  Y is a column matrix containing 1 for positive
 * examples and 0 for negative examples.  C is the standard SVM regularization
 * parameter.  tol is a tolerance value used for determining equality of
 * floating point numbers. max_passes controls the number of iterations
 * over the dataset (without changes to alpha) before the algorithm quits.
 *
 * Note: This is a simplified version of the SMO algorithm for training
 * SVMs. In practice, if you want to train an SVM classifier, we
 * recommend using an optimized package such as:
 * 	LIBSVM   (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
 & 	SVMLight (http://svmlight.joachims.org/)
 */
function svm_train($x_matrix, $y_vector, $C, $kernel_fn, $sigma=null, $tol=0.001, $max_passes=5) {
	echo "train with C=$C";
	if (!is_null($sigma)) {
		echo ", sigma=$sigma";
	}
	echo " and kernel_fn=$kernel_fn\n";
	$shape = $x_matrix->getShape();
	echo "x matrix m={$shape->m}, n={$shape->n}\n";
	$size = $y_vector->getSize();
	echo "y vector size ", $size, "\n";

	$m = $shape->m;


	// map the 0s in y to -1; note this appears to be faster than vector->map() stuff
	// BIG FAT WARNING we have to make a copy of the $y_vector object because
	// changing it here apparently propagates those changes back up to the calling scope
	$yvec = Np\vector::ones($y_vector->getSize());
	$yndim = $y_vector->ndim;
	for($i=0; $i<$yndim; $i++) {
        	if ($y_vector->data[$i] == 0) {
                	$yvec->data[$i] = -1;
        	}
	}


	// Pre-compute the Kernel Matrix since our dataset is small
	// (in practice, optimized SVM packages that handle large datasets
	// gracefully will _not_ do this)

	echo "CALCULATING KERNEL\n";
	// We have implemented optimized vectorized version of the Kernels here so
	// that the svm training will run faster.
	if ($kernel_fn === 'linear_kernel') {
		// Vectorized computation for the Linear Kernel
		// This is equivalent to computing the kernel on every pair of examples
		$K = $x_matrix->dot($x_matrix->transpose());
	} elseif ($kernel_fn === 'gaussian_kernel') {

		if (is_null($sigma)) {
			throw new Exception('You must provide a sigma value for gaussian kernel training');
		}

		// Vectorized RBF Kernel
		// This is equivalent to computing the kernel on every pair of examples

		// orig octave:
		// X2 = sum(X.^2, 2);
		// K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X')));
		// K = kernelFunction(1, 0) .^ K;

		// a vector of size n, orig X2 was a column vector
		$x2 = $x_matrix->square()->sumRows();
		// need to buidl K. this gets us pretty far, calculating inner bsxfun somehow
		$K = $x_matrix->dot($x_matrix->transpose())->multiply(-2)->sum($x2);
		// ghostjat has no means to add a column vector soooo we fake it
		$kshape = $K->getShape();
		$km = $kshape->m;
		$kn = $kshape->n;
		$x2size = $x2->getSize();
		// $km should match the dimensions of $x2
		// sanity check
		if ($x2size !== $km) {
			throw new \Exception('x2 size ($x2size) does not match km ($km)');
		}
		for($i=0; $i<$x2size; $i++) {
			$x2val = $x2->data[$i];
			for($j=0; $j<$kn; $j++) {
				// add the ith x2 value to the jth column of each row
				//$offset = ($i * $kn) + $j;
				$K->data[($i * $kn) + $j] += $x2val;
			}
		}
		// free memory
		unset($x2);

		$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $sigma);
		$K = $K->map(fn($v) => (pow($kf, $v)));

	} else {
		// Pre-compute the Kernel Matrix
		// The following can be slow due to the lack of vectorization
		echo "NON-VECTORIZED, SLOW\n";
		$K = Np\matrix::zeros($m, $m);
		for ($i=0; $i<$m; $i++) {
			if ($i >0 && ($i % 10 == 0)) {
				echo "\tloop $i\n";
			}
			for ($j=0; $j<$m; $j++) {

				// original matlab/octave code
				//K(i,j) = kernelFunction(X(i,:)', X(j,:)');
				//K(j,i) = K(i,j); %the matrix is symmetric

				// FIXME define a set() fn for matrix class rather than awkwardly calculating offset
				$kernel_val = $kernel_fn($x_matrix->rowAsVector($i), $x_matrix->rowAsVector($j));
				// location of $i, $j
				$offset1 = ($i * $K->col) + $j;
				$K->data[$offset1] = $kernel_val;
				// K matrix is symmetric, location of $j, $i
				$offset2 = ($j * $K->col) + $i;
				$K->data[$offset2] = $kernel_val;
			} // j loop

		} // i loop
	} // if linear/gaussian/slow
	echo "KERNEL CALC COMPLETE\n";

	// Variables
	$alphas = Np\vector::zeros($m);
	$b = 0;
	$E = Np\vector::zeros($m);
	$passes = 0;
	$eta = 0;
	$L = 0;
	$H = 0;

	// Train
	echo "\nTraining...";
	$dots = 11;
	while ($passes < $max_passes) {

		$num_changed_alphas = 0;
		for ($i=0; $i<$m; $i++) {
			// comments from original coursera class octave source:
			// Calculate Ei = f(x(i)) - y(i) using (2).
			// this line commented out in coursera source
			// E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i);

			// we want to calculate this octave expression from coursera source
			//E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i);

			// considerable trial and error yielded this for the sum, returns a scalar/float
			//$sum = $alphas->multiply($yvec)->multiply($K->rowAsVector($i))->sum();

			$E->data[$i] = $b + $alphas->multiply($yvec)->multiply($K->rowAsVector($i))->sum() - $yvec->data[$i];

			// orig octave if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0)),
			if (
				($yvec->data[$i] * $E->data[$i] < -$tol && $alphas->data[$i] < $C)
				|| ($yvec->data[$i] * $E->data[$i] > $tol && $alphas->data[$i] > 0)
			) {

				// In practice, there are many heuristics one can use to select
				// the i and j. In this simplified code, we select them randomly.
				do {
					$j = mt_rand(0, ($m-1));
				} while ($j === $i);

// TESTING
//$j = ($i + 1) % $m;
//echo "j: $j\n";

				// Calculate Ej = f(x(j)) - y(j) using (2).
				// orig octave calc: E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j);
				$E->data[$j] = $b + $alphas->multiply($yvec)->multiply($K->rowAsVector($j))->sum() - $yvec->data[$j];

				// Save old alphas
				$alpha_i_old = $alphas->data[$i];
				$alpha_j_old = $alphas->data[$j];

				// Compute L and H by (10) or (11).
				$ai = $alphas->data[$i]; // grab these to prevent costly lookups any more than necessary
				$aj = $alphas->data[$j];
				if ($yvec->data[$i] == $yvec->data[$j]) {
					$L = max(0, $aj + $ai - $C);
					$H = min($C, $aj + $ai);
				} else {
					$L = max(0, $aj - $ai);
					$H = min($C, $C + $aj - $ai);
				}


				if ($L == $H) {
					// continue to next i.
					continue;
				}
            
				// Compute eta by (14).
				$eta = 2 * $K->at($i,$j) - $K->at($i,$i) - $K->at($j,$j);
            
				if ($eta >= 0) {
					// continue to next i.
					continue;
				}

				// Compute and clip new value for alpha j using (12) and (15).
				// orig octave: alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta;
				// to avoid costly lookups, lets use the $aj var we just set above
				$aj = $aj - ($yvec->data[$j] * ($E->data[$i] - $E->data[$j])) / $eta;
				// Clip
				//alphas(j) = min (H, alphas(j));
				//alphas(j) = max (L, alphas(j));
				$aj = min($H, $aj);
				$aj = max($L, $aj);
				// make sure we put the new $aj value back into $alphas
				$alphas->data[$j] = $aj;

				// Check if change in alpha is significant
				if (abs($aj - $alpha_j_old) < $tol) {
					// continue to next i.
					// replace anyway
					$alphas->data[$j] = $alpha_j_old;
					continue;
				}

				// Determine value for alpha i using (16).
				// alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j));
				$ai = $ai + $yvec->data[$i] * $yvec->data[$j] * ($alpha_j_old - $aj);
				// be sure to put new $ai back in $alphas
				$alphas->data[$i] = $ai;

				//  Compute b1 and b2 using (17) and (18) respectively.
				//b1 = b - E(i) ...
				//- Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
				//- Y(j) * (alphas(j) - alpha_j_old) *  K(i,j)';
				$b1 = $b - $E->data[$i]
					- $yvec->data[$i] * ($ai - $alpha_i_old) * $K->at($i, $j)
					- $yvec->data[$j] * ($aj - $alpha_j_old) * $K->at($i, $j);

				//b2 = b - E(j) ...
				//- Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
				//- Y(j) * (alphas(j) - alpha_j_old) *  K(j,j)';
				$b2 = $b - $E->data[$j]
					- $yvec->data[$i] * ($ai - $alpha_i_old) * $K->at($i, $j)
					- $yvec->data[$j] * ($aj - $alpha_j_old) * $K->at($j, $j);

				// Compute b by (19).
				if (0 < $ai && $ai < $C) {
					$b = $b1;
				} elseif (0 < $aj && $aj < $C) {
					$b = $b2;
				} else {
					$b = ($b1+$b2)/2;
				}
				$num_changed_alphas = $num_changed_alphas + 1;


			} //  if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0))


		} // for loop

		if ($num_changed_alphas == 0) {
			$passes++;
		} else {
			$passes = 0;
		}

		echo '.';
		$dots++;
		if ($dots > 78) {
			$dots = 0;
			echo "\n";
		}

	} // while passes < max_passes
	echo "\nMAX_PASSES ($max_passes) REACHED, training done\n";


	// NOTE: alphas is a m x 1 column vector containing some floats and
	// many near-zero values and a few floats a tiny bit less than zero

	// idx is an size m vector with ones or zeros indicating which alphas are > 0
	// while this is convenient & readable in octave, it's gratuitous in PHP
	// FIXME remove this
	//$idx =  $alphas->map(fn($v) => ($v > 0));

	// b value calculated from our training, a float, e.g. 0.9990
	// FIXME we don't need this extra ret_b var, move comment below?
	$ret_b = $b;

	// GENERATE THESE WITH LOOP which is actually faster/simpler than vectot::map()
	// X subset matrix of orig feature vectors who end up with alpha > 0
	// size typically 500 x n (where m x n is size of orig training set X)
	$ret_x = [];
	// subset (column) vector indicating original classification for our
	// new subset model.X. size same as model.X, e.g. 500 x 1
	$ret_y = [];
	// subset (column) vector same size as our model.X, e.g., 500 x 1 containing
	// float alpha values calculated by our training
	$ret_alphas = [];

	// only include x/y/alphas with value greater than zero
	for($i=0; $i<$m; $i++) {
		$alpha = $alphas->data[$i];
		if ($alpha > 0) {
			// sadly ghostjat/np offers no efficient methods to construct new matrix from vectors
			// so we have to convert to native PHP arrays
			// TODO this would probably be faster if we looped directly in $x_matrix->data
			$ret_x[] = $x_matrix->rowAsVector($i)->asArray();
			$ret_y[] = $yvec->data[$i];
			$ret_alphas[] = $alpha;
		}
	}
	$ret_x = Np\matrix::ar($ret_x);
	$ret_y = Np\vector::ar($ret_y);
	$ret_alphas = Np\vector::ar($ret_alphas);
	

	// column vector containing our weights for each feature, size
	// is n x 1 (where m x n is size of orig training set X)
	// the orig octave
	// model.w = ((alphas.*Y)'*X)';
	// getting the correct output required much trial and error, produced weird sumRows thing
	$ret_w = $alphas->multiply($yvec)->multiply($x_matrix->transpose())->sumRows();
	// Return the model
	return [
		'kernel_fn' => $kernel_fn, // string specifying kernel function
		'b' => $ret_b, // float
		'x' => $ret_x, // matrix
		'y' => $ret_y, // vector
		'alphas' => $ret_alphas, // vector
		'w' => $ret_w, // vector
		'sigma' => $sigma,
		'c' => $C
	];

} // svm_train()

/**
 * returns a vector of predictions using a SVM trained by svm_train
 * @param $x is either a m x n matrix or a vector of size n
 * @param model is an associative array svm model returned from svm_train()
 * @return size m vector of predictions
 */
function svm_predict($model, $x) {
	if ($x instanceof Np\matrix) {
		// matrix is acceptable
	} elseif ($x instanceof Np\vector) {
		// FIXME work up a variant of this fn to predict for a vector
		die("is vector\n");
	} else {
		throw new Exception(gettype($x) . ' is not a valid type for $x');
	}

	$shape = $x->getShape();
	$m = $shape->m;
	$features = $shape->n;

	if ($model['kernel_fn'] == 'linear_kernel') {
		// We can use the weights and bias directly if working with the
		// linear kernel
		// original octave:
		// p = X * model.w + model.b;
		// WARNING this seems to return the right result, but
		// the order of operands is reversed, there's a sum, etc. real kludgy.
		$p = $model['w']->multiply($x)->sumRows()->add($model['b']);

	} elseif ($model['kernel_fn'] == 'gaussian_kernel') {
		$K = get_gaussian_predict_k($x, $model);
		//p = sum(K, 2);
		$p = $K->sumRows();

	} else {
		// Other kernel fn -- THIS WILL PROB BE SLOW
		$shape = $model['x']->getShape();
		$model_x_m = $shape->m;
		$p = Np\vector::zeros($m);

		for($i=0; $i<$model_x_m; $i++) {
			$prediction = 0;
			for($j=0; $j<$features; $j++) {
				throw new Exception("NOT YET IMPLEMENTED");
				// we want to do this original octave stuff here:

				//prediction = prediction + ...
				//model.alphas(j) * model.y(j) * ...
				//model.kernelFunction(X(i,:)', model.X(j,:)');
			} // for j
			$p->data[$i] = $prediction + $model['b'];
		} // for i
	} // if kernel_fn is linear/gaussian/other

	// change calculated ranges to zero or one
	return $p->map(fn($v) => ($v >= 0));
} // svm_predict()

/**
 * Runs the specified model on the $x and $y provided and
 * returns details about the time and accuracy
 */
function svm_assess(array $model, Np\matrix $x, Np\vector $y) {
	$start = microtime(TRUE);

	$retval = [];

	$shape = $x->getShape();
	$retval['x_samples'] = $shape->m;
	$retval['x_features'] = $shape->n;

	$y_size = $y->getSize();
	$retval['y_samples'] = $y_size;

	$p = svm_predict($model, $x);
	$p_size = $p->getSize();
	$retval['p_size'] = $p_size;

	// sanity check
	if ($p_size !== $y_size) {
        	throw new Exception("p size $p_size does not match y size $y_size");
	}

	// calculate what percentage of the time our model's prediction
	// matches y. $p is full of predictions, $y is full of answers
	$correct = 0;
	$true_positives = 0;
	$true_negatives = 0;
	$false_positives = 0;
	$false_negatives = 0;
	for($i=0; $i<$p_size; $i++){
		// FIXME modify this logic to calculate true & false positives/negatives
	        // if prediction matches training set value, it's CORRECT
		$pval = $p->data[$i];
	        if ($pval == $y->data[$i]) {
	                $correct++;
			if ($pval == 1) {
				$true_positives++;
			} else {
				$true_negatives++;
			}
	        } else {
			if ($pval == 1) {
				$false_positives++;
			} else {
				$false_negatives++;
			}
		}
	}
	$precision = $true_positives / ($true_positives + $false_positives);
	$recall = $true_positives / ($true_positives + $false_negatives);
	
	$retval['correct_predictions'] = $correct;
	$retval['true_positives'] = $true_positives;
	$retval['true_negatives'] = $true_negatives;
	$retval['false_positives'] = $false_positives;
	$retval['false_negatives'] = $false_negatives;
	$retval['precision'] = $precision;
	$retval['recall'] = $recall;
	$retval['f_score'] = (2 * $precision * $recall) / ($precision + $recall);
	


	$accuracy = ($correct/$p_size);
	$retval['correct_decimal'] = $accuracy;
	$retval['correct_percent'] = $accuracy * 100;

	$retval['elapsed_time'] = microtime(TRUE) - $start;

	return $retval;
}


/**
 * returns optimal C by training numerous SVM classifiers with varying
 * values of C and returning the one that performs best
 *
 */
function svm_linear_optimal_c($xtrain, $ytrain, $xval, $yval) {

	// evenly spaced (exponentially) generated from powers of 1.4
	$cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	echo "Begin linear sweep\n";
	echo "\tvalues of c: ", implode(", ", $cvals), "\n";

	$best_c = null;
	$best_results = null;
	$best_correct_percent = null;
	$best_model = null;

	$train_results = [];
	$val_results = [];
	$result_idx = 0;
	foreach($cvals as $c_i => $cval) {
		echo "== Training SVM with C=$cval ==\n";

		$start = microtime(TRUE);
		// train the model on the training set
		$model = svm_train($xtrain, $ytrain, $cval, 'linear_kernel', null, 0.0001);
		$elapsed = microtime(TRUE) - $start;
		echo "training completed in $elapsed seconds\n";

		// assess the model with the xtrain set
		$results = svm_assess($model, $xtrain, $ytrain);
		echo "XTRAIN\n";
		print_r($results);
		$train_results[$result_idx] = array_merge(
			['c' => $cval, 'training_time' => $elapsed],
			$results			
		);

		// assess the model with the xval set
		$results = svm_assess($model, $xval, $yval);
		echo "XVAL\n";
		print_r($results);
		$val_results[$result_idx] = array_merge(
			['c' => $cval, 'training_time' => $elapsed],
			$results
		);

		// maybe optimize for f_score?
		$correct_percent = $results['correct_percent'];
		if (is_null($best_c) || $correct_percent > $best_correct_percent) {
			$best_c = $cval;
			$best_results = $results;
			$best_correct_percent = $correct_percent;
			$best_model = $model;
		}

		$result_idx++;

	}

	echo "\n=====\n";
	echo "Best value for C is $best_c, with correct_percent of $best_correct_percent\n";
	print_r($best_results);

	return [
		'c' => $best_c,
		'model' => $best_model,
		'train_results' => $train_results,
		'val_results' => $val_results
	];

}


/**
 * returns optimal C and sigma by training numerous gaussian SVM classifiers with varying
 * values of C and sigma, returning the one that performs best
 *
 */
function svm_gaussian_optimal_c($xtrain, $ytrain, $xval, $yval) {
	// good, sort of hand picked
//	$cvals = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
//	$sigma_vals = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
	// evenly spaced (exponentially) generated from powers of 1.4
	$cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	$sigma_vals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
	echo "Begin gaussian sweep\n";
	echo "\tvalues of c: ", implode(", ", $cvals), "\n";
	echo "\tvalues of sigma: ", implode(", ", $sigma_vals), "\n";

	$best_c = null;
	$best_sigma = null;
	$best_results = null;
	$best_correct_percent = null;
	$best_model = null;

	$train_results = [];
	$val_results = [];
	$result_idx = 0;
	foreach($cvals as $c_i => $cval) {
		foreach($sigma_vals as $s_i => $sigma) {
			echo "== Training SVM with C=$cval, sigma=$sigma ==\n";

			$start = microtime(TRUE);
			// train the model on the training set
			$model = svm_train($xtrain, $ytrain, $cval, 'gaussian_kernel', $sigma);
			$elapsed = microtime(TRUE) - $start;
			echo "training completed in $elapsed seconds\n";

			// assess the model with the xtrain set
			$results = svm_assess($model, $xtrain, $ytrain);
			echo "XTRAIN\n";
			print_r($results);
			$train_results[$result_idx] = array_merge(
				['c' => $cval, 'sigma' => $sigma, 'training_time' => $elapsed],
				$results				
			);

			// assess the model with the xval set
			$results = svm_assess($model, $xval, $yval);
			echo "XVAL\n";
			print_r($results);
			$val_results[$result_idx] = array_merge(
				['c' => $cval, 'sigma' => $sigma, 'training_time' => $elapsed],
				$results
			);

			// TODO maybe optimize for f_score instead?
			// find a way to punish false positives more or we throwing out good messages a spam
			$correct_percent = $results['correct_percent'];
			if (is_null($best_c) || $correct_percent > $best_correct_percent) {
				$best_c = $cval;
				$best_sigma = $sigma;
				$best_results = $results;
				$best_correct_percent = $correct_percent;
				$best_model = $model;
			}

			$result_idx++;

		} // foreach sigma
	} // foreach c

	echo "\n=====\n";
	echo "Best C is $best_c, best sigma is $best_sigma, with correct_percent of $best_correct_percent\n";
	print_r($best_results);

	return [
		'c' => $best_c,
		'sigma' => $best_sigma,
		'model' => $best_model,
		'train_results' => $train_results,
		'val_results' => $val_results
	];
}

Steve_R_Jones · Dec 8, 2023

sneakyimp

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. -> Because it isn't SPAM.

Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. -> The username ":could be" from a spammer.... So could any user name that starts with SNEAKY.

sneakyimp · Dec 8, 2023

Steve_R_Jones Because it isn't SPAM

Are you sure? Because it's their only post ever, and completely unrelated to this thread. I think it's quite likely that this user's second post, if it ever appears, will have a spam link in it.

And @Steve_R_Jones, there appears to be an off-by-one error in the logic that sends response notifications for this site. Your response #11100226 prompted the site to send me this email:

Hey sneakyimp!

sneakyimp replied to your post (#6) in spam filter findings?.

(LINK WAS HERE)

sneakyimp

This post from IdealCleaning is a good example of a post that is hard to recognize as spam. -> Because it isn't SPAM.

Seems to be spam to me (the username) but the post content itself might easily appear as a genuine response. -> >The username ":could be" from a spammer.... So could any user name that starts with SNEAKY.

Steve_R_Jones · Dec 9, 2023

Around here - everyone is ""Innocent until proven guilty." " Otherwise, no one could last as long as you have.

The off-by-one issue will probably have to remain an issue. The resources for the site are a bit on the low side.

Weedpacket · Dec 12, 2023

I'd just appreciate an effective spam filter...

sneakyimp · Dec 14, 2023

Steve_R_Jones Around here - everyone is ""Innocent until proven guilty." " Otherwise, no one could last as long as you have

I like to think my posts have been topical, and a reasonable person (or spam filter) could easily distinguish them from posts that contribute precisely nothing to the conversation at hand.

Steve_R_Jones The off-by-one issue will probably have to remain an issue. The resources for the site are a bit on the low side.

I'd be happy to donate some free time to examine the problem. i was also imagining the spam filter discussion here might also help facilitate content moderation on this forum.

Weedpacket I'd just appreciate an effective spam filter...

@Weedpacket agreed! The ratio of spam to ham here is somewhat lamentable. Even the ham we get seems confused and poorly expressed.

spam filter findings?

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Steve_R_Jones

Ssneakyimp

(LINK WAS HERE)

Steve_R_Jones

Weedpacket

Ssneakyimp