Thanks very much for the detail. Regarding the uci.edu data:
Date Donated 1999-07-01
I think I'll try and cook up a data set from one of my gmail accounts. I'm having some luck using the PHP IMAP extension.
Thanks very much for the detail. Regarding the uci.edu data:
Date Donated 1999-07-01
I think I'll try and cook up a data set from one of my gmail accounts. I'm having some luck using the PHP IMAP extension.
NogDog Should have put some time restrictions on the google query.
I'm curious what your query was, exactly. I've been googling for 'spam message training set' and 'spambase dataset' and UCI comes up again and again -- and they only have a few spam datasets. I've seen the old '97 dataset mentioned in a few academic papers. Looks to me like good spam filter training datasets are quite rare.
Also, could you explain that spambase dataset's format to me? The data doesn't contain any actual spam messages from what I can tell. The spambase.data file is just a CSV of numeric values.
I created a custom classifier model using AWS Comprehend and trained it with the SMS spam dataset here. I was quite surprised at how difficult it is to find spam filter training sets.
I then wrote a script to use this classifier model to test the various examples we've received on a contact form. The AWS model matched our manual (i.e., human-entered) spam/ham assessments 66% of the time. I wasn't sure whether to be disappointed at these results or surprised at how good they were, given that the dataset I used is for SMS text messages, which are quite short.
I then appended our contact form ham/spam records to the end of that SMS dataset and trained another classifier model. I then tested our contact form entries using the new model, which matched our human ham/spam assessments 95% of the time. I'd be delighted to get such results on incoming novel ham/spam contact form entries.
I would point out that we only had about 41 contact form entries to add + test. I found it quite interesting that even when added our own 41 entries to the model's training set, the model still failed to classify two of these entries correctly. One entry was falsely classified as ham when it should have been spam, but only by an extremely narrow margin:
spam: 0.49563866853714
ham: 0.50436133146286
The other failure was a false positive for spam for the exceedingly terse message "digital marketing assistance."
Very interesting that the model would fail to properly classify records included as part of its training set.
sneakyimp Very interesting that the model would fail to properly classify records included as part of its training set.
Probably not unexpected, since in theory it's iterating through various neural network combinations based on the content and then evaluating against the expected outcomes; but the spam/ham indicator is not part of the data it's examining -- just used to evaluate the results of each iteration. I.e., if "cars for sale", is flagged as spam but "carts for sails" and "ears four pale" are not (not really a good example, but you get the idea); it may lump all 3 together as being essentially the same, and since it's 2:1 ham, classify all 3 as ham. Okay, a horrible example, but the general idea is that by its very nature, it can be hard to discern how the learning model parsed, categorized, and grouped everything -- but 95% is pretty darned good, IMHO.
NogDog the spam/ham indicator is not part of the data it's examining
Strictly speaking, the spam/ham indication is most definitely part of the training set. The CSV I provided has two columns: the message and spam/ham indication. I don't recall the exact mechanism by which a neural network gets trained, but I vaguely recall that training involves supplying a set of inputs (the message or 'document') and also an expected output. The 'training' involves adjustment of neural network weights (a matrix/array of nodes and connections) according to some algorithm. I think I vaguely understand why it might fail to get the right classification for an item from its training set -- I think your explanation is a decent one: the algorithm's action is approximate and applies broadly, in a 'fuzzy logic' sense. We are training it with broad, imprecise notions in the hope that it can properly deal with unpredictable or novel input.
A few observations about AWS in particular:
Has anyone worked with a PHP neural network library? I see various options when I google: PHP-ML, Rubix, FANN.
I've just realized that AWS Comprehend endpoints are exorbitantly expensive. If I'm not mistaken, my endpoint cost me $9/hr.
sneakyimp Strictly speaking, the spam/ham indication is most definitely part of the training set.
If my understanding is correct (not a 100% given in anything to do with ML/AI ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.
NogDog If my understanding is correct (not a 100% given in anything to do with ML/AI
), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.
I believe you have described it precisely.
WARNING
: Amazon Comprehend is EXPENSIVE. I set up my endpoint with a mere 5 inference unit capability and decided I'd leave it running just in case I wanted to run another test and ended up carelessly leaving it running for 24 hours. Just checked my billing dashboard and this apparently cost me $267.
Yikes! I think the guy working on our stuff right now does a lot of things locally now (probably mainly Python-based?), and when running on AWS it's some sort of thing where it spins up a container just for that process. But then we already have all sorts of things running on AWS full-time, so that's probably a different use case for us.
Given my experience with AWS, I'm looking into FANN, which has PHP bindings. It looks quite complicated and only has one trivial example showing XOR training. If anyone around here has any experience with coding neural networks, I'd appreciate whatever broad advice or suggestions you may have. For now, I'm starting here.
SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.
Are you experiencing stuff getting past a Captcha?
dalecosp SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.
I feel (perhaps incorrectly) that SpamAssassin is widely used around the world and that it made use of coordinated/centralized spam tracking data/logic. I'd also point out that its sa-learn functionality allows you to 'train' it. I did like the fact that it seems to check the spamhaus db for domains that appear in the message's urls. That said, it seems very specifically designed to parse email message, and doesn't seem useful if you are parsing contact form data.
dalecosp Are you experiencing stuff getting past a Captcha?
YES. Most contact form entries are spam. Currently thinking either a) recaptcha has been cracked by somebody or b) this spam comes from human workers, probably in some low-wage spam sweatshop.
I know that no one has asked, but I've been taking a Machine Learning course from Coursera presented by Andrew Ng of Stanford University. Although they use MATLAB/Octave for the programming exercises, the course provides a pretty extensive explanation of the underlying mathematics. There's a lot of matrix multiplication and algebra, some borderline calculus stuff.
So, having seen the underlying operations of logistic regression and artificial neural networks, it occurs to me that a simple ordered sequence of bytes - the spam message one receives, UTF8 chars -- doesn't seem especially useful as an input for a neural network spam detector. I started thinking about might actually be useful for describing the essence of a mail message or contact form submission and it seemed that a ranked word frequency list might be quite helpful. A list of the words in a given message, ordered by decreasing frequency seems like it might be quite apt for this sort of thing. It then occurred to that an ordered list of word hashes might be better because then each word (normalized to upper or lowercase) would be reduced to just a few bytes, regardless of length. If anyone has thoughts about what distillations of a free-form text input might be useful for a spam detector, I'd be delighted to hear your thoughts.
Once again, no one has asked, but I'm chuffed to report that the week 7 programming assignment involved creating a spam filter. The weekly lectures explained how to create a Support Vector Machine (SVM) spam filter. Broadly, speaking, the process works like this:
As part of the assignment, I trained an SVM classifier using the entire Spam Assassin public Corpus which seems pretty old. It has some 10,750 messages in it. It took about 10 hours for my Octave code to analyze the words in use and generate a vocabulary. It took 2-3 hours to convert all the messages into their binary x-vectors, and then it took maybe a couple of minutes to train a fresh SVM classifier on the result. My code claims a 99.6% success rate on the cross-evaluation data and a 99.1% success rate on the test data (i.e., novel messages not used in training or cross-evaulation).
HOWEVER I attempted to use this classifier on some contact form submissions from a website and, testing recent submissions, we got 60% false positives -- i.e., it declared 3 out of 5 ham messages were spam. This is not good. I'm currently working on incorporatng some of my own ham messages into the training set to see if that improves matters.
If anyone is curious, these are the specific pre-processing steps:
It might seem counterintuitive to remove so much useful information that might indicate whether a message is spam or not, but this reduction is intended to eliminate information that is too specific.