I think tomorrow I will try to train an AWS Comprehend document classifier model to classify text strings as SPAM or NOT-SPAM. Anyone have suggestions about where I could get a great big sample of spam and not-spam plaintext messages? I was thinking I might use my own email accounts. I suspect gmail download will exclude the junk mail, which sorta defeats the purpose.

Any thoughts on training an AWS Comprehend document classifier would also be welcome. It's not clear from the documentation, but it kinda seems like you can only train your model once.

sneakyimp thoughts on training an AWS Comprehend document classifier

One thing I learned is that you want your training set to be close to a 50/50 mix of spam and no-spam examples. If it's skewed too much in either direction, then the developed algorithm tends to be biased in that direction.

NogDog One thing I learned is that you want your training set to be close to a 50/50 mix of spam and no-spam examples. If it's skewed too much in either direction, then the developed algorithm tends to be biased in that direction.

That's an extremely helpful bit of info. So you've worked with AWS Comprehend?

sneakyimp So you've worked with AWS Comprehend?

Not specifically. A couple/three years ago I helped out on a ML project trying to identify problematic comments in review submissions. We used Amazon ML tools, though I don't recall which specific ones. My role mainly was helping to provide data for training/evaluation. I learned some stuff about ML by osmosis and a bit of tinkering around, but haven't really touched it since then.

We got some pretty promising results, but then the business moved in a different direction and it all got put on hold. We're now using ML for a very different purpose (more for predicting things, versus evaluating things), but I'm not closely involved, other than helping out with a few ancillary support things. I think we're using a mix of AWS tools along with some open-source libraries and such. 🤷

NogDog My role was helping to provide data for training/evaluation for the main part.

It's extremely interesting that a 50/50 mix is important. Do you have any thoughts on how many records are required to train? I'd also be very curious about how you cooked up your training set. I'm sure I can hook initialize the model, get PHP talking to it, etc., but I'm struggling a bit for some efficient way to cook up the training set. I have perhaps 30 existing contact form submissions, most of which are spam. Also, I believe the contact form submissions will be plain text rather than HTML. I was considering trying to export data from some mail account and using strip_tags on the raw email. Or maybe using some mail-parsing function to see if there's any text-only form of the message body embedded in the original message.

sneakyimp how many records are required to train?

AFAIK, the more, the better. Not sure if there are any best practice minimums?

sneakyimp It's extremely interesting that a 50/50 mix is important.

Newer models may handle unbalanced sets better, possible? We definitely noticed that in our case it seemed to not be catching enough bad reviews, then realized we probably had something like an 80::20 balance of good vs. bad, and when we went 50/50 got much better results (and confirmed from other sources that that's usually best).

sneakyimp I'd also be very curious about how you cooked up your training set.

In that respect, we already had several years' worth of reviews to use, all of which had gone through human moderation; so we just picked all the reviews that were rejected, then an equivalent number that were approved.

Anyway, a bit of googling led me to this, which perhaps could be useful: https://archive.ics.uci.edu/ml/datasets/spambase

Thanks very much for the detail. Regarding the uci.edu data:

Date Donated 1999-07-01

I think I'll try and cook up a data set from one of my gmail accounts. I'm having some luck using the PHP IMAP extension.

NogDog Should have put some time restrictions on the google query.

I'm curious what your query was, exactly. I've been googling for 'spam message training set' and 'spambase dataset' and UCI comes up again and again -- and they only have a few spam datasets. I've seen the old '97 dataset mentioned in a few academic papers. Looks to me like good spam filter training datasets are quite rare.

Also, could you explain that spambase dataset's format to me? The data doesn't contain any actual spam messages from what I can tell. The spambase.data file is just a CSV of numeric values.

    I created a custom classifier model using AWS Comprehend and trained it with the SMS spam dataset here. I was quite surprised at how difficult it is to find spam filter training sets.

    I then wrote a script to use this classifier model to test the various examples we've received on a contact form. The AWS model matched our manual (i.e., human-entered) spam/ham assessments 66% of the time. I wasn't sure whether to be disappointed at these results or surprised at how good they were, given that the dataset I used is for SMS text messages, which are quite short.

    I then appended our contact form ham/spam records to the end of that SMS dataset and trained another classifier model. I then tested our contact form entries using the new model, which matched our human ham/spam assessments 95% of the time. I'd be delighted to get such results on incoming novel ham/spam contact form entries.

    I would point out that we only had about 41 contact form entries to add + test. I found it quite interesting that even when added our own 41 entries to the model's training set, the model still failed to classify two of these entries correctly. One entry was falsely classified as ham when it should have been spam, but only by an extremely narrow margin:
    spam: 0.49563866853714
    ham: 0.50436133146286

    The other failure was a false positive for spam for the exceedingly terse message "digital marketing assistance."

    Very interesting that the model would fail to properly classify records included as part of its training set.

    sneakyimp Very interesting that the model would fail to properly classify records included as part of its training set.

    Probably not unexpected, since in theory it's iterating through various neural network combinations based on the content and then evaluating against the expected outcomes; but the spam/ham indicator is not part of the data it's examining -- just used to evaluate the results of each iteration. I.e., if "cars for sale", is flagged as spam but "carts for sails" and "ears four pale" are not (not really a good example, but you get the idea); it may lump all 3 together as being essentially the same, and since it's 2:1 ham, classify all 3 as ham. Okay, a horrible example, but the general idea is that by its very nature, it can be hard to discern how the learning model parsed, categorized, and grouped everything -- but 95% is pretty darned good, IMHO.

    NogDog the spam/ham indicator is not part of the data it's examining

    Strictly speaking, the spam/ham indication is most definitely part of the training set. The CSV I provided has two columns: the message and spam/ham indication. I don't recall the exact mechanism by which a neural network gets trained, but I vaguely recall that training involves supplying a set of inputs (the message or 'document') and also an expected output. The 'training' involves adjustment of neural network weights (a matrix/array of nodes and connections) according to some algorithm. I think I vaguely understand why it might fail to get the right classification for an item from its training set -- I think your explanation is a decent one: the algorithm's action is approximate and applies broadly, in a 'fuzzy logic' sense. We are training it with broad, imprecise notions in the hope that it can properly deal with unpredictable or novel input.

    A few observations about AWS in particular:

    • when creating an API endpoint for your model, you have to choose its desired performance level in 'inference units.' I somewhat arbitrarily chose 5 inference units, and haven't the foggiest notion how much this might cost. I'd also point out that it was very, very easy to exceed the allowed usage, which triggers throttling exceptions from the API.
    • you can't continue to train a given model once you create it. If you want new training, you create a new model and provide it with a new training dataset.
    • I haven't seen any explanation from AWS about how these models work internally, whether they are neural networks or use bayesian analysis, or what sort of application logic might be in use. Unless I'm missing something, they are a total black box in terms of functionality.

    Has anyone worked with a PHP neural network library? I see various options when I google: PHP-ML, Rubix, FANN.

    I've just realized that AWS Comprehend endpoints are exorbitantly expensive. If I'm not mistaken, my endpoint cost me $9/hr.

    sneakyimp Strictly speaking, the spam/ham indication is most definitely part of the training set.

    If my understanding is correct (not a 100% given in anything to do with ML/AI 😉 ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.

    NogDog If my understanding is correct (not a 100% given in anything to do with ML/AI 😉 ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.

    I believe you have described it precisely.

    🚨🚨🚨WARNING🚨🚨🚨: Amazon Comprehend is EXPENSIVE. I set up my endpoint with a mere 5 inference unit capability and decided I'd leave it running just in case I wanted to run another test and ended up carelessly leaving it running for 24 hours. Just checked my billing dashboard and this apparently cost me $267.

    sneakyimp

    Yikes! I think the guy working on our stuff right now does a lot of things locally now (probably mainly Python-based?), and when running on AWS it's some sort of thing where it spins up a container just for that process. But then we already have all sorts of things running on AWS full-time, so that's probably a different use case for us. 😐

      5 days later

      SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.

      Are you experiencing stuff getting past a Captcha?

      dalecosp SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.

      I feel (perhaps incorrectly) that SpamAssassin is widely used around the world and that it made use of coordinated/centralized spam tracking data/logic. I'd also point out that its sa-learn functionality allows you to 'train' it. I did like the fact that it seems to check the spamhaus db for domains that appear in the message's urls. That said, it seems very specifically designed to parse email message, and doesn't seem useful if you are parsing contact form data.

      dalecosp Are you experiencing stuff getting past a Captcha?

      YES. Most contact form entries are spam. Currently thinking either a) recaptcha has been cracked by somebody or b) this spam comes from human workers, probably in some low-wage spam sweatshop.

        a month later

        I know that no one has asked, but I've been taking a Machine Learning course from Coursera presented by Andrew Ng of Stanford University. Although they use MATLAB/Octave for the programming exercises, the course provides a pretty extensive explanation of the underlying mathematics. There's a lot of matrix multiplication and algebra, some borderline calculus stuff.

        So, having seen the underlying operations of logistic regression and artificial neural networks, it occurs to me that a simple ordered sequence of bytes - the spam message one receives, UTF8 chars -- doesn't seem especially useful as an input for a neural network spam detector. I started thinking about might actually be useful for describing the essence of a mail message or contact form submission and it seemed that a ranked word frequency list might be quite helpful. A list of the words in a given message, ordered by decreasing frequency seems like it might be quite apt for this sort of thing. It then occurred to that an ordered list of word hashes might be better because then each word (normalized to upper or lowercase) would be reduced to just a few bytes, regardless of length. If anyone has thoughts about what distillations of a free-form text input might be useful for a spam detector, I'd be delighted to hear your thoughts.

          16 days later

          Once again, no one has asked, but I'm chuffed to report that the week 7 programming assignment involved creating a spam filter. The weekly lectures explained how to create a Support Vector Machine (SVM) spam filter. Broadly, speaking, the process works like this:

          • create a corpus of ham & spam messages.
          • pre-process messages: convert to lowercase, replace specific urls with string like httpaddr, emails with emailaddr, use word-stemming library to reduce word variants to some core representation.
          • After pre-processing messages, analyze them all to determine all words in use. From this choose a vocabulary of all words that appear in a significant fraction of all messages (e.g., 100 messages).
          • Run through all the messages again, reducing each to just a vector, x of ones and zeros where x(i) is 1 if the i-th word in vocabulary is used in the message and zero if that word is not used.
          • Randomly split all the messages into a training set (60% of messages), a cross-validation set (20%), and a test set (20%).
          • 'Train' your model using a linear classifier algorithm on the training set, using the cross-validation set to evaluate its progress as you run a gradient descent algorithm to try and locate the minimum of a cost function.
          • Estimate the efficacy of your trained model by running it on the test set.

          As part of the assignment, I trained an SVM classifier using the entire Spam Assassin public Corpus which seems pretty old. It has some 10,750 messages in it. It took about 10 hours for my Octave code to analyze the words in use and generate a vocabulary. It took 2-3 hours to convert all the messages into their binary x-vectors, and then it took maybe a couple of minutes to train a fresh SVM classifier on the result. My code claims a 99.6% success rate on the cross-evaluation data and a 99.1% success rate on the test data (i.e., novel messages not used in training or cross-evaulation).

          HOWEVER I attempted to use this classifier on some contact form submissions from a website and, testing recent submissions, we got 60% false positives -- i.e., it declared 3 out of 5 ham messages were spam. This is not good. I'm currently working on incorporatng some of my own ham messages into the training set to see if that improves matters.

            Write a Reply...