Using AI to bounce contact form spam?

sneakyimp · 2021-06-01T19:48:45+00:00

As we have seen on this forum, spam from human worker bees continues to be a difficult problem to solve. The CAPTCHA offered by Google seems to be a decent w...

Using AI to bounce contact form spam?

sneakyimp

I am also sad to report that spamassassin is quite disappointing (so far) in recognizing spam in an online form submission. It appears to be designed entirely around parsing a mail message format, and most of its functionality hinges around making sure that all the necessary mail headers exist (To, From, Received, Subject, etc.) and that they obey various header-related rules.

I tested spamassassin on ubuntu by installing it with sudo apt install spamassassin and then running it on a text file containing a spam message. The output has a brief section describing its spam findings. Running it in local-only (with the -L option) mode (which doesn't check online resources like spamhaus), yields this bit of analysis:

Content analysis details:   (6.2 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
-0.0 NO_RELAYS              Informational: message was not relayed via SMTP
 0.9 MISSING_HEADERS        Missing To: header
 1.0 PP_MIME_FAKE_ASCII_TEXT BODY: MIME text/plain claims to be ASCII
                             but isn't
 2.7 MISSING_DATE           Missing Date: header
 1.0 MISSING_FROM           Missing From: header
-0.0 NO_RECEIVED            Informational: message has no Received headers
 0.6 MISSING_MID            Missing Message-Id: header
 0.0 MISSING_SUBJECT        Missing Subject: header
 0.0 NO_HEADERS_MESSAGE     Message appears to be missing most RFC-822
                            headers
 0.0 T_FILL_THIS_FORM_SHORT Fill in a short form with personal
                            information

While missing headers certainly seem like a problem for messages received via SMTP or some mail protocol, these checks are hardly applicable to contact form spam. Also, most of the headers can be easily added to the spam message text and then the message will no longer be considered spam, even if the message is extremely spammy.

If you drop the -L option, the spamassassin check is slower:
spamassassin spam.txt

and the only additional check I've noticed is the spamhaus check, which does recognize a spammy url in the text message, but that barely adds anything to the spam score.

Content analysis details:   (6.6 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
-0.0 NO_RELAYS              Informational: message was not relayed via SMTP
 1.2 MISSING_HEADERS        Missing To: header
 1.0 PP_MIME_FAKE_ASCII_TEXT BODY: MIME text/plain claims to be ASCII
                             but isn't
 0.1 URIBL_SBL_A            Contains URL's A record listed in the Spamhaus SBL
                            blocklist
                            [URIs: talkwithwebvisitors.com]
 1.0 MISSING_FROM           Missing From: header
-0.0 NO_RECEIVED            Informational: message has no Received headers
 1.4 MISSING_DATE           Missing Date: header
 1.8 MISSING_SUBJECT        Missing Subject: header
 0.1 MISSING_MID            Missing Message-Id: header
 0.0 NO_HEADERS_MESSAGE     Message appears to be missing most RFC-822
                            headers
 0.0 T_FILL_THIS_FORM_SHORT Fill in a short form with personal
                            information

I noticed that these checks don't appear to mention DKIM or SPF. I guess those might be plugins. Sadly, these checks are also not applicable to contact form spam.

sneakyimp

I think tomorrow I will try to train an AWS Comprehend document classifier model to classify text strings as SPAM or NOT-SPAM. Anyone have suggestions about where I could get a great big sample of spam and not-spam plaintext messages? I was thinking I might use my own email accounts. I suspect gmail download will exclude the junk mail, which sorta defeats the purpose.

Any thoughts on training an AWS Comprehend document classifier would also be welcome. It's not clear from the documentation, but it kinda seems like you can only train your model once.

NogDog

sneakyimp thoughts on training an AWS Comprehend document classifier

One thing I learned is that you want your training set to be close to a 50/50 mix of spam and no-spam examples. If it's skewed too much in either direction, then the developed algorithm tends to be biased in that direction.

sneakyimp

NogDog One thing I learned is that you want your training set to be close to a 50/50 mix of spam and no-spam examples. If it's skewed too much in either direction, then the developed algorithm tends to be biased in that direction.

That's an extremely helpful bit of info. So you've worked with AWS Comprehend?

NogDog

sneakyimp So you've worked with AWS Comprehend?

Not specifically. A couple/three years ago I helped out on a ML project trying to identify problematic comments in review submissions. We used Amazon ML tools, though I don't recall which specific ones. My role mainly was helping to provide data for training/evaluation. I learned some stuff about ML by osmosis and a bit of tinkering around, but haven't really touched it since then.

We got some pretty promising results, but then the business moved in a different direction and it all got put on hold. We're now using ML for a very different purpose (more for predicting things, versus evaluating things), but I'm not closely involved, other than helping out with a few ancillary support things. I think we're using a mix of AWS tools along with some open-source libraries and such. 🤷

sneakyimp

NogDog My role was helping to provide data for training/evaluation for the main part.

It's extremely interesting that a 50/50 mix is important. Do you have any thoughts on how many records are required to train? I'd also be very curious about how you cooked up your training set. I'm sure I can hook initialize the model, get PHP talking to it, etc., but I'm struggling a bit for some efficient way to cook up the training set. I have perhaps 30 existing contact form submissions, most of which are spam. Also, I believe the contact form submissions will be plain text rather than HTML. I was considering trying to export data from some mail account and using strip_tags on the raw email. Or maybe using some mail-parsing function to see if there's any text-only form of the message body embedded in the original message.

NogDog

sneakyimp how many records are required to train?

AFAIK, the more, the better. Not sure if there are any best practice minimums?

sneakyimp It's extremely interesting that a 50/50 mix is important.

Newer models may handle unbalanced sets better, possible? We definitely noticed that in our case it seemed to not be catching enough bad reviews, then realized we probably had something like an 80::20 balance of good vs. bad, and when we went 50/50 got much better results (and confirmed from other sources that that's usually best).

sneakyimp I'd also be very curious about how you cooked up your training set.

In that respect, we already had several years' worth of reviews to use, all of which had gone through human moderation; so we just picked all the reviews that were rejected, then an equivalent number that were approved.

Anyway, a bit of googling led me to this, which perhaps could be useful: https://archive.ics.uci.edu/ml/datasets/spambase

sneakyimp

Thanks very much for the detail. Regarding the uci.edu data:

Date Donated 1999-07-01

I think I'll try and cook up a data set from one of my gmail accounts. I'm having some luck using the PHP IMAP extension.

NogDog

sneakyimp Oops. Should have put some time restrictions on the google query. 😅

sneakyimp

NogDog Should have put some time restrictions on the google query.

I'm curious what your query was, exactly. I've been googling for 'spam message training set' and 'spambase dataset' and UCI comes up again and again -- and they only have a few spam datasets. I've seen the old '97 dataset mentioned in a few academic papers. Looks to me like good spam filter training datasets are quite rare.

Also, could you explain that spambase dataset's format to me? The data doesn't contain any actual spam messages from what I can tell. The spambase.data file is just a CSV of numeric values.

sneakyimp

I created a custom classifier model using AWS Comprehend and trained it with the SMS spam dataset here. I was quite surprised at how difficult it is to find spam filter training sets.

I then wrote a script to use this classifier model to test the various examples we've received on a contact form. The AWS model matched our manual (i.e., human-entered) spam/ham assessments 66% of the time. I wasn't sure whether to be disappointed at these results or surprised at how good they were, given that the dataset I used is for SMS text messages, which are quite short.

I then appended our contact form ham/spam records to the end of that SMS dataset and trained another classifier model. I then tested our contact form entries using the new model, which matched our human ham/spam assessments 95% of the time. I'd be delighted to get such results on incoming novel ham/spam contact form entries.

I would point out that we only had about 41 contact form entries to add + test. I found it quite interesting that even when added our own 41 entries to the model's training set, the model still failed to classify two of these entries correctly. One entry was falsely classified as ham when it should have been spam, but only by an extremely narrow margin:
spam: 0.49563866853714
ham: 0.50436133146286

The other failure was a false positive for spam for the exceedingly terse message "digital marketing assistance."

Very interesting that the model would fail to properly classify records included as part of its training set.

NogDog

sneakyimp Very interesting that the model would fail to properly classify records included as part of its training set.

Probably not unexpected, since in theory it's iterating through various neural network combinations based on the content and then evaluating against the expected outcomes; but the spam/ham indicator is not part of the data it's examining -- just used to evaluate the results of each iteration. I.e., if "cars for sale", is flagged as spam but "carts for sails" and "ears four pale" are not (not really a good example, but you get the idea); it may lump all 3 together as being essentially the same, and since it's 2:1 ham, classify all 3 as ham. Okay, a horrible example, but the general idea is that by its very nature, it can be hard to discern how the learning model parsed, categorized, and grouped everything -- but 95% is pretty darned good, IMHO.

sneakyimp

NogDog the spam/ham indicator is not part of the data it's examining

Strictly speaking, the spam/ham indication is most definitely part of the training set. The CSV I provided has two columns: the message and spam/ham indication. I don't recall the exact mechanism by which a neural network gets trained, but I vaguely recall that training involves supplying a set of inputs (the message or 'document') and also an expected output. The 'training' involves adjustment of neural network weights (a matrix/array of nodes and connections) according to some algorithm. I think I vaguely understand why it might fail to get the right classification for an item from its training set -- I think your explanation is a decent one: the algorithm's action is approximate and applies broadly, in a 'fuzzy logic' sense. We are training it with broad, imprecise notions in the hope that it can properly deal with unpredictable or novel input.

A few observations about AWS in particular:

when creating an API endpoint for your model, you have to choose its desired performance level in 'inference units.' I somewhat arbitrarily chose 5 inference units, and haven't the foggiest notion how much this might cost. I'd also point out that it was very, very easy to exceed the allowed usage, which triggers throttling exceptions from the API.
you can't continue to train a given model once you create it. If you want new training, you create a new model and provide it with a new training dataset.
I haven't seen any explanation from AWS about how these models work internally, whether they are neural networks or use bayesian analysis, or what sort of application logic might be in use. Unless I'm missing something, they are a total black box in terms of functionality.

Has anyone worked with a PHP neural network library? I see various options when I google: PHP-ML, Rubix, FANN.

I've just realized that AWS Comprehend endpoints are exorbitantly expensive. If I'm not mistaken, my endpoint cost me $9/hr.

NogDog

sneakyimp Strictly speaking, the spam/ham indication is most definitely part of the training set.

If my understanding is correct (not a 100% given in anything to do with ML/AI 😉 ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.

sneakyimp

NogDog If my understanding is correct (not a 100% given in anything to do with ML/AI 😉 ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.

I believe you have described it precisely.

🚨🚨🚨WARNING🚨🚨🚨: Amazon Comprehend is EXPENSIVE. I set up my endpoint with a mere 5 inference unit capability and decided I'd leave it running just in case I wanted to run another test and ended up carelessly leaving it running for 24 hours. Just checked my billing dashboard and this apparently cost me $267.

NogDog

sneakyimp

Yikes! I think the guy working on our stuff right now does a lot of things locally now (probably mainly Python-based?), and when running on AWS it's some sort of thing where it spins up a container just for that process. But then we already have all sorts of things running on AWS full-time, so that's probably a different use case for us. 😐

sneakyimp

Given my experience with AWS, I'm looking into FANN, which has PHP bindings. It looks quite complicated and only has one trivial example showing XOR training. If anyone around here has any experience with coding neural networks, I'd appreciate whatever broad advice or suggestions you may have. For now, I'm starting here.

dalecosp

SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.

Are you experiencing stuff getting past a Captcha?

sneakyimp

dalecosp SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.

I feel (perhaps incorrectly) that SpamAssassin is widely used around the world and that it made use of coordinated/centralized spam tracking data/logic. I'd also point out that its sa-learn functionality allows you to 'train' it. I did like the fact that it seems to check the spamhaus db for domains that appear in the message's urls. That said, it seems very specifically designed to parse email message, and doesn't seem useful if you are parsing contact form data.

dalecosp Are you experiencing stuff getting past a Captcha?

YES. Most contact form entries are spam. Currently thinking either a) recaptcha has been cracked by somebody or b) this spam comes from human workers, probably in some low-wage spam sweatshop.

sneakyimp

I know that no one has asked, but I've been taking a Machine Learning course from Coursera presented by Andrew Ng of Stanford University. Although they use MATLAB/Octave for the programming exercises, the course provides a pretty extensive explanation of the underlying mathematics. There's a lot of matrix multiplication and algebra, some borderline calculus stuff.

So, having seen the underlying operations of logistic regression and artificial neural networks, it occurs to me that a simple ordered sequence of bytes - the spam message one receives, UTF8 chars -- doesn't seem especially useful as an input for a neural network spam detector. I started thinking about might actually be useful for describing the essence of a mail message or contact form submission and it seemed that a ranked word frequency list might be quite helpful. A list of the words in a given message, ordered by decreasing frequency seems like it might be quite apt for this sort of thing. It then occurred to that an ordered list of word hashes might be better because then each word (normalized to upper or lowercase) would be reduced to just a few bytes, regardless of length. If anyone has thoughts about what distillations of a free-form text input might be useful for a spam detector, I'd be delighted to hear your thoughts.