Using AI to bounce contact form spam?

sneakyimp · Jun 17, 2021

Thanks very much for the detail. Regarding the uci.edu data:

Date Donated 1999-07-01

I think I'll try and cook up a data set from one of my gmail accounts. I'm having some luck using the PHP IMAP extension.

NogDog · Jun 17, 2021

sneakyimp Oops. Should have put some time restrictions on the google query.

sneakyimp · Jun 18, 2021

NogDog Should have put some time restrictions on the google query.

I'm curious what your query was, exactly. I've been googling for 'spam message training set' and 'spambase dataset' and UCI comes up again and again -- and they only have a few spam datasets. I've seen the old '97 dataset mentioned in a few academic papers. Looks to me like good spam filter training datasets are quite rare.

Also, could you explain that spambase dataset's format to me? The data doesn't contain any actual spam messages from what I can tell. The spambase.data file is just a CSV of numeric values.

sneakyimp · Jun 21, 2021

I created a custom classifier model using AWS Comprehend and trained it with the SMS spam dataset here. I was quite surprised at how difficult it is to find spam filter training sets.

I then wrote a script to use this classifier model to test the various examples we've received on a contact form. The AWS model matched our manual (i.e., human-entered) spam/ham assessments 66% of the time. I wasn't sure whether to be disappointed at these results or surprised at how good they were, given that the dataset I used is for SMS text messages, which are quite short.

I then appended our contact form ham/spam records to the end of that SMS dataset and trained another classifier model. I then tested our contact form entries using the new model, which matched our human ham/spam assessments 95% of the time. I'd be delighted to get such results on incoming novel ham/spam contact form entries.

I would point out that we only had about 41 contact form entries to add + test. I found it quite interesting that even when added our own 41 entries to the model's training set, the model still failed to classify two of these entries correctly. One entry was falsely classified as ham when it should have been spam, but only by an extremely narrow margin:
spam: 0.49563866853714
ham: 0.50436133146286

The other failure was a false positive for spam for the exceedingly terse message "digital marketing assistance."

Very interesting that the model would fail to properly classify records included as part of its training set.

NogDog · Jun 21, 2021

sneakyimp Very interesting that the model would fail to properly classify records included as part of its training set.

Probably not unexpected, since in theory it's iterating through various neural network combinations based on the content and then evaluating against the expected outcomes; but the spam/ham indicator is not part of the data it's examining -- just used to evaluate the results of each iteration. I.e., if "cars for sale", is flagged as spam but "carts for sails" and "ears four pale" are not (not really a good example, but you get the idea); it may lump all 3 together as being essentially the same, and since it's 2:1 ham, classify all 3 as ham. Okay, a horrible example, but the general idea is that by its very nature, it can be hard to discern how the learning model parsed, categorized, and grouped everything -- but 95% is pretty darned good, IMHO.

sneakyimp · Jun 21, 2021

NogDog the spam/ham indicator is not part of the data it's examining

Strictly speaking, the spam/ham indication is most definitely part of the training set. The CSV I provided has two columns: the message and spam/ham indication. I don't recall the exact mechanism by which a neural network gets trained, but I vaguely recall that training involves supplying a set of inputs (the message or 'document') and also an expected output. The 'training' involves adjustment of neural network weights (a matrix/array of nodes and connections) according to some algorithm. I think I vaguely understand why it might fail to get the right classification for an item from its training set -- I think your explanation is a decent one: the algorithm's action is approximate and applies broadly, in a 'fuzzy logic' sense. We are training it with broad, imprecise notions in the hope that it can properly deal with unpredictable or novel input.

A few observations about AWS in particular:

when creating an API endpoint for your model, you have to choose its desired performance level in 'inference units.' I somewhat arbitrarily chose 5 inference units, and haven't the foggiest notion how much this might cost. I'd also point out that it was very, very easy to exceed the allowed usage, which triggers throttling exceptions from the API.
you can't continue to train a given model once you create it. If you want new training, you create a new model and provide it with a new training dataset.
I haven't seen any explanation from AWS about how these models work internally, whether they are neural networks or use bayesian analysis, or what sort of application logic might be in use. Unless I'm missing something, they are a total black box in terms of functionality.

Has anyone worked with a PHP neural network library? I see various options when I google: PHP-ML, Rubix, FANN.

I've just realized that AWS Comprehend endpoints are exorbitantly expensive. If I'm not mistaken, my endpoint cost me $9/hr.

NogDog · Jun 21, 2021

sneakyimp Strictly speaking, the spam/ham indication is most definitely part of the training set.

If my understanding is correct (not a 100% given in anything to do with ML/AI ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.

sneakyimp · Jun 21, 2021

NogDog If my understanding is correct (not a 100% given in anything to do with ML/AI ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.

I believe you have described it precisely.

WARNING: Amazon Comprehend is EXPENSIVE. I set up my endpoint with a mere 5 inference unit capability and decided I'd leave it running just in case I wanted to run another test and ended up carelessly leaving it running for 24 hours. Just checked my billing dashboard and this apparently cost me $267.

NogDog · Jun 21, 2021

sneakyimp

Yikes! I think the guy working on our stuff right now does a lot of things locally now (probably mainly Python-based?), and when running on AWS it's some sort of thing where it spins up a container just for that process. But then we already have all sorts of things running on AWS full-time, so that's probably a different use case for us.

sneakyimp · Jun 26, 2021

Given my experience with AWS, I'm looking into FANN, which has PHP bindings. It looks quite complicated and only has one trivial example showing XOR training. If anyone around here has any experience with coding neural networks, I'd appreciate whatever broad advice or suggestions you may have. For now, I'm starting here.

dalecosp · Jun 28, 2021

SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.

Are you experiencing stuff getting past a Captcha?

sneakyimp · Jun 28, 2021

dalecosp SpamAssassin dates back to a time when contact-form spam was pretty much an unknown malfeasance, so that finding isn't surprising ... you could play with scores in the CONF directories/files to massage that, but it's kind of like using a hammer to turn a screw.

I feel (perhaps incorrectly) that SpamAssassin is widely used around the world and that it made use of coordinated/centralized spam tracking data/logic. I'd also point out that its sa-learn functionality allows you to 'train' it. I did like the fact that it seems to check the spamhaus db for domains that appear in the message's urls. That said, it seems very specifically designed to parse email message, and doesn't seem useful if you are parsing contact form data.

dalecosp Are you experiencing stuff getting past a Captcha?

YES. Most contact form entries are spam. Currently thinking either a) recaptcha has been cracked by somebody or b) this spam comes from human workers, probably in some low-wage spam sweatshop.

sneakyimp · Aug 3, 2021

I know that no one has asked, but I've been taking a Machine Learning course from Coursera presented by Andrew Ng of Stanford University. Although they use MATLAB/Octave for the programming exercises, the course provides a pretty extensive explanation of the underlying mathematics. There's a lot of matrix multiplication and algebra, some borderline calculus stuff.

So, having seen the underlying operations of logistic regression and artificial neural networks, it occurs to me that a simple ordered sequence of bytes - the spam message one receives, UTF8 chars -- doesn't seem especially useful as an input for a neural network spam detector. I started thinking about might actually be useful for describing the essence of a mail message or contact form submission and it seemed that a ranked word frequency list might be quite helpful. A list of the words in a given message, ordered by decreasing frequency seems like it might be quite apt for this sort of thing. It then occurred to that an ordered list of word hashes might be better because then each word (normalized to upper or lowercase) would be reduced to just a few bytes, regardless of length. If anyone has thoughts about what distillations of a free-form text input might be useful for a spam detector, I'd be delighted to hear your thoughts.

sneakyimp · Aug 20, 2021

Once again, no one has asked, but I'm chuffed to report that the week 7 programming assignment involved creating a spam filter. The weekly lectures explained how to create a Support Vector Machine (SVM) spam filter. Broadly, speaking, the process works like this:

create a corpus of ham & spam messages.
pre-process messages: convert to lowercase, replace specific urls with string like httpaddr, emails with emailaddr, use word-stemming library to reduce word variants to some core representation.
After pre-processing messages, analyze them all to determine all words in use. From this choose a vocabulary of all words that appear in a significant fraction of all messages (e.g., 100 messages).
Run through all the messages again, reducing each to just a vector, x of ones and zeros where x(i) is 1 if the i-th word in vocabulary is used in the message and zero if that word is not used.
Randomly split all the messages into a training set (60% of messages), a cross-validation set (20%), and a test set (20%).
'Train' your model using a linear classifier algorithm on the training set, using the cross-validation set to evaluate its progress as you run a gradient descent algorithm to try and locate the minimum of a cost function.
Estimate the efficacy of your trained model by running it on the test set.

As part of the assignment, I trained an SVM classifier using the entire Spam Assassin public Corpus which seems pretty old. It has some 10,750 messages in it. It took about 10 hours for my Octave code to analyze the words in use and generate a vocabulary. It took 2-3 hours to convert all the messages into their binary x-vectors, and then it took maybe a couple of minutes to train a fresh SVM classifier on the result. My code claims a 99.6% success rate on the cross-evaluation data and a 99.1% success rate on the test data (i.e., novel messages not used in training or cross-evaulation).

HOWEVER I attempted to use this classifier on some contact form submissions from a website and, testing recent submissions, we got 60% false positives -- i.e., it declared 3 out of 5 ham messages were spam. This is not good. I'm currently working on incorporatng some of my own ham messages into the training set to see if that improves matters.

sneakyimp · Aug 20, 2021

If anyone is curious, these are the specific pre-processing steps:

strip out email headers
convert entire email to lowercase
remove all HTML tags
replace all urls with the string httpaddr
replace all emails with the string emailaddr
replace all numbers with the string number
all dollar signs replaced with the string dollar
WORD STEMMING we replace all variants of a word with some representative reduction. E.g., "include", "includes", "included", "including" all get converted to "includ". If anyone has experience with a good PHP Porter stemming library, I'd greatly appreciate hearing about it. There's one here but I haven't tested it.
strip all 'non words' -- i.e., remove all punctuation or non-alphanumeric characters and reduce all white space to single space chars.

It might seem counterintuitive to remove so much useful information that might indicate whether a message is spam or not, but this reduction is intended to eliminate information that is too specific.

Using AI to bounce contact form spam?

Ssneakyimp

NogDog

Ssneakyimp

Ssneakyimp

NogDog

Ssneakyimp

NogDog

Ssneakyimp

NogDog

Ssneakyimp

dalecosp

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp