As we have seen on this forum, spam from human worker bees continues to be a difficult problem to solve. The CAPTCHA offered by Google seems to be a decent way to prevent robots -- and a heavyweight like Google can surely solve this problem better than some homespun solution -- but it doesn't prevent actual humans from spamming one's contact form. I am looking into a way to prevent this sort of spam.

I've installed SpamAssassin on servers in the past, but it's been quite awhile. IIRC, the configuration settings for it have quit a few different values you can tweak, and the munging of spam email requires a fair amount of computing resources. There are white lists and black lists for senders and recipients. I think it might also check spam blacklists like spamhaus.

We've also seen things like DMARC, SPF, DKIM to try and check whether a machine is authorized to send email for a particular domain. Sadly, such tools don't seem applicabe to a web contact form. Some user on their phone is unlikely to have credentials we can check like domain registration records.

As for AI approaches that might serve this purpose, I'm only familiar with a couple of broad categories of machine learning:

  • Genetic Algorithm
  • Neural Network

Genetic Algorithm doesn't really seem applicable. It works by repeated iterations of random tweaking of possible solutions and the application of some algorithm to evaluate the SCORE of any solution. This approach seems to already assume that we have some algorithm to generate a spam score.

Neural Network might be applicable, if one artfully identifies useful inputs and is able to 'train' the neural network with some data set of positive and negative -- i.e., spam and not-spam -- examples. Assuming we have some useful distillation of inputs related to the contact form submission, we might be able to train a neural net with an initial dataset and then further train the neural net with trusted user input over time by allowing our trusted users apply additional training with some simple UI.

A contact form has some obvious data points:

  • any data in $SERVER, $GET, $_POST. The REMOTE_ADDR seems of particular interest.
  • RECAPTCHA data acquired from the form. Google reCaptcha V3 offers an invisible method that works without hassling the users to fill out the form.
  • return address, submitted message

It seems to me that the origin IP address and the contents of the submitted message need particular scrutiny. Checking the IP address against some spam heat map like spamhaus seems helpful. As for the message content itself, I haven't really got any useful thoughts about how to assess its 'spamminess' beyond keeping a dataset of historical spam messages and calculating some kind of statistical correlation between the incoming messages and prior spam messages -- sorta like how a vaccine works: "hey now the surface of this incoming message feels a lot like SPAM-EXAMPLE-123 in my database of spam".

Any thoughts, discussion, or anecdotes would be much appreciated.

    No great technical thoughts, but something to consider when weighing options: what is the acceptable amount of false positives? I.e., if whatever you come up rejects 99+% of the spam inputs, but also rejects 1% of all the valid inputs as well, is that good or bad overall for the product/service you are providing? Presumably the answer will vary from product to product, anywhere from zero tolerance to some non-zero amount. 🤷

    NogDog what is the acceptable amount of false positives? I.e., if whatever you come up rejects 99+% of the spam inputs, but also rejects 1% of all the valid inputs as well...

    I had definitely wondered this myself, and it's important to consider. Seems to me that any potential user of this AI tool should have some ability to tweak settings and choose their own tolerance level for spam in some way. I'd point out that reCaptcha V3 has moved toward returning a score instead of true/false.

      Yeah, in my experience with ML (mainly as an interested team member but not closely involved), they usually have had some 0.0 - 0.999... result returned. In one case, if the score was over one value, the action was just done, if it was less than that but greater than another value, then it was flagged for "human intervention".

        9 days later

        Just an experiential aside: if you need to scale your AI, better make it a couple orders if magnitude smarter than you think is acceptable.

          Anyone at all interested in this sort of spam-block-via-machine-learning thread might be interested in knowing that Amazon Comprehend (Natural Language Processing) does not offer any specific spam recognition API. They do, however, offer a variety of pre-trained 'models' and also offers functionality to let you 'train' these models. As best I can tell, models appear to be a sort of black box implementation of a neural network. They offer a variety of pre-trained, out-of the box models listed here, including key phrases, sentiment, PII, syntax, etc. I thought these might be useful but the data I've seen them return would need some additional algorithmic work or possibly another layer of machine learning to distinguish SPAM vs NOT-SPAM.

          They also allow to create and train your own models to classify documents by assigning your own classes. You can also train a model to do entity recognition. Training involves submitting pre-classified documents/entities either as CSV or some other, more elaborate JSON format. I might do some testing to see if how I can train one of these models.

            5 days later

            I am also sad to report that spamassassin is quite disappointing (so far) in recognizing spam in an online form submission. It appears to be designed entirely around parsing a mail message format, and most of its functionality hinges around making sure that all the necessary mail headers exist (To, From, Received, Subject, etc.) and that they obey various header-related rules.

            I tested spamassassin on ubuntu by installing it with sudo apt install spamassassin and then running it on a text file containing a spam message. The output has a brief section describing its spam findings. Running it in local-only (with the -L option) mode (which doesn't check online resources like spamhaus), yields this bit of analysis:

            Content analysis details:   (6.2 points, 5.0 required)
            
             pts rule name              description
            ---- ---------------------- --------------------------------------------------
            -0.0 NO_RELAYS              Informational: message was not relayed via SMTP
             0.9 MISSING_HEADERS        Missing To: header
             1.0 PP_MIME_FAKE_ASCII_TEXT BODY: MIME text/plain claims to be ASCII
                                         but isn't
             2.7 MISSING_DATE           Missing Date: header
             1.0 MISSING_FROM           Missing From: header
            -0.0 NO_RECEIVED            Informational: message has no Received headers
             0.6 MISSING_MID            Missing Message-Id: header
             0.0 MISSING_SUBJECT        Missing Subject: header
             0.0 NO_HEADERS_MESSAGE     Message appears to be missing most RFC-822
                                        headers
             0.0 T_FILL_THIS_FORM_SHORT Fill in a short form with personal
                                        information

            While missing headers certainly seem like a problem for messages received via SMTP or some mail protocol, these checks are hardly applicable to contact form spam. Also, most of the headers can be easily added to the spam message text and then the message will no longer be considered spam, even if the message is extremely spammy.

            If you drop the -L option, the spamassassin check is slower:
            spamassassin spam.txt

            and the only additional check I've noticed is the spamhaus check, which does recognize a spammy url in the text message, but that barely adds anything to the spam score.

            Content analysis details:   (6.6 points, 5.0 required)
            
             pts rule name              description
            ---- ---------------------- --------------------------------------------------
            -0.0 NO_RELAYS              Informational: message was not relayed via SMTP
             1.2 MISSING_HEADERS        Missing To: header
             1.0 PP_MIME_FAKE_ASCII_TEXT BODY: MIME text/plain claims to be ASCII
                                         but isn't
             0.1 URIBL_SBL_A            Contains URL's A record listed in the Spamhaus SBL
                                        blocklist
                                        [URIs: talkwithwebvisitors.com]
             1.0 MISSING_FROM           Missing From: header
            -0.0 NO_RECEIVED            Informational: message has no Received headers
             1.4 MISSING_DATE           Missing Date: header
             1.8 MISSING_SUBJECT        Missing Subject: header
             0.1 MISSING_MID            Missing Message-Id: header
             0.0 NO_HEADERS_MESSAGE     Message appears to be missing most RFC-822
                                        headers
             0.0 T_FILL_THIS_FORM_SHORT Fill in a short form with personal
                                        information

            I noticed that these checks don't appear to mention DKIM or SPF. I guess those might be plugins. Sadly, these checks are also not applicable to contact form spam.

              I think tomorrow I will try to train an AWS Comprehend document classifier model to classify text strings as SPAM or NOT-SPAM. Anyone have suggestions about where I could get a great big sample of spam and not-spam plaintext messages? I was thinking I might use my own email accounts. I suspect gmail download will exclude the junk mail, which sorta defeats the purpose.

              Any thoughts on training an AWS Comprehend document classifier would also be welcome. It's not clear from the documentation, but it kinda seems like you can only train your model once.

              sneakyimp thoughts on training an AWS Comprehend document classifier

              One thing I learned is that you want your training set to be close to a 50/50 mix of spam and no-spam examples. If it's skewed too much in either direction, then the developed algorithm tends to be biased in that direction.

              NogDog One thing I learned is that you want your training set to be close to a 50/50 mix of spam and no-spam examples. If it's skewed too much in either direction, then the developed algorithm tends to be biased in that direction.

              That's an extremely helpful bit of info. So you've worked with AWS Comprehend?

              sneakyimp So you've worked with AWS Comprehend?

              Not specifically. A couple/three years ago I helped out on a ML project trying to identify problematic comments in review submissions. We used Amazon ML tools, though I don't recall which specific ones. My role mainly was helping to provide data for training/evaluation. I learned some stuff about ML by osmosis and a bit of tinkering around, but haven't really touched it since then.

              We got some pretty promising results, but then the business moved in a different direction and it all got put on hold. We're now using ML for a very different purpose (more for predicting things, versus evaluating things), but I'm not closely involved, other than helping out with a few ancillary support things. I think we're using a mix of AWS tools along with some open-source libraries and such. 🤷

              NogDog My role was helping to provide data for training/evaluation for the main part.

              It's extremely interesting that a 50/50 mix is important. Do you have any thoughts on how many records are required to train? I'd also be very curious about how you cooked up your training set. I'm sure I can hook initialize the model, get PHP talking to it, etc., but I'm struggling a bit for some efficient way to cook up the training set. I have perhaps 30 existing contact form submissions, most of which are spam. Also, I believe the contact form submissions will be plain text rather than HTML. I was considering trying to export data from some mail account and using strip_tags on the raw email. Or maybe using some mail-parsing function to see if there's any text-only form of the message body embedded in the original message.

              sneakyimp how many records are required to train?

              AFAIK, the more, the better. Not sure if there are any best practice minimums?

              sneakyimp It's extremely interesting that a 50/50 mix is important.

              Newer models may handle unbalanced sets better, possible? We definitely noticed that in our case it seemed to not be catching enough bad reviews, then realized we probably had something like an 80::20 balance of good vs. bad, and when we went 50/50 got much better results (and confirmed from other sources that that's usually best).

              sneakyimp I'd also be very curious about how you cooked up your training set.

              In that respect, we already had several years' worth of reviews to use, all of which had gone through human moderation; so we just picked all the reviews that were rejected, then an equivalent number that were approved.

              Anyway, a bit of googling led me to this, which perhaps could be useful: https://archive.ics.uci.edu/ml/datasets/spambase

              Thanks very much for the detail. Regarding the uci.edu data:

              Date Donated 1999-07-01

              I think I'll try and cook up a data set from one of my gmail accounts. I'm having some luck using the PHP IMAP extension.

              NogDog Should have put some time restrictions on the google query.

              I'm curious what your query was, exactly. I've been googling for 'spam message training set' and 'spambase dataset' and UCI comes up again and again -- and they only have a few spam datasets. I've seen the old '97 dataset mentioned in a few academic papers. Looks to me like good spam filter training datasets are quite rare.

              Also, could you explain that spambase dataset's format to me? The data doesn't contain any actual spam messages from what I can tell. The spambase.data file is just a CSV of numeric values.

                I created a custom classifier model using AWS Comprehend and trained it with the SMS spam dataset here. I was quite surprised at how difficult it is to find spam filter training sets.

                I then wrote a script to use this classifier model to test the various examples we've received on a contact form. The AWS model matched our manual (i.e., human-entered) spam/ham assessments 66% of the time. I wasn't sure whether to be disappointed at these results or surprised at how good they were, given that the dataset I used is for SMS text messages, which are quite short.

                I then appended our contact form ham/spam records to the end of that SMS dataset and trained another classifier model. I then tested our contact form entries using the new model, which matched our human ham/spam assessments 95% of the time. I'd be delighted to get such results on incoming novel ham/spam contact form entries.

                I would point out that we only had about 41 contact form entries to add + test. I found it quite interesting that even when added our own 41 entries to the model's training set, the model still failed to classify two of these entries correctly. One entry was falsely classified as ham when it should have been spam, but only by an extremely narrow margin:
                spam: 0.49563866853714
                ham: 0.50436133146286

                The other failure was a false positive for spam for the exceedingly terse message "digital marketing assistance."

                Very interesting that the model would fail to properly classify records included as part of its training set.

                sneakyimp Very interesting that the model would fail to properly classify records included as part of its training set.

                Probably not unexpected, since in theory it's iterating through various neural network combinations based on the content and then evaluating against the expected outcomes; but the spam/ham indicator is not part of the data it's examining -- just used to evaluate the results of each iteration. I.e., if "cars for sale", is flagged as spam but "carts for sails" and "ears four pale" are not (not really a good example, but you get the idea); it may lump all 3 together as being essentially the same, and since it's 2:1 ham, classify all 3 as ham. Okay, a horrible example, but the general idea is that by its very nature, it can be hard to discern how the learning model parsed, categorized, and grouped everything -- but 95% is pretty darned good, IMHO.

                NogDog the spam/ham indicator is not part of the data it's examining

                Strictly speaking, the spam/ham indication is most definitely part of the training set. The CSV I provided has two columns: the message and spam/ham indication. I don't recall the exact mechanism by which a neural network gets trained, but I vaguely recall that training involves supplying a set of inputs (the message or 'document') and also an expected output. The 'training' involves adjustment of neural network weights (a matrix/array of nodes and connections) according to some algorithm. I think I vaguely understand why it might fail to get the right classification for an item from its training set -- I think your explanation is a decent one: the algorithm's action is approximate and applies broadly, in a 'fuzzy logic' sense. We are training it with broad, imprecise notions in the hope that it can properly deal with unpredictable or novel input.

                A few observations about AWS in particular:

                • when creating an API endpoint for your model, you have to choose its desired performance level in 'inference units.' I somewhat arbitrarily chose 5 inference units, and haven't the foggiest notion how much this might cost. I'd also point out that it was very, very easy to exceed the allowed usage, which triggers throttling exceptions from the API.
                • you can't continue to train a given model once you create it. If you want new training, you create a new model and provide it with a new training dataset.
                • I haven't seen any explanation from AWS about how these models work internally, whether they are neural networks or use bayesian analysis, or what sort of application logic might be in use. Unless I'm missing something, they are a total black box in terms of functionality.

                Has anyone worked with a PHP neural network library? I see various options when I google: PHP-ML, Rubix, FANN.

                I've just realized that AWS Comprehend endpoints are exorbitantly expensive. If I'm not mistaken, my endpoint cost me $9/hr.

                sneakyimp Strictly speaking, the spam/ham indication is most definitely part of the training set.

                If my understanding is correct (not a 100% given in anything to do with ML/AI 😉 ), the spam/ham flag should not be part of the data that is being parsed, tokenized, evaluated, whatever the heck the terms are. It should only be used to evaluate the result of an iteration to determine how well/poorly it did, so that it can then hopefully determine what it should try for the next iteration.