As we have seen on this forum, spam from human worker bees continues to be a difficult problem to solve. The CAPTCHA offered by Google seems to be a decent way to prevent robots -- and a heavyweight like Google can surely solve this problem better than some homespun solution -- but it doesn't prevent actual humans from spamming one's contact form. I am looking into a way to prevent this sort of spam.
I've installed SpamAssassin on servers in the past, but it's been quite awhile. IIRC, the configuration settings for it have quit a few different values you can tweak, and the munging of spam email requires a fair amount of computing resources. There are white lists and black lists for senders and recipients. I think it might also check spam blacklists like spamhaus.
We've also seen things like DMARC, SPF, DKIM to try and check whether a machine is authorized to send email for a particular domain. Sadly, such tools don't seem applicabe to a web contact form. Some user on their phone is unlikely to have credentials we can check like domain registration records.
As for AI approaches that might serve this purpose, I'm only familiar with a couple of broad categories of machine learning:
- Genetic Algorithm
- Neural Network
Genetic Algorithm doesn't really seem applicable. It works by repeated iterations of random tweaking of possible solutions and the application of some algorithm to evaluate the SCORE of any solution. This approach seems to already assume that we have some algorithm to generate a spam score.
Neural Network might be applicable, if one artfully identifies useful inputs and is able to 'train' the neural network with some data set of positive and negative -- i.e., spam and not-spam -- examples. Assuming we have some useful distillation of inputs related to the contact form submission, we might be able to train a neural net with an initial dataset and then further train the neural net with trusted user input over time by allowing our trusted users apply additional training with some simple UI.
A contact form has some obvious data points:
- any data in $SERVER, $GET, $_POST. The REMOTE_ADDR seems of particular interest.
- RECAPTCHA data acquired from the form. Google reCaptcha V3 offers an invisible method that works without hassling the users to fill out the form.
- return address, submitted message
It seems to me that the origin IP address and the contents of the submitted message need particular scrutiny. Checking the IP address against some spam heat map like spamhaus seems helpful. As for the message content itself, I haven't really got any useful thoughts about how to assess its 'spamminess' beyond keeping a dataset of historical spam messages and calculating some kind of statistical correlation between the incoming messages and prior spam messages -- sorta like how a vaccine works: "hey now the surface of this incoming message feels a lot like SPAM-EXAMPLE-123 in my database of spam".
Any thoughts, discussion, or anecdotes would be much appreciated.