NogDog My role was helping to provide data for training/evaluation for the main part.
It's extremely interesting that a 50/50 mix is important. Do you have any thoughts on how many records are required to train? I'd also be very curious about how you cooked up your training set. I'm sure I can hook initialize the model, get PHP talking to it, etc., but I'm struggling a bit for some efficient way to cook up the training set. I have perhaps 30 existing contact form submissions, most of which are spam. Also, I believe the contact form submissions will be plain text rather than HTML. I was considering trying to export data from some mail account and using strip_tags on the raw email. Or maybe using some mail-parsing function to see if there's any text-only form of the message body embedded in the original message.