[RESOLVED] Sanitize and censor $_POST message

nrg_alpha

ok.. I will give a sample code below, which basically has the role of taking the user's message in a form textfield and first sanitize it to clear any potential harmful scripting tags.., then scan and replace bad words.. the current code does work.. but can't help but think it can be streamlined / optimized..

but first, the code:

if (isset($_POST['textarea'])){

// first remove any potential html / scripting tags...
$_POST['textarea'] = preg_replace('#<[^>]+>#' , '' , $_POST['textarea']);

// now check and sensor out any profanity...
$_POST['textarea'] = preg_replace('/badWord-01/i', 'spoofWord-01', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-02/i', 'spoofWord-02', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-03/i', 'spoofWord-03', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-04/i', 'spoofWord-04', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-05/i', 'spoofWord-05', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-06/i', 'spoofWord-06', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-07/i', 'spoofWord-07', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-08/i', 'spoofWord-08', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-09/i', 'spoofWord-09', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-10/i', 'spoofWord-10', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-11/i', 'spoofWord-11', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-12/i', 'spoofWord-12', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-13/i', 'spoofWord-13', $_POST['textarea']);
$_POST['textarea'] = preg_replace('/badWord-14/i', 'spoofWord-14', $_POST['textarea']);

// finally, remove any added slahes due to get_magic_quotes_gpc()...
if(get_magic_quotes_gpc()){
	$_POST['textarea'] = stripslashes($_POST['textarea']);
}
}

Now I realise it could take on different routes.. like perhaps storing the bad words and spoof words in a multi-dimensional array and then using for each, plug them into a common preg_replace structure.. but I wonder if this is slower than simply doing what I have done (which is just keep passing $_POST['textarea'] down a chain of checks to replace with spoofed words..

What faster alternatives are at my disposal?

Cheers,

NRG

P.S as for the removal of harmful tags, I am aware of using htmlentities(), but I don't want to be left with all these sappy altcharacters leftover..

Weedpacket

[man]preg_replace[/man] can take arrays for its search and replace parameters.

If the reason for using preg_replace is for case insensitivity, then str_ireplace() would do that without needing regular expressions (it, too, can take array arguments). Or, since you're probably not interested in searching for bad words in replacement text you've just inserted, the two-argument form of [man]strtr[/man] may be effective (it makes one pass over the text, instead of as many passes as there are search/replace pairs).

On the other hand, using preg_replace() will allow you to match word boundaries.

Regardless, the list of bad words will need to be chosen carefully, or you could end up making the clbuttic mistake.

nrg_alpha

OK, I have managed to get it working with str_ireplace();

$badWord = array('badWord-01','badWord-02','badWord-03','badWord-04','badWord-05','badWord-06','badWord-07','badWord-08','badWord-09','badWord-10');
$spoofWord = array('spoofWord-01','spoofWord-02','spoofWord-03','spoofWord-04','spoofWord-05','spoofWord-06','spoofWord-07','spoofWord-08','spoofWord-09','spoofWord-10');

// first remove any potential html / scripting tags... 
$_POST['textarea'] = preg_replace('#<[^>]+>#' , '' , $_POST['textarea']);

// now check and sensor out any profanity...
$_POST['textarea'] = str_ireplace($badWord, $spoofWord, $_POST['textarea']);

// finally, remove any added slahes due to get_magic_quotes_gpc()...
if(get_magic_quotes_gpc()){
   $_POST['textarea'] = stripslashes($_POST['textarea']);
}

The code definately looks more 'refined'.. when tested, the performance seems a bit faster (but I think the difference is not significant). But I gather that the technique beneath the hood of str_ireplace() is quicker than preg_replace (especially as more entries come into play)?

Cheers,

NRG

P.S I'm not overly concerned about making a clbuttic mistake. Just covering some basics is all.

nrg_alpha

Ok..so I have refined things somewhat.. in light of Weedpacket (and Brad's) illsutration of clbuttic... I have rewrote the routine to censor out potty mout talk from bad little users who have nothing better to do than swear at people in forms. 😃

$badWord = array('badWord-01','badWord-02','badWord-03','badWord-04','badWord-05','badWord-06','badWord-07','badWord-08','badWord-09','badWord-10');
$censor = array('f--k'=>'*'); // the f--k is actually written as the f word in full.

// first remove any potential html / scripting tags... 
$_POST['textarea'] = preg_replace('#<[^>]+>#' , '' , $_POST['textarea']); 

// now check and sensor out any profanity...
$profanityStatus = false; // innocent till proven guilty...

foreach($badWord as $remove){  // first begin by checking for complete words that match complete profanity words...
   $_POST['textarea'] = preg_replace('#\b'.$remove.'\b#i' , '*' , $_POST['textarea']);
}
// and then check for words that are partial in profanity... example: f--ker, f--king or f--k-adoodaaday
$_POST['textarea'] = strtr($_POST['textarea'], $censor);

if(preg_match('/[*]/', $_POST['textarea'])){
   $profanityStatus = true; // this will deny the form from being sent as an email...
}

// finally, remove any added slahes due to get_magic_quotes_gpc()... 
if(get_magic_quotes_gpc()){ 
   $_POST['textarea'] = stripslashes($_POST['textarea']); 
}

So basically, I search for complete words first.. and change those to asterisks. Then I go through again and search for parts of words that contain profanity.
I felt it necessary to do this out of our beloved clbuttic situation. Using this method, while the word 'ass' is censored out, it isn't in the word classified. So this enables me to fine comb 'hybrid words' and know the difference between say 'ass' vs 'classified' and f--k vs f--ker (which in this case, both need to be delt with).
Any new hybrid words I am missing can easily be updated in the $censor array.

Am I on the right track?

Cheers,

NRG

bradgrafelman

One change I would suggest is that you utilize preg_replace()'s 5th parameter and introduce a "count" variable. Use this to gauge whether $profanityStatus should be set to true (i.e. if the count is >0, then some profanity was found). Otherwise, legitimate text can lead to a false positive if the user wanted to use an asterisk in his/her message.

Also, you mentioned in the inline comment that if there was profanity found then the entire form would be nullified and no e-mail sent. What, then, is the point of replacing the profanity when it's found? Why not just search for it and, if found, halt the process there rather than have PHP do some find-and-replace, additional censoring, and then finally die?

nrg_alpha

bradgrafelman;10879896 wrote:
One change I would suggest is that you utilize preg_replace()'s 5th parameter and introduce a "count" variable. Use this to gauge whether $profanityStatus should be set to true (i.e. if the count is >0, then some profanity was found). Otherwise, legitimate text can lead to a false positive if the user wanted to use an asterisk in his/her message.

Ah, nice suggestion Brad! I want that! Unfortunatelty, I could not get it working.. I looked at the manual here:
http://us.php.net/manual/en/function.preg-replace.php

and emulated this snippet:

$count = 0;
echo preg_replace(array('/\d/', '/\s/'), '*', 'xp 4 to', -1 , $count);
echo $count; //3

Of course, I used my previous example preg_replace instead of the sample shown here.. I had a $count=0; line just above, and I inserted -1, $count as the 4th and 5th variables and I had echo $count just after the preg (just like the snippet),, the $count number kept reporting 0.. so I tried passing it by reference (&count), I even tried ++$count.. (I just could not get the count variable to record properly.. 0 count it seems 🙁 If I can get help on that one thing.. it will all come together beautifully..

bradgrafelman;10879896 wrote:
Also, you mentioned in the inline comment that if there was profanity found then the entire form would be nullified and no e-mail sent. What, then, is the point of replacing the profanity when it's found? Why not just search for it and, if found, halt the process there rather than have PHP do some find-and-replace, additional censoring, and then finally die?

Ok, change of plan here.. the idea was to get the users to 'fix' the message by clearing the asterisks themselves.. but after your comment, I re-thought it.. so now, I'll allow the astersisks to pass along in the message.. I'll see who swore and where 🙂 so now there is no inconvenience to the users in this regard.

so now my code is as follows:

$badWord = array('badWord-01','badWord-02','badWord-03','badWord-04','badWord-05','badWord-06','badWord-07','badWord-08','badWord-09','badWord-10'); 
$censor = array('f--k','sh-t'); // actually swear words in real script. 

// first remove any potential html / scripting tags... 
$_POST['textarea'] = preg_replace('#<[^>]+>#' , '' , $_POST['textarea']); 

// now check and sensor out any profanity... 
$profanityStatus = false; // innocent till proven guilty... 

foreach($badWord as $remove){  // first begin by checking for complete words that match complete profanity words... 
   $_POST['textarea'] = preg_replace('#\b'.$remove.'\b#i' , '*' , $_POST['textarea']); 
} 
// and then check for words that are partial in profanity... example: f--ker, f--king or f--k-adoodaaday 
$_POST['textarea'] = str_ireplace($censor, '*', $_POST['textarea']);

if(preg_match('/[*]/', $_POST['textarea'])){ 
   $profanityStatus = true; // if I can get that 5th $count variable working in preg, I can convert str_ireplace to preg as well and once again check for this $count variable and if it is still 0, profanityStatus = false...
} 

// finally, remove any added slahes due to get_magic_quotes_gpc()... 
if(get_magic_quotes_gpc()){ 
   $_POST['textarea'] = stripslashes($_POST['textarea']); 
}

It is pretty solid now.. I played around with it for a while.. it is by no means perfect of course.. as I don't think I can code against EVERYTHING.. but in anycase, go ahead.. amuse yourself 😃

So Brad, if you can show me what I am doing wrong with regards to that 5th $count variable in preg_replace, that would be much appreciated!

Thanks for the feedback so far! Much appreciated.

Cheers,

NRG

nrg_alpha

ok, so I think I got it now (third time a charm?)

here is the update.. I got the '$count' parameter in preg_replace working.. (in my case relabeled as $profanityCount1 and $profanityCount2 [which I suppose I could have simply used an array like $profanityCount[] and stored the full and partial profanity count into those...]). So now, it keeps track of how many changes have been made in both full words and partial words, which in turn is checked to see if those variables are true.. and if so, profanity has been entered...

// first remove any potential html / scripting tags... 
$_POST['textarea'] = preg_replace('#<[^>]+>#' , '' , $_POST['textarea']); 

// set variables and check for any profanity...
$profanityCount1 = 0; // set complete-word profanity count to 0
$profanityCount2 = 0; // set partial-word profanity count to 0

$postProccess = preg_replace(array('#\bbadWord1\b#i', '#\bbadWord2\b#i', '#\bbadWord3\b#i', '#\bbadWord4\b#i'), '*', $_POST['textarea'], -1 , $profanityCount1);
$_POST['textarea'] = $postProccess;

// and then check for words that are partial in profanity...
$postProccess = preg_replace(array('#badWord1#i', '#badWord2#i', '#badWord3#i'), '*', $_POST['textarea'], -1 , $profanityCount2);
$_POST['textarea'] = $postProccess;
unset($postProccess);
//echo $profanityCount1 . '   ' . $profanityCount2;

// finally, remove any added slahes due to get_magic_quotes_gpc()... 
if(get_magic_quotes_gpc()){ 
	$_POST['textarea'] = stripslashes($_POST['textarea']); 
}

Then simply check to see if profanity has been used...

if($profanityCount1 || $profanityCount2){
   // insert profanity message here...

This all seems to work reasonably well. Users can now enter asterisks and the system will know the difference and not call it profanity (only the actual asterisks the system inserts).. so this means an user may enter something like ' I swear on my on my life... ' and this will process as a clean, non-profanity entry.
C&C welcome.

Cheers,

NRG