preg_match to remove words AND delimiters properly

nrg_alpha · Jul 31, 2008

sneakyimp;10879815 wrote:
You'd get double spaces "I think erego I am" would return "I think I am" (with two spaces)

Ah yes, I see what you mean. My bad.

Cheers,

NRG

Drakla · Jul 31, 2008

This pattern's ugly, but does it in one shot - just an extra option that will suck the spaces before and after if it's at the end of the line. Note YourWord is what you'd put in with the preg_quote bit, aye?

/(\bYourWord\b\s?|\s+\bYourWord\b\s?$)/i

I'd like a nicer pattern without the or, but can't muster it at the mo.

bradgrafelman · Jul 31, 2008

nrg_alpha;10879817 wrote:
Hi Brad..

Mabey I'm misunderstanding something.. going by the OP's goal:
'I've been instructed to write a routine that strips certain words from a text string',

would the str_ireplace routine not suffice?

Cheers,

NRG

The key word in the quote there is words.

The "clbuttic" mistake is that you decide to remove the vulgar word "ass" from a user's input, but instead of just stripping it out you decide to be humorous and replace it with the less offensive word "butt". So you simply do a str_ireplace() to replace "ass" with "butt".

Hence the word classic becomes "clbuttic." Search Google for this term, or any other word that has "ass" in it replaced with "butt" - it's a clbuttic, I mean classic, mistake.

(Thanks to Weedpacket in a previous thread for this example - it vaguely sounded familiar when he mentioned the word, but after a quick Google it all came back to me... very funny example of programming-gone-awry.)

nrg_alpha · Jul 31, 2008

bradgrafelman;10879822 wrote:
The key word in the quote there is words.

The "clbuttic" mistake is that you decide to remove the vulgar word "ass" from a user's input, but instead of just stripping it out you decide to be humorous and replace it with the less offensive word "butt". So you simply do a str_ireplace() to replace "ass" with "butt".

Hence the word classic becomes "clbuttic." Search Google for this term, or any other word that has "ass" in it replaced with "butt" - it's a clbuttic, I mean classic, mistake.

(Thanks to Weedpacket in a previous thread for this example - it vaguely sounded familiar when he mentioned the word, but after a quick Google it all came back to me... very funny example of programming-gone-awry.)

Ohh.. I thought you were referring to Sneakyimp's problem in THIS thread.. sorry..

Cheers,

NRG

nrg_alpha · Jul 31, 2008

Sneakyimp, my solution does not create two spaces when I tested it.. the '' that does the replacing does not create an extra space.. it is 'nothing'.. when I try the following code, it works:

function remove_words($str) {
   $remove = array('SOMNAMBULIST', 'EREGO', 'HERETOFORE'); 
   $str = str_ireplace($remove, '', $str); 
   return $str; 
}

$test = 'I am SOMNAMBULIST therefore I am!';
echo $test . '<br />';
echo remove_words($test);

Cheers,

NRG

bradgrafelman · Jul 31, 2008

nrg_alpha;10879823 wrote:
Ohh.. I thought you were referring to Sneakyimp's problem in THIS thread.. sorry..

I was/am - it's the same concept.

nrg_alpha · Jul 31, 2008

Brad, I don't think sneakyimp is trying to replace one word for another...if I understand correctly, he just wants to strip out specific words...

so this is not the same as in the link I initially posted (which yes, I agree with what you and Weedpacket are saying).

But in THIS case (this thread), I fail to see the usage of 'clbuttic' when the goal here is simply to remove words.. not replace them with other words.. Unless I am very seriously misunderstanding everything here..

Cheers,

NRG

m_tt · Jul 31, 2008

nrg_alpha,

It's pretty simple.. you want to remove full words. Your solution will deform one word into another. For example:

remove_words('I think neweregoword I am');

Would yeild "I think newword I am" deforming my word "neweregoword". Hence clbuttic = classic

nrg_alpha · Jul 31, 2008

m@tt;10879829 wrote:
nrg_alpha,

It's pretty simple.. you want to remove full words. Your solution will deform one word into another.

In the test from the last code I posted, it deforms (replaces) 'one successfully found criteria' for another if thats what you mean.. in this case, due to an array of 'words' from the $remove array, this is replaced with ''.

Is this not in essence 'removing' a word? When I examine the string after it has been passed through the function, the word that is not supposed to be there isn't there.

m@tt;10879829 wrote:
For example:
remove_words('I think neweregoword I am');
Would yeild "I think newword I am" deforming my word "neweregoword". Hence clbuttic = classic

Perhaps I'm new to this clbuttic thing.. but doesn't the replacement word have to actually be a word instead of ''? Or do you mean that by doing any form of replcaement, the 'context' of the string is altered which can result in a clbuttic situation?

Cheers,

NRG

bradgrafelman · Aug 1, 2008

The whole point we're trying to make in suggesting regular expressions over a simple str_ireplace() is that regular expressions can make the distinction between words and a string of characters within a word.

For example, using str_ireplace(), try to remove the offensive word "ass" from this string: Calling someone an ass can be classified as quite offensive. -- and then ask yourself what "clified" means.

nrg_alpha · Aug 1, 2008

Point taken. I now understant completely. Just by the example words the OP used, str_ireplace() worked perfectly. But givine your latest example, it does not.

So I concede.. preg it is. (and now I realise fully the definition of clbuttic).

Sometimes the simplest examples hammer home the point the hardest.

Cheers,

NRG

Drakla · Aug 1, 2008

$patterns[] = '/(^)?\s?\b'.preg_quote($rm).'\b(?(1)\s?)/i';

Just for the fun of it. My previous pattern left spaces at the end of lines if the word occured at the end of a line, which you could get rid of by modifying the pattern to b[/b] but the one shown above doesn't have the repeat. The only site effect of this new pattern is that it essentially does a left trim.

sneakyimp · Aug 1, 2008

ok so let me see...

function remove_words($str) {
    $remove = array('SOMNAMBULIST', 'EREGO', 'HERETOFORE');

$patterns = array();
foreach($remove as $rm) {
	$quoted = preg_quote($rm, '/');
    $patterns[] = '/(\b' . $quoted . '\b\s?|\s?\b' . $quoted . '\b\s?(?=$|\n))/i';
}

return preg_replace($patterns, '', $str);
} // remove_words()

echo "'" . remove_words('I think erego I am') . "'\n"; // needs a space!
echo "'" . remove_words('Heretofore unknown') . "'\n"; // works great 
echo "'" . remove_words('I think heretofore erego i am') . "'\n"; // works great 
echo "'" . remove_words('I was satisfied heretofore') . "'\n"; // works great

I believe I've covered all the boundary conditions in the examples and it seems to work:

'I think I am'
'unknown'
'I think i am'
'I was satisfied'

I think you nailed it drakla. The expression itself is a bit scary to me (this is to be expected from the undead i suppose) so I think I'll stick with my previous function which I can grasp a little better.

The scariest parts are all those question marks. I'm not really sure what they do.

Thanks for the valiant effort guys!

Drakla · Aug 1, 2008

I was thinking of putting an explanation of the pattern in for this one

b?\s?\bYourWord\b(?(1)\s?)[/b]

The basics of it are that you test if you're at the start of a line, and to do that you use the ^ and put it into brackets, which for those who've done a bit of regex with know also creates a capturing pattern with 1 as its id, but it must be optional, and that's what the question mark does.

B?[/B] check if we're at the start of the line [the ^], do a capture so we can check it later [the brackets], but make it optional [the question mark]

\s?\bYourWord\b is grab a space before the word if it's there [\s?], and the word itself

The last bit tests whether the optional start of line pattern actually did capture anything, and if so also says take whitespace from after the word

B[/B] means did 1 capture anything? If so optionally scoop up another space.

So the logic of the whole expression is always grab a space from in front of the word, but if you're at the start of a line then also grab a space after.

That's going to be unintelligible drivel, isn't it. This will probably help more:
http://www.regular-expressions.info/conditional.html

preg_match to remove words AND delimiters properly

Nnrg_alpha

DDrakla

Bbradgrafelman

Nnrg_alpha

Nnrg_alpha

Bbradgrafelman

Nnrg_alpha

Mm_tt

Nnrg_alpha

Bbradgrafelman

Nnrg_alpha

DDrakla

Ssneakyimp

DDrakla