Pattern syntax for use with PREG_REPLACE

Anon

I am trying to generate a list of keywords from a page of HTML content for entry to a search engine DB. I have a list of noise words (i.e. and, it, while, the etc.) that I need to strip out of the content. I have converted the content to a string and intend to create a comma seperated list from this using PREG_REPLACE.

I need some help to create a pattern with the correct syntax that will match all circumstances of a noise word in the content string EXCEPT where it is part of a larger word, i.e.:

If the noise word is 'and',

(and
and)
,and
and,
and'
"and

and so on should be removed / replaced, but

England
understand
handy

should be untouched.

I haven't really got used to the pattern sytax yet and am struggling, any help would be greatly appreciated!!!

You can e-mail me at alex@suboceanic.net

Thanks,

AlexT

[deleted]

Sometimes life is so much simpler than you think. 🙂

Put the noisewords in an array.
use regexp or strreplace to get rid of all your dots, commas etc in the text.
explode the textinto an array, and use
array_diff()
Et presto: all the noisewords are gone from your text.

Anon

Thanks, that worked a treat!!!

Now, do you know of a quick method of stripping the HTML tags from a string?? I have played with the get_html_translation_table() and several of the other HTML functions but to no avail!! Short of using the str_replace function for each individual tag (v.tedious) I can't think of anything else...Help!!!

Thanks,

AlexT

[deleted]

striptags()