I am trying to generate a list of keywords from a page of HTML content for entry to a search engine DB. I have a list of noise words (i.e. and, it, while, the etc.) that I need to strip out of the content. I have converted the content to a string and intend to create a comma seperated list from this using PREG_REPLACE.
I need some help to create a pattern with the correct syntax that will match all circumstances of a noise word in the content string EXCEPT where it is part of a larger word, i.e.:
If the noise word is 'and',
(and
and)
,and
and,
and'
"and
and so on should be removed / replaced, but
England
understand
handy
should be untouched.
I haven't really got used to the pattern sytax yet and am struggling, any help would be greatly appreciated!!!
You can e-mail me at alex@suboceanic.net
Thanks,
AlexT