strip out tag and content with regular expression

mr_slava

Hi everyone,

I've got a little problem I can't resolve myself and would like to ask some help. I use regular expression to remove from email in HTML format some dangerous tags and also I remove everything before <BODY..> tag and after </body>. Regular expression to remove all stuff from <HTML> to <BODY...> looks like:

eregi_replace("^(.|\s)*<body[^>]*>", "", $htmlstring);

and it work well with regular HTML formatted emails. My problems begin when user receive email with few HTML document parts in body of letter (ex. the best company in the World microsoft send newsletter in HTML format with few instance of HTML in the body). Body part in raw looks like:
<html>....<body>...</body></html><html>...<body..>...</body></html>
Of cause my regular expression return me just last part of this e-mail.
So question what regular expression strip me of all instance <html>... any tags, but not other HTML or BODY...<body...> from given HTML.

Thanks.

Mordecai

Try looking at strip_tags(), it should be easier than a regex.

mr_slava

Originally posted by Mordecai
Try looking at strip_tags(), it should be easier than a regex.

strip_tags() is not what I am looking for, because this function accept just "allowable_tags", I am not sure what allowable, actually everything, but nothing from particular place, such as between <html> and <body..>. I know it's a complicated regex, especially with multi entries of HTML-BODY parts and I don't ask someone to write it for me, spend time for this stuff. What I am asking just may be someone have something similar written or advise.