removing MS Word HTML from a file

martincrumlish

Hi,

I have a problem with converting word docs to HTML. As you probably know, when word generates its HTML it has a lot of needless tags. Is there a way I can cut these out and just leave basic formatting tags such as <b>, <br>, <ol>, <li> etc.

I found this code to remove Word HTML

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript 
                 "'<[\/\!]*?[^<>]*?>'si",           // Strip out html tags 
                 "'([\r\n])[\s]+'",                 // Strip out white space 
                 "'&(quot|#34);'i",                 // Replace html entities 
                 "'&(amp|#38);'i", 
                 "'&(lt|#60);'i", 
                 "'&(gt|#62);'i", 
                 "'&(nbsp|#160);'i", 
                 "'&(iexcl|#161);'i", 
                 "'&(cent|#162);'i", 
                 "'&(pound|#163);'i", 
                 "'&(copy|#169);'i", 
                 "'&#(\d+);'e");                    // evaluate as php 

$replace = array ("", 
                  "", 
                  "\\1", 
                  "\"", 
                  "&", 
                  "<", 
                  ">", 
                  " ", 
                  chr(161), 
                  chr(162), 
                  chr(163), 
                  chr(169), 
                  "chr(\\1)"); 

$content = preg_replace ($search, $replace, $content);

I found this on another site as a solution for removing word HTML but the problem is it removes all of the HTML leaving the file as a blob of text only with nor line breaks or anything. I am afraid I don't understand the code above fully so I was hoping someone on here could help me out.

Basically, I need help modiying the code above to remove all the crap but still leave certain tags.

Thanks in advance,
Martin