I am working with a company who is publishing their documents online. We decided it was easy for them to create the docs in Word, save them as an html file, then upload it to a MySQL database. Below is the script that was written to delete all the goofy MS Office tags in Word, so we can keep the basic html formatting. However, we are having issues escaping single quotes, double quotes and the '&'. These characters are causing issues and are being outputted in the display page as a '?'. We have a pretty good handle on most of the situation except for the three characters. Here is the script:
function cleanWordHTML($fileName)
{
//Open the file and reaad all of its contents into a variable
$fh = fopen($fileName, 'r');
$contents = fread($fh, filesize($fileName));
fclose($fh);
preg_match('/(<body.?>)(.?)(<\/body>)/', $contents, $matches);
$contents = $matches[2];
//Remove word logic tags
$regs[0] = '/(<\!\[)(.*?)(\]>)/';
$replaces[0] = '$4';
//Remove word xml tags
$regs[1] = '/(<|<\/)(o:)(.*?)(>)/';
$replaces[1] = '';
//Remove class attributes
$regs[2] = '/( class=.*?)( |>)/';
$replaces[2] = '$2';
$contents = preg_replace($regs, $replaces, $contents);
return $contents;
}
If anyone has come across this before and made it work, please let me know.
Thanks
Matt -
Another hurricane weekend in Florida!