Hi
I am having a nightmare trying to come up with regular expressions to remove Word generated HTML.
Here's an example of the kind of thing I am trying to deal with:
<p>
<b style="font-weight: normal"><span size="5" style="FONT-SIZE: 48pt; mso-ansi-language: EN-GB" xml:lang="EN-GB"><font color="#000000"><font face="Times New Roman">BREAKING NEWS!/></font></font></span></b>
</p>
I have already run this through the wonderful html_tidy extension so I can be sure all attributes are quoted, etc.
What I want to do is remove any attribute with "mso" in it (eg. the entirety of style="FONT-SIZE: 48pt; mso-ansi-language: EN-GB") as it will certainly be Word-generated.
So it should end up reading:
<p>
<b style="font-weight: normal"><span size="5" xml:lang="EN-GB"><font color="#000000"><font face="Times New Roman">BREAKING NEWS!
/></font></font></span></b>
</p>
Can you think of any regex to do this?
The closest I have got is
[ ](.+)(mso)+([^"]+)"
But although this stops at the right place, it doesn't start at the beginning of the attribute declaration (the style=). Instead it starts at the first space which would mean too much is deleted.
voidstate