Regex Help - Stripping Word generated HTML

voidstate

I am having a nightmare trying to come up with regular expressions to remove Word generated HTML.

Here's an example of the kind of thing I am trying to deal with:

<p>
<b style="font-weight: normal"><span size="5" style="FONT-SIZE: 48pt; mso-ansi-language: EN-GB" xml:lang="EN-GB"><font color="#000000"><font face="Times New Roman">BREAKING NEWS!/&gt;</font></font></span></b>
</p>

I have already run this through the wonderful html_tidy extension so I can be sure all attributes are quoted, etc.

What I want to do is remove any attribute with "mso" in it (eg. the entirety of style="FONT-SIZE: 48pt; mso-ansi-language: EN-GB") as it will certainly be Word-generated.

So it should end up reading:

<p>
<b style="font-weight: normal"><span size="5"  xml:lang="EN-GB"><font color="#000000"><font face="Times New Roman">BREAKING NEWS!
/&gt;</font></font></span></b>
</p>

Can you think of any regex to do this?

The closest I have got is

[ ](.+)(mso)+([^"]+)"

But although this stops at the right place, it doesn't start at the beginning of the attribute declaration (the style=). Instead it starts at the first space which would mean too much is deleted.

voidstate

mrhappiness

%style\s*=\s*".*mso.*"%siU%

replace this with an empty string and see what happens