I am working on a site with a CMS back end. The people writing the articles are typically using Word and then copy/pasting into (I think) FCK editor (maybe Tiny MCE) on the back end. The issue is that many times, depending on what they did in word, I am getting a lot of extra, unwanted or unnecessary HTML tags in the article.
For instance, and article may start like this and/or contain similar strings within the field
<p class="ecxMsoNormal" style="text-indent: 13.5pt"> Where is...
or this
<div style="margin: 0in 0in 0pt">
I want to remove mostly the attributes of the tags, but I want to leave the attributes on a tag such as 'a' where the 'href' is important. Of course I also need to keep the formatting tags like 'p' and 'div' (maybe convert div to p).
I've looked for some scripts to strip tags/attributes but nothing I have come across has worked well or at all. I'd really like to be able to clean these articles better.
Any tips?