Grabbing the words is easy, but looking for closing tags is a different matter.
For one, how would you decide which tags 'must' be closed, and which tags can be left open? You mention the HREF tag, which obviously should be closed.
But what about the B tag, or TD? font?
If you'd want to go through with it, a simple way would be to check your text for HTML tags. every time you find a new opening tag, add it to an array of tags.
Whenever you find a closing tag, check if the last tag in the 'opened' array is of the same type. If so, that tag must be closed now, and you remove that opening tag from the array (pop the little bastard off the end of the array, making the array shorter)
When the array is empty, your document should be complete.
There's also a regexp example of finding matching tags somewhere in there... but I forgot where. :-)