You should not attempt to parse a HTML document using regular expressions. It won't work in the general case.
I recommend that you feed the document into the DOMDocument::loadHTML function, which will parse it correctly provided it has a HTML meta content-type tag.
If the page has no meta content-type tag, you should read its encoding from the HTTP header - and somehow get loadHTML to treat that as the correct encoding. However, it is not straightforward to do this, as it seems it always uses latin1 as the default.
What I ended up doing when I did this was having some code manually insert a meta http-equiv content-type tag if one didn't exist in the document already, copying the charset from the http header.
This is a hack however.
All DOM methods return their string values in utf-8, regardless of the original encoding, so from that point forward there is no problem.
Mark