If I was working with just one news feed this wouldn't be so bad but I have 20 or so now and more will likely be added. Right now this is what I am having to do to fix very poorly done news feeds, and it will likely get worse
$trans = array('“' => '"', '”' => '"', '…' => '...', '`' => "'", '´' => "'", '‘' => "'", '’' => "'", '–' => '-', '–' => '', '’' => "'", '“' => '"', '”' => '"',
'—' => '-', '—' => '-', 'Â' => '', 'Â' => '', ' ' => ' ', '’' => "'", '’' => "'", '‘' => "'",
'‘' => "'", '—' => '-', '—' => '-', '–' => '-', '–' => '-', '&' => '&',
'…' => '...', '…' => '...', 'Ã' => '', 'Ã' => '', '©' => '', ' ' => ' ', "\r\n" => ' ', "\n" => ' ', "\r" => ' '
);
// ...
$description = (isset($item->description)) ? trim($this->db->real_escape_string(htmlentities(strtr(strip_tags(html_entity_decode($item->description, ENT_QUOTES))), $trans), ENT_QUOTES)): '';
Yes you have guessed it, one of the feeds looks to be running "htmlentities" (or similar function) twice on the feed (and doubling up "html_entity_decode" doesn't seem to fix). Some feeds have HTML wrapped in the CDATA tag but all the html entities, in it, are converted, so you have to decode just to strip the tags and then encode again so quotes don't cause problems. And it only gets worse as many feeds use multiple, if not every, variation of single and double quotes, and then they go further and have all kinds of other odd characters (I have to wonder where people get their keyboards from). I used CDATA to over come some problems, but not all, in the "description" tag, but I am seeing problems in the "title" tag and I will likely have to use CDATA there also, which to me seem very odd.
It's getting crazy what I am having to do to use some news feeds. So I have to ask is there some better technique, I have not seen, for dealing with odd characters and character codes?