I am trying to import news articles in XML form into a mysql table for a news site. This works fine except there are Unicode characters that are not being imported and are instead turning into "?" characters and other oddball characters. The most frequent is the use of — which is an EM Dash used by news agencies.
The news story is separated into paragraphs in the XML using <CPStoryPara></CPStoryPara> as the tag. The following code is what is used to take the paragraphs, wrap them in HTML Paragraph tags, build a string and then update the MySQL record.
$string=file_get_contents($url);
$xml = new SimpleXMLElement($string);
$data=$xml->NewsItem->NewsComponent->ContentItem[0]->DataContent->CPOnlineFile->CPStory;
$newsstory="";
foreach ($xml->NewsItem->NewsComponent->ContentItem[0]->DataContent->CPOnlineFile->CPStory->CPStoryPara as $index) {
$newsstory.="<p>".$index."</p>";
}
$update="UPDATE canadianpress_briefs SET story='".addslashes($newsstory)."' WHERE id='".$row_story['id']."'";
mysql_query($update);
When I look at the MySQL Record, it has instead of — there is a ? in it's place.
I am running Apache 2, MySQL 5 and PHP 5. What solutions are out there to fix this problem and convert the XML record properly.
Thank you for your help.
Phil