I need to parse XML files which contain a lot of 8859-1 extended characters (basically names in Irish Gaelic).

When I run this through the parser I currently use for similar processing (new simpleXMLElement($xmlstr) you know the one), the encodings are not working:

If I have an item:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<all>
    <item>
        <type>Memorial</yype>
        <name>SURNAME Éadaoín</name>
        <text>In memory of Éadaoín.
Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
        </text>
        <date>2008-09-20</date>
    </item>
</all>

Once it comes through the simpleXMLElement() it is converted to:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<all>
    <item>
        <type>Memorial</type>
        <name>SURNAME ÉadaoÃ*n</name>
        <text>In memory of ÉadaoÃ*n.
Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
        </text>
        <date>2008-09-20</date>
    </item>
</all>

I've tried to convert the characters before it gets put through the function:

$rawxml = file_get_contents($file);

$xmlstr = ($chars, $replace, $rawxml);

$xml = simpleXMLElement($xmlstr);

and the correct translations are there.

But when they are then processed it seems to de-encode them and then make the same errors.

Where am I going wrong???? Anyone any ideas?

For info, this will be parsing approximately 150 records a week where various acute and grave characters could appear in the text at any time.

    Sounds a lot like this thread. So far no solution there, but you might want to keep an eye on it or check to see if you can try running with the latest libxml version.

      Thanks for taking the time to point me to that thread NogDog, newbie here and I tried a search for SimpleXML but nothing really matched up with what's happening.

      Unfortunately I am on a virtual server and so have no control over what versions of software are running there.

      Cheers,

      H

        I've got an answer (of sorts, it works but seems pretty hardcore).

        XML Encoding in file:

        <?xml version="1.0" encoding="UTF-8" ?>

        Data in:

        $rawxml = file_get_contents($file);
        
        $xmlstr = utf8_encode(str_replace($chars, $replace, $rawxml));
        
        $xml = simpleXMLElement($xmlstr);

        Create Variables:

        		$name	=	utf8_decode(str_replace($characters,$entities,$item->name));		
        		$text		=	utf8_decode(str_replace($characters,$entities,$item->text));

        What appears to happen here is the $xmlstr is encoded in UTF-8 with the character substitutions then translated correctly when it hits simpleXMLElement.

        When the variables are then generated they are translated correctly. There's every chance that I don't have to use utf8_decode on this string, but since it works and performance wise it's not an issue (runs as a twice weekly cron) I'm not going to tempt fate and play about with it too much.

          Write a Reply...