[RESOLVED] utf8 and xml pain.

doublehops · Dec 2, 2009

Yes. sorry.

I am creating the instance with

$this->xml = new DOMDocument( "1.0", "UTF-8" );

and then adding

<?xml version="1.0" encoding="utf-8" ?>

to the start of the document. I'm not sure if the second part is necessary but I seem to get the same result with or without it.

blackhorse · Dec 2, 2009

I don't think addattribute changed it for you. I think your original data (from database?) is already html_entities -ed.

You could use the html_entity_decode($string, ENT_COMPAT, 'utf-8') to decode it back to utf-8 character (not number based entities), your xml will accept unicode characters.

Weedpacket · Dec 2, 2009

There's no reason I can think of for DOMDocument to use named entities - they're simply not part of XML. In fact it wouldn't need to use numeric ones either since Á isn't special to XML.

I've tried duplicating your problem, but without success.

<?php
$doc = new DOMDocument("1.0","UTF-8");
$node = $doc->createElement("para");
$newnode = $doc->appendChild($node);
$newnode->setAttribute("align", utf8_encode('&#8364;'));

echo $doc->saveXML();
?>

Output:

<?xml version="1.0" encoding="UTF-8"?>
<para align="&#8364;"/>

Could you post some code that does fail, that we may see it for ourselves?

doublehops · Dec 8, 2009

Thanks guys,

Sorry for the late reply, I have been stuck on more urgent issues.

You both gave me some ideas to work with.

Response to blackhorse:

You are right. The string is already UTF-8 encoded in the database. However, running the string through html_entity_decode() as you've described before adding it as an attribute still results in the string being added as Tomá&scaron.

Response to Weedpacket:

I have tried your exact code on two separate servers and received to different strings than what you found.

Local dev box: Ubuntu 9.04, PHP 5.2.6. Resulting string: <para align="â‚¬"/>
Remote server: Ubuntu 8.10, PHP 5.2.6 Resulting string: <para align="\u20ac"/>

Neither of which seems to be what I expect. Could this be an environment issue?

scrupul0us · Dec 8, 2009

Not sure if it makes any difference, but, are you encapsulating that data within CDATA tags?

also, what collation/charset are you using in your database?

doublehops · Dec 8, 2009

Wrapping the string inside CDATA tags would not change the string. The output still validates as xml as it is.

A mysqldump gave me the following settings for the database in relation to collation and charset.

-- Server version 5.0.75-0ubuntu10.2-log

/!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT /;
/!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS /;
/!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION /;
/!40101 SET NAMES utf8 /;

blackhorse · Dec 9, 2009

Just simply use html_entity_decode, to see if it will work.

But if in some cases, html_entity_decode still cannot get rid off some entities code, then it might be due to it was entities-ed on unicode, so html_entity_decode($string, ENT_COMPAT, 'utf-8') will solve that problem.

And html_entity_decode -- other character set.

The problems are sometimes, if you were working with merged 3rd party data, then it would be a headache, because some of them may not be entities-ed, some of them entities-ed on unicode, some of them are unicode and some of them are not unicode.

blackhorse · Dec 9, 2009

Try this

http://drupal.org/files/issues/212130-decode-entities-support-all-entities.patch

blackhorse · Dec 9, 2009

1) Set my own Database and its tables (which I use to read the 3rd party data into) character set utf8

2)Use the header like this in php export xml page

header('Content-Type: text/xml, charset=utf-8');	
	echo "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";

3) Use utf8_encode, html_entity_decode etc. to clean off the data.

That is what I do to deal with read 3rd party data to export to xml task.

doublehops · Dec 9, 2009

I will look into your latest attempts shortly blackhorse but I'll add what I've been up to this afternoon.

The issue is caused by the data format in the database or the way in which the data comes out. I have been able get the string to output properly within a stand alone file by sending the query 'SET NAMES utf8' to mysql before requesting the data. After that the data is successfully added as an attribute in the way that I expect.

I found the SET NAMES trick here: http://www.marteydodoo.com/2005/08/02/wordpress-utf-8-charset-woes/

I then found that the database was set to latin1. I therefore removed the database and re-created as utf8 and inserted the data back in again ensuring that each table is also set to utf8.

However the data will not format correctly without using the 'SET NAMES utf8' query before retrieving the data regardless if I run the data through html_entity_decode() or not.

Also, putting this workaround into my wordpress plugin using Wordpress's PDO (i.e. $wpdb->query( "SET NAMES 'utf8'" ); ) does not resolve my problem there. Wordpress may be converting the data in the background that I have no control over.

doublehops · Dec 10, 2009

I have just solved my issue.

Ultimately my issue was using the method saveHTML() rather than saveXML(). This sounds obvious but after following a DOCDoument tutorial I was unaware that there was a different method for saving and after the script working for sometime this way I didn't think to look back to the save method.

I also need to apply html_entity_decode() to required fields as pointed out by blackhorse.

Thanks all for your help.

[RESOLVED] utf8 and xml pain.

Ddoublehops

Bblackhorse

Weedpacket

Ddoublehops

Sscrupul0us

Ddoublehops

Bblackhorse

Bblackhorse

Bblackhorse

Ddoublehops

Ddoublehops