As I understand it, the PHP5 DOM classes are supposed to use UTF-8. However, to get the correct display results, I need to utf8_decode(Node::nodeValue). Can someone please explain why?
I also have an issue with not showing up as it should, but perhaps that's a side effect of the character encoding problem.
I have tried sending the HTML code as plain text along with the DOM-processed code to rule out other places where the encoding might not be handled correctly.
HTML code used (saved as UTF-8, no byte order mark)
<ul>
<li>å</li>
<li>ü</li>
<li>ê</li>
</ul>
The DOM tree is used to create an array structure which is sent as a JSON string and then used in javascript to recreate the DOM structure and append it as needed to an existing node in the page.
Text content (nodes of type #text) are handled by one line of code (here seen with utf8_decode, which I believe should not have to be used)
else if ($node->nodeName === "#text") {
return utf8_decode($node->nodeValue);
}
When the response gets to the client, the HTML code that was sent as plain text is simply used to create a textNode which is added to the page. The only reason I did this was to see if I got the same character encoding issues here, which I didn't:
var d = document;
d.childNodes[1].insertBefore(d.createTextNode(meta.plaintext), d.childNodes[1].firstChild);
The array structure sent in the response is javascript-evaled into an array/object structure, where each object represents a node. This is used to rebuild the DOM tree so that it can be inserted in the page as needed. The text nodes are once again processed in a single line of code:
if (o.type == "#text" || o.type == "#cdata-section") {
return d.createTextNode(o.content);
}
Output...
The HTML sent as plain text shows up as:
<ul> <li>å</li> <li>ü</li> <li>ê</li> </ul>
Processed HTML, without using utf8_decode on Node::value, shows up as:
* å
* ü
* ê
Processed HTML, using utf8_decode on Node::value shows up as:
* å
* ü
* ê
The received headers, as reported by the client, includes the line
Content-Type: text/plain; charset=utf-8
Since the HTML code is saved as UTF-8, the header is set to UTF-8 and the part of the response that is sent as plain text shows up correctly, it seems to me that the problem does indeed originate with the Node::nodeValue. But why? And the same issue arises with the use of Node::textContent.
Greatful for any help