[RESOLVED] DOM and character encoding

johanafm · Jul 3, 2009

As I understand it, the PHP5 DOM classes are supposed to use UTF-8. However, to get the correct display results, I need to utf8_decode(Node::nodeValue). Can someone please explain why?
I also have an issue with   not showing up as it should, but perhaps that's a side effect of the character encoding problem.

I have tried sending the HTML code as plain text along with the DOM-processed code to rule out other places where the encoding might not be handled correctly.

HTML code used (saved as UTF-8, no byte order mark)

<ul>
	<li>å</li>
	<li>ü</li>
	<li>ê</li>
</ul>

The DOM tree is used to create an array structure which is sent as a JSON string and then used in javascript to recreate the DOM structure and append it as needed to an existing node in the page.

Text content (nodes of type #text) are handled by one line of code (here seen with utf8_decode, which I believe should not have to be used)

else if ($node->nodeName === "#text") {
	return utf8_decode($node->nodeValue);
}

When the response gets to the client, the HTML code that was sent as plain text is simply used to create a textNode which is added to the page. The only reason I did this was to see if I got the same character encoding issues here, which I didn't:

var d = document;
d.childNodes[1].insertBefore(d.createTextNode(meta.plaintext), d.childNodes[1].firstChild);

The array structure sent in the response is javascript-evaled into an array/object structure, where each object represents a node. This is used to rebuild the DOM tree so that it can be inserted in the page as needed. The text nodes are once again processed in a single line of code:

	if (o.type == "#text" || o.type == "#cdata-section") {
		return d.createTextNode(o.content);
	}

Output...

The HTML sent as plain text shows up as:

<ul>  <li>å</li>  <li>ü</li>  <li>ê</li> </ul>

Processed HTML, without using utf8_decode on Node::value, shows up as:

    * Ã¥
    * Ã¼
    * Ãª

Processed HTML, using utf8_decode on Node::value shows up as:

    * å
    * ü
    * ê

The received headers, as reported by the client, includes the line

Content-Type: text/plain; charset=utf-8

Since the HTML code is saved as UTF-8, the header is set to UTF-8 and the part of the response that is sent as plain text shows up correctly, it seems to me that the problem does indeed originate with the Node::nodeValue. But why? And the same issue arises with the use of Node::textContent.

Greatful for any help

Bjom · Jul 3, 2009

Maybe setting PHPs default charset will help?

locate the default_charset keyword and change the line to this:

default_charset = "UTF-8"

johanafm · Jul 6, 2009

Thanks for the suggestion, but unfortunately it does not help.

Shrike · Jul 6, 2009

How are you sending the JSON string? Presumably just a simple print. You might find some mileage in sending a content type header first.

johanafm · Jul 8, 2009

Thanks, but I'm allready sending headers. The response class just makes sure the response always has the same structure and carries the same basic information. Stripped down for testing purposes, sending the response looks like this:

class response {
	public $data;
	public $meta;

function __construct($data, $meta) {
	$this->data = $data;
	$this->meta = $meta;
}
}

header("Content-type: text/plain; charset=utf-8");
echo json_encode(
	new response(
		array($harr->GetArray()),
		array('status' => 'html',
			'plaintext' => $harr->GetString())
	);

GetArray() returns an array of objects representing DOM nodes. To build this array, the PHP 5 DOM classes are used to process the HTML code (see last code block in this post).
GetString() returns the html code as a string.

And the only difference between the actual texts in the objects and the string, is that the object strings come from $node->nodeValue.

The string 'plaintext' displays as it should and the browser reports the response headers to contain "Content-Type ... =utf-8". Thus, I'm certain the issue is not with file character encoding, response character encoding or browser encoding detection.

Also, I have found a way around the problem, even though I still do not understand the why and how: converting the html code from utf-8 to windows-1252!

	$html = iconv("UTF-8", "Windows-1252", $html);

$html comes from a file most definitely saved as UTF-8 (no BOM). If I instead save it as Western (Windows Latin-1), which presumably is my editor's name for "Windows-1252", it works without the call to iconv(). If the file is saved as UTF-8, then the above conversion is needed. Also, please do note that our development server is running on OSX.

The class dealing with the "html->dom->array" conversion looks like this, apart from the simple getter functions for private data members.

<?php
class HtmlArray {
	private $dom;
	private $arr;
	private $str;

public function __construct($html) {
	// Comment out this line for strange character encoding...
	$html = iconv("UTF-8", "Windows-1252", $html);

	$this->str = $html;
	$this->dom = new DOMDocument('1.0', 'utf-8');
	$this->dom->loadHTML($html);
	$this->arr = array();

	/*	<html>
			<head></head>		- item(0)
			<body>actual content</body>		- item(1)
		</html>*/
	// Start building from the body node
	$this->arr = $this->HtmlToArray($this->dom->childNodes->item(1));
	$this->arr = $this->arr[0];		// Keep only actual content. Ditch the body node
}

// Go through all of a node's childnodes. Text nodes never have childnodes: just return the actual text
private function HtmlToArray($node) {
	if ($node === null)
		return null;

	$nodeArr = array();
	// Go deeper
	if ($node->childNodes) {
		for ($i = 0; $i < $node->childNodes->length; ++$i) {
			$nodeArr[] = $this->HtmlToArrayTail($node->childNodes->item($i));
		}
		return $nodeArr;
	}

/*** ***
 * This is where the strange encoding issues appear without iconv() above.
 *** ***/
	else if ($node->nodeName === "#text") {
		return $node->nodeValue;
	}
	else
		return null;
}

// Go through one node's data. If node->content is another node, proceed deeper
private function HtmlToArrayTail($node) {
	if ($node === null)
		return null;

	$nodeArr = array();
	$nodeArr['type'] = $node->nodeName;
	$nodeArr['content'] = $this->HtmlToArray($node);
	$nodeArr['attrs'] = array();
	if ($node->attributes != null) {
		foreach ($node->attributes as $attrName => $attrNode) {
			$nodeArr['attrs'][$attrName] = $attrNode->value;
		}
	}

	return $nodeArr;
}
}

Shrike · Jul 8, 2009

Seems very odd, I don't have alot to offer. Are you specifying the encoding in the HTML input file? I note that DOMDocument::encoding will tell you what the DOM library thinks the encoding is.

johanafm · Jul 8, 2009

Thanks for that one! It seems the character encoding is based on

	<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Anything else is apparently disregarded.

[RESOLVED] DOM and character encoding

Jjohanafm

BBjom

Jjohanafm

SShrike

Jjohanafm

SShrike

Jjohanafm