Traversing DOMDocument XML tree gets progressively slower

sprocket · Sep 12, 2006

Hi all,

My apologies if my DOM/XML vocabulary is a bit off, but this is my first attempt....

I'm trying to convert a 50MB XML file (containing 35000 <$prod_tag> tags into a CSV file (each line representing the tags within <$prod_tag> </$prod_tag>.

The script parses 100 <$prod_tag>'s at a time, saving the data in an array, then dumps it to file. The array is then reset to save memory.

However, the script starts of parsing 100 tags in ~1 sec but as time goes by, it gets slower and slower, parsing 100 tags per 5secs and then the script timesout.

Firstly, why does tree traversal slow down? I've echoed memory_get_usage() every time the script dumps data to file and memory management is good (maxes at ~1.1MB on the 100th tag).

Is it all the writing to disk? or is it the way DOM trees work? Can I not unset DOM nodes as they've been parsed?

I'm lost! Anyway - here's my naked parsing script....

$doc = new DOMDocument();
$doc->load($feed_filename);

$items = $doc->getElementsByTagName($prod_tag);


foreach($items as &$item) {

foreach($column_names as $column) { //load up the data for each column tag
	$allnodes = $item->getElementsByTagName($column);
	$i=0;

	foreach($allnodes as $node) { //there can be >1 instance of the column tag (esp. categories path)
		if($i==0) $data[$column]= $allnodes->item(0)->nodeValue;
		else $data[$column] .= "##".$allnodes->item($i)->nodeValue; //yup, there were >1 instances. Delimit with ##
		$i++;
	}	
}

$numItems ++;

$k=0;
foreach($data as $val) {
	if( ($numItems>1) && ($k===0) ) $todump .= "\n";
	if($k===0) $todump .= $val;
	else $todump .= "\t".$val;
	$k++;
}
unset($data);
unset($allnodes);
unset($item);

if($numItems %100 ==0) {
	print"<code>_</code>"; //.memory_get_usage() ;
	if(fwrite($handle, $todump)===FALSE) return FALSE;
	unset($todump);

}
if($numItems %10000 ==0) print("<br>"); 
flush();
}

fclose($handle);

MarkR · Sep 13, 2006

I suspect that memory_get_usage excludes memory used by libraries (e.g. libxml2).

Loading a 50 Meg file into a DOM is not something that you should do- it is likely to be very inefficient and use loads of memory.

Have you looked at an OS memory usage tool to analyse the memory use?

I am assuming you're running this from the PHP CLI as it would be rediculous to do something like this from a web server script. Consider putting in a sleep(), then looking at the memory usage with a tool appropriate to your OS.

I think you should really be using a stream-based parser such as expat instead (I think the PHP xml_ functions use the expat stream parser).

Mark

sprocket · Sep 13, 2006

MarkR wrote:
I suspect that memory_get_usage excludes memory used by libraries (e.g. libxml2).

Loading a 50 Meg file into a DOM is not something that you should do- it is likely to be very inefficient and use loads of memory.

Have you looked at an OS memory usage tool to analyse the memory use?

I am assuming you're running this from the PHP CLI as it would be rediculous to do something like this from a web server script. Consider putting in a sleep(), then looking at the memory usage with a tool appropriate to your OS.

I think you should really be using a stream-based parser such as expat instead (I think the PHP xml_ functions use the expat stream parser).

Mark

It is a webserver script. And then it's not. It's a once-a-day parser called only by the server or an admin. You're right though, at teh system level it hogs the ram and htpd.

Anyway, I found a very quick and resource easy solution, and that was to load the xml into a simplexml object, then iterate through the major nodes with a DOMDocument (this last bit seemed essential to me, as there were 100+ irrevelent children nodes under the parent node).

The script, mean and lean converts the 50MB XML document to CSV (11MB final) in 4.5 secs and uses 55MB system RAM. That'll do me for a once-a-day script!

the code if anyone's interested:
($column_names is an array of node names to collect)

       $handle =fopen($feed, "a");
	$doc = simplexml_load_file($feed_filename);

foreach($doc->item_data as $item_node) {

	$domnode = dom_import_simplexml($item_node);
	$doc = new DOMDocument();	
	$item = $doc->importNode($domnode, true);

	foreach($column_names as $column) { //load up the data for each column tag
		$allnodes = $item->getElementsByTagName($column);

		$i=0;

		foreach($allnodes as $node) { //there can be >1 instance of the column tag (esp. categories path)
			if($i==0) {
				$tmp= $allnodes->item(0)->nodeValue;
				$data[$column]= mb_convert_encoding($tmp, "HTML-ENTITIES","UTF-8"); //gotta do this because feeds containing hex html entitities... get mangled 
			}else{ //yup, there were >1 instances. Delimit with ##
				$tmp = $allnodes->item($i)->nodeValue;
				$tmp = mb_convert_encoding($tmp, "HTML-ENTITIES","UTF-8");
				$data[$column] .= "##".$tmp; 
			}
			$i++;
		}	
	}

	$numItems ++;

	$k=0;
	foreach($data as $val) {
		if( ($numItems>1) && ($k===0) ) $todump .= "\n";
		if($k===0) $todump .= $val;
		else $todump .= "\t".$val;
		$k++;
	}
	unset($data);
	unset($allnodes);

	if($numItems %100 ==0) {
		if(fwrite($handle, $todump)===FALSE) return FALSE;
		unset($todump);
	}
}
if($todump) { //dump the remaining data and return
	if(fwrite($handle, $todump)===FALSE) return FALSE;
}
fclose($handle);

MarkR · Sep 13, 2006

Well, I'm glad it works for you. But using a stream-based parser is definitely the right way to go for large documents.

simplexml is just another DOM-based parser, except it doesn't use a standardised DOM. This may be more efficient on memory, but it's not any better scalability-wise.

A stream based parser can parse an arbitrarily large file using only a fixed amount of memory. This makes it massively superior- in terms of scalability- as there is then no limit to the file size (imposed by the XML parser anyway).

Mark

dream_scape · Sep 14, 2006

I think XMLReader would be a good candidate for this. The docs are not exactly clear, but I believe XMLReader uses a stream, and is probably the most efficient XML component in PHP 5 for traversing an XML document.

Traversing DOMDocument XML tree gets progressively slower

Ssprocket

MMarkR

Ssprocket

MMarkR

Ddream_scape