DOM parsing with large files

scrupul0us

I am attempting to parse a rather large HTM file (~5.6mb)

per my thread from this morning i'm pulling all tr records that match a condition.. the resultant is about 3800 nodes

when I parse a small subset ~20 nodes, the script goes just fine... when I link to the full file the browser rolls for about, oh 5 minutes or so and just displays a white page

i've done the boiler plate stuff:

set_time_limit(0);
ini_set('display_errors',1);
ini_set('max_execution_time', 0);
error_reporting(E_ERROR | E_WARNING | E_PARSE | E_NOTICE);

and nothing is reported

I have 5 xpath queries and 6 preg_match statements within this loop

i tried an ob_start and ob_end_flush at the top and bottom of the for loop but that gives me nothing either

scrupul0us

<?php
for($i=0;($i<$length);$i++)
{
	echo '<item>'."\n";
	#**************************************
	$elements = $xpath->query("//table[@width='580']/tr/td[@width='49%']/b");
	$companyName = $elements->item($i)->nodeValue;
	//echo 'Company: '.$companyName.'<br/>';
	echo '<companyName>'.$companyName.'</companyName>'."\n";
	#**************************************
	$elements = $xpath->query("//table[@width='580']/tr/td[@width='49%']/a");
	$companyURL = $elements->item($i*2)->nodeValue;
	//echo 'Link: '.$companyURL.'<br/>';
	echo '<companyURL>'.$companyURL.'</companyURL>'."\n";
	#**************************************
	$elements = $xpath->query("//table[@width='580']/tr/td[@width='49%']");
	$companyAddress = $elements->item($i*2)->nodeValue;

$nameLen = strlen(strip_tags($companyName))+12;
$urlLen = strlen($companyURL);
$addressLen = strlen($companyAddress);

$companyAddress = substr($companyAddress,$nameLen,($addressLen - $urlLen - $nameLen));
$companyAddress = trim(str_replace('Â','',$companyAddress));
$companyAddress = trim(str_replace(urldecode('%A0'),' ',$companyAddress));
//echo 'Address: '.$companyAddress.'<br/>';
echo '<companyAddress>'.$companyAddress.'</companyAddress>'."\n";
#**************************************
$elements = $xpath->query("//table[@width='580']/tr/td[@width='49%']/i");
$industry = $elements->item($i)->nodeValue;
//echo 'Industry: '.$industry.'<br/>';
echo '<companyIndustry>'.$industry.'</companyIndustry>'."\n";

#**************************************
$elements = $xpath->query("//table[@width='580']/tr/td[@width='49%']/i/..");
$contact = $elements->item($i)->nodeValue;
$contact = str_replace('Â','',$contact);
$contact = str_replace(urldecode('%A0'),' ',$contact);
preg_match('/(Contact:)(.*)/',$contact,$contactName);
preg_match('/(Phone:)(.*)/',$contact,$contactPhone);
preg_match('/(Toll Free:)(.*)/',$contact,$contactToll);
preg_match('/(Intl:)(.*)/',$contact,$contactIntl);
preg_match('/(Fax:)(.*)/',$contact,$contactFax);
preg_match('/(Email:)(.*)/',$contact,$contactEmail);

$contactName = trim($contactName[2]);
//echo 'Contact Name: '.$contactName.'<br/>';
echo '<contactName>'.$contactName.'</contactName>'."\n";

$contactPhone = trim($contactPhone[2]);
//echo 'Contact Phone: '.$contactPhone.'<br/>';
echo '<contactPhone>'.$contactPhone.'</contactPhone>'."\n";

$contactToll = trim($contactToll[2]);
//echo 'Toll Free: '.$contactToll.'<br/>';
echo '<contactToll>'.$contactToll.'</contactToll>'."\n";

$contactIntl = trim($contactIntl[2]);
//echo 'Intl: '.$contactIntl.'<br/>';
echo '<contactIntl>'.$contactIntl.'</contactIntl>'."\n";

$contactFax = trim($contactFax[2]);
//echo 'Fax: '.$contactFax.'<br/>';
echo '<contactFax>'.$contactFax.'</contactFax>'."\n";

$contactEmail = trim($contactEmail[2]);
//echo 'Email: '.$contactEmail.'<br/>';
echo '<contactEmail>'.$contactEmail.'</contactEmail>'."\n";

echo '</item>'."\n";
}
?>

Shrike

You might want to consider using a SAX or pull parser rather than DOM if filesize becomes an issue. Fast and memory efficient but quite alot less flexible.

scrupul0us

Shrike;10918696 wrote:
You might want to consider using a SAX or pull parser rather than DOM if filesize becomes an issue. Fast and memory efficient but quite alot less flexible.

my source is html not XML... i was attempting to convert html -> xml

I ended up getting it work... some of my xpaths returned 'unknown methods' when there was no node value... i used @ to get around it but in the end it took about 6 hours to process but it did work

undoubtedly it could be streamlined I'm sure... I'll have to toy with it to see if I can boost the speed... I can probably get rid of the preg_matches in favor of some diff string manipulation which may cost less

Shrike

There's no real semantic difference between HTML and XML, they are both just subsets of SGML. With the exception of HTML/Transitional allowances (e.g. "empty" tags, processing instructions) you should be able to parse HTML with any XML parser.

scrupul0us

<tr>
<td valign="top" width="49%">
<b>COMPANY NAME</b><br>
ADDRESS 1<br>

ADDRESS 2<br>

ADDRESS 3<br>
<a href="http://www.somedomain.com" target="_blank">www.somedomain.com</a><br>
</td>
<td width="2%">&nbsp;</td>
<td valign="top" width="49%">
<i>INDUSTRY</i><br>
Contact: NAME<br>

Phone: NUMBER<br>
Toll Free: NUMBER<br>
Intl: NUMBER<br>
Fax: NUMBER<br>
Email: <a href="mailto:email@somedomain.com">email@somedomain.com</a><br>

</td>
</tr>

that's one iteration of about 4000 or so... id love to see an alternate implementation 🙂

obviously just about any of the pieces of data can be missing

Shrike

You are right, it might be a tad hard with XmlReader as it's not really aware of anything other than the current "node".

A possible improvement to your DOM based solution might be to do with how you are fetching each element. Assuming a document has 100 items, this Xpath query:

$elements = $xpath->query("//table[@width='580']/tr/td[@width='49%']/b");

would select 100 <b> nodes. The problem as I see it is that you are then only using 1 node out of that 100 (referenced by $i). This is then repeated from 0 - $length. It would make more sense to process all 100 nodes then move on to the next Xpath query (i.e. do away with the loop).

scrupul0us

forgive the cliche but "oh snap" you are right... im loading all nodes that match the xpath EVERYTIME!!! no wonder its so fast with a small subset and took 6 hours on the full, it grows exponentially

i guess i could parse each xpath into an array and then just step through each and build my XML at the end

scrupul0us

just an update:

I moved the xpath queries outside the for loop and the process time is now, oh, 25 seconds...

thanks for the advice... i cant believe I made such an obvious error