Parsing HUGE XML files?

gmdavis · Aug 26, 2005

My latest struggle is attempting to parse a HUGE XML file (about 10M. Using PHP usually times-out or runs out of memory just opening the file. Heck, I can’t even open it with a text editor without a huge wait. I was wondering if you had any advice on what language or method I could use to do this. Basically, I need to be able to make quick searches of this XML info and retrieve the matched records. The file changes often and I need to retrieve it from and FTP server every few days, so converting it into a mySQL database didn’t seem practical (maybe I’m wrong). Ideally, this all has to happen on my web server (I think).

Maybe PHP is not the right solution here. If not, another server-side solution?

Any ideas? I not necessarily looking for code but just point me in the right direction (like “xyz scripting language can parse through that xml data fast...” Or “this php function will do it...”)

I've found xquery Lite but without downloading it, installing it, learning it and running it, I can't be sure this can handle a 10MB XML file.

Weedpacket · Aug 26, 2005

If the XML format is that simple (enough to consider converting it into a database), then that may well be the way to go. A cron job that retrieves the XML, parses it and refreshes the database on a regular basis would avoid needing to mess with 10MB every time you want to use the data - just the once when you do an update. If the PHP interpreter is run from the command line (which would be the ideal method when specifying the cron job) it ignores the configuration limits on runtime and memory consumption. Ideally, when picking a time to carry out the update, pick a time that is fairly quiet, so that the server can devote as many resources to the update as possible without affecting other tasks.

On the subject, I found a post through Google by someone who had faced a similar situation to yours:
http://archives.postgresql.org/pgsql-novice/2005-03/msg00178.php
Which suggests that reading the XML line by line and updating the database as you read would be better than reading the file as a whole and then parsing it ([man]xml_parse[/man] can work this way, so you shouldn't have to write your own XML parser).

planetsim · Aug 26, 2005

Well if your not running a cron job you should for every couple of days. And separating the jobs have one to download the file and the other to read the XML File. Now my suggestion is to once after reading the XML document store it into a Database now you can decide on how to actually do this since I'm not sure on the XML data and how much it changes you may wish to check if a XML record already exists in the Database before storing the record.

I unfortunately don't have code but hopefully you can interpret some logic from what Ive said to develop a solution

[edit]Appears weed beat me to the punch[/edit]

gmdavis · Aug 26, 2005

Thanks to you both. I haven't tried to read the xml file line by line and XML_parse is a unfamiliar function to me . (I did look at it but I tried file() and similar functions first.

I did already have a separate script to download the file for me. So thanks for confirming that strategy.

One thing I don't understand is how to run php via command line, via the web server. I know the manual has a seciton on this. Hopefully, it's fairly self-evident and my lame webhost allows this.

Thanks again.

Parsing HUGE XML files?

Ggmdavis

Weedpacket

Pplanetsim

Ggmdavis