This post has been modified
Heres what it needs to do:
1) Open the large 900mb+ rdf file
2) Read a small part of it (a buffer)
3) Look for complete containers (from <ExternalPage about="http://someurl.com/whatever"> SOME TAGS IN BETWEEN </ExternalPage> is a complete containter)
4) Parse each container at a time and put that information into a mysql database
5) Continue reading the file from the end of the last complete container
My problems:
I dont know how to get the url from:
<ExternalPage about="http://someurl.com/whatever">
because if I use:
$start= strpos($buffer, "<ExternalPage about=\"");
$finish= strpos($buffer, "\">");
$length= $finish-$start;
$url=Substr($buffer, $start, $length);
echo "url= $url<BR>";
It reads the whole buffer until it finds a "> (in otherwords doesnt stop in the <ExternalPage> tag.
My second problem:
How would I get it to:
1) Look for, and parse only complete tags, then mark the end of the last complete tag in the buffer (after the last </ExternalPage> ) and continue reading from right after that mark, so that it parses ALL the <ExternalPage tags, but doesnt get partial information.... Here is what I have so far (I know it doesnt mark and trace, so if someone could help...
<? $dbhost='localhost'; $dbname='---------'; $dbport=''; $dbuname='------'; $dbpass='-----'; ?>
<?
mysql_connect($dbhost . ":" .$dbport, $dbuname, $dbpass);
@mysql_select_db("$dbname") or die ("Unable to select database.");
?>
<?
mysql_query("CREATE TABLE links (id BIGINT not null AUTO_INCREMENT, title TEXT not null , url TEXT not null , description TEXT not null , PRIMARY KEY (id))");
$fp = fopen ("content.rdf.u8.ascii", "r");
while (!feof ($fp)){
$fd= fgets($fp, 1024);
$start= strpos($fd, "<ExternalPage");
$finish= strpos($fd, "</ExternalPage>");
$length= $finish-$start;
$buffer=Substr($fd, $start, $length);
while($buffer) {
$start= strpos($buffer, "<d:Title>");
$finish= strpos($buffer, "</d:Title>");
$length= $finish-$start;
$title=Substr($buffer, $start, $length);
echo "Title= $title<BR>";
$start= strpos($buffer, "<d<img src="images/smilies/biggrin.gif" border="0" alt="">escription>");
$finish= strpos($buffer, "</d<img src="images/smilies/biggrin.gif" border="0" alt="">escription>");
$length= $finish-$start;
$description=Substr($buffer, $start, $length);
echo "Description= $description<BR>";
mysql_query('INSERT INTO links VALUES("","$title","","$description")');
}
}
fclose($fp);
?>
Thanks, I really would like to get this working, and I just dont know how to solve those problems... Im new to advanced php... (I mostly just use echo, if, mysql_query, and while) -it will be cool to have written a dmoz parser, I can distribute the database and the parser to everyone!