I have a complex web page that I'm trying to get info from--and for anyone who is concerned, yes, I have the right to access the info. I'm a librarian and it's a page from our catalog.

Here's my code ($result is the output from the cURL):

$doc = new DOMDocument();
$doc->loadHTML($result);
$doc->preserveWhiteSpace = false;
$tables = $doc->getElementsByTagName('table');
 $rows = $tables->item(0)->getElementsByTagName('tr');
 foreach ($rows as $row)
    {
        /*** get each column by tag name ***/
        $cols = $row->getElementsByTagName('td');
        /*** echo the values ***/
        echo $cols->item(0)->nodeValue;


    echo '<hr />';
} 

My page has a huge header with table statements in them which get included. The page is all in table layout.

<table>[I]This the last table that gets picked up by the above code[/I]   
<tr> <td> <!-- container cell -->
<form> <input> <table>[B]THIS IS THE TABLE I WANT TO SCRAPE [/B]

Basically, there is this one table that I want to scrape out of the entire page. How do I get that table only? Am I on the right track?

thanks

spivey

    Your question is really more about how DOMDocument works... and that I don't know.

    But I do have two alternate ways to do this (unless you absolutely must use DOMDocument).

    Use a regular expression on $result which looks for:

    <table> some stuff <input some stuff <table> interesting stuff </table> some stuff </table>

    preg_match("/<table>.+<input.+<table>(.+)<\/table>.+<\/table>/s",$result,$regs);

    Now $regs[1] will contain the body of the table you are interested in. You can use preg_match again to find the <td> tags inside of it to find your content.

    But here's the thing... scraping is usually reserved as a last resort for when the only place you can get the data is from someone else's web site. It's slow and unnecessary. Since the content is your own catalog and is constructed with an include command, you can use that same include method to obtain the data here whether or not the data is on the same server. This is a normal case for reuse of data and even if it takes you a little extra time to get it working, it's really worth it because then you will have mastered the art of multi-purposing your content. There's no sense in having one of your servers read in the content, wrap it up in table tags, have another server read the result, and then strip out the content from the table tags. If you skip all those steps, the system will be fast and efficient and done right.

      Interesting thoughts on our catalog! How would that include method work? That sounds like a much better approach!

      Thanks

      spivey

        The actual implementation is going to depend on a half dozen factors. The people who set up the catalog will be able to tell you how the relevant portion of content is acquired... and you must acquire it the same way.

        For example, let's say that the two sites are on the same server and the catalog is constructed like this:

        display header
        display table wrapper
        open database, read content from table where ID = ####
        display content
        display close of table
        display footer

        If that's the case, then just re-use those two lines (read content / display content) in your script. You're getting the raw content instead of getting the formatted content.

        But maybe the catalog is on a different server. Not a problem. The catalog gets its content from somewhere. So think of the catalog web pages as content displayers. They read the content, add some format to it, and display it. What you're going to do on the other server with the catalog is build a new displayer that simply doesn't add the formatting.

        For example, imagine on the catalog server you have a script called fancy_display.php whose job is to read some content and display it wrapped in headers and tables and whatnot. What you're going to do on that server is make a new script called plain_display.php that does exactly the same job as fancy_display except that it leaves out the parts about headers and tables. So now, instead of scraping from a page called fancy_display, you do the job the same way you were trying to do it earlier (using cURL) except you "scrape" from plain_display instead and you can skip the steps about parsing the content out of $result. In this new model, $result IS the content itself. (No real users would ever see plain_display.php... this is just a script you've made for internal use only for transporting raw, unformatted content from one server to another.) And I put "scrape" in quotes because it's not really scraping when the content you're gathering from the other server is exactly what you need and doesn't need to be teased out of a messy $result string.

        Yet another way to do it would be to make a database user on the old catalog server that is allowed to access the database from the new server. That is, on your new site, instead of using cURL to get the formatted page from the catalog server, you will use a database command to connect from new server -> catalog server and acquire the data.

        This message is filled with tons of speculation because I don't really know how your catalog was set up, how many servers you are using, if the content exists in a database or in static text files, or which database you are using, or even which languages, operating system(s), or display tools you might be using. The general idea is to develop a more direct approach for getting the content. Exactly how you do that will depend on what you already have built for the catalog.

          The catalog is an elaborate modular construct of C, javascript, html and some maybe some other programming languages. Teasing out what module delivers what content is tricky business. Fortunately, I'm not the systems admin, who I will consult. But, if scraping is how I must do it, you've given me an easy way to strip out a single table. I will give it a shot.

          Thanks

          spivey

            There are all of these damn <table statements in the header written in js, and my script doesn't seem to be getting out of the header. Anyway to start this at the <body> element?

              I got somewhere with DOM and it's working. But, there are several input statements that I am trying to include. They are check boxes on the original form, especially the "input" statement for each item. Look at the input before the label that reads "Spooky Sooga Village".

              <table border="0" cellpadding="3" cellspacing="0" align="center">
              
              <tr>
                  <td class="header" colspan="3">Select Items to Renew</td>
              </tr>
              <tr>
                      <td class="subheader" colspan="3">
              	  <strong>7</strong> items eligible for renewal. Use check boxes below to mark list items for Renew.
                      </td>
              </tr>
              <tr>
              	<td class="defaultstyle" align="center" colspan="3">
              	    <input type="radio" name="selection_type" id="renew_selected" value="selected" checked="checked">&nbsp;<label for="renew_selected">Renew Selected Items</label>&nbsp;&nbsp;&nbsp;&nbsp;
              	    <input type="radio" name="selection_type" id="renew_all" value="all">&nbsp;<label for="renew_all">Renew all</label>	 
              	</td>
              </tr>
              <tr>
              	<td class="itemlisting2">
                          <input type="checkbox" name="RENEW^IPPL000838636.^JD PUCCA : SPOOKY^1^^Spooky Sooga Village [DVD]^" id="RENEW1">
                      </td>
              	td class="itemlisting2">
              	    <label for="RENEW1">
              		  <!-- Print the title, if it exists -->
                        Spooky Sooga Village [DVD]&nbsp;&nbsp;
              
              	  </label>
              	 </td>
              
              	 <td class="itemlisting2" align="left">
              	 Due:
                   <!-- Print the date due -->
                    <strong>6/30/2009,23:59</strong><br>
              
                  </td>
                 </tr>

              The "label" appears in my code, which is nice, but I want the check box, the full input statement. I have tried to add another getelementsbytagname statement for "input" and work that array into another foreach statement, but I get a white screen. Also, I don't know why it isn't processing the input statement, but grabbed the label for that input statement.

              Here's my dom code, which works for everything but the input statement.

              $dom = new DOMDocument();
                  $dom->loadHTML($result);
                  // find all tables
              
              $tables = $dom->getElementsByTagName('table');
              // get all rows from the 8th table
              
              $rows = $tables->item(8)->getElementsByTagName('tr');
              
              
              foreach ($rows as $row) {
                  echo "<fieldset>";
                  $cols = $row->getElementsByTagName('td');
              //echoing some values, etc. 
              }
              
                Write a Reply...