I am tying to build a database of items that exsist on in HTML pages already. The issue I have is pretty simple though, as I can access a item on one page and have all the data, but parsing the correct values is not easy.

The problem with DOMX I have is that its generic div inside a a named div, I can of course grab the named Div data no problem and parse elements. Here is an example

<html>
<body>
<div id="page">
 <div id="header">BLA BLA</div>
 <div id="body">
  <div id="item1">
   <div class="spacer">
    <div class="value">DATA I WANT</div>
   </div>
  </div>
  <div id="item2">
   <div class="spacer">
    <div class="value">DATA I WANT</div>
   </div>
  </div>
  <div id="item3">
   <div class="spacer">
    <div class="value">DATA I DONT WANT</div>
   </div>
  </div>
  <div id="item4">
   <div class="spacer">
    <div class="value">DATA I WANT</div>
   </div>
  </div>
</div>
</div>
</body>
</html>

I can of course get a names Div, but when all the "data" divs are un named and use "value" cor the class its not going to work well.

Thanks,
Chris

    Not sure about DOMX (whatever that is), but with the [man]DOM[/man] extension you should be able to do a getElementById(), and on the resulting object do a getElementsByTagName(), grab the first (and in this case probably only) resulting object and do another getElementsByTagName() on it, and get that first (and presumably only) result.

      Thanks, I think I am just missing the syntax.

      I have this so far, and it returns all the data, ie a blank and a data section.

      $elements = $dom_xpath->query('//*[contains(@id, \'item1\')]'); 
      
      
      
      if (!is_null($elements)) { 
      
        foreach ($elements as $element) { 
         echo "\n[". $element->nodeName. "]"; 
      
      $nodes = $element->childNodes; 
      foreach ($nodes as $node) { 
            echo $node->nodeValue. "<br />"; 
      } 
      
        } 
      } 
      

      do the output is

      Blank Line
      Blank Line
      Data I Want

      I am not seeing how I can first add the elements from the DIV id tag, then pull only the class="value" at this point.

      Thanks

        With no defensive coding:

        <?php
        
        $text = <<<EOD
        <html>
        <body>
        <div id="page">
         <div id="header">BLA BLA</div>
         <div id="body">
          <div id="item1">
           <div class="spacer">
            <div class="value">DATA I WANT</div>
           </div>
          </div>
          <div id="item2">
           <div class="spacer">
            <div class="value">DATA I WANT</div>
           </div>
          </div>
          <div id="item3">
           <div class="spacer">
            <div class="value">DATA I DONT WANT</div>
           </div>
          </div>
          <div id="item4">
           <div class="spacer">
            <div class="value">DATA I WANT</div>
           </div>
          </div>
        </div>
        </div>
        </body>
        </html>
        EOD;
        
        $dom = new DOMDocument();
        $dom->loadHTML($text);
        $value = $dom->getElementById('item1')->getElementsByTagName('div')->item(0)
                 ->getElementsByTagName('div')->item(0)->textContent;
        var_dump($value);
        

        Instead of chaining all those methods together, you could break them up and assign results to separate variables, and test each result along the way -- especially if the source HTML is unreliable.

          I'd probably use either of these two approaches, where the first one is pretty much the XPath equivalent of NogDog's method chaining. I simply find the XPath easier to read. The second one uses getElementById to get a context node for a shorter XPath.

          $d = new DOMDocument();
          $d->loadHTML($html);
          $xp = new DOMXPath($d);
          
          $ids = array('item1', 'item2', 'item4', 'item5');
          foreach ($ids as $id)
          {
          	$q = sprintf('/html/body/div[@id="page"]/div[@id="body"]/div[@id="%s"]/div/div[@class="value"]',
          		$id
          	);
          	$nl = $xp->query($q);
          	if ($nl->length)
          	{
          		printf('Found "%s"<br>', $nl->item(0)->nodeValue);
          	}
          	else
          	{
          		printf('Failed matching for id="%s"<br>', $id);
          	}
          }
          
          $q = 'div/div[@class="value"]';
          foreach ($ids as $id)
          {
          	if ($el = $d->getElementById($id))
          	{
          		$nl = $xp->query($q, $el);
          		if ($nl->length)
          		{
          			printf('Found "%s"<br>', $nl->item(0)->nodeValue);
          		}
          		else
          		{
          			printf('No value for id="%s"<br>', $id);
          		}
          	}
          	else
          	{
          		printf('No element with id="%s"<br>', $id);
          	}
          }
          

            Thanks a ton, I am seeing the basic chain now with object, not sure why I didnt catch ont this after years of domino programming 🙂

              Write a Reply...