Hi,

I'm writing a script that will parse a HTML file containing a table of data and enter the data into a mysql database.

I'm trying to parse all the data in the cells '<td></td>' using the preg_match_all function.

 // match all <td> </td> tags and get the data
 preg_match_all("|<[td>]+(.*)</[td>]+>|U", $content, $regs);

 $datacnt = count($regs[0]);
 $databasedata="";	//combine all the data

 echo "\n";

 for ($i=0; $i<count($regs[0]); $i++) {

  // insert the data into the database
  if ($i%16 ==0) { 
    echo "\n";
    if ($rowcount < $codeopen-1) {
    	echo $databasedata . "\n\n";
    }
    if ($rowcount > 1) --$rowcount;
    $databasedata="";
  }

  // remove all html tags and decode html entities
  $data = html_entity_decode(strip_tags($regs[1][$i]));

The problem I have is that I can't work out the correct pattern in the preg_match_all function that will pickup both '<td>' and <td align="left">'.πŸ˜•

The code as it stands only picks up '<td>'.

I'm not that familiar with the patterns.

Any ideas
ThanksπŸ˜•

    Try this:

    <td[>]*>

    The [>]* part will match anything that isn't a greater-than sign, zero or more times.

    By the way, you may want to use the "s" modifier, so matches can cross newlines.

      Thanks,

      I've tried that and it still isn't picking up any of the <td align="left"> data.

        Originally posted by robinstott
        Thanks,

        I've tried that and it still isn't picking up any of the <td align="left"> data.

        It should (at least, it should pick up the stuff between <td align="left"> and </td>); how are you using it?

          Basically I have a HTML form with a table 16 cells wide and an undefined number of rows.

          The code below is run:-

          // match all <td> </td> tags and get the data 
           preg_match_all("|<td[^>]*>+(.*)</[td>]+>|U", $content, $regs); 
          
           $datacnt = count($regs[0]); 
           $databasedata="";    //combine all the data 
           echo "\n"; 
          
           for ($i=0; $i<$datacnt; $i++) { 
          
            // insert the data into the database 
            if ($i%16 ==0) {  
          echo "\n"; if ($rowcount < $codeopen-1) { echo $databasedata . "\n\n"; } if ($rowcount > 1) --$rowcount; $databasedata=""; } // remove all html tags and decode html entities $data = html_entity_decode(strip_tags($regs[1][$i])); $databasedata = $databasedata . "," . $data; }

          preg_match_all should return everything between <td align="left"> and </td> or <td> and </td>.

          I do not save the first 16 cells because these are the table header. I save the values to a variable $databasedata, with each cell data separted by a comma. I then explode it by it's delimiter and stick it into a database, which all works fine If I ignore <td align="left"> </td> data.

          If I run the script as it is above I get all the required data except two cells which have the align set.
          πŸ˜•

            Actually I think the regexp is slightly wrong. I don't think you can say 'not this' 'any number of times' (e.g. [>] )

            preg_match_all("/<td[^>].*?>?(.*?)<\\/td>/si", $string, $matches);
            

            That would match a <td> tag with any attributes, once (the ?) plus the contents - which are put into the $matches array, then a closing </td>. Using ? in the regexp is equivalent to the non-greedy flag which can be used at the end. The s flag allows or multi-line matches and the i flag for case-insensitivity.

            With regard to my post in the EL, I think that the above regexp is quite unrefined, formats such as .*? should probably be broken down to match groups.

              why call the overheads of reg exp when a simple explode will do the job

                <?php 
              $t=array("</TD>","</TR>","</span>","</SPAN>","</a>","</A>","</div>","</DIV>","</b>","</B>","</center>","</CENTER>",); 
              $r=array("</td>","</tr>",""); 
              $table=str_replace($t,$r,$table); 
              $rows=explode("</tr>",$table); 
              foreach($rows as $row) 
                  { 
                  $cells=explode("</td>",$row); 
                  foreach($cells as $cell) 
                      { 
                      $cell=trim(end(explode(">",$cell))); 
                      echo $cell."<br>\n"; // individual cell content
                      } 
                  }?>  

                Thanks for the help.

                I used sidneys option in the end, but I need to learn the reg exp though.

                πŸ˜ƒ :p

                  Originally posted by Shrike
                  I don't think you can say 'not this' 'any number of times' (e.g. [>]* )

                  You certainly can. I used it all the time before discovering the ungreedy modifier.

                  Think of it not as "not this" but as "any character except this/these". So it's just like "." but with an "except these" clause.

                    Write a Reply...