Using preg_match_all to parse html table

robinstott · Mar 29, 2004

Hi,

I'm writing a script that will parse a HTML file containing a table of data and enter the data into a mysql database.

I'm trying to parse all the data in the cells '<td></td>' using the preg_match_all function.

 // match all <td> </td> tags and get the data
 preg_match_all("|<[td>]+(.*)</[td>]+>|U", $content, $regs);

 $datacnt = count($regs[0]);
 $databasedata="";	//combine all the data

 echo "\n";

 for ($i=0; $i<count($regs[0]); $i++) {

  // insert the data into the database
  if ($i%16 ==0) { 
    echo "\n";
    if ($rowcount < $codeopen-1) {
    	echo $databasedata . "\n\n";
    }
    if ($rowcount > 1) --$rowcount;
    $databasedata="";
  }

  // remove all html tags and decode html entities
  $data = html_entity_decode(strip_tags($regs[1][$i]));

The problem I have is that I can't work out the correct pattern in the preg_match_all function that will pickup both '<td>' and <td align="left">'.

The code as it stands only picks up '<td>'.

I'm not that familiar with the patterns.

Any ideas
Thanks

swr · Mar 29, 2004

Try this:

<td[^>]*>

The [^>]* part will match anything that isn't a greater-than sign, zero or more times.

By the way, you may want to use the "s" modifier, so matches can cross newlines.

robinstott · Mar 29, 2004

Thanks,

I've tried that and it still isn't picking up any of the <td align="left"> data.

Weedpacket · Mar 29, 2004

Originally posted by robinstott
Thanks,

I've tried that and it still isn't picking up any of the <td align="left"> data.

It should (at least, it should pick up the stuff between <td align="left"> and </td>); how are you using it?

robinstott · Mar 29, 2004

Basically I have a HTML form with a table 16 cells wide and an undefined number of rows.

The code below is run:-

// match all <td> </td> tags and get the data 
 preg_match_all("|<td[^>]*>+(.*)</[td>]+>|U", $content, $regs); 

 $datacnt = count($regs[0]); 
 $databasedata="";    //combine all the data 
 echo "\n"; 

 for ($i=0; $i<$datacnt; $i++) { 

  // insert the data into the database 
  if ($i%16 ==0) {  

    echo "\n"; 
    if ($rowcount < $codeopen-1) { 
        echo $databasedata . "\n\n"; 
    } 
    if ($rowcount > 1) --$rowcount; 
    $databasedata=""; 
  } 

  // remove all html tags and decode html entities 
  $data = html_entity_decode(strip_tags($regs[1][$i]));

  $databasedata = $databasedata . "," . $data;
}

preg_match_all should return everything between <td align="left"> and </td> or <td> and </td>.

I do not save the first 16 cells because these are the table header. I save the values to a variable $databasedata, with each cell data separted by a comma. I then explode it by it's delimiter and stick it into a database, which all works fine If I ignore <td align="left"> </td> data.

If I run the script as it is above I get all the required data except two cells which have the align set.

Shrike · Mar 29, 2004

Actually I think the regexp is slightly wrong. I don't think you can say 'not this' 'any number of times' (e.g. [^>] )

preg_match_all("/<td[^>].*?>?(.*?)<\\/td>/si", $string, $matches);

That would match a <td> tag with any attributes, once (the ?) plus the contents - which are put into the $matches array, then a closing </td>. Using ? in the regexp is equivalent to the non-greedy flag which can be used at the end. The s flag allows or multi-line matches and the i flag for case-insensitivity.

With regard to my post in the EL, I think that the above regexp is quite unrefined, formats such as .*? should probably be broken down to match groups.

sidney · Mar 29, 2004

why call the overheads of reg exp when a simple explode will do the job

  <?php 
$t=array("</TD>","</TR>","</span>","</SPAN>","</a>","</A>","</div>","</DIV>","</b>","</B>","</center>","</CENTER>",); 
$r=array("</td>","</tr>",""); 
$table=str_replace($t,$r,$table); 
$rows=explode("</tr>",$table); 
foreach($rows as $row) 
    { 
    $cells=explode("</td>",$row); 
    foreach($cells as $cell) 
        { 
        $cell=trim(end(explode(">",$cell))); 
        echo $cell."<br>\n"; // individual cell content
        } 
    }?>

robinstott · Mar 29, 2004

Thanks for the help.

I used sidneys option in the end, but I need to learn the reg exp though.

:p

swr · Mar 29, 2004

Originally posted by Shrike
I don't think you can say 'not this' 'any number of times' (e.g. [^>]* )

You certainly can. I used it all the time before discovering the ungreedy modifier.

Think of it not as "not this" but as "any character except this/these". So it's just like "." but with an "except these" clause.

Using preg_match_all to parse html table

Rrobinstott

Sswr

Rrobinstott

Weedpacket

Rrobinstott

SShrike

Ssidney

Rrobinstott

Sswr