preg_match between <tr> tags - why is this code acting greedy?

m5laser · Feb 21, 2010

Intended Usage:
Read a HTML table and return everything between the <tr> tags if one of the words "bedroom/studio/penthouse" is present.

Example:

$text = '
<table>
<tr>
<th>Unit Type</th>
<th>Availability</th>
<th>Rates</th>
</tr>

<tr>
<td><b>One Bedroom</b> </td>
<td><p>Call for Availability</p></td>
<td><p>$830-$855</p></td>
</tr>
</table>';
preg_match_all('/<tr>(.{0,200})(Bedroom|Studio|Penthouse)(.{0,200})<\/tr>/isU',$text,$match, PREG_SET_ORDER);
print_r($match);

The Problem:
For some reason the above code seems to be acting 'greedy' by returning non-matching <tr> tags. For example, $match[0][0] would return:

[font=courier]<tr>
<th>Unit Type</th>
<th>Availability</th>
<th>Rates*</th>
</tr>

<tr>
<td>One Bedroom </td>
<td>Call for Availability</td>

<td>$830-$855</td>
</tr>[/font]

Any idea what I'm doing wrong? Any help would be GREATLY appreciated!

sneakyimp · Feb 21, 2010

I'm not familiar enough with the vague notion of 'greediness' in regular expressions to explain why it's matching the whole thing instead of the shorter string, but humbly offer this modified version of your regex that uses an assertion (?!<tr>) which says that your first wildcard cannot match a character followed by a <tr> tag. Maybe this will help?

$pattern = '/<tr>((.(?!<tr>)){0,200})(Bedroom|Studio|Penthouse)(.{0,200})<\/tr>/isU';

this should require that no <tr> tag exists between the first <tr> and the word bedroom/studio/penthouse. You may need to alter the other wildcard with the same assertion to prevent additional greedy behavior in it as well.

If this approach proves frustrating, you may want use a DOM approach instead where you automatically parse the content into a collection of nodes and you traverse the node structure.

halojoy · Feb 21, 2010

I think I made it! Eventhough no preg expert.

Output of my script

<td><b>One Bedroom</b> </td>
<td><p>Call for Availability</p></td>
<td><p>$830-$855</p></td>

<td><b>One Penthouse</b> </td>
<td><p>Call for Availability</p></td>
<td><p>$830-$855</p></td>

I think you added to much parenteses in your original pattern.
Mine is a bit shorter & more simple.
As you can see I made it an even trickier test
by having both 'Bedroom' & 'Penthouse' in the text.
This gives 2 matches.

<?php

$text = '
<table>
<tr>
<th>Unit Type</th>
<th>Availability</th>
<th>Rates</th>
</tr>

<tr>
<td><b>One Bedroom</b> </td>
<td><p>Call for Availability</p></td>
<td><p>$830-$855</p></td>
</tr>

<tr>
<td><b>One Penthouse</b> </td>
<td><p>Call for Availability</p></td>
<td><p>$830-$855</p></td>
</tr>
</table>';

preg_match_all( '/<tr>(.*[Bedroom|Studio|Penthouse].*)<\/tr>/isU',
					$text,$match,PREG_SET_ORDER);
//echo '<pre>'; print_r($match);
$bed  = trim($match[1][1]);
$pent = trim($match[2][1]);
//Display to see the resulting html source code
echo nl2br(htmlspecialchars($bed));
echo '<br /><br />';
echo nl2br(htmlspecialchars($pent));

?>

m5laser · Feb 22, 2010

Thanks, that works perfect!

I'm still curious why the expression was bypassing the first set of tags, but this is a very nice solution for what I need.

Thanks!

johanafm · Feb 22, 2010

Pattern matching in the first pattern had nothing to do with greedyness. It matched <tr> followed by 0 to 200 characters of any kind (dot_all pattern modifier) followed by one of the three words another 0 to 200 characters and </tr>, and [0][0] contains the whole pattern match. One way to NOT match the first characters would be

$p = '#<tr>\s*<td>(Bedroom|...#isu';

Since it would only allow white space between tr and td, while the first row contains th instead of td. Or you could start the pattern with

$p = '#</tr>\K';

An example of greedyness functionality

$text = '
<tr>
<td>one</td>
<td>two</td>
<td>three</td>
</tr>
</table>';
preg_match_all('#one.*</td>#sU',$text,$match);
echo '<b>Ungreedy</b><pre>';
echo htmlentities(print_r($match,1));
echo '</pre>';

preg_match_all('#one.*</td>#s',$text,$match);
echo '<b>Greedy</b><pre>';
echo htmlentities(print_r($match,1));
echo '</pre>';

As the example shows, a greedy match will match the maximum amount of characters possible while the rest of the pattern still matches. Ungreedy means matching the least amount of characters possible while the rest of the pattern still matches.

preg_match between <tr> tags - why is this code acting greedy?

Mm5laser

Ssneakyimp

Hhalojoy

Mm5laser

Jjohanafm