[RESOLVED] RegEx Being Greedy

Joseph_Witchard · Mar 15, 2009

Earlier today I had a problem with my BBCode function messing up the text formatting when bold tags or url tags were used more than once. I was told to use ? after the quantifier in the expression in order to make preg_replace() "not greedy". That fixed it.

I'm afraid I don't really understand what "being greedy" means. The PHP website says that if an expression is greedy, it will match as much as possible. It it's not, it will match as little as possible. To me, that sounds like being greedy should have worked just fine if a bold or url tag was used more than once, since being greedy matches as much as possible.

Could someone correct my thinking, please?

bradgrafelman · Mar 15, 2009

Take this pattern:

/@[b\](.*)\[/b\]@/

Now, take a look at this string:

[noparse][B]This[/b] word is bold, but what exactly is [b]bold[/b]?[/noparse]

Since * is a "greedy" quantifier by default, it will try to consume as much text that matches "." (so any character) as possible. Thus, it starts before the word "The" and consumes everything up until the last ending bold tag after "bold". Why didn't it stop at the first [noparse][/b][/noparse] tag? Because it found one farther down the line, and as we've said, it's a "greedy" lil' quantifier! :p

Adding a '?' right after a or + will effectively negate the greediness of the quantifiers. Note that you can also add the 'U' modifier at the end of your pattern; this 'U' stands for Ungreedy, meaning that all 's and +'s will be non-greedy by default (so a '?' after one would then make it greedy again).

NogDog · Mar 15, 2009

Also note that if you use the U modifier and also use the ? after the * or +, then you'll be right back where you started, as what the U modifier really does is invert the greediness of any repetition. I just wanted to point this out lest you think that using both would make it doubly sure that it is ungreedy, whereas in fact they would cancel each other out and you'd be back to greedy capturing again.

Joseph_Witchard · Mar 16, 2009

All right. But I still don't understand why consuming every character would cause an error like that I don't understand why making it ungreedy would fix the problem.

Weedpacket · Mar 16, 2009

Being greedy, it tries to match as much as possible.

&#91;b]This[/b] word is bold, but what exactly is &#91;b]bold[/b]?

It's going to see the first "[b]" and then match as much as possible, then match "[/b]". There are two "[/b]" that can serve as the end of the match. So it has a choice:

This

and

This[/b] word is bold, but what exactly is &#91;b]bold

It's greedy, so it picks the longer one.

Joseph_Witchard · Mar 17, 2009

Weedpacket;10907359 wrote:
Being greedy, it tries to match as much as possible.
[b]This[/b] word is bold, but what exactly is [b]bold[/b]?
It's going to see the first "" and then match as much as possible, then match "". There are two "[/b]" that can serve as the end of the match. So it has a choice:
This
and
This[/b] word is bold, but what exactly is [b]bold
It's greedy, so it picks the longer one.

Okay. So, by your last example, there's only one opening bold tag and only one closing bold tag. Does that mean if it's greedy, it's only going to match one of each?

bradgrafelman · Mar 17, 2009

That was all one single example; the first code block showed the raw string, and the second and third code blocks were examples of what the (.*) in the pattern would match based on that string.

Joseph_Witchard · Mar 17, 2009

Okay. If it's going to match the last [ /b] if it's greedy, what exactly happens to the first [ /b]? Where does it go?

NogDog · Mar 17, 2009

It does not go anywhere: it's included as part of the matched text between the first [noparse][/noparse] tag and the last [noparse][/noparse] tag.

$text = "This [b]is[/b] a test. It is [b]only[/b] a test.";
$greedy = preg_replace('#[b](.*)[/b]#is', "<strong>\\1</strong>", $text);
$ungreedy = preg_replace('#[b](.*)[/b]#isU', "<strong>\\1</strong>", $text);
echo "<pre>" . htmlspecialchars($greedy) . "</pre>\n";
echo "<pre>" . htmlspecialchars($ungreedy) . "</pre>\n";

Output:

This <strong>is[/b] a test. It is [b]only</strong> a test.

This <strong>is</strong> a test. It is <strong>only</strong> a test.

Joseph_Witchard · Mar 17, 2009

This <strong>is[/b] a test. It is [b]only</strong> a test.

Okay, I think I've finally got it. Correct me if I'm wrong, please. Since it's matching as much as possible, it's treating the unformatted [ /b] and [ b] tags as part of .* rather than [ /b] and [ b] that are supposed to be replaced with HTML tags via preg_replace()?

bradgrafelman · Mar 17, 2009

Right. Those middle two [noparse][/b] and [/noparse] are simply treated as any other text would.

Joseph_Witchard · Mar 17, 2009

Cool So, since being ungreedy corrects the error that's output if you have more than one bold tag in the pattern, does that means it will replace each one, no matter how many you have?

bradgrafelman · Mar 17, 2009

Right. It's ungreedy, so it stops as soon as it can (meaning the first [noparse][/b][/noparse] it can find) and that's all it matches. Once that replacement is done, it begins looking for more replacements to make.

Joseph_Witchard · Mar 17, 2009

This has cleared up a lot. Thanks

Are all language RegEx syntaxes greedy by default, or is being greedy something unique to PHP? I'm also trying to learn JavaScript RegEx in addition to PHP RegEx, so I'm curious.

bradgrafelman · Mar 17, 2009

A great place to learn about regular expressions, I've found, is here: http://www.regular-expressions.info. The syntax of the pattern shouldn't vary much whether you're using PHP or JavaScript, either.

Joseph_Witchard · Mar 17, 2009

I appreciate everyone's help with this

And just so I know (because this has been bugging me since I saw this), why does your user title say "Banned"?

[RESOLVED] RegEx Being Greedy

JJoseph_Witchard

Bbradgrafelman

NogDog

JJoseph_Witchard

Weedpacket

JJoseph_Witchard

Bbradgrafelman

JJoseph_Witchard

NogDog

JJoseph_Witchard

Bbradgrafelman

JJoseph_Witchard

Bbradgrafelman

JJoseph_Witchard

Bbradgrafelman

JJoseph_Witchard