Earlier today I had a problem with my BBCode function messing up the text formatting when bold tags or url tags were used more than once. I was told to use ? after the quantifier in the expression in order to make preg_replace() "not greedy". That fixed it.

I'm afraid I don't really understand what "being greedy" means. The PHP website says that if an expression is greedy, it will match as much as possible. It it's not, it will match as little as possible. To me, that sounds like being greedy should have worked just fine if a bold or url tag was used more than once, since being greedy matches as much as possible.

Could someone correct my thinking, please?πŸ˜•

    Take this pattern:

    /@[b\](.*)\[/b\]@/

    Now, take a look at this string:

    [noparse][B]This[/b] word is bold, but what exactly is [b]bold[/b]?[/noparse]

    Since * is a "greedy" quantifier by default, it will try to consume as much text that matches "." (so any character) as possible. Thus, it starts before the word "The" and consumes everything up until the last ending bold tag after "bold". Why didn't it stop at the first [noparse][/b][/noparse] tag? Because it found one farther down the line, and as we've said, it's a "greedy" lil' quantifier! :p

    Adding a '?' right after a or + will effectively negate the greediness of the quantifiers. Note that you can also add the 'U' modifier at the end of your pattern; this 'U' stands for Ungreedy, meaning that all 's and +'s will be non-greedy by default (so a '?' after one would then make it greedy again).

      Also note that if you use the U modifier and also use the ? after the * or +, then you'll be right back where you started, as what the U modifier really does is invert the greediness of any repetition. I just wanted to point this out lest you think that using both would make it doubly sure that it is ungreedy, whereas in fact they would cancel each other out and you'd be back to greedy capturing again.

        All right. But I still don't understand why consuming every character would cause an error like thatπŸ˜• I don't understand why making it ungreedy would fix the problem.

          Being greedy, it tries to match as much as possible.

          [b]This[/b] word is bold, but what exactly is [b]bold[/b]?

          It's going to see the first "[b]" and then match as much as possible, then match "[/b]". There are two "[/b]" that can serve as the end of the match. So it has a choice:

          This

          and

          This[/b] word is bold, but what exactly is [b]bold

          It's greedy, so it picks the longer one.

            Weedpacket;10907359 wrote:

            Being greedy, it tries to match as much as possible.

            [b]This[/b] word is bold, but what exactly is [b]bold[/b]?

            It's going to see the first "" and then match as much as possible, then match "". There are two "[/b]" that can serve as the end of the match. So it has a choice:

            This

            and

            This[/b] word is bold, but what exactly is [b]bold

            It's greedy, so it picks the longer one.

            Okay. So, by your last example, there's only one opening bold tag and only one closing bold tag. Does that mean if it's greedy, it's only going to match one of each?

              That was all one single example; the first code block showed the raw string, and the second and third code blocks were examples of what the (.*) in the pattern would match based on that string.

                Okay. If it's going to match the last [ /b] if it's greedy, what exactly happens to the first [ /b]? Where does it go?

                  It does not go anywhere: it's included as part of the matched text between the first [noparse][/noparse] tag and the last [noparse][/noparse] tag.

                  $text = "This [b]is[/b] a test. It is [b]only[/b] a test.";
                  $greedy = preg_replace('#[b](.*)[/b]#is', "<strong>\\1</strong>", $text);
                  $ungreedy = preg_replace('#[b](.*)[/b]#isU', "<strong>\\1</strong>", $text);
                  echo "<pre>" . htmlspecialchars($greedy) . "</pre>\n";
                  echo "<pre>" . htmlspecialchars($ungreedy) . "</pre>\n";
                  

                  Output:

                  This <strong>is[/b] a test. It is [b]only</strong> a test.
                  
                  This <strong>is</strong> a test. It is <strong>only</strong> a test.
                  
                    This <strong>is[/b] a test. It is [b]only</strong> a test.
                    

                    Okay, I think I've finally got it. Correct me if I'm wrong, please. Since it's matching as much as possible, it's treating the unformatted [ /b] and [ b] tags as part of .* rather than [ /b] and [ b] that are supposed to be replaced with HTML tags via preg_replace()?

                      Right. Those middle two [noparse][/b] and [/noparse] are simply treated as any other text would.

                        CoolπŸ™‚ So, since being ungreedy corrects the error that's output if you have more than one bold tag in the pattern, does that means it will replace each one, no matter how many you have?

                          Right. It's ungreedy, so it stops as soon as it can (meaning the first [noparse][/b][/noparse] it can find) and that's all it matches. Once that replacement is done, it begins looking for more replacements to make.

                            This has cleared up a lot. ThanksπŸ™‚

                            Are all language RegEx syntaxes greedy by default, or is being greedy something unique to PHP? I'm also trying to learn JavaScript RegEx in addition to PHP RegEx, so I'm curious.

                              I appreciate everyone's help with thisπŸ™‚

                              And just so I know (because this has been bugging me since I saw this), why does your user title say "Banned"?

                                Write a Reply...