My script is simple:

<?php
$str = "mem: 9 334 23423343 3433434";

$num_matches = preg_match_all("/^mem:(\s+\d+)+$/", $str, $matches);
if (!$num_matches) {
        throw new Exception("no match");
}

echo "$num_matches matches\n";
var_dump($matches);

I was expecting that the pattern (\s+\d+)+ should match all of the numbers in $str but the output only shows the last match for some reason:

1 matches
array(2) {
  [0] =>
  array(1) {
    [0] =>
    string(23) " 9 334 23423343 3433434"
  }
  [1] =>
  array(1) {
    [0] =>
    string(8) " 3433434"
  }
}

As you can see, $matches[1] contains only the last \s+\d+ occurrence in $str. I was expecting it should contain all of the matches: 9, 334, 23423343, 343434.

Is there some way to alter my pattern such that it returns all of these numbers for a string that may contain an arbitrary number of strings? Am I correct in thinking this is incorrect behavior by preg_match_all? Should I report it to the PHP devs?

    You may be entering the twilight zone of recursive patterns, but whenever I try to read that, my head starts to hurt, and I find a different solution. 🙂

      I think anchoring might be at fault: the matches returned by preg_match_all don't overlap and once you anchor your pattern to the ends of the string, there can only be one match running from ^ to $. All the parenthesised sub-matches get drawn from that, so with only one match found, there's only one space-digit group found (which one it is and why - who knows? Maybe as the regexp engine matches more and more of the string, the group 1 submatch gets repeatedly overwritten, with only the last one written surviving).

      There's probably a regexp solution, but to get on with things I'd just check the overall format of the line and then match \s+\d+.

      $num_matches = preg_match_all("/^mem:(\s+\d+)+$/", $str) and
      	$num_matches = preg_match_all("/\s+\d+/", $str, $matches);

      And then come back to it later.

      Or if you don't actually need the spaces maybe $matches = preg_split('/\s+/', $str) and check that the 0th element returned is mem: and the rest are \d+.

      Weedpacket

      I goofed a bit in my post. The output is actually slightly different. I'll edit the output momentarily.

      I believe I was correct in my questioning if things were working correctly. $matches[0] correctly matches the the entire expression -- the anchoring seemed necessary to make sure the entire line was consumed to get all the space/digit sections (i.e., to make it "greedy") and it's clear that the parenthetical would have to be matched multiple times for the pattern to work with those anchors in place. The entire pattern would not match the entire string if the parenthetical matched only one space/digit section. You will note that the number returned is the last space/digit section of that line.

      Turns out that the return behavior when a pattern matches repeatedly only yields that last match:

      If a capturing subpattern is matched repeatedly, it is the last portion
      of the string that it matched that is returned.

      Thanks, weedpacket, for the suggestions.

        Write a Reply...