Matching LARGE regex strings?

impactdni

I have a large HTML string that is multiple webpages attached to each other (one long string). Because the headers and things are the same (comments, navbar, sidebars), I would like to get rid of them since I know there are multiple copies (indexing, so multiple copies throws off data).

I would like to match the LARGEST thing that it finds repeated first, then the next largest, then the next, etc. However, I can't use

/(.*).*?\1/

because I don't want it matching small words like "the", so there needs to be a minimum. I've tried things like

/(.{15,}).*?\1/

but this normally grabs the first thing over 15 that it sees (not the largest above 15). The only way I've been able to semi-simulate the behaviour I want is with the following function, but it is not at all efficient...

What is a better way?

for ($x = 100; $x >= 20; $x -= 10) {
  while (preg_match('@(.{'.$x.',}).*?\1@S', $page_data, $dupe)) {
    $dupe[1] = preg_quote($dupe[1]);
    $page_data = preg_replace('|'.$dupe[1].'|', '', $page_data);
  }
}

Also. I've tried adding the "s" flag to make it match newlines (which it hasn't been doing as far as I can tell), but as soon as I add the "s" flag, it just basically skips over everything (matches nothing). It does it VERY fast compared to not having the "s" flag (under a second vs a few min).

impactdni

Anyone?

Weedpacket

Don't post content-free posts like

Anyone?

It's rude and makes it look like you're just sitting around doing nothing in the meantime. Also, five hours is a pretty short time to wait: give people time to live their own lives first.

/(.*)\1/

In other words, don't use ungreedy matching.

impactdni

But that doesn't allow there to be any text between the match and the repeat, correct?

These are very possibly spaced out throughout the string (text between them).

Weedpacket

/(.*).*\1/

, then. And if this is on more than one line,

/(.*).*\1/s

impactdni

But the first one matches small strings (under 5 characters or so, which gets rid of many of the things that I want).

And for some reason, the 2nd option (With an s flag), skips everything. It litterally matches NOTHING.

Weedpacket

Finally I understand the problem. I'm being very slow today.

The regexp engine is eager as well as greedy, and it's the eagerness that's getting in the way. Once it finds a match it won't look to see if there are any more. Only if it finds more than one match starting from the same position does it need to choose which one to return, and that's when greediness comes into play. It has no reason to keep searching after that (in case there's a loner one to be found further in).

So there's that, and there's the fact that regular expressions can't count.

preg_match_all() came to mind, but I can think of cases where it would fail, due to patterns overlapping.

The following may not be ideal for your purposes. You mention wanting to be able to find the longest, then the next longest, etc. But it will find the longest. That will give you an upper limit on how long matches can be. If it finds a 14-character match, you can see how to then embark on a 13-character match, and so on.

Oh, and (due to eagerness) it finds the first match of the longest possible length. But then, how many matches of the form "e…e" would there be in a typical block of English text?

The text being searched is in $string.

$strlen = strlen($string);
$longest_match='';
$longest_matchlen=0;
for($i=0; $i<$strlen; $i++)
{
	if($i+2*$longest_matchlen>$strlen) break;
	$maxgap = strlen($string)-$i-2*$longest_matchlen;
	if(preg_match('/^.{'.$i.'}(.+).{0,'.$maxgap.'}\\1/s', $string, $match))
	{
		if($longest_matchlen<strlen($match[1])
		{
			$longest_matchlen = strlen($match[1]);
			$longest_match = $match[1];
		}
	}
}

The '(.+)' might be expanded to '(.{'.$longest_matchlen.',}.+)' so that one it matches a 5-character dupe, it won't be satisfied until it matches a 6-character dupe. Then the next test on match lengths would not be needed.

That's something to look into, anyway.

So is the possibility of getting creative with lookahead assertions. Oh, and preg_match() has had an offset parameter since 4.3.3. That would allow dropping the "^.{$i}" business.