Problem using regex to locate php elements

captaingrasshol

Hi,

I am trying to isolate php elemments from the source code of a webpage using the following:

$arr = split("<\?[^{(?>)]+\?>",$text);}

This works fine if $text does not contain html tags within the <? and ?> tags but otherwise it does not treat ^(?>) as excluding only ? followed by > and excludes any portions containing > only (i.e html closing tags). What am i doing wrong? I am a comparative regex newbie but have used them for processing html elements successfully in the past and everything i read suggests that the above code should work (I have tried lots of variants as well none of which work)

Any suggestions greatly appreciated!

Thanks,

Chris

Weedpacket

[^(?>)]+

That just says "any sequence of one or more characters that is not a '(', a '?', a '>', or a ')'". As soon as the regexp machine sees any of those four characters that bit will fail to match.

/<\?(?:(?!\?>).)+\?>/

Note that that expression will not work with split(); it would need preg_split(). I think you'd need

<\?([^?]|\?[^>])+\?>

to use split().

Note also split() will be getting moved out of PHP 6 into a separate extension.

Oh, and consider

<?php
echo "This strips out all text that lies between <? ... ?> tags";
?>

What I'm hinting at here is that regexp is probably the wrong way to go. It would be easier to write a parser that recognises just enough PHP syntax to do the job.

nrg_alpha

Weedpacket;10890344 wrote:
Note also split() will be getting moved out of PHP 6 into a separate extension.

Very interesting (and nice to know). Could you please provide a link to this info so that I may bookmark for future reference?

While I don't use split often, from here on in, I think I will avoid it all together given this info (well, also in conjunction with the statement on the split page in the php manual that states that preg_split is 'often faster' than split).

And as a bonus, this does make me wonder out loud, what exactly does make preg_split faster than split?

Cheers,

NRG

Weedpacket

nrg_alpha wrote:
Could you please provide a link to this info so that I may bookmark for future reference?

http://www.php.net/~derick/meeting-notes.html#move-ereg-to-pecl

And as a bonus, this does make me wonder out loud, what exactly does make preg_split faster than split?

More syntax that allows greater control over how and what gets matched (using ?: to avoid caching subexpressions that don't need it, being able to write ((?!thing).) instead of ([^{t]|t([^{h]|h(|[^{i]|i([^{n]|n[^g])))}}}} makes it clearer to the compiler about what is to be done and hence makes it easier to develop an efficient execution plan); more active development (the most recent version of PCRE was released last May; I don't think the POSIX implementation has been touched since 1994) and a smarter compiler (we're not running on 486s these days, so we can spend more clock cycles on optimisation before beginning the search and still come out ahead of an implementation that doesn't do the extra work).

nrg_alpha

Thanks for the info, packet.

captaingrasshol

Many thanks Weedpacket, 🙂

I'll let you know how I get on.

Weedpacket

Righty-ho. I just looked at that and reckon that one idea that might help with the parser is to use preg_match() to locate the next "interesting bit".

Say you're currently inside generic php code. (for($i=0; $i<42; $i++) sort of stuff. The interesting bits when you're in this state are: " that starts a double-quoted string; ' that starts a single-quoted string, <<< that starts a heredoc/nowdoc-quoted string; // that starts a rest-of-line comment, /* that starts a block comment, and ?> that ends a PHP block. You need to know which of these appears first, so that you know which state to go into next. Assuming that your source code is in $source, and that you're currently scanning the $offset'th character:

preg_match('!("|\'|<<<|//|/\*|\?>)!', $source, $match, PREG_OFFSET_CAPTURE, $offset)

(I think I've got that right) will, assuming it finds anything, will find the first interesting bit in $code that appears after $offset; depending on what it is the parser would then go on with scanning a string, a comment, or non-PHP stuff.

When scanning a heredoc-quoted string, you'd need to capture the delimiter that follows the "<<<", so you know when the string ends.

When scanning a quoted string, the next quote you're interested in must be preceded by an even number of backslashes (possibly zero); if there are an odd number, then the quote character itself is part of the string. That can be done with the pattern b'[/b] for single-quoted strings (it says "an apostrophe, but only if it's preceded by zero or more repetitions of '\'"; four backslashes because PCRE uses the backslash as its escape character, so they both need to be escaped).