Can't understand when to escape speacial characters in regular expressions. Plz Help!

Paul_help_

Hi Guys.

I am at a complete confusion when it comes to escaping special characters when working with regular expressions.

I am working my way through a php book.

And below is the code I am having trouble with understanding:

<body>
<h1>Find Linked URLs in a Web Page</h1>
<?php
displayForm();
if ( isset( $_POST[“submitted”] ) ) {
processForm();
}
function displayForm() {
?>
<h2>Enter a URL to scan:</h2>
<form action=”” method=”post” style=”width: 30em;”>
<div>
<input type=”hidden” name=”submitted” value=”1” />
<label for=”url”>URL:</label>
<input type=”text” name=”url” id=”url” value=”” />
<label> </label>
<input type=”submit” name=”submitButton” value=”Find Links” />
</div>
</form>
<?php
}
function processForm() {
$url = $_POST[“url”];
if ( !preg_match( ‘|^http(s)?\://|’, $url ) ) $url = “http://$url”;
$html = file_get_contents( $url );
preg_match_all( “/<a\s*href=[‘\”](.+?)[‘\”].*?>/i”, $html, $matches );
echo ‘<div style=”clear: both;”> </div>’;
echo “<h2>Linked URLs found at “ . htmlspecialchars( $url ) . “:</h2>”;
echo “<ul>”;
for ( $i = 0; $i < count( $matches[1] ); $i++ ) {
echo “<li>” . htmlspecialchars( $matches[1][$i] ) . “</li>”;
}
echo “</ul>”;

}
?>
</body>

........... What the above code does is read the contents of the URL which is submitted in the text field.

It then proceeds to read and displays any linked files on the page itself.

The thing is, I understand how it works but I cannot get my head around why certain special characters are escaped on certain occasions (whilst reading through my book) and are not escaped on other occasions.

To be more specific, below is the code snippet I am having the main trouble getting my head around:

preg_match_all( “/<a\s*href=[‘\”](.+?)[‘\”].*?>/i”, $html, $matches );

You see, in the above snippet, the '<' characters is not escaped and niether is the equal (=) character. And I don't understand why. Because up until now I have got used to escaping these characters.

Also, another thing the book does not explain is:

[‘\”]

I understand what the above means. It means 'a single quote or a double quote'. But I always thought the vertical bar (|) communicated 'or'.

Also, the book does not explain what the square brackets are for?

Can someone explain these things I am having trouble with?

Thanks.

Paul.

Weedpacket

You can get PHP's description of the regular language it uses from the [man]PCRE[/man] section of the manual. For example, square brackets are described under Character Classes.

The main complication is that some characters need to be escaped because they have a special meaning in regexps, and some need to be escaped because they have a special meaning in PHP strings. In [font=monospace]['\"][/font], the double-quote is escaped because it appears in a double-quoted PHP string; not quoting it would cause the string to be ended too early and cause a parse error (try it).

Thing to remember is that the regexp language is a language in its own right; it's not PHP, it's not HTML. For example, [font=monospace]<[/font] and [font=monospace]=[/font] don't mean anything in regular expressions.

NogDog

Weedpacket;11062575 wrote:
...For example, [font=monospace]<[/font] and [font=monospace]=[/font] don't mean anything in regular expressions.

Unless you use either as the regex delimiter. 😉

Paul_help_

Thanks guys.

But the book I am reading explains how '=' and '<' have speacial meanings.

There are 19 characters in total that have special meanings when working with regular expressions.

And it explains how they need to be escaped when used literally.

So I can't understand how (below) these characters are not escaped?:

preg_match_all( “/<a\shref=‘\”[‘\”].?>/i”, $html, $matches );

Weedpacket

Paul help! wrote:
But the book I am reading explains how '=' and '<' have speacial meanings.

How? they have special meanings in HTML (one links attribute names and values, the other ends a tags), but we're not talking about HTML here.

Regexps as they appear in PHP have fourteen/fifteen special characters (listed, as ever, in the manual). The fact they appear inside PHP strings adds another one or two on top ([font=monospace]'[/font] or [font=monospace]"[/font] and [font]$[/font]).

Paul_help_

Thanks Weedpacket.

So this must mean that the author of the book I have been reading made a mistake when he said there are 19 special characters in regular expressions?

Here is an extract from the book (so you can see what I am referring to):

Matching Literal Characters

The simplest form of regular expression pattern is a literal string. In this situation, the string stored in the
pattern matches the same string of characters in the target string, with no additional rules applied.
As you ’ ve already seen, alphabetical words such as “ hello ” are treated as literal strings in regular
expressions. The string “ hello ” in a regular expression matches the text “ hello ” in the target string.
Similarly, many other characters — such as digits, spaces, single and double quotes, and the % , & , @ , and

symbols — are treated literally by the regular expression engine.

However, as you see later, some characters have special meanings within regular expressions. These
nineteen special characters are:

. \ + * ? [ ^ ] $ ( ) { } = ! < > | :

If you want to include any character from this list literally within your expression, you need to escape it
by placing a backslash ( \ ) in front of it, like so:

echo preg_match( “/love\?/”, “What time is love?” ); // Displays “1”

Can you tell me how acurate the above extract is?

Weedpacket

[font=monospace]<[/font], [font=monospace]>[/font], [font=monospace]=[/font], and [font=monospace]![/font] do appear in assertions; [font=monospace]<[/font], and [font=monospace]>[/font] can be used when naming subpatterns and referring to them (and so can [font=monospace]'[/font]); and [font=monospace]:[/font] is used to save recording unnecessary partial matches or when using POSIX named character classes.

You might possibly want to have a subpattern that matches [font=monospace]?![/font] (maybe the idea is to replace it with [font=monospace]‽[/font]). Since [font=monospace](?![/font] looks like the start of an assertion, it would need to be escaped. But the [font=monospace]?[/font] would need to be escaped anyway, so even without taking assertions into account, the subpattern would have to start [font=monospace](\?![/font] and there is no reason to escape the exclamation mark.

The same goes for the other spurious characters in that list: any time one of them appears, it's always part of a sequence that also includes one of the regexp metacharacters, so that is the character that needs escaping.

What does the book say is so special about them? It does promise that you'll see later.

Paul help! wrote:
Can you tell me how acurate the above extract is?

Well, can you compare it with what the manual says? The manual is more likely to be accurate.

Oh, and the author of the book missed that when the EXTENDED option is turned on, [font=monospace]#[/font] is the comment character.

Weedpacket

NogDog;11062577 wrote:
Unless you use either as the regex delimiter. 😉

Well, yeah; but you can use almost anything as the delimiter. I'm reluctant to count them as part of the regexp (like the way [font=monospace]strlen("foo")[/font] is 3, not 5; the delimiters aren't counted as part of the string).

Paul_help_

Thanks.

When you say:

Oh, and the author of the book missed that when the EXTENDED option is turned on, # is the comment character.

....... what is the extended option?

And what exactly are you referring to when you say comment character??

dalecosp

An extended regular expression is something like:

$regex = "@\w(.+)
# that was the first thing, now do this other thing
\s[A-Za-z0-9]
#another comment here
@";

The gist being that the regexp extends over multiple lines of text and contains commentary delineated by the "#" symbol. This commentary is not part of the regular expression itself.

Paul_help_

Thanks.

But what does weedpacket mean when he says:

Oh, and the author of the book missed that when the EXTENDED option is turned on, # is the comment character.

........ in regards to this code:

preg_match_all( “/<a\shref=‘\”[‘\”].?>/i”, $html, $matches );

NogDog

It's only "extended" if you add the "x" modifier after the closing regex delimiter, e.g.:

$regex = '/(foo|bar)/[B][COLOR="#B22222"]x[/COLOR][/B]'

http://php.net/manual/en/reference.pcre.pattern.modifiers.php :

x (PCRE_EXTENDED)

If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include commentary inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern.[/quote]

NogDog

PS: Just to recap from your original post:

preg_match_all( "/<a\s*href=['\"](.+?)['\"].*?>/i", $html, $matches );
                 12        232 43     32 43   21

1 - regex delimiter
2 - no special meaning (at least in this context)
3 - delimiters of a character class (http://php.net/manual/en/regexp.reference.character-classes.php)
4 - delimited for PHP string quoting reasons, not because of regex reasons (if you had used single quotes around the entire regex string, the the single quote would be the one that needs escaping instead)