Hi Guys.
I am at a complete confusion when it comes to escaping special characters when working with regular expressions.
I am working my way through a php book.
And below is the code I am having trouble with understanding:
<body>
<h1>Find Linked URLs in a Web Page</h1>
<?php
displayForm();
if ( isset( $_POST[“submitted”] ) ) {
processForm();
}
function displayForm() {
?>
<h2>Enter a URL to scan:</h2>
<form action=”” method=”post” style=”width: 30em;”>
<div>
<input type=”hidden” name=”submitted” value=”1” />
<label for=”url”>URL:</label>
<input type=”text” name=”url” id=”url” value=”” />
<label> </label>
<input type=”submit” name=”submitButton” value=”Find Links” />
</div>
</form>
<?php
}
function processForm() {
$url = $_POST[“url”];
if ( !preg_match( ‘|^http(s)?\://|’, $url ) ) $url = “http://$url”;
$html = file_get_contents( $url );
preg_match_all( “/<a\s*href=[‘\”](.+?)[‘\”].*?>/i”, $html, $matches );
echo ‘<div style=”clear: both;”> </div>’;
echo “<h2>Linked URLs found at “ . htmlspecialchars( $url ) . “:</h2>”;
echo “<ul>”;
for ( $i = 0; $i < count( $matches[1] ); $i++ ) {
echo “<li>” . htmlspecialchars( $matches[1][$i] ) . “</li>”;
}
echo “</ul>”;
}
?>
</body>
........... What the above code does is read the contents of the URL which is submitted in the text field.
It then proceeds to read and displays any linked files on the page itself.
The thing is, I understand how it works but I cannot get my head around why certain special characters are escaped on certain occasions (whilst reading through my book) and are not escaped on other occasions.
To be more specific, below is the code snippet I am having the main trouble getting my head around:
preg_match_all( “/<a\s*href=[‘\”](.+?)[‘\”].*?>/i”, $html, $matches );
You see, in the above snippet, the '<' characters is not escaped and niether is the equal (=) character. And I don't understand why. Because up until now I have got used to escaping these characters.
Also, another thing the book does not explain is:
[‘\”]
I understand what the above means. It means 'a single quote or a double quote'. But I always thought the vertical bar (|) communicated 'or'.
Also, the book does not explain what the square brackets are for?
Can someone explain these things I am having trouble with?
Thanks.
Paul.