Ok guys, so this is my pattern for finding URLs:
#(https?://)?([a-z][a-z0-9-]*\.)+([a-z]{2,6})(/[^\s]*)?#is
It's working perfectly with the tests I've written. However I'm wondering if you can see problems where this will find false positives, or will fail to find valid URLs? I'm less concerned about the second but looking to see your responses on the matter. Below is the code I used to test this pattern and help develop it.
<?php
$tests = array(
'google.com' => 'Garbage found on Google.com there is a lot more!',
'www.google.com'=>'some www.google.com text',
'http://www.google.com'=>'some http://www.google.com text',
'http://www.google.com/?q=some+search'=>'random http://www.google.com/?q=some+search text',
'http://www.experts-exchange.com/Database/MySQL/Q_27471906.html?cid=1572#a37215895'=>'this test contains a very long url http://www.experts-exchange.com/Database/MySQL/Q_27471906.html?cid=1572#a37215895 and a bunch of text',
'mail.cs.michigan.edu'=>'multiple subdomain mail.cs.michigan.edu will it work?',
'http://mail.cs.michigan.edu'=>'protocol with http://mail.cs.michigan.edu multiple subdomain',
'www.mech.eng.school.edu'=>'some text www.mech.eng.school.edu is here',
'http://www.mech.eng.school.edu'=>'some text http://www.mech.eng.school.edu',
'http://phpbuilder.com/board/newthread.php?do=newthread&f=24'=>'some text http://phpbuilder.com/board/newthread.php?do=newthread&f=24 this won\'t work',
);
$tests['all'] = implode(' ',$tests);
$LINKPattern = '#' // OPENING DELIMITER
.'(' // BEGIN PROTOCOL PATTERN
.'http' // PROTOCOL BEGINS WITH HTTP
.'s?' // PROTOCOL HAS AN OPTIONAL S
.'://' // PROTOCOL ENDS WITH ://
.')?' // PROTOCOL IS OPTIONAL
.'(' // BEGIN SUBDOMAIN PATTERN
.'[a-z]' // SUBDOMAINS MUST BEGIN WITH A LETTER
.'[a-z0-9-]*' // SUBDOMAINS MAY THEN CONTAIN 0 OR MORE LETTERS, NUMBERS OR DASH
.'\.' // SUBDOMAINS MUST BE FOLLOWED BY A LITERAL DOT
.')+' // ANY NUMBER OF SUBDOMAINS
.'(' // BEGIN TLD PATTERN
.'[a-z]{2,6}' // TLD MUST BE BETWEEN 2 AND 6 LETTERS
.')' // END TLD PATTERN
.'(' // BEGIN DIRECTORY/FILE/QUERY STRING PATTERN
.'/' // DIRECTORY/FILE/QUERY STRING STARTS WITH A SLASH
.'[^\s]*' // DIRECTORY/FILE/QUERY STRING MAY CONTAIN ANYTHING BUT SPACE CHARACTERS
.')?' // DIRECTORY/FILE/QUERY STRING IS OPTIONAL
.'#is'; // CLOSING DELIMITER
echo '<pre>';
echo 'Pattern: '.$LINKPattern.PHP_EOL.PHP_EOL;
foreach( $tests as $ans => $orig ) {
echo 'Original: '.$orig.PHP_EOL;
echo 'Expected: '.$ans.PHP_EOL;
if( preg_match_all($LINKPattern,$orig,$matches) ) {
echo 'Match(es): '.PHP_EOL;
foreach( $matches[0] as $k => $v ) {
echo "\tTLD: ".$matches[3][$k]."\tURL: ".$v.PHP_EOL;
}
echo PHP_EOL;
} else {
echo 'Matches: None'.PHP_EOL.PHP_EOL;
}
}
echo '</pre>';