RegEx to find URLs

Derokorian · Dec 2, 2011

Ok guys, so this is my pattern for finding URLs:

#(https?://)?([a-z][a-z0-9-]*\.)+([a-z]{2,6})(/[^\s]*)?#is

It's working perfectly with the tests I've written. However I'm wondering if you can see problems where this will find false positives, or will fail to find valid URLs? I'm less concerned about the second but looking to see your responses on the matter. Below is the code I used to test this pattern and help develop it.

<?php

$tests = array(
		'google.com' => 'Garbage found on Google.com there is a lot more!',
		'www.google.com'=>'some www.google.com text',
		'http://www.google.com'=>'some http://www.google.com text',
		'http://www.google.com/?q=some+search'=>'random http://www.google.com/?q=some+search text',
		'http://www.experts-exchange.com/Database/MySQL/Q_27471906.html?cid=1572#a37215895'=>'this test contains a very long url http://www.experts-exchange.com/Database/MySQL/Q_27471906.html?cid=1572#a37215895 and a bunch of text',
		'mail.cs.michigan.edu'=>'multiple subdomain mail.cs.michigan.edu will it work?',
		'http://mail.cs.michigan.edu'=>'protocol with http://mail.cs.michigan.edu multiple subdomain',
		'www.mech.eng.school.edu'=>'some text www.mech.eng.school.edu is here',
		'http://www.mech.eng.school.edu'=>'some text http://www.mech.eng.school.edu',
		'http://phpbuilder.com/board/newthread.php?do=newthread&f=24'=>'some text http://phpbuilder.com/board/newthread.php?do=newthread&f=24 this won\'t work',
	);
$tests['all'] = implode(' ',$tests);


$LINKPattern = '#'   // OPENING DELIMITER
   .'('              // BEGIN PROTOCOL PATTERN
   .'http'           // PROTOCOL BEGINS WITH HTTP
   .'s?'             // PROTOCOL HAS AN OPTIONAL S
   .'://'            // PROTOCOL ENDS WITH ://
   .')?'             // PROTOCOL IS OPTIONAL
   .'('              // BEGIN SUBDOMAIN PATTERN
   .'[a-z]'          // SUBDOMAINS MUST BEGIN WITH A LETTER
   .'[a-z0-9-]*'     // SUBDOMAINS MAY THEN CONTAIN 0 OR MORE LETTERS, NUMBERS OR DASH
   .'\.'             // SUBDOMAINS MUST BE FOLLOWED BY A LITERAL DOT
   .')+'             // ANY NUMBER OF SUBDOMAINS
   .'('              // BEGIN TLD PATTERN
   .'[a-z]{2,6}'     // TLD MUST BE BETWEEN 2 AND 6 LETTERS
   .')'              // END TLD PATTERN
   .'('              // BEGIN DIRECTORY/FILE/QUERY STRING PATTERN
   .'/'              // DIRECTORY/FILE/QUERY STRING STARTS WITH A SLASH
   .'[^\s]*'         // DIRECTORY/FILE/QUERY STRING MAY CONTAIN ANYTHING BUT SPACE CHARACTERS
   .')?'             // DIRECTORY/FILE/QUERY STRING IS OPTIONAL
   .'#is';           // CLOSING DELIMITER

echo '<pre>';
echo 'Pattern: '.$LINKPattern.PHP_EOL.PHP_EOL;

foreach( $tests as $ans => $orig ) {
	echo 'Original: '.$orig.PHP_EOL;
	echo 'Expected: '.$ans.PHP_EOL;
	if( preg_match_all($LINKPattern,$orig,$matches) ) {
		echo 'Match(es): '.PHP_EOL;
		foreach( $matches[0] as $k => $v ) {
			echo "\tTLD: ".$matches[3][$k]."\tURL: ".$v.PHP_EOL;
		}
		echo PHP_EOL;
	} else {
		echo 'Matches: None'.PHP_EOL.PHP_EOL;
	}
}

echo '</pre>';

Weedpacket · Dec 2, 2011

RFC3986 offers as a regex code?(//([^{/?#]))?([^?#])(\?([^{#]))?(#(.))?[/code]}} which on its own would no doubt be far too broad for your purposes, but it could be used to isolate things that look like URLs; those can be pass to [man]parse_url[/man] to (obviously) parse them into things like scheme and authority and whatnot, which can be examined independently for suitability.

// TLD MUST BE BETWEEN 2 AND 6 LETTERS

Not any more: TLD registration is open. The longest TLDs might only be six letters now (museum and travel) but there's no guarantee that will remain the case in the future.

Strictly speaking, there are two more TLDs that are six letters long, but they're சிங்கப்பூர் and فلسطين; and if you're going to count them, then you're already into the realm of TLDs that are more than six letters long.