Here are some routines I used in my spider:
function HasTrailingSlash(& $str)
{
if ($str == '') { return false; }
return ($str[strlen($str) - 1] == '/');
}
function FixPath(& $path)
{
$bits = explode('/', $path);
$newbits = array();
foreach ($bits as $bit) {
// Ignore empty bits, and '.'
if (($bit == '') || ($bit =='.')) {
continue;
}
// If we see '..', chop the previous bit off.
if ($bit == '..') {
$numbits = count($newbits);
if ($numbits > 0) {
unset($newbits[count($newbits) -1 ]);
}
} else {
// Otherwise, add it.
$newbits[] = $bit;
}
}
// Stick them back together
$path = '/' . implode('/', $newbits);
}
// DOes the inverse of parse_url
// Ignores username and password.
// Ignores fragment.
function ConstructUrl($bits)
{
// Check we actually have a scheme...
assert(isset($bits['scheme']));
$url = $bits['scheme'] . '://';
if (isset($bits['host'])) {
$url = $url . $bits['host'];
}
if (isset($bits['port'])) {
$url = $url . ':' . $bits['port'];
}
// Add path (mandatory).
if (! isset($bits['path'])) {
$bits['path'] = '/';
}
FixPath($bits['path']);
$url = $url . $bits['path'];
if (isset($bits['query'])) {
$url = $url . '?' . $bits['query'];
}
return $url;
}
/*
* This is a work-around for the fact that parse_url fails
* for URLs like
* /blah?add=http://www.example.com
*
*/
function MyParseUrl($url)
{
$qpos = strpos($url, '?');
if ($qpos !== FALSE) {
$baseurl = substr($url, 0, $qpos);
$bits = parse_url($baseurl);
$bits['query'] = substr($url, $qpos+1);
return $bits;
} else {
return parse_url($url);
}
}
function IsAbsoluteUrl($url)
{
return (preg_match('/^[a-z]+:/i', $url));
}
function FindRelativeUrl($url, $relative)
{
// If it's an absolute URL already, return it unchanged.
if (IsAbsoluteUrl($relative)) {
return $relative;
}
// chop the filename off the original url
$original_url_bits = parse_url($url);
if (! isset($original_url_bits['path'])) {
$original_url_bits['path'] = '';
}
// If path *does not* end in a /, use dirname on it.
$path = $original_url_bits['path'];
if (! HasTrailingSlash($path)) {
$path = dirname($path);
// If we don't have a trailing slash, add one.
if (! HasTrailingSlash($path)) {
$path = $path . '/';
}
}
try {
$rel_url_bits = MyParseUrl($relative);
} catch (Exception $e) {
echo "WARNING: found a duff relative URL that we can't parse.\n";
echo "Relative URL:$relative Linked from:$url\n";
return "malformed:";
}
// If it has no path at all, then we have to assume it's something
// like an anchor, or invalid. Return the original URL.
if (! isset($rel_url_bits['path']) || ($rel_url_bits['path'] == '')) {
$rel_url_bits['path'] = $original_url_bits['path'];
}
// If it has an absolute path, use that
$new_url_bits = $original_url_bits;
if (substr($rel_url_bits['path'],0,1) == '/') {
$new_url_bits['path'] = $rel_url_bits['path'];
} else {
// Otherwise, stick it on to the original path.
$new_url_bits['path'] = $path . $rel_url_bits['path'];
}
// Blank the previous query string
unset($new_url_bits['query']);
// If set, use the new one.
if (isset($rel_url_bits['query'])) {
$new_url_bits['query'] = $rel_url_bits['query'];
}
return ConstructUrl($new_url_bits);
}
function StrStartsWith(& $str, $start)
{
return (substr($str, 0, strlen($start)) == $start);
}
function IsValidUrl($url)
{
try {
if (strlen($url) > MAX_URL_LENGTH) {
return false;
}
$bits = MyParseUrl($url);
// Check host for validity.
if (! isset($bits['host'])) {
return false;
}
$host = $bits['host'];
// NOTE: we require at least one dot in the hostname.
return preg_match('/^[a-z0-9\.\-]+\.[a-z]+$/i', $host);
} catch (Exception $e) {
echo "Really broken URL has caused an exception in IsValidUrl\n";
echo $e . "\n";
return false;
}
}
function CanonicaliseUrl($url)
{
try {
$bits = MyParseUrl($url);
if (isset($bits['host'])) {
$bits['host'] = strtolower($bits['host']);
}
if (! isset($bits['path'])) {
$bits['path'] = '/';
}
return ConstructUrl($bits);
} catch (Exception $e) {
echo "Cannot canonicalise URL: $url\n";
return FALSE;
}
}
I want to stress that this is a VERY SMALL part of a VERY COMPLICATED application. The above contains one of the key routines, FindRelativeUrl() which, given a base URL and relative URL, finds the new URL. This is not as easy as you may think as there are a lot of strange cases.
Here are some test cases for the above:
function TestUtils()
{
echo FindRelativeUrl("http://www.example.com/testing/qwer", 'asdf') . "\n";
echo FindRelativeUrl("http://www.example.com/testing/qwer", '/asdf') . "\n";
echo FindRelativeUrl("http://www.example.com/testing/qwer", 'http://blah.example.com') . "\n";
echo FindRelativeUrl("http://www.example.com/testing/qwer", '../asdf') . "\n";
echo FindRelativeUrl("http://www.example.com/a/b/c/d", '././/../e') . "\n";
echo FindRelativeUrl("http://www.example.com/a/b/c/d", 'test.html?x=42') . "\n";
echo FindRelativeUrl("http://www.example.com/a/b/c/d", '?x=99') . "\n";
echo FindRelativeUrl("http://www.example.com/a/b/c/d", '/blah?add=http://www.somewhere.com/') . "\n";
echo FindRelativeUrl("http://www.example.com/a/b/c/d", '?add=http://www.example.com/blah') . "\n";
echo FindRelativeUrl("http://www.example.com/a/b/c/d", 'javascript://') . "\n";
echo "\n\n";
All of which it handled correctly last time I checked.
Mark