Extracting every URL from a page?

alexcg

Trying to extract every URL from a page, but not having much luck. Can get the page okay, but think there's an issue with my regexes when it comes to matching URLs -- I've tried searching for regex examples but they don't seem to work so well. Code follows:

$url = $_GET["url"];
$ch = curl_init(); // initialize curl handle
curl_setopt($ch, CURLOPT_URL,$url); // set url to post to
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
curl_setopt($ch, CURLOPT_TIMEOUT, 20); // times out after 4s
//This gets the page
$result = curl_exec($ch); // run the whole process

// This is the part that doesn't seem to work
ereg('https?://([-\w.]+)+(:\d+)?(/([\w/_.]*(\?\S+)?)?)?',$result,$eventurl);

When I print_r($eventurl) I get all the data from the first URL to the end of the page, all contained in the first array item.

Can anyone shed any light on this newbie-troubling issue?

Thanks!

NogDog

If you can assume that all urls will be quoted (as attribute values of link or image tags, for example), this seems to work:

<?php
$text = file_get_contents('http://www.phpbuilder.com/');
preg_match_all('#([\'"])(https?://[^\'"]+)\\1#iU', $text, $matches);
// $matches[2] will be an array of (unquoted) URLs:
printf("<pre>%s</pre>", print_r($matches[2], 1));

christo16

Check out this link http://www.merchantos.com/makebeta/php/scraping-links-with-php/
I believe its what your looking for.

alexcg

Thanks NogDog, worked perfectly. christo16, that page looks really useful. checking it out now

MarkR

I am assuming you aren't at all interested in relative URLs, which are extremely common and require a lot more work to process.

Firstly, you probably need to parse the HTML properly to find them.

Secondly, calculating a new absolute URL from a previous absolute URL and a relative one is NOT particularly easy. Consider:

http://something.fake/blah
/blah
/
/blah/
#something
/blah#something
../blah
./blah
../.././././blah
something?query
?query
./?query#blah

Etc

In order to correctly generate every link from a page, your implementation should handle all of those correctly.

Mark