<?
$url = "forum url";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url); // set url to post to
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
curl_setopt($ch, CURLOPT_TIMEOUT, 3); // times out after 4s
$result = curl_exec($ch); // run the whole process
curl_close($ch);
$file = fopen("og.txt", "w+");
fwrite($file, $result);
fclose($file);
$url = "og.txt";
if ($url) {
$remote = fopen($url, 'r');
$html = fread($remote, 10485760);
fclose($remote);
$urls = '(http|file|ftp)';
$ltrs = '\w';
$gunk = '/#~:.?+=&%@!\-';
$punc = '.:?\-';
$any = "$ltrs$gunk$punc";
preg_match_all("{
\b
$urls :
[$any] +?
(?=
[$punc] *
[^$any]
|
$
)
}x", $html, $matches);
printf("Output of URLs %d URLs<P>\n", sizeof($matches[0]));
foreach ($matches[0] as $u) {
echo "<A HREF='$u'>$u</A><BR>\n";
}
}
?>
my goal: read a remote forum page get the post tiles, authors, views, replies.
so far ive got:
get the remote url, save it as a file, get the urls out of it.
the script above works, but it leaves me kind of wondering whats going on, i get a bunch of urls, but i dont get the text of what the url is for.
how can i get the whole html code between <a and </a> not just the [url]http://[/url] part ?