Anon wrote:I am trying to get all information between the <head> and </head> tags of a user-submitted URL.
Here's what I have -- this gets the HTTP response header only (where $address is the variable from the submitted form:
$openaddress = eregi_replace("http://", "", $address);
$web = fsockopen("$openaddress", 80);
fputs($web, "GET / HTTP/1.0\n\n");
while (!feof($web)) {
$header = fgets($web, 1024);
if(!strcmp($header, "\n"))
break;
echo $header;
}
fclose($web);
How could this be altered to read the string I want?
What I ultimately want to do is get the title and return it, plus mine the keywords in the meta tag for certain strings.
Thx in advance!
--ph
Seems like this message is quite old, though it may be useful to someone.
The way to get this (modifying the code supplied Anon above) would be to just get the whole page and then match the
<title></title>
from the result as such:
$openaddress = eregi_replace("http://", "", $address);
$web = fsockopen("$openaddress", 80);
fputs($web, "GET / HTTP/1.0\n\n");
while (!feof($web)) {
$buffer .= fgets($web, 1024); // get the whole file and not just http headers
}
fclose($web);
preg_match('/<title>([^<]*)<\/title>/', $buffer, $matches); // match the title from the returned data ($buffer)
$title = $matches[1]; // save the title to $title
// now you can create the a link to the page using the title
$href = '<a href="http://'.$address.'">'.$title.'</a>';
Matching all the returned data and not as we read each 1024 from the page makes sure we dont miss matching the title with our regex if we only get part of the title in one read. In this case we would safe cause the whole title should be within the first 1024 bytes, but if we wanted to match something further down the page, we might run into the problem.