fsockopen() to get &lt;head&gt;?

fsockopen() to get <head>?

Anon

I am trying to get all information between the <head> and </head> tags of a user-submitted URL.

Here's what I have -- this gets the HTTP response header only (where $address is the variable from the submitted form:

$openaddress = eregi_replace("http://", "", $address);
$web = fsockopen("$openaddress", 80);
fputs($web, "GET / HTTP/1.0\n\n");
while (!feof($web)) {
$header = fgets($web, 1024);
if(!strcmp($header, "\n"))
break;
echo $header;
}
fclose($web);

How could this be altered to read the string I want?

What I ultimately want to do is get the title and return it, plus mine the keywords in the meta tag for certain strings.

Thx in advance!
--ph

Anon

Peter, the HTTP HEAD and HTML <head> are two different things. The HEAD request won't get you the information you want. Instead, you will need to get the entire file and then use regular expressions to grab the title, if it exists.

Here is a brute-force example. Somebody with a better handle on regular expressions might offer a superior alternative.

<?
$foo = file($openaddress);
$foo = implode("",$foo);
$foo = eregi_replace(".<title>","",$foo);
$foo = eregi_replace("</title>.","",$foo);
echo $foo;
?>

If $openaddress is coming from a user submission, you should check to see if it includes "http://" and if not, prepend it.

Anon

Thanks for your help, Steve. After screwing around with this for a while, I have this:

$web = fopen("$address", "r");
do {
$title = fgets($web, 10);
$title = eregi_replace(".<title>","",$title);
$title = eregi_replace("</title>.","",$title);
echo $title;
} while (!feof($web));
fclose($web);

and it prints the title alright, but below it displays the ENTIRE site, which I do not want to do. How would you alter this to only print the string?

Anon

Well, let's walk through your code.

First you open a virtual file/connection to the remote Web server. OK so far, although you didn't check the return value to ensure that it did not fail.

Then you have a do-while loop that continues until the end of the file. It pulls in the remote file 10 bytes at a time. For each 10-byte chunk, it looks for the beginning and end of <title>...</title> and deletes anything fore/aft of that target. The contents of the buffer are then printed.

Then, after the loop terminates, the file is closed.

The problem is in your do/while loop. There are two cases where it will fail.

The <title>...</title> pair may not and in fact probably will not fall neatly into a 10-byte boundary.
After the <title>...</title> portion of the file, you continue to read chunks and look for pieces to delete. Since those pieces are not found, no delete takes place, and you are going to print out a whole lot of html.

I'm not sure why you are reading the file in little 10-byte chunks. I would just pull the whole file into memory and work on it.

Anon

I really appreciate your help here, Steve. I have been working on this and actually included a couple of your suggestions before I saw this reply. The code below is all acceptable -- the only thing I need more help with is the $title var -- I need to insert it into the database to display as the link...it does not seem to be available later (see the last 4 lines), and I don't know why since I am not in a user-defined function.

$full_path = getenv("REQUEST_URI");
$path = dirname($full_path);
$section = ereg_replace("/sections/", "", $path);
// yet to do ereg to make sure the first 4 chars start with http...
$web = @fopen("$address", "r");
if(!($web=@fopen("$address", "r")))
{
echo "Sorry, the site you suggested was not found. Either you did not include the \"http://\", or the site is temporarily out of service. <input type=\"button\" name=\"submit\" value=\"<< Go Back and Check Entry\" onClick=\"window.history.go(-1)\">";
exit;
}
echo "The site you have suggested is displayed below. The link to your suggestion will be displayed in the <a href=\"http://www.domain.com/sections/$section\">$section</a> section and the title will be: ";
do {
$title = @fgets($web, 1024);
$title = eregi_replace(".<title>","",$title);
$title = eregi_replace("</title>.","",$title);
echo $title;
} while (!feof($web));
fclose($web);
echo "";
// this will not display!!!
if (isset($title)) {
echo " Valid URL - here is the site you suggested: ";
}

Sorry to keep bugging you about this, I'm sure I'll "get it" soon. One last hurdle. Thanks again!

--ph(p)

Anon

The problem continues to be your do-while loop. You don't need it, and it is overwriting the contents of $title after you successfully find the title. If you don't want to simply read the entire remote HTML into a variable, you could guess that the title is likely to be in the first 1024 characters and stop there.

At any rate, if you remove the

do{

and the

}while (!feof($web));

the code you have should work just fine.

bucabay

Anon wrote:
I am trying to get all information between the <head> and </head> tags of a user-submitted URL.

Here's what I have -- this gets the HTTP response header only (where $address is the variable from the submitted form:

$openaddress = eregi_replace("http://", "", $address);
$web = fsockopen("$openaddress", 80);
fputs($web, "GET / HTTP/1.0\n\n");
while (!feof($web)) {
$header = fgets($web, 1024);
if(!strcmp($header, "\n"))
break;
echo $header;
}
fclose($web);

How could this be altered to read the string I want?

What I ultimately want to do is get the title and return it, plus mine the keywords in the meta tag for certain strings.

Thx in advance!
--ph

Seems like this message is quite old, though it may be useful to someone.

The way to get this (modifying the code supplied Anon above) would be to just get the whole page and then match the

<title></title>

from the result as such:


$openaddress = eregi_replace("http://", "", $address);
$web = fsockopen("$openaddress", 80);
fputs($web, "GET / HTTP/1.0\n\n");
while (!feof($web)) {
  $buffer .= fgets($web, 1024); // get the whole file and not just http headers
}
fclose($web);

preg_match('/<title>([^<]*)<\/title>/', $buffer, $matches); // match the title from the returned data ($buffer)

$title = $matches[1]; // save the title to $title

// now you can create the a link to the page using the title
$href = '<a href="http://'.$address.'">'.$title.'</a>';

Matching all the returned data and not as we read each 1024 from the page makes sure we dont miss matching the title with our regex if we only get part of the title in one read. In this case we would safe cause the whole title should be within the first 1024 bytes, but if we wanted to match something further down the page, we might run into the problem.