Reading a url redirect, parsing remote web pages.

heygrady · Feb 23, 2004

I am writing a script that reads a remote webpage, parses the links and displays the page with altered links. The idea of the script is to allow my users to surf the web from my website while I record what they do as part of a strange game I have concocted.

The problem is this, I can get the page to load from a remote source with altered links but once I follow the link I have no idea how to find out where I am. This is especially true if the link I followed is a redirected url as is common on the yahoo homepage.

So, if I click on a link like www.yahoo.com/s/43481, it will take me to sports.yahoo.com/ncaab. Is there a way in PHP to determine that www.yahoo.com/s/43481 redirects to sports.yahoo.com/ncaab?

I need to be able to follow a redirect and return the destination programatically using php. I need to write a function that works like this:

function followRedirect ($given_url){
[some php magic]
return $redirected_url;
}

Can anyone help me??

softsolvers · Feb 23, 2004

u can use java script to redirect window.location in ur php code
like windoow.location="Your path"
Hope this may work,
Either send me ur more requirements

AstroTeg · Feb 23, 2004

I'm not sure if you're already using curl or not. If not, curl can follow redirects automatically. It might be able to make this problem completely transparent to the rest of your code.

	curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);				// 0 = no follow, 1 = follow

Don't know if that will help or not...

heygrady · Feb 23, 2004

I am already able to transparently follow the redirect. I am not able to know where I've been redirected to.

I am not able to use curl, it is not built into PHP by default which means it is not useful for this project. I am (sadly) limited to a default install of PHP because I want to be able to share my script with others and I don't feel comfortable requiring people to add components.

You guys were unable to really address the issue.

I can use either:

$handle = fopen($link, "rb");
$contents = "";
do {
   $data = fread($handle, 8192);
   if (strlen($data) == 0) {
       break;
   }
   $contents .= $data;
} while (true);
fclose($handle);
echo $contents;

or

$contents = file_get_contents($link);
echo $contents;

In the middle I parse the links on the page and change every "a href" to point back to my script using myscript.php?link=[some link]. Then the $link variable points to the some link when the pages is refreshed. this loads the remote page at the end of the link.

For relative links, like <a href="/up/one/directory/">, I need to know the address of the current page being loaded. That is extremely easy for the first page because I chose it. But, in the case of yahoo (and probably numerous other sites) some relative links like "/s/42183" actually redirect to another site, like http://sports.yahoo.com/ncaab .

The tricky part comes in here. I am actually able to load the page with this script:

$link = "http://www.yahoo.com/s/42183/"
$contents = file_get_contents($link);
echo $contents;

As well as the fopen method. Those pages load fine. But the problem is I have no way to know that I was redirected and so I have no way of knowing how to handle relative links like <a href="/up/one/directory/">, on any subsequent pages.

I need to be able to write a function that can take a url and return the redirected link.

The trail has led me to think I need to mess with sockets and wrappers to read the headers returned when I access the remote page. But the fact that fopen and file_get_contents are able to easily follow the redirect that I could somehow force those functions to tell me where the actually got the data from.

I have no experience with the process needed to extract an http header from a remote request and I would like some information of where to start.

Any help would be fantastic.

AstroTeg · Feb 23, 2004

Its a shame you can't use curl - the header info is just an additional curl option setting.

I'm thinking you're going to have to resort to sockets (which I believe can fall under the same PHP config restrictions as curl).

heygrady · Feb 23, 2004

http://us2.php.net/manual/en/function.fsockopen.php

I found the answer in a reply from a user in the php manual.

the user that posted as:

webmaster at elcurriculum dot com
16-Oct-2003 12:41

had the answer to my problem but i don't think he knew he did.

at the bottom of his GetHTML function there is a section as follows:

 //Seguir location...
   $ret = strpos($header, "Location:", 0);
     if ($ret !== false) {
     $fin = strpos($header, "\r\n", $ret +9);
     $nueva = substr($header, $ret+9, $fin - $ret - 9);
     $body = GetHTML($nueva, $delta, $corto, $complet);
     } else {
     $delta = $url;
   }

the variable $nueva is the address of the redirect.

I am in the process of reading through the function and changing it for my exact needs.

heygrady · Feb 23, 2004

PHP supports libcurl, a library created by Daniel Stenberg also known as cURL.

Daniel Sternberg did not write all the socket stuff that is native to PHP. He wrote a library that deals with URL stuff. His library would probably be helpful to me but the fsockopen() function and the related functions have been available since PHP 3 and are native to every install of PHP and are vital to PHP's operation.

The socket functions are not under the maniacal control of Daniel Sternberg .

Reading a url redirect, parsing remote web pages.

Hheygrady

Ssoftsolvers

AAstroTeg

Hheygrady

AAstroTeg

Hheygrady

Hheygrady