[RESOLVED] Retrieving links from a remote web site

padanaram · May 3, 2007

I am using a class called snoopy which fetches links from a remote web page. I am able to use it to fetch links, but when I try to fetch links from the actual page that I need to fetch the links from I get the following error.

** Question

Does anyone know what the http response code unauthorized means? Could this have something to do with the referrer of the page or the long string in the URL? I have no problem fetching links from a domain such as http://www.domain.com.

I also wondering why it says the variable that was passed is not an array or object.

response code: HTTP/1.1 401 Unauthorized

Warning: Variable passed to each() is not an array or object in /home/content/a/l/e/directory/html/gregsCode/test.php on line 58

https://www.domain.com/members/powersearch/control/interresults/https://www.domain.com/register/https://www.domain.com/register/https://www.domain.com/register/https://www.domain.com/register/

This is the code that I am using to call the snoopy class.

include "Snoopy.class.php";
$snoopy = new Snoopy;

if($snoopy->fetchlinks("https://www.domain.com/members/powersearch/control/interresults?fromyear=2007&toyear=2008&region=&make=&modeltext=&auction=&numresultsperpage=50&x=33&y=4&mileage=&interior=&engine=&top=&transmission=&radio=&certification=&consignor=&presalechannel=1&dealerexchangechannel=5&cyberlotchannel=2&cyberauctionchannel=4&encorechannel=3&numresultsperpage=50")) 
{ 
echo "response code: ".$snoopy->response_code."<br>\n"; 

while(list($key,$val) = each($snoopy->results)) 
echo $key.": ".$val."<br>\n"; 
echo "<p>\n"; 
echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n"; 
} 
else 
echo "error fetching document: ".$snoopy->error."\n";

To get the code that is in the snoopy class go to http://snoopy.sourceforge.net.

There is another issue that I am dealing with when I am retrieving links.

I have noticed that some of the links that I have to retrieve are not normal links. they are written with JAVA script and look like this.

<a href="javascript:goToPresaleResults('18', 'ALBA', '05/02/2007', '05%2F02%2F2007+Albuquerque+AA+-+GM+SALE', '1');">82 Found</a>

** Question
Is there a particular php function that will be able to fetch this link, or at least fetch each variable that is in the link? I can build the url in a php script.

There is also a function that these links call. Here is the function.

function goToPresaleResults(saleID, auctionID, saleDate, iResultName, saleChannel)
{
    document.PSPageSelector.action = "/members/presale/control/powersearchList";  // this is the url of the script.
    document.PSPageSelector.saleID.value = saleID;
    document.PSPageSelector.saleNumber.value = saleID;
    document.PSPageSelector.saleDate.value = saleDate;
    document.PSPageSelector.auctionID.value = auctionID;
    if(auctionID == null || auctionID == ''){
    	loadOriginalAuctions();
    }else{
	    clearAuctions();
    }
    document.PSPageSelector.salechannel.value = saleChannel;

document.PSPageSelector.irname.value = iResultName;
checkEngineValue();
document.PSPageSelector.submit();
return;
}

Kudose · May 3, 2007

padanaram wrote:
Does anyone know what the http response code unauthorized means?

10.4.2 401 Unauthorized

The request requires user authentication. The response MUST include a WWW-Authenticate header field (section 14.47) containing a challenge applicable to the requested resource. The client MAY repeat the request with a suitable Authorization header field (section 14.8). If the request already included Authorization credentials, then the 401 response indicates that authorization has been refused for those credentials. If the 401 response contains the same challenge as the prior response, and the user agent has already attempted authentication at least once, then the user SHOULD be presented the entity that was given in the response, since that entity might include relevant diagnostic information. HTTP access authentication is explained in "HTTP Authentication: Basic and Digest Access Authentication" [43].

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

Basically, it is telling you that the web site you are trying to scrap requires an authenticated user.

padanaram · May 3, 2007

I have been logging into the site using a user name and password. Snoopy does that for me and all I have to do is put my user name and password into a variable in the class.

Installer · May 3, 2007

HTTP Status Code Definitions

Edit: I just saw this is posted twice (by a member with over 100 posts, no less).:mad:

padanaram · May 3, 2007

I accidentally posted this post 2 times.

padanaram · May 13, 2007

I marked my duplicate post as resolved so we don't have to deal with duplicate posts.

I figured out the login. Here is what I did. I replaced the remote login info in snoopy with a curl login which works.

The next problem that I am dealing with is whether I can use curl to follow the java script links or not.

Is there a way to follow specific links with curl? Links that have a particular link text?

The links that I have to follow call a javascript function in the href tag, but in the link text all of them say almost the same thing such as "34 found" or "123 found".

Another possibility may be to figure out the numbering in the script that actually has the info on it when the page displays and then retrieve the specific info off of that page that I need.

Another possibility is to retrieve the links using snoopy which I can do, but I will have to use the info to somhow build a url and then call the url and retrieve the info that I need off of the page.

Please help or give me any direrction that you can. If anyone has any cod ethat already does this that would be appreciated too.

padanaram · May 13, 2007

I found a hack where I can get the info from the remote page without having to follow all these java script links.

I had to re-write some parts of the Snoopy class.

Thanks for your help.

Kudose · May 13, 2007

Please post the entire hacked class. I would like to see it.

[RESOLVED] Retrieving links from a remote web site

Ppadanaram

KKudose

Ppadanaram

IInstaller

Ppadanaram

Ppadanaram

Ppadanaram

KKudose