fool proof way of following links

jasongonella · Mar 3, 2001

How's it going?

I'm trying to index (robot/spider) a site in php for our search engine.

However, when I cut the links, some of the links aren't complete enough and therefore don't work.

e.g. at yahoo if I cut all links some are complete (and thus work immediately) but others are of the form:

<a href=r/gr>Greetings</a>

...which results in a deadlink unless the http://google.yahoo.com/ is added to the front.

Is there a way around this please, that ensures I can capture the entire active link each time please?

I'm looking for a full proof generic way so there's no problems following any link at any site dynamically (i.e. without having to configure the script for each site individually).

Maybe php isn't the language to use for a robot?

Thank you very much if you can assist...

Regards,

Jason

ame12 · Mar 3, 2001

No, I think your script will have to generate the prefixes yourself. In fact, it's actually a little tougher than you describe. If the link starts with a slash, then you can just use the original URL. Ex:

Spidering- http://www.x.com
Link encountered- href=/fred.htm
Add to spider list- http://www.x.com/fred.htm

Spidering- http://www.x.com/folder
Link encountered- href="subfolder/y.htm"
Add to spider list- http://www.x.com/folder/subfolder/y.htm

Note that if no slash begins the link then you need to take your current location and add the link to it (with correct slashing).

Also, you will need to keep a map/vector of links so that you don't continually spider the same links. This is where it starts to get interesting.

PHP can do this, but usually industrial-strength spiders are written in C/C++ (performance) or Java (less often).

Hope that helps, Dave

===========================================
http://badblue.com
Small footprint P2P web server for Windows,

File-sharing, PHP, wireless apps & more

fool proof way of following links

Jjasongonella

Aame12

File-sharing, PHP, wireless apps & more