How's it going?

I'm trying to index (robot/spider) a site in php for our search engine.

However, when I cut the links, some of the links aren't complete enough and therefore don't work.

e.g. at yahoo if I cut all links some are complete (and thus work immediately) but others are of the form:

<a href=r/gr>Greetings</a>

...which results in a deadlink unless the http://google.yahoo.com/ is added to the front.

Is there a way around this please, that ensures I can capture the entire active link each time please?

I'm looking for a full proof generic way so there's no problems following any link at any site dynamically (i.e. without having to configure the script for each site individually).

Maybe php isn't the language to use for a robot?

Thank you very much if you can assist...

Regards,

Jason

    No, I think your script will have to generate the prefixes yourself. In fact, it's actually a little tougher than you describe. If the link starts with a slash, then you can just use the original URL. Ex:

    Spidering- http://www.x.com
    Link encountered- href=/fred.htm
    Add to spider list- http://www.x.com/fred.htm

    Spidering- http://www.x.com/folder
    Link encountered- href="subfolder/y.htm"
    Add to spider list- http://www.x.com/folder/subfolder/y.htm

    Note that if no slash begins the link then you need to take your current location and add the link to it (with correct slashing).

    Also, you will need to keep a map/vector of links so that you don't continually spider the same links. This is where it starts to get interesting.

    PHP can do this, but usually industrial-strength spiders are written in C/C++ (performance) or Java (less often).

    Hope that helps, Dave

    ===========================================
    http://badblue.com
    Small footprint P2P web server for Windows,

    File-sharing, PHP, wireless apps & more

      Write a Reply...