Hi There,

Your humble ethical programmer here, wanting to know if anyone out there know of a suite of php files which will check out a web page (home page), and then spider those pages for links, etc. and build a LIST of pages and also check for 404's - this will help me check my own sites through a CRON job.

Know there are commercial features out there but I'd prefer to work with this type of software myself on my own server. I do NOT intend to use this to try to google pages, nor as a tool to spam anyone.

Thanks,
Samuel

    I have a script, but it take a long time to load. It can grab links from any page that you enter in and also grab links from the links it finds on the main page. I used the file() function and used a preg_match in a foreach statement. to crawl one page takes about 15 seconds

      I wrote a fairly comprehensive spider in PHP, I have to say, it's not terribly well suited:

      1. You don't have full control over threads / processes
      2. It doesn't have a thorough garbage collector
      3. Its built-in HTTP implementation isn't flexible enough for use on a large-scale spider (I gave up and wrote my own)

      Mark

        to both you guys,

        post some of your code and I'll show you some functions I developed, maybe this will spawn a project. The idea is to:

        1. build a list of pages, plus links on those pages
        2. perhaps, create a copy on a "local disk" whcih I could burn to a cd (so I could give my client a static copy of their website)

        Thanks,
        Samuel

          Write a Reply...