SPider help

slimjim

Been wacking at this for days and I cant get it to work correctly.
Can someone give it a shot for me please...
I spiders nicely, just one thing, after it spiders, it starts over again.
IT ends up putting doubles into the database because of it.

I need it to stop at some point after it gathers information from all crawled links.

I know it doesnt look pretty but hey..

ANy help would be nice, and yeah I know there are many spiders out there but I can learn more this way, with some helpe of course.

<?php
   ini_set('max_execution_time',0); 
   set_time_limit( 0 );

#
require "./conf/db_connect.inc.php";
require_once "./conf/display.functions.inc.php";

include("tagretrieval.php"); //gets title

   // Where should we start searching from?
   $start = "http://somesite.com/";
   // Build information about the site we're going to search.
   if($url = parse_url($start))
   {
      if(isset($url['scheme']))
      {
         $b_scheme = $url['scheme'];
         $b_url = $b_scheme."://";
      }
      if(isset($url['host']))
      {
         $b_host = $url['host'];
         $b_url = $b_url.$b_host;
      }
      if(isset($url['path']))
      {
         $b_path = dirname($url['path']);
         $b_url = $b_url.$b_path;
      }
   }
   else
   {
      echo("\nError!\n");
      echo("Description: Unable to parse starting URL. ");
      echo("Please enter a different URL to start from.\n");
      echo("Starting URL: " .$start. "\n\n");
      exit;
   }
   // Initialize our array of links.
   $links = array($start => "0");
   // Keep crawling until we run out of links.
   while($p_link = array_search("0", $links))
   {
      // Mark this link as having been seen.
      $links[$p_link] = "1";

  // Get the contents of the link we're currently looking at.
  // If we fail this, there's no point in going further.
  // We're going to surpress PHP's warning messages here as well.
  if(@ $contents = file_get_contents($p_link))
  {
$query1 = "SELECT * FROM keyword2 where url='$p_link'";
$result1= mysql_query($query1) or die( "ERROR: " . mysql_error() . "n"); 
$num1 = mysql_num_rows($result1);

$meta = get_meta_tags($p_link);
$description = $meta[description];
$keyword = $meta[keywords];
$keyword = str_replace(".", ",", $keyword);

$kwords2 = explode(",", $keyword);
$description = str_replace("'", "", $description);
srand((double)microtime() * 10000000);
$originalArray = array("$kwords2[0]", "$kwords2[1]", "$kwords2[2]", "$kwords2[3]", 
"$kwords2[4]", "$kwords2[5]", "$kwords2[6]", "$kwords2[7]", "$kwords2[8]",
"$kwords2[9]", "$kwords2[10]", "$kwords2[11]", "$kwords2[12]", "$kwords2[13]", 
"$kwords2[15]", "$kwords2[16]", "$kwords2[17]", "$kwords2[18]", "$kwords2[19]", "$kwords2[20]");

$pickOne = array_rand($originalArray, 1);
$aRandomSelection = $originalArray[$pickOne ];

if($num1<=0){//if link isnt found in db, then we will insert
    $file = file_get_contents($p_link); 
    $title = get_doc_title($file); 
    if(!empty($title[0])){ 
$title = strip_tags(trim($title[0])); 
$title = str_replace("'","", $title);
$title = str_replace("-","", $title);
$title = str_replace("*","", $title);
$title = trim($title);

}else{ 
$title = strip_tags(trim($title[0])); 
$title = str_replace("'","", $title);
$title = str_replace("-","", $title);
$title = str_replace("*","", $title);
$title = trim($title);
$title=$p_link; 
    } 
if(($title!=="") && ($aRandomSelection!="") && ($description!="")){ 
$aRandomSelection2 = strtolower($aRandomSelection); 
$title = strtolower($title);
$description = strtolower($description);

//$shortppctitle = substr($title, 0, 25) . "..."; 
if((preg_match("/^[a-zA-Z\.\,\_\-\'\ ]+$/u", $title)) && (preg_match("/^[a-zA-Z\.\,\_\-\'\ ]+$/u", $description)) && (preg_match("/^[a-zA-Z\.\,\_\-\'\ ]+$/u", $aRandomSelection2))) {

//here is where I insert into database the title, description and one random keyword

}		
         // What link are we following?
     echo("Following link: " .$uniqid. "-<b>" .$p_link. "</b><BR>" .$title. "<BR>" .$description. "<BR><BR>");
      }
//
}  

         // Build information about the link we're currently looking at.
         if($url = parse_url($p_link))
         {
            $p_url = $p_link;
            if(isset($url['scheme']))
            {
               $p_scheme = $url['scheme'];
               $p_url = $p_scheme."://";
            }
            if(isset($url['host']))
            {
               $p_host = $url['host'];
               $p_url = $p_url.$p_host;
            }
            if(isset($url['path']))
            {
               $p_path = dirname($url['path']);
               $p_url = $p_url.$p_path;
            }
         }
         // Extract the links from the current page.
         preg_match_all("/href=\"(.*?)\"/", $contents, $link_results);  

         // Loop through our extracted links and manipulate them.
         for($i = 0; $i < count($link_results[1]); $i++)
         {

        // Get an extracted link from out list and assume it's good.
        $c_link = $link_results[1][$i];
        $c_valid = true;


        // Trim any whitespace that might be on our link.
        $c_link = trim($c_link);


        // Build information about our extracted link.
        // If we can't parse the URL, don't continue.
        // Surpress all PHP warnings here as well.
        if(@ $url = parse_url($c_link))
        {
           if(isset($url['host']))
           {
              $c_host = $url['host'];
           }
           if(isset($url['query']))
           {
              $c_query = $url['query'];
           }
           if(isset($url['fragment']))
           {
              $c_fragment = $url['fragment'];
           }
        }
        else
        {
           // If we won't be able to follow it, mark it as bad.
           $c_valid = false;
        }
           if(preg_match("/\.(jpg|gif|png|ico)$/i", $c_link))
        {
           $c_valid = false;
        }
        elseif(preg_match("/\.(zip|rar|tar|gz)$/i", $c_link))
        {
           $c_valid = false;
        }
        elseif(preg_match("/\.(c|pl|py|js|reg|orig)$/i", $c_link))
        {
           $c_valid = false;
        }
        elseif(preg_match("/\.(exe|java|class)$/i", $c_link))
        {
           $c_valid = false;
        }
        elseif(preg_match("/\.(css|xml|txt|doc|pdf|lit)$/i", $c_link))
        {
           $c_valid = false;
        }
        elseif(preg_match("/\.(mp3|wav|ra|pm)$/i", $c_link))
        {
           $c_valid = false;
        }
        // If our link's made it this far, it's good, so let's keep it.
        if($c_valid)
        {

           // Remove queries from the end of a link.
           if(isset($c_query))
           {
              $c_link = preg_replace("/\?(.*?)$/", "", $c_link);
           }


           // Remove fragments from the end of a link.
           if(isset($c_fragment))
           {
              $c_link = preg_replace("/#(.*?)$/", "", $c_link);
           }


           // Case 1: The URL is of the form: /directory/file
           if(preg_match("/^\//", $c_link))
           {
              $c_link = $b_scheme."://".$b_host.$c_link;
           }


           // Case 2: The URL is of the form: ../directory/file
           if(preg_match("/^\.\.\//", $c_link))
           {

              // How many directories will we have to backtrack into?
              preg_match_all("/\.\.\//", $c_link, $count);
              $count = count($count[0]);

              // Remove the relative bits from our link.
              $c_link = preg_replace("/\.\.\//", "", $c_link);

              // Remove leading and trailing slashes from our path.
              $p_path = preg_replace("/^\//", "", $p_path);
              $p_path = preg_replace("/\/$/", "", $p_path);

              // Backtrack the required number of directories.
              $path_array = explode("/", $p_path);
              $new_path = "";
              for($j = $count; $j > 0; $j--)
              {
                 array_pop($path_array);
              }
              for($j = 0; $j < count($path_array); $j++)
              {
                 $new_path = $new_path.$path_array[$j]."/";
              }

              // Tack our new path onto the begining of our link.
              $c_link = $p_scheme."://".$p_host."/".$new_path.$c_link;
           }


           // Case 3: The URL is of the form: ./directory/file
           $c_link = preg_replace("/^\.\//", "", $c_link);


           // Case 4: The URL is of the form: file
           if(!preg_match("/^http:/", $c_link))
           {
              if(preg_match("/\/$/", $p_url))
              {
                 $c_link = $p_url.$c_link;
              }
              else
              {
                 $c_link = $p_url."/".$c_link;
              }
           }


           // Remove any www. stuff from the start of our link.
           $c_link = preg_replace("/^http:\/\/www\./", "http://", $c_link);

           // Add our extracted list to our list of links to look at.
           if(!array_key_exists($c_link, $links))
           {
              $links[$c_link] = "0";
           }
        }
     }
  }
   }          

?>

scross

I don't have time to go through your code and see what's wrong with it. However, I can suggest how you should be doing this so you can see if you've made an error in the logical process of the script.

You need to have two arrays, one which will be used like a stack, and one which will be used like a queue. The first is used to prevent the same URL from being indexed twice and the second stores the URL's you need to crawl. At the start the first array should be empty and the second array should hold your starting URL's (it's fine to have more than one).

On each crawl you get all of the URL's. For each URL you scan through the first array to check that you haven't already indexed it. You can add the URL to the second array. Once you have finished processing URL's for a page, you need to:
1) Remove the URL you just crawled from the second array and add it to the first array
2) Run array_unique on the second array to remove duplicates.

On the next crawl you then select a URL from the second array (I assume the order doesn't matter in this case) and repeat the process. For each crawl you of course need to add the data you obtain to the database, and you finish crawling when the second array is empty. The chances are not everything I have said here is perfectly correct, I probably missed something out, but this should be your basic template.

As you say, building a crawler like this serves a good educational purpose. If you feel like it you could also change it so that it performs the crawling over multiple requests and uses a page refresh, just in case the set up didn't allow you to let it run forever. It's also probably safer that way, to avoid it running into an infinite loop.

slimjim

I dont want you to write the code for me, but can you give me a example on the coding ?
ALl of these arrays are a bit confusing on how to interegrate it all together.

I feel like I need to start clean from scratch.

Also yeah having more then one start url is good idea.

scross

Ok. In the code I'll make references to the following function which you will have to define:

array getURLs(string $url) - gets all the urls which are linked to from the provided url. All the urls returned by this function are expected to be in full form (http://domain/path/file.extension), any short domains are expanded to www (http://somesite.com to http://www.somesite.com) and all urls end with a forward slash (http://www.somesite.com to http://www.somesite.com/). This is required to ensure the same URL does not get searched twice simplify because it is in a different form.

<?php
include 'spider_functions.php';

$urls = array('http://www.somesite.com/');
$visited = array();

while(!empty($urls)){
    $this_url = array_pop($urls); //faster to take the last element from the array
    $visited[] = $this_url; //add it to the visited array
    //you will probably want to get the page data here
    $this_urls = getURLs($this_url);
    foreach($this_url as $u){
        if(in_array($u, $visited)){
            continue;
        }
        $urls[] = $u;
    }
    $urls = array_unique($urls); //remove duplicates
}

echo 'Crawling finished';
?>

That should be all you need, though you'll probably want to change it so it makes a single request and then the URL's are obtained from that request along with the page title, contents etc. The code is untested so it's likely I've missed something out or made an error.