hey everyone

I was wondering if it'd be possible to generate a variable containing every word on a web page (given by url e.g. http://roblox.com/)
If so, how?

Thanks.

    Imperialoutpost;10958607 wrote:

    The easiest way I can think of is loading the page up with file_get_contents, and then use strip_tags to remove the code.

    You could then use a regex to remove punctuation, and explode the string in an array of individual words.

    file_get_contents doesn't work for me on other websites. fopen and fread do.

      cUrl gives you a lot more control...

      This is a basic version chopped down version of what I use for similar tasks:

      <?php
      
      $ch = curl_init(); // Start the curl function
      
      $userAgent =  'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // Set our UserAgent as Googlebot
      curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
      
      curl_setopt($ch, CURLOPT_URL, 'http://www.roblox.com/Default.aspx');
      curl_setopt($ch, CURLOPT_FAILONERROR, true);
      curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
      curl_setopt($ch, CURLOPT_AUTOREFERER, true);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
      curl_setopt($ch, CURLOPT_TIMEOUT, 45);
      
      $html = curl_exec($ch); // Pull the page HTML into the string $html
      
      if (!$html) { // If it hasn't worked, print an error and fail
      
      echo "cURL error number: " .curl_errno($ch) . '<br/>';
      echo "cURL error:" . curl_error($ch) . '<br/>';
      die();
      
      }
      
      echo strip_tags($html);
      
      ?>

      I notice if the useragent isn't set as Googlebot, then the site returns an error.

        I've found that it's helpful to forward the user agent of whoever is visiting your page.

          Write a Reply...