Hi all

I am building a forum for a client and I am recording when a forum thread is viewed, so on each page refresh the count is incremented by one. This is just some basic statistical data the client wanted adding however I have thought that this data maybe misleading if search enginesa crawl my web pages?

What is the best way to determine if the page view is by a human or not?

Many Thanks

    The easiest way is probably to check the USER AGENT string supplied by the visitor. This value is supplied by the client so it won't work if the visitor fails to supply this information (like if the visitor is using curl or is a hacker or something) but most visitors (and bots) are well behaved.

    Here's a super simple script from a very long time ago but it's hardly comprehensive as far as a bot list goes:

    <?
    $user_agent = $_SERVER['HTTP_USER_AGENT'];
    
    function is_bot($user_agent) {
      $spiders = array('Googlebot', 'MSNBOT', 'FAST-WebCrawler', 'Gigabot', 'YahooSeeker', 'ZyBorg'); 
      foreach($spiders as $key=>$bot) {
        echo 'testing:' . $bot . '<br>';
        if (stristr($user_agent, $bot)) {
          return TRUE;
        }
      }
      // if we reach this point, we've tried all the bots with no match
      return FALSE;
    }
    echo 'result:' . is_bot($user_agent);
    ?>

    I bet you could find a better one by googling around.

      9 days later

      For the forum I created I only incremented the thread views if the user viewing was logged in and it wasn't their own thread. I also checked for a specific GET variable and value.

        The user-agent parameter is settable by the use and shouldn't be relied upon.

          A little trick used to catch rogue bots and/or spiders is to place a link (or image) that is hidden from the user. I would guess you could use a similar technique as a spider will generally not notice if the link is hidden from the user.

          Of course as this question was asked almost two weeks ago I don't know if the original poster is even still looking for a solution??? 😕

            Why not use robots.txt. It will obviously not prevent malicious robots, email address harvesters etc from scanning your site anyway, but you should be able to tell any search engine bot from going where it shouldn't go.

            User-agent: *
            Disallow: /
            
              Write a Reply...