Hey guys.

I had a guys write this code for me a while back before I started to learn PHP. The script is a stats tracker. But, it must have some problems. To see if it was working, I started using Google Analytics as well. The Google stats tracker says I get 100 visits a day and the home brew version says I have 1000. I am tending to believe the Google stats gizmo, but I want to fix the original one that was made for me. SO here is my question:

Is it possible to determine that it is Google's or Yahoo's Spider checking out my site.
I'm thinking something like if(Google or Yahoo){don't count};

Is there anyway to do this?

Thanks fellas and Happy New Year !!

    Not guaranteed; OEM stats sites usually work by putting some javascript on your site- this only counts browsers which support Javascript and choose to allow it in that context.

    Robots don't normally understand Javascript at all, but moreover, many bots are "evil" bots which don't respect either the robots exclusion protocol, or identify themselves as such.

    You can match the robot's user-agent for that of known robots, but other than that there's little you can do. "evil" bots will usually claim they are a web browser and come from arbitrary (rather than well-known) IP addresses.

    Googlebot and other "good" robots are well documented in their user-agent string and originating IP.

    Mark

      OK...so smaller or mean bots/spiders aside, you think I should look for google's spider IP address and not count it when It visits...is that correct?

        You can prevent most of the decent spiders by examining the User-agent header. Google for some common ones.... Google even has this page to help you block Googlebot.

          OK...here's a question. Is it possible to track how many times Google or MSN or Yahoo or Alta Vista spiders my site? I think this could also be useful information.

            You could install BBClone 0.49b on your site. I have it running on all three of mine. It keeps track of robots/spiders in red and regular visitors in blue.

              Yeah I could, but I want to make my original script do this, not add another counter to the mix....Any other thoughts?

                This from google:
                The IP addresses used by Googlebot change from time to time. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

                So if you want to ID a google spider/bot, look at the user-agent. If it contains the
                string "Googlebot", its most likely google (or another bot pretending to be google which, I would guess, is very unlikely). The $_SERVER variable, which you're probably familiar with, contains the user-agent information you need.

                  mikell wrote:

                  This from google:
                  The IP addresses used by Googlebot change from time to time. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

                  So if you want to ID a google spider/bot, look at the user-agent. If it contains the
                  string "Googlebot", its most likely google (or another bot pretending to be google which, I would guess, is very unlikely). The $_SERVER variable, which you're probably familiar with, contains the user-agent information you need.

                  Thank you for the tip, but what about the other bot, like those from Yahoo or MSN or AOL?

                    I read on another forum that I might want to think about segregating by IP address. I was directed to a website that has something like 300 IP's on file for these spiders. Unfortunately, I have no idea how to code a simple script to read an IP and add it to a counter....Is the IP logged somehow through the $_server variable?

                      And what happens when Google buys another IP range? Or Yahoo? Or MSN? Why check for several ranges that Google might use when you could instead do a simple string match for "Googlebot" in the user agent?

                      shrug To answer your question, the client's IP address is stored in $_SERVER['REMOTE_ADDR']; more information about this superglobal (and the others) can be found here: [man]variables.predefined[/man].

                        That makes really solid sense...I like that idea and will try going with that. Thank you. When I run into coding snags, I will post here again and hopfuly I can get somemore help

                          This thread should be stickied. Extremly useful. 🙂

                            Yeah I agree...A lot of webmasters might find this sort of info on spiders and bot very useful

                              I have been storing unique user agents on my forum today. I know that at least 2 visitors were spider bots but they dont sure in my database.

                              I typical entry in the agent coloum is

                              Mozilla/5.0 (Windows; U; Win 9x 4.90; de; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11
                              

                              Can someone tell me what i should see if it is a google, yahoo bot etc.

                              Thanks

                                You might consider checking if the ip is listed on iplists.com or do a reverse lookup on dnsstuff.com

                                  I use network-tools on my forum. Though my database does not have an entry for this ip.

                                  i think i will give it another day of gathering unique hits then have a look over then see if any bots have been added.
                                  That iplist link will come in very handy thanks

                                    11 days later

                                    I use

                                                gethostbyaddr($ip)
                                    

                                    usually you will get result like: 123.123.122.222.googlebot.com or 123.123.123.123.somecrawler.whatever.net

                                      Woocha, Are u still checking the IP range instead of bot string ?