They're at "80legs dot com" of course. They claim to be the fastest web-crawler ever.

And Saturday night they came partying and bashed our site to death. Our "support" did us the dubious favor of chmodding index.php to 000, where it remained for approximately 28 hours!

Mostly, I'm curious about their business model. They say "customize your web-crawls to extract data" and "our #1 goal is getting you access to web-scale data".

With Google, Bing, Yahoo, Baidu, etc. we know what we're getting from letting their spider in --- I can't determine if we're getting any value at all from "80legs".

Does anyone have experience with 80legs? Thoughts?

    Never heard of it - just searched logs for my sites and its not showing up on any of them.

      It looks like maybe they provide their spider as a service to customers, so it may be in part a result of some customer of theirs hitting your site (maliciously or not).

        Yes, true, NogDog. The good news is that they say they obey the robots.txt protocol, so we can crawl-delay them, or disallow (I picked the former, but VP decided the latter). We can't predict who their customers are, or what they'll be doing with their data, so we can't see any value in them coming 'round...

        Interesting stuff though. 50000 machines sending the requests, impossible to block @, 6th place in Rice University's business competition, and more.

        I guess I wish 'em luck going against Google though --- of course, they're not, really. Trying another niche.

          Just checked my own stats and I've had exactly 2 hits from these guys, one from the US and the other from the UK, both were a couple of years ago, so they've at least been around for a while.

          According to their website, a robots.txt won't take immediate effect, but a gradual one, so if you're getting whacked you might just have to put up with it for the short term :-/

            Ashley Sheridan;11002112 wrote:

            Just checked my own stats and I've had exactly 2 hits from these guys, one from the US and the other from the UK, both were a couple of years ago, so they've at least been around for a while.

            According to their website, a robots.txt won't take immediate effect, but a gradual one, so if you're getting whacked you might just have to put up with it for the short term :-/

            I think they gave up after our site went 403. :queasy:

            If you've only got 2 hits, consider yourself either lucky, or you need to improve your layout (I'd guess it's the "lucky", I doubt your site has significant issues otherwise 😉 ).

            Once they figured out that they could pull every item in the DB with the 'b' var in the query string, they must've set a rather large set of boxen to attempt all of 'em. We saw consecutive requests from different IP's for an large range of numbers at b=foo, b=foo++, b=foo++ ... In addition to the fact that our scripts aren't the quickest on the block already, I've been doing some new work of the "we couldn't find this, but we found these instead" kind, and it may need some refining also. CentOS was reporting load averages in the 200-500 range for about 45 minutes that night. :eek:

              I think I was lucky, my site is fairly simple and easy for a bot to navigate. It's most likely that 80legs was following a couple of links from elsewhere to my site, and then stopped after one page?

                Hmm, possibly. You'd think, though, that it would follow any links it could. Of course, it kind of appears that the company designed it for "deep data", so they may have decided there wasn't much of that type of stuff on yours. They certainly tried to delve too deep and too quickly on ours ... those greedy Dwarves! 😃

                  I guess it depends what it was configured to do really. I can't tell what it was trying to hit specifically because it was so long ago and I don't have complete data to look at :-/

                    dalecosp wrote:

                    CentOS was reporting load averages in the 200-500 range for about 45 minutes that night. :eek:

                    Well, they are supposed to have the fastest web crawler ever...

                      Weedpacket;11002167 wrote:

                      Well, they are supposed to have the fastest web crawler ever...

                      Yes; how unfortunate they unknowingly put it into a tarpit last weekend... :rolleyes:

                        dalecosp;11002250 wrote:

                        Yes; how unfortunate they unknowingly put it into a tarpit last weekend... :rolleyes:

                        Ooh...If I had idle time on my hands I'd buy a domain, and use some url-rewriting and a bit of creative PHP coding to generate random pages that include random links, with the idea of generating some potentially huge number of pages (maybe putting a sleep(1) in each so they don't swamp it too much).

                        :evilgrin:

                          NogDog;11002377 wrote:

                          Ooh...If I had idle time on my hands I'd buy a domain, and use some url-rewriting and a bit of creative PHP coding to generate random pages that include random links, with the idea of generating some potentially huge number of pages (maybe putting a sleep(1) in each so they don't swamp it too much).

                          :evilgrin:

                          Mezzacotta comic for 15 December, 1098

                            2 months later

                            Came into the office this morning to find this bot ravaging our Magento site. We have hundreds upon hundreds of pages and alternate pages of pages, so (since this spider ignores most robots.txt rules) I dug up a way to block it entirely and immediately using .htaccess:

                            SetEnvIfNoCase User-Agent "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620" bad_bot
                            Order Allow,Deny
                            Allow from all
                            Deny from env=bad_bot

                            Other more "creative" ways to block malicious spiders or bots can be found here:
                            http://www.kloth.net/internet/bottrap.php

                              getout;11006591 wrote:

                              Came into the office this morning to find this bot ravaging our Magento site. We have hundreds upon hundreds of pages and alternate pages of pages, so (since this spider ignores most robots.txt rules) I dug up a way to block it entirely and immediately using .htaccess:

                              SetEnvIfNoCase User-Agent "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620" bad_bot
                              Order Allow,Deny
                              Allow from all
                              Deny from env=bad_bot

                              Other more "creative" ways to block malicious spiders or bots can be found here:
                              http://www.kloth.net/internet/bottrap.php

                              Sorry to hear you had to deal with this too.

                              Other more "creative" ways to block malicious spiders or bots can be found here: http://www.kloth.net/internet/bottrap.php

                              Thanks - I think I've read the bot trap page before.

                              The .htaccess hack does sound pretty good. Mightn't a regexp only on the bot's information page URI give you just a good a result, and be "future proof" in the event they rebuild their UA with a newer Gecko version, etc.?

                                Write a Reply...