Anyone heard of "80legs" web crawler?

dalecosp · Apr 16, 2012

They're at "80legs dot com" of course. They claim to be the fastest web-crawler ever.

And Saturday night they came partying and bashed our site to death. Our "support" did us the dubious favor of chmodding index.php to 000, where it remained for approximately 28 hours!

Mostly, I'm curious about their business model. They say "customize your web-crawls to extract data" and "our #1 goal is getting you access to web-scale data".

With Google, Bing, Yahoo, Baidu, etc. we know what we're getting from letting their spider in --- I can't determine if we're getting any value at all from "80legs".

Does anyone have experience with 80legs? Thoughts?

Derokorian · Apr 17, 2012

Never heard of it - just searched logs for my sites and its not showing up on any of them.

NogDog · Apr 17, 2012

It looks like maybe they provide their spider as a service to customers, so it may be in part a result of some customer of theirs hitting your site (maliciously or not).

dalecosp · Apr 17, 2012

Yes, true, NogDog. The good news is that they say they obey the robots.txt protocol, so we can crawl-delay them, or disallow (I picked the former, but VP decided the latter). We can't predict who their customers are, or what they'll be doing with their data, so we can't see any value in them coming 'round...

Interesting stuff though. 50000 machines sending the requests, impossible to block @, 6th place in Rice University's business competition, and more.

I guess I wish 'em luck going against Google though --- of course, they're not, really. Trying another niche.

Ashley_Sheridan · Apr 18, 2012

Just checked my own stats and I've had exactly 2 hits from these guys, one from the US and the other from the UK, both were a couple of years ago, so they've at least been around for a while.

According to their website, a robots.txt won't take immediate effect, but a gradual one, so if you're getting whacked you might just have to put up with it for the short term :-/

dalecosp · Apr 18, 2012

Ashley Sheridan;11002112 wrote:
Just checked my own stats and I've had exactly 2 hits from these guys, one from the US and the other from the UK, both were a couple of years ago, so they've at least been around for a while.

According to their website, a robots.txt won't take immediate effect, but a gradual one, so if you're getting whacked you might just have to put up with it for the short term :-/

I think they gave up after our site went 403. :queasy:

If you've only got 2 hits, consider yourself either lucky, or you need to improve your layout (I'd guess it's the "lucky", I doubt your site has significant issues otherwise ).

Once they figured out that they could pull every item in the DB with the 'b' var in the query string, they must've set a rather large set of boxen to attempt all of 'em. We saw consecutive requests from different IP's for an large range of numbers at b=foo, b=foo++, b=foo++ ... In addition to the fact that our scripts aren't the quickest on the block already, I've been doing some new work of the "we couldn't find this, but we found these instead" kind, and it may need some refining also. CentOS was reporting load averages in the 200-500 range for about 45 minutes that night. :eek:

Ashley_Sheridan · Apr 18, 2012

I think I was lucky, my site is fairly simple and easy for a bot to navigate. It's most likely that 80legs was following a couple of links from elsewhere to my site, and then stopped after one page?

dalecosp · Apr 18, 2012

Hmm, possibly. You'd think, though, that it would follow any links it could. Of course, it kind of appears that the company designed it for "deep data", so they may have decided there wasn't much of that type of stuff on yours. They certainly tried to delve too deep and too quickly on ours ... those greedy Dwarves!

Ashley_Sheridan · Apr 18, 2012

I guess it depends what it was configured to do really. I can't tell what it was trying to hit specifically because it was so long ago and I don't have complete data to look at :-/

Weedpacket · Apr 18, 2012

dalecosp wrote:
CentOS was reporting load averages in the 200-500 range for about 45 minutes that night. :eek:

Well, they are supposed to have the fastest web crawler ever...

dalecosp · Apr 19, 2012

Weedpacket;11002167 wrote:
Well, they are supposed to have the fastest web crawler ever...

Yes; how unfortunate they unknowingly put it into a tarpit last weekend... :rolleyes:

NogDog · Apr 20, 2012

dalecosp;11002250 wrote:
Yes; how unfortunate they unknowingly put it into a tarpit last weekend... :rolleyes:

Ooh...If I had idle time on my hands I'd buy a domain, and use some url-rewriting and a bit of creative PHP coding to generate random pages that include random links, with the idea of generating some potentially huge number of pages (maybe putting a sleep(1) in each so they don't swamp it too much).

:evilgrin:

Weedpacket · Apr 21, 2012

NogDog;11002377 wrote:
Ooh...If I had idle time on my hands I'd buy a domain, and use some url-rewriting and a bit of creative PHP coding to generate random pages that include random links, with the idea of generating some potentially huge number of pages (maybe putting a sleep(1) in each so they don't swamp it too much).

:evilgrin:

Mezzacotta comic for 15 December, 1098

getout · Jun 25, 2012

Came into the office this morning to find this bot ravaging our Magento site. We have hundreds upon hundreds of pages and alternate pages of pages, so (since this spider ignores most robots.txt rules) I dug up a way to block it entirely and immediately using .htaccess:

SetEnvIfNoCase User-Agent "^Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620" bad_bot
Order Allow,Deny
Allow from all
Deny from env=bad_bot

Other more "creative" ways to block malicious spiders or bots can be found here:
http://www.kloth.net/internet/bottrap.php

dalecosp · Jun 25, 2012

getout;11006591 wrote:
Came into the office this morning to find this bot ravaging our Magento site. We have hundreds upon hundreds of pages and alternate pages of pages, so (since this spider ignores most robots.txt rules) I dug up a way to block it entirely and immediately using .htaccess:

SetEnvIfNoCase User-Agent "^Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620" bad_bot
Order Allow,Deny
Allow from all
Deny from env=bad_bot

Other more "creative" ways to block malicious spiders or bots can be found here:
http://www.kloth.net/internet/bottrap.php

Sorry to hear you had to deal with this too.

Other more "creative" ways to block malicious spiders or bots can be found here: http://www.kloth.net/internet/bottrap.php

Thanks - I think I've read the bot trap page before.

The .htaccess hack does sound pretty good. Mightn't a regexp only on the bot's information page URI give you just a good a result, and be "future proof" in the event they rebuild their UA with a newer Gecko version, etc.?

Anyone heard of "80legs" web crawler?

dalecosp

DDerokorian

NogDog

dalecosp

AAshley_Sheridan

dalecosp

AAshley_Sheridan

dalecosp

AAshley_Sheridan

Weedpacket

dalecosp

NogDog

Weedpacket

Ggetout

dalecosp