Anyone heard of "80legs" web crawler?
Results 1 to 15 of 15

Thread: Anyone heard of "80legs" web crawler?

  1. #1
    Settled 4 red convertible dalecosp's Avatar
    Join Date
    Jul 2002
    Location
    Accelerating Windows at 9.81 m/s....
    Posts
    7,623

    Anyone heard of "80legs" web crawler?

    They're at "80legs dot com" of course. They claim to be the fastest web-crawler ever.

    And Saturday night they came partying and bashed our site to death. Our "support" did us the dubious favor of chmodding index.php to 000, where it remained for approximately 28 hours!

    Mostly, I'm curious about their business model. They say "customize *your* web-crawls to extract data" and "our #1 goal is getting you access to web-scale data".

    With Google, Bing, Yahoo, Baidu, etc. we know what we're getting from letting their spider in --- I can't determine if we're getting any value at all from "80legs".

    Does anyone have experience with 80legs? Thoughts?
    /!!\ mysql_ is deprecated --- don't use it! Tell your hosting company you will switch if they don't upgrade! /!!!\ ereg() is deprecated --- don't use it!

    dalecosp "God doesn't play dice." --- Einstein "Perl is hardly a paragon of beautiful syntax." --- Weedpacket

    Getting Help at All --- Collected Solutions to Common Problems --- Debugging 101 --- Unanswered Posts --- OMBE: Office Machines, Business Equipment

  2. #2
    Senior Member Derokorian's Avatar
    Join Date
    Apr 2011
    Location
    Denver
    Posts
    1,740
    Never heard of it - just searched logs for my sites and its not showing up on any of them.
    Sadly, nobody codes for anyone on this forum. People taste your dishes and tell you what is missing, but they don't cook for you. ~anoopmail
    I'd rather be a comma, then a full stop.
    User Authentication in PHP with MySQLi - Don't forget to mark threads resolved - MySQL(i) warning

  3. #3
    High Energy Magic Dept. NogDog's Avatar
    Join Date
    Aug 2006
    Location
    Ankh-Morpork
    Posts
    13,820
    It looks like maybe they provide their spider as a service to customers, so it may be in part a result of some customer of theirs hitting your site (maliciously or not).
    Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be." ~ from Nation, by Terry Pratchett

    "But the main reason that any programmer learning any new language thinks the new language is SO much better than the old one is because hes a better programmer now!" ~ http://www.oreillynet.com/ruby/blog/...ck_to_p_1.html


    eBookworm.us

  4. #4
    Settled 4 red convertible dalecosp's Avatar
    Join Date
    Jul 2002
    Location
    Accelerating Windows at 9.81 m/s....
    Posts
    7,623
    Yes, true, NogDog. The good news is that they say they obey the robots.txt protocol, so we can crawl-delay them, or disallow (I picked the former, but VP decided the latter). We can't predict who their customers are, or what they'll be doing with their data, so we can't see any value in them coming 'round...

    Interesting stuff though. 50000 machines sending the requests, impossible to block @firewall, 6th place in Rice University's business competition, and more.

    I guess I wish 'em luck going against Google though --- of course, they're not, really. Trying another niche.
    /!!\ mysql_ is deprecated --- don't use it! Tell your hosting company you will switch if they don't upgrade! /!!!\ ereg() is deprecated --- don't use it!

    dalecosp "God doesn't play dice." --- Einstein "Perl is hardly a paragon of beautiful syntax." --- Weedpacket

    Getting Help at All --- Collected Solutions to Common Problems --- Debugging 101 --- Unanswered Posts --- OMBE: Office Machines, Business Equipment

  5. #5
    Senior Member
    Join Date
    Aug 2008
    Location
    London, UK
    Posts
    753
    Just checked my own stats and I've had exactly 2 hits from these guys, one from the US and the other from the UK, both were a couple of years ago, so they've at least been around for a while.

    According to their website, a robots.txt won't take immediate effect, but a gradual one, so if you're getting whacked you might just have to put up with it for the short term :-/
    Ashley Sheridan
    www.ashleysheridan.co.uk

  6. #6
    Settled 4 red convertible dalecosp's Avatar
    Join Date
    Jul 2002
    Location
    Accelerating Windows at 9.81 m/s....
    Posts
    7,623
    Quote Originally Posted by Ashley Sheridan View Post
    Just checked my own stats and I've had exactly 2 hits from these guys, one from the US and the other from the UK, both were a couple of years ago, so they've at least been around for a while.

    According to their website, a robots.txt won't take immediate effect, but a gradual one, so if you're getting whacked you might just have to put up with it for the short term :-/
    I think they gave up after our site went 403.

    If you've only got 2 hits, consider yourself either lucky, or you need to improve your layout (I'd guess it's the "lucky", I doubt your site has significant issues otherwise ).

    Once they figured out that they could pull every item in the DB with the 'b' var in the query string, they must've set a rather large set of boxen to attempt all of 'em. We saw consecutive requests from different IP's for an large range of numbers at b=foo, b=foo++, b=foo++ ... In addition to the fact that our scripts aren't the quickest on the block already, I've been doing some new work of the "we couldn't find this, but we found these instead" kind, and it may need some refining also. CentOS was reporting load averages in the 200-500 range for about 45 minutes that night.
    /!!\ mysql_ is deprecated --- don't use it! Tell your hosting company you will switch if they don't upgrade! /!!!\ ereg() is deprecated --- don't use it!

    dalecosp "God doesn't play dice." --- Einstein "Perl is hardly a paragon of beautiful syntax." --- Weedpacket

    Getting Help at All --- Collected Solutions to Common Problems --- Debugging 101 --- Unanswered Posts --- OMBE: Office Machines, Business Equipment

  7. #7
    Senior Member
    Join Date
    Aug 2008
    Location
    London, UK
    Posts
    753
    I think I was lucky, my site is fairly simple and easy for a bot to navigate. It's most likely that 80legs was following a couple of links from elsewhere to my site, and then stopped after one page?
    Ashley Sheridan
    www.ashleysheridan.co.uk

  8. #8
    Settled 4 red convertible dalecosp's Avatar
    Join Date
    Jul 2002
    Location
    Accelerating Windows at 9.81 m/s....
    Posts
    7,623
    Hmm, possibly. You'd think, though, that it would follow any links it could. Of course, it kind of appears that the company designed it for "deep data", so they may have decided there wasn't much of that type of stuff on yours. They certainly tried to delve too deep and too quickly on ours ... those greedy Dwarves!
    /!!\ mysql_ is deprecated --- don't use it! Tell your hosting company you will switch if they don't upgrade! /!!!\ ereg() is deprecated --- don't use it!

    dalecosp "God doesn't play dice." --- Einstein "Perl is hardly a paragon of beautiful syntax." --- Weedpacket

    Getting Help at All --- Collected Solutions to Common Problems --- Debugging 101 --- Unanswered Posts --- OMBE: Office Machines, Business Equipment

  9. #9
    Senior Member
    Join Date
    Aug 2008
    Location
    London, UK
    Posts
    753
    I guess it depends what it was configured to do really. I can't tell what it was trying to hit specifically because it was so long ago and I don't have complete data to look at :-/
    Ashley Sheridan
    www.ashleysheridan.co.uk

  10. #10
    Pedantic Curmudgeon Weedpacket's Avatar
    Join Date
    Aug 2002
    Location
    General Systems Vehicle "Thrilled To Be Here"
    Posts
    21,773

    What is that even supposed to mean?

    Quote Originally Posted by dalecosp
    CentOS was reporting load averages in the 200-500 range for about 45 minutes that night.
    Well, they are supposed to have the fastest web crawler ever...
    THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER
    FAQs! FAQs! FAQs! Most forums have them!
    Search - Debugging 101 - Collected Solutions - General Guidelines - Getting help at all

  11. #11
    Settled 4 red convertible dalecosp's Avatar
    Join Date
    Jul 2002
    Location
    Accelerating Windows at 9.81 m/s....
    Posts
    7,623
    Quote Originally Posted by Weedpacket View Post
    Well, they are supposed to have the fastest web crawler ever...
    Yes; how unfortunate they unknowingly put it into a tarpit last weekend...
    /!!\ mysql_ is deprecated --- don't use it! Tell your hosting company you will switch if they don't upgrade! /!!!\ ereg() is deprecated --- don't use it!

    dalecosp "God doesn't play dice." --- Einstein "Perl is hardly a paragon of beautiful syntax." --- Weedpacket

    Getting Help at All --- Collected Solutions to Common Problems --- Debugging 101 --- Unanswered Posts --- OMBE: Office Machines, Business Equipment

  12. #12
    High Energy Magic Dept. NogDog's Avatar
    Join Date
    Aug 2006
    Location
    Ankh-Morpork
    Posts
    13,820
    Quote Originally Posted by dalecosp View Post
    Yes; how unfortunate they unknowingly put it into a tarpit last weekend...
    Ooh...If I had idle time on my hands I'd buy a domain, and use some url-rewriting and a bit of creative PHP coding to generate random pages that include random links, with the idea of generating some potentially huge number of pages (maybe putting a sleep(1) in each so they don't swamp it too much).

    Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be." ~ from Nation, by Terry Pratchett

    "But the main reason that any programmer learning any new language thinks the new language is SO much better than the old one is because hes a better programmer now!" ~ http://www.oreillynet.com/ruby/blog/...ck_to_p_1.html


    eBookworm.us

  13. #13
    Pedantic Curmudgeon Weedpacket's Avatar
    Join Date
    Aug 2002
    Location
    General Systems Vehicle "Thrilled To Be Here"
    Posts
    21,773
    Quote Originally Posted by NogDog View Post
    Ooh...If I had idle time on my hands I'd buy a domain, and use some url-rewriting and a bit of creative PHP coding to generate random pages that include random links, with the idea of generating some potentially huge number of pages (maybe putting a sleep(1) in each so they don't swamp it too much).

    Mezzacotta comic for 15 December, 1098
    THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER
    FAQs! FAQs! FAQs! Most forums have them!
    Search - Debugging 101 - Collected Solutions - General Guidelines - Getting help at all

  14. #14
    Junior Member
    Join Date
    Jun 2012
    Posts
    1

    Pleh...

    Came into the office this morning to find this bot ravaging our Magento site. We have hundreds upon hundreds of pages and alternate pages of pages, so (since this spider ignores most robots.txt rules) I dug up a way to block it entirely and immediately using .htaccess:

    SetEnvIfNoCase User-Agent "^Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620" bad_bot
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot

    Other more "creative" ways to block malicious spiders or bots can be found here:
    http://www.kloth.net/internet/bottrap.php

  15. #15
    Settled 4 red convertible dalecosp's Avatar
    Join Date
    Jul 2002
    Location
    Accelerating Windows at 9.81 m/s....
    Posts
    7,623
    Quote Originally Posted by getout View Post
    Came into the office this morning to find this bot ravaging our Magento site. We have hundreds upon hundreds of pages and alternate pages of pages, so (since this spider ignores most robots.txt rules) I dug up a way to block it entirely and immediately using .htaccess:

    SetEnvIfNoCase User-Agent "^Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620" bad_bot
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot

    Other more "creative" ways to block malicious spiders or bots can be found here:
    http://www.kloth.net/internet/bottrap.php
    Sorry to hear you had to deal with this too.

    Other more "creative" ways to block malicious spiders or bots can be found here: http://www.kloth.net/internet/bottrap.php
    Thanks - I think I've read the bot trap page before.

    The .htaccess hack does sound pretty good. Mightn't a regexp only on the bot's information page URI give you just a good a result, and be "future proof" in the event they rebuild their UA with a newer Gecko version, etc.?
    /!!\ mysql_ is deprecated --- don't use it! Tell your hosting company you will switch if they don't upgrade! /!!!\ ereg() is deprecated --- don't use it!

    dalecosp "God doesn't play dice." --- Einstein "Perl is hardly a paragon of beautiful syntax." --- Weedpacket

    Getting Help at All --- Collected Solutions to Common Problems --- Debugging 101 --- Unanswered Posts --- OMBE: Office Machines, Business Equipment

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •