I'm working on a new server for reasons I can't discuss right now.

I've got Apache running and it's not particularly firewalled so that no one can access it. Maybe I should for the time being, but I'm kind of afraid I'd punch DEPLOY and forget to change the ruleset back ... 😛

Anyway, the script_kiddiez are always looking for phpMyAdmin, admin.php, shell.php, log.php, hell.php, java.php (WTF?), x.php, z.php, etc., etc.

In the past what I've done is log the IP address and have it added to the firewall by a cronjob.

I want to do that once this box goes into production. I was thinking for now I might just toy with them.

Should I header them somewhere? A large file? That might hurt the site I send them to.

What about just

sleep($big_seconds);

Thoughts?

    Eh...don't waste your time (or bandwidth) on them. Just send them a HTTP 402 and be done with them. 🙂

    Yeah... playing with them (say, redirect loop) would still impact on your site. If this is someone sitting at a keyboard trying stuff then a big sleep followed by a 402 (I was thinking 451 myself). But an automated attack that makes thousands of requests would make for a lot of sleeping processes.

    Picking a large file somewhere else and redirecting them to that would just move the impact there.

    But "random file somewhere else" made me think of https://randomyoutube.net/api

      Well, if this site was currently "live" or "in production", you can bet I wouldn't. I'm saving the logs for my FW reporting script, and just couldn't hardly wait to mess with them if I could think of a good quick hack. Not to mention it might qualify as free load testing, eh?

      "Picking a large file somewhere else and redirecting them would just move the impact there."

      Yeah, I'd thought about that, and so I wasn't going to do it unless it was one of the Big Boys who might not notice. Say, an obscure product on Amazon, or some PDF on MSFT ....

      Thanks for the randomyoutube.net link though ... surely that can come in handy for something, someday.

        Just a thought. Is there some way to get a list of sites that are actively hosting exploits? I.e., maybe you could send the script kiddies to get hacked by real pros. Maybe some kind of random redirect to malware list.

        sneakyimp

        Hmm...I personally wouldn't want to risk getting lawyers involved, especially if you send them there on a false positive -- and even if it's not a false positive, you might still be liable.

          VERY good point. Maybe send them to chatroulette.com?

            In the past I've sometimes 302'ed to fbi.gov ...

              5 days later

              It's looking like my mysterious bottleneck may be related to some poorly behaved bots so my interest in this question has been re-piqued. I would point out a few things:
              1) a single IP might represent a large number of users, most of whom are well-behaved. E.g., a starbucks or airport WiFI system, cellular system, or large university. Simply adding them to the firewall might ban innocent folks.
              2) The moment the request is made you have a connection to the culprit. Acting right then and there seems the right thing to do.
              3) A bland response is probably best. And I mean really bland like 403 or 404 or 410.

              Would it make any sense to set a cookie so you can identify this culprit again? Surely they can easily drop the cookie on their end, but if they come back with the same session ID, that info might be useful.

                Is there somewhere to check a particular bot to see if it's badly behaved? I've been googling 'bot check' etc. but get mostly tools for checking social media accounts. Other links are lists on github, etc. but nothing jumps out as authoritative.

                In the meantime, if anyone has feelings about these bots, I'd be curious to hear your thoughts.
                DotBot - 216.244.66.202
                MegaIndex - 46.4.64.197
                BLEXBot - 94.130.237.98
                SemrushBot - 46.229.168.0/24

                At the moment, I'm tempted to just ban these user agents.

                Is there

                  We've got Semrush blocked with a 503 right at the top of config.php ... a dumb place, in theory, but we don't even consider allowing them access any longer, so it's about the first thing that happens. I can't remember how we decided, but that's what it is.

                  We have a "bad_bots.php" that's included a little further down in most scripts/pages. Among the highlights:

                  $banned_uas = array(
                     "compatible; synapse",
                     "seokicks",
                     "ahrefs",
                     "linkdex",
                     "hubspot",
                     'phantomjs'
                  );

                  403 for them. I seem to remember mentioning PhantomJS earlier ... it may also be blocked in Apache config/.htaccess, so possibly it (and Hubspot) aren't needed in this array. That might also be true for this next one:

                  if (stristr($_SERVER['HTTP_USER_AGENT'],'Go 1.1 package http')) {
                  
                     header("HTTP/1.0 204 No Content");
                     die();
                  }
                  if (stristr($_SERVER['HTTP_USER_AGENT'], "Chrome/60.0.3112.113")) {
                      header("HTTP/1.0 503 Service unavailable");
                      die();
                  }

                  The second one's interesting, maybe. Lots of bogus simultaneous requests from distributed IP's with this UA string. Given that Chrome forces updates in most environments, and that at the time that policy was written very few stats labs reported anyone on Chrome 60, we put that in place.

                  Finally:

                  // we have three blocks like this, wrapped in tests of sys_getloadavg().
                  // this list is for the lightest loads; if the load avg. is higher these numbers
                  // are *lower*  ($cos == "chance of success")
                  $cos_bing    =
                  $cos_slurp   =
                  $cos_sougou  =
                  $cos_yandex  = 99;
                  $cos_dotbot  = 19;
                  $cos_unknown = 8;
                  $cos_MJ12    = 70;
                  
                  $limits_array = array(
                  
                      'bingbot'     => 'bing',
                      'slurp'       => 'slurp',
                      'sogou'       => 'sougou',
                      'sougou'      => 'sougou',
                      'yandex'      => 'yandex',
                      'alexabot'    => 'bing',
                      'MJ12'        => 'MJ12',
                      'dotbot'      => 'dotbot',
                      'mail.ru_bot' => 'dotbot',
                      'netseer'     => 'unknown',
                      'xovi'        => 'unknown',
                      'easou'       => 'sogou',
                      'crawl'       => 'sogou',
                      'spider'      => 'sogou',
                      'iOpus'       => 'dotbot',
                      'seznambot'   => 'sogou'
                  );
                  
                  foreach ($limits_array as $ua=>$chance) {
                  
                      $cos_var = ${"cos_".$chance};
                  
                      if (stristr($_SERVER['HTTP_USER_AGENT'], $ua)) {
                  
                          $brandom = rand(0, 100);
                  
                          if ($brandom > $cos_var) {
                  
                              header("HTTP/1.0 503 Service unavailable");
                              die('Server too busy. Please try again later.');
                          }
                      }
                  }

                  Dunno that the PHB would really like us ever serving a 503 to Bing/Yahoo, but I don't recall asking; if the load's high enough, there's a chance they'll see one (and a one percent chance that they will even if load is light ... that might bear re-consideration). If Google sees any 5XX it's most likely an actual server problem (and they do seem to think they see some ... I think it's bad chars in the auto-generated page URI's we feed them. (I need to add that to the bug DB, actually ....)

                    Do you ever ban IP blocks or anything like that? I see a few IPs doing some dodgy stuff and I'm tempted to ban them but I don't want to block some giant municipal wifi network entirely if only one person is screwing around.

                    I've also been tempted to ban every server in China and/or Russia and/or Cyprus or Belarus or whatever. These folks should never need our site anyway.

                    EDIT: another question...should we ban or 403 visitors who don't supply any user agent? Is there any risk of excluding honest users?

                    sneakyimp another question...should we ban or 403 visitors who don't supply any user agent? Is there any risk of excluding honest users?

                    Well, user agent is not required, and if I decide as a user it's none of your business, I'm not sure that's a good criteria to not allow me to use your site; but maybe that's an edge case that would be so small you could afford to ignore it. (shrug)

                      sneakyimp

                      Good question. I'm not the only one who handles that, and there are other mechanisms in various parts of the octopus 😉 that may ban entire ranges. I'm not sure we have with our site, though. It's possible we have.

                      That might be a good job for .htaccess, defining some whatever-they-call-it and "require ip not whatever-its-called"?

                        Regarding the "banning" of IP addresses, I love love love iptables. For thos not familiar with it, iptables is a software firewall. It's used by fail2ban and provides tremendous power to level the ban hammer against bad folks. You can use it to truly lock down your server (and you can really shoot yourself in the foot if you aren't careful) but it can also ban folks. I saw today that I wrote some PHP to "ban" semrushbot and others years ago but looking at my access logs this week, I see they are still coming around despite the 403 forbidden and 410 gone responses. I therefore took the rather bold step of banning the entire IP block they were emanating from:

                        sudo iptables -I INPUT 28 -s 46.229.168.0/24 -j REJECT

                        This instruction inserts a rule at slot #28 that says to refuse any type of connection (any port, UDP or TCP) from the i range specified. It takes effect immediately and doesn't even bother being polite. I know this is heavy handed, but it is also gratifying for that very reason.

                        I would also remind everyone of the robot exclusion standard.

                        EDIT: I'd also like to pose another question. What is a polite and effective way to tell the bots (good ones and bad ones alike) that a particular URL request is not for bots. E.g., my site has a lot of pages with a link to a printable version of the page without all the header and footer stuff. I.e., just the important text without CSS and all that. I don't want googlebot crawling those pages. is 403 the right move? If I'm not mistaken, this and the meta no follow tag are no guarantee that the bot won't crawl the page. I want to prevent bots even requesting the page, but it would require a very extensive list of urls for a robots.txt file because the robots exclusion standard doesn't support regex type stuff. E.g., this url is fine:
                        https://example.com/careers/haberdasher-001.php
                        But THIS url should never be visited by a bot:
                        https://example.com/careers/haberdasher-001.php?printable=1

                        sneakyimp EDIT: I'd also like to pose another question. What is a polite and effective way to tell the bots (good ones and bad ones alike) that a particular URL request is not for bots.

                        The "polite" way is robots.txt, but bad ones will ignore that. Then again, bad ones would ignore anything else: if they (or their writers) are determined enough they'll be doing their best to imitate humans.

                        So in other words, letting in legitimate users but keeping out bots is an instance of the Imitation game.

                        On the grounds that being asked "Is Gerry there?" and replying "Who's asking?" is leaving the denial of service too late, you only have the content of the request itself and the address it came from and your decision to respond or not has to be based on that.

                        Bots generally don't make POST requests (unless they're trying to spam a board), but wrenching no-follow links into a shape where they make POSTs instead of GETs seems like a good way to make a hash of a site.

                        Of course, the server itself should be doing this - that's its job; your application shouldn't have to be doing it. Depending on your server there may be something available (e.g. mod_security) - what would be "effective" would be for the server to simply drop the entire connection on the floor without even acknowledging the request. With luck the client will sit spinning its wheels waiting for a response that will never come until it times out.

                        And of course, the web server's job is to serve web pages. It's the firewall's job to block connections. So if you build up a list of banned IPs then that is where you want to deploy them.

                        sneakyimp E.g., my site has a lot of pages with a link to a printable version of the page without all the header and footer stuff. I.e., just the important text without CSS and all that. I don't want googlebot crawling those pages. is 403 the right move?

                        More like a 303. But to skew the subject a bit; have you considered using @media rules in your CSS to control how your page is printed in contrast to how it is displayed? Then you don't need to serve an additional "printer-friendly" version.

                          Write a Reply...