Hey everybody!

I spend a fair amount of time looking through my apache logs and always see these entries that get me started thinking about security again:

68.183.193.242 - - [20/Mar/2024:10:59:08 +0000] "GET /Temporary_Listen_Addresses HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:08 +0000] "GET /ews/exchanges/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:08 +0000] "GET /ews/exchange%20/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /ews/exchange/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /ews/%20/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /ews/ews/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /ews/autodiscovers/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /autodiscover/autodiscovers/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /autodiscover/autodiscover%20/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /autodiscover/autodiscoverrs/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
68.183.193.242 - - [20/Mar/2024:10:59:09 +0000] "GET /autodiscove/ HTTP/1.1" 404 1625 "-" "Mozilla/5.0 zgrab/0.x"
135.125.244.48 - - [20/Mar/2024:11:16:16 +0000] "GET /.env HTTP/1.1" 404 397 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
135.125.244.48 - - [20/Mar/2024:11:16:16 +0000] "POST / HTTP/1.1" 404 397 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"

194.38.23.16 - - [20/Mar/2024:14:12:31 +0000] "GET /sites/all/modules/civicrm/packages/OpenFlashChart/php-ofc-library/ofc_upload_image.php HTTP/1.1" 404 1494 "-" "ALittle Client"
194.38.23.16 - - [20/Mar/2024:14:12:31 +0000] "GET /php-ofc-library/ofc_upload_image.php HTTP/1.1" 404 1494 "-" "ALittle Client"
194.38.23.16 - - [20/Mar/2024:14:12:32 +0000] "GET /sites/default/modules/civicrm/packages/OpenFlashChart/php-ofc-library/ofc_upload_image.php HTTP/1.1" 404 1494 "-" "ALittle Client"
35.94.93.42 - - [20/Mar/2024:15:38:50 +0000] "GET /.git/HEAD HTTP/1.1" 404 360 "-" "Python-urllib/3.10"
161.97.147.235 - - [20/Mar/2024:16:12:08 +0000] "GET /wp-login.php HTTP/1.1" 404 3439 "http://www.example.com/wp-login.php" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/95.0"
174.53.49.200 - - [20/Mar/2024:16:38:24 +0000] "-" 408 0 "-" "-"
185.254.196.173 - - [20/Mar/2024:18:42:13 +0000] "GET /.env HTTP/1.1" 404 397 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
135.125.244.48 - - [20/Mar/2024:19:03:50 +0000] "POST / HTTP/1.1" 404 397 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
172.104.11.4 - - [20/Mar/2024:19:47:51 +0000] "\x16\x03\x01" 400 392 "-" "-"

Some of these are trying to sniff out secret credential files (.env), some are looking for git repos, some are refusing to ask for any files (408 response), some are seeking wordpress entry points (wp-login.php), some are looking for particular modules that probably contain exploits, some appear to be attempting binary chars (\x16\x03\x01, etc).

It's of little comfort that the http responses seem correct, but this bothers me. It seems like such requests might be munged to generate some kind of threat map and that, in turn, might help one better protect one's systems. I know that fail2ban has various jails but I tried their 404 jail once (i believe this is the apache-noscript jail) but this caused serious problems. Missing images or CSS files, perfectly innocent ones, generated a lot of 404 requests and this commonly banned an IP address behind which there were a LOT of people (e.g., the University of Michigan or some busy Starbucks somewhere) which was NOT GOOD.

So I have two questions:
1) supposing I were to cook up some kind of machine learning script that, given an apache log full of novel requests, could classify each requests as good (apparently a good faith request) or bad (a hack attempt or script kiddie screwing around), would folks find this helpful? Can we think of any data we might want to extract with such a script (e.g., "guiltiest IP addresses" or "fishiest user agent strings") or any sort of application for such a machine learning tool?

2) What tricks are folks using these days to keep the script kiddies and scrapers out without excluding friendly users and friendly web crawlers? I've seen some sites lately that appear to perform some kind of cookie or captcha check to even let you browse the site.

I've been thinking I might work up some code in my PHP app that sets a cookie when it detects suspicious behavior and then bans anyone showing up with that cookie. Of course, most bad behavior probably doesn't bother with cookies, so I started thinking I might set a cookie for any fresh visitor and if they failed to present such a cookie on subsequent requests, I might just show them an error page or something. Neither of these solutions seems very good to me.

    I'm actually doing something like this and have attempted to do so for many years (dating back to my last programming job 2011-2020). A Security class is included in "404.php" (Apache custom 404 script) that has a list of likely strings.

    Apache has .htaccess in the web root. If you ask for a typical script-kiddie resource you get added to it thus:

    #Tried to get /.env
    deny from 20.172.38.178

    I'd actually feel better if it modified firewall rules, but you run a high risk of screwing things up royally as a fat-fingered URL in your browser might lock you out pretty hard....

    It also bans UAs that don't provide a UA string and some clients by default ("go-http-client" comes to mind...)

    dalecosp I do something similar. I don't add to the htaccess but my sites all run through index.php and the very first thing I do on page load is check for their existence in the ban table, stopping page load with a simple "banned" before spending any overhead on the visitor. The second thing I do is check URL string, UA, IP, POST/GET data, referer, etc for honeypot stuff, banning them if a match is found.

    Low effort ban results(like a script trying all the popular WP exploits) get a one hour ban while more serious efforts get 30 days.

    Cool; although I don't like running everything from index.php ... seems like some package that a company I used to work for did that and it turned me off. Index.php there was like a huge SWITCH statement with every possible value that could be put in the GET string to show whatever content the params called for, and errors if they were outside of boundaries.

    I've not decided to un-ban IPs from the .htaccess. I might, but this particular site is for a very specific US business type and they could care less if anyone else goes there for the most part, so the thought of any particular IP being perma-banned doesn't bother us AFAIK.

    I did have a "bad_bots.php" that we included in the head of every page at my last major PHP job; it analyzed the UA string and current system load and then, if a bot, did a rand() to decide if they saw content or an HTTP/429 or /503 header. IIRC, Google never was penalized, Bing was allowed 94%-96% chance of a 200 response, and the rest were given lower chances of success.

      6 days later

      Forgive me, I initially didn't realize that there were responses to this thread because the forum has various bugs and isn't updating response counts, etc.

      schwim I'm confused, where's your 4,000 wp-* requests per hour?

      Oh this is just a small extraction from a server that is not very busy. I've seen those script kiddies sounding out my server for a wp install, and boy does it rankle me. I've long contemplated setting up some page to handle those requests, but I have not done so because I'm not sure what measures I might take against the script kiddies and, if I'm just collecting information, I'm not sure what I would even do with the collected info. Perhaps cram it into some table for later assessment?

      deny from 20.172.38.178
      dalecosp I'd actually feel better if it modified firewall rules, but you run a high risk of screwing things up royally as a fat-fingered URL in your browser might lock you out pretty hard....

      Thanks for this, @dalecosp. Are you not concerned that blocking an entire IP address might lock out well-intentioned visitors. E.g., what if that IP represents a busy coffee shop wifi network? Or a large corporate office? Or the entire University of Michigan? I once tried enabling the apache-noscript jail in fail2ban, and this had the very bad side effect of blocking nearly all my visitors because my markup (which was admittedly very ugly due to a sloppy front end guy) had all kinds of bad links to transparent GIFs. This is back in the Bad Old Days when CSS didn't really work and IE was in widespread use.

      dalecosp It also bans UAs that don't provide a UA string and some clients by default ("go-http-client" comes to mind...)

      I've seen those script kiddy UAs, which also rankle me. I worry about damaging my SEO reputation if I accidentally block search engines or bots. Is an empty UA inherently bad? Would privacy settings or some popular search engine every show up with an empty UA? Or might a search engine (or social preview bot) show up with some unexpected curl UA or some default UA set by a code library? What criteria are we using to distinguish good UAs from bad UAs?

      schwim I don't add to the htaccess but my sites all run through index.php and the very first thing I do on page load is check for their existence in the ban table, stopping page load with a simple "banned" before spending any overhead on the visitor. The second thing I do is check URL string, UA, IP, POST/GET data, referer, etc for honeypot stuff, banning them if a match is found.

      Low effort ban results(like a script trying all the popular WP exploits) get a one hour ban while more serious efforts get 30 days.

      This sounds like what I am most likely to do, although my framework or mod_rewrite might need some mods or massage to be able to handle requests that don't match any of my defined endpoints or routes. I've considered adding honeypot endpoints/routes to sniff out the common forms of snooping, exact actions TBD. I would again ask the question are you not concerned about blocking by ip address? How do we feel about setting some kind of cookie that identifies bad actors like a big red "dunce" cap? Let's call that approach DunceCookie. I guess most bad requests skip cookies entirely, so DunceCookie doesn't sound like it would help with the vast majority of script kiddy activity. Alternatively, perhaps we set a cookie on the first page request and, if that cookie is missing on subsequent requests, we might refuse to do anything useful for a visitor. Let's call this FriendCookie. If someone visits and there's no FriendCookie, we could show them a WELCOME/CONTINUE message and attempt to set the FriendCookie. For any subsequent visits, if they have the FriendCookie, then we could serve them and feel better about them. If they screw around, we could convert FriendCookie to DunceCookie. Any cookieless request would show the welcome page and nothing else.

      dalecosp ...turned me off. Index.php there was like a huge SWITCH statement with every possible value...

      Routing everything through index.php is quite powerful and you don't need a giant switch statement (that does sound ugly). You can use some algorithmic mapping of urls onto PHP scripts, possibly involving a routing table. Laravel and CodeIgniter use this approach, and I've found it very useful for code organization and for establishing consistent and handy application state to handle every request with a minimum of worry and effort.

      dalecosp I did have a "bad_bots.php" that we included in the head of every page at my last major PHP job; it analyzed the UA string and current system load and then, if a bot, did a rand() to decide if they saw content or an HTTP/429 or /503 header. IIRC, Google never was penalized, Bing was allowed 94%-96% chance of a 200 response, and the rest were given lower chances of success.

      This detail is helpful. I'd be curious how we go about recognizing bad UAs (script kiddies!) versus good UAs (google, bing, duck duck go, other?, actual users).

      sneakyimp

      Specifically on the last point, there's probably no substitute for just watching the access logs; I tend to have one monitor dedicated to "tail -f somelogfile" a lot of the time.

      As for SEO and such, I think that here we're very concerned about only people we want to use our site using it (distributorship in a close-knit industry), so no one's worried too much (yet!) about blocking good bots or users in privacy mode/using Brave/using DDG Browser, etc....

      6 days later

      dalecosp

      Thanks @dalecosp, for your responses. I do often check the log files directly, typically filtered with grep to isolate certain things. It is precisely this monitoring that caused me to see all these dodgy requests, and start wondering what to do about them.

        Write a Reply...