Denial of Service attack?

sneakyimp · Jan 11, 2020

So I think my server might be suffering a Denial of Service attack.

We got notified by pingdom (website monitoring) that our website was unavailable starting around 3AM. Early today we started checking apache error logs and saw a whole bunch of this error:

AH00485: scoreboard is full, not at MaxRequestWorkers

We also saw that our PHP-FPM process pool frequently needed to spawn more servers:

[pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children

We tried increasing MaxRequestWorkers in the apache conf and some other remedies but these would not rid us of the scoreboard error in the apache error log so, against my better judgement, I followed the advice in this thread and set MinSpareThreads and MaxSpareThreads equal to MaxRequestWorkers. These changes appear to have removed the scoreboard error.

I also greatly increased MaxRequestWorkers because we have a lot of RAM that evidently isn't being utilized. Our server has 8 cores and, despite these really high config values, doesn't seem to be using much of its RAM at all:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.8G        1.8G        2.0G         38M        4.0G        5.8G
Swap:            0B          0B          0B

I'm pretty nervous about these high values for MaxRequestWorkers in the apache conf and pm.max_children in php-fpm configuration.

Here's the basic config in mpm_event.conf

<IfModule mpm_event_module>
        StartServers		2
        MinSpareThreads		800
        MaxSpareThreads		800
        ThreadLimit		64
        ThreadsPerChild		25
        ServerLimit 800
        MaxRequestWorkers       800
        MaxConnectionsPerChild   0
</IfModule>

Here are some settings in a php-fpm conf file:

pm.max_children = 256
pm.start_servers = 64
pm.min_spare_servers = 64
pm.max_spare_servers = 128

Here's some basic server info:

Server version: Apache/2.4.18 (Ubuntu)
Server built:   2019-10-08T13:31:25
Server's Module Magic Number: 20120211:52
Server loaded:  APR 1.5.2, APR-UTIL 1.5.4
Compiled using: APR 1.5.2, APR-UTIL 1.5.4
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)

And here's some of the data from the apache server-status output:

Server Version: Apache/2.4.18 (Ubuntu) OpenSSL/1.0.2g
Server MPM: event
Server Built: 2019-10-08T13:31:25

Current Time: Friday, 10-Jan-2020 22:58:55 CST
Restart Time: Friday, 10-Jan-2020 22:26:32 CST
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 32 minutes 22 seconds
Server load: 4.69 5.06 5.12
Total accesses: 78434 - Total Traffic: 1.5 GB
CPU Usage: u2970.53 s5037.34 cu0 cs0 - 412% CPU load
40.4 requests/sec - 0.8 MB/second - 19.7 kB/request
797 requests currently being processed, 3 idle workers

PID	Connections 	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
6124	28	yes	25	0	0	0	3
6125	27	yes	25	0	0	0	2
6182	30	yes	25	0	0	1	4
6210	28	yes	25	0	0	0	3
6211	29	yes	25	0	0	0	5
6266	28	yes	25	0	0	2	1
6267	25	yes	25	0	0	0	1
6269	28	no	24	1	0	1	3
6276	28	yes	25	0	0	0	3
6378	28	yes	25	0	0	0	3
6379	31	no	24	1	0	4	3
6380	27	yes	25	0	0	0	3
6384	26	yes	25	0	0	0	2
6397	28	yes	25	0	0	2	1
6405	27	yes	25	0	0	0	2
6414	26	yes	25	0	0	1	0
6423	27	no	24	1	0	1	1
6602	27	yes	25	0	0	0	3
6603	28	yes	25	0	0	0	4
6604	26	yes	25	0	0	0	1
6617	30	yes	25	0	0	0	5
6646	26	yes	25	0	0	0	2
6676	27	yes	25	0	0	0	2
6694	30	yes	25	0	0	0	5
6705	28	yes	25	0	0	0	3
6730	29	yes	25	0	0	0	4
6765	29	yes	25	0	0	0	4
6781	27	yes	25	0	0	0	2
6805	28	yes	25	0	0	0	4
6836	28	yes	25	0	0	0	3
6858	27	yes	25	0	0	0	3
6859	27	no	25	0	0	1	1
Sum	888	 	797	3	0	13	86

The worker mode part is the most disconcerting. Almost every single one is in read mode:

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRR_RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
_RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR_RRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

And at the end there's this:

SSL/TLS Session Cache Status:
cache type: SHMCB, shared memory: 512000 bytes, current entries: 2176
subcaches: 32, indexes per subcache: 88
time left on oldest entries' objects: avg: 220 seconds, (range: 197...243)
index usage: 77%, cache usage: 99%
total entries stored since starting: 60122
total entries replaced since starting: 0
total entries expired since starting: 0
total (pre-expiry) entries scrolled out of the cache: 57946
total retrieves since starting: 3405 hit, 59594 miss
total removes since starting: 0 hit, 0 miss

And netstat shows some 3000+ connections to port 80 and port 443:

$ netstat -n | egrep ":80|443" | wc -l
3715

What the heck is going on? The server has been running fine for months with much more modest configuration settings. Something seems to have abruptly changed last night around 3AM.

Any guidance would be much appreciated. I searched here first and found this other thread but it's a different version of apache running in prefork mode instead of event like mine. I also don't understand how the little bit of information in that thread led to a SlowLoris diagnosis.

Is there any way to confirm that the machine is under a DoS attack?

Disclaimer: I also posted this on ServerFault but rather expect they might be jerks about it like they usually are.

dalecosp · Jan 11, 2020

The access_log is key. Note that it's usually a LOT of lines. Might be possible to copy it and download it to a non-production box (also with some RAM) and open it. Find the accesses around 3 AM and see what you can see. Look for funny user-agents, strange GET params, long URI requests, etc.

And, it could be that this doesn't help ... but in my experience if you want to have any clue what's going on, that's the place to look.

sneakyimp · Jan 13, 2020

dalecosp

Response Part 1

The access_log is key. Note that it's usually a LOT of lines. Might be possible to copy it and download it to a non-production box (also with some RAM) and open it. Find the accesses around 3 AM and see what you can see. Look for funny user-agents, strange GET params, long URI requests, etc.

This is exactly what I did. I was also monitoring the access_log in real time using the tail -f command in conjunction with various grep statements to isolate bots, 404s, etc. I was not able to identify anything particularly ugly except maybe a few bots that look unwelcome, but this traffic was not especially high. I also did some analysis comparing traffic that day to traffic on prior days. Curiously there was no spike in traffic that corresponded to the server outage.

As described before, I tried to tweak our apache settings in /etc/apache2/mods-available/mpm_event.conf to increase the available apache workers (as recommended here) and even though I was worried about these being far too high:

<IfModule mpm_event_module>
        StartServers                     2
        MinSpareThreads          800
        MaxSpareThreads          800
        ThreadLimit                      64
        ThreadsPerChild          25
        ServerLimit 800
        MaxRequestWorkers       800
        MaxConnectionsPerChild   0
</IfModule>

The machine still clearly has plenty of RAM available:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.8G        2.5G        1.7G         88M        3.6G        5.0G

This leads me to conclude that my understanding of the MaxRequestWorkers setting is wrong. I was under the impression that I could use top or ps to list the number of running apache processes, figure out how much RAM each process was using, and make some estimate of the maximum number of apache processes the machine could support before blasting through all its RAM. If I'm not mistaken, my chief misapprehension is that MaxRequestWorkers = number of apache processes running as listed by those commands. This does not appear to be the case. Now I believe that apache may launch some smaller number of processes and that each of those processes can launch multiple threads and that it's actually the number of threads that expands eventually to MaxRequestWorkers. Furthermore, threads seem to be more efficient because each thread shares some amount of memory with the other threads running under the same parent process. The upshot being that I get 800 threads to serve my visitors while only running about 30-40 processes.

I.e., this command lists my apache2 processes, of which there are currently 33:

ps --no-headers -C apache2 -o "uname,ppid,pid,%cpu,pmem,rss,command"

However the apache fullstatus output clearly tells me I have 40 busy and 760 idle threads. Also, the ps -L option tells it to show threads. This outputs about 866 rows:

ps -ALf | grep apache2

If anyone could corroborate and/or clarify my new understanding that MaxRequestWorkers is a limit on THREADS and not PROCESSES, I'd very much appreciate it. More importantly, If anyone could help me understand how to estimate the optimum apache and php-fpm settings for my machine to utilize ALL available RAM and CPU resources without risking a disk-thrashing meltdown, I'd appreciate it. So far I've been increasing things. A MaxRequestWorkers setting of 800 seems really high, but there still seems to be plenty of RAM around, even during moments of extremely high traffic. Also, the 8-core machine has not reported a 5m load average higher than 7.1 at any point in the past month -- and that moment was not while the server was down.

I think I can pretty safely and surely say that it was NOT any lack of RAM or CPU power that caused the site to go down the other night. This brings me to Part 2

Response Part 2
So, having noticed essentially all the apache workers in "R" - Reading Request mode during these moments of crisis, it seems to me upon reading that other thread I mentioned that my server might be experiencing some DoS situation, possibly intentionally attack, possibly accidental by-product of low-bandwidth traffic from Africa and Asia.

I would point out that I do not have direct evidence that I have an especially high number of low-bandwidth connections nor do I fully understand just yet what it means for an apache worker to be in R - Reading Request mode. What does reading request mean?

Does it mean the remote client has not fully transmitted its page request and is just hanging around with an open socket?
could the worker still be in R mode if it has received the full request but is waiting for local system resources (cpu? network bandwidth? PHP process pool? database connection? database response?)

If I could get some more precise description of that R mode is, that would help very much. I haven't had much luck finding any detail, even in the apache mod_status docs.

Working on this theory, I tried setting apache's Timeout directive down from 300 to 30. I restarted apache for this to take effect but it did almost nothing.

The good news is that I enabled apache mod_reqtimeout and was able to dramatically improve things with this setting in /etc/apache2/mods-available/reqtimeout.conf, the mod_reqtimeout conf file:

RequestReadTimeout handshake=5-10,MinRate=500 header=5-20,MinRate=500 body=10,MinRate=500

This had an immediate effect and, while lots of workers were still in R mode, that number was considerably fewer, with many switching to mode=_ (meaning idle, and ready to serve requests). The web server became immediately MUCH more response to incoming requests. A LOT faster.

That said, there was suddenly a very large number of 408 responses in the access log, which one can monitor with this command:

sudo tail -f /var/log/apache2/access.log | grep '" 408 '

A very large number of the IP addresses that show up among these 408s are from China -- which is more than a bit odd given the nature of our site for reasons I will not specify here.

Any thoughts folks might have on those RequestReadTimeout settings would be most welcome. My thoughts are that 5 seconds to read simple headers (up to 20 seconds??) and 10 seconds to deliver the body of your request is a very long time -- except perhaps if someone was uploading a large image. I also have no idea how likely these reqtimeout settings are to disturb normal functionality of our site.

dalecosp · Jan 13, 2020

https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers

You are correct: MaxRequestWorkers would be simultaneous threads, not the number of processes.

Now I believe that apache may launch some smaller number of processes and that each of those processes can launch multiple threads and that it's actually the number of threads that expands eventually to MaxRequestWorkers. Furthermore, threads seem to be more efficient because each thread shares some amount of memory with the other threads running under the same parent process. The upshot being that I get 800 threads to serve my visitors while only running about 30-40 processes.

Unfortunately, I've not been exposed to RequestReadTimeout before. The Docs seem to not vary far from what you're showing. The 408's are being sent by mod_reqtimeout, as that's it's job. It could be that these settings are too restrictive for some subset of your audience who are far away or on very slow networks (e.g., behind the Great Firewall of China, for example).