Response Part 1
The access_log is key. Note that it's usually a LOT of lines. Might be possible to copy it and download it to a non-production box (also with some RAM) and open it. Find the accesses around 3 AM and see what you can see. Look for funny user-agents, strange GET params, long URI requests, etc.
This is exactly what I did. I was also monitoring the access_log in real time using the
tail -f command in conjunction with various grep statements to isolate bots, 404s, etc. I was not able to identify anything particularly ugly except maybe a few bots that look unwelcome, but this traffic was not especially high. I also did some analysis comparing traffic that day to traffic on prior days. Curiously there was no spike in traffic that corresponded to the server outage.
As described before, I tried to tweak our apache settings in /etc/apache2/mods-available/mpm_event.conf to increase the available apache workers (as recommended here) and even though I was worried about these being far too high:
The machine still clearly has plenty of RAM available:
$ free -h
total used free shared buff/cache available
Mem: 7.8G 2.5G 1.7G 88M 3.6G 5.0G
This leads me to conclude that my understanding of the MaxRequestWorkers setting is wrong. I was under the impression that I could use
ps to list the number of running apache processes, figure out how much RAM each process was using, and make some estimate of the maximum number of apache processes the machine could support before blasting through all its RAM. If I'm not mistaken, my chief misapprehension is that MaxRequestWorkers = number of apache processes running as listed by those commands. This does not appear to be the case. Now I believe that apache may launch some smaller number of processes and that each of those processes can launch multiple threads and that it's actually the number of threads that expands eventually to MaxRequestWorkers. Furthermore, threads seem to be more efficient because each thread shares some amount of memory with the other threads running under the same parent process. The upshot being that I get 800 threads to serve my visitors while only running about 30-40 processes.
I.e., this command lists my apache2 processes, of which there are currently 33:
ps --no-headers -C apache2 -o "uname,ppid,pid,%cpu,pmem,rss,command"
However the apache fullstatus output clearly tells me I have 40 busy and 760 idle threads. Also, the
ps -L option tells it to show threads. This outputs about 866 rows:
ps -ALf | grep apache2
If anyone could corroborate and/or clarify my new understanding that MaxRequestWorkers is a limit on THREADS and not PROCESSES, I'd very much appreciate it. More importantly, If anyone could help me understand how to estimate the optimum apache and php-fpm settings for my machine to utilize ALL available RAM and CPU resources without risking a disk-thrashing meltdown, I'd appreciate it. So far I've been increasing things. A MaxRequestWorkers setting of 800 seems really high, but there still seems to be plenty of RAM around, even during moments of extremely high traffic. Also, the 8-core machine has not reported a 5m load average higher than 7.1 at any point in the past month -- and that moment was not while the server was down.
I think I can pretty safely and surely say that it was NOT any lack of RAM or CPU power that caused the site to go down the other night. This brings me to Part 2
Response Part 2
So, having noticed essentially all the apache workers in "R" - Reading Request mode during these moments of crisis, it seems to me upon reading that other thread I mentioned that my server might be experiencing some DoS situation, possibly intentionally attack, possibly accidental by-product of low-bandwidth traffic from Africa and Asia.
I would point out that I do not have direct evidence that I have an especially high number of low-bandwidth connections nor do I fully understand just yet what it means for an apache worker to be in R - Reading Request mode. What does reading request mean?
- Does it mean the remote client has not fully transmitted its page request and is just hanging around with an open socket?
- could the worker still be in R mode if it has received the full request but is waiting for local system resources (cpu? network bandwidth? PHP process pool? database connection? database response?)
If I could get some more precise description of that R mode is, that would help very much. I haven't had much luck finding any detail, even in the apache mod_status docs.
Working on this theory, I tried setting apache's Timeout directive down from 300 to 30. I restarted apache for this to take effect but it did almost nothing.
The good news is that I enabled apache mod_reqtimeout and was able to dramatically improve things with this setting in /etc/apache2/mods-available/reqtimeout.conf, the mod_reqtimeout conf file:
RequestReadTimeout handshake=5-10,MinRate=500 header=5-20,MinRate=500 body=10,MinRate=500
This had an immediate effect and, while lots of workers were still in R mode, that number was considerably fewer, with many switching to mode=_ (meaning idle, and ready to serve requests). The web server became immediately MUCH more response to incoming requests. A LOT faster.
That said, there was suddenly a very large number of 408 responses in the access log, which one can monitor with this command:
sudo tail -f /var/log/apache2/access.log | grep '" 408 '
A very large number of the IP addresses that show up among these 408s are from China -- which is more than a bit odd given the nature of our site for reasons I will not specify here.
Any thoughts folks might have on those RequestReadTimeout settings would be most welcome. My thoughts are that 5 seconds to read simple headers (up to 20 seconds??) and 10 seconds to deliver the body of your request is a very long time -- except perhaps if someone was uploading a large image. I also have no idea how likely these reqtimeout settings are to disturb normal functionality of our site.