mystery performance bottleneck

sneakyimp · Sep 20, 2018

I've run a site for 15 years now. Today it's hosted in the Amazon cloud and consists of a web server and an RDS server doing database duty. It's a busy time of year, but we've very recently getting reports of timeouts and 'site can't be reached.' I've checked the monitoring stats for the two servers and, with the exception of a few very brief CPU spikes here and there, the CPU and memory usage seem totally nominal.

The machine is not even using the memory swap:

$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.5G       3.2G       322M        31M       146M       2.6G
-/+ buffers/cache:       419M       3.1G
Swap:           0B         0B         0B

The db server looks even more nominal. CPU usage is typically 5%, which spikes to 15%. Plenty of RAM available. The slow query log hasn't had an entry in weeks.

I'm wondering if it might be a network bottleneck with the web server? The webserver is an amazon EC2 instance and actual network specs seem really hard to come by. My instance is an m1.medium which are rumored to have a network bandwidth of 800Mbit/sec.

I haven't been able to check the machine right when the trouble is happening (it's usually quite quick when i check it later) but when I login and check for TCP connections, it's typically between 1000 and 1400:

$ netstat -nt | wc -l
993

1400 simultaneous connections seems like it could easily consume 800Mbit/sec, but you'd still be getting about half a Mbit/sec for each user.

Can anyone suggest a methodical approach to locating the problem? This seems like such a black art to me.

EDIT: this is a screen shot of the amazon console. Note the network stas. More incoming than outgoing? Seems fishy. [upl-image-preview url=https://board.phpbuilder.com/assets/files/2018-09-20/1537478757-514613-myplan-web-server.png]

dalecosp · Sep 21, 2018

1000 connections is quite a few. Access logs available for analysis? I'm not sure we ever run that high in traffic, but typically once we started getting a handle on bad code, the only performance-impacting spikes we've seen are from badly behaved bots. Doing things like blocking HEAD requests and certain known bad UA's (yes, I'm looking at YOU PhantomJS) is also SOP for us.

I also run this:


DATED=`/bin/date +%F-%T`
/usr/bin/top -b -n1 > "/home/me/toplog/top.$DATED"

via cron every 120 seconds (it's probably not really needed any more though).

systat/sar/iostat are also occasionally helpful in figuring out these things, maybe?

sneakyimp · Sep 21, 2018

dalecosp 1000 connections is quite a few.

It's currently:

$ netstat -n | wc -l
1137

dalecosp Access logs available for analysis?

Apache logs yes of course, but what analysis? Not sure how to tease any great epiphany out of these logs regarding vague but persistent reports of server slowness and dropped connections. Any specific analysis you can suggest? I'm pretty handy with grep.

dalecosp the only performance-impacting spikes we've seen are from badly behaved bots. Doing things like blocking HEAD requests and certain known bad UA's (yes, I'm looking at YOU PhantomJS) is also SOP for us.

This code has been running for about 15 years, and has some tweaks to manage bot madness. Why block HEAD requests? Or might you have a list of bad UAs that I could compare to my own?

The output of to -b -n1 shows a LOT Of apache processes. I'm starting to wonder if it might be an apache config issue. Apache MaxClients or MaxRequestWorkers or something.

Does anyone have any thoughts about the incoming network traffic exceeding the outgoing network traffic? That's what those graphs seem to suggest -- and this makes no sense given the nature of our site. Is there some easy process to track network behavior continuously for inspection later? I've seen some netstat tips, but am note sure what to make of the output of netstat -s.

dalecosp · Sep 21, 2018

As far as analysis, 1st thing I look for is a large number of requests from a single IP address. That's not conclusive, but often once I get there I can analyze the UserAgent string and determine if I think it might be a bot.

We disabled HEAD because looking at the logs we saw nothing using HEAD that looked like a valid request, and what did appear to be automated/bot traffic.

Our bad UA data is variously in .htaccess or httpd.conf:

RewriteCond %{HTTP_USER_AGENT} ^.*(PhantomJS|wget|HTTrack|python).*$
RewriteRule . - [F,L]

RewriteCond %{HTTP_USER_AGENT} Java.*
RewriteRule ^/(.*)$ /$1 [F]

RewriteCond %{HTTP_USER_AGENT} ^.(lr_http_client).*$[NC]
RewriteRule ^/(.*)$ /$1 [F]

RewriteCond %{HTTP_USER_AGENT} ^linkoatl
RewriteRule ^/(.*)$ /$1 [F]

RewriteCond %{HTTP_USER_AGENT} ^zgrab
RewriteRule ^/(.*)$ /$1 [F]

As for "incoming more than outgoing", is it all HTTP traffic? Because that seems like a classic symptom of a Ping-Flood type DDOS attack. But I'm paranoid, of course, and no expert.

As for this being an Apache issue, it would certainly be possible given the amount of traffic you're indicating. Do you watch "apachectl fullstatus", for example?

sneakyimp · Sep 28, 2018

dalecosp As for "incoming more than outgoing", is it all HTTP traffic? Because that seems like a classic symptom of a Ping-Flood type DDOS attack. But I'm paranoid, of course, and no expert.

It should be all HTTP traffic but I'm not really sure how to check that sort of thing. Suggestions welcome. I don't think the server is under attack. Suggestions on how to check that are also welcome. Our server doesn't respond to pings and lives behind a firewall.

What do you mean by "watch apachectl fullstatus?" I've looked at the output of that command and it's a bit greek to me. I'm imagining some kind of cron job to accumulate its output over time for analysis later. Is that what you mean?

In the meantime, I'm greatly concerned that there's some kind of ticking time bomb performance-wise that needs to be dealt with. I've implemented some of the palliative measures you suggested regarding bad bots and so on, but I think the issue seems more likely to happen during our busy times (two users have complained) and I just can't seem to find out where the bottleneck is.

My usual bottleneck-finding tricks are:
1) check the Amazon Web Services graphs to look for really high CPU usage or low memory. These graphs seem mostly nominal (see above).
2) login to machine and check memory and processes using free and/or top
3) scan apache error logs looking for errors.

All three results don't really yield any news. The graphs look mostly nominal (except maybe the incoming traffic being higher than outgoing traffic). Memory seems fine inasmuch as the swap isn't being used at all. CPU usage seems totally nominal and the load averages are low. I've also checked the apache error logs and, while there were a couple of 'MaxRequest workers reached' errors, these are very infrequent -- only 4 on the past 10 days. No errors of that sort occurred to day when we got 2 complaints of slowness.

I desperately need some kind of comprehensive and methodical approach to finding this bottleneck and I'm not even sure where to start. I've exhausted my usual bag of tricks.

dalecosp · Sep 28, 2018

At that point I might be complaining to your cloud provider? (Although how Amazon is with that stuff I have no idea). We have a couple hosted VM's with a different service. They have led us to believe that it's possible for one user on a VM to use enough resources (the back-end SAN, for example) that it would impact other users in the "general vicinity" (for example, a VM is usually one of several tenants on one real-world hardware box, right?) I wonder if the problem isn't actually you? That would at least have to be one possibility to consider when gazing into this crystal ball ...

dalecosp · Sep 28, 2018

As for "fullstatus", I have this in my shell resource file for some of our servers:

alias "status" "apachectl fullstatus | head -n 29"

That gives me this:

                  Apache Server Status for localhost (via ::1)

   Server Version: Apache/2.4.35 (NetBSD) OpenSSL/1.0.1u-freebsd
          PHP/5.6.37

   Server MPM: prefork
   Server Built: unknown
     __________________________________________________________________

   Current Time: Friday, 28-Sep-2018 16:09:11 CDT
   Restart Time: Friday, 21-Sep-2018 10:11:12 CDT
   Parent Server Config. Generation: 1
   Parent Server MPM Generation: 0
   Server uptime: 37 days 5 hours 57 minutes 59 seconds
   Server load: 0.34 0.53 0.51
   Total accesses: 1096342 - Total Traffic: 15.4 GB
   CPU Usage: u183.914 s11.9219 cu0 cs0 - .0313% CPU load
   11.75 requests/sec - 9.1 kB/second - 5.2 kB/request
   16 requests currently being processed, 14 idle workers

_CWC_WR_KKK.KSCWWRLC___________...........
................................................................
................................................................
................................................................

   Scoreboard Key:
   "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
   "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
   "C" Closing connection, "L" Logging, "G" Gracefully finishing,

sneakyimp · Sep 28, 2018

dalecosp

At that point I might be complaining to your cloud provider? (Although how Amazon is with that stuff I have no idea)...

Amazon is terrible for things like this. I believe you can pay for the privilege of speaking with a tech, but I have never in a decade had any real live help from an amazon employee. They sometimes chime in on the forums, but it's rare.

And yes it's my understanding that these EC2 instances are timeslices running on a hypervisor on some actual "box" -- although that box might be some kind of awesome hardware like a mainframe or something I do not know. AWS does offer specs on their EC2 instances, but I don't really know what tools or techniques might be used to test these promised specs -- especially not on a production system.

Given that I don't lack memory and that the CPU usage seems fine, and that the apache log doesn't complain much about MaxRequestWorkers (or anything else really) is there some other resource or latency problem? Is there some utility/command/file that i might check to definitely identify and document the problem? Possible candidates that occur to me are database latencies or network congestion, disks being maxed out, etc.

As for "fullstatus", I have this in my shell resource file for some of our servers

How do you use this information? If you're capturing it via cron, do you somehow rotate it? One of the problems I have is that these periods of slowness often happen while I'm busy with something else. Seems to me that you'd need some kind of file rotation scheme. Looking at the output of mine, the first bit of output seems helpful for determining network usage:

Current Time: Friday, 28-Sep-2018 21:59:49 UTC
Restart Time: Tuesday, 25-Sep-2018 18:53:19 UTC
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 3 days 3 hours 6 minutes 29 seconds
Server load: 0.04 0.05 0.07
Total accesses: 4939977 - Total Traffic: 12.9 GB
CPU Usage: u9.08 s4.58 cu0 cs.03 - .00506% CPU load
18.3 requests/sec - 50.0 kB/second - 2800 B/request
18 requests currently being processed, 16 idle workers

This just jogged a thought loose: my network input exceeding my network output might be related to the fact that db traffic is also on the network. That said, 12.9GB over 3 days hardly seems to approach any network limits.

dalecosp · Oct 1, 2018

I only watch "fullstatus" when I'm logged in and have a need. I do have some automated collection that I use during anomalies/emergencies (typically AFTER the emergency, alas, but you understand).

I dunno if this is helpful, but here's something I do that has helped in the past. top.sh runs via cron every 2 minutes.

#!/bin/sh

DATE=`/bin/date +%F-%T`
/usr/bin/top -b -n1 > "/home/me/toplog/top.$DATE"

endtop.sh runs each morning at 4 AM-ish.

#!/bin/sh

DATE=`date +%F`

/usr/bin/head -n 19 /home/me/toplog/top* > /home/me/toplog/archive/top.$DATE
/bin/rm -f /home/me/toplog/top*

Nineteen lines of output gets me all the details about Tasks, CPU, Mem, Swap and the top 12 programs running every two minutes during the past twenty-four hours. I can page through the file and see, historically, what was going on at different points of the day.

Then I use that knowledge (what time it was) and go to the section of the access logs to see what the server was feeding, and whom it was feeding it to, at the time the system was affected. Sometimes it allows some insight ... other times, no so much.

sneakyimp · Oct 3, 2018

Thank you, DaleCosp for that detail.

This problem has become serious and I don't think the issue is due to any memory/disk/bandwidth constraints. I think it has to do with an HTTPS handshake or something. It is happening right now and hope that you knowledgeable types might go take a look. If the issue is an HTTPS handshake, what could be the issue?

sneakyimp · Oct 3, 2018

OK the slowness problem has magically vanished. I was in the process of checking all resource amounts (memory, bandwidth, etc.) and the slow query log (which hasn't shown anything at all today) when suddenly the server speed was fine again.

I think the issue may be due to TLS handshaking or something. The crazy irony is that once the slowness disappeared, I saw a dramatic increase in the number of connections to the server and the number of connections between the web server and the database server. The bottleneck, whatever it is, appears to prevent network connections to and from my server from being made.

For instance, during the slowness, I'd run this command to see how many network connections where active:

netstat -n | wc -l

This value was steady around 800. I used a variation of this command to see how many connections were being made between the web server and the database

# where 123.123.123.123 would be replaced with the IP of the db server
netstat -an | grep 123.123.123.123 | wc -l

This value was very steady between 190 and 210.

As soon as the performance problem vanished, these numbers doubled:

netstat -n | wc -l
1705
netstat -an | grep 123.123.123.123 | wc -l
484

This is maddening.

sneakyimp · Oct 17, 2018

For anyone who was interested in this question, the slowness came and went and came and went. I eventually solved the problem by increasing MaxRequestWorkers in my apache config from 150 to 200. It's important to avoid setting this value too high or you can run out of memory REALLY fast and your server will start to use the swap and slow to a crawl. You can get an idea of how much memory apache uses on average by massaging this output of this command:

top -b -n 1 | grep apache2

On my machine, the memory column displays as a % of total memory. I did some find-and-replace to flank this column with tabs and then pasted into a LibreOffice spreadsheet and took an average and then took the inverse of that average (or something like it) to determine the max number of apache processes that might run with my available memory.