I've run a site for 15 years now. Today it's hosted in the Amazon cloud and consists of a web server and an RDS server doing database duty. It's a busy time of year, but we've very recently getting reports of timeouts and 'site can't be reached.' I've checked the monitoring stats for the two servers and, with the exception of a few very brief CPU spikes here and there, the CPU and memory usage seem totally nominal.
The machine is not even using the memory swap:
$ free -h
total used free shared buffers cached
Mem: 3.5G 3.2G 322M 31M 146M 2.6G
-/+ buffers/cache: 419M 3.1G
Swap: 0B 0B 0B
The db server looks even more nominal. CPU usage is typically 5%, which spikes to 15%. Plenty of RAM available. The slow query log hasn't had an entry in weeks.
I'm wondering if it might be a network bottleneck with the web server? The webserver is an amazon EC2 instance and actual network specs seem really hard to come by. My instance is an m1.medium which are rumored to have a network bandwidth of 800Mbit/sec.
I haven't been able to check the machine right when the trouble is happening (it's usually quite quick when i check it later) but when I login and check for TCP connections, it's typically between 1000 and 1400:
$ netstat -nt | wc -l
1400 simultaneous connections seems like it could easily consume 800Mbit/sec, but you'd still be getting about half a Mbit/sec for each user.
Can anyone suggest a methodical approach to locating the problem? This seems like such a black art to me.
EDIT: this is a screen shot of the amazon console. Note the network stas. More incoming than outgoing? Seems fishy. [upl-image-preview url=https://board.phpbuilder.com/assets/files/2018-09-20/1537478757-514613-myplan-web-server.png]