Scale/speed issues for HTTP d'loads --- file_get_contents, cURL, Keep-Alive

johanafm · Sep 26, 2014

You will benefit more from persistent connections when you have a higher [latency * bandwith]:filesize ratio. That is, the higher the latency and bandwith and the smaller the file size, the more you benefit from persistent connections.
The higher the latency, the longer the round trip to close and start connections, and thus more time is wasted to manage each connection. This is the time saved with persistent connections.

There is also a mechanism called slow-start (but I do not know if this is always used by servers), which means that the server starts by sending a single packet. For each received ACK the server sends one additional packet the next time. Each round trip doubles the transfer rate, until some maximum is reached. With persistent connections there is no need for this. Subsequent files may be sent at the previously determined speed. The possible gain in this regard depends on the ratio maximum-transfer-rate:file-size. With larger files, a smaller part of the total time is spent reaching maximum transfer speed. For very small files, you might not even reach the threshold.

Apart from sneaky's advice on multiple concurrent requests, you probably stand to gain the most from compressed transfers. See RFC 2616, section 3.5 content encodings (e.g. gzip and deflate) and section 14.3 Accept-Encoding (header). I'm guessing neither file_get_contents nor curl sends accept-encoding headers by default, while curl allows you to set headers. Perhaps it is possible to specify request headers through php.ini for file_get_contents? But I doubt it. Anyway, the server does not have to compress anything, but if it does compress transfers, you will gain a lot vs uncompressed transfers. Compression also works well with persistent connections. As file size goes down, the gain from persistent connections go up.

dalecosp · Sep 26, 2014

sneakyimp;11043143 wrote:
Depending on what exactly your script is doing with these remote requests and also on how many requests you queue up for curl_multi_exec, I'd be willing to bet you'll get a manifold performance boost. Requesting a long list of remote files sequentially one at a time is one of the most latency-plagued operations you can do. If you MUST request pages in sequence, then you are unlikely to get much improvement unless you can find some way to break down your overall task into parallelizable tasks.

What the script does is save the entire contents of the remote web page to disk (and later, analyze it). An initial test looks somewhat promising. The tool has been running 3 stages with about a 20 hour total runtime. I ran the first 2 stages at end of day yesterday and they completed in 2.3 hours; this was running via curl_multi_exec with 6 concurrent "threads". However, the script itself has changed because the target site has been refactored; I think we d'loaded about 80% less files. so my estimate of "performance gain" from six threads over one is about one hour and 40 minutes.

NOTE: depending on remote server security features (e.g., fail2ban) and capabilities, your raft of massively parallel file requests may appear to be a DoS attack. It may in fact bring the remote server to its knees depending on how many simultaenous requests you make. I would strongly recommend some way to easily configure how many simultaneous requests you make. Maybe define a constant CONCURRENT_REQUESTS=40 or something.

Lol! Fairly well aware of that. We have been inadvertently blocked by one of our targets sites; a few emails later and we're OK with them. Not sure about this target. I do note that in changing from file_get_contents to cURL multi on this tool I failed to configure my UA string in cURL I'd probably best resolve that before the next run

I would also strongly recommend configuring your script so you can turn on or off keep-alive settings. That should make it super easy to profile performance across many connection requests.

A good & logical idea

johanafm;11043155 wrote:
You will benefit more from persistent connections when you have a higher [latency * bandwith]:filesize ratio. That is, the higher the latency and bandwith and the smaller the file size, the more you benefit from persistent connections.
The higher the latency, the longer the round trip to close and start connections, and thus more time is wasted to manage each connection. This is the time saved with persistent connections.

The first two stages produce 13K files with an average size of 51KB each. Is that "small"? I'm guessing "yes", but really I'm not sure. I do have tools that load big files, but, typically, that's One Big File To Rule Them All.

There is also a mechanism called slow-start (but I do not know if this is always used by servers), which means that the server starts by sending a single packet. For each received ACK the server sends one additional packet the next time. Each round trip doubles the transfer rate, until some maximum is reached. With persistent connections there is no need for this. Subsequent files may be sent at the previously determined speed. The possible gain in this regard depends on the ratio maximum-transfer-rate:file-size. With larger files, a smaller part of the total time is spent reaching maximum transfer speed. For very small files, you might not even reach the threshold.

Ah, yes; that's actually a TCP algorithm, described in RFC 5681 (learned about that when I was doing troubleshooting for people with ISP-via-satellite). I hadn't considered that as having an effect here.

It brings up an issue, for certain. Since I'm reading $x files and saving them, the process now looks something like:

Do the $n concurrent HTTP requests and put the results in an array...
Loop the array and write each file to disk.

It appears to me that I might be better off to have each thread do the write as well ... but at some point I'll likely be I/O bound (and, I'm not quite sure how to work the writing into the curl_multi_exec() routine I've got working at the moment:

function curlMultiRequest($urls, $options = array()) {
    $ch = array();
    $results = array();
    $mh = curl_multi_init();
    foreach($urls as $key => $val) {
        $ch[$key] = curl_init();
        if ($options) {
            curl_setopt_array($ch[$key], $options);
        }
        curl_setopt($ch[$key], CURLOPT_URL, $val);
        curl_multi_add_handle($mh, $ch[$key]);
    }
    $running = null;
    do {
        curl_multi_exec($mh, $running);
    }
    while ($running > 0);
    // Get content and remove handles.
    foreach ($ch as $key => $val) {
        $results[$key] = curl_multi_getcontent($val);
        curl_multi_remove_handle($mh, $val);
    }
    curl_multi_close($mh);
    return $results;
}

$url_chunks = array_chunk($urllist,$numthreads,true);

   foreach ( $url_chunks as $chunk ) {
      $returns = curlMultiRequest($chunk,array(CURLOPT_RETURNTRANSFER=>1));
      foreach ($returns as $key=>$val) {
         $write = file_put_contents( "cache/redacted/subpages/".$key.".html",$val );   

      }
   }

Apart from sneaky's advice on multiple concurrent requests, you probably stand to gain the most from compressed transfers. See RFC 2616, section 3.5 content encodings (e.g. gzip and deflate) and section 14.3 Accept-Encoding (header). I'm guessing neither file_get_contents nor curl sends accept-encoding headers by default, while curl allows you to set headers. Perhaps it is possible to specify request headers through php.ini for file_get_contents? But I doubt it. Anyway, the server does not have to compress anything, but if it does compress transfers, you will gain a lot vs uncompressed transfers. Compression also works well with persistent connections. As file size goes down, the gain from persistent connections go up.

I will definitely look into what it takes to request gzipped content (I know we have gzip-enabled content for some MIME types on our sites). Persistent connections will also get a look-see ... hopefully we can make this thing run like a gazelle instead of an overweight hippo ...

Thanks, fellas

johanafm · Sep 26, 2014

dalecosp;11043157 wrote:
The first two stages produce 13K files with an average size of 51KB each. Is that "small"?

They are "small". I read a report from way back regarding time gains from persistent connections. They considered "regular user" to be on 28.8kbps. Average files in ordinary requests back then where around 6kB. Example calculation for 250ms latency and 6kB files gave ~15% waste on 28.8 kbps, ~50% waste on 100 kbps, where "waste" is the time spent setting up and closing connections. Therefor, this "waste" represents potential gain from persistant connections. I do not remember if this "waste" also included slow-start waste.
Comparing this to your actual file sizes and taking a wild bandwidth guess: At only 10x file size and 100x (2.8 Mbps) – 1000x (28 Mbps) bandwidth (?), you should have more than 50% potential gain through persistant connections, assuming other things are similar to their calculations.

I have not used curl_multi_exec, so do not know if this is easily done or even at all possible. You might actually have to handle the multithreading yourself, with each thread using one persistent connection and downloading lots of files in sequence.

sneakyimp · Sep 26, 2014

My understanding of persistent connections is that your browser, for instance may want to request a url (e.g. http://example.com) and ask that the connection be kept open just in case the resulting HTML might result in more requests (e.g., for images, CSS, JS, etc.). Unfortunately we have no idea whether curl_multi_exec will repurpose a connection to a particular domain for subsequent requests or not. That's essentially why I suggested creating a configuration constant or something so you can flick it on or off and compare the results.

The entire advantage of 'keep-alive' as I see it is to eliminate the overhead involved in initiating a new connection to the remote server. My understanding of this is somewhat limited but I think it basically amounts to making a socket connection on port 80 (or 443 or whatever) which may result in one having to join a queue of waiting connections. Once you are first in the queue, the remote host will hand your connection request off to some other port where you will open a socket connection and initiate the file transfer handshaking and stuff. A 'persistent connection' would maintain this socket connection longer just in case additional files were to be transferred. A non-persistent connection would immediately close once the file request was served. Johnafm's post seems pretty interesting. I have a vague recollection of my web server suffering from slow page loads because each page consisted of numerous small files (html, mages, js, etc.) and when we allowed persistent connections, a single page load was much much faster.

The balancing idea, however, is that if you have a whole ton of persistent connections being made then you may get a situation where the server cannot permit any more connections but each connection is not being fully utilized. E.g., dozens of requests connect to the server via persistent connection and then only request one file each. This would be a total waste of the available connections and the server and additional connection attempts would probably time out. This is analagous to a 'too many connections' error you might sometimes get using MySQL.

With curl_multi_exec, I'm going to make a guess and speculate that it may not keep a pool of connections for efficient re-use. This is only a guess and is partly based on the notion that one can feed curl_multi_exec a list of totally unrelated domains for all these connections and partly based on the idea that writing the extra code in curl to intelligently manage such a pool of connections sounds just a tiny bit harder than just opening a connection and closing it. I could be totally wrong.

sneakyimp · Sep 26, 2014

dalecosp;11043157 wrote:
this was running via curl_multi_exec with 6 concurrent "threads".

Only six? I'd go for twenty at least. There is surely some calculation that can be performed, but I think of it this way: A typical server page request takes about a second to serve from the moment you enter the url to the time the page starts to load. Depending on the size of the file requested (very small in your case) and the bandwidth available for the requeste, most of this time is latency. How long do you think it would take your system to process the response of the server once it has been retrieved? 10 milliseconds? Some simple napkin math would suggest to me that 1000ms/10ms means 100 threads would probably be optimal for your server -- not so much for the remote machine which may panic or ban you or something.

dalecosp;11043157 wrote:
However, the script itself has changed because the target site has been refactored; I think we d'loaded about 80% less files. so my estimate of "performance gain" from six threads over one is about one hour and 40 minutes.

I set up my system to try each file 10 times before giving up. Recently requested files that had failed were of course moved to the end of the queue (or at least postponed). You may not have time for this and, again, there could be security features in place that would make this a bad idea. For instance, if you requested a file and got a 404 NOT FOUND and the remote server had fail2ban installed with a NOSCRIPT jail then you would find yourself banned pretty quickly.

dalecosp;11043157 wrote:
Lol! Fairly well aware of that. We have been inadvertently blocked by one of our targets sites; a few emails later and we're OK with them. Not sure about this target. I do note that in changing from file_get_contents to cURL multi on this tool I failed to configure my UA string in cURL I'd probably best resolve that before the next run

I know I don't have to tell you this, but the more you can seem like real users, the less likely you are to get blocked/banned/scolded.

dalecosp;11043157 wrote:
It brings up an issue, for certain. Since I'm reading $x files and saving them, the process now looks something like:

Do the $n concurrent HTTP requests and put the results in an array...

Loop the array and write each file to disk.

It appears to me that I might be better off to have each thread do the write as well ... but at some point I'll likely be I/O bound (and, I'm not quite sure how to work the writing into the curl_multi_exec() routine I've got working at the moment:

A typical modern hard drive can write something like 100MB/sec to disk. I think you'd be much more likely to be I/O bound on your network connection than on your disk. Also, if you store all the server responses in an array, there's a chance you may exceed your memory limits? Make sure you log errors/exceptions so you can be sure that your script is properly completing its work with each request.

function curlMultiRequest($urls, $options = array()) {
    $ch = array();
    $results = array();
    $mh = curl_multi_init();
    foreach($urls as $key => $val) {
        $ch[$key] = curl_init();
        if ($options) {
            curl_setopt_array($ch[$key], $options);
        }
        curl_setopt($ch[$key], CURLOPT_URL, $val);
        curl_multi_add_handle($mh, $ch[$key]);
    }
    $running = null;
    do {
        curl_multi_exec($mh, $running);
    }
    while ($running > 0);
    // Get content and remove handles.
    foreach ($ch as $key => $val) {
        $results[$key] = curl_multi_getcontent($val);
        curl_multi_remove_handle($mh, $val);
    }
    curl_multi_close($mh);
    return $results;
}

$url_chunks = array_chunk($urllist,$numthreads,true);

   foreach ( $url_chunks as $chunk ) {
      $returns = curlMultiRequest($chunk,array(CURLOPT_RETURNTRANSFER=>1));
      foreach ($returns as $key=>$val) {
         $write = file_put_contents( "cache/redacted/subpages/".$key.".html",$val );   

      }
   }

Seems to me that the flaw with this code is that a whole bunch of connection attempts might just end up waiting on the slowest one in the bunch. One bad apple can ruin the party for everyone.

I'm also wondering if you are doing it right. I haven't used it myself but the logic in the documentation looks a little different than yours:

$active = null;
//execute the handles
do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    }
}

dalecosp · Sep 26, 2014

I've read the doc example ... had trouble with wrapping my head around combining that logic with a write operation using [man]curl_multi_getcontent/man.

That said, I'm playing now with the pcntrl_fork() idea and having some marginal success. I even did 7 threads, hee hee!!

I'll take your audacious goal (20-100 threads) to heart on the next run and see what happens.

sneakyimp · Sep 26, 2014

I've always been a bit confused by curl_multi_exec mostly because it seems to offer concurrent operation in a single-threaded context somehow. I had similar mind bogglers when working with [man]socket_select[/man].

One must be very careful when starting to use multithreaded programs -- which is essentially what you are doing when you use [man]pcntl_fork[/man]. The basic idea is that you fork off a new process from your primary process and that new process inherits some of the values from the parent process and you must take great care when trying to access data objects because you have no guarantees about which process will access something first or in what sequence. The two biggest risks that come to mind are race conditions and deadlock. I have found that a database is a pretty good way to coordinate many simultaneous processes, but you still have to think about what happens when you have many different processes all working on some data set. For example, if your database holds the list of remote urls to be fetched, how do you make sure that only one process fetches a particular url? It might take two operations to find a url that needs fetching and then lock that record so there's a slim window between the two operations where some other process might sneak in there, etc. It can be quite a mind boggler.

But it's surely worth it sometimes. You can efficiently bring an enormous amount of computing power to bear on big computational problems if you do it right.

I found this little project (github version) recommended by johanafm very helpful as a basis for my previous project where I was fetching millions of images. My thread on that project is here and reading my old description, sounds kind of similar to your problem.

Derokorian · Sep 27, 2014

When using curl_multi_exec you should generally handle the finished calls as soon as possible. This has the added benefit of allowing you to keep popping more on. As you're doing it now, you start a batch, wait for them all to finish, process them, and start another batch yes? Why not simply start another request as soon as possible and keep n requests going as long as possible?

Think of this ascii art as completion time for each of 3 threads at a time:

...........
.........
..
  ^ here you could start a new request, instead of waiting both of the others to complete

sneakyimp · Sep 27, 2014

Derokorian;11043177 wrote:
When using curl_multi_exec you should generally handle the finished calls as soon as possible. This has the added benefit of allowing you to keep popping more on. As you're doing it now, you start a batch, wait for them all to finish, process them, and start another batch yes? Why not simply start another request as soon as possible and keep n requests going as long as possible?

I was not aware you could keep feeding the beast. If you are an old hand at curl_multi_exec, perhaps you could elaborate a bit on the proper use of curl_multi_exec and curl_multi_select and how to check their return values? That logic from the documentation example looks a bit weird to me. We've got two loops both calling curl_multi_exec -- just seems a bit wrong.

$active = null;
//execute the handles
do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    }
}

dalecosp · Sep 29, 2014

OK, as it now stands:

The code is now using pnctl_fork() to fork $threads PHP processes which run this:

function systemFetchSequential($urls,$path = "") {
   $ua = "Mozilla/5.0 (FreeBSD i386; U; en-us) FooSpider(tm) 0.1.3BETA";
   if (!is_array($urls) || empty($urls)) return false;
   foreach ($urls as $key=>$val) {
      $fetchurl = $val;
      if (strlen($path)) {
         $cmd = "fetch --user-agent=\"".$ua."\" -o $path"."/"."$key \"$fetchurl\"";
         $action = `$cmd > /dev/null 2>&1`;
      } else {
         $cmd ="fetch --user-agent=\"".$ua."\" -o $key \"$fetchurl\"";
         $action = `$cmd > /dev/null 2>&1`;
      }
   }
}

I've tested with both wget(1) and the BSD native fetch(1). It doesn't seem to matter. I am not clear on whether fetch(1) uses keep-alives, but wget does and evidence would suggest that fetch(1) does also (literature seems to suggest that most any HTTP/1.1 (read: modern) client program does).

So far, I've tried $threads at various numbers with 25 being highest. It seems to work best around $threads = 18. That might change on other, more powerful hardware, but the target machine might not like this, and the performance uptick is substantial enough I think I'm willing to leave it there. Tricky issue: now that we're calling system() from multiple PHP threads ... gotta use "kill" on the server if we want to abort. The target site was visibly affected when I forgot this and ran $threads = 25 a second time while $threads = 25 was still running from the previous test. Not one of my finer moments. Fortunately it didn't go on for TOO long.

johanafm · Sep 30, 2014

You can prevent multiple instances from running. Either use a SYSTEM V semaphore (only blocking mode available in PHP) or use fopen with mode 'x' (creates and returns file handle or returns false if the file already exists). Just make sure you clean up the file on shutdown. Additionally you may also write the pid to your lock file, so that it is possible to verify if the process that created the file still exists or not.

Installing signal handlers is easy: [man]pcntl_signal[/man]. But unless you already know how signals work you will probably have to read up on their specifics. For example, if you send a signal to the main process, I believe the children will also receive the signal unless you assign them to a different process group. If this is the case, you probably only have to install one signal handler for the main process (to wait for children to finish before shutting down itself), and install another signal handler for each child you create which deals with aborting work, closing down connections and return.

When fetching one file per system call, you could of course handle of INT and TERM by setting a flag to signify int/term received. After each file download completes (one iteration over your list of files), return instead of fetching the next file. When fetching a list of files, you might have to abort prematurely. You should ideally not take more than a few seconds to handle INT/TERM, because the system might simply kill you eventually (at shutdown). But if you only fetch one file per system call, you are invoking a new shell for each download, and it would surprise me much if that would let you reuse the same persistent connection. If you use curl you can fetch a list of files by subsequently calling curl_exec with the same handle. Doing this allows you to check for int/term in between downloads, while using a persistent connection for all files in the list.

sneakyimp · Sep 30, 2014

Multithreaded applications always need some kind of tweakable settings. Certainly how many threads but it also helps to keep an eye on memory consumption and CPU load averages and such. Try launching 1,000 threads doing something memory and CPU-intensive and watch your machine grind to a halt.

IIRC, I constructed my application to keep a pool of threads and each thread, before running the remote file fetch, would check to see if memory was available and if the load average wasn't too high. Unfortunately, I had to keep checking 1-minute load average because there's no system-busy-indicator-type-thing that indicates the system's load on a shorter time scale -- at least none that I know.

I also wrote my system so that it would write a PID to a file and each thread would also pay attention to incoming signals. You could halt all the child processes using kill -15 $(cat /path/to/pid/file). I still had weird circumstances where some child processes would get orphaned or go zombie or something. It's hard to debug them sometimes.

dalecosp · Oct 1, 2014

Thank you, gentlemen, for your time and attention. I may do some additional tuning, but currently feel that I'm improved performance enough that I need to go forward and re-factor my parser.

I'm currently forking 12-20 threads that run the system's fetch(1) against a list of URL's provided by the parent processes. A simple pcntl_waitpid() loop holds them and their parent until they finish. Since the queues are all approximately equal in size, there's not too much waiting to do.

The machine is dedicated, so I'm not too worried about monitoring load.

You guys really are fantastically helpful!

sneakyimp · Oct 6, 2014

dalecosp;11043285 wrote:
Thank you, gentlemen, for your time and attention. I may do some additional tuning, but currently feel that I'm improved performance enough that I need to go forward and re-factor my parser.

Got any estimate of how much you've improved performance? Just curious.

dalecosp;11043285 wrote:
I'm currently forking 12-20 threads that run the system's fetch(1) against a list of URL's provided by the parent processes. A simple pcntl_waitpid() loop holds them and their parent until they finish. Since the queues are all approximately equal in size, there's not too much waiting to do.

It's been a long time since I set up my MT image fetching script, but I seem to recall you can work in logic which assesses whether the machine has enough headroom to just send off another process again. A carefully constructed MT program can be smart enough to consume all available system resources but not get out of hand.

Scale/speed issues for HTTP d'loads --- file_get_contents, cURL, Keep-Alive

Jjohanafm

dalecosp

Jjohanafm

Ssneakyimp

Ssneakyimp

dalecosp

Ssneakyimp

DDerokorian

Ssneakyimp

dalecosp

Jjohanafm

Ssneakyimp

dalecosp

Ssneakyimp