sneakyimp;11043143 wrote:Depending on what exactly your script is doing with these remote requests and also on how many requests you queue up for curl_multi_exec, I'd be willing to bet you'll get a manifold performance boost. Requesting a long list of remote files sequentially one at a time is one of the most latency-plagued operations you can do. If you MUST request pages in sequence, then you are unlikely to get much improvement unless you can find some way to break down your overall task into parallelizable tasks.
What the script does is save the entire contents of the remote web page to disk (and later, analyze it). An initial test looks somewhat promising. The tool has been running 3 stages with about a 20 hour total runtime. I ran the first 2 stages at end of day yesterday and they completed in 2.3 hours; this was running via curl_multi_exec with 6 concurrent "threads". However, the script itself has changed because the target site has been refactored; I think we d'loaded about 80% less files. so my estimate of "performance gain" from six threads over one is about one hour and 40 minutes.
NOTE: depending on remote server security features (e.g., fail2ban) and capabilities, your raft of massively parallel file requests may appear to be a DoS attack. It may in fact bring the remote server to its knees depending on how many simultaenous requests you make. I would strongly recommend some way to easily configure how many simultaneous requests you make. Maybe define a constant CONCURRENT_REQUESTS=40 or something.
Lol! Fairly well aware of that. We have been inadvertently blocked by one of our targets sites; a few emails later and we're OK with them. Not sure about this target. I do note that in changing from file_get_contents to cURL multi on this tool I failed to configure my UA string in cURL 😉 I'd probably best resolve that before the next run 😃
I would also strongly recommend configuring your script so you can turn on or off keep-alive settings. That should make it super easy to profile performance across many connection requests.
A good & logical idea 🙂
johanafm;11043155 wrote:You will benefit more from persistent connections when you have a higher [latency * bandwith]:filesize ratio. That is, the higher the latency and bandwith and the smaller the file size, the more you benefit from persistent connections.
The higher the latency, the longer the round trip to close and start connections, and thus more time is wasted to manage each connection. This is the time saved with persistent connections.
The first two stages produce 13K files with an average size of 51KB each. Is that "small"? I'm guessing "yes", but really I'm not sure. I do have tools that load big files, but, typically, that's One Big File To Rule Them All.
There is also a mechanism called slow-start (but I do not know if this is always used by servers), which means that the server starts by sending a single packet. For each received ACK the server sends one additional packet the next time. Each round trip doubles the transfer rate, until some maximum is reached. With persistent connections there is no need for this. Subsequent files may be sent at the previously determined speed. The possible gain in this regard depends on the ratio maximum-transfer-rate:file-size. With larger files, a smaller part of the total time is spent reaching maximum transfer speed. For very small files, you might not even reach the threshold.
Ah, yes; that's actually a TCP algorithm, described in RFC 5681 (learned about that when I was doing troubleshooting for people with ISP-via-satellite). I hadn't considered that as having an effect here.
It brings up an issue, for certain. Since I'm reading $x files and saving them, the process now looks something like:
- Do the $n concurrent HTTP requests and put the results in an array...
- Loop the array and write each file to disk.
It appears to me that I might be better off to have each thread do the write as well ... but at some point I'll likely be I/O bound (and, I'm not quite sure how to work the writing into the curl_multi_exec() routine I've got working at the moment:
function curlMultiRequest($urls, $options = array()) {
$ch = array();
$results = array();
$mh = curl_multi_init();
foreach($urls as $key => $val) {
$ch[$key] = curl_init();
if ($options) {
curl_setopt_array($ch[$key], $options);
}
curl_setopt($ch[$key], CURLOPT_URL, $val);
curl_multi_add_handle($mh, $ch[$key]);
}
$running = null;
do {
curl_multi_exec($mh, $running);
}
while ($running > 0);
// Get content and remove handles.
foreach ($ch as $key => $val) {
$results[$key] = curl_multi_getcontent($val);
curl_multi_remove_handle($mh, $val);
}
curl_multi_close($mh);
return $results;
}
$url_chunks = array_chunk($urllist,$numthreads,true);
foreach ( $url_chunks as $chunk ) {
$returns = curlMultiRequest($chunk,array(CURLOPT_RETURNTRANSFER=>1));
foreach ($returns as $key=>$val) {
$write = file_put_contents( "cache/redacted/subpages/".$key.".html",$val );
}
}
Apart from sneaky's advice on multiple concurrent requests, you probably stand to gain the most from compressed transfers. See RFC 2616, section 3.5 content encodings (e.g. gzip and deflate) and section 14.3 Accept-Encoding (header). I'm guessing neither file_get_contents nor curl sends accept-encoding headers by default, while curl allows you to set headers. Perhaps it is possible to specify request headers through php.ini for file_get_contents? But I doubt it. Anyway, the server does not have to compress anything, but if it does compress transfers, you will gain a lot vs uncompressed transfers. Compression also works well with persistent connections. As file size goes down, the gain from persistent connections go up.
I will definitely look into what it takes to request gzipped content (I know we have gzip-enabled content for some MIME types on our sites). Persistent connections will also get a look-see ... hopefully we can make this thing run like a gazelle instead of an overweight hippo ...
Thanks, fellas 🙂