dalecosp;11050799 wrote:No, just a list of URLs to fetch. Big array of URLs, chunk it by number of threads, give each process its list.
OK yes this seems to be one of the fundamental pieces of wisdom when dealing with MT situations: Things can be a lot simpler if you give a process all the data it needs to complete it s job when you fire it up in the first place.. If you have a single 'taskmaster' thread that is forking off the work to be done, you can completely avoid a situation where a swarm of threads is fighting over some stack of data to grab its workload.
dalecosp;11050799 wrote:Generally speaking they're all gonna finish within a few seconds of each other. As the program was taking multiple hours* to complete, a few seconds either way was No Big Deal(tm).
Your usage example is almost identical to one that I had to deal with. In my case, I had to fetch a list of as many as several hundred thousand image urls to fetch. It was not practical to break out this entire batch at the outset because the likelihood of failure of some kind was too great for such a long-running job. I ended up using a MySQL db (and sql transactions) and having my child processes go fetch their own job queue from the db. I had to introduce some means of locking records to show that they were being worked on so that separate processes didn't try and work on the same records.
dalecosp;11050799 wrote:(*Now running in 25% of the time it formerly took. 🙂 )
YES YES YES. The magic of MT (and distributed processing) is amazing. Not only is my scheme MT, it is also multi-machine. In the bad old days, a daily cron job PHP script would still be running after 24 hours when it was supposed to start again the next day. Between the actual work to be done and the latency involved in serial http requests, it was still trying to choke down the 10,000 or so images it had found the prior day. I modified the script to make it spawn multiple processes to perform the image processing in parallel and sped the process up about 10 or 20 times. I then went further and modified things further so that another script would monitor the pending workload and would spin up new virtual machines in response to the number of pending jobs. The result is that I can now download 300,000 images (and generate about 2.4M thumbs of varying sizes) in about 8 hours. When there's nothing to be done, the monitor script spins down the unused virtual machines, saving gobs of money each month. I've attached a chart of the images to be processed (pending images) and a chart of the number of virtual machines (imagedaemons). If you consider what the inverse of the imagedaemon graph might look like, you can get an idea of how much money this saves each month versus having 10 servers running all the time.
[ATTACH]5247[/ATTACH][ATTACH]5249[/ATTACH]
I'm obviously quite excited about the success of this project and am grateful that my client was kind enough to let me spend all the time it required to get it working. I'm hoping to apply this MT magic to another project which involves a trickier situation. I need to spawn threads in a socket server to unserialize incoming messages from a socket and then queue them up on list of tasks. More broadly, I feel that being able to implement MT when one is dealing with multiple shared resources is an monumental skill that I'd like to have. I studied this in college and wish I had remembered the basics. Sadly, I must re-learn it.
pending-image-chart-screenshot.jpg
image-daemon-charg-screenshot.jpg