sneakyimp;10974655 wrote:
If I'm using a product_processing table, I can't depend on the auto_increment functionality of the main products_table so I have to work through the logic possibilities to make sure everything ends up with a valid product_id.
Another option would be to insert the products straight into the products table, but for customers only show such products that have an image file (WHERE image IS NOT NULL) if you want to avoid the risk of showing products without images. The image grabber obviously would select only those products that lack images, and if it fails to retrieve an image several times, you could still either insert a row in a process_failed table to avoid processing these products over and over again, or both insert a row in process_failed and remove the row from the product table. Or simply keep the number of times processing failed and stop processing after a certain number of attempts.
sneakyimp;10974655 wrote:
* making sure the daemon process is running when it needs to be -- like if the server reboots. This daemon might need some logic to identify itself as running or offer a status to properly formatted requests. Requests on a socket? In a PID file?
You should create a startup script. Also, this daemon makes use of a PID file to ensure there's only one instance running.
sneakyimp;10974655 wrote:
* configuring the daemon process to not overload the server. does it need to be aware of the system's load average or memory availability? Can I rely on configuration settings to ensure optimal performance without overwhelming the machine?
Available memory should not (but always can) be a problem for the daemon. It will temporarily have a sql connection and a sql result set resource and after that the retrieved rows will be stored in memory until processed. On top of that, there's some overhead for the class itself, but in all this class should use little memory. Sure, if the machine runs out of memory, this will affect the daemon as well, but the daemon should not be the cause of it.
Also, whenever the daemon is not actually working, it should be sleeping and thus take up no processing time. The work performed by the daemon itself will be kept at a minimum (keeping track of the actual work that needs to be done.) The children will do the actual work, thus using up more memory if you run more concurrent children.
The child processes gets one product at the time to work on. They will also need a sql connection for updates, and they will be working with images, which means both using file system as well as storing images in memory. If you get "insane" images when it comes to size, then this may cause problems. And remember, size in this regard is not necessarily the same as file size. A 2000x2000 px red square image might take up a few 100k as jpeg, but would still require roughly 2000 2000 3 (24 bits / pixel) = 12MB as an image resource.
However, running 10 threads each processing such an image is still no more than 120MB ram, plus whatever else is used. Running 100 threads on the other hand is more likely to cause problems.
But, if your images really are this big, and assuming you don't have a need to display full sized images to customers, perhaps you should start by looking at how to retrieve smaller images in the first place. Should this be the case, perhaps you would not even need this complicated approach since file retrieval would be quicker.
sneakyimp;10974655 wrote:
* concurrent threads accessing a single DB table? I know MySQL is good for this sort of thing, but I don't want to go crashing any tables -- what about locking resources?
I don't want multiple threads downloading the same remote image file and at the same time I don't want a dead process to permanently lock a particular image download so that it never gets downloaded when that process fails.
There should be no particular problem when it comes to this. The daemon will deal with acessing the DB to retrieve a set of rows to work on and distributing work to its children, while the children will do the work and eventually update the DB. But the children will never be looking at the same product. Moreover, the daemon will update the DB before passing a product on for processing, thus ensuring that the daemon will not retrieve the same row again. Once the child is done with the product, it updates the row for live use, or once again enabling it to be selected by the daemon. No explicit locking should be needed.
I'd go along the lines of
* daemon: wait until a child is free
* daemon: if daemon hasno product rows in memory, query the DB: SELECT * FROM product WHERE image IS NULL AND process_time IS NULL AND failures < [max_failed_attempts]
* daemon: array_shift the first row off the array as $row
* daemon: UPDATE product SET process_time = UNIX_TIMESTAMP() WHERE id = [id for $row]
* daemon: pass $row to child for processing - then sleep
* child: retrieve image. If you use cUrl, you can set timeout thresholds, which means that the child should not stay occupied more or less indefinitely due to oversized image in combination with a lousy connection
* child (on success): UPDATE product SET image = [filename] WHERE id...
* child (on failure):
UPDATE product SET failures = failures + 1, last_failure_time = UNIX_TIMESTAMP(), process_time = NULL WHERE id...
The above should ensure that the daemon never retrieves a product for processing when work is in progress for this product, while still allowing you to disregard products for which image retrieval failed too many times, and doing all of this without explicit locking of rows or tables.
You are also keeping your queries to a minimum (almost). You could remove the daemons UPDATE set process_time and instead keep information stored per child in the daemon class about what product it's currently processing, and even include information about the start time if needs be. But you'd need to modify the abstract MT class beyond what's otherwise necessary, so the easiest approach is to stick with the DB for this.