[RESOLVED] "scaleable" data fetcher / image grabber

sneakyimp

I hope this question will be a bit more stimulating than the intractable error questions I've been posting lately.

I'm working for a site that has some cron jobs that fetch data feeds from their clients. Once these data feeds are fetched, they are parsed and compared to a database and a list of product images to fetch is generated.

The way the site currently works, the data fetching cron job runs first. It creates new product records and makes a list of the images to be fetched. A second (single threaded) cron job runs which opens up the list of images and fetches them one by one. This results in a lot of products showing up before the images are ready. We want to change that.

Also, the image fetching -- being single-threaded -- is taking too long because the script just fetches one after the other rather than using curl_multi_exec or whatever. The site is approaching a point where the image fetching script might run 24 hours or more -- meaning last night's script is still running when tonite's cron fires off.

Lastly, the client has expressed interest in using rackspace.com for a CDN for these images.

I was wondering if anyone had any bright ideas about how to restructure this script so that it can fetch images in a multithreaded fashion.

I'm thinking something like this

$client_feeds = get_client_csv_files(); # fetches a list of data files to downlaod
foreach($client_feeds as $csv) {
  # code here to fetch client feed via curl and store it as a local temporary file
  if($handle = @fopen($tmp_file_name, "r")){
    while (($data = @fgetcsv($handle, 2048, "|")) !== FALSE){
      # compare $data to the database and determine what images need to be fetched
      foreach($images_to_fetch as $img) {
        ### fork off image-fetching script here ###
      }
    }
  }
}

Since this is running on linux, I was thinking of using exec to fork off the image fetching to a command line script in the background (using '&'). This should make things better for a few reasons:
The image fetching is started immediately after the customer record is created, reducing the amount of time between product record creation and image acquisition
The image fetching is forked off in an entirely separate, concurrent process for each iteration of the inner loop. If my main script blocks waiting for input, the various background processes might run. This should allow much better throughput for the image downloads rather than a single script downloading them one after the other
* The script that fetches the images will also be responsible for loading the images into the CDN.

A few things worry me:
Is it robust? What if a forked process dies? Is there some way to structure this so that the system tries again if an image fetch attempt fails?
Will it overwhelm the server? I'm imagining a scenario where this runs on a million records and 100,000 images and my load average shoots through the roof and I run out of RAM and the computer crashes.
* Is it tricky to manage a script pool for downloading? I'm thinking that rather than definitely forking a distinct process with each iteration of the inner loop that I might specify some reasonable number of simultaneous processes (e.g., 10 or 50 or something) that somehow keep track of which images need fetching and which are complete. I'm imagining MySQL might work well in this scenario as a switchboard for all of the image fetching -- each script in the thread pool checks in with mysql before fetching to make sure that images are not double-fetched, etc.

If anyone has any ideas or knows of a similar system that I might study, I'd love to hear about it. The more scaleable this thing is the better.

johanafm

First off, I would keep new products in a separate table process_product. They remain there until their respective images are fetched, and then inserted into the product table. This way, you will not have the issue of products showing up with no images. If the image is retrieved correctly, you also remove the product from process_product. If something goes wrong, you could either leave it in the table for reprocessing later, or move it to another separate table: processing_product_failed. Moreover, if you opt to leave a product for later processing, you could also increment a failure_count field and once it hits 3 (or some other limit) move it to the failure table. This way, every product gets several chances, but if they fail after a set amount of attempts, you can get an easy overview of failed jobs, optionally mailing this list to someone once per day.

Perhaps making use of a daemon would make this easier. Googling for "php daemon class" or some such should give you a bunch of options. I had a quick look at this one which seems to be easy to use. There are three files: the abstract daemon class, a log class and an example implementation which parses files in a directory and moves them to one directory on failure and another directory on success. Modifying the example to suit your needs should be trivial.

This daemon class lets you specify the number of concurrent threads that are allowed to run, which means that you should not run out of system resources.

The example implementation for file parsing only does ls in the work directory if it has no files to work on (which are kept in an array), which you should be able to manage as one query if you have nothing to work on, and keep current queue as a result set resource, only fetching a new row until your result set is empty. Once empty, you query the database again.

Just remember that you can't use echo or print. Use the log class.

sneakyimp

johanafm;10974609 wrote:
First off, I would keep new products in a separate table process_product. They remain there until their respective images are fetched, and then inserted into the product table. This way, you will not have the issue of products showing up with no images.

I like this idea, but the issue of a product_id concerns me a bit. The product_id is the glue that holds together all the product-related info -- including images. I have a table of product_images that needs to know which product_id it matches up with. If I'm using a product_processing table, I can't depend on the auto_increment functionality of the main products_table so I have to work through the logic possibilities to make sure everything ends up with a valid product_id.

johanafm;10974609 wrote:
If something goes wrong, you could either leave it in the table for reprocessing later, or move it to another separate table: processing_product_failed. Moreover, if you opt to leave a product for later processing, you could also increment a failure_count field and once it hits 3 (or some other limit) move it to the failure table. This way, every product gets several chances, but if they fail after a set amount of attempts, you can get an easy overview of failed jobs, optionally mailing this list to someone once per day.

It would definitely be nice to make the script robust. I sense the possibility of failure at multiple points in this rather complex configuration.

johanafm;10974609 wrote:
Perhaps making use of a daemon would make this easier. Googling for "php daemon class" or some such should give you a bunch of options. I had a quick look at this one which seems to be easy to use. There are three files: the abstract daemon class, a log class and an example implementation which parses files in a directory and moves them to one directory on failure and another directory on success. Modifying the example to suit your needs should be trivial.

Thanks for the link. I have some experience with running PHP processes as a daemon. I'm worried about a couple of things
making sure the daemon process is running when it needs to be -- like if the server reboots. This daemon might need some logic to identify itself as running or offer a status to properly formatted requests. Requests on a socket? In a PID file?
configuring the daemon process to not overload the server. does it need to be aware of the system's load average or memory availability? Can I rely on configuration settings to ensure optimal performance without overwhelming the machine?
* concurrent threads accessing a single DB table? I know MySQL is good for this sort of thing, but I don't want to go crashing any tables -- what about locking resources? I don't want multiple threads downloading the same remote image file and at the same time I don't want a dead process to permanently lock a particular image download so that it never gets downloaded when that process fails.

sneakyimp

Thanks so much for the link to the MTDaemon project. It looks fascinating.

I've been reading the source of class.MTDaemon.php and see that it makes use of pcntl and posix functions. I'm wondering a) if this will be safe to run in PHP given the old warning I've seen and b) would it require any changes to PHP to support the process control functions?

Also I don't see much info on how to use it. The blog link from the code itself doesn't seem to have any docs. The google code page has neither downloads nor wiki pages. I'm working through the source and the examples and realizing just how long it's been since I dealt with multithreaded/concurrently executing code 🙁

Given that the processes I want to run are all going to be querying a database and writing data to a file system (likely the rackspace cloud), I'm wondering if you've got any good advice about how to handle that. One of the big things I remember about concurrent programming is that you have to be careful when accessing shared resources (like a db connection or directory handle) or you run into deadlock, race conditions, and all sorts of other nasties.

johanafm

sneakyimp;10974655 wrote:
If I'm using a product_processing table, I can't depend on the auto_increment functionality of the main products_table so I have to work through the logic possibilities to make sure everything ends up with a valid product_id.

Another option would be to insert the products straight into the products table, but for customers only show such products that have an image file (WHERE image IS NOT NULL) if you want to avoid the risk of showing products without images. The image grabber obviously would select only those products that lack images, and if it fails to retrieve an image several times, you could still either insert a row in a process_failed table to avoid processing these products over and over again, or both insert a row in process_failed and remove the row from the product table. Or simply keep the number of times processing failed and stop processing after a certain number of attempts.

sneakyimp;10974655 wrote:
* making sure the daemon process is running when it needs to be -- like if the server reboots. This daemon might need some logic to identify itself as running or offer a status to properly formatted requests. Requests on a socket? In a PID file?

You should create a startup script. Also, this daemon makes use of a PID file to ensure there's only one instance running.

sneakyimp;10974655 wrote:
* configuring the daemon process to not overload the server. does it need to be aware of the system's load average or memory availability? Can I rely on configuration settings to ensure optimal performance without overwhelming the machine?

Available memory should not (but always can) be a problem for the daemon. It will temporarily have a sql connection and a sql result set resource and after that the retrieved rows will be stored in memory until processed. On top of that, there's some overhead for the class itself, but in all this class should use little memory. Sure, if the machine runs out of memory, this will affect the daemon as well, but the daemon should not be the cause of it.
Also, whenever the daemon is not actually working, it should be sleeping and thus take up no processing time. The work performed by the daemon itself will be kept at a minimum (keeping track of the actual work that needs to be done.) The children will do the actual work, thus using up more memory if you run more concurrent children.

The child processes gets one product at the time to work on. They will also need a sql connection for updates, and they will be working with images, which means both using file system as well as storing images in memory. If you get "insane" images when it comes to size, then this may cause problems. And remember, size in this regard is not necessarily the same as file size. A 2000x2000 px red square image might take up a few 100k as jpeg, but would still require roughly 2000 2000 3 (24 bits / pixel) = 12MB as an image resource.
However, running 10 threads each processing such an image is still no more than 120MB ram, plus whatever else is used. Running 100 threads on the other hand is more likely to cause problems.

But, if your images really are this big, and assuming you don't have a need to display full sized images to customers, perhaps you should start by looking at how to retrieve smaller images in the first place. Should this be the case, perhaps you would not even need this complicated approach since file retrieval would be quicker.

sneakyimp;10974655 wrote:
* concurrent threads accessing a single DB table? I know MySQL is good for this sort of thing, but I don't want to go crashing any tables -- what about locking resources?

I don't want multiple threads downloading the same remote image file and at the same time I don't want a dead process to permanently lock a particular image download so that it never gets downloaded when that process fails.

There should be no particular problem when it comes to this. The daemon will deal with acessing the DB to retrieve a set of rows to work on and distributing work to its children, while the children will do the work and eventually update the DB. But the children will never be looking at the same product. Moreover, the daemon will update the DB before passing a product on for processing, thus ensuring that the daemon will not retrieve the same row again. Once the child is done with the product, it updates the row for live use, or once again enabling it to be selected by the daemon. No explicit locking should be needed.

I'd go along the lines of

* daemon: wait until a child is free
* daemon: if daemon hasno product rows in memory, query the DB: SELECT * FROM product WHERE image IS NULL AND process_time IS NULL AND failures < [max_failed_attempts]
* daemon: array_shift the first row off the array as $row
* daemon: UPDATE product SET process_time = UNIX_TIMESTAMP() WHERE id = [id for $row]
* daemon: pass $row to child for processing - then sleep
* child: retrieve image. If you use cUrl, you can set timeout thresholds, which means that the child should not stay occupied more or less indefinitely due to oversized image in combination with a lousy connection
* child (on success): UPDATE product SET image = [filename] WHERE id...
* child (on failure):
UPDATE product SET failures = failures + 1, last_failure_time = UNIX_TIMESTAMP(), process_time = NULL WHERE id...

The above should ensure that the daemon never retrieves a product for processing when work is in progress for this product, while still allowing you to disregard products for which image retrieval failed too many times, and doing all of this without explicit locking of rows or tables.
You are also keeping your queries to a minimum (almost). You could remove the daemons UPDATE set process_time and instead keep information stored per child in the daemon class about what product it's currently processing, and even include information about the start time if needs be. But you'd need to modify the abstract MT class beyond what's otherwise necessary, so the easiest approach is to stick with the DB for this.

johanafm

I just modified the included example to make it more inline with what you will do and need. But, I've been lazy. For example, I've hardcoded values for max_failure_attempts (< 5 in a query, $row['failure_attempts'] == 4 in another place) etc. Error handling for failed DB queries also remain to be dealt with.

But overall, I believe this should provide a starting point. The base MT class is a little overkill for this solution since there isn't really a need for in class shared memory since the parent keeps track of the product data to process and passes on one product at the time to its children.
There is however some information exchange behind the scene, in the DB.

error_reporting(E_ALL);
require_once 'class.MTDaemon.php';

// Set number of threads for daemon
$nThreads = 4;

// Optional: Init logging class
$sLogfile = $argv[0].'.log'; // Set to null for logging to stdout
MTLog::getInstance($sLogfile)->setVerbosity(MTLog::INFO);

class MTProductProcessor extends MTDaemon {
	/*
	 * Optional: read in config files or perform other "init" actions
	 * Called from MTDaemon's _prerun(), plus when a SIGHUP signal is received.
	 * NOTE: Runs under the PARENT process.
	 */
	public function loadConfig() {
		# set image directory path, create dirs if needed
		$imageDir;
		$this->setVar('imagedir', $imageDir);

	# will hold row data for processing
	$rows = array();
	$this->setVar('rows', $rows);

	# set max_number_of_failures here instead of hardcoded as I have
	# set DSN, db user an db pass.
	# possibly set image dimensions for scaling/cropping
}

/* 
 * Return quickly with (a) no work to do, or (b) next product to process
 * NOTE: Runs under the PARENT process
 */
public function getNext($slot) {
	// Get shared vars
	$this->lock();
	$rows = $this->getVar('rows');

	# No data in memory - query the DB
	if (!count($rows))
	{
		# connect to DB
		# Do not reuse daemon's connection in child classes.
		# Children should connect to the DB on their own if they need it
		$db = new PDO($dsn, $user, $pass);

		$stmt = $db->query('SELECT id, image_to_fetch, failures FROM product WHERE image IS NULL AND process_time IS NULL AND failures < 5 LIMIT 100');
		if (!stmt)
		{
			# error handling
		}
		$rows = $stmt->fetchall(PDO::FETCH_ASSOC);

	}
	if (!count($rows))
	{
		# release the DB connection. Don't keep it tied up indefinitely
		$row = null;
	}
	else
	{
		$row = array_shift($rows);

		$db->query('UPDATE product SET process_time='.time() . ' WHERE id='.$row['id']);
		# error handling ?
		# if the query didn't go through, the only risk is having this product
		# processed more than once

		# update in memory data
		$this->setVar('rows', $rows);
	}

	# release the DB connection
	$db = null;
	$this->unlock();

	return $row;
}

/* 
 * Do main work here.
 * NOTE: Runs under a CHILD process
 */
public function run($row, $slot) {
	$sMsg = 'slot='.$slot.' product='.$row['id'];
	MTLog::getInstance()->info('## Start '.$sMsg);
	try
	{
		# retrieve image & update product's row as necessary
		$bProcessedOK = ProcessProduct($row);
	} catch( Exception $e )
	{
		# Log if you wish to. But you will also have some information
		# about this in the DB: failures will increment and eventually
		# you might move this product into product_processing_failed
		# or duplicate it instead of moving it from the product table
		MTLog::getInstance()->error('ProcessFile error '.$sMsg.': '.$e->getMessage());
		$bProcessedOK = false;
	}

	# This is where the child process updated 'aFilesInProcess'
	# but you have the data needed for keeping track of what to process
	# in your DB table with process_time IS NULL AND image IS NULL
	# so I removed this entire section
	return 0;
}
}

// Function that actually does the work
function ProcessProduct($row, $imageDir)
{
	if (!is_array($row) || !count($row))
	{
		return false;
	}

# row = array('id' => ..., 'image_to_fetch' => ...)
# get image using curl

$db = new PDO($dsn, $user, $pass);

# curl error, no image retrieved
if (curl_errno($ch) != 0)
{
	$db->execute('UPDATE product SET failures = failures + 1, last_failure_time = ' . time().', process_time = NULL');

	# error handling?
	# if this query doesn't go through, process_time will not be null
	# which means that this product will not be included on daemon's
	# next DB retrieval

	# row data comes from before this round of processing.
	# So if your daemon SELECTS ... WHERE failures < 5
	# this part should compare against 4
	if ($row['failures'] == 4)
	{
		$db->execute('INSERT into product_processing_failed ...');
		# error handling
	}

	return false;
}
else
{
	$db->execute('UPDATE product SET image='.$image.' WHERE id='.$row['id']);

	# error handling
	# if this query didn't go through, it would be nice to have process_time
	# reset to NULL again.
}
return true;
}

// Run daemon, start threads
try
{
	$mttest = null;
	$mttest = new MTProductProcessor($nThreads); // Init
	$mttest->handle();                		     // Run threads
}
catch( Exception $e )
{
	if ( $mttest==null ) {
		$sErr = $argv[0].': Daemon failed to start: '.$e->getMessage();
	} else {
		$sErr = $argv[0].': Daemon died: '.$e->getMessage();
	}
	MTLog::getInstance()->error($sErr);
	die($sErr."\n");
}

johanafm

sneakyimp;10974665 wrote:
I've been reading the source of class.MTDaemon.php and see that it makes use of pcntl and posix functions. I'm wondering a) if this will be safe to run in PHP given the old warning I've seen and
b) would it require any changes to PHP to support the process control functions?

a) It should be safe, or perhaps it should be phrased: it can be safe (always a possibility of screwing up). That warning talks about mixing non-threadsafe things with threaded things. If you use this daemon, it will be run on its own. It will of course make use of the same file system and DB, but the PHP process running the daemon should make use of no other things (as opposed to when people requests a web page: web server + PHP + framework + libraries etc, where some things may not be thread safe)

Your MySql db should be no problem, but you should stick with InnoDB for ACID compliance. This however does mean that you have no full text indexing support.

b) possibly (from php.net): Process Control support in PHP is not enabled by default. You have to compile the CGI or CLI version of PHP with --enable-pcntl configuration option when compiling PHP to enable Process Control support.

sneakyimp;10974665 wrote:
Given that the processes I want to run are all going to be querying a database and writing data to a file system (likely the rackspace cloud), I'm wondering if you've got any good advice about how to handle that. One of the big things I remember about concurrent programming is that you have to be careful when accessing shared resources (like a db connection or directory handle) or you run into deadlock, race conditions, and all sorts of other nasties.

The file system has a potential for problem if you want to randomize file names. If you are fine using the product_ids you allready have unique file names provided by the DB, and as long as you write the image to file before updating the DB, there should be noone trying to access this file when it's not there.
Also, there should never be two processes trying to write to the same filename twice.

But, if you want randomized names, you will have to take care. Possibly by doing something like

# This would go in function processProduct

# Continue until a valid file handle is returned
$fh = false;
while ($fh = false)
{
	# This is done until an unused filename is found
	do
	{
		$dir_and_file = md5(uniqid(mt_rand(), true)) . '.jpg';
		$dir = substr($dir_and_file, 0, 2);
		$file = substr($dir_and_file, 2)

	$newfilename = $dir . '/' . substr($file, 2);
}
while ($dir == 'ad' || file_exists('/' . $image_path . $newfilename))

# But someone else might have beaten us to taking that name after we checked
# file_exists, so we try to actually create the file now
# E_NOTICE on failure iirc. Using @ to su
$fh = @fopen($newfilename, 'x');
}

# write image file
# update database

Also, I reread my rewrite of the code example, and realized that I do not like locking before performing one update query and potentially one select query and unlocking after. Normally these queries should take no time, but there are no guarantees. Granted, if the db is under such a heavy load that it takes a long time to respond, you have other problems, but it would be nice to avoid potentially long lock times.

And since no child needs access to rows in shared memory, you could keep it available only to the daemon since each child gets the row data it needs. This would remove the need for locking rows 38 to 77.

sneakyimp

I've been in touch with a friend of mine who's PHD candidate in CS at UCLA and he's been kind enough to refresh me on some multithreading basics so I'm gradually getting more comfortable with this MTDaemon script. However, the target machine for this project doesn't have the PCNTL functions enabled.

Lacking the information to know which extension to look for (or whether one can tell from the list of extensions) I wrote this script:

<?php

$funcs = array(
  'pcntl_alarm',
  'pcntl_exec',
  'pcntl_fork',
  'pcntl_getpriority',
  'pcntl_setpriority',
  'pcntl_signal_dispatch',
  'pcntl_signal',
  'pcntl_sigprocmask',
  'pcntl_sigtimedwait',
  'pcntl_sigwaitinfo',
  'pcntl_wait',
  'pcntl_waitpid',
  'pcntl_wexitstatus',
  'pcntl_wifexited',
  'pcntl_wifsignaled',
  'pcntl_wifstopped',
  'pcntl_wstopsig',
  'pcntl_wtermsig'
);

$ok = TRUE;
foreach ($funcs as $f) {
  if (!function_exists($f)) {
    echo $f . "does not exist\n";
    $ok = FALSE;
  }
}

if ($ok) echo 'PCNTL is installed!';

?>

And of course none of these functions are defined.

I'm leaning toward writing my own script which would
Loop, checking the database for images that need to be downloaded
When an image download record is found, check to see how many threads are running by listing PID files or maybe checking a datagase.
If the number of threads (and possibly load avg, memory usage, etc.) are nominal, use exec to fork off a new process that would:
download the image from the remote site
upload the image to the cloud
update/delete the appropriate DB records
* die

This script would be run either from the script that sets up the image download records or as a cron job during off-hours.

sneakyimp

OH...and perhaps more importantly...THANK YOU for your effort on this Johana. I'm definitely going to be looking at this script in more detail.

Some questions occur to me:
the sem_acquire function is included in PHP's core, but the docs on the Semaphore functions says they are "a wrapper for System V functionality" which is really vague. I find myself wondering if these semaphore functions are truly atomic get/set or whether they are cheap knockoffs. Not sure who to ask really.
I've looked more closely at this script and am a bit confused by the way some shared resources are handled. stdout, for example. The script doesn't seem to lock any threads when accessing it. Or I don't think so anyway. It was very late when I was reading the code.

Anyways -- extremely interesting script it is.

sneakyimp

Alrighty I'm in this up to my neck now. I learned that it's pretty easy to set up a new Cloud Server at Rackspace.com with the pcntl functions on it and that's what I plan to do -- use this extra server for the image work. The plan is to definitely use the MTDaemon script.

I'm a bit concerned now about the possibility of record duplication in this whole scheme and this has caused me to scale back my expectations a bit so I can comprehend all of the table locking requirements. Having a staging record for the images is tricky enough for now without introducing a staging table for the products too. I started a separate thread about it.

johanafm

Since the daemon does the initial data parsing, there is still only one thread dealing with client data and inserting this into the table, there should never be duplicates. As for the DB, you should use InnoDB, not MyIsam, since InnoDB is ACID compliant as far as I know.

PHP manual for fwrite() wrote:
fwrites to stdout are atomic, at least up to some data size (probably differs between machines. It has to be possible to find this out for your environment, but I'd assume that at least up to 255 chars would cause no breaking of atomicity for this operation.

System V semaphores are not atomic for create + init. But your deamon creates its semaphores at pre run, when no threads exist, so there is no risk of someone reading garbage after create but before init.

sneakyimp

I'm writing daemon code now and am using your example as a basis:

error_reporting(E_ALL);
require_once 'class.MTDaemon.php';

// Set number of threads for daemon
$nThreads = 4;

// Optional: Init logging class
$sLogfile = $argv[0].'.log'; // Set to null for logging to stdout
MTLog::getInstance($sLogfile)->setVerbosity(MTLog::INFO);

class MTProductProcessor extends MTDaemon {
    /*
     * Optional: read in config files or perform other "init" actions
     * Called from MTDaemon's _prerun(), plus when a SIGHUP signal is received.
     * NOTE: Runs under the PARENT process.
     */
    public function loadConfig() {
        # set image directory path, create dirs if needed
        $imageDir;
        $this->setVar('imagedir', $imageDir);

    # will hold row data for processing
    $rows = array();
    $this->setVar('rows', $rows);

    # set max_number_of_failures here instead of hardcoded as I have
    # set DSN, db user an db pass.
    # possibly set image dimensions for scaling/cropping
}

/*
 * Return quickly with (a) no work to do, or (b) next product to process
 * NOTE: Runs under the PARENT process
 */
public function getNext($slot) {
    // Get shared vars
    $this->lock();
    $rows = $this->getVar('rows');

    # No data in memory - query the DB
    if (!count($rows))
    {
        # connect to DB
        # Do not reuse daemon's connection in child classes.
        # Children should connect to the DB on their own if they need it
        $db = new PDO($dsn, $user, $pass);

        $stmt = $db->query('SELECT id, image_to_fetch, failures FROM product WHERE image IS NULL AND process_time IS NULL AND failures < 5 LIMIT 100');
        if (!stmt)
        {
            # error handling
        }
        $rows = $stmt->fetchall(PDO::FETCH_ASSOC);

    }
    if (!count($rows))
    {
        # release the DB connection. Don't keep it tied up indefinitely
        $row = null;
    }
    else
    {
        $row = array_shift($rows);

        $db->query('UPDATE product SET process_time='.time() . ' WHERE id='.$row['id']);
        # error handling ?
        # if the query didn't go through, the only risk is having this product
        # processed more than once

        # update in memory data
        $this->setVar('rows', $rows);
    }

    # release the DB connection
    $db = null;
    $this->unlock();

    return $row;
}

/*
 * Do main work here.
 * NOTE: Runs under a CHILD process
 */
public function run($row, $slot) {
    $sMsg = 'slot='.$slot.' product='.$row['id'];
    MTLog::getInstance()->info('## Start '.$sMsg);
    try
    {
        # retrieve image & update product's row as necessary
        $bProcessedOK = ProcessProduct($row);
    } catch( Exception $e )
    {
        # Log if you wish to. But you will also have some information
        # about this in the DB: failures will increment and eventually
        # you might move this product into product_processing_failed
        # or duplicate it instead of moving it from the product table
        MTLog::getInstance()->error('ProcessFile error '.$sMsg.': '.$e->getMessage());
        $bProcessedOK = false;
    }

    # This is where the child process updated 'aFilesInProcess'
    # but you have the data needed for keeping track of what to process
    # in your DB table with process_time IS NULL AND image IS NULL
    # so I removed this entire section
    return 0;
}
}

// Function that actually does the work
function ProcessProduct($row, $imageDir)
{
    if (!is_array($row) || !count($row))
    {
        return false;
    }

# row = array('id' => ..., 'image_to_fetch' => ...)
# get image using curl

$db = new PDO($dsn, $user, $pass);

# curl error, no image retrieved
if (curl_errno($ch) != 0)
{
    $db->execute('UPDATE product SET failures = failures + 1, last_failure_time = ' . time().', process_time = NULL');

    # error handling?
    # if this query doesn't go through, process_time will not be null
    # which means that this product will not be included on daemon's
    # next DB retrieval

    # row data comes from before this round of processing.
    # So if your daemon SELECTS ... WHERE failures < 5
    # this part should compare against 4
    if ($row['failures'] == 4)
    {
        $db->execute('INSERT into product_processing_failed ...');
        # error handling
    }

    return false;
}
else
{
    $db->execute('UPDATE product SET image='.$image.' WHERE id='.$row['id']);

    # error handling
    # if this query didn't go through, it would be nice to have process_time
    # reset to NULL again.
}
return true;
}

// Run daemon, start threads
try
{
    $mttest = null;
    $mttest = new MTProductProcessor($nThreads); // Init
    $mttest->handle();                             // Run threads
}
catch( Exception $e )
{
    if ( $mttest==null ) {
        $sErr = $argv[0].': Daemon failed to start: '.$e->getMessage();
    } else {
        $sErr = $argv[0].': Daemon died: '.$e->getMessage();
    }
    MTLog::getInstance()->error($sErr);
    die($sErr."\n");
}

Some things I've noticed (or think I've noticed) that I'm hoping you'll comment on:
1) Unless I'm mistaken, there doesn't seem to be a clear way for a child thread, upon completion, to pass any complex return values (e.g., arrays, etc.) back to the parent thread upon child completion. This makes me think that each child is required to communicate any nuances of its success or failure back up the call chain by either utilizing the database or manipulating a shared variable. This is a bit surprising to me given the intimacy of the two threads. The ability to simply return an associative array would be nice.
2) The $rows var doesn't need to be stored as a shared var using setVar/getVar because it doesn't need to be directly accessed by any children.
3) The $rows var (possibly quite large) is inherited by any threads that happen to be forked after its definition. I expect I will try to unset it at the fork point in order to conserve memory

sneakyimp

4) This is more of a question: Suppose $rows is not a shared memory variable but rather just a protected var of the MTProductProcessor class and is never referred to directly in the child processes. Do we still need to lock the mutex to access a $rows var defined as a protected var? I'm thinking that because the getNext code is only run by the parent process that there is no need to lock access to $rows defined this way because we will only ever have one thread accessing the instructions within getNext.

johanafm

I havn't looked at the actual code yet and probably won't be able to until next week.

sneakyimp;10975292 wrote:
Some things I've noticed (or think I've noticed) that I'm hoping you'll comment on:
1) Unless I'm mistaken, there doesn't seem to be a clear way for a child thread, upon completion, to pass any complex return values (e.g., arrays, etc.) back to the parent thread upon child completion. This makes me think that each child is required to communicate any nuances of its success or failure back up the call chain by either utilizing the database or manipulating a shared variable. This is a bit surprising to me given the intimacy of the two threads. The ability to simply return an associative array would be nice.

You could use class.MTDaemon::setVar and ::getVar for this. Making use of the DB also works since you have a similar way of protecting yourself with the locks there. So it mainly depend on what you want to communicate and to whom. For example, readFiles is not part of this script and would need to get and set its information through the DB, just like the threads here need to do to communicate to readFiles.

sneakyimp;10975292 wrote:
2) The $rows var doesn't need to be stored as a shared var using setVar/getVar because it doesn't need to be directly accessed by any children.

I arrived at the same conclusion.

sneakyimp;10975292 wrote:
3) The $rows var (possibly quite large) is inherited by any threads that happen to be forked after its definition. I expect I will try to unset it at the fork point in order to conserve memory

Seems reasonable. This is also what is done by class.MTDaemon as it forks off a new process for $next

				if ( $pid==-1 ) {
					MTLog::getInstance()->error('[fork] Duplication impossible');
					// $this->run = false; // Quit now
					$this->bTerminate = true; // Wait till children finish
					continue;
				}
				# parent process: pcntl_fork returns PID of child (non-zero)
				else if ( $pid ) {
					unset ($next);
					usleep(10); // HACK : give the hand to the child -> a simple way to better handle zombies
					continue;
				}
				# child process: pcntl_fork returns 0
				else {

sneakyimp;10975292 wrote:
4) This is more of a question: Suppose $rows is not a shared memory variable but rather just a protected var of the MTProductProcessor class and is never referred to directly in the child processes. Do we still need to lock the mutex to access a $rows var defined as a protected var? I'm thinking that because the getNext code is only run by the parent process that there is no need to lock access to $rows defined this way because we will only ever have one thread accessing the instructions within getNext.

No, you only need to use locks when multiple threads are (possibly) using the same memory/resource

sneakyimp

I'm starting to get very excited about this thread. I finally have it running, querying the table for images to fetch, and fetching them and it's looking very promising. However, I know that MT code can be very difficult to debug when bad things happen.

When you issues a START TRANSACTION query and then have a series of queries (one of which fails), I'm wondering
a) whether I need to make sure I COMMIT/ROLLBACK or whether it's enough to disconnect from the DB.
b) is setting my $db var to NULL enough to disconnect from the db?

sneakyimp

My MTImageFetcher is running! It's very exciting to be working with MT code again. I'm embarassed to say how long it's been. I'm not 100% convinced my table locking code is doing what it should, but the script seems pretty stable now that I've worked out some of the bugs.

I've done a bit of informal benchmarking with the script. It works steadily and well with up to about 10 threads I think. I tested with higher numbers of threads and it started having more fetch failures. I'm guessing in hindsight that this is likely due to cURL timeouts because I had too many threads trying to crowd in a network request. I tried a very large number of threads (100) and I think there was some kind of resource starvation or something -- the main thread would be returning zero records for processing because a record would have been locked by a thread but that thread didn't ever seem to run to attempt a fetch and succeed or fail. Really bizarre actually. The sweet spot seems to be a combination of settings
10 threads - go much higher and we get more failures -- even if we increase our fetch attempts per image
5-10 fetch attempts per image before failing
100 threads per fetch batch - should be considerably larger than the number of threads so that threads don't fight over which image to download

The task ahead is to deploy this to a Rackspace Cloud Server and modify it to place the fetched images into their CDN.

johanafm

sneakyimp;10975331 wrote:
When you issues a START TRANSACTION query and then have a series of queries (one of which fails), I'm wondering
a) whether I need to make sure I COMMIT/ROLLBACK or whether it's enough to disconnect from the DB.
b) is setting my $db var to NULL enough to disconnect from the db?

😎 As the PDO documentation tells you (or if it was from a comment somewhere), setting your PDO instance variable to null will destroy the object and that will disconnect to the DB.

A) If you disconnect from the DB in the middle of a transaction, everything from the start of the transaction is rolled back.

sneakyimp

Sorry to resurrect this old thread, but I've been using this MTDaemon and it's working pretty well, but my log files are growing without bound. I'm hoping to rotate the logs but I'm worried about concurrent access issues.

I noticed originally that there's no locking performed when folks access the MTLog object -- despite the fact that this is accomplished through a static method. It's my understanding that static and global vars present prime opportunities for concurrent access problems and one should protect them with a lock. The original code here does no such thing. Johana, you did point out that I should use the log rather than echo or print, but I fail to understand why this is permitted. I have a vague recollection of the file system being MT-safe.

At any rate, I want the Log class to create a log file for each day, periodically closing the existing log which is opened using fopen, stored in a static var, and never closed.

I'm looking either for some explanation as to how it would be safe to close the current log and start a new one in the MTLog::getInstance class or some hints about how to protect access to this log with a lock without scouring my code and wrapping every log reference inside a lock.

Any help would be appreciated.

johanafm

If my reasoning is correct, there should be no problem making some modifications to MTLog and enable it to switch logs on the fly.

A write to the log involves getInstance()->write(). The first one simply returns the instance, and opens the file if necessary, while the second performs the write. But since MTLog is a singletone, the MTLog::stream file handle used is alwys the one and same. Thus there are only two possible outcomes.

# getInstance and write happen sequentially
$inst = MTLog::getInstance();
$inst->write();

#getInstance and write are interleaved
$inst = MTLog::getInstance();
# MTLog::$stream changes
$inst->write();

So, as long as MTLog::$stream is a valid file handle, everything should be ok. The call to write() will either write to the same file from when getInstance was called, or it will write to a new file.

And looking at the MTLog::write() method itself

protected function _write($verbosity, &$msg)
{
	# no writing done, so it doesn't matter what ::$stream is
	# So far there are no concerns
	if ( $verbosity>$this->verbosity ) return;

# either these happen sequentially - ok
# or they don't...
fwrite($this->stream,
	date('Y-m-d H:i:s').' ['.posix_getpid().'] '.
	self::_verbosityToString($verbosity).' : '.$msg."\n");

# ::$stream has changed... doesn't matter since then noone else will write
# to it either. The newly opened file is flushed, but the previous
# file will soon be close, or allready has been closed
fflush($this->stream);
}

So, modifying MTLog::setLogFile like this should work

public static function setLogFile($logfile)
{
	# Singleton - no matter whom/what/where calls ::setLogfile
	# the same instance is used, which also means this will
	# always refer to the same ::$stream
	$inst = self::getInstance();

# the currently used file handle
$ofh = $inst->stream;

# new file handle to be used
$mode = ($logfile=='php://stdout' ? 'w' : 'a');
$nfh = fopen($logfile, $mode);
if ($nfh === false)
{
	# not possible to open file... error handling
	return;
}

# let's say MTLog::getInstance() was called here
$inst->stream = $nfh;
# and MTLog::$_INSTANCE->write() is called here
# ... then the write will go to the new file

# all that remains to be done is close the old file handle
fclose($ofh);

# $logfile is actually only ever used once in ::getInstance()
# so this seems superflous, just like the variable does
self::$logfile = $logfile;
}

Now, the above won't work, since $inst->stream is not visible. But that can be changed. However, it might be easier to make MTLog::$stream static (it will always be the same for all references to MTLog::_INSTANCE anyway), and replace $inst->stream with self::$stream rather than providing public getter and setter.

sneakyimp

Thanks again for your first-class advice and code here. I notice a couple of things:
1) My plan to shut the current file (x.log) and date it (x-2011-06-15.log) and then reopen the original log name (x.log) is not going to work because the log might be closed by one thread right as one thread tries to write it or two logs might be opened, etc.

2) Your plan to open the new log before closing the old one would require a different naming scheme for my logs but is otherwise very shrewd in that it tries to maintain a valid log stream at all times. Thanks for that.

3) I find myself wondering about the atomicity of these PHP statements we're dealing with and whether that might affect our reasoning here. For instance, is the PHP fwrite function truly atomic or might a PHP process that is writing a file get de-scheduled in mid-string? Seems to me that the longer the string, the more likely this is to happen as the I/O might easily force a reschedule as the hard disk catches up. What about getting and setting vars? Might a PHP function that is checking a variable's value look to some memory location and find X but before the underlying machine instructions are able to pack up a PHP return value of X and hand it to my PHP function call, the process is de-scheduled by the OS. In the meantime, another PHP process might come in and set the value to Y, only to be rescheduled itself before the entire Y object is constructed. The fwrite call comes to mind:

                fwrite($this->stream,
                        date('Y-m-d H:i:s').' ['.posix_getpid().'] '.
                        self::_verbosityToString($verbosity).' : '.$msg."\n");

The memory location for $this->stream might suffer from the sort of rescheduling I just described.

4) I still can imagine something weird happening here:

        public static function getInstance($sLogFile = null) {
                if ( self::$_INSTANCE===null ) {
                        // one process gets descheduled here and another one runs this same routine
                        if ( $sLogFile ) self::$logfile = $sLogFile;
                        self::$_INSTANCE = new MTLog(self::$logfile);
                }
                return self::$_INSTANCE;
        }

In this case, we'd have two different threads trying to open the same log file and I believe one would have to fail if they were opening the same exact filename for writing. At the very least, we'd have one thread over-writing the Log instance of another which sounds ugly. Now that I think about it, this is currently a problem in my existing code but because I call getInstance in the parent process before I fork any threads, it hasn't been a problem. If I were to remove all getInstance calls in my main thread, it could be a problem.

5) The windows for these types of concurrent execution problems are typically vanishingly small, but I'd like this script to run as a daemon, constantly, for a very long time without screwing up.

Sadly, I think I'll feel more comfortable if I implement some locks here. I offer this excerpt from Threads Primer: A Guide to Multithreaded Programming:

All shared data must be protected by a synchronization variable. Failure to do so will result in truly ugly bugs. Keep in mind that all means all. Global variables are the obvious example. Data structures that can be accessed by multiple threads are included. Static variables are included.
Statics are really just global variables that can be seen by only one function or functions in one file. It was somewhat convenient to use these in the single-threaded programs of yore, but in MT programs they are disaster waiting to strike. You should reconsider your use of statics very carefully. If you do use 'em, lock 'em first.

The question now I guess is how to handle multiple locks -- Given how much my log is getting written, I think using the one lock I currently have might really slow things down. The problem with multiple synch vars/locks is that one risks deadlock unless one is very careful.

I think I can successfully close this thread, however. The image fetcher is working amazingly well. I'm running it in the Rackspace cloud which means the virtual machines are connected to the content delivery network with a fat pipe. It is fast as hell and I couldn't have done it so well without your help.