grokking table locking behavior in MySQL

sneakyimp

I'm embarking on a project involving lots of concurrent database access and am trying to understand MySQL's locking behavior. I was hoping to get some feedback on a few scenarios. I'll start with scenario #1 which sounds the simplest.

1) Scenario 1
I have ReadFiles.php that runs via cron once a day. It queries TableProducts (which is MyISAM) to check for records and then update them or insert new records if no record is found. TableProducts may also be affected by some scripts triggered by user interaction on the website (e.g., if a client wants to add a new product).

The queries run by ReadFiles.php are derived from some text files. It checks the text file contents against TableProductImages (MyISAM). If records in these text files don't exist in TableProductImages, then ReadFiles.php will create a bunch of records in TableFetchImages (InnoD😎

The other script, FetchImages.php, is multithreaded. FetchImages.php checks table TableFetchImages for records describing what it needs to do (grab images, resize them, and enter the results into a cloud storage network) and, when finished with a single record, it will update or create records in TableProductImages.

So hopefully someone can let me know if I got this right:
1) in ReadFiles.php, I need to lock TableProducts between the point where I'm checking for existing products just in case some client gets antsy and wants to stick the product in manually right as I'm reading the text file they sent me.
2) In ReadFiles.php I need to lock TableProductImages when I am comparing its contents to the text file I'm reading because FetchImages.php could be writing the very images I'm looking for.
3) ReadFiles.php doesn't need to lock TableFetchImages because it doesn't care what content is in there, it's just outputting records.

4) FetchImages.php needs to lock TableFetchImages because it wants each thread to get a thread which no one else is working on. The locking is needed for semaphore-type 'check for record not busy and lock it, if possible' functionality.
5) FetchImages.php only needs to lock TableProductImages if it plans to check it before writing records to it. If FetchImages.php is just writing records to TableProductImages without any checking, then there's no need to lock it.

Any insight would be much appreciated. I figure I'll try this long question before I even bother with Scenario 2 (which involves yet another table).

johanafm

sneakyimp;10975097 wrote:
The queries run by ReadFiles.php are derived from some text files. It checks the text file contents against TableProductImages (MyISAM). If records in these text files don't exist in TableProductImages, then ReadFiles.php will create a bunch of records in TableFetchImages (InnoD😎

I would keep all involved tables as InnoDB, since it is ACID compliant (but do not allow full text indexing).

sneakyimp;10975097 wrote:
1) in ReadFiles.php, I need to lock TableProducts between the point where I'm checking for existing products just in case some client gets antsy and wants to stick the product in manually right as I'm reading the text file they sent me.

I'd avoid table locks if possible, and that's what you'd need to do if this involves preventing inserts of new records. However, if a product can be inserted in one of several ways, and you have the means of checking for this (otherwise there's nothing preventing the client from inserting the same product a second time after you release the table lock), then there must be one or more unique values. If you create a unique index (primary key or other specific index), then the second insert would simply fail and all is well (assuming either insert is equally "correct"). You could also have your script perform an INSERT ... ON DUPLICATE KEY UPDATE ... in case the data in the text file is more likely to be correct that user input.

sneakyimp;10975097 wrote:
2) In ReadFiles.php I need to lock TableProductImages when I am comparing its contents to the text file I'm reading because FetchImages.php could be writing the very images I'm looking for.

Why not keep both the images to be fetched and the processed images in the same table?

procut_id, image, fetch_image, failures, processing_start_time, processing_end_time

Once again, uniqueness constraint on fetch_image, so that the same image will not be inserted more than once. Then, there is either an image to be fetched (image IS NULL AND fetch_IMAGE IS NOT NULL) or there isn't.

But even if you stick to the two tables, there should be no need for a lock unless I'm missing something, assuming ReadFiles.php doesn't insert images into TableProductImages (which should be the job of FetchImages.php as I understand things).

sneakyimp;10975097 wrote:
4) FetchImages.php needs to lock TableFetchImages because it wants each thread to get a thread which no one else is working on. The locking is needed for semaphore-type 'check for record not busy and lock it, if possible' functionality.

No, there should be no need for locking. The child threads should not access this table. The main thread selects images that needs fetching. Before passing a record on to a child, it can set processing_start_time to current time or timestamp. If the main thread has no more data in memory to pass on to a child process, then it selects new images from the table WHERE processing_start_time IS NULL OR (processing_end_time IS NOT NULL AND processing_start_time < processing_end_time). Thus, it will never reselect an image in process.
When the child is done processing the image, it will update the fetch_image table with failures = failures + 1, processing_end_time = UNIX_TIMSTAMP() on failure, or remove the row on success (or leave it but set processing_end_time = UNIX_TIMESTAMP() if you need to keep track of this and preserve the row to keep track of how many times your image fetching fails for tracking possible network errors, slow connections, out of memory errors for too large images to be scaled etc).

sneakyimp;10975097 wrote:
5) FetchImages.php only needs to lock TableProductImages if it plans to check it before writing records to it. If FetchImages.php is just writing records to TableProductImages without any checking, then there's no need to lock it.

Once again, I'd recommend keeping one table, so that you have a one to one relationship between an image to be fetched and the processed image, at least if you have several different ways of having processed images inserted/updated. If it's only the child processes (which will never be working on the same image) that inserts images here, so the same image will not be fetched twice.
If you stick with two tables, you could still include both product id and a fetch_image_id in your TableProductImages to keep this one to one relationship, and with a uniqueness constraint to prevent overwriting data.

But perhaps I've missunderstood things since I fail to see the reasons for table locking.

sneakyimp

Johanafm, you are a saint. I truly appreciate your dragging me kicking and screaming into the world of threads and (transactional SQL?).

johanafm;10975143 wrote:
I would keep all involved tables as InnoDB, since it is ACID compliant (but do not allow full text indexing).

TableProducts is part of a large, live website. I'm not sure if I am allowed to change its DB engine from MyISAM to InnoDB because they may or may not be using full-text search features.

johanafm;10975143 wrote:
I'd avoid table locks if possible...

I do have a primary key for every table involved and I'm not worried about MySQL's ability to prevent duplicate primary keys (I'm using mysql_insert_id or similar). The most likely scenario is where we have an enormous import with 100,000+ images one day where the import process takes more than 24 hours because we are unable to process all the products and download their images before the exact same script begins again the next day via cron. In this situation we have massive amounts of changes being made to at least 2 tables: TableProducts and TableProductImages. Note that this is precisely the problem that has happened prior to my involvement on the project. Compounding this problem is the existence of public-facing pages that allow user to insert data into both product and image tables.

johanafm;10975143 wrote:
Why not keep both the images to be fetched and the processed images in the same table?
procut_id, image, fetch_image, failures, processing_start_time, processing_end_time
Once again, uniqueness constraint on fetch_image, so that the same image will not be inserted more than once. Then, there is either an image to be fetched (image IS NULL AND fetch_IMAGE IS NOT NULL) or there isn't.

The primary reason was that we'd be forced to alter the code that displays images in lots of places in order to take into account this improved table structure. Otherwise, we may try to display images that haven't yet been fetched.

A secondary reason (which may be totally wrong) is the potential need for locking the image table because I want the image download script to run in massively multi-threaded fashion. If I get a massively parallel image fetching operation working, this table would be slammed with requests for image records in need of fetching and updates to images that have been fetched. More on that in a bit.

johanafm;10975143 wrote:
But even if you stick to the two tables, there should be no need for a lock unless I'm missing something, assuming ReadFiles.php doesn't insert images into TableProductImages (which should be the job of FetchImages.php as I understand things).

ReadFiles.php fetches the text files from various client websites. Sadly, these text files may have records that we've already processed on a previous night. These text files assign multiple images to a given vehicle -- which may or may not have been assigned on a previous night. It's not a simple in-out operation but requires lots of checking against previously entered data (including both products and images). There's a lot of cross-checking between the tables. Did we insert this product before? Is it expired? Does it have new images or have we handled them all? Have images been removed? At the very least, it is in fact ReadFiles.php which is delegating which images are to be fetched -- and ideally NOT fetched twice.

Some other complicating factors:
The ReadFile.php script I've inherited has about 800 lines of cleanup code to sanitize the various different client files into something coherent and consistent for our system. The idea of trying to work this into a daemon (whose code I barely understand) kept me up last night.
I like your idea of using an image's remote URL as a unique key to prevent double-fetching but unfortunately a single image may be used more than once (e.g., a manufacturer's logo). Any uniqueness I check may be required on a combination of fields -- which might still be feasible -- but there's a possibility that a product listed one day in a text file may also be listed the next day but with a couple of minor changes -- new images, for instance -- or the price might be different. It's still the same product it just needs to be updated.
* The number of images to be fetched is likely 10,000 to 100,000 on a given day and the client has requested that any processes for this be 'scaleable' to roughly 10 times that. If I'm fetching a million images from all over the internet, it is likely that a) that will certainly chew up some memory, b) a single script running for the duration is likely to encounter a fatal error or other problem that would cause it to fail and c) I'll probably need to delegate the image fetching chore to multiple machines to get this done in a reasonable time frame which precludes the much-preferred single process talking to the database on behalf of many threads -- or does it? I'm imagining numerous parent processes on numerous machines talking to scores of threads.

johanafm;10975143 wrote:
No, there should be no need for locking. The child threads should not access this table. The main thread selects images that needs fetching. Before passing a record on to a child, it can set processing_start_time to current time or timestamp. If the main thread has no more data in memory to pass on to a child process, then it selects new images from the table WHERE processing_start_time IS NULL OR (processing_end_time IS NOT NULL AND processing_start_time < processing_end_time). Thus, it will never reselect an image in process.
When the child is done processing the image, it will update the fetch_image table with failures = failures + 1, processing_end_time = UNIX_TIMSTAMP() on failure, or remove the row on success (or leave it but set processing_end_time = UNIX_TIMESTAMP() if you need to keep track of this and preserve the row to keep track of how many times your image fetching fails for tracking possible network errors, slow connections, out of memory errors for too large images to be scaled etc).

I do hope you are right. I definitely see the advantage of having the single master daemon process handle the database with the children just fetching images, but could it happen that the children are busy downloading images and the master thread visits TableProductImages and retrieves the record corresponding to those same images and subsequently orders them fetched all over again? This may not be feasible in the daemon but I'm still trying to understand the daemon code (reading it this morning).

In a situation where we have multiple daemons on multiple machines fetching bazillions of images, it seems almost certain without table locking somewhere.

johanafm;10975143 wrote:
Once again, I'd recommend keeping one table, so that you have a one to one relationship between an image to be fetched and the processed image, at least if you have several different ways of having processed images inserted/updated. If it's only the child processes (which will never be working on the same image) that inserts images here, so the same image will not be fetched twice.
If you stick with two tables, you could still include both product id and a fetch_image_id in your TableProductImages to keep this one to one relationship, and with a uniqueness constraint to prevent overwriting data.

Given the amount of checking/fetching/updating going on, it sounds like this image table is going to get a real workout. I'm going to have to read your post a few times and mull it over (and then get a mullet).

johanafm;10975143 wrote:
But perhaps I've missunderstood things since I fail to see the reasons for table locking.

Thanks for your input. I've provided some additional detail if you have some time.

johanafm

sneakyimp;10975164 wrote:
Johanafm, you are a saint. I truly appreciate your dragging me kicking and screaming into the world of threads and (transactional SQL?).

TableProducts is part of a large, live website. I'm not sure if I am allowed to change its DB engine from MyISAM to InnoDB because they may or may not be using full-text search features.

My pleasure. I have several reasons for suggesting that you switch away from MyIsam. The first is that it is in fact not transactional. Should you ever need to perform more than one query without race issues, you cannot START TRANSACTION, query1, ..., query N, COMMIT. You'd need to lock table, do stuff, unlock table. And this will not be possible for any table under decent work load.
Also, it also has no support for foreign key constraints, which I find scary.
Last but not least, I recently spoke to a friend who is a sys admin for a large ISP and he told me he's had issues with MyIsam for a system that had heaps of insert/updates/deletes (and selects as well). The tables often became corrupt and backing them up required full table locks which he had to work around. It is of course possible that things have changed since then (I believe this was about 1-2 years ago).
At work we actually do use MyISAM for our news site (online newspaper) without such problems, but that system has a low insert/update load, and very few deletes. I don't know how many queries are performed, but the number of page requests is about 100k per day.

He made a switch to InnoDB and Sphinx, which is for full text indexing outside of the database, and from what I've heard also faster than MySQL's built in full text indexing.

So it may still be worthwhile to check and see if there are indeed full text indexing. If not, you could easily switch to InnoDB. If they do need full text indexing, you could go for PostgreSQL (which my friend always recommends over MySQL).

sneakyimp;10975164 wrote:
A secondary reason (which may be totally wrong) is the potential need for locking the image table because I want the image download script to run in massively multi-threaded fashion. If I get a massively parallel image fetching operation working, this table would be slammed with requests for image records in need of fetching and updates to images that have been fetched. More on that in a bit.

You will not be able to use table locks since your users will spend too much time waiting (no selects will go through while tables are locked). Also, your concurrently running threads to update the table will start spending a lot of time waiting for lock to be released, thus meaning you (most likely) will not gain the performance increase that you need.

But, assuming that TableProductFetchImage (was that the correct name for the table of images to be fetched) is InnoDB, you might be fine. This is where you will need your locks. I am also assuming that JOINING a MyISAM table for update (TableProductImage) with an InnoDB table (TableProductFetchImage) works when the InnoDB table has row locks. But this is a question I'd post on MySQL forum to get a definitive answer.

By the way, using mixed letter casing for database identifiers is not recommended (for MySQL anyway), since case sensitivity is platform dependent, which means that you may have working code on for example Windows (not case sensitive) while some queries may fail to execute on for example FreeBSD (case sensitive).

My line of reasoning goes like this (hoping I'm not making assumptions that are too far off)
- 3 tables like you've said: TableProduct, TableProductImage (MyISAM) and TableProductFetchImage (InnoD😎. I'll just call then Product, Image, FetchImage
- readFiles.php parses client data and decides on updates. Things to be updated goes into TableProductFetchImage, at least as far as the image processing is concerned
- daemon retrieves data from TableProductFetchImage, passes on one data row at the time to a separate thread (I'll simply call it Child)
- Child updates TableProduct

A. readFiles
1. parses a product.
2. to see if anything needs to be done, it may need one or several rows from FetchImage which means
3. START TRANSACTION
4. SELECT stuff FROM FetchImage FOR UPDATE (reading and writing is delayed until COMMIT or ROLLBACK for these rows)
5. do necessary checks
6. If necessary, UPDATE one or several rows in FetchImage
7. COMMIT

B. daemon
1. SELECT id FROM FetchImage WHERE ... LIMIT 0, 1000
2. pass one id at the time to Child

C. child
1. starts with an Image id
2. START TRANSACTION
3. SELECT stuff from FetchImage ... FOR UPDATE
$fetchImage is assigned value of FetchImage.image
It is perhaps possible that this row no longer exists due to work done by readFiles.php after the daemon selected its rows from the db. In this case, COMMIT here then exit thread.
4. fetch and scale image
5. UPDATE Image AS i INNER JOIN FetchImage AS fi ON ... SET i.file = '/some/file' fi.processing_done = UNIX_TIMESTAMP() WHERE ... AND fi.image = $fetchImage
In case readFiles has updated this row (but not deleted it), I'm assuming that the image file is either still the same and should still be fetched, or it has changed and this update should no longer be performed. In this case fi.image is no longer equal to $fetchImage and there is no row to update.
6. COMMIT

If readFiles (A) tries to fetch data from FetchImage while a child (C) is working on it, A will wait until C has finished updating its row. readFiles will now work on current data.
If FetchImage (C) wants to fetch a row for processing while readFiles (A) is working on it, it will wait until A has finished updating and will either end up with no work which cost you one SELECT query, or work being done on current data.

If the daemon (😎 tries to SELECT rows while work is being performed upon them by one or more Child (C), then it will also have to wait until updates have been written before it gets new data. With a proper WHERE clause, the daemon should not retrieve such a row until readFiles has once again updated it.

If you need to run several daemons (😎 on different machines, you could change B to accomodate this. Give each machine an identifier, say MACHINE_IDENTIFIER is either 1, 2 or 3
B. daemon
1. UPDATE FetchImage SET machine_identifier=MACHINE_IDENTIFIER WHERE machine_identifier IS NULL LIMIT 1000
2. SELECT id FROM FetchImage WHERE ... AND machine_identifier=MACHINE_IDENTIFIER LIMIT 0,1000
3. Pass one id at the time to Child

C. child
5. UPDATE Image AS i INNER JOIN FetchImage AS fi ON ... SET i.file = '/some/file', fi.processing_done = UNIX_TIMESTAMP(), fi.machine_identifier = NULL WHERE ...

This way, each machine's daemon will "claim" a set of rows, and its children will release each row after image processing is done by resetting machine_identifier to NULL. Since it also updates other information at the same time, this row should not be included in a subsequent query until after it has been updated anew by readFiles.

Another way to split workload between machines would of course be to have one readFiles.php per machine if it is possible to split this workload in some sensible fashion. Updates going into FetchImage could be made machine specific from the start, in the same way as before, or even passed on directly to a Child for processing.

It should be possible to slack on row locking by either not making readFiles select for update or making Child not select for update.
If readFiles.php does not SELECT ... FOR UPDATE, but instead when updating will SET update_time = UNIX_TIMESTAMP(), you could still prevent Child from updating this row if it has changed since Child started working on the image. In this case, one run of image fetching and processing was wasted.
If Child does not SELECT ... FOR UPDATE, then it is possible that an image was updated after readFiles.php SELECTED this image data to check if it should be updated. It is entirely possible that the check would have said no after this update, but instead said yes, and that you will waste one run of image fetching and processing since you will perform one extra run later.
Likewise, it is possible that both the check done on old data and the check on new data would have said yes, and in this case you do two runs where only the latter would have been needed.

I have no idea if either of these "slack lock" versions would run faster due to fewer waits for locsk would occur or slower due to more unnecessary work being done. But I do guess it depends on how time consuming different parts are, how many rows are locked by readFiles at the time and probably things I havn't forseen as well.

I'd probably start with transactions in both places, and if needed or out of interest after that check how the slack lock versions perform in comparison.

sneakyimp

The SELECT...FOR UPDATE is not something I'd seen before. That's a seriously helpful tip. I had been wondering about record locking when select statements are involved. I don't recall seeing 'FOR UPDATE' in the docs on BEGIN/START TRANSACTION.

OK I fully believe in the need to switch the tables to InnoDB and am hoping that rather than scouring all the code for fulltext search usage in queries that I can just look at the tables. I checked the product table and the product_image table (thx for warning about case sensitivity -- the actual tables are all lowercase with underscores) and I don't see any full-text indexes on them -- a 'SHOW INDEXES FROM products' query reports all the indexes are of type BTREE. Likewise for product_images. Am I right in thinking that I can switch to InnoDB no problem for these tables because no fulltext indexes exist ? Or do I still need to check the code?

Also, the phrase 'InnoDB doesn't support fulltext search' isn't strictly true is it? I can still do a query like SELECT * FROM table_innodb WHERE field_x LIKE '%some string%' and it will search every record in the table in its entirety looking for my string? I've always been a little fuzzy on the difference between the fulltext searching and the simple LIKE queries. I'm guessing that a simple LIKE query will result in a table scan.

I'm also liking my idea of using a product_image_fetch table (InnoD😎 in order to take some of the heat off the live images table and also so I don't have to go alter all the image display code to check whether a given record in product_images has been retrieved. I don't expect to JOIN product_images and product_images_fetch because product_images stores the image's new location and the product_images_fetch table stores the image's remote location. I can't think of any other viable field(s) on which we might join them so I'll be querying them each separately and merging the results in PHP. Do I need to worry about atomicity here when checking those two tables?

// pseudo for readFile.php
* for a given product record in the text file, there will be a list of $product_images
* if a matching product already exists in the products table:
   - get all the images from the product_images table  [atomicity?]
   - get images for this product from product_images_fetch (if any) [atomicity?]
   - remove any images described i $product_images that exist i either image table
   - possibly update product_images.is_default field for this product's image records to change the default image.
   - add any new $product_image images to the product_images_fetch table.
* if a product did not previously exist
   - put all of the $product_image values in product_images_fetch

Sounds to me like your START TRANSACTION and SELECT...FOR UPDATE approach should work well to prevent updates on the image tables until we're done checking both of them.

Your pseudo code is amazingly helpful. I'll be reading it in detail now and probably writing actual code.

sneakyimp

If I understand the docs correctly:

SELECT * FROM table LOCK IN SHARE MODE sets a shared mode lock on the rows read meaning other sessions can read the rows but not modify them
SELECT * FROM table FOR UPDATE will -- for records the search encounters -- block other sessions from doing SELECT ... LOCK IN SHARE MODE or from reading in certain transaction isolation levels

Unless I'm mistaken, both of these searches do nothing to prevent insertion of new records which would be found by these same searches. Is that correct? If so, I can imagine two concurrent scripts runnig wherin each does the following:

START TRANSACTION;
SELECT * FROM table WERE id=1 FOR UPDATE; // will block other sessions from calling this same query?
/* php code HERE to parse SELECT results against some OTHER_DATA and insert any records not found in SELECT
COMMIT

Am I right in understanding that these two scripts might end up inserting duplicate records? Is a UNIQUE INDEX required to prevent dupes?

johanafm

Shared mode lock means that you will not read uncommited data that has been changed, so your read is blocked until commit (or rollback). But any rows read by you can still be read by others. For update lock means that no reads or writes are allowed until you commit or rollback. If you set the transaction level to be "serializable" then shared mode is the default lock.

An easy example, taken from mysql 5.1 docs, that explains the effects

For another example, consider an integer counter field in a table CHILD_CODES, used to assign a unique identifier to each child added to table CHILD. Do not use either consistent read or a shared mode read to read the present value of the counter, because two users of the database could see the same value for the counter, and a duplicate-key error occurs if two transactions attempt to add rows with the same identifier to the CHILD table.

Here, LOCK IN SHARE MODE is not a good solution because if two users read the counter at the same time, at least one of them ends up in deadlock when it attempts to update the counter.

Here are two ways to implement reading and incrementing the counter without interference from another transaction:

First update the counter by incrementing it by 1, then read it and use the new value in the CHILD table. Any other transaction that tries to read the counter waits until your transaction commits. If another transaction is in the middle of this same sequence, your transaction waits until the other one commits.

First perform a locking read of the counter using FOR UPDATE, and then increment the counter:

SELECT counter_field FROM child_codes FOR UPDATE;
UPDATE child_codes SET counter_field = counter_field + 1;

A SELECT ... FOR UPDATE reads the latest available data, setting exclusive locks on each row it reads. Thus, it sets the same locks a searched SQL UPDATE would set on the rows.

So, doing things properly would not risk two updates producing dupblicate records, as far as this part is concerned. However, if duplicate records are to be avoided, then there should be some set of data for a record that identifies it as unique, and in my opinion you should then have a unique index over these fields to ensure that you really do not have duplicate data. Just like you should use foreign key constraints to ensure that relations hold between tables, even if you do code so that they always will.
"Always" is never forever in my experience... 😉

sneakyimp

BTW, what kind of permissions does a user need to do these START TRANSACTION queries? I tried giving some table-specific permissions for this issue and the query was failing (and being a PDO noob I couldn't find out how to extract the error from the PDO/statement object).

johanafm

Apparently I forgot to post this yesterday, but thanks to FF it's still here 🙂

sneakyimp;10975235 wrote:
I don't recall seeing 'FOR UPDATE' in the docs on BEGIN/START TRANSACTION.

I believe it's in (or linked to from) the docs for SELECT statement.

sneakyimp;10975235 wrote:
all the indexes are of type BTREE. Am I right in thinking that I can switch to InnoDB no problem for these tables because no fulltext indexes exist? Or do I still need to check the code?

If there is no fulltext index, there should be no problem.

sneakyimp;10975235 wrote:
Also, the phrase 'InnoDB doesn't support fulltext search' isn't strictly true is it? I can still do a query like SELECT * FROM table_innodb WHERE field_x LIKE '%some string%'

Well, fulltext indexing and fulltext searching doesn't mean the same thing. Now, I'm not certain if fulltext indexing would use an index for '%whatever' or not (I'd guess it would though). I do know you can't use MATCH fields AGAINST ('stuff') without a fulltext index, so it could be they are only used here (but I doubt it)

But what I do know is that text searches on non-fulltext indexed fields using like 'post_wildcard%' will make use of an index, whereas '%pre_wildcard' will not which would indeed lead to a full table scan.

sneakyimp;10975235 wrote:
I don't expect to JOIN product_images and product_images_fetch because product_images stores the image's new location and the product_images_fetch table stores the image's remote location.

I understand this. The reason for joining them while doing the update was to insert the new filename into product_image, while at the same time also updating product_image_fetch with some data to let subsequent queries know that
1) Processing has completed for this image
2a) It went ok, so there is no reason to do this again
2b) An error occured and this row should be included in the next select for images to fetch

This way, you ensure that there is no inconsistency between the product_image table and the procut_image_fetch table, even without using a transaction (and 2 separate updates). Doing it with separate queries is of course possible if both tables are InnoDB and you row lock them first (SELECT ... FOR UPDATE). But the update using join seems simpler to me and it was also a possible strategy, for a mix of a product_image MyISAM table and a product_image_fetch InnoDB table - Or at least I hoped it was.

sneakyimp;10975235 wrote:
I can't think of any other viable field(s) on which we might join them so I'll be querying them each separately and merging the results in PHP. Do I need to worry about atomicity here when checking those two tables?

That depends on what things are checked by readFiles. But since you seem to currently have no way of relating product_image_fetch to product_image, readFiles can't look at both places to determine if an image needs to be fetched. And if it does look in one place and determine that an image needs to be fetched, there can be no way of doing something with data in the other table.
[After reading furhter down...]
I realize it really only does look in one place, product_image to decide wether to fetch an image or not.

But if you ever have (or might have in the future) the need to replace an image somehow (remove the old one, insert a new - or rescale an existing image), then how will you know what existing image to remove? If there really is no need to keep any kind of relation between the two images, then you won't have to worry about it. But since there is a one-to-one relationship (assuming you never have multiple images of different sizes for one product_image_fetch), you could either
- insert product_image_fetch.id or .url into product_image
- insert product_image.id or filename into product_image_fetch

And if you do create this relationship, then you should also ensure that your data is always in a consistent state. That is, either product_image is not yet updated and product_image_fetch states that work is in progress for this image, or product_image is updated and product_image_fetch states that no work is in progress.

Using a serializable transaction level you can do one update to two tables (with join). If you want to do it with two separate updates, then you need to start a transaction (or turn off autocommit), select rows for locking then update both tables and commit.

// pseudo for readFile.php
* for a given product record in the text file, there will be a list of $product_images
* if a matching product already exists in the products table:
# What is the relation between $product_images and the table rows?
# If you go by filename, you risk [A] (see below)
   - get all the images from the product_images table  [atomicity?]
# image_fetcher threads will lock rows FOR UPDATE before starting work
# so no rows are selected here until all threads have commited
# likewise, get_images should lock rows for update, so that no image_fetcher starts to
# fetch an image which is then deleted from product_images_fetch
   - get images for this product from product_images_fetch (if any) [atomicity?]
   - remove any images described i $product_images that exist i either image table
   - possibly update product_images.is_default field for this product's image records to change the default image.
# This is the earliest point where readFiles could commit
   - add any new $product_image images to the product_images_fetch table.
# These newly inserted images will never be selected for work until inserted
# and obviously cannot be used by readFiles either up til this point.
* if a product did not previously exist
   - put all of the $product_image values in product_images_fetch
# Same as last comment for existing products.

[A] - $product_image <=> product_image table
If the only thing matching images between these two is file name, you have a problem.

Let's say an arbitrary image filename, '1.jpg', does not exist in product_image but does exist in $product_images when the clients data file is created. This means it should be inserted into product_image_fetch.
But now, one of your image_fetchers has processed some other file and saves the scaled image as '1.jpg'.
readFiles.php find the entry in $product_images, checks the tables and decides that it should be removed.

However, as long as you somehow use paths/filenames that will always be different for images to fetched and stored images (fully qualified names with path, ftp server or whatever else is necessary to never have identical names), this shouldn't happen.

johanafm

Getting seriously annoyed with these freaking ads popping up after you click "go advanced" for a reply, since they simply remove the entire thing you just wrote and eventually lands you on a blank page. Sometimes the browser will let me hit back and repost, but sometimes there's nothing cached in the browser either 🙁
Off topic rant, but if anyone knows what to do / where to post to get this fixed... I'd be a happy camper.

To see PDO errors, I believe you do it like this

$qry = 'invalid sql';

$r = $db->query($qry);
if (!$r)
{
	echo $db->errorCode() . ': ' . print_r($db->errorInfo(),1) . '<br/>';
}

$stmt = $db->prepare($qry);
$r = $stmt->execute();
if (!$r)
{
	echo $stmt->errorCode() . ': ' . print_r($stmt->errorInfo(),1) . '<br/>';
}
else
{
	$arr = $stmt->fetchall(PDO::FETCH_ASSOC);
	echo '<pre>'.print_r($arr, 1).'</pre>';
}

Not sure if START TRANSACTION is standard or not. You could use BEGIN instead anyway. But in MySQL's stored procedures you need to use START TRANSACTION since BEGIN ... END makes up an execution block.

sneakyimp

Thanks for the posts. I sent you a PM about the error popups.

I found a bit of explanation in the MySQL docs about "FOR UPDATE" but it's painfully brief. According to the MySQL docs:

BEGIN and BEGIN WORK are supported as aliases of START TRANSACTION for initiating a transaction. START TRANSACTION is standard SQL syntax and is the recommended way to start an ad-hoc transaction.

My readFiles script basically does something like this once the data record has been downloaded and parsed:

get array of $images that are specified in the downloaded data file
START TRANSACTION
grab all of the product_images records using FOR UPDATE
grab all of the product_images_fetch records using FOR UPDATE
loop through $images, generate the path that *would have* been inserted and check against product_image table, removing any images that appear to have been fetched already
(same loop) if the current item from $images is not in_array of product_images_fetch items, then we create a new product_images_fetch item.
COMMIT

I think I'm getting very close to something that will work, even if it's not perfect. My code to check for images that need downloading also uses START TRANSACTION and runs like this:
[code=php]
    /**
     * Function to return quickly with (a) no work to do, or (b) next file to process
     * NOTE: Runs under the PARENT process WITHOUT THE LOCK
     * @param int $slot - Just an integer that indicates which thread is looking for work.  Corresponds
     *  to the index of an available thread in the main slots array
     */
    public function getNext($slot) {
        MTLog::getInstance()->info('getNext running for slot ' . $slot);


    if (sizeof($this->aPendingImages) < 1) {
        MTLog::getInstance()->info('no pending images!');
        // our batch of images is empty, try and get some more
        try {
            MTLog::getInstance()->info('main thread connecting to DB in getNext');
            $this->db = new PDO(EREP_DB_DSN, EREP_DB_USER, EREP_DB_PASS);
        } catch (PDOException $e){
            MTLog::getInstance()->error('PDO Creation failed: ' . $e->getMessage());
        }
        if (!$this->db) {
            MTLog::getInstance()->error('PDO create failed: ' . "\n" . print_r($this->db->errorInfo(), TRUE));
        } else {
            MTLog::getInstance()->debug2('PDO create success');
        }
        // START TRANSACTION, to hopefully implement record locking for the items returned
        // remember that the data downloader script is also creating records in the car_images_pending table
        $stmt = $this->db->query('START TRANSACTION');
        if (!$stmt){
            $this->checkGetNextFailures();
            MTLog::getInstance()->error('START TRANSACTION FAILED: ' . print_r($stmt->errorInfo(), TRUE));
            return NULL;
        }
        $pendingImageSelectSQL = 'SELECT id, client_id, vid, remote_url FROM product_images_pending WHERE record_lock_microtime IS NULL AND record_lock_name IS NULL AND fetch_failures < ' . EREP_MAX_FETCH_FAILURES . ' ORDER BY is_default DESC, creation_microtime ASC LIMIT ' . EREP_BATCH_SIZE . ' FOR UPDATE';
        $stmt = $this->db->query($pendingImageSelectSQL);
        if (!$stmt) {
            $this->db->query('COMMIT'); // do i really need all of these COMMIT statements?
            $this->checkGetNextFailures();
            MTLog::getInstance()->error('pending_image select failed: ' . $pendingImageSelectSQL . "\n" . print_r($this->db->errorInfo(), TRUE));
            return NULL;
        }
        // mark all the records we just fetched as BUSY
        $this->aPendingImages = $stmt->fetchall(PDO::FETCH_ASSOC);
        $aPendingImageIDs = array();
        foreach($this->aPendingImages as $pi) {
            $aPendingImageIDs[] = $pi['id'];
        }
        $pendingImageUpdateSQL = 'UPDATE car_images_pending SET record_lock_microtime=' . microtime(TRUE) . ' WHERE id IN (' . implode(',', $aPendingImageIDs) . ')';
        $stmt = $this->db->query($pendingImageUpdateSQL);
        if (!$stmt) {
            $this->db->query('COMMIT');
            $this->checkGetNextFailures();
            MTLog::getInstance()->error('pending_image update failed: ' . $pendingImageUpdateSQL . "\n" . print_r($stmt->errorInfo(), TRUE));
            return NULL;
        }
        $stmt = $this->db->query('COMMIT');
        if (!$stmt) {
            $this->checkGetNextFailures();
            MTLog::getInstance()->error('COMMIT failed: ' . "\n" . print_r($stmt->errorInfo(), TRUE));
            return NULL;
        }

        $this->db = NULL; // don't need the db any more for now
        // since we succeeded, set this to zero
        $this->iConsecutiveGetNextFailures = 0;
    } // if no pending images

    if (sizeof($this->aPendingImages) > 0) {
        $pendingImg = array_shift($this->aPendingImages);
        MTLog::getInstance()->error('getNext returned image record ' . $pendingImg['id']);
        return $pendingImg;
    } else{
        MTLog::getInstance()->info('getNext found now pending images');
        return NULL;
    }

}

sneakyimp

oops. have a couple of references in there to car_images (my actual product_image table name).

johanafm

It looks pretty good as far as I can tell. Just a few things I'd like to comment on.

The db connection code would look cleaner like this in my opinion

                $this->db = new PDO(EREP_DB_DSN, EREP_DB_USER, EREP_DB_PASS);

            # Since PDO throws exception if DB connection fails, you will never reach
            # this point if you have no db connection.
            MTLog::getInstance()->debug2('PDO create success');

            # and you could actually put the rest of the if block
            # sizeof($this->aPendingImages) < 1) here as well.
        } catch (PDOException $e){
            MTLog::getInstance()->error('PDO Creation failed: ' . $e->getMessage());
        }

Not sure if I ever did post my theory about ::query(START TRANSACTION) vs ::beginTransaction. But I wrote some test code, and it turns out that you can't make consecutive calls to ::beginTransaction without ::commit or ::rollback first (pdo complains that "transaction allready in progress"), whereas a START TRANSACTION statement will issue an implicit COMMIT.

MySQL 5.1 doc wrote:
Beginning a transaction causes any pending transaction to be committed. See Section 12.3.3, “Statements That Cause an Implicit Commit”, for more information

# Fetch row(s) and set read locks for exactly those N rows that were fetched
$pendingImageSelectSQL = 'SELECT id, client_id, vid, remote_url FROM product_images_pending WHERE record_lock_microtime IS NULL AND record_lock_name IS NULL AND fetch_failures < ' . EREP_MAX_FETCH_FAILURES . ' ORDER BY is_default DESC, creation_microtime ASC LIMIT ' . EREP_BATCH_SIZE . ' FOR UPDATE';
$stmt = $this->db->query($pendingImageSelectSQL);

# If the above query failed, there are no rows fetched, and no locks issued
if (!$stmt) {
	# So this will achieve nothing (assuming you don't have executed other queries
	# since you started the transaction)
	# And it would generally make more sense to issue a rollback if one of your queries fails.
	$this->db->query('COMMIT'); // do i really need all of these COMMIT statements?
	$this->checkGetNextFailures();
	MTLog::getInstance()->error('pending_image select failed: ' . $pendingImageSelectSQL . "\n" . print_r($this->db->errorInfo(), TRUE));
	return NULL;

# $this->db never set to null
}

So, while I do not know what checkGetNextFailures() does, I would rewrite anything in a transaction along the lines of
$status = true;
beginTransactin();
... setting $status = false if any single query fails
if (!$status) rollback()

$transaction_status = true;
$stmt = $this->db->query($pendingImageSelectSQL);
if ($stmt)
{
	$aPendingImageIDs = array();
	foreach($this->aPendingImages as $pi) {
		$aPendingImageIDs[] = $pi['id'];
	}
	# Possibly no rows returned, even if the query executed without errors
	if (count($aPendingImageIDs))
	{
		$pendingImageUpdateSQL = 'UPDATE car_images_pending SET record_lock_microtime=' . microtime(TRUE) . ' WHERE id IN (' . implode(',', $aPendingImageIDs) . ')';
		$stmt = $this->db->query($pendingImageUpdateSQL);
		if (!$stmt)
		{
			$transaction_status = false;
		}
	}

}
else
{
	$transaction_status = false;
}

# Somewhere, wherever, among your queries, something went wrong
# ROLLBACK (in case any changes would have been made on commit)
# and also reset your member variables to reflect this so they don't contain
# rows that never had a record_lock_microtime set 
if (!$transaction_status)
{
	$this->aPendingImages = array();
	$this->db->query('ROLLBACK');		# or $db->rollback(); if you use $db->beginTransaction()

# Since you havn't issued a commit, and close the db connection here,
# the transaction should be rolled back automatically.
# But for all I know, there might be some mysql .ini setting that changes
# default behaviour to commit, which is why I prefer the explicit call to
# rollback above
$this->db = null;
return null;
}

$this->db->query('COMMIT') / $db->commit();
$this->db = null;

Also if I'm not misstaken, you only use the db handle in this function, which means you could use a variable with local (function) scope, rather than store it in a member variable. This way, when you return from the member method, the connection is automatically dropped as the variable is destroyed.
As things stand now, you need to take care to release this resource.