sneakyimp;10975164 wrote:
Johanafm, you are a saint. I truly appreciate your dragging me kicking and screaming into the world of threads and (transactional SQL?).
TableProducts is part of a large, live website. I'm not sure if I am allowed to change its DB engine from MyISAM to InnoDB because they may or may not be using full-text search features.
My pleasure. I have several reasons for suggesting that you switch away from MyIsam. The first is that it is in fact not transactional. Should you ever need to perform more than one query without race issues, you cannot START TRANSACTION, query1, ..., query N, COMMIT. You'd need to lock table, do stuff, unlock table. And this will not be possible for any table under decent work load.
Also, it also has no support for foreign key constraints, which I find scary.
Last but not least, I recently spoke to a friend who is a sys admin for a large ISP and he told me he's had issues with MyIsam for a system that had heaps of insert/updates/deletes (and selects as well). The tables often became corrupt and backing them up required full table locks which he had to work around. It is of course possible that things have changed since then (I believe this was about 1-2 years ago).
At work we actually do use MyISAM for our news site (online newspaper) without such problems, but that system has a low insert/update load, and very few deletes. I don't know how many queries are performed, but the number of page requests is about 100k per day.
He made a switch to InnoDB and Sphinx, which is for full text indexing outside of the database, and from what I've heard also faster than MySQL's built in full text indexing.
So it may still be worthwhile to check and see if there are indeed full text indexing. If not, you could easily switch to InnoDB. If they do need full text indexing, you could go for PostgreSQL (which my friend always recommends over MySQL).
sneakyimp;10975164 wrote:
A secondary reason (which may be totally wrong) is the potential need for locking the image table because I want the image download script to run in massively multi-threaded fashion. If I get a massively parallel image fetching operation working, this table would be slammed with requests for image records in need of fetching and updates to images that have been fetched. More on that in a bit.
You will not be able to use table locks since your users will spend too much time waiting (no selects will go through while tables are locked). Also, your concurrently running threads to update the table will start spending a lot of time waiting for lock to be released, thus meaning you (most likely) will not gain the performance increase that you need.
But, assuming that TableProductFetchImage (was that the correct name for the table of images to be fetched) is InnoDB, you might be fine. This is where you will need your locks. I am also assuming that JOINING a MyISAM table for update (TableProductImage) with an InnoDB table (TableProductFetchImage) works when the InnoDB table has row locks. But this is a question I'd post on MySQL forum to get a definitive answer.
By the way, using mixed letter casing for database identifiers is not recommended (for MySQL anyway), since case sensitivity is platform dependent, which means that you may have working code on for example Windows (not case sensitive) while some queries may fail to execute on for example FreeBSD (case sensitive).
My line of reasoning goes like this (hoping I'm not making assumptions that are too far off)
- 3 tables like you've said: TableProduct, TableProductImage (MyISAM) and TableProductFetchImage (InnoD😎. I'll just call then Product, Image, FetchImage
- readFiles.php parses client data and decides on updates. Things to be updated goes into TableProductFetchImage, at least as far as the image processing is concerned
- daemon retrieves data from TableProductFetchImage, passes on one data row at the time to a separate thread (I'll simply call it Child)
- Child updates TableProduct
A. readFiles
1. parses a product.
2. to see if anything needs to be done, it may need one or several rows from FetchImage which means
3. START TRANSACTION
4. SELECT stuff FROM FetchImage FOR UPDATE (reading and writing is delayed until COMMIT or ROLLBACK for these rows)
5. do necessary checks
6. If necessary, UPDATE one or several rows in FetchImage
7. COMMIT
B. daemon
1. SELECT id FROM FetchImage WHERE ... LIMIT 0, 1000
2. pass one id at the time to Child
C. child
1. starts with an Image id
2. START TRANSACTION
3. SELECT stuff from FetchImage ... FOR UPDATE
$fetchImage is assigned value of FetchImage.image
It is perhaps possible that this row no longer exists due to work done by readFiles.php after the daemon selected its rows from the db. In this case, COMMIT here then exit thread.
4. fetch and scale image
5. UPDATE Image AS i INNER JOIN FetchImage AS fi ON ... SET i.file = '/some/file' fi.processing_done = UNIX_TIMESTAMP() WHERE ... AND fi.image = $fetchImage
In case readFiles has updated this row (but not deleted it), I'm assuming that the image file is either still the same and should still be fetched, or it has changed and this update should no longer be performed. In this case fi.image is no longer equal to $fetchImage and there is no row to update.
6. COMMIT
If readFiles (A) tries to fetch data from FetchImage while a child (C) is working on it, A will wait until C has finished updating its row. readFiles will now work on current data.
If FetchImage (C) wants to fetch a row for processing while readFiles (A) is working on it, it will wait until A has finished updating and will either end up with no work which cost you one SELECT query, or work being done on current data.
If the daemon (😎 tries to SELECT rows while work is being performed upon them by one or more Child (C), then it will also have to wait until updates have been written before it gets new data. With a proper WHERE clause, the daemon should not retrieve such a row until readFiles has once again updated it.
If you need to run several daemons (😎 on different machines, you could change B to accomodate this. Give each machine an identifier, say MACHINE_IDENTIFIER is either 1, 2 or 3
B. daemon
1. UPDATE FetchImage SET machine_identifier=MACHINE_IDENTIFIER WHERE machine_identifier IS NULL LIMIT 1000
2. SELECT id FROM FetchImage WHERE ... AND machine_identifier=MACHINE_IDENTIFIER LIMIT 0,1000
3. Pass one id at the time to Child
C. child
5. UPDATE Image AS i INNER JOIN FetchImage AS fi ON ... SET i.file = '/some/file', fi.processing_done = UNIX_TIMESTAMP(), fi.machine_identifier = NULL WHERE ...
This way, each machine's daemon will "claim" a set of rows, and its children will release each row after image processing is done by resetting machine_identifier to NULL. Since it also updates other information at the same time, this row should not be included in a subsequent query until after it has been updated anew by readFiles.
Another way to split workload between machines would of course be to have one readFiles.php per machine if it is possible to split this workload in some sensible fashion. Updates going into FetchImage could be made machine specific from the start, in the same way as before, or even passed on directly to a Child for processing.
It should be possible to slack on row locking by either not making readFiles select for update or making Child not select for update.
If readFiles.php does not SELECT ... FOR UPDATE, but instead when updating will SET update_time = UNIX_TIMESTAMP(), you could still prevent Child from updating this row if it has changed since Child started working on the image. In this case, one run of image fetching and processing was wasted.
If Child does not SELECT ... FOR UPDATE, then it is possible that an image was updated after readFiles.php SELECTED this image data to check if it should be updated. It is entirely possible that the check would have said no after this update, but instead said yes, and that you will waste one run of image fetching and processing since you will perform one extra run later.
Likewise, it is possible that both the check done on old data and the check on new data would have said yes, and in this case you do two runs where only the latter would have been needed.
I have no idea if either of these "slack lock" versions would run faster due to fewer waits for locsk would occur or slower due to more unnecessary work being done. But I do guess it depends on how time consuming different parts are, how many rows are locked by readFiles at the time and probably things I havn't forseen as well.
I'd probably start with transactions in both places, and if needed or out of interest after that check how the slack lock versions perform in comparison.