When queing thousands of records. Flat File vs SQL vs Array

Tea_J · Mar 2, 2015

The short version:
What's the best way to que thousands of rows, with ability to process each of those rows individually as needed (view, update, delete)

The details:

This is with regards to a part of my web app -> sending messages to an API...

I have the following tables

outbox (user_id, name, address, message, etc ) - the table that keeps records of all message sent. each row is 1 recipient
broadcast (id, user_id, name, message, address, stat[pending,sending,sent,failed], etc) - each 1 broadcast row is a mass message that can be related to 1 to 20,000 (or more) recipients in the outbox table
(in case i get critiqued... i keep a copy of the user_id, name and message for important purposes)

PROCESS:
1) User creates broadcast sessions
2) User selects contact group to send
3) User press SEND

At this point web app pulls all the individual recipient data from the contact group selected and sends it to que... <-- this is the part i'm struggling to optimize ..

At the moment what I do is upon sending, the I INSERT all the recipient data in the outbox that way i have individual rows i can monitor, query, search from, and even cancel/delete

This is fine for most use case where there are usually less than 1000 recipients .. but i got to a point where i will be needing to send to 10,000 recipients or more ... and im worried about script/sql timing out on writes.. or wasting auto incremented id numbers in the outbox table for broadcasts that are cancelled (i essentially have to delete unsent rows leaving huge gaps in the primary ids)

So I was thinking of holding the huge recipient list somewhere in a queued state, and then writing it to outbox table only when I send it to api (@ around 10 rows / second) because this way i prevent clutter in my outbox table and i dont need to do a massive 10,000 + rows INSERT into the table w/c may cause timeouts. (then again i still have to write that 10,000+ rows somewhere..hmmm)

So do idea is to dump all that into a single source location (eg 1 txt file) and load it from there

Anyway my options are:

OPTION 1: Use Flat File
-> I would simply write all rows in a text file (in array format for convenience) and load that into PHP

OPTION 2: Use Array stored in a single full-text type column in the broadcast table
-> I would simply store all recipients in a column, in array format

The only problem that I see here is that everytime i need to access some rows I have to load a huge chunk of data and parse through it (easy w/ array find(id) ) just to view or make updates to rows, or when I send a few rows to the outbox table.

OPTION 3: Quit making my life too hard and just stick w/ using the outbox table.

Any thoughts?

NogDog · Mar 3, 2015

Big arrays are memory hogs, so I would tend to avoid them. My first thought would be to just process the database result set one row at a time, if feasible:

while(($row = $stmt->fetch(PDO::FETCH_ASSOC)) != false)
{
    // do your thing with this row
}

If for some reason you feel this is not practical and you want to write it to a file first, then the same thing, but first fopen() a write file handle, and write each result row to it (maybe fputcsv()?). Then you can fopen() it for reading, and read one line at a time (fgetcsv()?) and process it, again avoiding saving the whole thing into PHP memory.

Weedpacket · Mar 3, 2015

Oh, you're saying "queue". I was going "¿Que?"

Anyway, what you're saying your three options are are:
1: Write your own DBMS. (Don't forget to handle the cases where two people send to a large group at the same time, or the previous list hasn't been finished with before the next begins, or...).
2: Use an existing DBMS but avoid using its functionality.
3: Continue to use an existing DBMS.

Tea_J wrote:
and im worried about script/sql timing out on writes

Are you having this problem? (I.e., you're not just making trouble for yourself for the sake of having something to worry about) You can't increase the script's timeout? You can't run the INSERT asynchronously? If you can't have the database itself do a query and INSERT INTO a table itself - if you have to generate the INSERTS yourself for some reason - you can't do a bulk copy or at least a prepared statement? Those are just some ideas. Another, depending on your ecosystem, is if you've got some sort of messaging interface already available that can fill the role of your outbox (e.g., SQLServer/MS Message Queue/etc.).

Tea_J · Mar 3, 2015

THank you both for you reply..

i was actually reading up on this
http://techblog.procurios.nl/k/news/view/34972/14863/cache-a-large-array-json-serialize-or-var_export.html

hence the idea of writing the pending messages on file.. put all the pendings in an array and serialize it, write to disk. im sure writing is not a problem but the reading and holding it into memory , or looping through em, is what worries me.

say user wanted to cancel 1 recipient from the list of 10,000 rows.. so to get to the specific row (id) i have to traverse or do an array find through the whole block of array instead of just querying the database w/c is much better at this.

hmm. im leaning more towards just dumping the whole lot onto the outbox table like normal.. to keep things simple.. hmmm

sneakyimp · Mar 4, 2015

I had to do something like this. A website had to download and resize images based on a daily data import. I altered the script to have no timeout, but things were getting out of hand because the script would still run for 24 hours and not finish before it had to run again the next day.

I thought long and hard about it and thought it would be bad if I just had the import process create an array in memory. What happens if an exception gets thrown because one record has unexpected bad data? In that case, my entire array in memory would be lost when my script halts due to the uncaught error. There were also memory concerns for a script that holds such a large array and runs for 24 hours.

I decided that a text file wasn't that great either. You could read it one line at a time if you had to and it was stored on disk so it addressed some of my concerns about my process halting and memory issues, but it seemed like a pain in the ass to try and both read and write this file a line at a time somehow (if that's even possible without writing the whole file's contents) so I gave up on the flat file too.

I ended up going with a DBMS (as Weedpacket suggests) and created a table specifically for my work queue. If a job was finished, you could run a query to mark a single record 'finished'. If a job needed to be cancelled, you could run a query to remove a single record from the queue table. DBMSes are very very very good at this and are most definitely the tool for the job. If you want to add some extra code to lock records and/or tables, then you can construct your program that performs the work to spawn a bunch of independent processes to do all the work in parallel much, much, much faster. This is pretty complicated though and not for the faint of heart.

laserlight · Mar 4, 2015

sneakyimp wrote:
I ended up going with a DBMS (as Weedpacket suggests) and created a table specifically for my work queue. If a job was finished, you could run a query to mark a single record 'finished'. If a job needed to be cancelled, you could run a query to remove a single record from the queue table. DBMSes are very very very good at this and are most definitely the tool for the job. If you want to add some extra code to lock records and/or tables, then you can construct your program that performs the work to spawn a bunch of independent processes to do all the work in parallel much, much, much faster. This is pretty complicated though and not for the faint of heart.

I note that there are task queue libraries that would arguably be most directly the tool for the job, though they may well use a relational DBMS for the storage backend, that is not always so, e.g., a quick search brings up a PHP port of resque that uses Redis as the backend.

sneakyimp · Mar 4, 2015

laserlight;11045705 wrote:
I note that there are task queue libraries that would arguably be most directly the tool for the job, though they may well use a relational DBMS for the storage backend, that is not always so, e.g., a quick search brings up a PHP port of resque that uses Redis as the backend.

That looks pretty useful.

Tea_J · Mar 5, 2015

thank you guys for your inputs..

well, so I tried storing the que into full text column in the broadcast table
--> this was the most convenient because I didn't need to make a separate call to load the que outbox messages , nor open a file etc.. all it took was 1 select call from the broadcast table, unserialized the pending_messages column , and ran through the array to get the statistical info, as well as make updates if necessary.. and if there are updates, i updated that full text column

I tested this with 10,000 rows and i could the thing was really fast still.. so i guess it doesn't take much processing power to hold and loop thru an array of over 10,000 rows..

This was true for both reading and inserting the 10,000 records in full text column.

Updating a record was a tiny bit more time/effort because i have to read and loop through all the arrays to update 1 record.. and write the whole array (serialized) back to the full text column.

Ofcourse a DBMS would be wonderful for the updating purpose since i can target a specific row exactly.

But I was weighing the use cases, and while doing massive INSERTS (helped a lot by multi-insert sql) is a bit daunting to think about, it is the least form of activity in the life cycle of a mass message task.. Most of the activities are in selecting and updating , so it's pretty clear to me that a DBMS (temp table) is better way to go.

I liked the idea of the resque, but honestly it feels so daunting for me to do all that just for this little piece of requirement, plus ads dependencies and complexities to my program. So my solution for the multi-insert issue is

1) Limit to 5,000 row inserts per batch
2) store the rest in the full text column in the broadcast table and process each 5000 records using sleep() delay

I think it's a helluva lot simpler and straight forward way than other options.

Anyway, will go down this route and report my experience (gonna try and hammer my local machine 30,000 or more rows / batch) ..

sneakyimp · Mar 5, 2015

Tea_J;11045737 wrote:
well, so I tried storing the que into full text column in the broadcast table
--> this was the most convenient because I didn't need to make a separate call to load the que outbox messages , nor open a file etc.. all it took was 1 select call from the broadcast table, unserialized the pending_messages column , and ran through the array to get the statistical info, as well as make updates if necessary.. and if there are updates, i updated that full text column

Hm. 10,000 records serialized into one array and crammed into a single field of a single database record sounds fairly suspect to me. I probably would never do that.

Tea_J;11045737 wrote:
I tested this with 10,000 rows and i could the thing was really fast still.. so i guess it doesn't take much processing power to hold and loop thru an array of over 10,000 rows..

This was true for both reading and inserting the 10,000 records in full text column.

10,000 rows, if they are small rows with a few fields, is certainly no big deal for a modern computer. But If each row has dozens or hundreds of fields, and perhaps you move to 100,000 rows, then you are going to maybe have a problem. You should assess whether this needs to grow or not.

Tea_J;11045737 wrote:
Updating a record was a tiny bit more time/effort because i have to read and loop through all the arrays to update 1 record.. and write the whole array (serialized) back to the full text column.

Doing one record (or 10 or 100) at a time is what would allow you to scale up to have dozens/hundreds/thousands of simultaneous processes all working on the same problem. This was the issue I had to solve with my image-downloading script I mentioned before. If you just have one giant array in memory and a bunch of processes fighting over it, things to tend to break and then you have to start all over again. If your system can easily process 10,000 records at once with high reliability, then you may not need to go through the trouble of doling out the work to multiple processes. You'll have to evaluate for your case alone.

Tea_J;11045737 wrote:
Ofcourse a DBMS would be wonderful for the updating purpose since i can target a specific row exactly.

Yes, exactly! However, you still have to take care and make sure that you don't process any single record twice or, if you have multiple processes running at once, you need to make sure that one process does not over-write the other. You have to lock records while they are being processed so that different threads of code execution don't step on each other's toes.

Tea_J;11045737 wrote:
But I was weighing the use cases, and while doing massive INSERTS (helped a lot by multi-insert sql) is a bit daunting to think about, it is the least form of activity in the life cycle of a mass message task.. Most of the activities are in selecting and updating , so it's pretty clear to me that a DBMS (temp table) is better way to go.

If you are intimidated by the prospect of building all the code, consider using laserlight's suggestion. It might be waaaaay easier. Or not.

Tea_J;11045737 wrote:
I liked the idea of the resque, but honestly it feels so daunting for me to do all that just for this little piece of requirement, plus ads dependencies and complexities to my program. So my solution for the multi-insert issue is

1) Limit to 5,000 row inserts per batch
2) store the rest in the full text column in the broadcast table and process each 5000 records using sleep() delay

I think it's a helluva lot simpler and straight forward way than other options.

Anyway, will go down this route and report my experience (gonna try and hammer my local machine 30,000 or more rows / batch) ..

Yes you'll need to evaluate the tradeoffs between the cost and effort of code development versus the needs of your application. In my case, the work simply had to be done and was worth the cost (it now handles about 100,000-200,000 images per hour, creating up to 1.4M resized images in a CDN). When there are no images to be processed, the system terminates all the cloud servers running, saving lots of $$. It took a LOT of time to get it working though.

When queing thousands of records. Flat File vs SQL vs Array

TTea_J

NogDog

Weedpacket

TTea_J

Ssneakyimp

laserlight

Ssneakyimp

TTea_J

Ssneakyimp