web database syndicates with central database

blackhorse

Central database is on mysql too.

Web database will read from central database, then populate the data to its own tables. (schema of central and web database are different)

Cron works will check the central database every hour and then syndicate the data to the web database.

My approach

1) export central database data (related to web database part) into an xml.
2) cron will empty the web database first, and then read the xml, and repopulate the web database.

I think that is what the google base do too.

The other approaches seem stupid.

1) instead of export the central database to xml and read that xml to populate the web database. I can just directly read the data from central database and input into web database right away without xml as the middle step. But I think with xml as the middle step should be the right approach. It separated central database and web database's process.

2) instead of first empty the web database, and read the xml and repopulate the web database, I can try to detect which are the new records of the central database, which records are deleted from the central database, and which are updated, and then step by step add/delete/update these records in web database. But that seems stupid.

Any advices about how would you do it?

Thanks!

Bjom

...seems stupid.... That proposition doesn't seem to hold a lot of information.

Step 2) doesn't "seem stupid" to me at all: when you use the central databases logs to determine what changed and simply do incremental upgrades to the web db according to this you will have to do less data shuffling compared to emptying/rebuilding.

However I am not an expert in this field ... hope some of our DB wizards can shed more light on it.

Bjom

Weedpacket

Assuming that the slave servers could be anything, here's a suggestion. Offer the data in two forms: A complete dump (in the master's schema because the slaves could have anything). That would allow new slaves to be brought up without additional messing about, and also allow really outdated servers to catch up quickly. The other would be to log all changes within the past n hours or whatever (depending on how often the data changes, how big the changes are, how big the resulting diffs would be compared to the complete dump, etc.), and allow slaves to request all updates since a certain time or within a certain interval. Then the slaves update as and when needed.

The log table could be generated by attaching appropriate triggers to the relevant tables of the master db.

Alternative: instead of logging every single change and when it occurred, simply log which records have changed and when they were most recently changed. The diff document would consist of the changed records (and identifiers of the records that have been deleted). To prevent diffs getting as big as the complete dumps they're supposed to be replacing, slaves would have to state in their request when (if ever) they last made an update (it would be better to have the slave keep track of this because the master probably won't know if the slave is being restored from a backup), and the master would have to keep an eye on when it last flushed the diff table (basically, the timestamp of the oldest change). Any slaves that haven't updated since the most recent flush would get a complete dump; slaves that have get the appropriate subset of the changed records.

All this of course assumes that the slaves could be anything. If they're the same DBMS as the master then obviously native replication would be more appropriate.

blackhorse

Thanks for the response from database expert.

Sorry, I didn't give enough information. Central database is my client's database, I only have read access this time. And I don't have access to their database log, or I am not sure if I can run a database dump to dump the data to my web database. (I can ask but I am not sure if they will give.) Plus in the future, the other clients may not even give me the read access, they may just give me the xml file.

Our web database schema will be different than these clients database's schema. So we will write our own codes to read their data and populate it into our database.

Just like Google Base, we give google xml file, google use the new xml file to totally replace our old data in google base.

So I am thinking like this, if the client gives me read access like this time, I will use it to generate xml sheet, read the xml I will get the most current data. If the clients don't give me read access but give me the xml sheet instead, that will be even better. In the future, it would be better that I would give my xml dtd, ask the clients give the xml according to my dtd if they can.

So, for the given situations, use xml as the data transfer format is the right approach?

If the first step, I use xml as the data transfer. Then second step, empty my web database and repopulate my web database with the xml sheet every time after scheduled reading xml, will that be right too? The web database is small. Focus on one event etc.

I think that is how google base works.

Weedpacket

Oh, so you're doing the other end.

So you're just getting dumps of their tables. You don't know what's changed since the last time you looked, and time you spend trying to figure that out could have been spent just populating a new table.

A new table, note. Perhaps temporary tables; then truncate the existing tables and copy the contents of the temp tables into the just-truncated ones. Assuming you can't just switch the tables directly. Updating the existing tables depends on comparing them with the temp tables and determining what's an update, what's an insert, and what's a deletion.

The differences in schemas isn't really major. Set up some views that mimic the source schema and add INSERT triggers to them that translate insertions into the views into whatever the corresponding commands would be for your schema. Then as the XML is read and rows inserted into these views the triggers fire and perform the schema conversion (the user that does this needs insert permissions on these views, obviously, but doesn't need any other permissions on anything else).

blackhorse

Weedpacket;10925131 wrote:
Oh, so you're doing the other end.

So you're just getting dumps of their tables. You don't know what's changed since the last time you looked, and time you spend trying to figure that out could have been spent just populating a new table.

Yes.

A new table, note. Perhaps temporary tables; then truncate the existing tables and copy the contents of the temp tables into the just-truncated ones. Assuming you can't just switch the tables directly. Updating the existing tables depends on comparing them with the temp tables and determining what's an update, what's an insert, and what's a deletion.

The differences in schemas isn't really major. Set up some views that mimic the source schema and add INSERT triggers to them that translate insertions into the views into whatever the corresponding commands would be for your schema. Then as the XML is read and rows inserted into these views the triggers fire and perform the schema conversion (the user that does this needs insert permissions on these views, obviously, but doesn't need any other permissions on anything else).

I will follow your advice here. Just a quick question! The web database part would be very simple. 3-4 tables, each with no more than 100 records. So what will be the problem to just empty the table and insert the new data. It will be done in second. There must be a good reason that you don't do this "quick and dirty" approach.

johanafm

A thought that struck me that I'd like some feedback on is, how about:

CREATE newTable;
popuplate newTable;
DROP oldTable;
ALTER newTable RENAME oldTable;

And any queries on the oldTable could check for missing table error in the short time it's gone and just retry the same query again.

And as for the last question in your previous post, if the condition on no more than 100 rows per table won't change, I'd say there is little reason to care about dealing with diffs. However, this information wasn't previously known and you're tables could have been holding a few million rows instead.

blackhorse

johanafm;10925159 wrote:
A thought that struck me that I'd like some feedback on is, how about:
CREATE newTable;
popuplate newTable;
DROP oldTable;
ALTER newTable RENAME oldTable;

I like this idea, does weedpacket's "switch the table" mean the same? I was thinking about the same kind approach too to "switch the table" as weedpacket suggests.

This way, the only mini seconds of the transaction is in the query drop oldtable; alter newtable.

Also detect the error etc will be simple too.

I will try this out.

Anything else I should concern when I use this approach? Please advise.

Thanks a lot.

Weedpacket

Yeah, that's what I meant about switching tables. One thing to do is ensure that all this happens within a transaction. That way (a) if it falls over you don't leave things half-done, and (b) no-one else (like your web app) sees things in a half-done state (they'll continue to look at the old content until the transaction is committed).

I did suggest emptying the tables and repopulating them; using temp tables to effect the schema conversion was simply a possible convenience. Using a real table and renaming it would also work (assuming that it doesn't break things like foreign key relationships).