Weedpacket;10905898 wrote:Why was the query bad? Wouldn't it be better to fix the problem than try monitoring for symptoms?
If you're slicing vegetables, do you regularly check to see if you haven't lopped any of your extremities off (unless you suffer from leprosy or something)? I doubt a reliable replication system would need to resort to such kludges as regularly reading the error log.
Actually, the problem here appears to be the rather fragile replication that MySQL employees. There are plenty of ways to make it stop working that seem non-obvious at first, and there are a ton of bugs listed at bugs.mysql.com for replication failing that are months and even years old. And, sadly, no one seems to be working on them. PostgreSQL or slony bug reports result in patches being available in hours or days. The advantage of software that's actually maintained by a vibrant developer community.
So, fixing the problem likely involves fixing MySQL's replication. Good luck with that. I've submitted very simple bug reports for things like bad packaging that took several months to get fixed, only to see them reverted back to the buggy behaviour to make a quick fix for another bug.
The replication engine I use with pgsql (slony) is much more complex to setup and maintain than MySQL's built in replication. However, I can initiate replication live while the db is up and running and change it to add new tables / remove them from the replication set while live also. My application based monitoring can switch out slave for master in one command and < 1 second and the whole system just keeps running smoothly.
I have a single query that runs every minute to see if replication has stopped or fallen behind. It hasn't yet, and we replicate millions of updates, inserts, and deletes a day with it.