[RESOLVED] replication monitoring

mark_kccs

Hi to all

I have replication working on a mysql db but recently discovered that the process had stopped due to a bad query. Just for clarity if any one has the same issue, once the query was resolved the replication began again and is working as expected.

This leads me to examine the need for monitoring the process with a script but so far I have not been able to get a solution.

Could anyone please point me in the right direction as to what would be a good starting place?

Many thanks in advance.

djjjozsi

hello,

you can save the query and mysql_error into a file, or just output to the browser to know which is the problem.
and don't interrupt the process with the die function.

The unescaped characters could make problems.

mark_kccs

Thanks for the reply.

I can see the error in the log file xxx.err so I'm thinking that by searching this file on a scheduled basis for the string "[ERROR] Slave:" I should be able to raise an alert. All I need to set up is the process to compare a new error to previous errors that have been rectified. Can i not simply delete the file contents if i find the error?

Thanks

Weedpacket

mark_kccs wrote:
I have replication working on a mysql db but recently discovered that the process had stopped due to a bad query.

Why was the query bad? Wouldn't it be better to fix the problem than try monitoring for symptoms?

If you're slicing vegetables, do you regularly check to see if you haven't lopped any of your extremities off (unless you suffer from leprosy or something)? I doubt a reliable replication system would need to resort to such kludges as regularly reading the error log.

mark_kccs

Hi Weedpacket

Thanks for the reply.

So are you suggesting that I vet all queries that are sent to the db's as there are many users and this would involve a full time process? I am simply trying to have an application that will give me some indication as to the status of the replication activity which does not involve someone having to query the server manually.

I fully understand your comments but in a real world situation this is not really a pratical solution.

Thanks

Sxooter

Weedpacket;10905898 wrote:
Why was the query bad? Wouldn't it be better to fix the problem than try monitoring for symptoms?

If you're slicing vegetables, do you regularly check to see if you haven't lopped any of your extremities off (unless you suffer from leprosy or something)? I doubt a reliable replication system would need to resort to such kludges as regularly reading the error log.

Actually, the problem here appears to be the rather fragile replication that MySQL employees. There are plenty of ways to make it stop working that seem non-obvious at first, and there are a ton of bugs listed at bugs.mysql.com for replication failing that are months and even years old. And, sadly, no one seems to be working on them. PostgreSQL or slony bug reports result in patches being available in hours or days. The advantage of software that's actually maintained by a vibrant developer community.

So, fixing the problem likely involves fixing MySQL's replication. Good luck with that. I've submitted very simple bug reports for things like bad packaging that took several months to get fixed, only to see them reverted back to the buggy behaviour to make a quick fix for another bug.

The replication engine I use with pgsql (slony) is much more complex to setup and maintain than MySQL's built in replication. However, I can initiate replication live while the db is up and running and change it to add new tables / remove them from the replication set while live also. My application based monitoring can switch out slave for master in one command and < 1 second and the whole system just keeps running smoothly.

I have a single query that runs every minute to see if replication has stopped or fallen behind. It hasn't yet, and we replicate millions of updates, inserts, and deletes a day with it.

Weedpacket

Sxooter wrote:
There are plenty of ways to make it stop working that seem non-obvious at first

A broken select statement certainly qualifies as "non-obvious". Even a broken update or insert shouldn't make it into the transaction log.

Sxooter wrote:
WITH queries including recursive

Recursive CTE? Wootness! Turing-complete selects!