Closed Bug 757911 Opened 12 years ago Closed 12 years ago

buildbot database replication event causing multiple failures

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bear, Assigned: scabral)

References

Details

(Whiteboard: [buildduty][outage])

this event was seen in nagios:

[13:02]  <nagios-scl3> [542] buildbot1.db.scl3.mozilla.com:MySQL Replication is CRITICAL: Replication Stopped - Last error:

and we are getting buildbot command queue items appearing in the dead queue:

[12:58]  <nagios-sjc1> [83] buildbot-master15.build.scl1:Command Queue is CRITICAL: 3 dead items

and aki and hwine are reporting duplicate jobs in the status displays.
trees are closed

[13:57]  <edmorley> bear: inbound and m-c closed, want me to do the others?
[13:58]  <bear> if you can, please and thanks
This is causing triplicates of in-progress release builds, which then fail out with errors because of collisions.
fyi: FF 13.0b5 release build can not start due to this (start is already in the DB)
Blocks: 744594
[13:30:07] <sheeri> so dustin  remember how we asked if buildbot master/master would have problems if we used auto_increment gaps?
[13:30:30] <sheeri> dustin we're running into the duplicate key problem, looks like writes are happening to both masters (probably a zeus switching thing)
[13:30:55] <dustin> yeah, that's no good
[13:31:05] <atoll> aiee
[13:31:43] <sheeri> well, your choices are to not have master/master auto-failover (we can still do manual failover) or to be able to have gaps in autoincrements.
[13:31:55] <dustin> well, catlee would make that call
[13:32:00] <atoll> sheeri: even/odd type thing?
[13:32:02] <dustin> but AFAIK we had decided on the first
[13:32:03] <sheeri> atoll yes
[13:32:07] <atoll> mm
[13:32:18] <dustin> sheeri: recall that things depend on insert ordering
[13:32:22] <atoll> i was thinking, gaps because you wanted to use the decreased-locking mode on autoinc
[13:32:24] <sheeri> dustin I do recall that
[13:32:39] <sheeri> atoll gaps because otherwise if things failover you might get 2 of the same primary key
[13:32:42] <atoll> right
[13:32:46] <sheeri> dustin but I don't' recall we actually made a decision
[13:32:54] <sheeri> of course, I could be mis-remembering.
[13:33:00] <atoll> to the bug!
[13:33:13] <dustin> so I think we had decided that failovers would be non-revertive, and rare enough that we'd eat some bum queries on a failover, but not continuously
[13:33:23] <sal> van: https://bugzilla.mozilla.org/show_bug.cgi?id=757630
[13:33:27] <dustin> so if that means manual failover, that's what it means :)
[13:33:48] coop|mtg is now known as coop|buildduty
[13:35:55] <sheeri> dustin yeah that's what I thought we came tot he conclusion of.
[13:36:07] <sheeri> dustin but I have 3 in a row duplicate key pages from nagios
[13:36:12] <sheeri> OK, the ro and rw buildbot pools now have no failovers
[13:36:35] <dustin> any idea what happened? did zeus bounce around for a bit?
[13:37:55] <sheeri> dustin that seems to be what happened based on the db errors
[13:38:06] <sheeri> not sure if we're logging that stuff or what on zeus, fox2mike or cshields  knows better
[13:38:12] <sheeri> dustin right now I"m focusing on cleaning stuff up
[13
mozilla-central, inbound, fx-team, aurora, beta, esr10 & try closed at 1900 UTC+1.
Stuff is now cleaned up, so I'll do a brief summary:

Root cause was a known issue that dustin and I had agreed was probably rare. Basically the load balancer was doing auto-failover of the databases, and the dbs were not set up to have offset auto-increment values (because the app would break). When the load balancer failed over and back a few times, duplicate entries for the same auto-inc IDs were made, causing replication to fail and data to be out of sync.
Corey got the logs for when things happened with Zeus:

[23/May/2012:09:17:28 -0700]	SERIOUS	pools/buildbot-rw-db	nodes/10.22.70.200:3306	nodefail	Node 10.22.70.200 has failed - Timeout while establishing connection (the machine may be down, or the network congested; increasing the 'max_connect_time' on the pool's 'Connection Management' page may help)

[23/May/2012:09:17:28 -0700]	SERIOUS	pools/buildbot-rw-db	pooldied	Pool has no back-end nodes responding

[23/May/2012:09:18:16 -0700]	SERIOUS	pools/buildbot-rw-db	nodes/10.22.70.200:3306	nodefail	Node 10.22.70.200 has failed - A monitor has detected a failure

[23/May/2012:09:49:21 -0700]	INFO	pools/buildbot-rw-db	nodes/10.22.70.200:3306	nodeworking	Node 10.22.70.200 is working again

[23/May/2012:09:49:21 -0700]	INFO	pools/buildbot-rw-db	poolok	Pool now has working nodes

[23/May/2012:10:35:20 -0700]	INFO	pools/buildbot-rw-db	confmod	Configuration file modified

[23/May/2012:10:35:34 -0700]	INFO	pools/buildbot-ro-db	confmod	Configuration file modified

(those last 2 entries are me changing the pools to not have auto-failover)
Assignee: server-ops-database → scabral
per irc:

1) RelEng is now ok to restart release builds.

2) trees remain closed while fallout-mopup continues.
Corey notes that the Zeus failover hadn't happened since May 14th, so it's not a common daily occurrence...but certainly more occurrant than we'd like.
Esr, aurora, beta, fx-team reopened at 2025 UTC+1 since they don't have any backlog. Mozilla-central & inbound left closed for now, just until we clear the pending (mozilla-central only has one push pending, but if I open it and leave inbound closed, the piling on will just shift trees...).
And then reclosed a few minutes later, because whether or not things are running, we're not actually getting results as far as tbpl.
(In reply to Sheeri Cabral [:sheeri] from comment #9)
> Corey notes that the Zeus failover hadn't happened since May 14th, so it's
> not a common daily occurrence...but certainly more occurrant than we'd like.

do we have nagios alerting for this, so we can at least know if it happens again?
(In reply to John O'Duinn [:joduinn] from comment #12)
> (In reply to Sheeri Cabral [:sheeri] from comment #9)
> > Corey notes that the Zeus failover hadn't happened since May 14th, so it's
> > not a common daily occurrence...but certainly more occurrant than we'd like.
> 
> do we have nagios alerting for this, so we can at least know if it happens
> again?

There is no reason why we shouldn't, we do this kind of zeus monitoring for websites.  Sheeri if that is not the case here have Ops fix that ASAP please.
It's a moot point, because failover is turned off now, to avoid this problem. We thought if it ever happened it would be when the db was actually having problems or unreachable, so we figured duplicates wouldn't happen. We were, obviously, wrong.
tbpl started showing jobs finishing and while we are still seeing some dead commands, they are from older jobs that are just now finishing.

the trees are closed for other issues but I feel this one is resolved.
[18:03]  <sheeri> things are coming back now tho.
[18:03]  <sheeri> bear from db point of view it's done
[18:04]  <bear> shall I close it or do you want the pleasure?
[18:05]  <sheeri> bear if you're in it, you can close it
[18:05]  <bear> done
[18:05]  <sheeri> bear you might get joduinn-mtg's blessing too
[18:05]  <sheeri> I guess if joduinn-mtg  doesn't give his blessing he'll re-open
[18:05]  <bear> he will let me know if something is needed
[18:06]  <joduinn-mtg> sheeri: /me fine with closing 757911, that tree closure + repair work is done afaict
[18:06]  <sheeri> w00t!
[18:06]  <sheeri> par-tay!
Severity: blocker → normal
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.