757911 - buildbot database replication event causing multiple failures

Reporter

Description

•

13 years ago

this event was seen in nagios: [13:02] <nagios-scl3> [542] buildbot1.db.scl3.mozilla.com:MySQL Replication is CRITICAL: Replication Stopped - Last error: and we are getting buildbot command queue items appearing in the dead queue: [12:58] <nagios-sjc1> [83] buildbot-master15.build.scl1:Command Queue is CRITICAL: 3 dead items and aki and hwine are reporting duplicate jobs in the status displays.

Mike Taylor [:bear]

Reporter

Comment 1

•

13 years ago

trees are closed [13:57] <edmorley> bear: inbound and m-c closed, want me to do the others? [13:58] <bear> if you can, please and thanks

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

13 years ago

This is causing triplicates of in-progress release builds, which then fail out with errors because of collisions.

hwine

Comment 3

•

13 years ago

fyi: FF 13.0b5 release build can not start due to this (start is already in the DB)

Blocks: 744594

Sheeri Cabral [:sheeri]

Assignee

Comment 4

•

13 years ago

[13:30:07] <sheeri> so dustin remember how we asked if buildbot master/master would have problems if we used auto_increment gaps? [13:30:30] <sheeri> dustin we're running into the duplicate key problem, looks like writes are happening to both masters (probably a zeus switching thing) [13:30:55] <dustin> yeah, that's no good [13:31:05] <atoll> aiee [13:31:43] <sheeri> well, your choices are to not have master/master auto-failover (we can still do manual failover) or to be able to have gaps in autoincrements. [13:31:55] <dustin> well, catlee would make that call [13:32:00] <atoll> sheeri: even/odd type thing? [13:32:02] <dustin> but AFAIK we had decided on the first [13:32:03] <sheeri> atoll yes [13:32:07] <atoll> mm [13:32:18] <dustin> sheeri: recall that things depend on insert ordering [13:32:22] <atoll> i was thinking, gaps because you wanted to use the decreased-locking mode on autoinc [13:32:24] <sheeri> dustin I do recall that [13:32:39] <sheeri> atoll gaps because otherwise if things failover you might get 2 of the same primary key [13:32:42] <atoll> right [13:32:46] <sheeri> dustin but I don't' recall we actually made a decision [13:32:54] <sheeri> of course, I could be mis-remembering. [13:33:00] <atoll> to the bug! [13:33:13] <dustin> so I think we had decided that failovers would be non-revertive, and rare enough that we'd eat some bum queries on a failover, but not continuously [13:33:23] <sal> van: https://bugzilla.mozilla.org/show_bug.cgi?id=757630 [13:33:27] <dustin> so if that means manual failover, that's what it means :) [13:33:48] coop|mtg is now known as coop|buildduty [13:35:55] <sheeri> dustin yeah that's what I thought we came tot he conclusion of. [13:36:07] <sheeri> dustin but I have 3 in a row duplicate key pages from nagios [13:36:12] <sheeri> OK, the ro and rw buildbot pools now have no failovers [13:36:35] <dustin> any idea what happened? did zeus bounce around for a bit? [13:37:55] <sheeri> dustin that seems to be what happened based on the db errors [13:38:06] <sheeri> not sure if we're logging that stuff or what on zeus, fox2mike or cshields knows better [13:38:12] <sheeri> dustin right now I"m focusing on cleaning stuff up [13

Ed Morley [:emorley]

Comment 5

•

13 years ago

mozilla-central, inbound, fx-team, aurora, beta, esr10 & try closed at 1900 UTC+1.

Sheeri Cabral [:sheeri]

Assignee

Comment 6

•

13 years ago

Stuff is now cleaned up, so I'll do a brief summary: Root cause was a known issue that dustin and I had agreed was probably rare. Basically the load balancer was doing auto-failover of the databases, and the dbs were not set up to have offset auto-increment values (because the app would break). When the load balancer failed over and back a few times, duplicate entries for the same auto-inc IDs were made, causing replication to fail and data to be out of sync.

Sheeri Cabral [:sheeri]

Assignee

Comment 7

•

13 years ago

Corey got the logs for when things happened with Zeus: [23/May/2012:09:17:28 -0700] SERIOUS pools/buildbot-rw-db nodes/10.22.70.200:3306 nodefail Node 10.22.70.200 has failed - Timeout while establishing connection (the machine may be down, or the network congested; increasing the 'max_connect_time' on the pool's 'Connection Management' page may help) [23/May/2012:09:17:28 -0700] SERIOUS pools/buildbot-rw-db pooldied Pool has no back-end nodes responding [23/May/2012:09:18:16 -0700] SERIOUS pools/buildbot-rw-db nodes/10.22.70.200:3306 nodefail Node 10.22.70.200 has failed - A monitor has detected a failure [23/May/2012:09:49:21 -0700] INFO pools/buildbot-rw-db nodes/10.22.70.200:3306 nodeworking Node 10.22.70.200 is working again [23/May/2012:09:49:21 -0700] INFO pools/buildbot-rw-db poolok Pool now has working nodes [23/May/2012:10:35:20 -0700] INFO pools/buildbot-rw-db confmod Configuration file modified [23/May/2012:10:35:34 -0700] INFO pools/buildbot-ro-db confmod Configuration file modified (those last 2 entries are me changing the pools to not have auto-failover)

Sheeri Cabral [:sheeri]

Assignee

Updated

•

13 years ago

Assignee: server-ops-database → scabral

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 8

•

13 years ago

per irc: 1) RelEng is now ok to restart release builds. 2) trees remain closed while fallout-mopup continues.

Sheeri Cabral [:sheeri]

Assignee

Comment 9

•

13 years ago

Corey notes that the Zeus failover hadn't happened since May 14th, so it's not a common daily occurrence...but certainly more occurrant than we'd like.

Ed Morley [:emorley]

Comment 10

•

13 years ago

Esr, aurora, beta, fx-team reopened at 2025 UTC+1 since they don't have any backlog. Mozilla-central & inbound left closed for now, just until we clear the pending (mozilla-central only has one push pending, but if I open it and leave inbound closed, the piling on will just shift trees...).

Phil Ringnalda (:philor)

Comment 11

•

13 years ago

And then reclosed a few minutes later, because whether or not things are running, we're not actually getting results as far as tbpl.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 12

•

13 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #9) > Corey notes that the Zeus failover hadn't happened since May 14th, so it's > not a common daily occurrence...but certainly more occurrant than we'd like. do we have nagios alerting for this, so we can at least know if it happens again?

Corey Shields [:cshields]

Comment 13

•

13 years ago

(In reply to John O'Duinn [:joduinn] from comment #12) > (In reply to Sheeri Cabral [:sheeri] from comment #9) > > Corey notes that the Zeus failover hadn't happened since May 14th, so it's > > not a common daily occurrence...but certainly more occurrant than we'd like. > > do we have nagios alerting for this, so we can at least know if it happens > again? There is no reason why we shouldn't, we do this kind of zeus monitoring for websites. Sheeri if that is not the case here have Ops fix that ASAP please.

Sheeri Cabral [:sheeri]

Assignee

Comment 14

•

13 years ago

It's a moot point, because failover is turned off now, to avoid this problem. We thought if it ever happened it would be when the db was actually having problems or unreachable, so we figured duplicates wouldn't happen. We were, obviously, wrong.

Mike Taylor [:bear]

Reporter

Comment 15

•

13 years ago

tbpl started showing jobs finishing and while we are still seeing some dead commands, they are from older jobs that are just now finishing. the trees are closed for other issues but I feel this one is resolved.

Mike Taylor [:bear]

Reporter

Comment 16

•

13 years ago

[18:03] <sheeri> things are coming back now tho. [18:03] <sheeri> bear from db point of view it's done [18:04] <bear> shall I close it or do you want the pleasure? [18:05] <sheeri> bear if you're in it, you can close it [18:05] <bear> done [18:05] <sheeri> bear you might get joduinn-mtg's blessing too [18:05] <sheeri> I guess if joduinn-mtg doesn't give his blessing he'll re-open [18:05] <bear> he will let me know if something is needed [18:06] <joduinn-mtg> sheeri: /me fine with closing 757911, that tree closure + repair work is done afaict [18:06] <sheeri> w00t! [18:06] <sheeri> par-tay!

Severity: blocker → normal

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Data & BI Services Team

Bugzilla

buildbot database replication event causing multiple failures

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: bear, Assigned: scabral)

References

Details

(Whiteboard: [buildduty][outage])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated