Closed Bug 1080855 Opened 10 years ago Closed 10 years ago

RFO needed for buildbot database outages on October 9

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Unassigned)

References

Details

Trees were closed twice today for mutliple hours due to buildbot database outages. What happened here, and how can we prevent it in the future?
Here's what I know:

Earlier yesterday, sheeri was doing ?something? with the buildbot slave db. This ?query? unexpectedly propagated to the master db. The database began dropping connections from the buildbot masters claiming the connection limit had been reached. Within an hour, the buildbot masters were able to connect to the db and dead queued jobs were able to be requeued. Trees were closed for about 2 hours waiting for jobs to catch up. We took this opportunity to test hot fail over from the master db to the slave db. This happened without incident.

In the afternoon, we wanted to try failing back from the slave db to the master db. This went less well. We ended up with data inconsistencies between the two databases. This resulted in many jobs marked as pending but never being claimed by buildbot. To get back to a good state, sheeri had to do a hot copy between the two databases instances.

Hopefully sheeri can fill us in as to the root cause, and better document what we did to get back on our feet when the databases were in an inconsistent state.
Flags: needinfo?(scabral)
Here's the IRC log from I have for the afternoon incident yesterday:

[3:46pm]sheeri:coop|buildduty: I'm ready to fail buildbot back if you are
[3:48pm]coop|buildduty:sheeri: sure
[3:49pm]hoosteeno is now known as hoosteeno|out.
[3:50pm]sheeri:buildbot1 is set to read_only=OFF
[3:50pm]sheeri:buildbot2 draining cxns
[3:50pm]sheeri:buildbot2 shutting down
[3:50pm]sheeri:buildbot1 in the lb as the master
[3:51pm]sheeri:buildbot2 started up in read-only mode
[3:51pm]sheeri:coop|buildduty: we're good
[4:09pm]hoosteeno|out is now known as hoosteeno.
[4:15pm]coop|buildduty:sheeri: ok
[4:40pm]coop|buildduty:sheeri: we haven’t had any jobs pick up in 50min
[4:41pm]sheeri:I have queries running....
[4:41pm]sheeri:nothing's backed up 
[4:42pm]sheeri:coop|buildduty: not having jobs pick up is a bad thing right?
[4:42pm]coop|buildduty:yes
[4:42pm]sheeri:read_only is indeed set to off
[4:42pm]sheeri:are you getting any tracebacks or anything?
[4:44pm]sheeri:coop|buildduty:  ^^
[4:45pm]coop|buildduty:sheeri: no
[4:45pm]sheeri:Hrm. can you actively kick off a job? is it possible that no jobs were submitted?
[4:45pm]coop|buildduty:there are 726 pending jobs in the queue
[4:46pm]coop|buildduty:my fear is that every time we failover, we need to wait the full 60 minute timeout for buildbot to reclaim the jobs
[4:46pm]coop|buildduty:but that’s not necessarily a db issue
[4:46pm]sheeri:is that what happened earlier?
[4:47pm]sheeri:coop|buildduty:  ^
[4:47pm]coop|buildduty:i’d need to look at the timing more closely, but i think there was about an hour from when we first noticed the issue to the fail over
[4:53pm]sheeri:OK, so shall we wait a few more mins?
[4:53pm]sheeri:that's a good thing to note if it's true
[4:58pm]coop|buildduty:doesn’t seem to be happening. the pending jobs are well past 1hr now
[4:59pm]sheeri:coop|buildduty: is there any way to debug this?
[4:59pm]sheeri:I definitely see connections and queries on the master
[4:59pm]sheeri:is there a table or status I can look at?
[5:03pm]sheeri:coop|buildduty: could it be caused by a data inconsistency in a certain table? I could look at that table if so...
[5:04pm]coop|buildduty:i don’t know enough about the db structure to know, honestly
[5:04pm]sheeri:coop|buildduty: I hear ya...but I don't know the code. the db is workign just fine, and writes are happening, so I don't know how I can help
[5:11pm]Callek:sheeri: do we have any means to do relatively quick integrity checks, without halting everything?
[5:11pm]•Callek assumes not unless we know specifically what we are looking for
[5:11pm]sheeri:Callek: if you give me a table or set of tables, I can look.
[5:11pm]sheeri:Callek: I don't think it's useful to spend time doing that unless we actually know that yes, it causes problem.s
[5:13pm]sheeri:Callek: there's nothing quick, checksums take 3h 20 mins on buildbot
[5:13pm]sheeri:(at least, this morning's run did)
[5:13pm]dustin:it doesn't sound like internal DB corruption
[5:13pm]dustin:it's a state asynchrony
[5:13pm]dustin:either between the buildmasters and the DB, or between the DB servers
[5:13pm]•Callek looks at https://rpm.newrelic.com/accounts/263620/dashboard/3101981 and https://rpm.newrelic.com/accounts/263620/dashboard/3101982 and doesn't see any indication that writes/etc are happening
[5:14pm]Callek:sheeri: ^
[5:14pm]sheeri:dustin: what does "state asynchrony between the DB servers" mean
[5:14pm]sheeri:Callek: sure, but I can see writes happening in the db. I'll cehck the binlogs
[5:15pm]dustin:sheeri: different data, or delayed replication, or something like that
[5:15pm]dustin:something comparing checksums would find
[5:15pm]sheeri:dustin: that's what we were talking about.
[5:15pm]sheeri:dustin: what tables?
[5:15pm]Callek:sheeri: and yea, the mere fact that newrelic is showing 0 data makes me wonder if whatever is causing stuff to not show there is related somehow
[5:15pm]dustin:buildrequests seems the likely target from what I've heard
[5:15pm]sheeri:Callek: I see inserts into steps, updates for buildrequests, etc.
[5:15pm]Callek:unless "don't report to new relic" was an intentional choice
[5:15pm]dustin:so let's start with some evidence
[5:15pm]dustin:what's a request that hasn't started?
[5:15pm]Callek:I might be chasing down a rabit hole 
[5:16pm]•dustin looks at Callek
[5:16pm]•sheeri goes to checksum the buildrequests table
[5:16pm]coop|buildduty:dustin: i’m looking at https://secure.pub.build.mozilla.org/buildapi/pending
[5:17pm]Callek:and I'm pulling up buildapi
[5:17pm]dustin:ok
[5:17pm]coop|buildduty:i can see activity on the scheduler master: tests are being scheduled for in-flight builds that are finishing, but none of that pending backlog is clearing
[5:18pm]Callek:...we have a bunch of w64 from last night thats invalid now (aiui)
[5:19pm]dustin:http://pastebin.mozilla.org/6750827
[5:19pm]dustin:that's from sourcestamps with b5fbc56ab657
[5:19pm]dustin:revision=b5fbc56ab657
[5:19pm]sheeri:would it help to just sync the dbs?
[5:19pm]dustin:is that incarnation still running?
[5:19pm]dustin:er, gtg
[5:19pm]dustin:hope that started things..
[5:19pm]sheeri:or rather, would it help to have the master and slave be the same db?
[5:22pm]sheeri:dustin: hrm, there's one row in that query on the master, 17 on the slave (that used to be the master)
[5:23pm]Callek:dustin: for future, using \G instead of ; is a nice way to throw it on psatebin or even terminal
[5:23pm]Callek:(imho)
[5:24pm]Callek:so yea, thats sounding more and more like a data integrity issue
[5:25pm]sheeri:OK, well I've changed the master to be both the read and write slaves
[5:25pm]Callek:sheeri: was the failover a concious choice here, or was it an automatic choice?
[5:26pm]sheeri:conscious
[5:26pm]sheeri:but we had to do it to fix the steps table.
[5:27pm]jp is now known as jp-away.
[5:27pm]coop|buildduty:and to be clear, i did OK the fail over
[5:28pm]Callek:ooo I wasn't questioning if it was an ok failover, just trying to find out what went wrong 
[5:28pm]Callek:since for me, knowing that usually helps my brain determine the best next step to fix
[5:28pm]sheeri:Callek: we did the failover twice, it worked the first time
[5:28pm]Callek:(at least in a case like this)
[5:28pm]coop|buildduty:sure
[5:28pm]sheeri:So, right now the master and slave are the same db, buildbot1
[5:29pm]sheeri:if the master and slave are the same db, that takes away the data inconsistency issues, right?
[5:29pm]Callek:sheeri: so when you fixed buildbot2, did we make sure that not only did it replicate new changes, but we killed off old changes? (as in, did we remove rows that were not also on buildbot1) ?
[5:29pm]Callek:sheeri: "maybe"
[5:30pm]Callek:sheeri: since buildbot has some inter-table assumptions, and if there was an odd inconsistency those may now be violated 
[5:30pm]sheeri:Callek: I made sure of that, yes.
[5:30pm]sheeri:Callek: when buildbot2 became the master, it was in sync with buildbot1
[5:30pm]sheeri:I think it's buildbot1 that's inconsistent, and buildbot2 has the "right" data
[5:31pm]sheeri:since the master right now is buildbot1, and that seems to be having the problem.
[5:31pm]sheeri:I'm hesitant to fail back to buildbot2 being the master b/c we'll have to wait 60m to see if it worked....any way to speed that up?
[5:31pm]Callek:could buildbot2 have had any prior changes idle in its binlog that propogated when we failed back?
[5:31pm]•Callek is grasping at straws
[5:32pm]sheeri:Callek: buildbot2 is having duplicate key errors from buildbot1
[5:32pm]Callek:I'm not certain what recognizes that 60m thing
[5:32pm]sheeri:so I know they're not consistent
[5:32pm]sheeri:coop|buildduty: your thoughts?
[5:33pm]•Callek is trying to find some data atm, but net seems to be acting slow
[5:37pm]Callek:hrm not sure where this is running https://bug923155.bugzilla.mozilla.org/attachment.cgi?id=8359204
[5:37pm]•Callek found a script that tries to reset the non-started things for us
[5:37pm]sheeri:Callek: OK, so..try running it/
[5:38pm]sheeri:??
[5:38pm]Callek:heh, yea
[5:38pm]Callek:I think its still running somewhere in a cron though
[5:38pm]Callek:trying to find out
[5:39pm]Callek:as in, if thats so, then running it manually may not help
[5:40pm]coop|buildduty:Callek: but also wouldn’t hurt IMO
[5:41pm]Callek:indeed
[5:41pm]Callek:old cron e-mail says bm81
[5:41pm]Callek:mx search says no
[5:41pm]Callek:*mxr
[5:42pm]sheeri:OK, I can run that on buildbot1
[5:43pm]sheeri:OK, it reset a bunch of claims
[5:43pm]sheeri:https://scabral.pastebin.mozilla.org/6750951
[5:43pm]sheeri:ran it twice, 2nd time it did nothing
[5:44pm]•Callek runs it
[5:44pm]sheeri:So....how long until we know if it worked? If it's not working I'd say we have to fail back to buildbot2, because that's the source of truth
[5:44pm]Callek:sheeri: ahh bah
[5:44pm]Callek:sheeri: I was gonna do the "blurt" one first
[5:44pm]Callek:https://bug923155.bugzilla.mozilla.org/attachment.cgi?id=8359203
[5:44pm]Callek:o well
[5:44pm]sheeri:Callek: you can do both if you want
[5:44pm]sheeri:Callek: doesn't hurt to run it again, right?
[5:45pm]Callek:yea blurt isn't giving me output though
[5:45pm]sheeri:oh, the blurt is a select
[5:45pm]•Callek wanted the data to help fall back on
[5:45pm]sheeri:right, because nothing's in th equeue now
[5:45pm]sheeri:b/c I deleted them
[5:45pm]Callek:yea
[5:45pm]sheeri:Callek: did you check out the pastebin I put in above?
[5:45pm]sheeri:it has the list of thigns it deleted.
[5:45pm]Callek:yea
[5:45pm]sheeri:So....how long until we know if it worked? If it's not working I'd say we have to fail back to buildbot2, because that's the source of truth....
[5:45pm]Callek:but would have been useful to know what the claimed at and claimed by *said* prior to change
[5:46pm]Callek:sheeri: the interesting par tis it only changed a few entries, and aiui coop|buildduty says theres a whole bunch
[5:46pm]sheeri:Callek: well, that was against buildbot1
[5:46pm]sheeri:I really do think buildbot2 has the right data at this oint
[5:46pm]sheeri:point
[5:47pm]sheeri:but I need coop|buildduty to chime in to see if we want to failover again
[5:47pm]sheeri:I can failover and then hot-copy the db from buildbot2 to buildbot1 so they're in sync
[5:51pm]coop|buildduty:we’ve dipped below the original 726 pending count we were stuck at. that’s a good sign
[5:52pm]Callek:coop|buildduty: sooo, where/what do I need to look at tbpl-wise to see these jobs that are pending
[5:52pm]Callek:coop|buildduty: at least inbound looks like its started just about everything that was scheduled since 2p
[5:52pm]sheeri:so, *some* jobs are being processed.
[5:52pm]Callek:(2p ET)
[5:52pm]coop|buildduty:i’m just going by https://secure.pub.build.mozilla.org/buildapi/pending
[5:56pm]sheeri:coop|buildduty: looks like it's going way down
[5:57pm]Callek:ok, based on that --> (*) a whole bunch of Date branch win64 vm tests as of ~midnight  ET (*) a whole bunch of the latest linux64 l10n nightlies, as of ~6am ET (*) chunk of mavericks tests as of ~9am ET  (*) few periodic things as of ~10-11 am ET, e.g. b2g emulator (*) swatch of large chunk of jobs at ~4p ET
[5:57pm]Callek:do any of those times line up with "events" we know of
[5:57pm]Callek:also another batch around 5:30pm ET
[5:58pm]Callek:the 5:30 PM ET stuff is the largest batch of jobs afaict
[5:58pm]sheeri:that's 27 min ago
[5:58pm]Callek:(most of those seem to be try or related test jobs)
[5:58pm]Callek:so those actually pending may be legit
[5:59pm]sheeri:Callek:  14:28 <@sheeri> So, right now the master and slave are the same db, buildbot1
[5:59pm]sheeri:could that be it?
[6:00pm]Callek:maybe
[6:00pm]Callek:I'm not *sure* we're out of the problem scope right now
[6:02pm]sheeri:some of the pending stuff was submitted last night (going to the last page there)
[6:04pm]sheeri:Callek: how do we know when we're out of the problem scope?
[6:04pm]•Callek is leaving that to coop to define atm
[6:05pm]sheeri:coop|buildduty:  ^^
[6:05pm]sheeri:Callek: the stuff from last night isn't try
[6:05pm]Callek:agreed
[6:06pm]Callek:the win64 stuff I suspect is "expected" (as in I think we turned off all machines that can do those jobs, while awaiting a deploy of a change that stops scheduling them)
[6:06pm]coop|buildduty:win64 vm should be ignored because we’re turning those off
[6:06pm]sheeri:sure but "Firefox mozilla-aurora linux l10n nightly"
[6:06pm]Callek:the l10n stuff I'm slightly concerned about, but not very (we don't have any real priority given to those jobs)
[6:07pm]coop|buildduty:mavericks we have almost zero cappacity, so ignore that too
[6:07pm]sheeri:OK, b2g_b2g-inbound_emulator-kk-debug_nonunified
[6:07pm]sheeri:Or wait
[6:07pm]sheeri:let me stop ratholing on individual ones
[6:07pm]sheeri:coop|buildduty:  how do we know if we're good?
[6:08pm]coop|buildduty:if the try jobs for 90-120min ago get picked up, then we’re good
[6:08pm]sheeri:OK, and if not....is there a way to pick them up or resubmit them?
[6:09pm]coop|buildduty:it looks like we’re constrained on machines there
[6:09pm]Callek:coop|buildduty: per https://secure.pub.build.mozilla.org/builddata/reports/slave_health/ at least, we only have mac try builds left, (I;m ignoring try tests atm)
[6:12pm]Callek:coop|buildduty: two minutes ago we started 171 linux spot instances to tackle jobs as well
[6:12pm]Callek:fwiw
[6:12pm]Callek:which of course take a bit from "start" in aws sense to "buildbot running and jobs started"
[6:13pm]coop|buildduty:yeah
[6:13pm]coop|buildduty:we’re definitely trending better
[6:13pm]Callek:and about 12 min ago we started about 330 instances
[6:14pm]Callek:sheeri: before we swap read-slave back, or do any other changes, hows load on buildbot1 looking?
[6:15pm]sheeri:load's OK
[6:15pm]sheeri:box is still responsive
[6:15pm]sheeri:but before I swap back I want to sync buildbot2
[6:16pm]You are now known as coop|afk.
[6:17pm]Callek:yea would have said to do a sync anyway
[6:17pm]Callek:sheeri: any insight, or seperate bug on why its not reporting to newrelic anymore
[6:19pm]Callek:hrm "replication lag" https://rpm.newrelic.com/accounts/263620/dashboard/3101982?tw[dur]=last_24_hours interesting, since the timestamps its showing are pretty far off
[6:19pm]Callek:sheeri: maybe the server timestamps are wrong?
[6:19pm]•Callek is again grasping at straws
[6:21pm]sheeri:"its"
[6:23pm]Callek:sheeri: sorry my apostrophe key gets stuck since I use it correctly so much, [except when I'm typing too fast, since I never stop and make sure it struck properly]  
[6:23pm]•Callek hopes that is the best excuse you have heard for grammar mistakes all day
[6:24pm]sheeri:nono
[6:24pm]sheeri:wasn't a grammar thing
[6:24pm]sheeri:when you say "it isn't reporting"
[6:24pm]sheeri:do you mean both? one? the other?
[6:24pm]Callek:sheeri: buildbot1/2
[6:24pm]Callek:both
[6:24pm]sheeri:OK
[6:24pm]Callek:so "they" is more accurate
[6:24pm]Callek:sorry
[6:24pm]sheeri:are any other dbs reporting?
[6:24pm]Callek:https://rpm.newrelic.com/accounts/263620/dashboard/3101981 and https://rpm.newrelic.com/accounts/263620/dashboard/3101982 
[6:24pm]•Callek looks at other db's
[6:25pm]Callek:addons3.dev is
[6:25pm]Callek:https://rpm.newrelic.com/accounts/263620/dashboard/4401498
[6:25pm]sheeri:I ask b/c our newrelic is centralized for mysql stuff, so it's not like one agent died
[6:25pm]sheeri:OK, kicked off the sync of buildbot2 while I'm at it
[6:26pm]Callek:sheeri: basically "green bar" vs "grey bar" it looks like: https://rpm.newrelic.com/accounts/263620/plugins/2805
[6:26pm]Callek:sheeri: which looks like ~ half the things aren't reporting
[6:26pm]Callek:sheeri: hrm actually its a common trend
[6:26pm]Callek:sheeri: *.phx is reporting, *.scl3 is not
[6:26pm]sheeri:Callek: let me guess - scl3 stopped reporting
[6:26pm]sheeri:
[6:27pm]sheeri:plugin *was* running, but I stopped and started it again
[6:28pm]Callek:I'd suspect some sort of netops work if all of scl3 is dark to newrelic
[6:28pm]Callek:but anyway, I trust you to get bug(s) on file fo rthat and drive it
[6:28pm]sheeri:is it all of scl3, or just scl3 dbs?
[6:28pm]Callek:sheeri: scl3 db's is all I looked at
[6:28pm]Callek:not too familiar with our newrelic context
[6:28pm]sheeri:right, that's centralized on a server I run
[6:29pm]sheeri:https://rpm.newrelic.com/accounts/263620/dashboard/4041139 that's http on a scl3 server
[6:29pm]sheeri:so it's me
[6:31pm]Callek:sheeri: anyway, I'm going to prelim say we're clear. I'd like to touchpoint at about 730p/8p as a sanity check, but I'm ok with sending a page if we're not
[6:31pm]sheeri:OK
[6:33pm]sheeri:oh wow, down to 177 entries
[6:41pm]Callek:sheeri: still around?
[6:41pm]sheeri:yep
[6:42pm]Callek:sheeri: nick just alerted me to a sign of further trouble:
[6:42pm]Callek:[18:38:56] nthomas|away bm82 is stalled, waiting on a socket to buildbot-rw-vip.db.scl3
[6:42pm]Callek:[18:39:21] nthomas|away for about 8 mins
[6:42pm]Callek:he can open a new connection with nc
[6:42pm]sheeri:there's only 459 connections, max is 1200
[6:42pm]Callek:[18:41:27] nthomas|away show processlist has an awful lot of 'Waiting for global read lock'
[6:42pm]sheeri:but I am doing the hot copy, so maybe things are a bit slow?
[6:43pm]nthomas joined the chat room.
[6:43pm]Callek:sheeri: but yea 8 min is pretty hefty of a problem
[6:43pm]Callek:(as in, most of the buildbot things die at that level, iirc)
[6:43pm]nthomas:show processlist on the buildbot db doesn't look so great, it's very long and a lot of 'Waiting for global read lock'
[6:43pm]sheeri:*nod* I see that
[6:43pm]sheeri:the hot copy has been going for 16 minutes
[6:43pm]nthomas:but I don't actually see anything active
[6:43pm]nthomas:does that lock then ?
[6:44pm]sheeri:it's at 114G now, the db is 129G, and it does about 6G a minute, so I'd like to see if waiting 3 mins helps
[6:44pm]sheeri:nthomas: it usually grabs a lock and quickly releases it.
[6:44pm]sheeri:nthomas: but I bet buildbot's myisam nature is the problem
[6:44pm]sheeri:OK, hot copy just finished
[6:45pm]sheeri:nope, I lied
[6:45pm]sheeri:my script to watch the hot copy finished 
[6:45pm]sheeri:it's at 125G now, so not long
[6:46pm]nthomas:ok
[6:46pm]nthomas:bm82 strace just started turning over
[6:46pm]sheeri:there, NOW it's finished
[6:46pm]sheeri:yeah, it was the hot copy
[6:46pm]sheeri:my apologies
[6:47pm]nthomas:what was the hot copy for ? just got here so no context
[6:48pm]sheeri:nthomas: sync the dbs
[6:48pm]sheeri:nthomas: which was the problem we were having about an hour or so ago
[6:48pm]sheeri:they were out of sync
[6:48pm]nthomas:ok, so recovering from earlier
[6:48pm]sheeri:correct
[6:49pm]Callek:didn't realize a hot copy would stall things
[6:49pm]sheeri:yeah, I *should* have realized it would have - for myisam it does
[6:49pm]sheeri:because myisam only has table level locking
[6:50pm]Callek:sheeri: for future is it possible to batch hotcopy stuff?
[6:50pm]Callek:e.g. 10% at a time, then pause for 10-30 seconds
[6:50pm]sheeri:Callek: it batches the myisam stuff at the end
[6:50pm]sheeri:Callek: no
[6:50pm]sheeri:sadly 
[6:50pm]sheeri:next time I should hot copy from the backup instance.
[6:50pm]sheeri:the good news is that buidlbot2 is happy again
[6:51pm]sheeri:and it's back in the lb as the read slave
[6:52pm]Callek:thanks
[6:52pm]sheeri:Callek: I'm going to make some dinner now, definitely page me if something's up again
[6:52pm]sheeri:and again, my apologies for the clusterfuck of buildbot today
[6:52pm]Callek:ok, sounds good
I was defragmenting a table on the slave, after taking it out of the load balancer. I had intended to set the session so the query did not replicate, but failed to. It is a standard procedure to set it so defragmentation queries do not replicate, but I missed that step.
Flags: needinfo?(scabral)
Product: mozilla.org → Data & BI Services Team
Please let me know if more information is needed.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.