Closed Bug 1286963 Opened 9 years ago Closed 9 years ago

RCA 2016-07-14 - Buildbot DB Table Crash

Categories

(Infrastructure & Operations :: MOC: Root Cause Analysis, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fauweh, Assigned: mpressman)

References

Details

(Whiteboard: [postmortem done])

No description provided.
Blocks: 1287117
See Also: → 1287117
Please see https://bugzilla.mozilla.org/show_bug.cgi?id=1286942 and WP entry https://whistlepig.mozilla.org/en-US/detail/729/ Please complete the following form to the best of your ability for RCA https://docs.google.com/document/d/14XzguY_SoWKP9kfkGwayL7u6f7HSfdOoBG8w-8siAgo/edit Sheeri, please update RCA doc or delegate as necessary. Thanks!
Assignee: kferrando → scabral
RCA completed; feel free to schedule the postmortem whenever. Attendees should include me, Matt Pressman, Callek, coop, hwine, linda and natalie.
Sheeri: can you please add the data requested in bug 1286942 comment 3 to the RCA please? Given that this occurred so quickly after the mid day fail over, we'd like to see data that shows there were no failover issues. Also, I've heard rumors that the DB has been failed back in response to similar db corruption issues occurring. Can you add to the timeline any failovers which have occurred since the P1 event, please? Keegan: can you add :catlee to the invitee list, please?
Flags: needinfo?(scabral)
Flags: needinfo?(kferrando)
Postmortem scheduled for Tue, July 26, 9am – 10am with requested participants and calendar invite sent.
Flags: needinfo?(kferrando)
The master is still buildbot2, which was what we failed over to in https://bugzilla.mozilla.org/show_bug.cgi?id=1286942#c8, so I don't think the rumors are true. There was corruption of a small table on 7/18, but that was fixed real-time, and AFAIK there were no outages (and definitely no failover). Adding the other info requested into the RCA.
Flags: needinfo?(scabral)
s/a bug 1287817 for another outage on 2016-07-19
See Also: → 1287817
Assignee: scabral → mpressman
I'm unclear on status here. There are postmortem results in the doc but it calls for a followup meeting. mpressman, do you have state on action items and any bugs for them?
Flags: needinfo?(mpressman)
Status: NEW → ASSIGNED
Whiteboard: [postmortem done]
:hwine since this caused you pain, are you happy with the postmortem results?
Flags: needinfo?(hwine)
Yes - I see no value in going deeper. For history, the following actions discussed during this meeting have been taken: - buildbot database backend converted from MyISAM to InnoDB - alerts about replication lag have been fixed - observations into the RCA process where provided
Flags: needinfo?(hwine)
Thanks :hwine. With a postmortem done and action items taken I think this can be closed.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: MOC: Incidents → MOC: Problems
Component: MOC: Problems → MOC: Root Cause Analysis
Flags: needinfo?(mpressman)
You need to log in before you can comment on or make changes to this bug.