Closed
Bug 1286963
Opened 9 years ago
Closed 9 years ago
RCA 2016-07-14 - Buildbot DB Table Crash
Categories
(Infrastructure & Operations :: MOC: Root Cause Analysis, task)
Infrastructure & Operations
MOC: Root Cause Analysis
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: fauweh, Assigned: mpressman)
References
Details
(Whiteboard: [postmortem done])
No description provided.
| Reporter | ||
Comment 1•9 years ago
|
||
Please see https://bugzilla.mozilla.org/show_bug.cgi?id=1286942
and WP entry https://whistlepig.mozilla.org/en-US/detail/729/
Please complete the following form to the best of your ability for RCA
https://docs.google.com/document/d/14XzguY_SoWKP9kfkGwayL7u6f7HSfdOoBG8w-8siAgo/edit
Sheeri, please update RCA doc or delegate as necessary.
Thanks!
Assignee: kferrando → scabral
Comment 2•9 years ago
|
||
RCA completed; feel free to schedule the postmortem whenever. Attendees should include me, Matt Pressman, Callek, coop, hwine, linda and natalie.
Sheeri: can you please add the data requested in bug 1286942 comment 3 to the RCA please?
Given that this occurred so quickly after the mid day fail over, we'd like to see data that shows there were no failover issues.
Also, I've heard rumors that the DB has been failed back in response to similar db corruption issues occurring. Can you add to the timeline any failovers which have occurred since the P1 event, please?
Keegan: can you add :catlee to the invitee list, please?
| Reporter | ||
Comment 4•9 years ago
|
||
Postmortem scheduled for Tue, July 26, 9am – 10am with requested participants and calendar invite sent.
Flags: needinfo?(kferrando)
Comment 5•9 years ago
|
||
The master is still buildbot2, which was what we failed over to in https://bugzilla.mozilla.org/show_bug.cgi?id=1286942#c8, so I don't think the rumors are true.
There was corruption of a small table on 7/18, but that was fixed real-time, and AFAIK there were no outages (and definitely no failover).
Adding the other info requested into the RCA.
Flags: needinfo?(scabral)
| Assignee | ||
Updated•9 years ago
|
Assignee: scabral → mpressman
Comment 7•9 years ago
|
||
I'm unclear on status here. There are postmortem results in the doc but it calls for a followup meeting.
mpressman, do you have state on action items and any bugs for them?
Flags: needinfo?(mpressman)
Updated•9 years ago
|
Status: NEW → ASSIGNED
Whiteboard: [postmortem done]
Comment 8•9 years ago
|
||
:hwine since this caused you pain, are you happy with the postmortem results?
Flags: needinfo?(hwine)
Yes - I see no value in going deeper. For history, the following actions discussed during this meeting have been taken:
- buildbot database backend converted from MyISAM to InnoDB
- alerts about replication lag have been fixed
- observations into the RCA process where provided
Flags: needinfo?(hwine)
Comment 10•9 years ago
|
||
Thanks :hwine.
With a postmortem done and action items taken I think this can be closed.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Component: MOC: Incidents → MOC: Problems
Updated•8 years ago
|
Component: MOC: Problems → MOC: Root Cause Analysis
| Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(mpressman)
You need to log in
before you can comment on or make changes to this bug.
Description
•