Closed
Bug 852380
Opened 12 years ago
Closed 12 years ago
buildbot2.db.scl3 issues 2013-03-18
Categories
(Data & BI Services Team :: DB: MySQL, task)
Data & BI Services Team
DB: MySQL
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: u429623, Assigned: rbryce)
References
Details
(Whiteboard: [reit-ops] [closed-trees])
bug filed after the fact to capture db crash and related fallout - assigned to rbryce as he handled it and can fill in more specifics
From RelEng viewpoint, the cascade was:
* mysql blew up, messed up buildbot schedulers, and buildapi leading to tree closure.
* Restart load may have contributed to HG issues in bug 852376
mysql nagios alert was cleared by kernel module reload by rbryce,
Updated•12 years ago
|
Comment 1•12 years ago
|
||
What could prevent mysql to blow out and the trees to get closed? Thanks!
Assignee | ||
Comment 2•12 years ago
|
||
The be2net driver crashed causing the network device on the server to shut down. I quickly reloaded the module and restarted networking. This driver crash, is an known issue across many of our older hp blades. I believe rhel is still working on a fix, Bug 831054.
Im not sure the Hg load spike was caused by this network outage on buildbot2. The first load spike came and went before the buildbot2.db outage. With this in mind, we(IT) have not been able to recreate the circumstances that lead to the be2net driver crash. It may have been increased traffic from Hg that caused the driver to crash.
Comment 3•12 years ago
|
||
Do you know what could be a long term solution for something like this to not bring down the continuous integration?
Thanks for the debrief!
Comment 4•12 years ago
|
||
In fact, dbas were not even paged about this issue. It being 100% network device related, MySQL stayed up (it's been up since 1/4 at 6:59 am Pacific time), so there was no worry of corruption or anything.
Comment 5•12 years ago
|
||
Armen - the long-term solution is to get a proper fix from RHEL.
Adding explicit link to the RHEL bug 831054
Depends on: 831054
Comment 7•12 years ago
|
||
Closing this out. The immediate issue is over, and there's an audit planned in Q3 of all machines: https://bugzilla.mozilla.org/show_bug.cgi?id=883228
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Updated•11 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•