Closed
Bug 662100
Opened 13 years ago
Closed 13 years ago
Analyze mini failure statistics to estimate remaining lifespan of r3 talos
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: zandr, Assigned: dustin)
Details
Attachments
(1 file)
7.96 KB,
application/octet-stream
|
Details |
The failure rate of r3 minis seems to be rising. The number of required reboots has been increasing, we have a few minis with dead drives, and we have at least one indication that the 'gray screen' reboots may be related to drive problems. We should do some data mining in bugzilla to estimate the remaining lifespan of this batch of machines, to inform decisions about repairing or retiring this pool.
Reporter | ||
Comment 1•13 years ago
|
||
Instances of dead drives that I know about: Bug 655437 - talos-r3-leopard-007 - burning jobs with "Device not configured" Bug 660303 - I/O error on talos-r3-snow-051 Bug 661377 - hardware problems on talos-r3-xp-045 The 'gray screen' AHT failure: https://bugzilla.mozilla.org/show_bug.cgi?id=654499#c1
Reporter | ||
Comment 2•13 years ago
|
||
Capturing some brainstorming: This dataset should also include the purchase date and in-rack location ('slot') from inventory. 'slot' is the RU in the rack, starting from the bottom to the left of the decimal point, and the left-right position to the right of the decimal point. There are six or seven minis in each rack. There might be something interesting to learn about thermal effects from the horizontal position. I would be suspicious about correlation with vertical position, since that also correlates to OS.
Assignee | ||
Comment 3•13 years ago
|
||
We've done 106 reboots in the last 4 months, at an average rate of a little under one per day. Out of those, only five were for snow-leopard. 9 date problems. 21 dead fish. 44 gray screens (on 27 distinct hosts). 6 powered off, all within the last month. I'll do the inventory analysis and then get back with more interesting info.
Assignee | ||
Comment 4•13 years ago
|
||
Stripping the dead fish and date problems, and only looking at each slave once, I get the following counts for position on the rack: 1: 8 2: 5 3: 5 4: 8 5: 10 6: 5 7: 3
Assignee | ||
Comment 5•13 years ago
|
||
Sorting the hardware failures by asset tag (purchase date is fabulously inaccurate) shows no great pattern - a nice split between the 12/22/09 batch (asset tags in the low 3000's) and the 5/1/10 batch (asset tags ~3400). The three mentioned above as being totally dead are from all over: talos-r3-leopard-007 - 12/22/09 talos-r3-xp-045 - 5/1/10 talos-r3-snow-051 - dunno, asset tag 4568 Failure counts by week: week of 2/20 - 4 2/27 - 4 3/6 - 6 3/13 - 3 3/20 - 0 (missed bug?) 3/27 - 4 4/3 - 1 (missed bug?) 4/10 - 3 4/17 - 7 4/24 - 2 5/1 - 7 5/8 - 0 5/15 - 9 5/22 - 7 5/29 - 7 6/5 - 8 That looks pretty significant to me. Conservatively assuming a mean of 5/wk for February and 7/wk for June, and a linear fit, that means we'll be doing 10/wk by December. At that point, it certainly starts to impact performance! Finally, failure count by image: fed: 17 fed64: 15 leopard: 1 snow: 4 w7: 25 xp: 10 w7 is, of course, adversely affected by our platinum club members, w7-036 and w7-032, which racked up 7 and 5 failures themselves, respectively. Even allowing for that, there's a clear bimodal distribution of "Mac OS X" vs, "Other". I don't think this is useful information for *this* revision of talos, but is great food for thought on the next version of talos: let's not run windows or linux on Apple hardware! That's all the questions I can think of to answer. I'll attach the spreadsheet here and close, but please reopen if you have more interesting questions.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 6•13 years ago
|
||
Oh, and I should add, since we've had relatively few permanent failures, I don't think there's any way to predict the future size of the pool.
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•