Closed Bug 662100 Opened 13 years ago Closed 13 years ago

Analyze mini failure statistics to estimate remaining lifespan of r3 talos

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: dustin)

Details

Attachments

(1 file)

The failure rate of r3 minis seems to be rising. The number of required reboots has been increasing, we have a few minis with dead drives, and we have at least one indication that the 'gray screen' reboots may be related to drive problems.

We should do some data mining in bugzilla to estimate the remaining lifespan of this batch of machines, to inform decisions about repairing or retiring this pool.
Instances of dead drives that I know about:

Bug 655437 - talos-r3-leopard-007 - burning jobs with "Device not configured"
Bug 660303 - I/O error on talos-r3-snow-051
Bug 661377 - hardware problems on talos-r3-xp-045

The 'gray screen' AHT failure:
https://bugzilla.mozilla.org/show_bug.cgi?id=654499#c1
Capturing some brainstorming:
This dataset should also include the purchase date and in-rack location ('slot') from inventory.

'slot' is the RU in the rack, starting from the bottom to the left of the decimal point, and the left-right position to the right of the decimal point. There are six or seven minis in each rack.

There might be something interesting to learn about thermal effects from the horizontal position. I would be suspicious about correlation with vertical position, since that also correlates to OS.
We've done 106 reboots in the last 4 months, at an average rate of a little under one per day.

Out of those, only five were for snow-leopard.  9 date problems.  21 dead fish.  44 gray screens (on 27 distinct hosts).  6 powered off, all within the last month.

I'll do the inventory analysis and then get back with more interesting info.
Stripping the dead fish and date problems, and only looking at each slave once, I get the following counts for position on the rack:

1: 8
2: 5
3: 5
4: 8
5: 10
6: 5
7: 3
Sorting the hardware failures by asset tag (purchase date is fabulously inaccurate) shows no great pattern - a nice split between the 12/22/09 batch (asset tags in the low 3000's) and the 5/1/10 batch (asset tags ~3400).  The three mentioned above as being totally dead are from all over:

 talos-r3-leopard-007 - 12/22/09
 talos-r3-xp-045 - 5/1/10
 talos-r3-snow-051 - dunno, asset tag 4568

Failure counts by week:

week of
2/20 - 4
2/27 - 4
3/6  - 6
3/13 - 3
3/20 - 0 (missed bug?)
3/27 - 4
4/3  - 1 (missed bug?)
4/10 - 3
4/17 - 7
4/24 - 2
5/1  - 7
5/8  - 0
5/15 - 9
5/22 - 7
5/29 - 7
6/5  - 8

That looks pretty significant to me.  Conservatively assuming a mean of 5/wk for February and 7/wk for June, and a linear fit, that means we'll be doing 10/wk by December.  At that point, it certainly starts to impact performance!

Finally, failure count by image:

fed: 17
fed64: 15
leopard: 1
snow: 4
w7: 25
xp: 10

w7 is, of course, adversely affected by our platinum club members, w7-036 and w7-032, which racked up 7 and 5 failures themselves, respectively.  Even allowing for that, there's a clear bimodal distribution of "Mac OS X" vs, "Other".  I don't think this is useful information for *this* revision of talos, but is great food for thought on the next version of talos: let's not run windows or linux on Apple hardware!

That's all the questions I can think of to answer.  I'll attach the spreadsheet here and close, but please reopen if you have more interesting questions.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Attached file reboots.csv
Oh, and I should add, since we've had relatively few permanent failures, I don't think there's any way to predict the future size of the pool.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: