Closed Bug 662100 Opened 13 years ago Closed 13 years ago

Analyze mini failure statistics to estimate remaining lifespan of r3 talos

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: zandr, Assigned: dustin)

Details

Attachments

(1 file)

reboots.csv 13 years ago Dustin J. Mitchell [:dustin] (he/him) 7.96 KB, application/octet-stream		Details

Zandr Milewski [:zandr]

Reporter

Description

•

13 years ago

The failure rate of r3 minis seems to be rising. The number of required reboots has been increasing, we have a few minis with dead drives, and we have at least one indication that the 'gray screen' reboots may be related to drive problems.

We should do some data mining in bugzilla to estimate the remaining lifespan of this batch of machines, to inform decisions about repairing or retiring this pool.

Zandr Milewski [:zandr]

Reporter

Comment 1

•

13 years ago

Instances of dead drives that I know about:

Bug 655437 - talos-r3-leopard-007 - burning jobs with "Device not configured"
Bug 660303 - I/O error on talos-r3-snow-051
Bug 661377 - hardware problems on talos-r3-xp-045

The 'gray screen' AHT failure:
https://bugzilla.mozilla.org/show_bug.cgi?id=654499#c1

Zandr Milewski [:zandr]

Reporter

Comment 2

•

13 years ago

Capturing some brainstorming:
This dataset should also include the purchase date and in-rack location ('slot') from inventory.

'slot' is the RU in the rack, starting from the bottom to the left of the decimal point, and the left-right position to the right of the decimal point. There are six or seven minis in each rack.

There might be something interesting to learn about thermal effects from the horizontal position. I would be suspicious about correlation with vertical position, since that also correlates to OS.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

13 years ago

We've done 106 reboots in the last 4 months, at an average rate of a little under one per day.

Out of those, only five were for snow-leopard.  9 date problems.  21 dead fish.  44 gray screens (on 27 distinct hosts).  6 powered off, all within the last month.

I'll do the inventory analysis and then get back with more interesting info.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

13 years ago

Stripping the dead fish and date problems, and only looking at each slave once, I get the following counts for position on the rack:

1: 8
2: 5
3: 5
4: 8
5: 10
6: 5
7: 3

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

13 years ago

Sorting the hardware failures by asset tag (purchase date is fabulously inaccurate) shows no great pattern - a nice split between the 12/22/09 batch (asset tags in the low 3000's) and the 5/1/10 batch (asset tags ~3400).  The three mentioned above as being totally dead are from all over:

 talos-r3-leopard-007 - 12/22/09
 talos-r3-xp-045 - 5/1/10
 talos-r3-snow-051 - dunno, asset tag 4568

Failure counts by week:

week of
2/20 - 4
2/27 - 4
3/6  - 6
3/13 - 3
3/20 - 0 (missed bug?)
3/27 - 4
4/3  - 1 (missed bug?)
4/10 - 3
4/17 - 7
4/24 - 2
5/1  - 7
5/8  - 0
5/15 - 9
5/22 - 7
5/29 - 7
6/5  - 8

That looks pretty significant to me.  Conservatively assuming a mean of 5/wk for February and 7/wk for June, and a linear fit, that means we'll be doing 10/wk by December.  At that point, it certainly starts to impact performance!

Finally, failure count by image:

fed: 17
fed64: 15
leopard: 1
snow: 4
w7: 25
xp: 10

w7 is, of course, adversely affected by our platinum club members, w7-036 and w7-032, which racked up 7 and 5 failures themselves, respectively.  Even allowing for that, there's a clear bimodal distribution of "Mac OS X" vs, "Other".  I don't think this is useful information for *this* revision of talos, but is great food for thought on the next version of talos: let's not run windows or linux on Apple hardware!

That's all the questions I can think of to answer.  I'll attach the spreadsheet here and close, but please reopen if you have more interesting questions.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

13 years ago

Attached file reboots.csv — Details

Oh, and I should add, since we've had relatively few permanent failures, I don't think there's any way to predict the future size of the pool.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Analyze mini failure statistics to estimate remaining lifespan of r3 talos

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: zandr, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Attachment

General

Description

File Name

Content Type