Closed Bug 798380 Opened 12 years ago Closed 12 years ago

Run hardware diagnostics on talos-r4-lion-063.build.scl1

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rail, Unassigned)

References

Details

Rail Aliiev [:rail]

Reporter

Description

•

12 years ago

.. particularly RAM tests

Van Le [:van]

Updated

•

12 years ago

colo-trip: --- → scl1

Vinh Hua [:vinh]

Comment 1

•

12 years ago

Since lion-63 is sharing the same chassis with lion-64, please take both builds offline so that I can run the diagnostics.

Thanks,
Vinh

Armen [:armenzg]

Comment 2

•

12 years ago

Hi Vinh, I have disabled both.
You can go ahead any time in the 30 minutes (it is just finishing a job).

Phil Ringnalda (:philor)

Comment 3

•

12 years ago

Can we go ahead with this now? I don't miss 063 in the least, but losing a good slave for three weeks just because it sits next to a bad one is a bit painful.

Vinh Hua [:vinh]

Comment 4

•

12 years ago

Sorry for the delay guys.  Hardware diagnostics did not find any problems with lion-063.  I've plugged both 63 and 64 back online.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Amy Rich [:arr] [:arich]

Comment 5

•

12 years ago

We seem to be hitting a lot of requests for hardware diagnostics that don't show any problems.  We should probably be trying reimages first (releng folks) since they seem to clear up a number of crufty filesystem issues.

Phil Ringnalda (:philor)

Comment 6

•

12 years ago

I'd be in favor of that as soon as someone develops a system for actually testing reimaged slaves other than "throw them back into production, and hope that after a dozen invalid code bugs are filed and several patches are backed out or abandoned because of inexplicable failures on Try, someone will notice that a broken slave is still broken."

In the case of this one, which had absolutely nothing to do with the filesystem so that a reimage smells an awful lot like cargo-culting, about 500 runs of reftest without an unexplained failure ought to do it.

The other options I've come up with so far are to rename the broken slaves we insist on continuing to use, so that people might have a chance of noticing that their test failed on "talos-r4-snow-014-totally-unreliable," or having a pool of slaves which are allowed to say that a test suite passed, but are not allowed to say that it failed, only setting RETRY to let some other unbroken slave rerun it when they fail. Dunno how many tests we have that would get false positives when they should have failed but are run on broken hardware, though.

Amy Rich [:arr] [:arich]

Comment 7

•

12 years ago

philor: Often machines come back having passed the apple hardware checks, and operations has no way of validating beyond that point.  The hardware looks fine, the OS passes checks.

Releng: should questionable machines come back in a separate pool to be verified?

Armen [:armenzg]

Comment 8

•

12 years ago

IMHO we should design a system that would sheppard slaves from preproduction to production automatically only if they pass a whole set of jobs and/or burning tests.

Meanwhile, I will figure out what manual process we can put in place.
We can make use of the preproduction masters and make sure that it takes a bunch of jobs.

Armen [:armenzg]

Comment 9

•

12 years ago

We want to look into make bug 712206 a way to prevent bad slaves to get into the pool.
I want to add prospective slaves into "see also" for bug 712206.
We should collect problems, determine how to diagnose them and adding a method to prevent a slave from starting if there are any issues.

Shyam Mani [:fox2mike]

Updated

•

11 years ago

Assignee: server-ops → server-ops-dcops

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Infrastructure & Operations

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Run hardware diagnostics on talos-r4-lion-063.build.scl1

Categories

(Infrastructure & Operations :: DCOps, task)

Tracking

(Not tracked)

People

(Reporter: rail, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated