Closed Bug 798380 Opened 12 years ago Closed 12 years ago

Run hardware diagnostics on talos-r4-lion-063.build.scl1

Categories

(Infrastructure & Operations :: DCOps, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Unassigned)

References

Details

.. particularly RAM tests
colo-trip: --- → scl1
Since lion-63 is sharing the same chassis with lion-64, please take both builds offline so that I can run the diagnostics.

Thanks,
Vinh
Hi Vinh, I have disabled both.
You can go ahead any time in the 30 minutes (it is just finishing a job).
Can we go ahead with this now? I don't miss 063 in the least, but losing a good slave for three weeks just because it sits next to a bad one is a bit painful.
Sorry for the delay guys.  Hardware diagnostics did not find any problems with lion-063.  I've plugged both 63 and 64 back online.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
We seem to be hitting a lot of requests for hardware diagnostics that don't show any problems.  We should probably be trying reimages first (releng folks) since they seem to clear up a number of crufty filesystem issues.
I'd be in favor of that as soon as someone develops a system for actually testing reimaged slaves other than "throw them back into production, and hope that after a dozen invalid code bugs are filed and several patches are backed out or abandoned because of inexplicable failures on Try, someone will notice that a broken slave is still broken."

In the case of this one, which had absolutely nothing to do with the filesystem so that a reimage smells an awful lot like cargo-culting, about 500 runs of reftest without an unexplained failure ought to do it.

The other options I've come up with so far are to rename the broken slaves we insist on continuing to use, so that people might have a chance of noticing that their test failed on "talos-r4-snow-014-totally-unreliable," or having a pool of slaves which are allowed to say that a test suite passed, but are not allowed to say that it failed, only setting RETRY to let some other unbroken slave rerun it when they fail. Dunno how many tests we have that would get false positives when they should have failed but are run on broken hardware, though.
philor: Often machines come back having passed the apple hardware checks, and operations has no way of validating beyond that point.  The hardware looks fine, the OS passes checks.

Releng: should questionable machines come back in a separate pool to be verified?
IMHO we should design a system that would sheppard slaves from preproduction to production automatically only if they pass a whole set of jobs and/or burning tests.

Meanwhile, I will figure out what manual process we can put in place.
We can make use of the preproduction masters and make sure that it takes a bunch of jobs.
We want to look into make bug 712206 a way to prevent bad slaves to get into the pool.
I want to add prospective slaves into "see also" for bug 712206.
We should collect problems, determine how to diagnose them and adding a method to prevent a slave from starting if there are any issues.
Assignee: server-ops → server-ops-dcops
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.