Closed Bug 779332 (t-snow-r4-0014) Opened 12 years ago Closed 11 years ago

t-snow-r4-0014 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mbrubeck, Unassigned)

References

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Several mochitest runs have failed on this slave with some sort of silent error in the "unpack tests" step.  For example:
https://tbpl.mozilla.org/php/getParsedLog.php?id=14015967&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=14499503&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=14495965&tree=Try
https://tbpl.mozilla.org/php/getParsedLog.php?id=14491115&tree=Ionmonkey

They aren't quite silent, it's just that unzip hides its error light under a bushel of output. One of those is a CRC error, another is the amusing "reftest/tests/editor/libeditor/html/pests/browserscope/lib/richtext2/richtext2/static/js/variables.js:  mismatching "local" filename (reftest/tests/editor/libeditor/html/tests/browserscope/lib/richtext2/richtext2/static/js/variables.js)," (we really should more tests in directories named "pests").

19 red out of the last 500 jobs, but it's tough to say how many of the 38 test failures were just normal test failures, and how many were due to reading something wrong off the failing disk.
Summary: talos-r4-snow-014 test runs failing in unpack tests → Please disable talos-r4-snow-014 and run disk diagnostics on it, test runs failing in unpack tests
Whiteboard: [badslave?] → [badslave?][buildduty]
Disabled in slavealloc
Also possible memory corruption issues, see bug 785724.
Depends on: 785724
No longer depends on: 785724
Alias: talos-r4-snow-014
Priority: -- → P3
Summary: Please disable talos-r4-snow-014 and run disk diagnostics on it, test runs failing in unpack tests → talos-r4-snow-014 problem tracking
Whiteboard: [badslave?][buildduty] → [badslave?][buildduty][buildslaves][capacity]
No longer blocks: 785751
Depends on: 785751
The machine got a re-image, let's try it out again.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Is there a reason to think that replacing the software on the machine (that's what you mean by re-image, right?) is going to fix what appear to be hardware problems?
(In reply to David Baron [:dbaron] from comment #7)
> Is there a reason to think that replacing the software on the machine
> (that's what you mean by re-image, right?) is going to fix what appear to be
> hardware problems?

Reimaging has fixed various problems in the past, it's generally our first step. For what it's worth, the first job this slave took was green.
Most of them have been.  The problem is that it's the source of a decent number of false reports of failures.
Yeah, I don't think you can call this fixed when no hardware diagnostics were run.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Re-disabled in the meantime.
Depends on: 787281
Depends on: 794926
No longer depends on: 787281
Hardware diagnostics were run and came out clean. Going to try it again in production. If it has more issues, I guess we'll just decommission it.
Running jobs fine in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=16818117&tree=Mozilla-Inbound was the "pink pixel of death" reftest problem, where one bit for one pixel gets flipped, so perhaps we should have been blaming the memory rather than the disk while it was getting CRC failures.
And https://tbpl.mozilla.org/php/getParsedLog.php?id=16816997&tree=Mozilla-Inbound shortly before that, where it silently failed to start up httpd.js.
Depends on: 811925
Disabled in slavealloc pending resolution of bug 811925.
Logic board was replaced in bug 811925, then the machine was reimage. I've hooked it up puppet to squelch the puppet exception spam, and re-enabled in slavealloc.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Whiteboard: [badslave?][buildduty][buildslaves][capacity] → [buildduty][buildslaves][capacity]
Product: mozilla.org → Release Engineering
Alias: talos-r4-snow-014 → t-snow-r4-0014
Summary: talos-r4-snow-014 problem tracking → t-snow-r4-0014 problem tracking
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.