Closed
Bug 785724
Opened 12 years ago
Closed 12 years ago
Please remove talos-r4-snow-014 from service
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 779332
People
(Reporter: dbaron, Unassigned)
References
Details
There have been a few intermittent-orange bugs filed in layout recently that have looked like totally random memory corruption. Such bugs could sometimes be the result of a memory corruption bug in our code. However, when a memory corruption bug in our code causes failing tests it usually causes a noticeable spike in such problems; these were very infrequent -- infrequent enough to make them basically impossible to test in any way.
Furthermore, I noticed today that both of the reports I was most concurned about -- bug 775045 and bug 772725 -- occurred on the same slave, that made me suspect a problem that leads to memory corruption on that build slave. So I started looking for other bugs to see if there was a pattern. And there seems to be.
The following intermittent failure bugs have all occurred only on talos-r4-snow-014 and are also explicitly a report of a small amount of memory being corrupted in a way that doesn't interfere with program execution:
bug 769567 (known to be a single bit memory error)
bug 772725
bug 775045
The following failures have also occurred only on that slave:
bug 662991 comment 3
bug 765190
bug 766508 (suspicion of slave also expressed in bug)
bug 768827
bug 775942
bug 777751 (also a memory error)
I think this is enough evidence that we should remove this build machine (talos-r4-snow-014) from service.
I don't know if we gather data on failures that happen only on a single machine (we probably should), and thus I don't know how abnormal these numbers are. However, from a layout perspective alone, this single machine explains a significant portion of the totally-inexplicable orange that I've seen recently.
If it's worth, say, running a RAM test to see if the problematic memory can be isolated and then replacing it in order to put it back into service, you could do that. I'm not sure if that's more time than the value of the machine, though.
Updated•12 years ago
|
Blocks: t-snow-r4-0014
Reporter | ||
Comment 1•12 years ago
|
||
Oops, looks lie bug 779332 was also a report about the same machine, but I didn't catch it because it didn't have the machine name in any of the comments (only in the summary).
Reporter | ||
Comment 2•12 years ago
|
||
Curiously enough, the one similar case I remembered that isn't this machine is its neighbor: talos-r4-snow-013, reported in bug 772440.
Comment 3•12 years ago
|
||
Turns out it was disabled on Aug 18 in bug 779332.
No longer blocks: t-snow-r4-0014
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•