Closed Bug 785724 Opened 12 years ago Closed 12 years ago

Please remove talos-r4-snow-014 from service

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 779332

People

(Reporter: dbaron, Unassigned)

References

Details

There have been a few intermittent-orange bugs filed in layout recently that have looked like totally random memory corruption.  Such bugs could sometimes be the result of a memory corruption bug in our code.  However, when a memory corruption bug in our code causes failing tests it usually causes a noticeable spike in such problems; these were very infrequent -- infrequent enough to make them basically impossible to test in any way.

Furthermore, I noticed today that both of the reports I was most concurned about -- bug 775045 and bug 772725 -- occurred on the same slave, that made me suspect a problem that leads to memory corruption on that build slave.  So I started looking for other bugs to see if there was a pattern.  And there seems to be.

The following intermittent failure bugs have all occurred only on talos-r4-snow-014 and are also explicitly a report of a small amount of memory being corrupted in a way that doesn't interfere with program execution:
bug 769567 (known to be a single bit memory error)
bug 772725
bug 775045

The following failures have also occurred only on that slave:
bug 662991 comment 3
bug 765190
bug 766508 (suspicion of slave also expressed in bug)
bug 768827
bug 775942
bug 777751 (also a memory error)

I think this is enough evidence that we should remove this build machine (talos-r4-snow-014) from service.

I don't know if we gather data on failures that happen only on a single machine (we probably should), and thus I don't know how abnormal these numbers are.  However, from a layout perspective alone, this single machine explains a significant portion of the totally-inexplicable orange that I've seen recently.

If it's worth, say, running a RAM test to see if the problematic memory can be isolated and then replacing it in order to put it back into service, you could do that.  I'm not sure if that's more time than the value of the machine, though.
Blocks: 769567
Blocks: 772725
Blocks: 775045
Blocks: 765190
Blocks: 766508
Blocks: 768827
Blocks: 775942
Blocks: 777751
Oops, looks lie bug 779332 was also a report about the same machine, but I didn't catch it because it didn't have the machine name in any of the comments (only in the summary).
Curiously enough, the one similar case I remembered that isn't this machine is its neighbor: talos-r4-snow-013, reported in bug 772440.
Turns out it was disabled on Aug 18 in bug 779332.
No longer blocks: t-snow-r4-0014
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.