Closed Bug 857705 (ppod) Opened 11 years ago Closed 11 years ago

find source of pink pixel of death and fix it

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

Attachments

(1 file)

We've currently got at least 6 test machines offline due to the pink pixel of death. I suspect we'll be getting more. We need to find out what causes this, and fix it.
Here's a log that contains the pink pixel of death:
https://tbpl.mozilla.org/php/getParsedLog.php?id=21233252&tree=Mozilla-Inbound#error0

For more context, two images that should be completely green differ in only one pixel:
http://cl.ly/O1xB

It should be rgb(0, 145, 0) vs rgb(0, 128, 0)

Should we ask whoever wrote the test to grab one of those machines and look at it?

Is this issue a permanent issue on these machines? or is it intermittent even on them?
If the pixel in question isn't needed for the test, we could potentially change the test to not use that pixel, making all of these machines available for testing.
There is no "the test."

The issue is that the reftest harness writes the values for the color of every pixel in the canvas to memory, and when it reads them back, it gets back a single value which has had a single bit flipped (or two, in that rare case in comment 1, since it wrote 10010001 and read 10000000).

It's intermittent on a given machine, and completely random which reftest it hits (there is no "the test") and which pixel within that test which it hits. There's something about reusing canvases in the reftest harness which explains why it sometimes hits just one test in a run, and sometimes hits several in a row like https://tbpl.mozilla.org/php/getParsedLog.php?id=19995987&tree=Mozilla-B2g18_v1_0_1 did, with the same pixel being off.

If you run reftests on a given busted machine hundreds of times, you may only get it once, since the best theory anyone has for it is a single bit of bad RAM, and you're not going to be getting the same chunk of physical RAM every time you run. If you run tests which are not reftests on that same machine hundreds of times, as I said in bug 845123, you may get one intermittent (in this case, more accurately random) failure, which you will have absolutely no way of identifying as the result of this, because how are you going to tell that a timeout or a GC crash was caused by a single bit of RAM having flipped its value, causing something not to happen or a pointer to point where it ought not point?

The thing I find interesting is that we only hit it on OS X slaves, despite running tests (other than, now, Linux tests) for every OS on the same Mac hardware. Does Bootcamp, or do Windows and Linux, do something different about error correction or avoiding bad RAM?
Is comment 3 a reasonably correct repeat of what I've been told?
Flags: needinfo?(dbaron)
(In reply to Phil Ringnalda (:philor) from comment #3)
> There's something about reusing canvases in the reftest harness which
> explains why it sometimes hits just one test in a run, and sometimes hits
> several in a row like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=19995987&tree=Mozilla-
> B2g18_v1_0_1 did, with the same pixel being off.
> 
> If you run reftests on a given busted machine hundreds of times, you may
> only get it once


Given these, I feel like we should put the machines back in the pool and treat this is an intermittent orange rather than reducing our capacity by pulling them permanently.
Or we could replace the RAM in them...
Comment 3 seems reasonable.

I don't think there's particularly strong evidence that it's RAM rather than some other component.

But the slaves that encountered the pink (or other color) pixel also encountered a higher rate of other intermittent failures, primarily crashes (which were the type of failure that led me to start investigating bug 787281 in the first place).  These other intermittent failures lead to substantially *more* confusion than the single pixel in the middle of a white field reftest failures, since they're less obviously hardware-related.  For example, let's take the first attachment (attachment 657101 [details]) on bug 787281, and look at the bugs associated with the first obviously-bad machine on the list (talos-r4-snow-013):
https://bugzilla.mozilla.org/buglist.cgi?bug_id=751057%2C%20754807%2C%20764116%2C%20764754%2C%20772441%2C%20774700%2C%20781813%2C%20781814%2C%20589445%2C%20628667%2C%20756198%2C%20758095%2C%20760056%2C%20768942%2C%20769564%2C%20780129%2C%20780577%2C%20780694%2C%20782501%2C%20782505%2C%20782931
This list of bugs (intermittent failures open at the time that occurred only on that machine) should have about 1-2 (or *maybe* 3) failures on it for a normal machine; this list has 21 failures that have occurred only on that machine.  Of those 21 failures, exactly 1 (bug 768942) was a single-pixel reftest failure.  All but one of the rest were crashes.

So if we run tests on these machines, should we show them in a funny color on tbpl to indicate that the failures come from a suspicious machine and might be ignorable?  But what if they're then real?

I don't think we should continue spending thousands of dollars of engineering time debugging problems that could be fixed by tossing a machine worth hundreds of dollars into the garbage can.

And this isn't only a reftest problem.  This is a random data corruption problem.
Flags: needinfo?(dbaron)
Sorry, I did sort of get it wrong with that "get it once."

If you ran reftests a hundred times on a known-good revision, and if you ignored any odd crashes that claimed to be in system libraries or had GC on the stack (as the sheriffs mostly do now, since they are as likely to be the result of slaves with bad RAM as they are to be actual bugs in Gecko or SpiderMonkey, until we've seen them lots of times), you might *visibly* get it once.

There are three sorts of reftests:

* ones which load two pages, and assert that every pixel in the two pages should match; these are the ones where you see PPoD failures as orange test runs

* ones which load two pages, and assert that there must be at least one pixel which differs between the two; if these get a PPoD, then they will either just ignore it and correctly pass because some other pixels were also different, or, they will produce a false positive, saying that a failing test passes because the Pink Pixel makes the two different when they are otherwise the same

* ones which load two pages, and assert that they must have no more than n pixels different, and that the biggest difference must be no more than m; if these get a PPoD, then they will either fail or pass despite it, one or the other

So on your known good revision, you can say that you know you got it once, and you may have gotten it between once and one hundred times since the 99 green runs could have all hit it in a way that doesn't show. On a hundred different revisions of unknown quality, you can say... nothing. They may be a hundred good revisions, one of which got a false orange, or they may be a hundred bad revisions which got false green, or one bad revision which got a false orange and a false green in the reftest it should have failed, or anything else. If one of these slaves runs reftests on a revision, you know slightly more than if it doesn't, but you absolutely do not know whether or not that revision actually passes reftests, no matter what result the run gets.

And that's just reftests, and just ignoring crashes and hangs and other non-comparison failures. The only reason we're not really finding them via bug 787281 anymore is because invalid bugs about crashes that were caused by these defective slaves, like bug 854839, mostly just don't get filed anymore. We just ignore it when a Mac test crashes, unless it happens so often that we start to recognize the stack. 

And that may well be the real cost of these bad slaves - not the engineering time that is spent chasing after code bugs which do not exist, but the chilling effect that has caused us to stop filing about intermittent failures, and has caused engineers to not look at the bugs when we do file them.
The counter-argument to "just throw the busted things away," which I can make as easily as my own argument, having taken both sides so many times before, is:

We're in this situation where we have a single pool of slaves that run both unittests and talos, and to run talos the entire pool needs to be identical hardware so we can meaningfully compare performance numbers across runs, which means that for a given OS X release, we buy n minis, and by the time we might want more Apple is no longer selling that hardware spec. Possibly, we can afford to throw away these six r4 minis, but there is absolutely no way to get more r4 minis, so at some point saying "throw away this slave" becomes "and also throw away the remaining perfectly fine 60 other slaves of the same class, buy 100 new minis, and spend thousands of dollars of releng, relops, and dcops time on racking and imaging and deploying an entirely new pool of slaves for an old version of OS X."

Possibly that's the right thing to do, buy 100 or 200 r5 if they are still available or more likely a new r6 mini, image them as 10.7 slaves, turn the remaining not-yet-broken 10.7 slaves into 10.6 slaves, and hope that'll carry us through until we drop 10.6, and if we're very lucky those r6 minis will still be what we can buy when we decide we want to buy 200 or 300 for 10.9 when it ships. More likely they would not be, and 10.9 would have to go on r7, and we'd be throwing out the r6 ones when we drop 10.7.
(In reply to Phil Ringnalda (:philor) from comment #9)
> The counter-argument to "just throw the busted things away," which I can
> make as easily as my own argument, having taken both sides so many times
> before, is:
> 
> We're in this situation where we have a single pool of slaves that run both
> unittests and talos, and to run talos the entire pool needs to be identical
> hardware so we can meaningfully compare performance numbers across runs,
> which means that for a given OS X release, we buy n minis, and by the time
> we might want more Apple is no longer selling that hardware spec. Possibly,
> we can afford to throw away these six r4 minis, but there is absolutely no
> way to get more r4 minis, so at some point saying "throw away this slave"
> becomes "and also throw away the remaining perfectly fine 60 other slaves of
> the same class, buy 100 new minis, and spend thousands of dollars of releng,
> relops, and dcops time on racking and imaging and deploying an entirely new
> pool of slaves for an old version of OS X."
> 
> Possibly that's the right thing to do, buy 100 or 200 r5 if they are still
> available or more likely a new r6 mini, image them as 10.7 slaves, turn the
> remaining not-yet-broken 10.7 slaves into 10.6 slaves, and hope that'll
> carry us through until we drop 10.6, and if we're very lucky those r6 minis
> will still be what we can buy when we decide we want to buy 200 or 300 for
> 10.9 when it ships. More likely they would not be, and 10.9 would have to go
> on r7, and we'd be throwing out the r6 ones when we drop 10.7.

No one has ever expressed it better ;)

I don't know how much budget we have this year besides getting the 10.9 machines.

I think we can soon start discussions to stop supporting 10.6 on tbpl and then re-purpose machines as lion machines. 

Replacing the memory on 6 machines is worth the money and time. Any objections if we try?

* talos-r4-snow-047
* talos-r4-snow-074
* talos-r4-lion-005
* talos-r4-lion-039
* talos-r4-lion-040
* talos-r4-lion-059
* talos-r4-lion-072
Blocks: 855282
Armen: why not start with running a memory test (memtest86) on each of those to confirm whether replacement of the RAM is actually needed?
Depends on: 875811
(In reply to John Hopkins (:jhopkins) from comment #11)
> Armen: why not start with running a memory test (memtest86) on each of those
> to confirm whether replacement of the RAM is actually needed?

It makes sense! That comment was from when we did not know that memtest would catch the issues.
No longer blocks: 855282
It seems that all Mac failures (except talos-r4-lion-059) were due to bad RAM.
We seem to have found the source of the issue for Macs but not for t-w732-ix-118.

What do we do with this bug?

I've requested talos-r4-lion-059 to be re-imaged and I'm leaning to do the same for the Windows machine.
Do these machines all use system memory for graphics, or do they have dedicated graphics memory?
The rev4 machines have whatever graphic card come with them.
The t-w732-ix-* machines have an NVIDIA GPU GeForce GT 610.
If the eventual full answer to that question is "the rev4 minis are the mid-2011 Core i7 ones, the ones with a Radeon HD 6630M with 256MB of dedicated graphics RAM, which is used by OS X but isn't used while running Linux or Windows through Bootcamp and which isn't checked by the diagnostic tools we use to check system RAM" then that would make a nice neat fit for both my question from the end of comment 3 and for the diagnostics disagreeing with the reftests about whether something's wrong.
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #13)
> What do we do with this bug?

Three possible directions to take it:

* determine why we never saw PPoD failures in Windows and Linux while they were running on the same hardware, in case that will tell us a way to get more reliable memory while running OS X on it

* determine whether we are actually running diagnostics on the memory that the reftests are using, or running diagnostics on system memory while they are using the memory on the graphics card

* declare that we've found the source of the PPoD, that it was exactly what we thought it was all along, one or more bad bits of memory, RESO FIXED.
Alias: ppod
Pretty much the last thing anyone wants to see (since among other things, it's completely and utterly unfixable), but https://tbpl.mozilla.org/php/getParsedLog.php?id=24129328&tree=Mozilla-Inbound is a PPoD on an ec2 slave.  

Assuming our thesis that a PPoD is the result of a bad bit of memory is correct, and that not only will that bad memory cause PPoDs, but will also cause random crashes and random other failures, and my understanding that any ec2 slave can do any run on any of Amazon's hardware is correct, that means that any failure we see on an ec2 slave means absolutely nothing at all until we see it on another platform.
It's pretty clear at this point that this issue comes down to bad ram. We've had great success in replacing the ram of machines suffering this problem.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: