Closed Bug 784278 Opened 12 years ago Closed 11 years ago

New tegras (and some old ones) failing in reftest intermittently

Categories

(Release Engineering :: General, defect, P3)

x86_64
Other
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Callek, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

So we have many instances, so far, of our new batch of tegras failing while being run on one of our previous mac foopies.

Some of these same tegras that are failing have also passed on some tests, across trees, try, m-i, m-c, etc.

The screenshot data: url's seem to be a completely blank [white] screen.

I welcome *any* and *all* ideas on what to look for, or if someone wants to hands-on a tegra or two, or even the foopy itself.

See any of:

https://secure.pub.build.mozilla.org/buildapi/recent/tegra-306
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-305
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-304 (no orange reftests yet)
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-302
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-300
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-299
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-298
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-297
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-296
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-295
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-294
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-293
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-290

I am hoping we can find out what is different/problematic here rather than needing to junk/return these tegras.
Some random observations:

 - there are both UNEXPECTED-FAIL and UNEXPECTED-PASS failures in all of these logs; they seem to be in nearly equal proportion
 - the set of failing tests is consistent; the 2 R1 logs are virtually identical; same for the 2 R2 logs and the 2 R3 logs 
 - the environment, SUT version, OS version and everything else I could think to check looks the same as on "old" tegras; the only difference I have spotted is:
  HOME=/Users/cltbld
Depends on: 784545
I see the same as Geoff here. I don't know why setting the home directory would matter for any of our code, so that doesn't feel like it should cause this.  It really looks like the tegras are simply not rendering graphics during these test runs, as unusual as that would be.
Running tegra 305 on my desk with exactly the same set of builds and tests as it ran in the automation (different host-utils because I can't download the one the tegras use) but it runs perfectly every single time. 

It's run 5 runs now, never failed.

I'm running it in a loop 50 times, rebooting between each run. I'll check back on it in a few hours and see what happens.
No longer blocks: 767447
So, overnight, tegra-305 ran the same code that caused the intermittent orange in comment 1 fifty-five times in succession and never failed a single test.

When I went into 2.IDF the tegras were sitting on a metal shelf with styrofoam insulating them from the metal shelf at the bottom. In my cursory review of the set of tegras (while I was looking for tegra 305) I found that most or all of the tegras having this issue are the ones on the ends of the shelves, resting against the metal supports on either side of the shelf. I recommend that we insulate these better because the metal may be adding conductivity across the board where there shouldn't be conductivity causing these intermittent issues. 

I will definitely agree that the theory sounds kind of wacko but given that the software on the mac foopy running these didn't change and that removing the tegra from that environment and running the same software on it in a new environment changed the behavior, then I'm thinking our one variable left is the physical environment itself. I'll file a DC-ops bug to get more insulation around these devices, and have it block this one.
(In reply to Clint Talbert ( :ctalbert ) from comment #9) 
> I will definitely agree that the theory sounds kind of wacko but given that
> the software on the mac foopy running these didn't change and that removing
> the tegra from that environment and running the same software on it in a new
> environment changed the behavior, then I'm thinking our one variable left is
> the physical environment itself. I'll file a DC-ops bug to get more
> insulation around these devices, and have it block this one.

FWIW, I don't think this sounds wacko at all. We had a similar issue a few years ago with the iX machines where a certain combination of flooring materials, racks, and fan/drive harmonics in different colos caused degraded performance on *some* of the iX machines. These are often the craziest situations to debug, so kudos if this works. Fingers crossed here.
Blocks: 784767
Bug 784767 should be done now, I am 90% sure that c#12 here was before that work was done. So starting now we should be on the lookout for more cases of this. (I'm hoping it does not repeat)
No longer blocks: 784767
Depends on: 784767
tegra-367
https://tbpl.mozilla.org/php/getParsedLog.php?id=14755449&tree=Firefox
tegra-298

(We'll be missing a fair number of instances of this that'll wind up in bug 663657 since it sometimes times out, like this one did, after the several hundred failures.)
This doesn't seem limited to Mac foopy builds.  Many of the above failures have foopy_type 'Linux'.
Summary: New tegras failing in reftest (on a *mac* foopy) intermittently → New tegras failing in reftest (on a *mac* (or linux?) foopy) intermittently
(In reply to Matt Brubeck (:mbrubeck) from comment #79)
> This doesn't seem limited to Mac foopy builds.  Many of the above failures
> have foopy_type 'Linux'.

Good point, initial reasoning for calling it out was that it was not an issue with Linux foopy alone. [and was before I brought linux foopy to production for any new tegras]
Summary: New tegras failing in reftest (on a *mac* (or linux?) foopy) intermittently → New tegras failing in reftest intermittently
https://tbpl.mozilla.org/php/getParsedLog.php?id=14863796&tree=Mozilla-Inbound
tegra-338

(another triple, but on my push, which is less funny)
Whiteboard: [orange]
Is this just the new normal, and from now on 10 or 20 reftest runs a day will fail this way to go along with the 10 or 20 reftest runs a day that will time out?
Blocks: 438871
We can't seem to repro this anywhere. I'd like to see if the fixes from bug 737961 which will eliminate the need to run at massive resolution will fix this.  When we eliminated the need for the 800 x 1000 resolution from the jsreftest and crashtests those frameworks became far more stable and green.
Depends on: 737961
That's two inbound pushes in a row, one hit this on all three reftest hunks, the next hit this on two of the three, and bug 660480. Callek asked me early on whether this was bad enough that we would be better off not running the new tegras at all instead of enduring it. At the time, the answer was no, we didn't need to shut them off. Now the answer is yes, they are in some way broken, and need to go away until they get better.
https://tbpl.mozilla.org/php/getParsedLog.php?id=15075660&tree=Firefox
tegra-073 (which has only existed since Thursday, so yeah, "new")
https://tbpl.mozilla.org/php/getParsedLog.php?id=15119800&tree=Mozilla-Inbound
tegra-302

That's the retriggered run, on a push which probably broke R1, which we thought probably didn't despite having tried to show how it was breaking it on try, because we've lost pretty much all faith in the ability of tegras to run reftests anymore.
(In reply to Clint Talbert ( :ctalbert ) from comment #6)
> Running tegra 305 on my desk with exactly the same set of builds and tests
> as it ran in the automation (different host-utils because I can't download
> the one the tegras use) but it runs perfectly every single time. 
> 
We can get you the same host-utils.
Depends on: 790689
Depends on: 790698
This will soon be slowing down and be solved since we won't run reftests on the new batches of tegras (bug 790698).
Or it'll be morphing into something even weirder, since there are a very few older tegras getting infected with the all-white-reftest disease.

https://tbpl.mozilla.org/php/getParsedLog.php?id=15197834&tree=Mozilla-Inbound
tegra-280
https://tbpl.mozilla.org/php/getParsedLog.php?id=15222705&tree=Firefox
tegra-267

Which isn't a full set, and there probably are two separate things getting mixed together here.
Summary: New tegras failing in reftest intermittently → New tegras (and some old ones) failing in reftest intermittently
Depends on: 792212
OS: Windows 7 → Other
Priority: -- → P3
https://tbpl.mozilla.org/php/getParsedLog.php?id=15748376&tree=Mozilla-Inbound
tegra-094 (which makes me weepy, since that's the first one since the patch for bug 792212 landed)
Blocks: 803408
The new tegras seem to work the same as the old now, with the patch for bug 797942. Can we close this out and remove the special configuration preventing new tegras from running reftests?
Attached patch time to revertSplinter Review
Attachment #674711 - Flags: review?(bugspam.Callek)
Comment on attachment 674711 [details] [diff] [review]
time to revert

staging first though! (and double check we have some of these new ones up in staging)
Attachment #674711 - Flags: review?(bugspam.Callek) → review+
Since bug 797942 is on Gecko 19, despite the 18 milestone, wouldn't the "time to revert" be when 19 hits mozilla-release in February?
Whiteboard: [orange]
Resolving WFM keyword:intermittent-failure bugs last modified >3 months ago, whose whiteboard contains none of:
{random,disabled,marked,fuzzy,todo,fails,failing,annotated,time-bomb,leave open}

There will inevitably be some false positives; for that (and the bugspam) I apologise. Filter on orangewfm.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: