Closed Bug 892990 Opened 11 years ago Closed 10 years ago

Intermittent Win7/Win8 reftest-no-accel content/canvas/test/reftest/webgl-clear-test.html?__&_____&________ | application timed out after 330 seconds with no output

Categories

(Core :: Graphics: CanvasWebGL, defect)

x86
Windows 8
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox29 --- wontfix
firefox30 --- wontfix
firefox31 --- wontfix
firefox-esr24 --- wontfix
b2g-v1.2 --- wontfix
b2g-v1.3 --- wontfix
b2g-v1.4 --- wontfix
b2g-v2.0 --- wontfix

People

(Reporter: RyanVM, Assigned: jgilbert)

References

Details

(Keywords: intermittent-failure, Whiteboard: webgl-internal)

Attachments

(2 files)

This actually started a few days ago but was treated as a random one-off. Unfortunately, it's not going away. Appears to be Win8 reftest-no-accel only. Jeff, can you please take a look or suggest someone who can? The screenshots in the log don't seem to show anything overly useful in the failures I've looked at. Sometimes they show an all-white Firefox window or they don't show anything at all.

https://tbpl.mozilla.org/php/getParsedLog.php?id=25214902&tree=Mozilla-Inbound

WINNT 6.2 mozilla-inbound opt test reftest-no-accel on 2013-07-12 04:08:21 PDT for push 216b74874ef0
slave: t-w864-ix-077

04:17:38     INFO -  REFTEST TEST-START | file:///C:/slave/test/build/tests/reftest/tests/content/canvas/test/reftest/webgl-clear-test.html?__&_____&________
04:17:38     INFO -  REFTEST TEST-LOAD | file:///C:/slave/test/build/tests/reftest/tests/content/canvas/test/reftest/webgl-clear-test.html?__&_____&________ | 4258 / 9436 (45%)
04:23:08  WARNING -  TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/reftest/tests/content/canvas/test/reftest/webgl-clear-test.html?__&_____&________ | application timed out after 330 seconds with no output
04:23:08     INFO -  args: ['C:\\slave\\test\\build\\tests\\bin\\screenshot.exe', 'c:\\users\\cltbld~1.t-w\\appdata\\local\\temp\\mozilla-test-fail_s4rbkh']
04:23:08     INFO -  SCREENSHOT: <see log>

command timed out: 1200 seconds without output, attempting to kill
Flags: needinfo?(jgilbert)
I take that back, Win7 too.
https://tbpl.mozilla.org/php/getParsedLog.php?id=25214248&tree=Fx-Team
Summary: Intermittent Win8 reftest-no-accel content/canvas/test/reftest/webgl-clear-test.html?__&_____&________ | application timed out after 330 seconds with no output → Intermittent Win7/Win8 reftest-no-accel content/canvas/test/reftest/webgl-clear-test.html?__&_____&________ | application timed out after 330 seconds with no output
We're about to land some changes to the webgl reftests, so let's check if those fix this.
Flags: needinfo?(jgilbert)
I'll add the cause I suspect as well:
What was triggering this in some recent changes was RAF not being triggered because we weren't drawing to the canvas, thus not enqueueing an RAF event, thus not triggering the end of the reftest, leaving it to time out.

We should probably add a failsafe setTimeout to assure that this doesn't happen if this is indeed the issue.
Jeff, is this something Guillaume can take care of?
(In reply to Milan Sreckovic [:milan] from comment #45)
> Jeff, is this something Guillaume can take care of?
Flags: needinfo?(jgilbert)
Probably not. This isn't an issue with WebGL per se, but rather is almost certainly a Layers issue.

Did anything in Win8 layers change around this date, Bas?
Flags: needinfo?(jgilbert) → needinfo?(bas)
Perhaps OMTC was switched on around this time? The metro folks would know that, I seriously doubt anything changed in Win8 desktop layers around this time.
Flags: needinfo?(bas)
(In reply to Bas Schouten (:bas.schouten) from comment #65)
> Perhaps OMTC was switched on around this time? The metro folks would know
> that, I seriously doubt anything changed in Win8 desktop layers around this
> time.

Jim, any ideas? As much fun as going in circles is.
Flags: needinfo?(jmathies)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #80)
> (In reply to Bas Schouten (:bas.schouten) from comment #65)
> > Perhaps OMTC was switched on around this time? The metro folks would know
> > that, I seriously doubt anything changed in Win8 desktop layers around this
> > time.
> 
> Jim, any ideas? As much fun as going in circles is.

omtc hasn't been switched on for desktop AFAIA.
Flags: needinfo?(jmathies)
Have we tried to piece together a regression window?
(In reply to Jim Mathies [:jimm] from comment #82)
> Have we tried to piece together a regression window?

Outside of early July, no. And at this point, doing much better than that is going to be rather difficult. Not impossible, but difficult.
Looks like this merged is what regressed it. Will continue bisecting.
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b3ff36cb6a1a&tochange=dde4dcd6fa46
Some possibly pertinent information given what the regression window is quickly narrowing down to. This has been happening since early July, yet there has never been a failure on Aurora/Beta in all this time. All failures have been on trunk only. Stay tuned for the possible relevance of this tidbit :)
Whelp, this appears to somehow be a regression from bug 785487 (I know, right?).

Try run reverted to 184127374d60 (the push prior to bug 785487):
https://tbpl.mozilla.org/?tree=Try&rev=ea074a0159e1

Try run reverted to f900daf35771 (the push for bug 785487):
https://tbpl.mozilla.org/?tree=Try&rev=673d2e64aace

I actually thought 184127374d60 was going to the push that it fell over on (hence comment 153) so I could argue that somehow it was causing trunk-only problems. No such luck :(

I have no idea how bug 785487 could have caused this, but Try seems to be pointing the finger at it. Gavin, any theories?
Flags: needinfo?(gavin.sharp)
And in a stark reminder of why I hate this bug with a passion, retriggers 201-250 managed to hit this failure on Win8 on ref 184127374d60. Rev a75301db9899 (the one prior to 184127374d60) already has 250 retriggers going for it without a failure, but what the hey, let's try some more. On the bright side, at least it would get back to the comment 153 theory if a75301db9899 holds up.

https://tbpl.mozilla.org/?tree=Try&rev=0f4d116ccd5b
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #156)
> https://tbpl.mozilla.org/?tree=Try&rev=0f4d116ccd5b

500 runs on both Win7 and Win8 without a single instance now. Going to run with rev 184127374d60 (Bug 888927) being the culprit.

Try run of m-c tip with 184127374d60 backed out:
https://tbpl.mozilla.org/?tree=Try&rev=e14d5fa7228b
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #158)
> Try run of m-c tip with 184127374d60 backed out:
> https://tbpl.mozilla.org/?tree=Try&rev=e14d5fa7228b

500/500 for both Win7 and Win8. Calling this a "regression" from bug 888927.
Blocks: 888927
Bug 888927 enabled telemetry-on-by-default code in tinderbox test builds, when it previously was only enabled for nightly builds. I have no idea how the telemetry code could be causing this reftest failure.
Flags: needinfo?(gavin.sharp)
Attached image screenshot
This screenshot (from the log in comment 160) looks weird. Completely blank window?
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #163)
> Created attachment 815910 [details]
> screenshot
> 
> This screenshot (from the log in comment 160) looks weird. Completely blank
> window?

There's another small blank window in the more recent screenshots.

We know what caused this, can we please get someone to look into why?
Flags: needinfo?(gavin.sharp)
I don't think I can help out much with this one, seems like a Windows gfx hacker is more likely to have insight.
Flags: needinfo?(gavin.sharp) → needinfo?(bas)
Whiteboard: webgl-internal
I know nothing about webgl...
Flags: needinfo?(bas) → needinfo?(jgilbert)
Assignee: nobody → jgilbert
Flags: needinfo?(jgilbert)
I retriggered this some. With sample-size of 13, we hit this twice on Win7 PGO Ru.
Looks like this is just a timeout where we don't get an rAF after doing a gl.clear().

Trivial fix that waits for the soonest of {rAF, setTimeout(MAX_DELAY)}.
Here's a Try run, though I accidentally set MAX_DELAY to 0.5ms:
https://tbpl.mozilla.org/?tree=Try&rev=60b86005cc63
Attached patch color-reRAFSplinter Review
Attachment #8404194 - Flags: review?(bjacob)
Comment on attachment 8404194 [details] [diff] [review]
color-reRAF

Review of attachment 8404194 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/canvas/test/reftest/webgl-utils.js
@@ +78,5 @@
> +    func();
> +  };
> +
> +  rAF(doneFunc);
> +  setTimeout(doneFunc, MAX_WAIT_FOR_COMPOSITE_DELAY_MS);

I hope that that won't just replace an intermittent timeout by an intermittent orange... could you maybe at least dump() a message here if the timeout is hit?
Attachment #8404194 - Flags: review?(bjacob) → review+
<jgilbert> bjacob, why would dump() be useful if we hit the timeout?
<bjacob> jgilbert: because it would give us a clue in the mochitest log
<jgilbert> bjacob, we sort of know the superficial reason that the test is timing out
<jgilbert> rAF is not being triggered
<jgilbert> actually, we haven't returned from JS
<jgilbert> hmm
<bjacob> jgilbert: but with your patch, if we 're unlucky, we can have intermitten oranges if the 500ms timeout races with the raf

I don't see how this is susceptible to racing, since either of {timeout, raf} will trigger the 'done' condition, and JS runs to completion, so we can't get execution of these callbacks interleaved.
Flags: needinfo?(bjacob)
My concern was that, assuming that it does matter that we actually wait for the compositor and that raf is a way to achieve that, which seems to be the premise for having this waitforcompositor function in the first place, then presumably it does matter whether we actually waited for the compositor or merely just timed out, i.e. presumably that could make the difference between passing and failing a test?

Anyway, I didn't mean these comments as blocking landing.
Flags: needinfo?(bjacob)
(In reply to Benoit Jacob [:bjacob] from comment #363)
> My concern was that, assuming that it does matter that we actually wait for
> the compositor and that raf is a way to achieve that, which seems to be the
> premise for having this waitforcompositor function in the first place, then
> presumably it does matter whether we actually waited for the compositor or
> merely just timed out, i.e. presumably that could make the difference
> between passing and failing a test?
> 
> Anyway, I didn't mean these comments as blocking landing.

Ah, so the issue is actually the opposite! Sometimes, rAF doesn't fire because the compositor is done already, somehow. (We should figure out why!)

That is:
function frame() {
  gl.clear();
  rAF(nextFrame); // might never run!
}
https://hg.mozilla.org/mozilla-central/rev/6f5a918051d5
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
(In reply to TBPL Robot from comment #371)

Crap.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla31 → ---
That was Fx-Team. It probably just didn't get merged to there yet. I think we can close this still.
Flags: needinfo?(ryanvm)
It did, the merge to fx-team was eight pushes below that.
I have no idea then. What I did was literally add a setTimeout(500ms) that should failsafe-terminate the test.
Flags: needinfo?(ryanvm)
Closing bugs where TBPLbot has previously commented, but have now not been modified for >3 months & do not contain the whiteboard strings for disabled/annotated tests or use the keyword leave-open. Filter on: mass-intermittent-bug-closure-2014-07
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: