Closed Bug 1452800 Opened 6 years ago Closed 6 years ago

Intermittent gfx/layers/apz/test/mochitest/test_bug1151663.html | found the RCD node

Categories

(Core :: Panning and Zooming, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: apavel, Assigned: mikokm)

References

Details

(Keywords: intermittent-failure, Whiteboard: [retriggered][stockwell fixed:other])

Attachments

(2 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1437295 +++

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9ffceba6cce1d2797c3c88fd8a469f174b48138e&selectedJob=172743449

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=172743449&repo=autoland&lineNumber=6520

[task 2018-04-09T21:17:56.699Z] 21:17:56     INFO - TEST-START | gfx/layers/apz/test/mochitest/test_bug1151663.html
[task 2018-04-09T21:17:58.277Z] 21:17:58     INFO - GECKO(5550) | Parent process: flushed APZ repaints, waiting for callback...
[task 2018-04-09T21:17:58.342Z] 21:17:58     INFO - GECKO(5550) | Flushed APZ repaints, waiting for callback...
[task 2018-04-09T21:17:58.560Z] 21:17:58     INFO - GECKO(5550) | Flushed APZ repaints, waiting for callback...
[task 2018-04-09T21:17:58.657Z] 21:17:58     INFO - TEST-INFO | started process screentopng
[task 2018-04-09T21:17:59.123Z] 21:17:59     INFO - TEST-INFO | screentopng: exit 0
[task 2018-04-09T21:17:59.124Z] 21:17:59     INFO - Buffered messages logged at 21:17:58
[task 2018-04-09T21:17:59.124Z] 21:17:59     INFO - TEST-PASS | gfx/layers/apz/test/mochitest/test_bug1151663.html | expected at least one paint in compositor test data 
[task 2018-04-09T21:17:59.124Z] 21:17:59     INFO - Buffered messages finished
[task 2018-04-09T21:17:59.125Z] 21:17:59     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_bug1151663.html | found the RCD node 
[task 2018-04-09T21:17:59.126Z] 21:17:59     INFO -     testBug1151663@gfx/layers/apz/test/mochitest/helper_bug1151663.html:51:7
[task 2018-04-09T21:17:59.126Z] 21:17:59     INFO -     promise callback*@gfx/layers/apz/test/mochitest/helper_bug1151663.html:57:5
[task 2018-04-09T21:17:59.128Z] 21:17:59     INFO - GECKO(5550) | JavaScript error: http://mochi.test:8888/tests/gfx/layers/apz/test/mochitest/helper_bug1151663.html, line 52: TypeError: rcd is null
[task 2018-04-09T21:22:56.704Z] 21:22:56     INFO - Not taking screenshot here: see the one that was previously logged
[task 2018-04-09T21:22:56.704Z] 21:22:56     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_bug1151663.html | Test timed out. 
[task 2018-04-09T21:22:56.705Z] 21:22:56     INFO -     reportError@SimpleTest/TestRunner.js:121:7
[task 2018-04-09T21:22:56.705Z] 21:22:56     INFO -     TestRunner._checkForHangs@SimpleTest/TestRunner.js:142:7
[task 2018-04-09T21:22:56.705Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.706Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.707Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.707Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.708Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.709Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.710Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.711Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.711Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.712Z] 21:22:56     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:22:56.713Z] 21:22:56     INFO -     TestRunner.runTests@SimpleTest/TestRunner.js:380:5
[task 2018-04-09T21:22:56.714Z] 21:22:56     INFO -     RunSet.runtests@SimpleTest/setup.js:194:3
[task 2018-04-09T21:22:56.714Z] 21:22:56     INFO -     RunSet.runall@SimpleTest/setup.js:173:5
[task 2018-04-09T21:22:56.715Z] 21:22:56     INFO -     hookupTests@SimpleTest/setup.js:266:5
[task 2018-04-09T21:22:56.715Z] 21:22:56     INFO - parseTestManifest@http://mochi.test:8888/manifestLibrary.js:36:5
[task 2018-04-09T21:22:56.716Z] 21:22:56     INFO - getTestManifest/req.onload@http://mochi.test:8888/manifestLibrary.js:49:11
[task 2018-04-09T21:22:56.716Z] 21:22:56     INFO - EventHandlerNonNull*getTestManifest@http://mochi.test:8888/manifestLibrary.js:45:3
[task 2018-04-09T21:22:56.716Z] 21:22:56     INFO -     hookup@SimpleTest/setup.js:246:5
[task 2018-04-09T21:22:56.717Z] 21:22:56     INFO - EventHandlerNonNull*@http://mochi.test:8888/tests?autorun=1&closeWhenDone=1&consoleLevel=INFO&manifestFile=tests.json&dumpOutputDirectory=%2Ftmp&cleanupCrashes=true:11:1
[task 2018-04-09T21:22:57.707Z] 21:22:57     INFO - GECKO(5550) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
[task 2018-04-09T21:22:57.708Z] 21:22:57     INFO - GECKO(5550) | MEMORY STAT | vsize 1577MB | residentFast 126MB | heapAllocated 22MB
[task 2018-04-09T21:23:00.716Z] 21:23:00     INFO - TEST-OK | gfx/layers/apz/test/mochitest/test_bug1151663.html | took 304018ms
[task 2018-04-09T21:23:03.725Z] 21:23:03     INFO - Error: Unable to restore focus, expect failures and timeouts.
[task 2018-04-09T21:23:03.762Z] 21:23:03     INFO - TEST-START | gfx/layers/apz/test/mochitest/test_bug1151667.html
[task 2018-04-09T21:23:07.952Z] 21:23:07     INFO - GECKO(5550) | Parent process: flushed APZ repaints, waiting for callback...
[task 2018-04-09T21:28:26.710Z] 21:28:26     INFO - Not taking screenshot here: see the one that was previously logged
[task 2018-04-09T21:28:26.711Z] 21:28:26     INFO - Buffered messages logged at 21:23:07
[task 2018-04-09T21:28:26.712Z] 21:28:26     INFO - Error: Unable to restore focus, expect failures and timeouts.
[task 2018-04-09T21:28:26.713Z] 21:28:26     INFO - must wait for load
[task 2018-04-09T21:28:26.714Z] 21:28:26     INFO - must wait for focus
[task 2018-04-09T21:28:26.714Z] 21:28:26     INFO - Buffered messages finished
[task 2018-04-09T21:28:26.715Z] 21:28:26     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_bug1151667.html | Test timed out. 
[task 2018-04-09T21:28:26.716Z] 21:28:26     INFO -     reportError@SimpleTest/TestRunner.js:121:7
[task 2018-04-09T21:28:26.716Z] 21:28:26     INFO -     TestRunner._checkForHangs@SimpleTest/TestRunner.js:142:7
[task 2018-04-09T21:28:26.717Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.717Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.718Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.718Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.719Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.720Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.720Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.721Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.722Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.722Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.723Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.724Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.724Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.725Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.725Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.726Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.727Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.728Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.728Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.729Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.729Z] 21:28:26     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2018-04-09T21:28:26.730Z] 21:28:26     INFO -     TestRunner.runTests@SimpleTest/TestRunner.js:380:5
[task 2018-04-09T21:28:26.731Z] 21:28:26     INFO -     RunSet.runtests@SimpleTest/setup.js:194:3
[task 2018-04-09T21:28:26.731Z] 21:28:26     INFO -     RunSet.runall@SimpleTest/setup.js:173:5
[task 2018-04-09T21:28:26.732Z] 21:28:26     INFO -     hookupTests@SimpleTest/setup.js:266:5
[task 2018-04-09T21:28:26.732Z] 21:28:26     INFO - parseTestManifest@http://mochi.test:8888/manifestLibrary.js:36:5
[task 2018-04-09T21:28:26.733Z] 21:28:26     INFO - getTestManifest/req.onload@http://mochi.test:8888/manifestLibrary.js:49:11
[task 2018-04-09T21:28:26.733Z] 21:28:26     INFO - EventHandlerNonNull*getTestManifest@http://mochi.test:8888/manifestLibrary.js:45:3
[task 2018-04-09T21:28:26.734Z] 21:28:26     INFO -     hookup@SimpleTest/setup.js:246:5
[task 2018-04-09T21:28:26.735Z] 21:28:26     INFO - EventHandlerNonNull*@http://mochi.test:8888/tests?autorun=1&closeWhenDone=1&consoleLevel=INFO&manifestFile=tests.json&dumpOutputDirectory=%2Ftmp&cleanupCrashes=true:11:1
[task 2018-04-09T21:28:27.712Z] 21:28:27     INFO - Not taking screenshot here: see the one that was previously logged
[task 2018-04-09T21:28:27.718Z] 21:28:27     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_bug1151667.html | [SimpleTest.finish()] waitForFocus() was called a different number of times from the number of callbacks run.  Maybe the test terminated prematurely -- be sure to use SimpleTest.waitForExplicitFinish(). - got 1, expected +0
Hello! I did some retriggers in order to find the where this originated and found the following push: https://goo.gl/tUHnNX
Before this push there are 4 pushes on which this fail does not appear. 

All of the fails appear only on webrender and the commit message from this push https://hg.mozilla.org/integration/autoland/rev/a4af53b3f62d38b50b75e0f0ea33cf7e158317dc shows it was worked on that.

Miko, could you please take a look at this? Thank you.
Flags: needinfo?(mikokm)
Whiteboard: [stockwell needswork][retriggered]
The retriggers do seem to point to miko's patch as the culprit, but it's not obvious to me how that change would have caused this failure - it looks like flushApzRepaints just never completes. It should be pretty easy to track down since the codepath in question is not that long. I tried reproducing locally and couldn't, but I'll do a try push with logging and see what's going on.
Oh I misread the log, I thought it was hanging in flushApzRepaints but that's in the subsequent test, not the one with the failure. The failure is that it's not finding the rcd node in the tree. Which is more strange, because we record a paint every time we rebuild the hit-testing tree [1]. And that in turns gets triggered from the call at [2] which we do for full transactions but not empty transactions.

In theory Miko's patch should only turn full transactions into empty transactions when the last full transaction was identical. So in this case we should have recorded the paint with the rcd node on the previous full transaction, so even if we skip recording the paint on the empty transaction it shouldn't matter.

The try push at [3] might shed some more light on this when I see what's in the compositor paints structure. But Miko - are there scenarios where we take this codepath even if the display lists are not totally identical?

[1] https://searchfox.org/mozilla-central/rev/4114ad2cfcbc511705c7865a4a34741812f9a2a9/gfx/layers/apz/src/APZCTreeManager.cpp#350
[2] https://searchfox.org/mozilla-central/rev/4114ad2cfcbc511705c7865a4a34741812f9a2a9/gfx/layers/wr/WebRenderBridgeParent.cpp#608
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb7d2c57f1f249083c25ea0ab863d1ba84ab4046
Ok, so based on the logging in the try push above, it looks like Miko's patch is breaking the forceLayerTreeToCompositor mechanism that we use in this test, which more or less makes sense. The test forces an extra content process paint which is intended to arrive in the compositor after the parent process layer tree has the right RefLayer. The content process display list might not have changed at all but we want to force the paint anyway. It's explained in more detail at [1]. With Miko's patch the transaction for this forced paint gets downgraded to an empty transaction and so we end up not recording the paint in the compositor.

So I guess if we recorded a paint on the compositor side even on empty transactions (which is what it looks like we do for non-webrender) then this wouldn't happen. Alternatively we could make forceLayerTreeToCompositor set a flag that forces a full transaction (i.e. bypasses the check Miko's patch added). The former will behave more like non-webrender, but will incur a lot of extra useless work because we'll end up rebuilding the APZ tree unnecessarily on empty transactions. I'm not really sure why we're doing that for non-webrender, that might be a bug.

[1] https://searchfox.org/mozilla-central/rev/4114ad2cfcbc511705c7865a4a34741812f9a2a9/gfx/layers/apz/test/mochitest/apz_test_utils.js#343
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #6)
> I'm not really sure why we're doing
> that for non-webrender, that might be a bug.

For non-webrender we seem to never actually send anything across the wire in the empty transaction case, and end up revoking the transaction id on the client side. So that's behaving as expected.

Given this I think the best fix would be to add some mechanism to bypass the check in Miko's patch, and use that mechanism in forceLayerTreeToCompositor so that we actually force the layer tree over. Miko, do you have suggestions on what's the best way to do this? If there's some existing mechanism that would be great otherwise we can add one somehow.
Thank you so much for looking into this Kats, it's a huge help as I am unfamiliar with this code.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)
> The try push at [3] might shed some more light on this when I see what's in
> the compositor paints structure. But Miko - are there scenarios where we
> take this codepath even if the display lists are not totally identical?

The code that toggles the identical display list-flag is quite easy to reason about, and I believe this scenario is not possible.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> Given this I think the best fix would be to add some mechanism to bypass the
> check in Miko's patch, and use that mechanism in forceLayerTreeToCompositor
> so that we actually force the layer tree over. Miko, do you have suggestions
> on what's the best way to do this? If there's some existing mechanism that
> would be great otherwise we can add one somehow.

If I am understanding this correctly, the root problem is that forceLayerTreeToCompositor relies on the previous behavior (as in, the problem is with the test)? I think the simplest solution here might be to force layout flush, which in turn should cause display list rebuild, followed by a non-empty transaction. What do you think?
Flags: needinfo?(mikokm)
(In reply to Miko Mynttinen [:miko] from comment #9)
> If I am understanding this correctly, the root problem is that
> forceLayerTreeToCompositor relies on the previous behavior (as in, the
> problem is with the test)? I think the simplest solution here might be to
> force layout flush, which in turn should cause display list rebuild,
> followed by a non-empty transaction. What do you think?

That sounds reasonable, yeah. Will you write the patch for this?
Assignee: nobody → mikokm
Status: NEW → ASSIGNED
There are 297 failures in the last 7 days.
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-04-10&endday=2018-04-17&tree=trunk&bug=1452800

Miko, do you have any updates on this, or should we go ahead and disable it?
Flags: needinfo?(mikokm)
(In reply to Narcis Beleuzu [:NarcisB] from comment #15)
> There are 297 failures in the last 7 days.
> https://treeherder.mozilla.org/intermittent-failures.html#/
> bugdetails?startday=2018-04-10&endday=2018-04-17&tree=trunk&bug=1452800
> 
> Miko, do you have any updates on this, or should we go ahead and disable it?

I think we have a solution for this (comment 9), but this is a bit tricky to implement as this bug only triggers intermittently on CI. I think if we don't have a patch, say tomorrow, we should look into disabling it.
Flags: needinfo?(mikokm)
For the record, I would rather back out the regressing patch than disable the test, since the regressing patch is specific to QR, and therefore should pass QR tests in order to stay in the tree.
No, let's give it another day. I'll help Miko track this down today.
Flags: needinfo?(mikokm)
Attached patch Patch to repro problem (obsolete) — Splinter Review
Here's a patch that forces the problem to reproduce when run with
  MOZ_ACCELERATED=1 MOZ_WEBRENDER=1

Note that this also causes the problem to occur on non-webrender, and that actually has a different root cause. Miko's patch to flush a layout [1] seems to do the job, but I think we also need to ensure the repaint in forceLayerTreeToCompositor does a flush. With those changes locally the test seems to pass consistently for me.

I'll have the final patch up shortly since Miko is done for the day.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ca8e8e6fbe095e6c110c51537821173f26aa3e5
Whoops, that patch had extra stuff in it too. This is the patch I meant to attach.
Attachment #8968616 - Attachment is obsolete: true
The verify try push didn't actually run the test, but the retriggers in the other try push seem to be green so far.
Comment on attachment 8968649 [details]
Bug 1452800 - Trigger a layout flush in forceLayerTreeToCompositor.

https://reviewboard.mozilla.org/r/237332/#review243134
Attachment #8968649 - Flags: review?(bugmail) → review+
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a89554c69f61
Trigger a layout flush in forceLayerTreeToCompositor. r=kats
https://hg.mozilla.org/mozilla-central/rev/a89554c69f61
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Whiteboard: [retriggered][stockwell disable-recommended] → [retriggered][stockwell fixed:other]
You need to log in before you can comment on or make changes to this bug.