Closed Bug 1601568 Opened 3 years ago Closed 3 years ago

Intermittent TEST-UNEXPECTED-TIMEOUT | gfx/layers/apz/test/mochitest/test_group_zoom-2.html | application timed out after 370 seconds with no output

Categories

(Core :: Panning and Zooming, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla73
Tracking Status
firefox73 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: botond)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

Filed by: nerli [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=279740409&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Tg6jkxBGRPmQgzI1aZ9fPA/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Tg6jkxBGRPmQgzI1aZ9fPA/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


[task 2019-12-05T08:21:45.355Z] 08:21:45 INFO - TEST-START | gfx/layers/apz/test/mochitest/test_group_zoom-2.html
[task 2019-12-05T08:21:46.572Z] 08:21:46 INFO - GECKO(1123) | WaitUntilApzStable: flushed APZ repaints in parent proc, waiting for callback...
[task 2019-12-05T08:21:46.581Z] 08:21:46 INFO - GECKO(1123) | WaitUntilApzStable: APZ flush done in parent proc
[task 2019-12-05T08:21:46.582Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | got MozAfterPaint: 0,0,0,0
[task 2019-12-05T08:21:46.582Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | Dispatching 0 onpaint listeners
[task 2019-12-05T08:21:46.582Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: got apz-flush-done in child proc
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | got MozAfterPaint: 0,0,0,0
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | Dispatching 0 onpaint listeners
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: done promiseFocus
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | done...
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: done promiseAllPaintsDone
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: Flushed APZ repaints, waiting for callback...
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: APZ flush done
[task 2019-12-05T08:21:46.590Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: all done
[task 2019-12-05T08:21:46.654Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | got MozAfterPaint: 0,0,0,0
[task 2019-12-05T08:21:46.654Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | Dispatching 0 onpaint listeners
[task 2019-12-05T08:21:46.872Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | got MozAfterPaint: 0,0,0,0
[task 2019-12-05T08:21:46.872Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | Dispatching 0 onpaint listeners
[task 2019-12-05T08:21:46.953Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: Flushed APZ repaints, waiting for callback...
[task 2019-12-05T08:21:46.954Z] 08:21:46 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: APZ flush done
[task 2019-12-05T08:21:47.110Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: Flushed APZ repaints, waiting for callback...
[task 2019-12-05T08:21:47.110Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: APZ flush done
[task 2019-12-05T08:21:47.126Z] 08:21:47 INFO - GECKO(1123) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
[task 2019-12-05T08:21:47.126Z] 08:21:47 INFO - GECKO(1123) | MEMORY STAT | vsize 2558MB | residentFast 167MB | heapAllocated 23MB
[task 2019-12-05T08:21:47.162Z] 08:21:47 INFO - TEST-OK | gfx/layers/apz/test/mochitest/test_group_zoom-2.html | took 1807ms
[task 2019-12-05T08:21:47.198Z] 08:21:47 INFO - TEST-START | Shutdown
[task 2019-12-05T08:21:47.199Z] 08:21:47 INFO - Passed: 9
[task 2019-12-05T08:21:47.200Z] 08:21:47 INFO - Failed: 0
[task 2019-12-05T08:21:47.201Z] 08:21:47 INFO - Todo: 0
[task 2019-12-05T08:21:47.201Z] 08:21:47 INFO - Mode: e10s
[task 2019-12-05T08:21:47.202Z] 08:21:47 INFO - Slowest: 1807ms - /tests/gfx/layers/apz/test/mochitest/test_group_zoom-2.html
[task 2019-12-05T08:21:47.202Z] 08:21:47 INFO - SimpleTest START Loop 2
[task 2019-12-05T08:21:47.203Z] 08:21:47 INFO - TEST-START | gfx/layers/apz/test/mochitest/test_group_zoom-2.html
[task 2019-12-05T08:21:47.360Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | got MozAfterPaint: 0,0,0,0
[task 2019-12-05T08:21:47.361Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | Dispatching 0 onpaint listeners
[task 2019-12-05T08:21:47.397Z] 08:21:47 INFO - GECKO(1123) | WaitUntilApzStable: flushed APZ repaints in parent proc, waiting for callback...
[task 2019-12-05T08:21:47.398Z] 08:21:47 INFO - GECKO(1123) | WaitUntilApzStable: APZ flush done in parent proc
[task 2019-12-05T08:21:47.398Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: got apz-flush-done in child proc
[task 2019-12-05T08:21:47.398Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: done promiseFocus
[task 2019-12-05T08:21:47.398Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | done...
[task 2019-12-05T08:21:47.399Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: done promiseAllPaintsDone
[task 2019-12-05T08:21:47.399Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: Flushed APZ repaints, waiting for callback...
[task 2019-12-05T08:21:47.399Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: APZ flush done
[task 2019-12-05T08:21:47.399Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | WaitUntilApzStable: all done
[task 2019-12-05T08:21:47.453Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | got MozAfterPaint: 0,0,0,0
[task 2019-12-05T08:21:47.454Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | Dispatching 0 onpaint listeners
[task 2019-12-05T08:21:47.711Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: Flushed APZ repaints, waiting for callback...
[task 2019-12-05T08:21:47.711Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: APZ flush done
[task 2019-12-05T08:21:47.772Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | got MozAfterPaint: 0,0,0,0
[task 2019-12-05T08:21:47.772Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | Dispatching 0 onpaint listeners
[task 2019-12-05T08:21:47.834Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: Flushed APZ repaints, waiting for callback...
[task 2019-12-05T08:21:47.834Z] 08:21:47 INFO - GECKO(1123) | helper_bug1280013.html | PromiseApzRepaintsFlushed: APZ flush done
[task 2019-12-05T08:21:47.854Z] 08:21:47 INFO - GECKO(1123) | MEMORY STAT | vsize 2554MB | residentFast 161MB | heapAllocated 23MB
[task 2019-12-05T08:21:47.890Z] 08:21:47 INFO - TEST-OK | gfx/layers/apz/test/mochitest/test_group_zoom-2.html | took 694ms

Flags: needinfo?(botond)
Regressed by: 1476893
Has Regression Range: --- → yes
Keywords: regression
Assignee: nobody → botond

The logging indicates that for this touch-drag, while the touch-end event makes it to APZCTM, it is not triggering a TransformEnd notification for some reason.

Trying to narrow it down further: https://treeherder.mozilla.org/#/jobs?repo=try&revision=90b5bb729bcf97b8bd8394a9fcb9410090d37782

The events do not make it to an APZC (not even the touch-start). Possibly they are stuck in the input queue.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e10136c25898eb873a2672a2e49acc62bc82a9be

The events are assigned a null target APZC via SetTargetAPZC.

After a few more rounds on Try, I have a partial diagnosis:

The test performs three touch drags. The first two are intended to bring an iframe into view, and the third one is intended to drag content inside the iframe. The iframe is not initially layerized, so the third touch-drag needs to go through the dispatch-to-content / SetTargetAPZC codepath.

On the failing test runs, for some reason (I don't understand why yet) the second touch drag targets the iframe rather than the root content document. As a result, the second touch drag takes the dispatch-to-content codepath.

When the touch-start event for the second drag arrives at content, content determines the target and checks if it's layerized. Since it's not, it sets a displayport, and takes the "layers-dependent" SetTargetAPZC codepath, where it registers a post-refresh observer and sends the SetTargetAPZC notification when the observer fires. This ensures that the SetTargetAPZC notification arrives at APZ after the layer transaction which tells APZ about the scroll id of the newly layerized iframe.

Now here's the key part: the main thread is busy (or the emulator is slow etc.), and takes a while to get around to doing a paint with the new displayport and fire the observer. As a result, the third touch-drag, which also targets the iframe, not only arrives at APZ before the SetTargetAPZC message arrives (and therefore also takes the DTC codepath -- which, in and of itself, is fine), but also arrives at content before the refresh observer from the second drag fires.

At this point, the check that the iframe is layerized passes (because it's checking for the displayport property, which has by now been set), so content figures it's fine to send the SetTargetAPZC notification for the third touch-drag right away, rather than taking the layers-dependent codepath. However, the layer transaction containing the iframe's scroll id hasn't been sent yet, and therefore the SetTargetAPZC notification, mentioning the iframe's scroll id, arrives at APZ before the layer transaction that establishes that scroll id. As a result, the APZC lookup results in a null APZC, we set that as the target, and end up dropping the touch events of the third drag.


Now, this mode of failure wouldn't happen in this test if the second touch drag were targeted correctly. However, the fact that the SetTargetAPZC notification can race the layer transaction in this way in the case where two subsequent input blocks target an element that isn't initially layerized, is a latent bug that should be fixed either way.

That's one of the greatest analysis for intermittent failures I've ever seen!

Thanks :) I have a candidate fix, let's see if it resolves the intermittent:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=93c87994a7bdf998aed6c2987c66dae30bd14bc2

Better Try push here:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1&revision=9bacff75313037731a21d4fe70309ff9ae056976

It does appear to resolve the intermittent, but it regressed a couple of other tests; will need to investigate that.

(In reply to Botond Ballo [:botond] from comment #10)

It does appear to resolve the intermittent, but it regressed a couple of other tests; will need to investigate that.

I am confused by this failure:

GECKO(1701) | pointerevent_touch-action-inherit_child-pan-x-child-pan-x_touch-manual.html | TEST-UNEXPECTED-FAIL | /tests/dom/events/test/pointerevents/pointerevent_touch-action-inherit_child-pan-x-child-pan-x_touch-manual.html | touch-action attribute test: assert_not_equals: scroll x offset should not be 0 in the end of the test got disallowed value 0

This is a manual test, and it's running in automation?

(In reply to Botond Ballo [:botond] from comment #11)

This is a manual test, and it's running in automation?

It looks like, even though the pointer-events tests were written as manual tests, we have some logic to drive them in an automated fashion.

(In reply to Botond Ballo [:botond] from comment #11)

GECKO(1701) | pointerevent_touch-action-inherit_child-pan-x-child-pan-x_touch-manual.html | TEST-UNEXPECTED-FAIL | /tests/dom/events/test/pointerevents/pointerevent_touch-action-inherit_child-pan-x-child-pan-x_touch-manual.html | touch-action attribute test: assert_not_equals: scroll x offset should not be 0 in the end of the test got disallowed value 0

So here, the test is synthesizing native events that are expected to cause scrolling, and then checks that scrolling has occurred.

I've verified that the events reach APZ and cause scrolling there. I suspect the issue is that the waitForApzFlushedRepaints() mechanism is somehow failing to wait until the scrolling is propagated to content.

In fact it's just a bug in my patch. When the displayport changes, the old property object is deleted and a new one is created; the new mPainted flag needs to be propagated from the old to the new.

I do feel like I've gotten a better understanding of waitForApzFlushedRepaints as a result of debugging this, though: it is not sufficient to simply synthesize a bunch of native events, call waitForApzFlushedRepaints, and then assert that scrolling has taken place. For waitForApzFlushedRepaints to have the desired effect, at the time it's called the paint that will carry the layers-dependent target confirmation needs to have been scheduled already (which generally means, the events that trigger said confirmation need to have arrived at content).

The pointer-events test in question does this correctly (it waits for the drag gesture's touchend before calling waitForApzFlushedRepaints), but this is useful to keep in mind for other tests.

Merely checking for having a displayport isn't enough, because the paint to
tell APZ about the new scroll frame could still be pending.

(In reply to Botond Ballo [:botond] from comment #15)

I do feel like I've gotten a better understanding of waitForApzFlushedRepaints as a result of debugging this, though: it is not sufficient to simply synthesize a bunch of native events, call waitForApzFlushedRepaints, and then assert that scrolling has taken place. For waitForApzFlushedRepaints to have the desired effect, at the time it's called the paint that will carry the layers-dependent target confirmation needs to have been scheduled already (which generally means, the events that trigger said confirmation need to have arrived at content).

The pointer-events test in question does this correctly (it waits for the drag gesture's touchend before calling waitForApzFlushedRepaints), but this is useful to keep in mind for other tests.

Should we file a bug to improve this situation in the future? Either audit test usage to make sure it's okay and add a comment about how to use waitForApzFlushedRepaints or extend/modify waitForApzFlushedRepaints to avoid this problem?

(In reply to Timothy Nikkel (:tnikkel) from comment #18)

Should we file a bug to improve this situation in the future? Either audit test usage to make sure it's okay and add a comment about how to use waitForApzFlushedRepaints or extend/modify waitForApzFlushedRepaints to avoid this problem?

Yes, good idea! Filed bug 1604270.

Pushed by bballo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e0d923d3736a
Only send a SetTargetAPZC notification right away if the target has a displayport and it has been painted. r=tnikkel
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73
Depends on: 1629466
You need to log in before you can comment on or make changes to this bug.