Closed Bug 1720485 Opened 3 years ago Closed 3 years ago

Intermittent layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have

Categories

(Core :: Layout, defect, P5)

defect

Tracking

()

RESOLVED FIXED
93 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox90 --- unaffected
firefox91 --- unaffected
firefox92 --- fixed
firefox93 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: dholbert)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell unknown])

Attachments

(1 file)

Filed by: abutkovits [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=345190321&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SAnzxqamQEOnw5fP9M7JjQ/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SAnzxqamQEOnw5fP9M7JjQ/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


[task 2021-07-14T12:06:11.189Z] 12:06:11     INFO - TEST-START | layout/base/tests/test_after_paint_pref.html
[task 2021-07-14T12:06:11.424Z] 12:06:11     INFO - TEST-INFO | started process screentopng
[task 2021-07-14T12:06:11.586Z] 12:06:11     INFO - TEST-INFO | screentopng: exit 0
[task 2021-07-14T12:06:11.587Z] 12:06:11     INFO - Buffered messages logged at 12:06:11
[task 2021-07-14T12:06:11.588Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | A valid string reason is expected 
[task 2021-07-14T12:06:11.588Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | Reason cannot be empty 
[task 2021-07-14T12:06:11.589Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | pref defaults to true in mochitest harness 
[task 2021-07-14T12:06:11.590Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | loaded 
[task 2021-07-14T12:06:11.590Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step1 reached: boundingClientRect=(top=0,left=0,width=0,height=undefined) 
[task 2021-07-14T12:06:11.591Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step2 reached: boundingClientRect=(top=0,left=0,width=0,height=undefined) 
[task 2021-07-14T12:06:11.592Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | got MozAfterPaint (timeout for next step is 900ms) 
[task 2021-07-14T12:06:11.592Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step3 reached 
[task 2021-07-14T12:06:11.593Z] 12:06:11     INFO - TEST-FAIL | layout/base/tests/test_after_paint_pref.html | The author of the test has indicated that flaky timeouts are expected.  Reason: untriaged 
[task 2021-07-14T12:06:11.594Z] 12:06:11     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | failstep reached: boundingClientRect=(top=0,left=0,width=0,height=undefined) 
[task 2021-07-14T12:06:11.594Z] 12:06:11     INFO - Buffered messages finished
[task 2021-07-14T12:06:11.594Z] 12:06:11     INFO - TEST-UNEXPECTED-FAIL | layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have 
[task 2021-07-14T12:06:11.594Z] 12:06:11     INFO -     SimpleTest.ok@SimpleTest/SimpleTest.js:417:16
[task 2021-07-14T12:06:11.594Z] 12:06:11     INFO -     failstep@layout/base/tests/test_after_paint_pref.html:101:5
[task 2021-07-14T12:06:12.319Z] 12:06:12     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step4 reached 
[task 2021-07-14T12:06:12.346Z] 12:06:12     INFO - GECKO(6140) | MEMORY STAT | vsize 2494MB | residentFast 116MB | heapAllocated 7MB
[task 2021-07-14T12:06:12.372Z] 12:06:12     INFO - TEST-OK | layout/base/tests/test_after_paint_pref.html | took 1182ms

This started to fail on Tier 2 jobs

Summary: Intermittent [tier2] TV layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have → Intermittent layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have
Summary: Intermittent layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have → Intermittent [Tier 2] layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have

Started to fail on Tier 1 jobs

Summary: Intermittent [Tier 2] layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have → Intermittent layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have

Set release status flags based on info from the regressing bug 1718755

Update:

There have been 40 failures within the last 7 days:

  • 11 failures on android-em-7-0-x86_64-lite-qr opt/debug
  • 12 failures on Android 7.0 x86-64 WebRender opt/debug
  • 1 failure Android 7.0 x86-64 Shippable WebRender opt
  • 1 failure on Linux 18.04 x64 asan opt
  • 9 failures on Linux 18.04 x64 WebRender opt/debug
  • 6 failures on Linux 18.04 x64 WebRender Shippable opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=347027749&repo=autoland&lineNumber=6807

[task 2021-08-01T06:23:44.581Z] 06:23:44     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | failstep reached: boundingClientRect=(top=0,left=0,width=0,height=undefined) 
[task 2021-08-01T06:23:44.581Z] 06:23:44     INFO - Buffered messages finished
[task 2021-08-01T06:23:44.581Z] 06:23:44     INFO - TEST-UNEXPECTED-FAIL | layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have 
[task 2021-08-01T06:23:44.581Z] 06:23:44     INFO -     SimpleTest.ok@SimpleTest/SimpleTest.js:417:16
[task 2021-08-01T06:23:44.581Z] 06:23:44     INFO -     failstep@layout/base/tests/test_after_paint_pref.html:101:5
[task 2021-08-01T06:23:45.201Z] 06:23:45     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step4 reached 

Daniel, could you help us assign this to someone?
Thank you.

Flags: needinfo?(dholbert)
Whiteboard: [stockwell needswork:owner]

Update:
There have been 50 failures within the last 7 days:
• 20 failures on android-em-7-0-x86_64-lite-qr debug
• 2 failures on android-em-7-0-x86_64-lite-qr opt
• 7 failures on Android 7.0 x86-64 WebRender debug
• 7 failures on Linux 18.04 x64 WebRender asan opt
• 1 failures on Linux 18.04 x64 CCov WebRender opt
• 5 failures on Linux 18.04 x64 WebRender opt
• 3 failures on Linux 18.04 x64 WebRender debug
• 4 failures on Linux 18.04 x64 WebRender Shippable opt
• 1 failures on linux1804-64-tsan-qr opt
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=347774783&repo=mozilla-central&lineNumber=8080

[task 2021-08-08T09:53:00.940Z] 09:53:00     INFO - TEST-START | layout/base/tests/test_after_paint_pref.html
[task 2021-08-08T09:53:02.453Z] 09:53:02     INFO - TEST-INFO | started process screentopng
[task 2021-08-08T09:53:02.697Z] 09:53:02     INFO - TEST-INFO | screentopng: exit 0
[task 2021-08-08T09:53:02.699Z] 09:53:02     INFO - Buffered messages logged at 09:53:02
[task 2021-08-08T09:53:02.700Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | A valid string reason is expected 
[task 2021-08-08T09:53:02.700Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | Reason cannot be empty 
[task 2021-08-08T09:53:02.700Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | pref defaults to true in mochitest harness 
[task 2021-08-08T09:53:02.700Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | loaded 
[task 2021-08-08T09:53:02.701Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step1 reached: boundingClientRect=(top=0,left=0,width=0,height=undefined) 
[task 2021-08-08T09:53:02.702Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step2 reached: boundingClientRect=(top=0,left=0,width=0,height=undefined) 
[task 2021-08-08T09:53:02.703Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | got MozAfterPaint (timeout for next step is 900ms) 
[task 2021-08-08T09:53:02.705Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step3 reached 
[task 2021-08-08T09:53:02.705Z] 09:53:02     INFO - TEST-FAIL | layout/base/tests/test_after_paint_pref.html | The author of the test has indicated that flaky timeouts are expected.  Reason: untriaged 
[task 2021-08-08T09:53:02.706Z] 09:53:02     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | failstep reached: boundingClientRect=(top=0,left=0,width=0,height=undefined) 
[task 2021-08-08T09:53:02.708Z] 09:53:02     INFO - Buffered messages finished
[task 2021-08-08T09:53:02.708Z] 09:53:02     INFO - TEST-UNEXPECTED-FAIL | layout/base/tests/test_after_paint_pref.html | got MozAfterPaint when we should not have 
[task 2021-08-08T09:53:02.708Z] 09:53:02     INFO -     SimpleTest.ok@SimpleTest/SimpleTest.js:417:16
[task 2021-08-08T09:53:02.708Z] 09:53:02     INFO -     failstep@layout/base/tests/test_after_paint_pref.html:101:5
[task 2021-08-08T09:53:03.328Z] 09:53:03     INFO - TEST-PASS | layout/base/tests/test_after_paint_pref.html | step4 reached 
[task 2021-08-08T09:53:03.369Z] 09:53:03     INFO - GECKO(8079) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
[task 2021-08-08T09:53:03.371Z] 09:53:03     INFO - GECKO(8079) | MEMORY STAT heapAllocated not supported in this build configuration.
[task 2021-08-08T09:53:03.372Z] 09:53:03     INFO - GECKO(8079) | MEMORY STAT | vsize 20974416MB | residentFast 364MB
[task 2021-08-08T09:53:03.424Z] 09:53:03     INFO - TEST-OK | layout/base/tests/test_after_paint_pref.html | took 2484ms
[task 2021-08-08T09:53:03.562Z] 09:53:03     INFO - TEST-START | layout/base/tests/test_border_radius_hit_testing.html

(In reply to Natalia Csoregi [:nataliaCs] from comment #6)

Daniel, could you help us assign this to someone?
Thank you.

Sorry for the delay. mattwoodrow, perhaps you could take a look here? Not sure who's the best person to ask about the MozAfterPaint event (and this test) these days, but you feel like a good candidate. :)

Some initial thoughts from skimming the test:

The test does have this suspicious snippet, which is just before the failure in step3:

  // When there was previously another page in our window, we seem to
  // get duplicate events, simultaneously, so we need to register our
  // next listener after a zero timeout.
  await SpecialPowers.pushPrefEnv({'set': [['dom.send_after_paint_to_content', false]]});
  await SpecialPowers.promiseTimeout(0);
  setTimeout(step3, 0, timeout);

https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/layout/base/tests/test_after_paint_pref.html#81

That timeout feels suspicious/hacky. I wonder if some recent architectural change "broke" this promiseTimeout(0) hackaround here (and so now we're getting a mozAfterPaint event that had been queued up from some previously-loaded test, of the sort referenced in the code comment). If the hackaround is indeed just broken/insufficient, maybe there's a better hackaround we could be doing, e.g. a double-rAF to allow any currently-pending paints to be flushed (and hence any pending MozAfterPaint calls to be flushed), or something...?

Alternately/also, maybe there's a race condition that causes the pref-tweak to not immediately take effect in the content process... (I don't have a good feel for whether await SpecialPowers.pushPrefEnv( is meant to be immediately observable in the content process after the promise has resolved. I would suspect it is, but I'm not sure.)

Flags: needinfo?(dholbert) → needinfo?(matt.woodrow)

I managed to catch this in rr locally (by running the test repeatedly in chaos mode for ~20min, in a debug+O1-opt build). I've uploaded the recording to pernosco and I'll post the link when I've got it.

Pernosco link (focused on the first of two test-failures in the aforementioned repeated-chaos-mode testrun):
https://pernos.co/debug/Xobq0MrDYw0nGkOF5vX7Bg/index.html

I have a theory about what's going on, and I think my double-rAF suggestion from comment 9 will fix it.

Notes:

  • This dom.send_after_paint_to_content pref is internally represented as a nsPresContext flag named mSendAfterPaintToContent
  • We update that flag somewhat lazily, on the next refresh cycle tick.
  • ...but before we can do that, we get composited, and that (rather than a refresh driver tick) is what prompts us to create a MozAfterPaint event.

So we end up dispatching MozAfterPaint barely after the pref was turned off, due to:

  • the lazy refresh-driver-throttled way that we get the pref-change reflected into our nsPresContext
  • ...combined with the fact that MozAfterPaint is depent on compositor timing (i.e. receiving IPC events), not refresh-driver timing.
  • ...combined with the fact that we had some earlier painting to do, which is what prompts some earlier paint that ends up resulting in this unfortunately-timed composite operation.

I suspect a double-rAF will help here (in place of the promiseTimeout(0)) because I think it will actually make us wait on the compositor to flush all of its pending composites. At least, I'm pretty sure rAF is gated on the compositor.

ni=me to follow up on this...

Flags: needinfo?(matt.woodrow) → needinfo?(dholbert)

nsPresContext has a flag that caches the value of this pref, and the flag gets
updated in the next refresh driver tick after a pref-change occurs. Right now
we don't have a strong guarantee that the flag will have been updated by the
time the test assumes that it has, which is why we're seeing intermittent
failures.

This patch adjusts the test so that it waits for a double-rAF before proceeding
from its pref adjustment. I believe this is the recommended way of forcing &
waiting-out at least one refresh driver tick; and this should let the test
reliably assume the pref-adjustment has taken effect.

Assignee: nobody → dholbert
Status: NEW → ASSIGNED
Pushed by dholbert@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/510d15e845ba Adjust test_after_paint_pref.html to use a double-rAF instead of a timeout, to give its pref-adjustment a better chance to take effect. r=mattwoodrow

I'm fairly confident that the landed patch will fix this.

I ran the same process as in comment 11, with the patch that has now landed, and I was able to get through 9x the number of test iterations without any failures (at which point I stopped the trial).

In comment 11, I let the test run on repeat for about 20 minutes and got 2 test-failures. Now I've repeated that same process 9 times, with the patch, and I got zero test-failures.

For reference, the command I'm using is:

 MOZ_CHAOSMODE=1 ./mach mochitest --headless --setpref="fission.autostart=true" layout/base/tests/test_after_paint_pref.html --repeat 10000
Flags: needinfo?(dholbert)
See Also: → 1459682
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: