Closed Bug 1700634 Opened 4 years ago Closed 4 years ago

Intermittent Windows AArch64 layout/generic/crashtests/<random_test>.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending) DON'T USE FOR CLASSIFICATION BUT FILE INDIVIDUAL BUGS

Categories

(Core :: Web Painting, defect, P5)

defect

Tracking

()

RESOLVED FIXED
92 Branch
Tracking Status
firefox91 --- fixed
firefox92 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jmaher)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=334277531&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Cg5eDHIJTZOJr9PI-7YN5A/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Cg5eDHIJTZOJr9PI-7YN5A/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


[task 2021-03-24T12:27:52.585Z] 12:27:52     INFO - REFTEST TEST-START | layout/generic/crashtests/370699-1.html
[task 2021-03-24T12:27:52.586Z] 12:27:52     INFO - REFTEST TEST-LOAD | file:///Z:/task_1616587823/build/tests/reftest/tests/layout/generic/crashtests/370699-1.html | 2180 / 3905 (55%)
[task 2021-03-24T12:29:08.367Z] 12:29:08     INFO - [Parent 7724, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, nullptr) failed with result 0x804B000A (NS_ERROR_MALFORMED_URI): file /builds/worker/checkouts/gecko/caps/BasePrincipal.cpp:1149
[task 2021-03-24T12:31:47.043Z] 12:31:47     INFO - JavaScript error: resource://gre/modules/PurgeTrackerService.jsm, line 387: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIScriptSecurityManager.createContentPrincipalFromOrigin]
[task 2021-03-24T12:41:21.118Z] 12:41:21     INFO - REFTEST TEST-UNEXPECTED-FAIL | layout/generic/crashtests/370699-1.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)
[task 2021-03-24T12:41:21.118Z] 12:41:21     INFO - REFTEST INFO | Saved log: START file:///Z:/task_1616587823/build/tests/reftest/tests/layout/generic/crashtests/370699-1.html
[task 2021-03-24T12:41:21.120Z] 12:41:21     INFO - REFTEST INFO | Saved log: [CONTENT] FromChildAfterPaintListener from about:blank```
Summary: Intermittent layout/generic/crashtests/370699-1.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending) → Intermittent layout/generic/crashtests/<random_test>.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

Daniel, there is an instance of this issue where the failure log is full of lines referencing AfterPaintListener, could you please take a look at what might be causing this?

Flags: needinfo?(dholbert)

(In reply to Alexandru Michis [:malexandru] from comment #7)

Daniel, there is an instance of this issue where the failure log is full of lines referencing AfterPaintListener, could you please take a look at what might be causing this?

That's a case where the test reloads itself forever, and the AfterPaintListener spam is probably just one line per reload. (And we probably never realize that the test is done because we wait for pending paints to hit 0, and we get unlucky and are too slow with our pending-paint check -- it must always happen after there's a pending paint due to the reload.)

That test is just kinda bogus as a crashtest; crashtests shouldn't reload themselves, lest they trigger issues like this.

I'll post a patch to clean up that particular test over in bug 1691034 (which I think is about this same issue for that test).

Flags: needinfo?(dholbert)

Daniel, any further plans for this issue affecting random crashtests on Windows AArch64? It's failing almost permanently on central and beta.

Flags: needinfo?(dholbert)
Summary: Intermittent layout/generic/crashtests/<random_test>.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending) → Intermittent Windows AArch64 layout/generic/crashtests/<random_test>.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

It's not a single issue that's affecting random crashtests -- it's distinct issues with individual crashtests, where the crashtest never stops painting (due to e.g. setTimeout loops that never end, which result in a paint queue that's never empty when the harness happens to check). This must be coming up on aarch64 due to specific paint and/or harness-event-timing behavior on our aarch64 test machines, which make us more likely to hit this issue there, or something.

The tests need individual fixups to avoid looping forever. I did one such fixup in bug 1691034, for the test that was implicated in comment 7 here. Other instances of this "waiting for pending paint count to reach zero" issue are likely indications of other tests that are similarly problematic and need fixups.

e.g. one recent instance is https://treeherder.mozilla.org/logviewer?job_id=338238518&repo=mozilla-central&lineNumber=25300 for https://searchfox.org/mozilla-central/source/layout/generic/crashtests/471360.html which is a test that also happened to come up in bug 1691192 recently. We can fix that one over there (and we should probably uplift these test fixes, if this is a problem for beta as well.)

Summary: Intermittent Windows AArch64 layout/generic/crashtests/<random_test>.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending) → Intermittent Windows AArch64 layout/generic/crashtests/<random_test>.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending) DON'T USE FOR CLASSIFICATION BUT FILE INDIVIDUAL BUGS

(In reply to Daniel Holbert [:dholbert] from comment #12)

The tests need individual fixups to avoid looping forever. I did one such fixup in bug 1691034, [...]
we should probably uplift these test fixes, if this is a problem for beta as well.)

(Note: We don't need to uplift that one to beta, because it landed before the most recent merge so it's already there. Any remaining instances of this on beta are probably in different tests.)

Here's a log of the last ~10 days of failures:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1700634&startday=2021-04-20&endday=2021-04-30&tree=all

471360.html looks like the most common culprit (and I've got a patch to fix it in bug 1691192, as noted in comment 12).

Here's a list of other tests where we also seem to have hit this in that time range (note, not all of these are highlighted in the logviewer, possibly because the logs are quite long):
https://treeherder.mozilla.org/logviewer?job_id=338192696&repo=mozilla-beta&lineNumber=5735

REFTEST TEST-UNEXPECTED-FAIL | layout/generic/crashtests/286491.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)
REFTEST TEST-UNEXPECTED-FAIL | layout/generic/crashtests/478504.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

https://treeherder.mozilla.org/logviewer?job_id=338139160&repo=mozilla-beta&lineNumber=23755

REFTEST TEST-UNEXPECTED-FAIL | layout/forms/crashtests/1279354.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

https://treeherder.mozilla.org/logviewer?job_id=338140600&repo=mozilla-central&lineNumber=58114

REFTEST TEST-UNEXPECTED-FAIL | layout/generic/crashtests/1460158-1.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

https://treeherder.mozilla.org/logviewer?job_id=337591720&repo=mozilla-central&lineNumber=16399

REFTEST TEST-UNEXPECTED-FAIL | layout/generic/crashtests/1460158-2.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

https://treeherder.mozilla.org/logviewer?job_id=337581302&repo=mozilla-central&lineNumber=15071

REFTEST TEST-UNEXPECTED-FAIL | gfx/tests/crashtests/783041-3.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

https://treeherder.mozilla.org/logviewer?job_id=337355726&repo=mozilla-beta&lineNumber=61673

REFTEST TEST-UNEXPECTED-FAIL | layout/style/crashtests/1400035.html | load failed: timed out waiting for pending paint count to reach zero (waiting for updateCanvasPending)

It looks like these are tests that continue painting forever.

Some of these are just pathological fuzzer testcases (like 370174-3.html did and 471360.html does, which I've previously posted patches to address). e.g. these two have a setInterval or a setTimeout ping-pong that causes some repeated DOM manipulation, indefinitely:
https://searchfox.org/mozilla-central/source/layout/generic/crashtests/286491.html
https://searchfox.org/mozilla-central/source/layout/generic/crashtests/478504.html

Those are pretty straightforward to fix, if we want to.

But there are other cases that are less clear about what we should do. This one just has a CSS animation (which cycles forever):
https://searchfox.org/mozilla-central/source/layout/style/crashtests/1400035.html

...and in several of the cases, the dynamic painting-forever thing is just a <progress> element, which by default plays a "bounce back and forth" animation:
https://searchfox.org/mozilla-central/source/layout/forms/crashtests/1279354.html
https://searchfox.org/mozilla-central/source/layout/generic/crashtests/1460158-1.html
https://searchfox.org/mozilla-central/source/layout/generic/crashtests/1460158-2.html
https://searchfox.org/mozilla-central/source/gfx/tests/crashtests/783041-3.html

I'm not sure we want CSS animations and <progress> elements to have the power to cause a crashtest to fail (due to spamming the harness with never-ending paints). This doesn't seem to be a problem on other platforms, so I assume we have some sort of solution that's not working on this new platform. mattwoodrow, do you know what might be going wrong here? I recall you working on the piece of the reftest harness that makes us wait until the pending paint count reaches zero.

Component: Layout → Web Painting
Flags: needinfo?(dholbert) → needinfo?(matt.woodrow)

I dealt with a lot of those types of issues when converting the reftest harness to handle fission, and this bug might in fact be a regression from that based on what you describe, because we needed to make async operations that were previously sync, which make them take slightly longer wall clock time and means things that invalidate every frame have less chance of finishing that work before the next invalidate asks for another paint.

Thanks, tnikkel -- yeah, this sounds like that sort of issue.

Do you have any suggestions for how to address this, based on your approaches in that effort? Particularly for tests that have some continuously-painting thing like <progress> elements and continuous CSS animations.

I'm hoping we don't have to hand-fix all such tests -- which I think would have to mean e.g. adding JS to an otherwise-static test, to remove the animated element after some arbitrary period of time. I'm hoping we don't have to resort to that.

Flags: needinfo?(matt.woodrow) → needinfo?(tnikkel)

I usually just added dump statements in the reftest harness and key parts of c++ code to get an idea of how the loop was happening and not getting broken.

Flags: needinfo?(tnikkel)
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE

Windows AArch64 tests on mozilla-central

Windows AArch64 crashtests had their last successful run on May 9th, reftests are failing permanently since the switch to WebRender on June 24th (bug 1717912) and had a success rate >50% before that. The current state of test results guarantees the application can be launched in a test environment.

Joel, what shall be done here?

  • Fixing: Doesn't look like an option based on previous discussion in the bug without spending a big chunk of developer time on it.
  • Disable all intermittently failing tests: Let me know if I shall build a query of all tests for this platform which failed recently. From inspection, this seems to affect random tests and might be an issue with the test environment.
  • Demote tasks from tier 2 to tier 3: Sheriffs wouldn't monitor these tasks anymore (they would still be listed in my reports about permanently failing tasks), but it doesn't sound like the tasks would provide any value in this state (similar to the current one).
  • Turn off these tests for Windows AArch64. The mochitest-media suite would still be running and indicate the basic health of the build.
Flags: needinfo?(jmaher)

thanks for bringing this up :aryx . I agree that fixing this is difficult and not realistic, also disabling the intermittents would be painful and probably not very beneficial. I suspect after the 100 common causes are disabled on win/aarch64 we would have a more stable crashtest suite, but it could be more- the tasks are timing out.

I think the best course of action is to disable the crashtests on this platform. The media tests run the large majority of the possible media tests and are green almost all the time.

Flags: needinfo?(jmaher)
Assignee: nobody → jmaher
Assignee: jmaher → aryx.bugmail
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/06952a064dd7 turn off crashtest/reftest on win/aarch64. r=releng-reviewers,bhearsum
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: