WPT reftest acid/acid2/reftest.html started intermittently failing recently
Categories
(Core :: DOM: Content Processes, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr128 | --- | unaffected |
firefox137 | --- | unaffected |
firefox138 | --- | unaffected |
firefox139 | --- | fixed |
firefox140 | --- | fixed |
People
(Reporter: dholbert, Assigned: nika)
References
(Regression)
Details
(Keywords: regression)
Attachments
(2 files)
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
|
Details | Review |
See bug 1785395 comment 54 through 56 - WPT acid/acid2/reftest.html started intermittently failing a few days ago (with the smiley face's eyes being missing, with either orange/red or yellow being there instead).
It's a pretty frequent intermittent -- 44 failures in the past 4-5 days. (bug 1785395 comment 53 is a spot where the bot alerted us about 15 annotated failures in a single day.) That's on track to be among our most-spammy intermittents, I think -- and given that this is a pretty ancient benchmark/conformance-test, we should probably take regressions in it seriously.
Backfills seem to point at bug 1954597 as being where this started, based on the current test results here:
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=web-platform-tests-reftest-sw%2Clinux%2Cdebug%2Cwr6&tochange=303fc19db920e2b98df3b0e4e345a10550bada6b&fromchange=164c9b97d9366943fa05f3e81724d4d4bdd44416
Given that regressor: nika, could you take a look perhaps?
Reporter | ||
Comment 1•4 months ago
|
||
For convenience, here's the most recent starred test failure:
https://treeherder.mozilla.org/logviewer?job_id=504278748&repo=mozilla-central&lineNumber=1983
reftest-analyzer view (showing an orange/red stripe where the smiley's eyes should be):
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/HkuTS84GSt6VALniQXbR_Q/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
Comment 2•4 months ago
|
||
Set release status flags based on info from the regressing bug 1954597
Reporter | ||
Comment 3•4 months ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #0)
It's a pretty frequent intermittent
[...]
Backfills seem to point at bug 1954597 as being where this started, based on the current test results here:
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=web-platform-tests-reftest-sw%2Clinux%2Cdebug%2Cwr6&tochange=303fc19db920e2b98df3b0e4e345a10550bada6b&fromchange=164c9b97d9366943fa05f3e81724d4d4bdd44416
Expanding on this being "pretty frequent" - if I'm doing my math correctly, it looks like this is failing on roughly 1/3 of the test runs for the affected pushes there (on Linux 18.04 x64 WebRender debug Wr6 or the Wr6 backfill tasks). I'm counting 17 failures out of 49 completed test runs (in the 6 total affected pushes shown in that treeherder link, starting with the push for bug 1954597 that seems to have regressed this). And 17/49 = 34.7% i.e. roughly 1/3.
Hopefully that's frequent enough that we can capture this in pernosco and reason about what's going on. I triggered the self-service pernosco UI from treeherder to attempt to repro the bug, and I'll share the trace if pernosco is able to capture one.
Assignee | ||
Comment 4•4 months ago
|
||
Yikes, that's quite the frequent failure.
Looking into this a bit (with emilio's help), it appears that the eyes are rendered using a series of 3 nested <object>
tags:
<object data="data:application/x-unknown,ERROR">
<object data="404.html?pipe=status(404)" type="text/html">
<object data="data:image/png;base64,...">ERROR</object>
</object>
</object>
The ways these work is that the text within the object tag is rendered if the data failed to load as the fallback response. Us failing to render the image could perhaps be caused by one of these object/embed tags which should fail to load intermittently not encountering a load error, and thus not rendering the fallback content. I haven't been able to reproduce it yet, so I don't know which load is failing to fail but I could believe there is some subtle issue in the regressing patch stack.
In the test failure which was linked, I do see an error being propagated up in the DLL:
[task 2025-04-11T18:47:36.441Z] 18:47:36 INFO - PID 947 | [Parent 947, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040154 (NS_ERROR_FACTORY_NOT_REGISTERED): file /builds/worker/checkouts/gecko/netwerk/ipc/DocumentLoadListener.cpp:299
[task 2025-04-11T18:47:36.442Z] 18:47:36 INFO - PID 947 | [Parent 947, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x805D0001 (NS_ERROR_WONT_HANDLE_CONTENT): file /builds/worker/checkouts/gecko/uriloader/base/nsURILoader.cpp:188
I think this is normal though, and happening because of the data:application/x-unknown,ERROR
data URI not successfully performing a type conversion. I believe that we should fail the OnStartRequest in that case, and end up closing the channel.
Assignee | ||
Comment 6•4 months ago
|
||
I left the above comment because I was going to be out for a week to record progress. Unfortunately I haven't had the chance to look into the specifics of what might be failing here.
Just before I went on PTO, we got this reproduction trace on pernosco, which I hope will help us figure out what is going wrong: https://pernos.co/debug/QW5-wpsFxeeqPbphO6bC2A/index.html
Assignee | ||
Comment 7•4 months ago
|
||
Looking at the trace with the reftest failure, I believe I figured out the issue.
All 3 loads happened as expected (with 2 failed OnStartRequest
loads, and one successful OnStartRequest
load), but the reftest snapshot appears to be taken before the successful navigation is processed. It appears that we're firing the load event before the fallback content from the object/embed element has a chance to be rendered, meaning that the next load does not start in time to delay the load event from being fired and a screenshot from being taken.
This is perhaps causing intermittent failures because the screenshot is taken asynchronously from the load failure, so most of the time the fallback loads finish fast enough they appear in the output anyway.
As far as I can tell this is a result of the changes from https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/netwerk/ipc/DocumentChannel.cpp#131-137. This logic was added in https://phabricator.services.mozilla.com/D49526 for downloads and removes a load from the load group before firing OnStartRequest
/OnStopRequest
to the listener in the case where the load is being disconnected from the parent process. This appears to have been done to ensure that the load group doesn't see loads which were redirected to being downloads as load failures, roughly mimicking the re-targeting of load notifications which was previously performed by https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/uriloader/exthandler/nsExternalHelperAppService.cpp#1410-1415 in nsExternalAppHandler
.
By removing the load from the load group before OnStartRequest
/OnStartRequest
this allows the load group to fire the load event before the handler for the failed load can be processed. In this case, it is causing the failed loads from the outer 2 object/embed elements to not start the inner loads for fallback content before the main document's load
event is fired.
I expect the situation was made worse by the changes in https://phabricator.services.mozilla.com/D242422. In that patch, I changed some logic to disconnect child channel loads from the parent process earlier when the load is doomed to fail, as hardening against the child process coming to a different conclusion about whether the load was going to fail. This caused some of the intentionally failing loads in acid2 to start taking the DisconnectListeners
codepath, meaning that the fallback content did not block the load event, and the reftest screenshot is taken early.
I expect that the easiest solution here might be to change the logic in https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/netwerk/ipc/DocumentChannel.cpp#131-137 to only do the early removal from the load group in the case where aLoadGroupStatus != aStatus
? I think that this would remove the need for doing this early synchronous request removal from the load group in the cases which matter while preserving the existing behaviour in the multipart-channel download case (https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/netwerk/ipc/DocumentLoadListener.cpp#2912).
I expect that a "correct" solution might be to understand what original behaviour with multipart downloads this logic was working around and fix that, but that's a larger project.
Assignee | ||
Updated•4 months ago
|
Assignee | ||
Comment 8•4 months ago
|
||
Comment 9•4 months ago
|
||
Another alternative could be for object to call BlockOnLoad() and UnblockOnload() but that might just paper the issue
Updated•4 months ago
|
Comment 10•4 months ago
|
||
Set release status flags based on info from the regressing bug 1954597
Updated•4 months ago
|
Assignee | ||
Comment 11•4 months ago
|
||
Comment 12•4 months ago
|
||
![]() |
||
Comment 13•4 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e3466c9b686f
https://hg.mozilla.org/mozilla-central/rev/90e17694492e
Comment 14•3 months ago
|
||
Did you want to nominate this for Beta uplift?
Assignee | ||
Comment 15•3 months ago
|
||
Comment on attachment 9484667 [details]
Bug 1961008 - Part 2: Fix toplevel data blocking test failure, r=smaug!
Beta/Release Uplift Approval Request
- User impact if declined/Reason for urgency: Load event ordering problems for object/embed elements
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): Change to how errored channels are torn down to make it more consistent with how it used to work. Should be fairly low-risk.
- String changes made/needed: none
- Is Android affected?: Yes
Assignee | ||
Updated•3 months ago
|
Comment 16•3 months ago
|
||
Comment on attachment 9481574 [details]
Bug 1961008 - Part 1: Only disconnect DocumentChannel from LoadGroup early for Downloads, r=smaug!
Approved for 139.0b6
Comment 17•3 months ago
|
||
Comment on attachment 9484667 [details]
Bug 1961008 - Part 2: Fix toplevel data blocking test failure, r=smaug!
Approved for 139.0b6
Comment 18•3 months ago
|
||
uplift |
Updated•3 months ago
|
Description
•