Closed Bug 1961008 Opened 4 months ago Closed 4 months ago

WPT reftest acid/acid2/reftest.html started intermittently failing recently

Tracking

()

Status:

RESOLVED FIXED

Milestone:

140 Branch

Tracking Flags:

Tracking

Status

firefox-esr128

---

unaffected

firefox137

---

unaffected

firefox138

---

unaffected

firefox139

---

fixed

firefox140

---

fixed

People

(Reporter: dholbert, Assigned: nika)

References

(Regression)

Details

(Keywords: regression)

Attachments

(2 files)

Bug 1961008 - Part 1: Only disconnect DocumentChannel from LoadGroup early for Downloads, r=smaug! 4 months ago Nika Layzell [:nika] (ooo, ni? for response) 48 bytes, text/x-phabricator-request	dmeehan : approval-mozilla-beta+	Details \| Review
Bug 1961008 - Part 2: Fix toplevel data blocking test failure, r=smaug! 4 months ago Nika Layzell [:nika] (ooo, ni? for response) 48 bytes, text/x-phabricator-request	dmeehan : approval-mozilla-beta+	Details \| Review

Daniel Holbert [:dholbert]

Reporter

Description

•

4 months ago

See bug 1785395 comment 54 through 56 - WPT acid/acid2/reftest.html started intermittently failing a few days ago (with the smiley face's eyes being missing, with either orange/red or yellow being there instead).

It's a pretty frequent intermittent -- 44 failures in the past 4-5 days. (bug 1785395 comment 53 is a spot where the bot alerted us about 15 annotated failures in a single day.) That's on track to be among our most-spammy intermittents, I think -- and given that this is a pretty ancient benchmark/conformance-test, we should probably take regressions in it seriously.

Backfills seem to point at bug 1954597 as being where this started, based on the current test results here:
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=web-platform-tests-reftest-sw%2Clinux%2Cdebug%2Cwr6&tochange=303fc19db920e2b98df3b0e4e345a10550bada6b&fromchange=164c9b97d9366943fa05f3e81724d4d4bdd44416

Given that regressor: nika, could you take a look perhaps?

Flags: needinfo?(nika)

Daniel Holbert [:dholbert]

Reporter

Comment 1

•

4 months ago

For convenience, here's the most recent starred test failure:
https://treeherder.mozilla.org/logviewer?job_id=504278748&repo=mozilla-central&lineNumber=1983

reftest-analyzer view (showing an orange/red stripe where the smiley's eyes should be):
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/HkuTS84GSt6VALniQXbR_Q/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1

BugBot [:suhaib / :marco/ :calixte]

Comment 2

•

4 months ago

Set release status flags based on info from the regressing bug 1954597

status-firefox137: --- → unaffected

status-firefox138: --- → unaffected

status-firefox139: --- → affected

status-firefox-esr128: --- → unaffected

Daniel Holbert [:dholbert]

Reporter

Comment 3

•

4 months ago

(In reply to Daniel Holbert [:dholbert] from comment #0)

It's a pretty frequent intermittent
[...]
Backfills seem to point at bug 1954597 as being where this started, based on the current test results here:
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=web-platform-tests-reftest-sw%2Clinux%2Cdebug%2Cwr6&tochange=303fc19db920e2b98df3b0e4e345a10550bada6b&fromchange=164c9b97d9366943fa05f3e81724d4d4bdd44416

Expanding on this being "pretty frequent" - if I'm doing my math correctly, it looks like this is failing on roughly 1/3 of the test runs for the affected pushes there (on Linux 18.04 x64 WebRender debug Wr6 or the Wr6 backfill tasks). I'm counting 17 failures out of 49 completed test runs (in the 6 total affected pushes shown in that treeherder link, starting with the push for bug 1954597 that seems to have regressed this). And 17/49 = 34.7% i.e. roughly 1/3.

Hopefully that's frequent enough that we can capture this in pernosco and reason about what's going on. I triggered the self-service pernosco UI from treeherder to attempt to repro the bug, and I'll share the trace if pernosco is able to capture one.

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Comment 4

•

4 months ago

Yikes, that's quite the frequent failure.

Looking into this a bit (with emilio's help), it appears that the eyes are rendered using a series of 3 nested <object> tags:

   <object data="data:application/x-unknown,ERROR">
     <object data="404.html?pipe=status(404)" type="text/html">
       <object data="data:image/png;base64,...">ERROR</object>
     </object>
   </object>

The ways these work is that the text within the object tag is rendered if the data failed to load as the fallback response. Us failing to render the image could perhaps be caused by one of these object/embed tags which should fail to load intermittently not encountering a load error, and thus not rendering the fallback content. I haven't been able to reproduce it yet, so I don't know which load is failing to fail but I could believe there is some subtle issue in the regressing patch stack.

In the test failure which was linked, I do see an error being propagated up in the DLL:

[task 2025-04-11T18:47:36.441Z] 18:47:36     INFO - PID 947 | [Parent 947, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040154 (NS_ERROR_FACTORY_NOT_REGISTERED): file /builds/worker/checkouts/gecko/netwerk/ipc/DocumentLoadListener.cpp:299
[task 2025-04-11T18:47:36.442Z] 18:47:36     INFO - PID 947 | [Parent 947, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x805D0001 (NS_ERROR_WONT_HANDLE_CONTENT): file /builds/worker/checkouts/gecko/uriloader/base/nsURILoader.cpp:188

I think this is normal though, and happening because of the data:application/x-unknown,ERROR data URI not successfully performing a type conversion. I believe that we should fail the OnStartRequest in that case, and end up closing the channel.

Flags: needinfo?(nika)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 5

•

4 months ago

So what should be done here?

Flags: needinfo?(nika)

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Comment 6

•

4 months ago

I left the above comment because I was going to be out for a week to record progress. Unfortunately I haven't had the chance to look into the specifics of what might be failing here.

Just before I went on PTO, we got this reproduction trace on pernosco, which I hope will help us figure out what is going wrong: https://pernos.co/debug/QW5-wpsFxeeqPbphO6bC2A/index.html

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Comment 7

•

4 months ago

Looking at the trace with the reftest failure, I believe I figured out the issue.

All 3 loads happened as expected (with 2 failed OnStartRequest loads, and one successful OnStartRequest load), but the reftest snapshot appears to be taken before the successful navigation is processed. It appears that we're firing the load event before the fallback content from the object/embed element has a chance to be rendered, meaning that the next load does not start in time to delay the load event from being fired and a screenshot from being taken.

This is perhaps causing intermittent failures because the screenshot is taken asynchronously from the load failure, so most of the time the fallback loads finish fast enough they appear in the output anyway.

As far as I can tell this is a result of the changes from https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/netwerk/ipc/DocumentChannel.cpp#131-137. This logic was added in https://phabricator.services.mozilla.com/D49526 for downloads and removes a load from the load group before firing OnStartRequest/OnStopRequest to the listener in the case where the load is being disconnected from the parent process. This appears to have been done to ensure that the load group doesn't see loads which were redirected to being downloads as load failures, roughly mimicking the re-targeting of load notifications which was previously performed by https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/uriloader/exthandler/nsExternalHelperAppService.cpp#1410-1415 in nsExternalAppHandler.

By removing the load from the load group before OnStartRequest/OnStartRequest this allows the load group to fire the load event before the handler for the failed load can be processed. In this case, it is causing the failed loads from the outer 2 object/embed elements to not start the inner loads for fallback content before the main document's load event is fired.

I expect the situation was made worse by the changes in https://phabricator.services.mozilla.com/D242422. In that patch, I changed some logic to disconnect child channel loads from the parent process earlier when the load is doomed to fail, as hardening against the child process coming to a different conclusion about whether the load was going to fail. This caused some of the intentionally failing loads in acid2 to start taking the DisconnectListeners codepath, meaning that the fallback content did not block the load event, and the reftest screenshot is taken early.

I expect that the easiest solution here might be to change the logic in https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/netwerk/ipc/DocumentChannel.cpp#131-137 to only do the early removal from the load group in the case where aLoadGroupStatus != aStatus? I think that this would remove the need for doing this early synchronous request removal from the load group in the cases which matter while preserving the existing behaviour in the multipart-channel download case (https://searchfox.org/mozilla-central/rev/931de234363586a45ea9ab66d281960ccf1c3597/netwerk/ipc/DocumentLoadListener.cpp#2912).

I expect that a "correct" solution might be to understand what original behaviour with multipart downloads this logic was working around and fix that, but that's a larger project.

Flags: needinfo?(nika)

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Updated

•

4 months ago

Assignee: nobody → nika

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Comment 8

•

4 months ago

Attached file Bug 1961008 - Part 1: Only disconnect DocumentChannel from LoadGroup early for Downloads, r=smaug! — Details

Emilio Cobos Álvarez (:emilio)

Comment 9

•

4 months ago

Another alternative could be for object to call BlockOnLoad() and UnblockOnload() but that might just paper the issue

Jan Varga [:janv]

Updated

•

4 months ago

Severity: -- → S3

Priority: -- → P2

BugBot [:suhaib / :marco/ :calixte]

Comment 10

•

4 months ago

Set release status flags based on info from the regressing bug 1954597

status-firefox140: --- → affected

Phabricator Automation

Updated

•

4 months ago

Attachment #9481574 - Attachment description: Bug 1961008 - Only disconnect DocumentChannel from LoadGroup early for Downloads, r=smaug! → Bug 1961008 - Part 1: Only disconnect DocumentChannel from LoadGroup early for Downloads, r=smaug!

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Comment 11

•

4 months ago

Attached file Bug 1961008 - Part 2: Fix toplevel data blocking test failure, r=smaug! — Details

Pulsebot

Comment 12

•

4 months ago

Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e3466c9b686f Part 1: Only disconnect DocumentChannel from LoadGroup early for Downloads, r=smaug,necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/90e17694492e Part 2: Fix toplevel data blocking test failure, r=smaug

amarc

Updated

•

4 months ago

Regressions: 1963908

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 13

•

4 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e3466c9b686f
https://hg.mozilla.org/mozilla-central/rev/90e17694492e

Status: NEW → RESOLVED

Closed: 4 months ago

status-firefox140: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 140 Branch

Ryan VanderMeulen [:RyanVM]

Comment 14

•

3 months ago

Did you want to nominate this for Beta uplift?

Flags: needinfo?(nika)

Flags: in-testsuite+

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Comment 15

•

3 months ago

Comment on attachment 9484667 [details]
Bug 1961008 - Part 2: Fix toplevel data blocking test failure, r=smaug!

Beta/Release Uplift Approval Request

User impact if declined/Reason for urgency: Load event ordering problems for object/embed elements
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Medium
Why is the change risky/not risky? (and alternatives if risky): Change to how errored channels are torn down to make it more consistent with how it used to work. Should be fairly low-risk.
String changes made/needed: none
Is Android affected?: Yes

Flags: needinfo?(nika)

Attachment #9484667 - Flags: approval-mozilla-beta?

Nika Layzell [:nika] (ooo, ni? for response)

Assignee

Updated

•

3 months ago

Attachment #9481574 - Flags: approval-mozilla-beta?

Donal Meehan [:dmeehan]

Comment 16

•

3 months ago

Comment on attachment 9481574 [details]
Bug 1961008 - Part 1: Only disconnect DocumentChannel from LoadGroup early for Downloads, r=smaug!

Approved for 139.0b6

Attachment #9481574 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Donal Meehan [:dmeehan]

Comment 17

•

3 months ago

Comment on attachment 9484667 [details]
Bug 1961008 - Part 2: Fix toplevel data blocking test failure, r=smaug!

Approved for 139.0b6

Attachment #9484667 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pulsebot

Comment 18

•

3 months ago

uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/7496f8f9fc47 https://hg.mozilla.org/releases/mozilla-beta/rev/2bbb3383c559

Donal Meehan [:dmeehan]

Updated

•

3 months ago

status-firefox139: affected → fixed

You need to log in before you can comment on or make changes to this bug.