Closed Bug 1694904 Opened 4 years ago Closed 1 year ago

Crash in [@ mozilla::net::DocumentChannel::SetLoadFlags] with reason DocumentChannel::SetLoadFlags: Don't set flags after creation

Categories

(Core :: DOM: Navigation, defect, P2)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
133 Branch
Tracking Status
firefox-esr115 --- wontfix
firefox-esr128 --- fixed
firefox120 --- wontfix
firefox121 --- wontfix
firefox122 --- wontfix
firefox130 --- wontfix
firefox131 --- wontfix
firefox132 --- fixed
firefox133 --- fixed

People

(Reporter: sg, Assigned: emilio)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(4 keywords, Whiteboard: [bugmon:bisected,confirmed])

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/357266c6-b17d-434c-b107-1b5670210224

MOZ_CRASH Reason: MOZ_CRASH(DocumentChannel::SetLoadFlags: Don't set flags after creation)

Top 10 frames of crashing thread:

0 xul.dll mozilla::net::DocumentChannel::SetLoadFlags netwerk/ipc/DocumentChannel.cpp:300
1 xul.dll mozilla::net::nsLoadGroup::AddRequest netwerk/base/nsLoadGroup.cpp:435
2 xul.dll mozilla::net::DocumentChannelChild::AsyncOpen netwerk/ipc/DocumentChannelChild.cpp:75
3 xul.dll nsURILoader::OpenURI uriloader/base/nsURILoader.cpp:696
4 xul.dll nsDocShell::InternalLoad docshell/base/nsDocShell.cpp:9567
5 xul.dll nsDocShell::LoadURI docshell/base/nsDocShell.cpp:906
6 xul.dll mozilla::dom::LocationBase::SetURI dom/base/LocationBase.cpp:155
7 xul.dll mozilla::dom::LocationBase::SetHrefWithBase dom/base/LocationBase.cpp:234
8 xul.dll mozilla::dom::LocationBase::DoSetHref dom/base/LocationBase.cpp:189
9 xul.dll mozilla::dom::Location_Binding::set_href dom/bindings/LocationBinding.cpp:122

First (recent) report from Nightly build id 20210222214056.

Nika, might this be related to Bug 1682285 ?

Flags: needinfo?(nika)

I think this is unlikely to be related to that issue.

I'm not sure exactly how something like this could happen off the top of my head, and haven't had a ton of time to look into it yet. Perhaps :mattwoodrow might remember more context of why we could assume that we wouldn't have SetLoadFlags called on us in DocumentChannelChild?

Flags: needinfo?(nika) → needinfo?(matt.woodrow)

It's mostly because we start loads in the parent process, without going through the docshell first. That requires computing the load flags without the docshell/load-group, and having them be correct.

DocumentChannelChild still needs to have the load flags available for reading (but it'd be nice if it didn't), so we also compute it in the content process (using the same code as runs in the parent), and pass it to the ctor.

If something attempts to mutate the load flags though, the parent and content process would diverge on what they think the correct load flags are, and we probably have a bug.

Flags: needinfo?(matt.woodrow)

To expand a bit more, this is pain because DocumentChannelChild isn't really an nsIChannel, it's a placeholder that might resolve to an nsIChannel (or might trigger a process switch and the teardown of the current doc shell).

Unfortunately, lots of code is depending on the nsIChannel interface being available, so DocumentChannelChild implements this, and then just asserts that most of the API isn't used.

Ideally we'd want a new interface/type to represent this placeholder (and what functionality it actually has) more clearly, and fix docshell/load group to understand having that instead of an nsIChannel for part of the loading process.

Bug 1617502 is open for part of the prerequisite work for doing that, though I don't think there's a general meta bug.

I think this should live in DOM: Navigation; please revert if this inference is incorrect.

Component: Networking → DOM: Navigation

P2 S2

mismatch load flags on LoadGroup and the network request.

Nika says this might be related to load flags when:

Severity: -- → S2
Priority: -- → P2
See Also: → 1617502
Whiteboard: [domcore-s2-revisit]

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Mac on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash

Bug 1743456 has a test case.

All of the wild crashes I've looked at so far are from loads triggered from the child, not the parent, so I'm not sure that things getting out of sync is a concern. It also happens in AsyncOpen before we send the constructor to the parent, so if we had to, we could send the flags to the parent. I'm not sure bug 1743456 is strictly the same issue, but it could be.

My guess, at this point, is that the problem is inheriting (mLoadFlags & (LOAD_BACKGROUND | LOAD_BYPASS_CACHE | LOAD_FROM_CACHE | VALIDATE_ALWAYS | VALIDATE_ONCE_PER_SESSION | VALIDATE_NEVER)); from the load group, rather than the BrowsingContext default load flags changing after the channel is created. The default load flags on the load group should match the ones on the browsing context (it's unfortunate that we have both). But, in any case, the parent process doesn't have access to the load group, so parent-initiated loads wouldn't have access to those flags, and we really don't want different behavior for parent-initiated vs child-initiated loads.

I'll look into this some more.

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Hsin-Yi, is there someone who can look into this?

Flags: needinfo?(htsai)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Exploring ...

Assignee: nobody → afarre
Status: NEW → ASSIGNED
Flags: needinfo?(htsai)

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Andreas, do you have cycles to investigate this Fenix topcrash?

Flags: needinfo?(afarre)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Any news on this? Still a Fenix topcrash.

Flags: needinfo?(htsai)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Any news on this? Still a Fenix topcrash.

Andreas is back from PTOs this week. He is resuming the work here.

Flags: needinfo?(htsai)

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

These stacks are all looking hefty strange. As far as I can tell they all come into nsDocShell::DoURILoad, where they call nsDocShell::OpenInitializedChannel. This is always called with a freshly created DocumentChannel. We calculate the nsLoadFlags just before creating the channel, so we should really be able to assume that they're correct at that point. Then we don't touch the load flags of the DocumentChannel until we've entered nsURILoader::OpenURI.

It is here, when we're calling DocumentChannelChild::AsyncOpen that we'll assert when we add the DocumentChannelChild to the loadGroup in nsLoadGroup::AddRequest. Note that we're adding DocumentChannelChild to its own load group! That load group was set a few lines earlier as a result of calling nsURILoader::OpenChannel, which in turn is extracted from the nsDocShell that we're calling DoLoadURI on.

I'm not sure if it's nsLoadGroup::MergeDefaultLoadFlags or nsLoadGroup::MergeLoadFlags that asserts. That's optimized out.

But as far as I can tell the problem is that the load group doesn't have the same load flags as the browsing context has. Is this a problem with how we compute the load flags in nsDocShellLoadState::CalculateChannelLoadFlags? We never suspend execution by returning to the main loop, this is happening linearly.

This is in line with what :kmag found in comment 10.

I'm at a bit of a loss of how to resolve this. :kmag, :nika any ideas?

Flags: needinfo?(nika)
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(afarre)
Depends on: 1849351
Depends on: 1852213
Duplicate of this bug: 1852039

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::net::DocumentChannel::SetLoadFlags] → [@ mozilla::net::DocumentChannel::SetLoadFlags] [@ base.odex@0x13522c] [@ base.odex@0x13a1cc] [@ base.odex@0x13b1cc]
No longer duplicate of this bug: 1852039

I doubt these offsets will be good for more than one build. I'm not sure if this is a new spike on Nightly or just bad symbols causing a new signature.

Crash Signature: [@ mozilla::net::DocumentChannel::SetLoadFlags] [@ base.odex@0x13522c] [@ base.odex@0x13a1cc] [@ base.odex@0x13b1cc] → [@ mozilla::net::DocumentChannel::SetLoadFlags]

The annotations in the first patch of Bug 1852213 were wrong, but a fix to that landed yesterday. Now we only need to wait for some reports with the annotation data available.

So the annotations says that it's LOAD_TRR_DISABLED_MODE that is the differing flag.

I wonder if this is the problem. That it somehow races with a load. I'll try to make the in process updates happen synchronously to see if that makes any difference.

I guess that I'd really like to know what the offline discussion was, but it does feel like a changing load flags because of captive portal problem.

Valentin, I don't know how the captive portal work, are there tests that I could look at/modify to reproduce this?

Nika, you reviewed the code that changed this, do you know how to resolve this?

Flags: needinfo?(valentin.gosu)

So when the captive portal checker detects we're in a captive portal, we display the captive portal banner to the user.
The user can click "Log into captive portal" on that banner, which opens a tab for the log in with the disableTRR: true property set.
When that flag is set, we then pass LOAD_FLAGS_DISABLE_TRR into fixupAndLoadURIString.
Then we set it on the defaultLoadFlags here.

The tests we do have for the UI are here: https://searchfox.org/mozilla-central/source/browser/base/content/test/captivePortal
These two tests actually click on the button and open that tab: https://searchfox.org/mozilla-central/search?q=openCaptivePortalLoginTab&redirect=false

Hope this helps.

Flags: needinfo?(valentin.gosu)
See Also: → 1852039
Keywords: stalled
Whiteboard: [domcore-s2-revisit]
Flags: needinfo?(kmaglione+bmo)

I've sifted through the recent crashes under this signature and there's a few things that stand out and maybe could help diagnose and fix this:

  • All recent crashes have the reason DocumentChannel::SetLoadFlags: Don't set flags after creation (differing flags 0 != 1000). This means that mozilla::net::DocumentChannel::SetLoadFlags() was called with the nsIRequest::VALIDATE_NEVER which had not been previously set. This seems to happen only in a handful of places and in this particular stack trace it seems that this is the most likely occurrence. There's also this on the same path but it would require the user to have changed their prefs, which seems unlikely at this scale.
  • Looking at URLs it seems that this is happening mostly on webpages with videos. I've sifted to the URLs and couldn't find a single SFW one but they come from a lot of different webpages, so it doesn't look site specific.
  • Some comments mention browsing either back or forward causing the crash.
  • Last but not least we have several crashes on file that appear to be coming from the same user, on the same page. It appears as if they tried to reload the same page (which crashed) and encountered more crashes. See these three crashes for example: 1, 2 and 3. This suggests that Firefox is in a state where browsing to a page will consistently crash it. Given that the instances where nsIRequest::VALIDATE_NEVER is set depend on the cache then maybe it has to do with page caching (IIUC).

A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html

The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.

Andreas: does this provide enough information to make progress?

Blocks: domino
Flags: needinfo?(afarre)
Keywords: pernosco

(In reply to Tyson Smith [:tsmith] from comment #38)

A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html

The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.

Andreas: does this provide enough information to make progress?

This is wonderful news! I'll dig right in! Thank you!

Flags: needinfo?(afarre)

:tsmith, you said that this was caught by fuzzers albeit unreliably? Do you have something that reproduced it? It might be reloading a history navigated page that never did push or replace on history that triggered this. But there might be a race in there also, maybe a history load that doesn't finish.

Yes but nothing reduced to the point where it would be useful for anything other than replaying at the moment. If/when we do get a test case reduced I will attach it to the bug.

We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.

(In reply to Gabriele Svelto [:gsvelto] from comment #42)

We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.

Yeah, it's definitely correct. The pernosco that Tyson made (comment 39) actually shows this happening due to loads and reloads interleaving somehow. I'm trying to work the recording to build a test case.

Clearing my old needinfo, as it looks like :farre is looking into this based on comment 43. Feel free to flag me again if you have more questions on this.

Flags: needinfo?(nika) → needinfo?(afarre)

I guess this is still stalled, I had trouble getting good information out of the pernosco trace. One thing that could help unblock this is if Tyson has had any success in finding a minimized fuzzer testcase?

Flags: needinfo?(afarre) → needinfo?(twsmith)
Attached file testcase.html โ€”

Got one reduced! Make sure you serve it from a web server. Let it run for a few seconds. Easiest to just use Grizzly.

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework --upgrade
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid>
Flags: needinfo?(twsmith)
Keywords: bugmon, testcase

Let me know if you have any trouble trying to reproduce the issue.

Flags: needinfo?(afarre)

Verified bug as reproducible on mozilla-central 20240628161338-30e83f26e442.
Unable to bisect testcase (Testcase reproduces on start build!):

Start: 81b43c5e095f0b0407b6d1a8d4a7f92bf7115e5e (20230701093845)
End: 81b43c5e095f0b0407b6d1a8d4a7f92bf7115e5e (20230701093845)
BuildFlags: BuildFlags(asan=True, tsan=False, debug=False, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)

Whiteboard: [bugmon:bisected,confirmed]

I can reproduce it, but not with rr. I get stopped_or_unexpected_exit() from --rr regardless. With --pernosco I can't launch. If you're able to reproduce with pernosco, I'd be extremely happy if you could ping me a link.

Flags: needinfo?(afarre) → needinfo?(twsmith)

(In reply to Andreas Farre [:farre] from comment #49)

I can reproduce it, but not with rr. I get stopped_or_unexpected_exit() from --rr regardless. With --pernosco I can't launch.

I logged that (or a similar) rr issue upstream a few months ago. I am not seeing it with the latest code from git.

If you're able to reproduce with pernosco, I'd be extremely happy if you could ping me a link.

A Pernosco session is available here: https://pernos.co/debug/9i40MeOBmKsCp2reqfy5sg/index.html

Flags: needinfo?(twsmith) → needinfo?(afarre)
Flags: needinfo?(afarre)
Flags: needinfo?(afarre)

Thanks for catching this crash! Looks like this crash's volume spiked for Nightly 132.0a1 users after 2024-09-11, from fewer than 100 reports per day to over 500.

The MOZ_CRASH reason is "DocumentChannel::SetLoadFlags: Don't set flags after creation (differing flags 0 != 1000)".

Flags: needinfo?(cpeterson)

(In reply to eclaudiu64 from comment #51)

I have this error, in Fenix: https://crash-stats.mozilla.org/report/index/0751e415-6a1a-4b09-901f-61ae90240918 , every time I go to this link: https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US

I was able to reproduce it. Thank you for the pointer.
It doesn't happen to me on navigation to the URL, but if I kill fenix and reopen, it crashes when the page is restored.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #53)

It doesn't happen to me on navigation to the URL, but if I kill fenix and reopen, it crashes when the page is restored.

This could be a good lead! I'm resurrecting the Pernosco session from comment 50 to see if that's the case there as well.

Flags: needinfo?(afarre)

So this feels very bisectable. Chris, I've tried my best building Fenix, but I've failed. Do you have pointers or someone that I can reach out to for help? Or someone that could bisect for me? It's very possible it turns out to be a highly used site and not an error from us, but it would be good to know that.

Flags: needinfo?(cpeterson)

(In reply to Andreas Farre [:farre] from comment #55)

So this feels very bisectable. Chris, I've tried my best building Fenix, but I've failed. Do you have pointers or someone that I can reach out to for help? Or someone that could bisect for me?

If you have you have ADB device set up to talk to your Android device, you bisect Fenix builds without building. Use mach mozregression --app fenix --arch arm64-v8a or, better yet since it might produce a narrower regression range, you can bisect GeckoViewExample builds using mach mozregression --app gve --arch aarch64.

If you still need to build Fenix, try these instructions and then ask in #mobile-android-team Slack channel or #Fenix Matrix room:

https://firefox-source-docs.mozilla.org/mobile/android/fenix.html

Flags: needinfo?(cpeterson) → needinfo?(afarre)

Yeah, mozregression on fenix/emulator gives:

 6:06.61 INFO: Newest known good nightly: 2024-09-10
 6:06.61 INFO: Oldest known bad nightly: 2024-09-11

which is exactly what crashstat says, but doesn't tell us much more. My STR:s were:

  1. Open https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US
  2. Open new tab go to some page
  3. Close tab from 1 and press undo

and this crashes. It's a bit harder trying to get this to trigger with gve, mainly because getting into a situation where a restore would trigger is harder.

There were 2 2024-09-11 Fenix nightly builds, so it's a bit unfortunate that mozregression doesn't give a build ID to identify which one it was testing. That said, here's a pushlog range showing both builds for that day.
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ebf0e33ba93e&tochange=606085de1f6c

I've done some bisecting locally to catch this crash. Even though I could not see the details of the native crash I was experiencing with the STR that Andreas provided (c57), I could spot a potential regressor: https://hg.mozilla.org/mozilla-central/rev/710d5d0be43889b6e5acdc54112fa9d93dc971cc
(as I am not sure if that's the same crash due to the lack of crash details - a confirmation would be good)

Right, I've found the spike for android, and its the second patch of Bug 1911977.

Emilio, do you know why this is happening?

Flags: needinfo?(afarre) → needinfo?(emilio)

Filed a bug with patch for reverting bug 1911977. See bug 1921972

See Also: → 1921972

Not off-hand, but I don't think we should just blindly revert. Let's investigate it in bug 1921972.

Flags: needinfo?(emilio)
See Also: → 1911977
Duplicate of this bug: 1918275

This was fixed by Bug 1921972. The testcase attached in this bug reproduces immediately with grizzly without the patch in Bug 1921972 , but not in 50 runs with the patch. Also, the crash table in Bug 1921972 is green. I'm closing this, but in the case of it coming back we can re-open.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Bug appears to be fixed on mozilla-central 20241010092233-541f36a6b102 but BugMon was unable to find a usable build for a5369079cd89dd91a42689da9aad1fc607ef9146.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Keywords: bugmon
Target Milestone: --- → 133 Branch
Assignee: afarre → emilio
Depends on: 1921972
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: