Open Bug 1694904 Opened 3 years ago Updated 3 months ago

Crash in [@ mozilla::net::DocumentChannel::SetLoadFlags] with reason DocumentChannel::SetLoadFlags: Don't set flags after creation

Categories

(Core :: DOM: Navigation, defect, P2)

Unspecified
Windows
defect

Tracking

()

ASSIGNED
Tracking Status
firefox-esr115 --- affected
firefox120 --- affected
firefox121 --- affected
firefox122 --- affected

People

(Reporter: sg, Assigned: farre, NeedInfo)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(4 keywords)

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/357266c6-b17d-434c-b107-1b5670210224

MOZ_CRASH Reason: MOZ_CRASH(DocumentChannel::SetLoadFlags: Don't set flags after creation)

Top 10 frames of crashing thread:

0 xul.dll mozilla::net::DocumentChannel::SetLoadFlags netwerk/ipc/DocumentChannel.cpp:300
1 xul.dll mozilla::net::nsLoadGroup::AddRequest netwerk/base/nsLoadGroup.cpp:435
2 xul.dll mozilla::net::DocumentChannelChild::AsyncOpen netwerk/ipc/DocumentChannelChild.cpp:75
3 xul.dll nsURILoader::OpenURI uriloader/base/nsURILoader.cpp:696
4 xul.dll nsDocShell::InternalLoad docshell/base/nsDocShell.cpp:9567
5 xul.dll nsDocShell::LoadURI docshell/base/nsDocShell.cpp:906
6 xul.dll mozilla::dom::LocationBase::SetURI dom/base/LocationBase.cpp:155
7 xul.dll mozilla::dom::LocationBase::SetHrefWithBase dom/base/LocationBase.cpp:234
8 xul.dll mozilla::dom::LocationBase::DoSetHref dom/base/LocationBase.cpp:189
9 xul.dll mozilla::dom::Location_Binding::set_href dom/bindings/LocationBinding.cpp:122

First (recent) report from Nightly build id 20210222214056.

Nika, might this be related to Bug 1682285 ?

Flags: needinfo?(nika)

I think this is unlikely to be related to that issue.

I'm not sure exactly how something like this could happen off the top of my head, and haven't had a ton of time to look into it yet. Perhaps :mattwoodrow might remember more context of why we could assume that we wouldn't have SetLoadFlags called on us in DocumentChannelChild?

Flags: needinfo?(nika) → needinfo?(matt.woodrow)

It's mostly because we start loads in the parent process, without going through the docshell first. That requires computing the load flags without the docshell/load-group, and having them be correct.

DocumentChannelChild still needs to have the load flags available for reading (but it'd be nice if it didn't), so we also compute it in the content process (using the same code as runs in the parent), and pass it to the ctor.

If something attempts to mutate the load flags though, the parent and content process would diverge on what they think the correct load flags are, and we probably have a bug.

Flags: needinfo?(matt.woodrow)

To expand a bit more, this is pain because DocumentChannelChild isn't really an nsIChannel, it's a placeholder that might resolve to an nsIChannel (or might trigger a process switch and the teardown of the current doc shell).

Unfortunately, lots of code is depending on the nsIChannel interface being available, so DocumentChannelChild implements this, and then just asserts that most of the API isn't used.

Ideally we'd want a new interface/type to represent this placeholder (and what functionality it actually has) more clearly, and fix docshell/load group to understand having that instead of an nsIChannel for part of the loading process.

Bug 1617502 is open for part of the prerequisite work for doing that, though I don't think there's a general meta bug.

I think this should live in DOM: Navigation; please revert if this inference is incorrect.

Component: Networking → DOM: Navigation

P2 S2

mismatch load flags on LoadGroup and the network request.

Nika says this might be related to load flags when:

Severity: -- → S2
Priority: -- → P2
See Also: → 1617502
Whiteboard: [domcore-s2-revisit]

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Mac on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash

Bug 1743456 has a test case.

All of the wild crashes I've looked at so far are from loads triggered from the child, not the parent, so I'm not sure that things getting out of sync is a concern. It also happens in AsyncOpen before we send the constructor to the parent, so if we had to, we could send the flags to the parent. I'm not sure bug 1743456 is strictly the same issue, but it could be.

My guess, at this point, is that the problem is inheriting (mLoadFlags & (LOAD_BACKGROUND | LOAD_BYPASS_CACHE | LOAD_FROM_CACHE | VALIDATE_ALWAYS | VALIDATE_ONCE_PER_SESSION | VALIDATE_NEVER)); from the load group, rather than the BrowsingContext default load flags changing after the channel is created. The default load flags on the load group should match the ones on the browsing context (it's unfortunate that we have both). But, in any case, the parent process doesn't have access to the load group, so parent-initiated loads wouldn't have access to those flags, and we really don't want different behavior for parent-initiated vs child-initiated loads.

I'll look into this some more.

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Hsin-Yi, is there someone who can look into this?

Flags: needinfo?(htsai)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Exploring ...

Assignee: nobody → afarre
Status: NEW → ASSIGNED
Flags: needinfo?(htsai)

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Andreas, do you have cycles to investigate this Fenix topcrash?

Flags: needinfo?(afarre)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Any news on this? Still a Fenix topcrash.

Flags: needinfo?(htsai)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Any news on this? Still a Fenix topcrash.

Andreas is back from PTOs this week. He is resuming the work here.

Flags: needinfo?(htsai)

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

These stacks are all looking hefty strange. As far as I can tell they all come into nsDocShell::DoURILoad, where they call nsDocShell::OpenInitializedChannel. This is always called with a freshly created DocumentChannel. We calculate the nsLoadFlags just before creating the channel, so we should really be able to assume that they're correct at that point. Then we don't touch the load flags of the DocumentChannel until we've entered nsURILoader::OpenURI.

It is here, when we're calling DocumentChannelChild::AsyncOpen that we'll assert when we add the DocumentChannelChild to the loadGroup in nsLoadGroup::AddRequest. Note that we're adding DocumentChannelChild to its own load group! That load group was set a few lines earlier as a result of calling nsURILoader::OpenChannel, which in turn is extracted from the nsDocShell that we're calling DoLoadURI on.

I'm not sure if it's nsLoadGroup::MergeDefaultLoadFlags or nsLoadGroup::MergeLoadFlags that asserts. That's optimized out.

But as far as I can tell the problem is that the load group doesn't have the same load flags as the browsing context has. Is this a problem with how we compute the load flags in nsDocShellLoadState::CalculateChannelLoadFlags? We never suspend execution by returning to the main loop, this is happening linearly.

This is in line with what :kmag found in comment 10.

I'm at a bit of a loss of how to resolve this. :kmag, :nika any ideas?

Flags: needinfo?(nika)
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(afarre)
Depends on: 1849351
Depends on: 1852213
Duplicate of this bug: 1852039

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::net::DocumentChannel::SetLoadFlags] → [@ mozilla::net::DocumentChannel::SetLoadFlags] [@ base.odex@0x13522c] [@ base.odex@0x13a1cc] [@ base.odex@0x13b1cc]
No longer duplicate of this bug: 1852039

I doubt these offsets will be good for more than one build. I'm not sure if this is a new spike on Nightly or just bad symbols causing a new signature.

Crash Signature: [@ mozilla::net::DocumentChannel::SetLoadFlags] [@ base.odex@0x13522c] [@ base.odex@0x13a1cc] [@ base.odex@0x13b1cc] → [@ mozilla::net::DocumentChannel::SetLoadFlags]

The annotations in the first patch of Bug 1852213 were wrong, but a fix to that landed yesterday. Now we only need to wait for some reports with the annotation data available.

So the annotations says that it's LOAD_TRR_DISABLED_MODE that is the differing flag.

I wonder if this is the problem. That it somehow races with a load. I'll try to make the in process updates happen synchronously to see if that makes any difference.

I guess that I'd really like to know what the offline discussion was, but it does feel like a changing load flags because of captive portal problem.

Valentin, I don't know how the captive portal work, are there tests that I could look at/modify to reproduce this?

Nika, you reviewed the code that changed this, do you know how to resolve this?

Flags: needinfo?(valentin.gosu)

So when the captive portal checker detects we're in a captive portal, we display the captive portal banner to the user.
The user can click "Log into captive portal" on that banner, which opens a tab for the log in with the disableTRR: true property set.
When that flag is set, we then pass LOAD_FLAGS_DISABLE_TRR into fixupAndLoadURIString.
Then we set it on the defaultLoadFlags here.

The tests we do have for the UI are here: https://searchfox.org/mozilla-central/source/browser/base/content/test/captivePortal
These two tests actually click on the button and open that tab: https://searchfox.org/mozilla-central/search?q=openCaptivePortalLoginTab&redirect=false

Hope this helps.

Flags: needinfo?(valentin.gosu)
See Also: → 1852039
Keywords: stalled
Whiteboard: [domcore-s2-revisit]
Flags: needinfo?(kmaglione+bmo)

I've sifted through the recent crashes under this signature and there's a few things that stand out and maybe could help diagnose and fix this:

  • All recent crashes have the reason DocumentChannel::SetLoadFlags: Don't set flags after creation (differing flags 0 != 1000). This means that mozilla::net::DocumentChannel::SetLoadFlags() was called with the nsIRequest::VALIDATE_NEVER which had not been previously set. This seems to happen only in a handful of places and in this particular stack trace it seems that this is the most likely occurrence. There's also this on the same path but it would require the user to have changed their prefs, which seems unlikely at this scale.
  • Looking at URLs it seems that this is happening mostly on webpages with videos. I've sifted to the URLs and couldn't find a single SFW one but they come from a lot of different webpages, so it doesn't look site specific.
  • Some comments mention browsing either back or forward causing the crash.
  • Last but not least we have several crashes on file that appear to be coming from the same user, on the same page. It appears as if they tried to reload the same page (which crashed) and encountered more crashes. See these three crashes for example: 1, 2 and 3. This suggests that Firefox is in a state where browsing to a page will consistently crash it. Given that the instances where nsIRequest::VALIDATE_NEVER is set depend on the cache then maybe it has to do with page caching (IIUC).

A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html

The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.

Andreas: does this provide enough information to make progress?

Blocks: domino
Flags: needinfo?(afarre)
Keywords: pernosco

(In reply to Tyson Smith [:tsmith] from comment #38)

A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html

The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.

Andreas: does this provide enough information to make progress?

This is wonderful news! I'll dig right in! Thank you!

Flags: needinfo?(afarre)

:tsmith, you said that this was caught by fuzzers albeit unreliably? Do you have something that reproduced it? It might be reloading a history navigated page that never did push or replace on history that triggered this. But there might be a race in there also, maybe a history load that doesn't finish.

Yes but nothing reduced to the point where it would be useful for anything other than replaying at the moment. If/when we do get a test case reduced I will attach it to the bug.

We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.

(In reply to Gabriele Svelto [:gsvelto] from comment #42)

We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.

Yeah, it's definitely correct. The pernosco that Tyson made (comment 39) actually shows this happening due to loads and reloads interleaving somehow. I'm trying to work the recording to build a test case.

You need to log in before you can comment on or make changes to this bug.