Crash in [@ mozilla::net::DocumentChannel::SetLoadFlags] with reason DocumentChannel::SetLoadFlags: Don't set flags after creation
Categories
(Core :: DOM: Navigation, defect, P2)
Tracking
()
People
(Reporter: sg, Assigned: emilio)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(4 keywords, Whiteboard: [bugmon:bisected,confirmed])
Crash Data
Attachments
(1 file)
|
773 bytes,
text/html
|
Details |
Crash report: https://crash-stats.mozilla.org/report/index/357266c6-b17d-434c-b107-1b5670210224
MOZ_CRASH Reason: MOZ_CRASH(DocumentChannel::SetLoadFlags: Don't set flags after creation)
Top 10 frames of crashing thread:
0 xul.dll mozilla::net::DocumentChannel::SetLoadFlags netwerk/ipc/DocumentChannel.cpp:300
1 xul.dll mozilla::net::nsLoadGroup::AddRequest netwerk/base/nsLoadGroup.cpp:435
2 xul.dll mozilla::net::DocumentChannelChild::AsyncOpen netwerk/ipc/DocumentChannelChild.cpp:75
3 xul.dll nsURILoader::OpenURI uriloader/base/nsURILoader.cpp:696
4 xul.dll nsDocShell::InternalLoad docshell/base/nsDocShell.cpp:9567
5 xul.dll nsDocShell::LoadURI docshell/base/nsDocShell.cpp:906
6 xul.dll mozilla::dom::LocationBase::SetURI dom/base/LocationBase.cpp:155
7 xul.dll mozilla::dom::LocationBase::SetHrefWithBase dom/base/LocationBase.cpp:234
8 xul.dll mozilla::dom::LocationBase::DoSetHref dom/base/LocationBase.cpp:189
9 xul.dll mozilla::dom::Location_Binding::set_href dom/bindings/LocationBinding.cpp:122
First (recent) report from Nightly build id 20210222214056.
| Reporter | ||
Comment 1•4 years ago
|
||
Nika, might this be related to Bug 1682285 ?
Comment 2•4 years ago
|
||
I think this is unlikely to be related to that issue.
I'm not sure exactly how something like this could happen off the top of my head, and haven't had a ton of time to look into it yet. Perhaps :mattwoodrow might remember more context of why we could assume that we wouldn't have SetLoadFlags called on us in DocumentChannelChild?
Comment 3•4 years ago
|
||
It's mostly because we start loads in the parent process, without going through the docshell first. That requires computing the load flags without the docshell/load-group, and having them be correct.
DocumentChannelChild still needs to have the load flags available for reading (but it'd be nice if it didn't), so we also compute it in the content process (using the same code as runs in the parent), and pass it to the ctor.
If something attempts to mutate the load flags though, the parent and content process would diverge on what they think the correct load flags are, and we probably have a bug.
Comment 4•4 years ago
|
||
To expand a bit more, this is pain because DocumentChannelChild isn't really an nsIChannel, it's a placeholder that might resolve to an nsIChannel (or might trigger a process switch and the teardown of the current doc shell).
Unfortunately, lots of code is depending on the nsIChannel interface being available, so DocumentChannelChild implements this, and then just asserts that most of the API isn't used.
Ideally we'd want a new interface/type to represent this placeholder (and what functionality it actually has) more clearly, and fix docshell/load group to understand having that instead of an nsIChannel for part of the loading process.
Bug 1617502 is open for part of the prerequisite work for doing that, though I don't think there's a general meta bug.
Comment 5•4 years ago
|
||
I think this should live in DOM: Navigation; please revert if this inference is incorrect.
Comment 6•4 years ago
|
||
P2 S2
mismatch load flags on LoadGroup and the network request.
Nika says this might be related to load flags when:
- bypassing TRR for captive portal detection: https://searchfox.org/mozilla-central/rev/36f79bed679ad7ec46f7cd05868a8f6dc823e1be/docshell/base/nsDocShell.cpp#820-827
- bypassing cache in DevTools' network panel
- or overriding loading behavior of background thumbnail process.
Updated•4 years ago
|
Comment 8•3 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 5 desktop browser crashes on Mac on beta
For more information, please visit auto_nag documentation.
Comment 9•3 years ago
|
||
Bug 1743456 has a test case.
Comment 10•3 years ago
|
||
All of the wild crashes I've looked at so far are from loads triggered from the child, not the parent, so I'm not sure that things getting out of sync is a concern. It also happens in AsyncOpen before we send the constructor to the parent, so if we had to, we could send the flags to the parent. I'm not sure bug 1743456 is strictly the same issue, but it could be.
My guess, at this point, is that the problem is inheriting (mLoadFlags & (LOAD_BACKGROUND | LOAD_BYPASS_CACHE | LOAD_FROM_CACHE | VALIDATE_ALWAYS | VALIDATE_ONCE_PER_SESSION | VALIDATE_NEVER)); from the load group, rather than the BrowsingContext default load flags changing after the channel is created. The default load flags on the load group should match the ones on the browsing context (it's unfortunate that we have both). But, in any case, the parent process doesn't have access to the load group, so parent-initiated loads wouldn't have access to those flags, and we really don't want different behavior for parent-initiated vs child-initiated loads.
I'll look into this some more.
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
Comment 14•2 years ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit auto_nag documentation.
Comment 15•2 years ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit auto_nag documentation.
Comment 17•2 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)
Hsin-Yi, is there someone who can look into this?
Exploring ...
Updated•2 years ago
|
Updated•2 years ago
|
Comment 18•2 years ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 AArch64 and ARM crashes on nightly
For more information, please visit BugBot documentation.
Comment 19•2 years ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 20•2 years ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 AArch64 and ARM crashes on nightly
For more information, please visit BugBot documentation.
Comment 21•2 years ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 22•2 years ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 AArch64 and ARM crashes on nightly
For more information, please visit BugBot documentation.
Comment 23•2 years ago
|
||
Andreas, do you have cycles to investigate this Fenix topcrash?
Comment 24•2 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)
Hsin-Yi, is there someone who can look into this?
Any news on this? Still a Fenix topcrash.
Comment 25•2 years ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 26•2 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)
Hsin-Yi, is there someone who can look into this?
Any news on this? Still a Fenix topcrash.
Andreas is back from PTOs this week. He is resuming the work here.
Comment 27•2 years ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 AArch64 and ARM crashes on nightly
For more information, please visit BugBot documentation.
Comment 28•2 years ago
•
|
||
These stacks are all looking hefty strange. As far as I can tell they all come into nsDocShell::DoURILoad, where they call nsDocShell::OpenInitializedChannel. This is always called with a freshly created DocumentChannel. We calculate the nsLoadFlags just before creating the channel, so we should really be able to assume that they're correct at that point. Then we don't touch the load flags of the DocumentChannel until we've entered nsURILoader::OpenURI.
It is here, when we're calling DocumentChannelChild::AsyncOpen that we'll assert when we add the DocumentChannelChild to the loadGroup in nsLoadGroup::AddRequest. Note that we're adding DocumentChannelChild to its own load group! That load group was set a few lines earlier as a result of calling nsURILoader::OpenChannel, which in turn is extracted from the nsDocShell that we're calling DoLoadURI on.
I'm not sure if it's nsLoadGroup::MergeDefaultLoadFlags or nsLoadGroup::MergeLoadFlags that asserts. That's optimized out.
But as far as I can tell the problem is that the load group doesn't have the same load flags as the browsing context has. Is this a problem with how we compute the load flags in nsDocShellLoadState::CalculateChannelLoadFlags? We never suspend execution by returning to the main loop, this is happening linearly.
This is in line with what :kmag found in comment 10.
I'm at a bit of a loss of how to resolve this. :kmag, :nika any ideas?
Comment 30•2 years ago
|
||
Copying crash signatures from duplicate bugs.
Comment 31•2 years ago
|
||
I doubt these offsets will be good for more than one build. I'm not sure if this is a new spike on Nightly or just bad symbols causing a new signature.
Comment 32•2 years ago
|
||
The annotations in the first patch of Bug 1852213 were wrong, but a fix to that landed yesterday. Now we only need to wait for some reports with the annotation data available.
Comment 33•2 years ago
|
||
So the annotations says that it's LOAD_TRR_DISABLED_MODE that is the differing flag.
Comment 34•2 years ago
•
|
||
I wonder if this is the problem. That it somehow races with a load. I'll try to make the in process updates happen synchronously to see if that makes any difference.
Comment 35•2 years ago
|
||
I guess that I'd really like to know what the offline discussion was, but it does feel like a changing load flags because of captive portal problem.
Valentin, I don't know how the captive portal work, are there tests that I could look at/modify to reproduce this?
Nika, you reviewed the code that changed this, do you know how to resolve this?
Comment 36•2 years ago
|
||
So when the captive portal checker detects we're in a captive portal, we display the captive portal banner to the user.
The user can click "Log into captive portal" on that banner, which opens a tab for the log in with the disableTRR: true property set.
When that flag is set, we then pass LOAD_FLAGS_DISABLE_TRR into fixupAndLoadURIString.
Then we set it on the defaultLoadFlags here.
The tests we do have for the UI are here: https://searchfox.org/mozilla-central/source/browser/base/content/test/captivePortal
These two tests actually click on the button and open that tab: https://searchfox.org/mozilla-central/search?q=openCaptivePortalLoginTab&redirect=false
Hope this helps.
Updated•2 years ago
|
Comment 37•2 years ago
|
||
I've sifted through the recent crashes under this signature and there's a few things that stand out and maybe could help diagnose and fix this:
- All recent crashes have the reason
DocumentChannel::SetLoadFlags: Don't set flags after creation (differing flags 0 != 1000). This means thatmozilla::net::DocumentChannel::SetLoadFlags()was called with thensIRequest::VALIDATE_NEVERwhich had not been previously set. This seems to happen only in a handful of places and in this particular stack trace it seems that this is the most likely occurrence. There's also this on the same path but it would require the user to have changed their prefs, which seems unlikely at this scale. - Looking at URLs it seems that this is happening mostly on webpages with videos. I've sifted to the URLs and couldn't find a single SFW one but they come from a lot of different webpages, so it doesn't look site specific.
- Some comments mention browsing either back or forward causing the crash.
- Last but not least we have several crashes on file that appear to be coming from the same user, on the same page. It appears as if they tried to reload the same page (which crashed) and encountered more crashes. See these three crashes for example: 1, 2 and 3. This suggests that Firefox is in a state where browsing to a page will consistently crash it. Given that the instances where
nsIRequest::VALIDATE_NEVERis set depend on the cache then maybe it has to do with page caching (IIUC).
Comment 38•2 years ago
|
||
A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html
The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.
Andreas: does this provide enough information to make progress?
Updated•2 years ago
|
Comment 39•2 years ago
|
||
(In reply to Tyson Smith [:tsmith] from comment #38)
A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html
The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.
Andreas: does this provide enough information to make progress?
This is wonderful news! I'll dig right in! Thank you!
Comment 40•2 years ago
|
||
:tsmith, you said that this was caught by fuzzers albeit unreliably? Do you have something that reproduced it? It might be reloading a history navigated page that never did push or replace on history that triggered this. But there might be a race in there also, maybe a history load that doesn't finish.
Comment 41•2 years ago
|
||
Yes but nothing reduced to the point where it would be useful for anything other than replaying at the moment. If/when we do get a test case reduced I will attach it to the bug.
Comment 42•2 years ago
|
||
We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.
Comment 43•2 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #42)
We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.
Yeah, it's definitely correct. The pernosco that Tyson made (comment 39) actually shows this happening due to loads and reloads interleaving somehow. I'm trying to work the recording to build a test case.
Comment 44•1 year ago
|
||
Clearing my old needinfo, as it looks like :farre is looking into this based on comment 43. Feel free to flag me again if you have more questions on this.
Updated•1 year ago
|
Comment 45•1 year ago
|
||
I guess this is still stalled, I had trouble getting good information out of the pernosco trace. One thing that could help unblock this is if Tyson has had any success in finding a minimized fuzzer testcase?
Comment 46•1 year ago
|
||
Got one reduced! Make sure you serve it from a web server. Let it run for a few seconds. Easiest to just use Grizzly.
To reproduce via Grizzly Replay:
$ pip install fuzzfetch grizzly-framework --upgrade
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid>
Updated•1 year ago
|
Comment 47•1 year ago
|
||
Let me know if you have any trouble trying to reproduce the issue.
Comment 48•1 year ago
|
||
Verified bug as reproducible on mozilla-central 20240628161338-30e83f26e442.
Unable to bisect testcase (Testcase reproduces on start build!):
Start: 81b43c5e095f0b0407b6d1a8d4a7f92bf7115e5e (20230701093845)
End: 81b43c5e095f0b0407b6d1a8d4a7f92bf7115e5e (20230701093845)
BuildFlags: BuildFlags(asan=True, tsan=False, debug=False, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)
Comment 49•1 year ago
|
||
I can reproduce it, but not with rr. I get stopped_or_unexpected_exit() from --rr regardless. With --pernosco I can't launch. If you're able to reproduce with pernosco, I'd be extremely happy if you could ping me a link.
Comment 50•1 year ago
|
||
(In reply to Andreas Farre [:farre] from comment #49)
I can reproduce it, but not with rr. I get
stopped_or_unexpected_exit()from--rrregardless. With--pernoscoI can't launch.
I logged that (or a similar) rr issue upstream a few months ago. I am not seeing it with the latest code from git.
If you're able to reproduce with pernosco, I'd be extremely happy if you could ping me a link.
A Pernosco session is available here: https://pernos.co/debug/9i40MeOBmKsCp2reqfy5sg/index.html
Updated•1 year ago
|
Updated•1 year ago
|
Comment 51•1 year ago
|
||
I have this error, in Fenix: https://crash-stats.mozilla.org/report/index/0751e415-6a1a-4b09-901f-61ae90240918 , every time I go to this link: https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US
Comment 52•1 year ago
|
||
Thanks for catching this crash! Looks like this crash's volume spiked for Nightly 132.0a1 users after 2024-09-11, from fewer than 100 reports per day to over 500.
The MOZ_CRASH reason is "DocumentChannel::SetLoadFlags: Don't set flags after creation (differing flags 0 != 1000)".
Comment 53•1 year ago
|
||
(In reply to eclaudiu64 from comment #51)
I have this error, in Fenix: https://crash-stats.mozilla.org/report/index/0751e415-6a1a-4b09-901f-61ae90240918 , every time I go to this link: https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US
I was able to reproduce it. Thank you for the pointer.
It doesn't happen to me on navigation to the URL, but if I kill fenix and reopen, it crashes when the page is restored.
Comment 54•1 year ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #53)
It doesn't happen to me on navigation to the URL, but if I kill fenix and reopen, it crashes when the page is restored.
This could be a good lead! I'm resurrecting the Pernosco session from comment 50 to see if that's the case there as well.
Comment 55•1 year ago
|
||
So this feels very bisectable. Chris, I've tried my best building Fenix, but I've failed. Do you have pointers or someone that I can reach out to for help? Or someone that could bisect for me? It's very possible it turns out to be a highly used site and not an error from us, but it would be good to know that.
Comment 56•1 year ago
|
||
(In reply to Andreas Farre [:farre] from comment #55)
So this feels very bisectable. Chris, I've tried my best building Fenix, but I've failed. Do you have pointers or someone that I can reach out to for help? Or someone that could bisect for me?
If you have you have ADB device set up to talk to your Android device, you bisect Fenix builds without building. Use mach mozregression --app fenix --arch arm64-v8a or, better yet since it might produce a narrower regression range, you can bisect GeckoViewExample builds using mach mozregression --app gve --arch aarch64.
If you still need to build Fenix, try these instructions and then ask in #mobile-android-team Slack channel or #Fenix Matrix room:
https://firefox-source-docs.mozilla.org/mobile/android/fenix.html
Comment 57•1 year ago
•
|
||
Yeah, mozregression on fenix/emulator gives:
6:06.61 INFO: Newest known good nightly: 2024-09-10
6:06.61 INFO: Oldest known bad nightly: 2024-09-11
which is exactly what crashstat says, but doesn't tell us much more. My STR:s were:
- Open https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US
- Open new tab go to some page
- Close tab from 1 and press undo
and this crashes. It's a bit harder trying to get this to trigger with gve, mainly because getting into a situation where a restore would trigger is harder.
Comment 58•1 year ago
|
||
There were 2 2024-09-11 Fenix nightly builds, so it's a bit unfortunate that mozregression doesn't give a build ID to identify which one it was testing. That said, here's a pushlog range showing both builds for that day.
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ebf0e33ba93e&tochange=606085de1f6c
Comment 59•1 year ago
|
||
I've done some bisecting locally to catch this crash. Even though I could not see the details of the native crash I was experiencing with the STR that Andreas provided (c57), I could spot a potential regressor: https://hg.mozilla.org/mozilla-central/rev/710d5d0be43889b6e5acdc54112fa9d93dc971cc
(as I am not sure if that's the same crash due to the lack of crash details - a confirmation would be good)
Updated•1 year ago
|
Comment 60•1 year ago
|
||
Right, I've found the spike for android, and its the second patch of Bug 1911977.
Emilio, do you know why this is happening?
Comment 61•1 year ago
•
|
||
Filed a bug with patch for reverting bug 1911977. See bug 1921972
| Assignee | ||
Comment 62•1 year ago
|
||
Not off-hand, but I don't think we should just blindly revert. Let's investigate it in bug 1921972.
Comment 64•1 year ago
|
||
This was fixed by Bug 1921972. The testcase attached in this bug reproduces immediately with grizzly without the patch in Bug 1921972 , but not in 50 runs with the patch. Also, the crash table in Bug 1921972 is green. I'm closing this, but in the case of it coming back we can re-open.
Comment 65•1 year ago
|
||
Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.
Comment 66•1 year ago
|
||
Bug appears to be fixed on mozilla-central 20241010092233-541f36a6b102 but BugMon was unable to find a usable build for a5369079cd89dd91a42689da9aad1fc607ef9146.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Description
•