<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 6

•

4 years ago

P2 S2

mismatch load flags on LoadGroup and the network request.

Nika says this might be related to load flags when:

bypassing TRR for captive portal detection: https://searchfox.org/mozilla-central/rev/36f79bed679ad7ec46f7cd05868a8f6dc823e1be/docshell/base/nsDocShell.cpp#820-827
bypassing cache in DevTools' network panel
or overriding loading behavior of background thumbnail process.

Severity: -- → S2

Priority: -- → P2

Updated

•

4 years ago

Whiteboard: [domcore-s2-revisit]

Andrew McCreight [:mccr8]

Comment 8

•

3 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 5 desktop browser crashes on Mac on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash

Comment 9

•

3 years ago

Bug 1743456 has a test case.

Kris Maglione [:kmag]

Comment 10

•

3 years ago

All of the wild crashes I've looked at so far are from loads triggered from the child, not the parent, so I'm not sure that things getting out of sync is a concern. It also happens in AsyncOpen before we send the constructor to the parent, so if we had to, we could send the flags to the parent. I'm not sure bug 1743456 is strictly the same issue, but it could be.

My guess, at this point, is that the problem is inheriting (mLoadFlags & (LOAD_BACKGROUND | LOAD_BYPASS_CACHE | LOAD_FROM_CACHE | VALIDATE_ALWAYS | VALIDATE_ONCE_PER_SESSION | VALIDATE_NEVER)); from the load group, rather than the BrowsingContext default load flags changing after the channel is created. The default load flags on the load group should match the ones on the browsing context (it's unfortunate that we have both). But, in any case, the parent process doesn't have access to the load group, so parent-initiated loads wouldn't have access to those flags, and we really don't want different behavior for parent-initiated vs child-initiated loads.

I'll look into this some more.

Comment hidden (obsolete)

Comment 14

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

Comment 15

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Comment 16

•

2 years ago

Hsin-Yi, is there someone who can look into this?

Flags: needinfo?(htsai)

Comment 17

•

2 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Exploring ...

Updated

•

2 years ago

Assignee: nobody → afarre

Status: NEW → ASSIGNED

Updated

•

2 years ago

Flags: needinfo?(htsai)

Comment 18

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 19

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 20

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 21

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 22

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 23

•

2 years ago

Andreas, do you have cycles to investigate this Fenix topcrash?

Flags: needinfo?(afarre)

Comment 24

•

2 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Any news on this? Still a Fenix topcrash.

Flags: needinfo?(htsai)

Comment 25

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 26

•

2 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)

Hsin-Yi, is there someone who can look into this?

Any news on this? Still a Fenix topcrash.

Andreas is back from PTOs this week. He is resuming the work here.

Flags: needinfo?(htsai)

Comment 27

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 28

•

2 years ago

•

Edited

These stacks are all looking hefty strange. As far as I can tell they all come into nsDocShell::DoURILoad, where they call nsDocShell::OpenInitializedChannel. This is always called with a freshly created DocumentChannel. We calculate the nsLoadFlags just before creating the channel, so we should really be able to assume that they're correct at that point. Then we don't touch the load flags of the DocumentChannel until we've entered nsURILoader::OpenURI.

It is here, when we're calling DocumentChannelChild::AsyncOpen that we'll assert when we add the DocumentChannelChild to the loadGroup in nsLoadGroup::AddRequest. Note that we're adding DocumentChannelChild to its own load group! That load group was set a few lines earlier as a result of calling nsURILoader::OpenChannel, which in turn is extracted from the nsDocShell that we're calling DoLoadURI on.

I'm not sure if it's nsLoadGroup::MergeDefaultLoadFlags or nsLoadGroup::MergeLoadFlags that asserts. That's optimized out.

But as far as I can tell the problem is that the load group doesn't have the same load flags as the browsing context has. Is this a problem with how we compute the load flags in nsDocShellLoadState::CalculateChannelLoadFlags? We never suspend execution by returning to the main loop, this is happening linearly.

This is in line with what :kmag found in comment 10.

I'm at a bit of a loss of how to resolve this. :kmag, :nika any ideas?

Flags: needinfo?(nika)

Flags: needinfo?(kmaglione+bmo)

Flags: needinfo?(afarre)

Updated

•

2 years ago

Depends on: 1849351

Updated

•

2 years ago

Depends on: 1852213

Kershaw Chang [:kershaw]

Updated

•

2 years ago

Duplicate of this bug: 1852039

Andrew McCreight [:mccr8]

Comment 30

•

2 years ago

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::net::DocumentChannel::SetLoadFlags] → [@ mozilla::net::DocumentChannel::SetLoadFlags] [@ base.odex@0x13522c] [@ base.odex@0x13a1cc] [@ base.odex@0x13b1cc]

Updated

•

2 years ago

No longer duplicate of this bug: 1852039

Andrew McCreight [:mccr8]

Comment 31

•

2 years ago

I doubt these offsets will be good for more than one build. I'm not sure if this is a new spike on Nightly or just bad symbols causing a new signature.

Crash Signature: [@ mozilla::net::DocumentChannel::SetLoadFlags] [@ base.odex@0x13522c] [@ base.odex@0x13a1cc] [@ base.odex@0x13b1cc] → [@ mozilla::net::DocumentChannel::SetLoadFlags]

Comment 32

•

2 years ago

The annotations in the first patch of Bug 1852213 were wrong, but a fix to that landed yesterday. Now we only need to wait for some reports with the annotation data available.

Comment 33

•

2 years ago

So the annotations says that it's LOAD_TRR_DISABLED_MODE that is the differing flag.

Comment 34

•

2 years ago

•

Edited

I wonder if this is the problem. That it somehow races with a load. I'll try to make the in process updates happen synchronously to see if that makes any difference.

Valentin Gosu [:valentin] (he/him)

Comment 35

•

2 years ago

I guess that I'd really like to know what the offline discussion was, but it does feel like a changing load flags because of captive portal problem.

Valentin, I don't know how the captive portal work, are there tests that I could look at/modify to reproduce this?

Nika, you reviewed the code that changed this, do you know how to resolve this?

Flags: needinfo?(valentin.gosu)

Comment 36

•

2 years ago

So when the captive portal checker detects we're in a captive portal, we display the captive portal banner to the user.
The user can click "Log into captive portal" on that banner, which opens a tab for the log in with the disableTRR: true property set.
When that flag is set, we then pass LOAD_FLAGS_DISABLE_TRR into fixupAndLoadURIString.
Then we set it on the defaultLoadFlags here.

The tests we do have for the UI are here: https://searchfox.org/mozilla-central/source/browser/base/content/test/captivePortal
These two tests actually click on the button and open that tab: https://searchfox.org/mozilla-central/search?q=openCaptivePortalLoginTab&redirect=false

Hope this helps.

Flags: needinfo?(valentin.gosu)

Gabriele Svelto [:gsvelto]

Updated

•

2 years ago

Updated

•

2 years ago

Keywords: stalled

Whiteboard: [domcore-s2-revisit]

Gabriele Svelto [:gsvelto]

Updated

•

2 years ago

Flags: needinfo?(kmaglione+bmo)

Comment 37

•

2 years ago

I've sifted through the recent crashes under this signature and there's a few things that stand out and maybe could help diagnose and fix this:

All recent crashes have the reason DocumentChannel::SetLoadFlags: Don't set flags after creation (differing flags 0 != 1000). This means that mozilla::net::DocumentChannel::SetLoadFlags() was called with the nsIRequest::VALIDATE_NEVER which had not been previously set. This seems to happen only in a handful of places and in this particular stack trace it seems that this is the most likely occurrence. There's also this on the same path but it would require the user to have changed their prefs, which seems unlikely at this scale.
Looking at URLs it seems that this is happening mostly on webpages with videos. I've sifted to the URLs and couldn't find a single SFW one but they come from a lot of different webpages, so it doesn't look site specific.
Some comments mention browsing either back or forward causing the crash.
Last but not least we have several crashes on file that appear to be coming from the same user, on the same page. It appears as if they tried to reload the same page (which crashed) and encountered more crashes. See these three crashes for example: 1, 2 and 3. This suggests that Firefox is in a state where browsing to a page will consistently crash it. Given that the instances where nsIRequest::VALIDATE_NEVER is set depend on the cache then maybe it has to do with page caching (IIUC).

Sylvestre Ledru [:Sylvestre]

Comment 38

•

2 years ago

A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html

The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.

Andreas: does this provide enough information to make progress?

Blocks: domino

status-firefox122: --- → affected

Flags: needinfo?(afarre)

Keywords: pernosco

Updated

•

2 years ago

status-firefox120: --- → affected

status-firefox121: --- → affected

status-firefox-esr115: --- → affected

Comment 39

•

2 years ago

(In reply to Tyson Smith [:tsmith] from comment #38)

A Pernosco session is available here: https://pernos.co/debug/d8q-rXYkfZv27Ej0fdQ5mA/index.html

The fuzzers are also reporting this issue but the test cases are also unreliable. I did manage to get a Pernosco session.

Andreas: does this provide enough information to make progress?

This is wonderful news! I'll dig right in! Thank you!

Flags: needinfo?(afarre)

Comment 40

•

2 years ago

:tsmith, you said that this was caught by fuzzers albeit unreliably? Do you have something that reproduced it? It might be reloading a history navigated page that never did push or replace on history that triggered this. But there might be a race in there also, maybe a history load that doesn't finish.

Gabriele Svelto [:gsvelto]

Comment 41

•

2 years ago

Yes but nothing reduced to the point where it would be useful for anything other than replaying at the moment. If/when we do get a test case reduced I will attach it to the bug.

Comment 42

•

2 years ago

We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.

Nika Layzell [:nika] (ni? for response)

Comment 43

•

2 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #42)

We're having a fairly large increase in crashes coming from Android with version 121+. Android doesn't report URLs but just looking at the stacks it seems like my analysis in comment 37 still stands.

Yeah, it's definitely correct. The pernosco that Tyson made (comment 39) actually shows this happening due to loads and reloads interleaving somehow. I'm trying to work the recording to build a test case.

Updated

•

1 year ago

Comment 44

•

1 year ago

Clearing my old needinfo, as it looks like :farre is looking into this based on comment 43. Feel free to flag me again if you have more questions on this.

Flags: needinfo?(nika) → needinfo?(afarre)

Updated

•

1 year ago

status-firefox120: affected → wontfix

status-firefox121: affected → wontfix

status-firefox122: affected → wontfix

Comment 45

•

1 year ago

I guess this is still stalled, I had trouble getting good information out of the pernosco trace. One thing that could help unblock this is if Tyson has had any success in finding a minimized fuzzer testcase?

Flags: needinfo?(afarre) → needinfo?(twsmith)

Comment 46

•

1 year ago

Attached file testcase.html — Details

Got one reduced! Make sure you serve it from a web server. Let it run for a few seconds. Easiest to just use Grizzly.

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework --upgrade
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid>

Flags: needinfo?(twsmith)

Updated

•

1 year ago

Keywords: bugmon, testcase

Bugmon [:jkratzer for issues]

Comment 47

•

1 year ago

Let me know if you have any trouble trying to reproduce the issue.

Flags: needinfo?(afarre)

Comment 48

•

1 year ago

Verified bug as reproducible on mozilla-central 20240628161338-30e83f26e442.
Unable to bisect testcase (Testcase reproduces on start build!):

Start: 81b43c5e095f0b0407b6d1a8d4a7f92bf7115e5e (20230701093845)
End: 81b43c5e095f0b0407b6d1a8d4a7f92bf7115e5e (20230701093845)
BuildFlags: BuildFlags(asan=True, tsan=False, debug=False, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)

Whiteboard: [bugmon:bisected,confirmed]

Comment 49

•

1 year ago

I can reproduce it, but not with rr. I get stopped_or_unexpected_exit() from --rr regardless. With --pernosco I can't launch. If you're able to reproduce with pernosco, I'd be extremely happy if you could ping me a link.

Flags: needinfo?(afarre) → needinfo?(twsmith)

Comment 50

•

1 year ago

(In reply to Andreas Farre [:farre] from comment #49)

I can reproduce it, but not with rr. I get stopped_or_unexpected_exit() from --rr regardless. With --pernosco I can't launch.

I logged that (or a similar) rr issue upstream a few months ago. I am not seeing it with the latest code from git.

If you're able to reproduce with pernosco, I'd be extremely happy if you could ping me a link.

A Pernosco session is available here: https://pernos.co/debug/9i40MeOBmKsCp2reqfy5sg/index.html

Flags: needinfo?(twsmith) → needinfo?(afarre)

Updated

•

1 year ago

Flags: needinfo?(afarre)

Updated

•

1 year ago

Flags: needinfo?(afarre)

eclaudiu64

Comment 51

•

1 year ago

I have this error, in Fenix: https://crash-stats.mozilla.org/report/index/0751e415-6a1a-4b09-901f-61ae90240918 , every time I go to this link: https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US

Flags: needinfo?(cpeterson)

Valentin Gosu [:valentin] (he/him)

Comment 52

•

1 year ago

Thanks for catching this crash! Looks like this crash's volume spiked for Nightly 132.0a1 users after 2024-09-11, from fewer than 100 reports per day to over 500.

The MOZ_CRASH reason is "DocumentChannel::SetLoadFlags: Don't set flags after creation (differing flags 0 != 1000)".

status-firefox130: --- → affected

status-firefox131: --- → affected

status-firefox132: --- → affected

Flags: needinfo?(cpeterson)

Comment 53

•

1 year ago

(In reply to eclaudiu64 from comment #51)

I have this error, in Fenix: https://crash-stats.mozilla.org/report/index/0751e415-6a1a-4b09-901f-61ae90240918 , every time I go to this link: https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US

I was able to reproduce it. Thank you for the pointer.
It doesn't happen to me on navigation to the URL, but if I kill fenix and reopen, it crashes when the page is restored.

Comment 54

•

1 year ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #53)

It doesn't happen to me on navigation to the URL, but if I kill fenix and reopen, it crashes when the page is restored.

This could be a good lead! I'm resurrecting the Pernosco session from comment 50 to see if that's the case there as well.

Flags: needinfo?(afarre)

Comment 55

•

1 year ago

So this feels very bisectable. Chris, I've tried my best building Fenix, but I've failed. Do you have pointers or someone that I can reach out to for help? Or someone that could bisect for me? It's very possible it turns out to be a highly used site and not an error from us, but it would be good to know that.

Flags: needinfo?(cpeterson)

https://firefox-source-docs.mozilla.org/mobile/android/fenix.html

Comment 56

•

1 year ago

(In reply to Andreas Farre [:farre] from comment #55)

So this feels very bisectable. Chris, I've tried my best building Fenix, but I've failed. Do you have pointers or someone that I can reach out to for help? Or someone that could bisect for me?

If you have you have ADB device set up to talk to your Android device, you bisect Fenix builds without building. Use mach mozregression --app fenix --arch arm64-v8a or, better yet since it might produce a narrower regression range, you can bisect GeckoViewExample builds using mach mozregression --app gve --arch aarch64.

If you still need to build Fenix, try these instructions and then ask in #mobile-android-team Slack channel or #Fenix Matrix room:

Flags: needinfo?(cpeterson) → needinfo?(afarre)

Comment 57

•

1 year ago

•

Edited

Yeah, mozregression on fenix/emulator gives:

 6:06.61 INFO: Newest known good nightly: 2024-09-10
 6:06.61 INFO: Oldest known bad nightly: 2024-09-11

which is exactly what crashstat says, but doesn't tell us much more. My STR:s were:

Open https://play.google.com/store/apps/details?id=com.zynga.starwars.hunters&hl=en_US
Open new tab go to some page
Close tab from 1 and press undo

and this crashes. It's a bit harder trying to get this to trigger with gve, mainly because getting into a situation where a restore would trigger is harder.

Comment 58

•

1 year ago

There were 2 2024-09-11 Fenix nightly builds, so it's a bit unfortunate that mozregression doesn't give a build ID to identify which one it was testing. That said, here's a pushlog range showing both builds for that day.
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ebf0e33ba93e&tochange=606085de1f6c

Kaya [:kaya]

Comment 59

•

1 year ago

I've done some bisecting locally to catch this crash. Even though I could not see the details of the native crash I was experiencing with the STR that Andreas provided (c57), I could spot a potential regressor: https://hg.mozilla.org/mozilla-central/rev/710d5d0be43889b6e5acdc54112fa9d93dc971cc
(as I am not sure if that's the same crash due to the lack of crash details - a confirmation would be good)

Updated

•

1 year ago

status-firefox130: affected → wontfix

status-firefox131: affected → wontfix

status-firefox133: --- → affected

status-firefox-esr115: affected → wontfix

status-firefox-esr128: --- → affected

Comment 60

•

1 year ago

Right, I've found the spike for android, and its the second patch of Bug 1911977.

Emilio, do you know why this is happening?

Flags: needinfo?(afarre) → needinfo?(emilio)

Comment 61

•

1 year ago

•

Edited

Filed a bug with patch for reverting bug 1911977. See bug 1921972

Comment 62

•

1 year ago

Not off-hand, but I don't think we should just blindly revert. Let's investigate it in bug 1921972.

Flags: needinfo?(emilio)

Updated

•

1 year ago

Updated

•

1 year ago

Duplicate of this bug: 1918275

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 64

•

1 year ago

This was fixed by Bug 1921972. The testcase attached in this bug reproduces immediately with grizzly without the patch in Bug 1921972 , but not in 50 runs with the patch. Also, the crash table in Bug 1921972 is green. I'm closing this, but in the case of it coming back we can re-open.

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

Resolution: --- → FIXED

Comment 65

•

1 year ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Bugmon [:jkratzer for issues]

Comment 66

•

1 year ago

Bug appears to be fixed on mozilla-central 20241010092233-541f36a6b102 but BugMon was unable to find a usable build for a5369079cd89dd91a42689da9aad1fc607ef9146.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Keywords: bugmon

Updated

•

1 year ago

status-firefox133: affected → fixed

Target Milestone: --- → 133 Branch

Updated

•

1 year ago

Assignee: afarre → emilio

Depends on: 1921972

Updated

•

1 year ago

status-firefox132: affected → fixed