Closed Bug 1707057 Opened 4 years ago Closed 4 years ago

Crash in [@ PLDHashTable::MakeEntryHandle | mozilla::DataStorage::GetFromRawFileName]

Categories

(Firefox :: Sync, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
relnote-firefox --- 90+
firefox-esr78 --- unaffected
firefox89 --- wontfix
firefox90 + fixed
firefox91 --- fixed

People

(Reporter: sefeng, Assigned: teshaq)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/945f5296-4b4b-40ff-98b7-bc7550210422

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll PLDHashTable::MakeEntryHandle xpcom/ds/PLDHashTable.cpp:673
1 xul.dll static mozilla::DataStorage::GetFromRawFileName security/manager/ssl/DataStorage.cpp:114
2 xul.dll static mozilla::DataStorage::Get security/manager/ssl/DataStorageList.h:17
3 xul.dll nsClientAuthRememberService::Init security/manager/ssl/nsClientAuthRemember.cpp:73
4 xul.dll mozilla::xpcom::CreateInstanceImpl xpcom/components/StaticComponents.cpp:9560
5 xul.dll nsComponentManagerImpl::GetServiceByContractID xpcom/components/nsComponentManager.cpp:1466
6 xul.dll nsCOMPtr_base::assign_from_gs_contractid xpcom/base/nsCOMPtr.cpp:82
7 xul.dll nsNSSComponent::InitializeNSS security/manager/ssl/nsNSSComponent.cpp:1923
8 xul.dll nsNSSComponent::Init security/manager/ssl/nsNSSComponent.cpp:2191
9 xul.dll mozilla::xpcom::CreateInstanceImpl xpcom/components/StaticComponents.cpp:11602

If I'm reading this correctly we're trying to create the NSS component way too late during shutdown, probably after we already shut it down in the first place. The call that triggers this sequence comes from viaduct. I'm not familiar with viaduct and how it interacts with the rest of the codebase, Ryan could you take a look? Should this be moved to the Firefox Sync component?

Flags: needinfo?(rfkelly)

Should this be moved to the Firefox Sync component?

Thanks, yep, that sounds like the best place for it for further investigation. Leaving ni? myself to dig in more.

Component: Security → Sync
Product: Core → Firefox

I don't think Sync on desktop uses viaduct. The rust-fxaclient does, but that's pref'd off. I wonder if it might be glean?

Found another signature.

Crash Signature: [@ PLDHashTable::MakeEntryHandle | mozilla::DataStorage::GetFromRawFileName] → [@ PLDHashTable::MakeEntryHandle | mozilla::DataStorage::GetFromRawFileName] [@ PLDHashTable::MakeEntryHandle | PLDHashTable::WithEntryHandle<T> | mozilla::DataStorage::GetFromRawFileName]

I know what's going on. All the crashes happen after xpcom-shutdown, when NSS has already been terminated. The request tries to re-initialize NSS which fails. So first of all we need to figure out who's creating a new request because it should check for shutdown and cancel pending requests and then we probably also need to put some assertions in place to prevent this from happening again - or to make debugging this kind of issue easier. I'll file another bug for that.

This is definitely a regression. The first affected build id is 20210415133200. I don't know this code well enough to look for the commit that caused this.

Keywords: regression
See Also: → 1713749

:janerik is it possible this was triggered by recent Glean changes in mozilla-central? Glean is the only active viaduct consumer I could see in mozilla-central (the fxa-client crate is in there and uses viaduct, but is preffed off in the build).

Flags: needinfo?(rfkelly) → needinfo?(jrediger)

Could be. Paging in :chutten as well.

Flags: needinfo?(jrediger) → needinfo?(chutten)

Thread 27 is launching the viaduct uploader.
We do initialize Viaduct when initializing Glean.
There's several reasons we might send a ping early (though I thought we wouldn't initialize Glean that early?)
Guess we have to special-case when we're at shutdown and avoid doing the work?

See Also: → 1714372

I would contend that Viaduct should raise an error, since it is more tightly coupled to the layer that's crashing.

I mean, Glean should also ensure it doesn't init after it is asked to shut down. But unless we've been init, we don't know xpcom's been shut down because we haven't registered any observers yet... so this might be a bit tricky. (and it might not solve this if init was triggered just before shutdown was announced). Lemme file a bug for that much at least. (bug 1714372 filed)

Around the same time this started showing up (gsvelto notes build id 20210415133200) is when we started sending pings at startup (bug 1704184 first build with: 20210413214314), so if we're looking for a specific bug that started this, that'd be my guess.

Flags: needinfo?(chutten)

I would contend that Viaduct should raise an error, since it is more tightly coupled to the layer that's crashing.

I spent a bit of time trying to refresh my memory of what Viaduct is up to here; some notes:

  • Viaduct does a bit of a dance to convert between the synchronous calls made by Rust code and the asynchronous handling of those calls be our network stack.
  • Rust code invokes Viaduct on a background thread, via callback, eventually ending calling ViaductRequest::MakeRequest here.
  • The implementation of ViaductRequest::MakeRequest dispatches a function call to the main thread to invoke ViaductRequest::LaunchRequest, then blocks waiting for it to complete.
  • The call to ViaductRequest::LaunchRequest invokes NS_NewChannel to make the outgoing request, which is what leads to the crash in this bug.

It seems entirely possible that Glean could initiate a network request before shutdown starts, but then shutdown is initiated asynchronously before ViaductRequest::LaunchRequest gets to run on the main thread. So yes, I agree, Viaduct seems like the thing that should be checking for shutdown here.

I'm well out of my depth in the C++ bits here, but willing to do the legwork with a bit of guidance. So...I notice some other code calling PastShutdownPhase to check whether shutdown is underway and error out cleanly. Perhaps we could add an if PastShutdownPhase(ShutdownPhase::AppShutdownNetTeardown) inside ViaductRequest::LaunchRequest and fail early if shutdown is already under way?

:dragana, I see you reviewed the initial landing of this Viaduct code in Bug 1628068, any chance you could advise on the above, or suggest an alternate approach?

Flags: needinfo?(dd.mozilla)

(In reply to Ryan Kelly [:rfkelly] from comment #12)

I'm well out of my depth in the C++ bits here, but willing to do the legwork with a bit of guidance. So...I notice some other code calling PastShutdownPhase to check whether shutdown is underway and error out cleanly. Perhaps we could add an if PastShutdownPhase(ShutdownPhase::AppShutdownNetTeardown) inside ViaductRequest::LaunchRequest and fail early if shutdown is already under way?

:dragana, I see you reviewed the initial landing of this Viaduct code in Bug 1628068, any chance you could advise on the above, or suggest an alternate approach?

I think that is the best approach. nsIOService also has the information about shutdown, but I am afraid that there may be cases that it is already destroyed when this code is invocted. so let's use PastShutdownPhase(ShutdownPhase::AppShutdownNetTeardown) and return NS_ERROR_SERVICE_NOT_AVAILABLE or just NS_ERROR_FAILURE.

Flags: needinfo?(dd.mozilla)
Assignee: nobody → teshaq

:janerik or :chutten, do you know if I can find somewhere instructions to test glean's behavior on desktop? I have a small patch up doing what :rfkelly and :dragana suggested, but I'm finding it hard to test/verify. At a minimum, I'm hoping to make sure Viaduct is still working correctly

Flags: needinfo?(jrediger)
Flags: needinfo?(chutten)

You can find those instructions in FOG's in-tree documentation here: https://firefox-source-docs.mozilla.org/toolkit/components/glean/dev/testing.html which links out to https://mozilla.github.io/glean/book/user/debugging/index.html for Glean's instructions.

The integration suite's probably the most relevant for automated coverage. For manual testing you can use about:glean to tag and submit the "baseline" ping to the Glean debug ping viewer (be patient, it takes a few minutes to show up).

Flags: needinfo?(jrediger)
Flags: needinfo?(chutten)

Awesome! Thanks :chutten!

Using a local build with the patch, I navigated to about:glean and submitted a couple of pings. I tried both baseline and metrics, and they were received on the ping viewer. (They're the ones under id viaduct). I also ran the integration tests and those passed too

This tells me that at least Viaduct seems to behave correctly with the change

Pushed by teshaq@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bc54273733d6 Add shutdown handling for viaduct. r=dragana.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch

We should probably leave this on the radar for a possible RC respin or dot release ride-along given that the volume seems to have gone up in 90.

The patch landed in nightly and beta is affected.
:teshaq, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(teshaq)

Comment on attachment 9228188 [details]
Bug 1707057 - Add shutdown handling for viaduct. r=rfkelly.

Beta/Release Uplift Approval Request

  • User impact if declined: currently, a low volume crasher
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: Hard to reproduce and verify, requires a timing situation where NSS shuts down while/before glean is attempting to send a network request
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Adds erroring out behavior instead of crashing, medium risk because the fix can't be verified easily, otherwise the code change has low impact
  • String changes made/needed:
Flags: needinfo?(teshaq)
Attachment #9228188 - Flags: approval-mozilla-release?

Comment on attachment 9228188 [details]
Bug 1707057 - Add shutdown handling for viaduct. r=rfkelly.

approved for 90.0.1

Attachment #9228188 - Flags: approval-mozilla-release? → approval-mozilla-release+

Added to 90.0.1 release notes: "Fixed a rare crash on shutdown"

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: