Crash in [@ PLDHashTable::MakeEntryHandle | mozilla::DataStorage::GetFromRawFileName]
Categories
(Firefox :: Sync, defect)
Tracking
()
People
(Reporter: sefeng, Assigned: teshaq)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-release+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/945f5296-4b4b-40ff-98b7-bc7550210422
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 xul.dll PLDHashTable::MakeEntryHandle xpcom/ds/PLDHashTable.cpp:673
1 xul.dll static mozilla::DataStorage::GetFromRawFileName security/manager/ssl/DataStorage.cpp:114
2 xul.dll static mozilla::DataStorage::Get security/manager/ssl/DataStorageList.h:17
3 xul.dll nsClientAuthRememberService::Init security/manager/ssl/nsClientAuthRemember.cpp:73
4 xul.dll mozilla::xpcom::CreateInstanceImpl xpcom/components/StaticComponents.cpp:9560
5 xul.dll nsComponentManagerImpl::GetServiceByContractID xpcom/components/nsComponentManager.cpp:1466
6 xul.dll nsCOMPtr_base::assign_from_gs_contractid xpcom/base/nsCOMPtr.cpp:82
7 xul.dll nsNSSComponent::InitializeNSS security/manager/ssl/nsNSSComponent.cpp:1923
8 xul.dll nsNSSComponent::Init security/manager/ssl/nsNSSComponent.cpp:2191
9 xul.dll mozilla::xpcom::CreateInstanceImpl xpcom/components/StaticComponents.cpp:11602
Comment 1•4 years ago
|
||
If I'm reading this correctly we're trying to create the NSS component way too late during shutdown, probably after we already shut it down in the first place. The call that triggers this sequence comes from viaduct. I'm not familiar with viaduct and how it interacts with the rest of the codebase, Ryan could you take a look? Should this be moved to the Firefox Sync component?
Comment 2•4 years ago
|
||
Should this be moved to the Firefox Sync component?
Thanks, yep, that sounds like the best place for it for further investigation. Leaving ni? myself to dig in more.
Comment 3•4 years ago
|
||
I don't think Sync on desktop uses viaduct. The rust-fxaclient does, but that's pref'd off. I wonder if it might be glean?
Comment 4•4 years ago
|
||
Found another signature.
Comment 5•4 years ago
|
||
I know what's going on. All the crashes happen after xpcom-shutdown, when NSS has already been terminated. The request tries to re-initialize NSS which fails. So first of all we need to figure out who's creating a new request because it should check for shutdown and cancel pending requests and then we probably also need to put some assertions in place to prevent this from happening again - or to make debugging this kind of issue easier. I'll file another bug for that.
Comment 6•4 years ago
|
||
This is definitely a regression. The first affected build id is 20210415133200. I don't know this code well enough to look for the commit that caused this.
Comment 7•4 years ago
|
||
:janerik is it possible this was triggered by recent Glean changes in mozilla-central? Glean is the only active viaduct consumer I could see in mozilla-central (the fxa-client crate is in there and uses viaduct, but is preffed off in the build).
Comment 8•4 years ago
•
|
||
Could be. Paging in :chutten as well.
Comment 9•4 years ago
|
||
Thread 27 is launching the viaduct uploader.
We do initialize Viaduct when initializing Glean.
There's several reasons we might send a ping early (though I thought we wouldn't initialize Glean that early?)
Guess we have to special-case when we're at shutdown and avoid doing the work?
Comment 10•4 years ago
|
||
I would contend that Viaduct should raise an error, since it is more tightly coupled to the layer that's crashing.
I mean, Glean should also ensure it doesn't init after it is asked to shut down. But unless we've been init, we don't know xpcom's been shut down because we haven't registered any observers yet... so this might be a bit tricky. (and it might not solve this if init was triggered just before shutdown was announced). Lemme file a bug for that much at least. (bug 1714372 filed)
Around the same time this started showing up (gsvelto notes build id 20210415133200) is when we started sending pings at startup (bug 1704184 first build with: 20210413214314), so if we're looking for a specific bug that started this, that'd be my guess.
Comment 11•4 years ago
|
||
I would contend that Viaduct should raise an error, since it is more tightly coupled to the layer that's crashing.
I spent a bit of time trying to refresh my memory of what Viaduct is up to here; some notes:
- Viaduct does a bit of a dance to convert between the synchronous calls made by Rust code and the asynchronous handling of those calls be our network stack.
- Rust code invokes Viaduct on a background thread, via callback, eventually ending calling
ViaductRequest::MakeRequest
here. - The implementation of
ViaductRequest::MakeRequest
dispatches a function call to the main thread to invokeViaductRequest::LaunchRequest
, then blocks waiting for it to complete. - The call to
ViaductRequest::LaunchRequest
invokesNS_NewChannel
to make the outgoing request, which is what leads to the crash in this bug.
It seems entirely possible that Glean could initiate a network request before shutdown starts, but then shutdown is initiated asynchronously before ViaductRequest::LaunchRequest
gets to run on the main thread. So yes, I agree, Viaduct seems like the thing that should be checking for shutdown here.
Comment 12•4 years ago
|
||
I'm well out of my depth in the C++ bits here, but willing to do the legwork with a bit of guidance. So...I notice some other code calling PastShutdownPhase
to check whether shutdown is underway and error out cleanly. Perhaps we could add an if PastShutdownPhase(ShutdownPhase::AppShutdownNetTeardown)
inside ViaductRequest::LaunchRequest
and fail early if shutdown is already under way?
:dragana, I see you reviewed the initial landing of this Viaduct code in Bug 1628068, any chance you could advise on the above, or suggest an alternate approach?
Comment 13•4 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #12)
I'm well out of my depth in the C++ bits here, but willing to do the legwork with a bit of guidance. So...I notice some other code calling
PastShutdownPhase
to check whether shutdown is underway and error out cleanly. Perhaps we could add anif PastShutdownPhase(ShutdownPhase::AppShutdownNetTeardown)
insideViaductRequest::LaunchRequest
and fail early if shutdown is already under way?:dragana, I see you reviewed the initial landing of this Viaduct code in Bug 1628068, any chance you could advise on the above, or suggest an alternate approach?
I think that is the best approach. nsIOService also has the information about shutdown, but I am afraid that there may be cases that it is already destroyed when this code is invocted. so let's use PastShutdownPhase(ShutdownPhase::AppShutdownNetTeardown) and return NS_ERROR_SERVICE_NOT_AVAILABLE or just NS_ERROR_FAILURE.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 14•4 years ago
|
||
Assignee | ||
Comment 15•4 years ago
|
||
:janerik or :chutten, do you know if I can find somewhere instructions to test glean's behavior on desktop? I have a small patch up doing what :rfkelly and :dragana suggested, but I'm finding it hard to test/verify. At a minimum, I'm hoping to make sure Viaduct is still working correctly
Assignee | ||
Updated•4 years ago
|
Comment 16•4 years ago
|
||
You can find those instructions in FOG's in-tree documentation here: https://firefox-source-docs.mozilla.org/toolkit/components/glean/dev/testing.html which links out to https://mozilla.github.io/glean/book/user/debugging/index.html for Glean's instructions.
The integration suite's probably the most relevant for automated coverage. For manual testing you can use about:glean
to tag and submit the "baseline" ping to the Glean debug ping viewer (be patient, it takes a few minutes to show up).
Assignee | ||
Comment 17•4 years ago
|
||
Awesome! Thanks :chutten!
Using a local build with the patch, I navigated to about:glean
and submitted a couple of pings. I tried both baseline and metrics, and they were received on the ping viewer. (They're the ones under id viaduct
). I also ran the integration tests and those passed too
This tells me that at least Viaduct seems to behave correctly with the change
Comment 18•4 years ago
|
||
Comment 19•4 years ago
|
||
bugherder |
Comment 20•4 years ago
|
||
We should probably leave this on the radar for a possible RC respin or dot release ride-along given that the volume seems to have gone up in 90.
Updated•4 years ago
|
Comment 21•4 years ago
|
||
The patch landed in nightly and beta is affected.
:teshaq, is this bug important enough to require an uplift?
If not please set status_beta
to wontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 22•4 years ago
|
||
Comment on attachment 9228188 [details]
Bug 1707057 - Add shutdown handling for viaduct. r=rfkelly.
Beta/Release Uplift Approval Request
- User impact if declined: currently, a low volume crasher
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce: Hard to reproduce and verify, requires a timing situation where NSS shuts down while/before glean is attempting to send a network request
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): Adds erroring out behavior instead of crashing, medium risk because the fix can't be verified easily, otherwise the code change has low impact
- String changes made/needed:
Comment 23•4 years ago
|
||
Comment on attachment 9228188 [details]
Bug 1707057 - Add shutdown handling for viaduct. r=rfkelly.
approved for 90.0.1
Comment 24•4 years ago
|
||
bugherder uplift |
Comment 25•4 years ago
|
||
Added to 90.0.1 release notes: "Fixed a rare crash on shutdown"
Description
•