Closed Bug 1677516 Opened 4 years ago Closed 4 years ago

cert_storage blocks the parent process main thread

Categories

(Core :: Security: PSM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
85 Branch
Performance Impact high
Tracking Status
firefox83 --- wontfix
firefox84 --- wontfix
firefox85 --- fixed

People

(Reporter: florian, Assigned: keeler)

Details

(Keywords: perf, perf:responsiveness, Whiteboard: [bhr:cert_storage][psm-assigned])

Attachments

(1 file)

This currently represents more than 6% of the total hang time reported through BHR for the nightly user base, making it one of the top five hangs (http://queze.net/bhr/test/#row=1&filter=cert_storage).
The main thread seems to be blocked trying to get a lock (RtlAcquireSRWLockExclusive).

The two most common stacks look like this:

RtlAcquireSRWLockExclusive ntdll
cert_storage::{{impl}}::done<tuple<>,closure-0>(cert_storage::SecurityStateTask<tuple<>, closure-0>*) xul
moz_task::{{impl}}::allocate::Run(xpcom::interfaces::idl::nsIRunnable*) xul
mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex &> const&) xul
mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:120:7'>::Run() xul
nsThread::ProcessNextEvent(bool, bool*) xul

and

RtlAcquireSRWLockExclusive ntdll
cert_storage::{{impl}}::allocate::AddCRLiteStash(xpcom::interfaces::idl::nsICertStorage*, thin_vec::ThinVec<u8>*, xpcom::interfaces::idl::nsICertStorageCallback*) xul
XPTC__InvokebyIndex xul
static XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) xul
XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) xul
(unresolved)
gre/modules/psm/RemoteSecuritySettings.jsm:797

I was on PTO last week, and when I looked at BHR data about a week ago, I don't remember seeing hangs involving cert_storage in the top hangs. Has something related to this code changed recently that would either have introduced these hangs, or significantly increased their frequency?

The certList.addCRLiteStash call at https://searchfox.org/mozilla-central/rev/c37038c592a352eda0f5e77dfb58c4929bf8bcd3/security/manager/ssl/RemoteSecuritySettings.jsm#797 seems to have been added in May 2020. Was it there but somehow disabled until last week?

Do these reports include stack traces from other threads? That would be super helpful in figuring out what's going on here.

Flags: needinfo?(florian)

BHR collects a stack of the main thread when it has been unresponsive for more than 128ms, it doesn't collect stacks for other threads. If we find a way to reproduce, we could capture a profile of multiple threads, but I don't know steps to reproduce.

Flags: needinfo?(florian)
Whiteboard: [qf][bhr:cert_storage] → [qf:p1:responsiveness][bhr:cert_storage]

Is there a way to tell if the changes from bug 1677399 had an effect on this?

Flags: needinfo?(florian)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #3)

Is there a way to tell if the changes from bug 1677399 had an effect on this?

Yes, but it takes about 3 days for BHR (Telemetry) data to be received.
Today http://queze.net/bhr/test/ shows data from the 20201123 Nightlies, and this bug is now the 8th most frequent hang, with about 1% of the total hang time. Yesterday the data from the 20201122 Nightlies showed this bug as the 3rd most frequent hang, with 2.13% of the total hang time. A few days ago, this was the most common hang, with more than 10% of the total hang time.

20201120094511 was the first Nightly with the patch from bug 1677399, so the timeframe is correct to say it could have had an impact, but I can't say for sure.

Flags: needinfo?(florian)

When looking today at BHR data (from the 20201124 Nightly), this bug is no longer visible.

The stacks I'm seeing all involve base64::decode, so one easy improvement would be to move the decoding step off the main thread.

Assignee: nobody → dkeeler
Severity: -- → S1
Priority: -- → P1
Whiteboard: [qf:p1:responsiveness][bhr:cert_storage] → [qf:p1:responsiveness][bhr:cert_storage][psm-assigned]

Telemetry indicated that setting various security state (in particular, CRLite
state) was causing main thread hangs due to base64 decoding. This patch
rearranges cert_storage slightly to do these decodings off the main thread.

Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b90af486fbcd move base64-decoding operations of cert_storage off the main thread r=rmf,bbeurdouche
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch
Performance Impact: --- → P1
Whiteboard: [qf:p1:responsiveness][bhr:cert_storage][psm-assigned] → [bhr:cert_storage][psm-assigned]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: