Cert override service does main-thread I/O
Categories
(Core :: Security: PSM, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox80 | --- | fixed |
People
(Reporter: nhnt11, Assigned: nhnt11)
Details
(Whiteboard: [psm-assigned])
Attachments
(1 file, 1 obsolete file)
Dana/Ehsan, do you know of an example of how to avoid main-thread I/O in this implementation? Perhaps another bug where a similar conversion was done? I'd like to work on this, but I've never converted main-thread I/O before so I'd like a reference.
Assignee | ||
Updated•6 years ago
|
Comment 1•6 years ago
|
||
Yes, looks like it does main-thread IO.
The common way to off-load these types of IO to a background thread is to use the "STS thread", aka the stream transport service. This is a thread used by Necko for the delivery messages of network notifications internally and is commonly used by various components in Gecko to run a small piece of code off the main thread.
There are two general techniques for doing so:
a) A fire and forget technique, like how we save the spell checking custom dictionary (when the user adds a custom word to the spell checking dictionary here: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/extensions/spellcheck/src/mozPersonalDictionary.cpp#347. We grab the STS thread event target and post a runnable to it, and the runnable's Run()
method is responsible to do the IO. That function will run off the main thread. The WaitForSave()
function uses a monitor to optionally block on an ongoing pending save where needed (e.g. when kicking off another save operation, you don't wanna race with a previously running one.)
b) A fire and wait for completion technique, like how we save the preferences file (https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/modules/libpref/Preferences.cpp#4166). This is very similar to the above except for the setup used to fire the runnable. Here we use a "sync runnable" (see the call to SyncRunnable::DispatchToThread()
). That dispatches the runnable to a background thread, but blocks the main thread until that runnable has been processed on the background thread! But while doing so, instead of actually blocking the main thread's event queue, we'd be processing the incoming messages, so the main thread will remain responsive, but we just won't make progress running more code from our function (in other words, we're spinning a nested event loop here.)
Which one of these solutions is appropriate here is really up to Dana since she's the domain expert here, I'll let her speak to that part. :-)
![]() |
||
Comment 2•6 years ago
|
||
In PSM we generally subclass CryptoTask
like so: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/security/manager/ssl/ContentSignatureVerifier.cpp#41
The downside of this approach is that it creates (and destroys when it's done) a new thread for every task created. Our webcrypto implementation uses the same framework but has a thread pool, which presumably is much more efficient: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/crypto/WebCryptoThreadPool.h#18 (I don't know why we didn't just make all CryptoTask
s use a thread pool when we implemented that, but oh well).
Comment 3•6 years ago
|
||
Hmm, looks like usage of the transient thread is optional? https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/security/manager/ssl/CryptoTask.cpp#43 Presumably one could just create a CryptoTask
and submit it to the STS thread, no?
![]() |
||
Comment 4•6 years ago
|
||
I suppose so but is it a good idea to be using the socket thread for unrelated tasks?
Comment 5•6 years ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #4)
I suppose so but is it a good idea to be using the socket thread for unrelated tasks?
Not the socket thread, the stream transport thread. This is a common source of confusion, see this thread for example: https://groups.google.com/d/msg/mozilla.dev.platform/v2MePAm0YFQ/RgMxVgDLe4YJ. The Necko team definitely thinks this is OK and actually invites people to use this thread for these types of tasks. As far as I understand other types of activities that can happen on this thread are also IO in nature.
![]() |
||
Comment 6•6 years ago
|
||
Ohhhh that makes a lot more sense. Thanks!
Looking at how the cert override service works, I imagine we could get away with fire-and-forget using the stream transport service in most cases (since we update in-memory behind a lock and the writing is only for saving state). For the initial read, again it would be nice to do that on a background thread, but in that case we'd essentially need to block any lookups until the read had completed (a bit like DataStorage: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/security/manager/ssl/DataStorage.cpp#629 )
Comment 7•6 years ago
|
||
That makes sense. Maybe we should file a follow-up bug to move some of those cases to the stream transport service?
(For Nihanth's original purpose, I hope what I said in comment 3 makes sense?)
![]() |
||
Comment 8•6 years ago
|
||
Sure - I filed bug 1555854.
Comment 9•6 years ago
|
||
The priority flag is not set for this bug.
:keeler, could you have a look please?
For more information, please visit auto_nag documentation.
![]() |
||
Comment 10•6 years ago
|
||
Nihanth, are you planning on working on this?
Assignee | ||
Comment 11•6 years ago
•
|
||
Sure, sorry I didn't get a chance to pick this up already. Thank you both for already starting the discussion to figure out the best way to do this! Here's what I understood from reading the conversation above and looking at the links:
- Dana already landed a patch to make
CryptoTask
s use the stream transport thread, so that's done. - We should create a subclass of
CryptoTask
to handle the I/O. It needs to overrideCalculateResult
to do the actual I/O, andCallCallback
which can be used to set a ready state on the override service and to propagate errors. Either the constructor can accept a boolean indicating read or write, or we can just have two subclasses. - Reads should set a
ready
state to false, and lookups should wait until the state is true. - Writes can be a "fire and forget".
- I need to double-check if the stream transport service uses a thread or a thread-pool. If it's a pool, maybe writes should block until
ready
too, to ensure they execute in the right order.
Thanks! I'll start working on a patch. Let me know if I'm missing something.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 12•6 years ago
|
||
I need some help to continue. I started working on making a CryptoTask
subclass for writing, and realized I had two options:
- Iterate on
mSettingsTable
and build all the data to be written into a string, and pass this to the Task to write. - Figure out the right way to access
mSettingsTable
from the Task.
What's the right approach here? If 1
, is there an example of string building that I should follow or can I just use an ostringstream
? 2
feels very unsafe and wrong, but if there's a good way to do it, I'm all ears.
Comment 13•6 years ago
|
||
This is really a PSM specific question so I'll leave it to Dana to decide how she prefers this to be handled...
![]() |
||
Comment 14•6 years ago
|
||
I would go with option 1. Here's an example of marshaling a bunch of data into a string and then creating a task that writes it out: https://searchfox.org/mozilla-central/rev/b3b401254229f0a26f7ee625ef5f09c6c31e3949/security/manager/ssl/DataStorage.cpp#895 (I would take that implementation with a grain of salt, though - it's a bit dated and uses some paradigms that maybe aren't best practice any longer).
Assignee | ||
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
Thanks Dana! I posted a proof-of-concept patch that moves the writing off the main thread by implementing a CryptoTask
subclass as a private inner class of nsCertOverrideService
and passing it the string to write. It works for me locally, but I haven't run any tests. I'm posting the patch for initial feedback before implementing the reading part.
![]() |
||
Comment 17•6 years ago
|
||
Nihanth, is this still something you're available to do?
Assignee | ||
Comment 18•6 years ago
|
||
Apologies for being away from this bug, had to prioritize Skyline stuff. I will be picking this back up soon. Leaving the needinfo for tracking.
Updated•5 years ago
|
![]() |
||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 19•5 years ago
|
||
Comment 20•5 years ago
|
||
Backed out as requested.
Backout link: https://hg.mozilla.org/integration/autoland/rev/1a9b9ba8a0c9dc3f46e11cc821b4c42afb244814
Comment 21•5 years ago
|
||
Comment 22•5 years ago
|
||
Backed outfor causing build bustage on nsCertOverrideService.cpp.
Backout link: https://hg.mozilla.org/integration/autoland/rev/f1bba90e7986517a2802dc2a4a6b74451f6406c0
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=6a93e7a5e2378be1878c44383fccc4c26ed4736d
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307596273&repo=autoland&lineNumber=39823
Updated•5 years ago
|
Assignee | ||
Comment 23•5 years ago
|
||
(In reply to Atila Butkovits from comment #22)
Backed outfor causing build bustage on nsCertOverrideService.cpp.
Backout link: https://hg.mozilla.org/integration/autoland/rev/f1bba90e7986517a2802dc2a4a6b74451f6406c0
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=6a93e7a5e2378be1878c44383fccc4c26ed4736d
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307596273&repo=autoland&lineNumber=39823
Build failure was due to GetCurrentThreadSerialEventTarget()
getting renamed in bug 1637500.
Comment 24•5 years ago
|
||
Assignee | ||
Comment 25•5 years ago
|
||
Comment 26•5 years ago
|
||
Backed out changeset 7e9307a4a744 (bug 1555557) for talos failures
Backout: https://hg.mozilla.org/integration/autoland/rev/f43be4059409ec108f2ea366144107be89b1a7ba
Failure push: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=7e9307a4a7441dc00fa8f53472b463adc3063e0d
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307666556&repo=autoland&lineNumber=2722
Assignee | ||
Comment 27•5 years ago
|
||
At first glance I have no idea how that failure is related to anything in this patch, but I'll look into it....
Assignee | ||
Comment 28•5 years ago
•
|
||
Assignee | ||
Comment 29•5 years ago
•
|
||
(In reply to Nihanth Subramanya [:nhnt11] from comment #27)
At first glance I have no idea how that failure is related to anything in this patch, but I'll look into it....
The log shows a JS error which threw me off - there's a crash in nsCertOverrideService::Init() because we MOZ_RELEASE_ASSERT that adding the AsyncShutdown blocker succeeded, and apparently, in some cases, it doesn't.
I need to figure out the answers to:
- When does this happen?
- Is this a real issue that needs to be fixed/worked around or is MOZ_RELEASE_ASSERT just unnecessary here?
So far I've found:
- GetShutdownBarrier definitely worked because there's an assert there too that isn't being hit.
- In m-c, there are some consumers that do the MOZ_RELEASE_ASSERT on the AddBlocker result and some that don't.
- I THINK AddBlocker might be failing in cases when we're somehow already shutting down and/or past the target shutdown phase.
Assignee | ||
Comment 30•5 years ago
|
||
(In reply to Nihanth Subramanya [:nhnt11] from comment #29)
(In reply to Nihanth Subramanya [:nhnt11] from comment #27)
3. I THINK AddBlocker might be failing in cases when we're somehow already shutting down and/or past the target shutdown phase.
Based on the stack trace in the log, it does indeed appear that we are hitting PSM init after the ScopedXPCOMStartup
destructor has been called, and we are crashing because we're trying to add a blocker to a shutdown phase that's already in progress (or maybe even finished).
I think that it should be sufficient here to WARN and return instead of MOZ_RELEASE_ASSERT.
Dana, do you have any input on this diagnosis+solution? It appears there's an XHR during shutdown which is triggering PSM init.
Here's the stack trace: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307741501&repo=try&lineNumber=3640
Assignee | ||
Comment 31•5 years ago
|
||
(In reply to Nihanth Subramanya [:nhnt11] from comment #30)
(In reply to Nihanth Subramanya [:nhnt11] from comment #29)
(In reply to Nihanth Subramanya [:nhnt11] from comment #27)
3. I THINK AddBlocker might be failing in cases when we're somehow already shutting down and/or past the target shutdown phase.I think that it should be sufficient here to WARN and return instead of MOZ_RELEASE_ASSERT.
Perhaps in addition to this, we should remember this state and return early in Write()
if we failed to add the shutdown blocker. Just in case.
Assignee | ||
Comment 32•5 years ago
•
|
||
![]() |
||
Comment 33•5 years ago
|
||
I feel like if we non-fatally fail in a function called from nsCertOverrideService::Init()
we should propagate the error (that way, the override service will never be created and no writers will be dispatched. It would also be interesting to know what is causing PSM to be initialized when we're shutting down.
Assignee | ||
Comment 34•5 years ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #33)
I feel like if we non-fatally fail in a function called from
nsCertOverrideService::Init()
we should propagate the error (that way, the override service will never be created and no writers will be dispatched. It would also be interesting to know what is causing PSM to be initialized when we're shutting down.
Sounds good, in that case we should try adding the blocker before adding any observers so we don't leak. I'll update the patch.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 35•5 years ago
|
||
Comment 36•5 years ago
|
||
bugherder |
Description
•