Closed Bug 1555557 Opened 6 years ago Closed 5 years ago

Cert override service does main-thread I/O

Categories

(Core :: Security: PSM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox80 --- fixed

People

(Reporter: nhnt11, Assigned: nhnt11)

Details

(Whiteboard: [psm-assigned])

Attachments

(1 file, 1 obsolete file)

See https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/security/manager/ssl/nsCertOverrideService.cpp#246

Dana/Ehsan, do you know of an example of how to avoid main-thread I/O in this implementation? Perhaps another bug where a similar conversion was done? I'd like to work on this, but I've never converted main-thread I/O before so I'd like a reference.

Flags: needinfo?(dkeeler)
Flags: needinfo?(ehsan)

Yes, looks like it does main-thread IO.

The common way to off-load these types of IO to a background thread is to use the "STS thread", aka the stream transport service. This is a thread used by Necko for the delivery messages of network notifications internally and is commonly used by various components in Gecko to run a small piece of code off the main thread.

There are two general techniques for doing so:

a) A fire and forget technique, like how we save the spell checking custom dictionary (when the user adds a custom word to the spell checking dictionary here: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/extensions/spellcheck/src/mozPersonalDictionary.cpp#347. We grab the STS thread event target and post a runnable to it, and the runnable's Run() method is responsible to do the IO. That function will run off the main thread. The WaitForSave() function uses a monitor to optionally block on an ongoing pending save where needed (e.g. when kicking off another save operation, you don't wanna race with a previously running one.)

b) A fire and wait for completion technique, like how we save the preferences file (https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/modules/libpref/Preferences.cpp#4166). This is very similar to the above except for the setup used to fire the runnable. Here we use a "sync runnable" (see the call to SyncRunnable::DispatchToThread()). That dispatches the runnable to a background thread, but blocks the main thread until that runnable has been processed on the background thread! But while doing so, instead of actually blocking the main thread's event queue, we'd be processing the incoming messages, so the main thread will remain responsive, but we just won't make progress running more code from our function (in other words, we're spinning a nested event loop here.)

Which one of these solutions is appropriate here is really up to Dana since she's the domain expert here, I'll let her speak to that part. :-)

Flags: needinfo?(ehsan)

In PSM we generally subclass CryptoTask like so: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/security/manager/ssl/ContentSignatureVerifier.cpp#41
The downside of this approach is that it creates (and destroys when it's done) a new thread for every task created. Our webcrypto implementation uses the same framework but has a thread pool, which presumably is much more efficient: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/crypto/WebCryptoThreadPool.h#18 (I don't know why we didn't just make all CryptoTasks use a thread pool when we implemented that, but oh well).

Flags: needinfo?(dkeeler)

Hmm, looks like usage of the transient thread is optional? https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/security/manager/ssl/CryptoTask.cpp#43 Presumably one could just create a CryptoTask and submit it to the STS thread, no?

I suppose so but is it a good idea to be using the socket thread for unrelated tasks?

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #4)

I suppose so but is it a good idea to be using the socket thread for unrelated tasks?

Not the socket thread, the stream transport thread. This is a common source of confusion, see this thread for example: https://groups.google.com/d/msg/mozilla.dev.platform/v2MePAm0YFQ/RgMxVgDLe4YJ. The Necko team definitely thinks this is OK and actually invites people to use this thread for these types of tasks. As far as I understand other types of activities that can happen on this thread are also IO in nature.

Ohhhh that makes a lot more sense. Thanks!
Looking at how the cert override service works, I imagine we could get away with fire-and-forget using the stream transport service in most cases (since we update in-memory behind a lock and the writing is only for saving state). For the initial read, again it would be nice to do that on a background thread, but in that case we'd essentially need to block any lookups until the read had completed (a bit like DataStorage: https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/security/manager/ssl/DataStorage.cpp#629 )

That makes sense. Maybe we should file a follow-up bug to move some of those cases to the stream transport service?

(For Nihanth's original purpose, I hope what I said in comment 3 makes sense?)

Sure - I filed bug 1555854.

The priority flag is not set for this bug.
:keeler, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dkeeler)

Nihanth, are you planning on working on this?

Flags: needinfo?(dkeeler) → needinfo?(nhnt11)

Sure, sorry I didn't get a chance to pick this up already. Thank you both for already starting the discussion to figure out the best way to do this! Here's what I understood from reading the conversation above and looking at the links:

  1. Dana already landed a patch to make CryptoTasks use the stream transport thread, so that's done.
  2. We should create a subclass of CryptoTask to handle the I/O. It needs to override CalculateResult to do the actual I/O, and CallCallback which can be used to set a ready state on the override service and to propagate errors. Either the constructor can accept a boolean indicating read or write, or we can just have two subclasses.
  3. Reads should set a ready state to false, and lookups should wait until the state is true.
  4. Writes can be a "fire and forget".
  5. I need to double-check if the stream transport service uses a thread or a thread-pool. If it's a pool, maybe writes should block until ready too, to ensure they execute in the right order.

Thanks! I'll start working on a patch. Let me know if I'm missing something.

Assignee: nobody → nhnt11
Status: NEW → ASSIGNED
Flags: needinfo?(nhnt11)

I need some help to continue. I started working on making a CryptoTask subclass for writing, and realized I had two options:

  1. Iterate on mSettingsTable and build all the data to be written into a string, and pass this to the Task to write.
  2. Figure out the right way to access mSettingsTable from the Task.

What's the right approach here? If 1, is there an example of string building that I should follow or can I just use an ostringstream? 2 feels very unsafe and wrong, but if there's a good way to do it, I'm all ears.

Flags: needinfo?(ehsan)
Flags: needinfo?(dkeeler)

This is really a PSM specific question so I'll leave it to Dana to decide how she prefers this to be handled...

Flags: needinfo?(ehsan)

I would go with option 1. Here's an example of marshaling a bunch of data into a string and then creating a task that writes it out: https://searchfox.org/mozilla-central/rev/b3b401254229f0a26f7ee625ef5f09c6c31e3949/security/manager/ssl/DataStorage.cpp#895 (I would take that implementation with a grain of salt, though - it's a bit dated and uses some paradigms that maybe aren't best practice any longer).

Flags: needinfo?(dkeeler)
Priority: -- → P1
Whiteboard: [psm-assigned]

Thanks Dana! I posted a proof-of-concept patch that moves the writing off the main thread by implementing a CryptoTask subclass as a private inner class of nsCertOverrideService and passing it the string to write. It works for me locally, but I haven't run any tests. I'm posting the patch for initial feedback before implementing the reading part.

Nihanth, is this still something you're available to do?

Flags: needinfo?(nhnt11)

Apologies for being away from this bug, had to prioritize Skyline stuff. I will be picking this back up soon. Leaving the needinfo for tracking.

Flags: needinfo?(nhnt11)
Attachment #9072908 - Attachment description: Bug 1555557 - Do cert override file writes off the main thread. r=keeler → Bug 1555557 - Do cert override file writes off the main thread. r=keeler!,baku
Severity: normal → S4
Attachment #9072908 - Attachment description: Bug 1555557 - Do cert override file writes off the main thread. r=keeler!,baku → Bug 1555557 - Do cert override file writes off the main thread. r=keeler!
Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/7fbc8bcf6859 Do cert override file writes off the main thread. r=keeler
Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/6a93e7a5e237 Do cert override file writes off the main thread. r=keeler
Flags: needinfo?(nhnt11)
Flags: needinfo?(nhnt11)
Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/7e9307a4a744 Do cert override file writes off the main thread. r=keeler

At first glance I have no idea how that failure is related to anything in this patch, but I'll look into it....

Flags: needinfo?(nhnt11)

(In reply to Nihanth Subramanya [:nhnt11] from comment #27)

At first glance I have no idea how that failure is related to anything in this patch, but I'll look into it....

The log shows a JS error which threw me off - there's a crash in nsCertOverrideService::Init() because we MOZ_RELEASE_ASSERT that adding the AsyncShutdown blocker succeeded, and apparently, in some cases, it doesn't.

I need to figure out the answers to:

  1. When does this happen?
  2. Is this a real issue that needs to be fixed/worked around or is MOZ_RELEASE_ASSERT just unnecessary here?

So far I've found:

  1. GetShutdownBarrier definitely worked because there's an assert there too that isn't being hit.
  2. In m-c, there are some consumers that do the MOZ_RELEASE_ASSERT on the AddBlocker result and some that don't.
  3. I THINK AddBlocker might be failing in cases when we're somehow already shutting down and/or past the target shutdown phase.

(In reply to Nihanth Subramanya [:nhnt11] from comment #29)

(In reply to Nihanth Subramanya [:nhnt11] from comment #27)
3. I THINK AddBlocker might be failing in cases when we're somehow already shutting down and/or past the target shutdown phase.

Based on the stack trace in the log, it does indeed appear that we are hitting PSM init after the ScopedXPCOMStartup destructor has been called, and we are crashing because we're trying to add a blocker to a shutdown phase that's already in progress (or maybe even finished).

I think that it should be sufficient here to WARN and return instead of MOZ_RELEASE_ASSERT.

Dana, do you have any input on this diagnosis+solution? It appears there's an XHR during shutdown which is triggering PSM init.

Here's the stack trace: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307741501&repo=try&lineNumber=3640

Flags: needinfo?(dkeeler)

(In reply to Nihanth Subramanya [:nhnt11] from comment #30)

(In reply to Nihanth Subramanya [:nhnt11] from comment #29)

(In reply to Nihanth Subramanya [:nhnt11] from comment #27)
3. I THINK AddBlocker might be failing in cases when we're somehow already shutting down and/or past the target shutdown phase.

I think that it should be sufficient here to WARN and return instead of MOZ_RELEASE_ASSERT.

Perhaps in addition to this, we should remember this state and return early in Write() if we failed to add the shutdown blocker. Just in case.

I feel like if we non-fatally fail in a function called from nsCertOverrideService::Init() we should propagate the error (that way, the override service will never be created and no writers will be dispatched. It would also be interesting to know what is causing PSM to be initialized when we're shutting down.

Flags: needinfo?(dkeeler)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #33)

I feel like if we non-fatally fail in a function called from nsCertOverrideService::Init() we should propagate the error (that way, the override service will never be created and no writers will be dispatched. It would also be interesting to know what is causing PSM to be initialized when we're shutting down.

Sounds good, in that case we should try adding the blocker before adding any observers so we don't leak. I'll update the patch.

Attachment #9159692 - Attachment description: Bug 1555557 - Do cert override file writes off the main thread - follow-up cleanup. r=jya! → Bug 1555557 - Do cert override file writes off the main thread. r=keeler!
Attachment #9159692 - Attachment is obsolete: true
Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/95db54ab9c71 Do cert override file writes off the main thread. r=keeler
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: