nsUrlClassifierDBService shutdown deadlock

RESOLVED FIXED in Firefox 57

Status

()

Toolkit
Safe Browsing
P2
normal
RESOLVED FIXED
7 months ago
4 months ago

People

(Reporter: mcmanus, Assigned: tnguyen)

Tracking

unspecified
mozilla57
Points:
---

Firefox Tracking Flags

(firefox57 fixed)

Details

(Whiteboard: #sbv4-m9)

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Reporter)

Description

7 months ago
shutdown() does a synchronous dispatch to the url classifier worker thread

  SyncRunnable::DispatchToThread(gDbBackgroundThread, r);

meanwhile, on the worker thread in openDB()

  mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

and if you dig into nsNSSComponents.cpp 101 (from that createInstance stack) you will find that when an nss instance is created off the main thread it does a synchronous dispatch to the main thread to ensure initialization (even if it is indeed initialized)

so we deadlock if these things are done simultaneously.

There is a shared state variable, gShuttingDownThread, that looks like it might be meant to guard against some of this but without any locking it is easy to see how it could be racy.

I can trivially reproduce this by grabbing a ref to the classifier service in nsHttpHandler::Init() and then running some tests that never use http.. basically the startup of the thread is racing against its shutdown the test completes so quickly. toolkit/components/crashmonitor/test/unit/test_init.js seems to repro the problem easily.

but even without my change to handler, this seems like a latent bug that has been closed WONTFIX a couple times lacking fresh data, a repro, or a theory.

Comment 1

7 months ago
I was aware of the fact that 

mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

will need to SyncRunnable::Dispatch to the main thread. 

However, only the very first "do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID)"
on non-main thread will need to do SyncRunnable::Dispatch to the main thread.
Maybe that's why I haven't run into this potential deadlock. I did see this
kind of deadlocks in other place but have got around it.

As for gShuttingDownThread, I would say it's just a "hint" for making shutdown
more quickly. Racing on gShuttingDownThread is totally expected and it should
lead to any issue.

I'll try if I can reproduce it tomorrow. Thanks for bring this up!
Assignee: nobody → hchang
Status: NEW → ASSIGNED
Priority: -- → P1

Comment 2

7 months ago
(In reply to Patrick McManus [:mcmanus] from comment #0)
> shutdown() does a synchronous dispatch to the url classifier worker thread
> 
>   SyncRunnable::DispatchToThread(gDbBackgroundThread, r);
> 
> meanwhile, on the worker thread in openDB()
> 
>   mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
>
> and if you dig into nsNSSComponents.cpp 101 (from that createInstance stack)

From what I dug a while ago, there is no "creation thread restriction" 
on NS_CRYPTO_HASH_CONTRACTID [1]. Instead, the enforcement in at [2],
which is only enforced for the very first time.

[1] http://searchfox.org/mozilla-central/rev/224cc663d54085994a4871ef464b7662e0721e83/security/manager/ssl/nsNSSModule.cpp#191
[2] http://searchfox.org/mozilla-central/rev/6580dd9a4eb0224773897388dba2ddf5ed7f4216/security/manager/ssl/nsNSSComponent.cpp#101
[3] http://searchfox.org/mozilla-central/rev/224cc663d54085994a4871ef464b7662e0721e83/toolkit/components/url-classifier/SafeBrowsing.jsm#454

> you will find that when an nss instance is created off the main thread it
> does a synchronous dispatch to the main thread to ensure initialization
> (even if it is indeed initialized)
> 
> so we deadlock if these things are done simultaneously.
> 
> There is a shared state variable, gShuttingDownThread, that looks like it
> might be meant to guard against some of this but without any locking it is
> easy to see how it could be racy.
> 
> I can trivially reproduce this by grabbing a ref to the classifier service
> in nsHttpHandler::Init() and then running some tests that never use http..
> basically the startup of the thread is racing against its shutdown the test
> completes so quickly. toolkit/components/crashmonitor/test/unit/test_init.js
> seems to repro the problem easily.
> 
> but even without my change to handler, this seems like a latent bug that has
> been closed WONTFIX a couple times lacking fresh data, a repro, or a theory.

This should be introduced by bug 1339050, where I significantly changed the 
shutdown process.

Updated

7 months ago
Blocks: 1345058
Priority: P1 → P3
(Assignee)

Updated

4 months ago
Assignee: hchang → tnguyen
(Assignee)

Comment 3

4 months ago
(In reply to Henry Chang [:hchang] from comment #1)
> I was aware of the fact that 
> 
> mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> 
> will need to SyncRunnable::Dispatch to the main thread. 
> 
> However, only the very first "do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID)"
> on non-main thread will need to do SyncRunnable::Dispatch to the main thread.
> Maybe that's why I haven't run into this potential deadlock. I did see this
> kind of deadlocks in other place but have got around it.
> 
> As for gShuttingDownThread, I would say it's just a "hint" for making
> shutdown
> more quickly. Racing on gShuttingDownThread is totally expected and it should
> lead to any issue.
> 
> I'll try if I can reproduce it tomorrow. Thanks for bring this up!

I am trying to catch the idea and follow what Henry has done in this bug.
I see the case in this bug, we still use SyncRunnable::Dispatch EnsureNSSInitializedChromeOrContent off main thread, even we are already sure nss is already initialized in mainthread.
Hi dkeeler, do we really need to send SyncRunnable::Dispatch in this case? Is it risky to remove that dispatching if nss is initialized?
I guess SyncRunnable::Dispatch EnsureNSSInitializedChromeOrContent will be useful to trigger NSS initialization from background at the first time (and after that we will use cached variable
http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/security/manager/ssl/nsNSSComponent.cpp#90)
Flags: needinfo?(dkeeler)
I would start by removing the mCryptoHash members of nsUrlClassifierDBServiceWorker and ProtocolParser. The hasher is only used in SafebrowsingHash::FromPlaintext, and since that usage consists of a complete set of Init, Update, and Finish, there's no reason to keep it around. I would then either add some synchronization to ensure that the background thread can't enter that section if the browser is in shutdown or add something before the shutdown that ensures NSS has been initialized (e.g. an explicit call to EnsureNSSInitializedChromeOrContent) (but honestly that seems like a waste if all we're doing is ensuring that we don't call that code at the wrong time).
Flags: needinfo?(dkeeler)
(Assignee)

Comment 5

4 months ago
> I would start by removing the mCryptoHash members of
> nsUrlClassifierDBServiceWorker and ProtocolParser. The hasher is only used
> in SafebrowsingHash::FromPlaintext, and since that usage consists of a
> complete set of Init, Update, and Finish, there's no reason to keep it

Thanks David for the prompt reply. 
If I understand your idea correctly, we are going to move the hash related to mainthread. It seems break work flow 
http://searchfox.org/mozilla-central/rev/6482c8a5fa5c7446e82ef187d1a1faff49e3379e/toolkit/components/url-classifier/Classifier.cpp#473
We may have to change in many interfaces, and do much more stuffs in main thread to pass correct argument to worker thread later. We may lose a perf impact, I am not sure if there's any risk.

> around. I would then either add some synchronization to ensure that the
> background thread can't enter that section if

We had ensured background thread could enter if ff is going to shutdown phase.
http://searchfox.org/mozilla-central/rev/6482c8a5fa5c7446e82ef187d1a1faff49e3379e/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#936,955

> add something before the shutdown that ensures NSS has been initialized
> (e.g. an explicit call to EnsureNSSInitializedChromeOrContent) 

It does not seem to work because shutdown() is triggered in main thread. The problem is, no matter EnsureNSSInitializedChromeOrContent made sure of NSS initialized in main thread, EnsureNSSInitializedChromeOrConten off thread still trigger SyncRunnable::Dispatch (only for the first time of every worker thread). That is the same with comment 3.
(Assignee)

Comment 6

4 months ago
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #5)
> 
> It does not seem to work because shutdown() is triggered in main thread. The
> problem is, no matter EnsureNSSInitializedChromeOrContent made sure of NSS
> initialized in main thread, EnsureNSSInitializedChromeOrConten off thread
> still trigger SyncRunnable::Dispatch (only for the first time of every
> worker thread). That is the same with comment 3.
The simpler idea is not dispatching from background to main thread when the worker has not been initialized yet (actually it will be bailed out and does not do anything)
(Assignee)

Comment 7

4 months ago
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #6)
> (In reply to Thomas Nguyen[:tnguyen] ni plz from comment #5)
> > 
> > It does not seem to work because shutdown() is triggered in main thread. The
> > problem is, no matter EnsureNSSInitializedChromeOrContent made sure of NSS
> > initialized in main thread, EnsureNSSInitializedChromeOrConten off thread
> > still trigger SyncRunnable::Dispatch (only for the first time of every
> > worker thread). That is the same with comment 3.
> The simpler idea is not dispatching from background to main thread when the
> worker has not been initialized yet (actually it will be bailed out and does
> not do anything)

I could easily reproduce the deadlock by running
toolkit/components/crashmonitor/test/unit/test_init.js
Indeed, that was the race when we are shutting down too quickly when mClassifier is still nullptr
Comment hidden (mozreview-request)
(Assignee)

Comment 9

4 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=52c47be10708cc713d0c9ccc7a3b51f5c4798389&selectedJob=123181406
(Assignee)

Updated

4 months ago
Attachment #8897268 - Flags: review?(hchang)
(Assignee)

Comment 10

4 months ago
I removed mCryptoHash in some places which contain full set of Init, Update, Finish and added a condition check in shutdown() and tested on my local machine (we can easily reproduce the deadlock as comment 0)
Henry, could you please take a look?
(Assignee)

Updated

4 months ago
Priority: P3 → P2
Whiteboard: #sbv4-m9
(Assignee)

Comment 11

4 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ef65e48d611922ccb7dcdebbf473a2f1d5479d7
(Assignee)

Comment 12

4 months ago
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #11)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=0ef65e48d611922ccb7dcdebbf473a2f1d5479d7

Running 20 tests on top of backed out commit in Bug 1345058 and it looks good
(Assignee)

Comment 13

4 months ago
As discussed with Henry, we may care about the case FlushAndDisableAsyncUpdate does not work expectly. It will not happen in real life but may happen in our test, for example when ResetDatabase and ReloadDatabase are called (those are only used for test purpose).
http://searchfox.org/mozilla-central/rev/13148faaa91a1c823a7d68563d9995480e714979/toolkit/components/url-classifier/tests/browser/classifierHelper.js#135
Running full try to see if we could pass all the tests
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a22868a29707589998c0433947789b4b4817fa3
(Assignee)

Comment 14

4 months ago
It looks good, except wpt failures 
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a22868a29707589998c0433947789b4b4817fa3&selectedJob=123446172
That's related to ipc PrincipalFlashClassifier::Result, should be fixed in Bug 1345058
(Assignee)

Comment 15

4 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=23f8c1bc52187b3c05e131f57aa6344f6d1eda32
(Assignee)

Comment 16

4 months ago
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #15)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=23f8c1bc52187b3c05e131f57aa6344f6d1eda32
Try looks good. I did not see any racing problem between shutting down againts reseting/reloading database

Comment 17

4 months ago
mozreview-review
Comment on attachment 8897268 [details]
Bug 1363038 - Remove synchronous dispatch when worker is racing against shutdown

https://reviewboard.mozilla.org/r/168552/#review176110

Given that what I am worried about is not happening [1], I am positive to this patch.
However, I feel like what we have to do is just leave the IsDBOpened() function and
keep

mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

as it in OpenDb(). Also, we probably shouldn't null out mCryptoHash in CloseDb().
Thomas, could you give that a try?

Thanks!

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1363038#c16

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:942
(Diff revision 1)
>  
>    nsresult rv;
> -  mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> +  {
> +    // Force PSM loading
> +    nsCOMPtr<nsICryptoHash> dummy = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> -  NS_ENSURE_SUCCESS(rv, rv);
> +    NS_ENSURE_SUCCESS(rv, rv);

I am thinking if we can just do

  mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

here and leave the IsDBOpened() check in Shutdown() so that we don't need to bother removing mCryptoHash everywhere.
(Assignee)

Comment 18

4 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b3f6b9b62ad9227f52714affc312597fef845d05
(Assignee)

Comment 19

4 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ad0fd284302d4ac612ff4048bbaf3c607debb21d
(Assignee)

Comment 20

4 months ago
mozreview-review-reply
Comment on attachment 8897268 [details]
Bug 1363038 - Remove synchronous dispatch when worker is racing against shutdown

https://reviewboard.mozilla.org/r/168552/#review176110

FF may crash when trying to release worker and mCryptoHash. I guess we may change to use weak ref, but it may not be obligated here.
So I just add IsDBOpened to resolve the race issue.

> I am thinking if we can just do
> 
>   mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> 
> here and leave the IsDBOpened() check in Shutdown() so that we don't need to bother removing mCryptoHash everywhere.

Sure, will file and do that in another bug if needed
Comment hidden (mozreview-request)
(Assignee)

Comment 22

4 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&tochange=b3f6b9b62ad9227f52714affc312597fef845d05&fromchange=3908649a27d1f9b0ae4bc9d9be396ebfe80703ff

Comment 23

4 months ago
mozreview-review
Comment on attachment 8897268 [details]
Bug 1363038 - Remove synchronous dispatch when worker is racing against shutdown

https://reviewboard.mozilla.org/r/168552/#review176694

r=me given the try result looks good. We can back this simple patch out anytime if what I'm worried (nsIUrlClassifierDBService.resetDatabase/reloadDatabase) occurs. Thanks Thomas :)
Attachment #8897268 - Flags: review?(hchang) → review+
(Assignee)

Updated

4 months ago
Keywords: checkin-needed

Comment 24

4 months ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/af78a23cda9c
Remove synchronous dispatch when worker is racing against shutdown r=hchang
Keywords: checkin-needed

Comment 25

4 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/af78a23cda9c
Status: ASSIGNED → RESOLVED
Last Resolved: 4 months ago
status-firefox57: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.