Closed Bug 1363038 Opened 7 years ago Closed 7 years ago

nsUrlClassifierDBService shutdown deadlock

Categories

(Toolkit :: Safe Browsing, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: mcmanus, Assigned: tnguyen)

References

Details

(Whiteboard: #sbv4-m9)

Attachments

(1 file)

shutdown() does a synchronous dispatch to the url classifier worker thread

  SyncRunnable::DispatchToThread(gDbBackgroundThread, r);

meanwhile, on the worker thread in openDB()

  mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

and if you dig into nsNSSComponents.cpp 101 (from that createInstance stack) you will find that when an nss instance is created off the main thread it does a synchronous dispatch to the main thread to ensure initialization (even if it is indeed initialized)

so we deadlock if these things are done simultaneously.

There is a shared state variable, gShuttingDownThread, that looks like it might be meant to guard against some of this but without any locking it is easy to see how it could be racy.

I can trivially reproduce this by grabbing a ref to the classifier service in nsHttpHandler::Init() and then running some tests that never use http.. basically the startup of the thread is racing against its shutdown the test completes so quickly. toolkit/components/crashmonitor/test/unit/test_init.js seems to repro the problem easily.

but even without my change to handler, this seems like a latent bug that has been closed WONTFIX a couple times lacking fresh data, a repro, or a theory.
I was aware of the fact that 

mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

will need to SyncRunnable::Dispatch to the main thread. 

However, only the very first "do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID)"
on non-main thread will need to do SyncRunnable::Dispatch to the main thread.
Maybe that's why I haven't run into this potential deadlock. I did see this
kind of deadlocks in other place but have got around it.

As for gShuttingDownThread, I would say it's just a "hint" for making shutdown
more quickly. Racing on gShuttingDownThread is totally expected and it should
lead to any issue.

I'll try if I can reproduce it tomorrow. Thanks for bring this up!
Assignee: nobody → hchang
Status: NEW → ASSIGNED
Priority: -- → P1
(In reply to Patrick McManus [:mcmanus] from comment #0)
> shutdown() does a synchronous dispatch to the url classifier worker thread
> 
>   SyncRunnable::DispatchToThread(gDbBackgroundThread, r);
> 
> meanwhile, on the worker thread in openDB()
> 
>   mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
>
> and if you dig into nsNSSComponents.cpp 101 (from that createInstance stack)

From what I dug a while ago, there is no "creation thread restriction" 
on NS_CRYPTO_HASH_CONTRACTID [1]. Instead, the enforcement in at [2],
which is only enforced for the very first time.

[1] http://searchfox.org/mozilla-central/rev/224cc663d54085994a4871ef464b7662e0721e83/security/manager/ssl/nsNSSModule.cpp#191
[2] http://searchfox.org/mozilla-central/rev/6580dd9a4eb0224773897388dba2ddf5ed7f4216/security/manager/ssl/nsNSSComponent.cpp#101
[3] http://searchfox.org/mozilla-central/rev/224cc663d54085994a4871ef464b7662e0721e83/toolkit/components/url-classifier/SafeBrowsing.jsm#454

> you will find that when an nss instance is created off the main thread it
> does a synchronous dispatch to the main thread to ensure initialization
> (even if it is indeed initialized)
> 
> so we deadlock if these things are done simultaneously.
> 
> There is a shared state variable, gShuttingDownThread, that looks like it
> might be meant to guard against some of this but without any locking it is
> easy to see how it could be racy.
> 
> I can trivially reproduce this by grabbing a ref to the classifier service
> in nsHttpHandler::Init() and then running some tests that never use http..
> basically the startup of the thread is racing against its shutdown the test
> completes so quickly. toolkit/components/crashmonitor/test/unit/test_init.js
> seems to repro the problem easily.
> 
> but even without my change to handler, this seems like a latent bug that has
> been closed WONTFIX a couple times lacking fresh data, a repro, or a theory.

This should be introduced by bug 1339050, where I significantly changed the 
shutdown process.
Blocks: 1345058
Priority: P1 → P3
Assignee: hchang → tnguyen
(In reply to Henry Chang [:hchang] from comment #1)
> I was aware of the fact that 
> 
> mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> 
> will need to SyncRunnable::Dispatch to the main thread. 
> 
> However, only the very first "do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID)"
> on non-main thread will need to do SyncRunnable::Dispatch to the main thread.
> Maybe that's why I haven't run into this potential deadlock. I did see this
> kind of deadlocks in other place but have got around it.
> 
> As for gShuttingDownThread, I would say it's just a "hint" for making
> shutdown
> more quickly. Racing on gShuttingDownThread is totally expected and it should
> lead to any issue.
> 
> I'll try if I can reproduce it tomorrow. Thanks for bring this up!

I am trying to catch the idea and follow what Henry has done in this bug.
I see the case in this bug, we still use SyncRunnable::Dispatch EnsureNSSInitializedChromeOrContent off main thread, even we are already sure nss is already initialized in mainthread.
Hi dkeeler, do we really need to send SyncRunnable::Dispatch in this case? Is it risky to remove that dispatching if nss is initialized?
I guess SyncRunnable::Dispatch EnsureNSSInitializedChromeOrContent will be useful to trigger NSS initialization from background at the first time (and after that we will use cached variable
http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/security/manager/ssl/nsNSSComponent.cpp#90)
Flags: needinfo?(dkeeler)
I would start by removing the mCryptoHash members of nsUrlClassifierDBServiceWorker and ProtocolParser. The hasher is only used in SafebrowsingHash::FromPlaintext, and since that usage consists of a complete set of Init, Update, and Finish, there's no reason to keep it around. I would then either add some synchronization to ensure that the background thread can't enter that section if the browser is in shutdown or add something before the shutdown that ensures NSS has been initialized (e.g. an explicit call to EnsureNSSInitializedChromeOrContent) (but honestly that seems like a waste if all we're doing is ensuring that we don't call that code at the wrong time).
Flags: needinfo?(dkeeler)
> I would start by removing the mCryptoHash members of
> nsUrlClassifierDBServiceWorker and ProtocolParser. The hasher is only used
> in SafebrowsingHash::FromPlaintext, and since that usage consists of a
> complete set of Init, Update, and Finish, there's no reason to keep it

Thanks David for the prompt reply. 
If I understand your idea correctly, we are going to move the hash related to mainthread. It seems break work flow 
http://searchfox.org/mozilla-central/rev/6482c8a5fa5c7446e82ef187d1a1faff49e3379e/toolkit/components/url-classifier/Classifier.cpp#473
We may have to change in many interfaces, and do much more stuffs in main thread to pass correct argument to worker thread later. We may lose a perf impact, I am not sure if there's any risk.

> around. I would then either add some synchronization to ensure that the
> background thread can't enter that section if

We had ensured background thread could enter if ff is going to shutdown phase.
http://searchfox.org/mozilla-central/rev/6482c8a5fa5c7446e82ef187d1a1faff49e3379e/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#936,955

> add something before the shutdown that ensures NSS has been initialized
> (e.g. an explicit call to EnsureNSSInitializedChromeOrContent) 

It does not seem to work because shutdown() is triggered in main thread. The problem is, no matter EnsureNSSInitializedChromeOrContent made sure of NSS initialized in main thread, EnsureNSSInitializedChromeOrConten off thread still trigger SyncRunnable::Dispatch (only for the first time of every worker thread). That is the same with comment 3.
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #5)
> 
> It does not seem to work because shutdown() is triggered in main thread. The
> problem is, no matter EnsureNSSInitializedChromeOrContent made sure of NSS
> initialized in main thread, EnsureNSSInitializedChromeOrConten off thread
> still trigger SyncRunnable::Dispatch (only for the first time of every
> worker thread). That is the same with comment 3.
The simpler idea is not dispatching from background to main thread when the worker has not been initialized yet (actually it will be bailed out and does not do anything)
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #6)
> (In reply to Thomas Nguyen[:tnguyen] ni plz from comment #5)
> > 
> > It does not seem to work because shutdown() is triggered in main thread. The
> > problem is, no matter EnsureNSSInitializedChromeOrContent made sure of NSS
> > initialized in main thread, EnsureNSSInitializedChromeOrConten off thread
> > still trigger SyncRunnable::Dispatch (only for the first time of every
> > worker thread). That is the same with comment 3.
> The simpler idea is not dispatching from background to main thread when the
> worker has not been initialized yet (actually it will be bailed out and does
> not do anything)

I could easily reproduce the deadlock by running
toolkit/components/crashmonitor/test/unit/test_init.js
Indeed, that was the race when we are shutting down too quickly when mClassifier is still nullptr
Attachment #8897268 - Flags: review?(hchang)
I removed mCryptoHash in some places which contain full set of Init, Update, Finish and added a condition check in shutdown() and tested on my local machine (we can easily reproduce the deadlock as comment 0)
Henry, could you please take a look?
Priority: P3 → P2
Whiteboard: #sbv4-m9
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #11)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=0ef65e48d611922ccb7dcdebbf473a2f1d5479d7

Running 20 tests on top of backed out commit in Bug 1345058 and it looks good
As discussed with Henry, we may care about the case FlushAndDisableAsyncUpdate does not work expectly. It will not happen in real life but may happen in our test, for example when ResetDatabase and ReloadDatabase are called (those are only used for test purpose).
http://searchfox.org/mozilla-central/rev/13148faaa91a1c823a7d68563d9995480e714979/toolkit/components/url-classifier/tests/browser/classifierHelper.js#135
Running full try to see if we could pass all the tests
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a22868a29707589998c0433947789b4b4817fa3
It looks good, except wpt failures 
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a22868a29707589998c0433947789b4b4817fa3&selectedJob=123446172
That's related to ipc PrincipalFlashClassifier::Result, should be fixed in Bug 1345058
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #15)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=23f8c1bc52187b3c05e131f57aa6344f6d1eda32
Try looks good. I did not see any racing problem between shutting down againts reseting/reloading database
Comment on attachment 8897268 [details]
Bug 1363038 - Remove synchronous dispatch when worker is racing against shutdown

https://reviewboard.mozilla.org/r/168552/#review176110

Given that what I am worried about is not happening [1], I am positive to this patch.
However, I feel like what we have to do is just leave the IsDBOpened() function and
keep

mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

as it in OpenDb(). Also, we probably shouldn't null out mCryptoHash in CloseDb().
Thomas, could you give that a try?

Thanks!

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1363038#c16

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:942
(Diff revision 1)
>  
>    nsresult rv;
> -  mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> +  {
> +    // Force PSM loading
> +    nsCOMPtr<nsICryptoHash> dummy = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> -  NS_ENSURE_SUCCESS(rv, rv);
> +    NS_ENSURE_SUCCESS(rv, rv);

I am thinking if we can just do

  mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);

here and leave the IsDBOpened() check in Shutdown() so that we don't need to bother removing mCryptoHash everywhere.
Comment on attachment 8897268 [details]
Bug 1363038 - Remove synchronous dispatch when worker is racing against shutdown

https://reviewboard.mozilla.org/r/168552/#review176110

FF may crash when trying to release worker and mCryptoHash. I guess we may change to use weak ref, but it may not be obligated here.
So I just add IsDBOpened to resolve the race issue.

> I am thinking if we can just do
> 
>   mCryptoHash = do_CreateInstance(NS_CRYPTO_HASH_CONTRACTID, &rv);
> 
> here and leave the IsDBOpened() check in Shutdown() so that we don't need to bother removing mCryptoHash everywhere.

Sure, will file and do that in another bug if needed
Comment on attachment 8897268 [details]
Bug 1363038 - Remove synchronous dispatch when worker is racing against shutdown

https://reviewboard.mozilla.org/r/168552/#review176694

r=me given the try result looks good. We can back this simple patch out anytime if what I'm worried (nsIUrlClassifierDBService.resetDatabase/reloadDatabase) occurs. Thanks Thomas :)
Attachment #8897268 - Flags: review?(hchang) → review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/af78a23cda9c
Remove synchronous dispatch when worker is racing against shutdown r=hchang
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/af78a23cda9c
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: