Deadlock during startup in nsRFPService::RandomMidpoint

RESOLVED FIXED in Firefox 60

Status

()

defect
--
critical
RESOLVED FIXED
Last year
4 months ago

People

(Reporter: sergey.kogan, Assigned: tjr)

Tracking

({hang})

60 Branch
mozilla61
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox59 unaffected, firefox60+ fixed, firefox61+ fixed)

Details

Attachments

(2 attachments)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36

Steps to reproduce:

Start Firefox with an URL as a command line argument (https://yandex.ru for example).


Actual results:

Firefox sometimes shows the main window but the window is blank and doesn't respond to input.
See the screenshot attached.

The issue is pretty rare. It started to happen on 03/15 with Firefox 60.0.0.6645 and continued with 60.0.0.6648. All cases were on Windows 10 RS3. It happened 8 times so far.

Please contact me to get a memory dump.
Digging into the memory dump shows that this is a simple deadlock.

The main thread is waiting inside nsRFPService::RandomMidpoint trying to acquire a lock.

Line 346: https://dxr.mozilla.org/mozilla-central/rev/6ff60a083701d08c52702daf50f28e8f46ae3a1c/toolkit/components/resistfingerprinting/nsRFPService.cpp#346

The call stack is:

...
06 ntdll!RtlEnterCriticalSection
07 mozglue!mozilla::detail::MutexImpl::lock
08 xul!mozilla::OffTheBooksMutex::Lock
09 xul!mozilla::StaticMutex::Lock
0a xul!mozilla::BaseAutoLock<mozilla::StaticMutex>::{ctor}
0b xul!mozilla::nsRFPService::RandomMidpoint <---------------- trying to acquire the lock
0c xul!mozilla::nsRFPService::ReduceTimePrecisionImpl
0d xul!mozilla::nsRFPService::ReduceTimePrecisionAsUSecsWrapper
0e xul!NowAsMillis
0f xul!js::date_now
10 xul!js::CallJSNative
...

Meanwhile that critical section is locked by another thread which is also inside nsRFPService::RandomMidpoint:

...
07 xul!mozilla::MonitorAutoLock::Wait
08 xul!mozilla::SyncRunnable::DispatchToThread
09 xul!mozilla::SyncRunnable::DispatchToThread
0a xul!EnsureNSSInitializedChromeOrContent
0b xul!mozilla::psm::Constructor<nsRandomGenerator,0,1,1>
0c xul!mozilla::GenericFactory::CreateInstance
0d xul!nsComponentManagerImpl::CreateInstanceByContractID
0e xul!nsComponentManagerImpl::GetServiceByContractID
0f xul!CallGetService
10 xul!nsGetServiceByContractIDWithError::operator()
11 xul!nsCOMPtr_base::assign_from_gs_contractid_with_error
12 xul!nsCOMPtr<nsIRandomGenerator>::{ctor}
13 xul!mozilla::nsRFPService::RandomMidpoint <------------- running the frames above under the lock
14 xul!mozilla::nsRFPService::ReduceTimePrecisionImpl
15 xul!mozilla::nsRFPService::ReduceTimePrecisionAsUSecsWrapper
16 xul!NowAsMillis
17 xul!js::date_now
18 xul!js::CallJSNative
...

From the code of EnsureNSSInitializedChromeOrContent it seems that it tries to dispatch a runnable to the main thread and waits for the result. The problem is that the main thread is busy waiting for the lock that was locked by the second thread.

Other dumps that I made have the same deadlock with the same stacks.
Severity: normal → critical
Component: Untriaged → DOM: Events
Keywords: hang
Product: Firefox → Core
Version: unspecified → 60 Branch
Component: DOM: Events → DOM
Flags: needinfo?(tom)
Firstly, thank you so much for digging into this and finding the root cause.

Hrm.

The first thing that comes to mind is a hack. If NSS is not initialized, and we're not on the main thread - don't do anything that might initialize NSS.

But as the repro steps illustrate: we would be returning unjittered time to web content. Practically speaking, one needs to run an attack using a high res timer for 'some amount of time' and NSS will surely be initialized and start jitting time early on in the process. So it's.... probably safe? But it's definitely not correct and it's quite ugly.
Flags: needinfo?(tom)
Assignee: nobody → tom
Depends on: 1440195, 1443943
Tom - how about not holding sLock while getting a handle on the random generator service? i.e.

    // If we don't have a seed, we need to get one.
    if (MOZ_UNLIKELY(!sSecretMidpointSeed)) {
      nsCOMPtr<nsIRandomGenerator> randomGenerator =
          do_GetService("@mozilla.org/security/random-generator;1", &rv);
      if (NS_WARN_IF(NS_FAILED(rv))) { return rv; }
      StaticMutexAutoLock lock(sLock);
      if (MOZ_LIKELY(!sSecretMidpointSeed)) {
        rv = randomGenerator->GenerateRandomBytes(kSeedSize, &sSecretMidpointSeed);
        if (NS_WARN_IF(NS_FAILED(rv))) { return rv; }
      }
    }
Flags: needinfo?(tom)
Comment on attachment 8961482 [details]
Bug 1447685 Move nsIRandomGenerator creation out of a lock to avoid a deadlock

https://reviewboard.mozilla.org/r/230256/#review235856
Attachment #8961482 - Flags: review?(dkeeler) → review+
Flags: needinfo?(tom)
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/e63bed369163
Move nsIRandomGenerator creation out of a lock to avoid a deadlock r=keeler
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/e63bed369163
Status: UNCONFIRMED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Sergei - any chance you could confirm this is fixed in the latest Nightly?
Flags: needinfo?(sergey.kogan)
(In reply to Tom Ritter [:tjr] from comment #8)
> Sergei - any chance you could confirm this is fixed in the latest Nightly?

Unfortunately the issue is hard to reproduce. I've seen it 9 times since 03/15 and during that time we've done roughly 8000 launches of Firefox 60.* in our test suite.
Flags: needinfo?(sergey.kogan)
Comment on attachment 8961482 [details]
Bug 1447685 Move nsIRandomGenerator creation out of a lock to avoid a deadlock

Approval Request Comment
[Feature/Bug causing the regression]: Deadlock in Time Jittering

[User impact if declined]: Occasional deadlocks

[Is this code covered by automated tests?]: No

[Has the fix been verified in Nightly?]: No; I couldn't repro it. But Nightly works so it didn't break it entirely...

[Needs manual test from QE? If yes, steps to reproduce]: No

[List of other uplifts needed for the feature/fix]: No

[Is the change risky?]: No

[Why is the change risky/not risky?]: It's moving code out of a critical section
Attachment #8961482 - Flags: approval-mozilla-beta?
Comment on attachment 8961482 [details]
Bug 1447685 Move nsIRandomGenerator creation out of a lock to avoid a deadlock

fix a deadlock regression in beta60
Attachment #8961482 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.