Closed Bug 1461731 Opened 2 years ago Closed 2 years ago

Crash in shutdownhang | libsystem_kernel.dylib@0x1ca1e

Categories

(NSS :: Libraries, defect, P1, critical)

Unspecified
macOS
defect

Tracking

(relnote-firefox 60+, firefox-esr52 unaffected, firefox-esr6060+ fixed, firefox60+ fixed, firefox61+ fixed, firefox62+ fixed)

RESOLVED FIXED
Tracking Status
relnote-firefox --- 60+
firefox-esr52 --- unaffected
firefox-esr60 60+ fixed
firefox60 + fixed
firefox61 + fixed
firefox62 + fixed

People

(Reporter: marcia, Assigned: froydnj)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(6 files, 3 obsolete files)

This bug was filed from the Socorro interface and is
report bp-690da5e1-4448-44e9-8900-ef3db0180515.
=============================================================

Currently at the top of **Mac** crash stats in 60, although present in other versions as well: https://bit.ly/2KqQba0. All users on 60 are running 10.13.

Comments:
*Firefox gets stuck (and sometimes crashes) since installation of 60.0 
*60 is very buggy on my iMac. Freezes constantly. I have to use Chrome 
*no more network access (from firefox only). Safari is still ok. 

MOZ_CRASH(Shutdown hanging before starting.) - Seems like a bad user experience.

Top 10 frames of crashing thread:

0 libsystem_kernel.dylib libsystem_kernel.dylib@0x1ca1e 
1 libsystem_pthread.dylib _pthread_cond_wait 
2 libsystem_pthread.dylib pthread_cond_signal_thread_np 
3 libmozglue.dylib <name omitted> mozglue/misc/ConditionVariable_posix.cpp:118
4 XUL mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/CondVar.h:68
5 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:967
6 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:517
7 XUL mozilla::net::nsHttpConnectionMgr::Shutdown xpcom/threads/nsThreadUtils.h:323
8 XUL mozilla::net::nsHttpHandler::Observe netwerk/protocol/http/nsHttpHandler.cpp:2754
9 XUL non-virtual thunk to mozilla::net::nsHttpHandler::Observe netwerk/protocol/http/nsHttpHandler.cpp

=============================================================
All users on 60 are running 10.13.4 17E202. I don't see any obvious correlations.
Hey Stephen, can you take a look, I think we're shutting down a thread that hasn't started yet.
Flags: needinfo?(spohl.mozilla.bugs)
Adding another signature which is 10.12 specific. All of those crashes save one are MOZ_CRASH(Shutdown hanging before starting.
Crash Signature: [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e] → [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e] [@ shutdownhang | libsystem_kernel.dylib@0x19bf2]
Hello, we are bumping into this problem too (hang of FF after opening several tabs).

We have sent several crash reports today like this one:

https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-258ed1180522

After further research, we noticed that the problem seems linked to the presence of a security module (pkcs#11)
configured in Firefox.

Apparently there is the same issue with the Belgium eID security module (as we found in the crash statistics database):

https://crash-stats.mozilla.com/report/index/374b0aad-7693-43ad-9579-5c3ab1180522

Some threads are blocked in libnss3.dylib calling  nssSlot_IsTokenPresent() function
(In reply to BPER_ILEX from comment #4)
> Hello, we are bumping into this problem too (hang of FF after opening
> several tabs).
> 
> We have sent several crash reports today like this one:
> 
> https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-
> 258ed1180522

Thank you for commenting! Could you clarify step by step how you get into this state? Does Firefox hang right away? Or do you have to visit a particular site before the hang occurs? Is the crash report generated automatically after some time? Or do you try to exit Firefox and it is then that Firefox crashes?

> After further research, we noticed that the problem seems linked to the
> presence of a security module (pkcs#11)
> configured in Firefox.

Is there a way for us to reproduce this, or does this require some kind of smart-card reader or other device?

> Apparently there is the same issue with the Belgium eID security module (as
> we found in the crash statistics database):
> 
> https://crash-stats.mozilla.com/report/index/374b0aad-7693-43ad-9579-
> 5c3ab1180522

Looking into this.
Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(bertrand.perret)
Same issue with Gemalto security module:
 https://crash-stats.mozilla.com/report/index/c2458f7f-5bb4-485b-84f0-414ef0180522

So the this problem seems general to all security modules handled by Firefox and especially the NSS layer.
(In reply to Stephen A Pohl [:spohl] from comment #5)
> (In reply to BPER_ILEX from comment #4)
> > Hello, we are bumping into this problem too (hang of FF after opening
> > several tabs).
> > 
> > We have sent several crash reports today like this one:
> > 
> > https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-
> > 258ed1180522
> 
> Thank you for commenting! Could you clarify step by step how you get into
> this state? Does Firefox hang right away? Or do you have to visit a
> particular site before the hang occurs? Is the crash report generated
> automatically after some time? Or do you try to exit Firefox and it is then
> that Firefox crashes?

The steps to reproduce:

0°) Lanch FF (current release)
1°) add a security module in FF
2°) Close FF
3°) Connect a PC/SC smartcard reader to the Mac
4°) Launch FF
5°) Go to Google News
6°) Open (say 5 items) in a separate tab

At one moment, a tab doesn't display its content

If we contine, further tabs won't work

If we try to close FF, it remains in memory

So we force it to close => then the Apple report dialog appears.

Click 'signal to Apple'

> 
> > After further research, we noticed that the problem seems linked to the
> > presence of a security module (pkcs#11)
> > configured in Firefox.
> 
> Is there a way for us to reproduce this, or does this require some kind of
> smart-card reader or other device?

You will need at least a reader and a security module downloadable here: http://testssl.asipsante.fr

> 
> > Apparently there is the same issue with the Belgium eID security module (as
> > we found in the crash statistics database):
> > 
> > https://crash-stats.mozilla.com/report/index/374b0aad-7693-43ad-9579-
> > 5c3ab1180522
> 
> Looking into this.
Flags: needinfo?(bertrand.perret)
To get the Firefox crash report dialog when FF hangs

We type in a Terminal:  kill -s SIGABRT <PID>
Sorry, there is an erratum in comment 7:

The installer for the security module: https://diagcps.eservices.esante.gouv.fr/
Use the "Diagnostiquer mon poste" button to acces the PKG download link


Web extension to configure the security module downloadable here: http://testssl.asipsante.fr
Another crash with FF 61b7 on macOS Sierra.

https://crash-stats.mozilla.com/report/index/6d11f44e-6e53-4b70-a723-461d00180523

With following user comments:

Security module configured.
Opening tabs from Google News on HTTPS url.
At one moment, the new tab does not show its content.
Further tabs do not either.
Closing Firefox then it hangs => then wait for it to crash
Thank you! I have ordered a Yubikey and will try to reproduce at the end of the week.
Priority: -- → P1
Attached image Pb_FF60_61_62.png
We easily reproduce the issue with all the builds supplied in your comment.

When the pb occurs we get a status message 'Performing TLS handshake on ...'

In attachment Screenshot revealing the problem of tab not showing its content.
Flags: needinfo?(bertrand.perret)
(In reply to Alvaro from comment #14)
> Hi,
> 
> We are not able to reproduce the issue with this build: 
> https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg
> 
> but occurs in the next build :
> https://archive.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-22-00-43-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg

Thank you! This was sort of what I expected, but I was hoping it would have started to work again with one of the later builds.

The first build was a build prior to landing an update to NSS to upgrade to version 3.36. The second build was the first build with version 3.36[1], which was known to cause issues with PKCS#11. The third build was supposed to have a fix for this (bug 1447628), but this does not appear to have resolved the issue, at least on macOS. All subsequent builds were after any further updates to NSS, which also do not appear to have resolved this. Will investigate.

[1] https://hg.mozilla.org/mozilla-central/rev/7ced88e883e9
(In reply to BPER_ILEX from comment #13)
> Created attachment 8980197 [details]
> Pb_FF60_61_62.png
> 
> We easily reproduce the issue with all the builds supplied in your comment.
> 
> When the pb occurs we get a status message 'Performing TLS handshake on ...'
> 
> In attachment Screenshot revealing the problem of tab not showing its
> content.

Could you double-check that this is also the case with the following build? PKCS#11 was expected to still work in this build:

https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-mozilla-central/firefox-60.0a1.en-US.mac.dmg
Flags: needinfo?(bertrand.perret)
(In reply to Stephen A Pohl [:spohl] from comment #16)
> Could you double-check that this is also the case with the following build?
> PKCS#11 was expected to still work in this build:
> 
> https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg

Oh, I missed the fact that you two ("BPER_ILE" and "Alvaro") may be from the same group (domain name of your email addresses appeared to be different at first sight). If this is the case, this was all the confirmation I needed. Thank you!
(In reply to Stephen A Pohl [:spohl] from comment #16)
> (In reply to BPER_ILEX from comment #13)
> > Created attachment 8980197 [details]
> > Pb_FF60_61_62.png
> > 
> > We easily reproduce the issue with all the builds supplied in your comment.
> > 
> > When the pb occurs we get a status message 'Performing TLS handshake on ...'
> > 
> > In attachment Screenshot revealing the problem of tab not showing its
> > content.
> 
> Could you double-check that this is also the case with the following build?
> PKCS#11 was expected to still work in this build:
> 
> https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg

Hi,

I have performed another check for this particular build.
And I confirm that it does work with our security module configured.
Flags: needinfo?(bertrand.perret)
Attached image Build_OK_Infos.png
about:support infos for the last 60 Nightly build that works.
Attached image NSS_Libs_OK.png
NSS libraries versions shipped with the last 60 Nightly build that works with security module configured.
Attached image Build_KO_Infos.png
about:support infos for the last 60 Nightly build that no longer works.
Attached image NSS_Libs_KO.png
NSS libraries versions shipped with the last 60 Nightly build that no longer works with security module configured.
(In reply to BPER_ILEX from comment #21)
> Created attachment 8980511 [details]
> Build_KO_Infos.png
> 
> about:support infos for the last 60 Nightly build that no longer works.

Erratum:
about:support infos for the first 60 Nightly build that no longer works.
(In reply to BPER_ILEX from comment #22)
> Created attachment 8980512 [details]
> NSS_Libs_KO.png
> 
> NSS libraries versions shipped with the last 60 Nightly build that no longer
> works with security module configured.

Erratum:
NSS libraries versions shipped with the first 60 Nightly build that no longer
> works with security module configured.
Could someone who has been able to reproduce this issue test these builds and let me know if this is fixed in either one (or both) of them?

1. https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/artifacts/public/build/target.dmg
2. https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/artifacts/public/build/target.dmg
Flags: needinfo?(bertrand.perret)
Flags: needinfo?(alvaro.rocha)
I just received my Yubikey and I believe that both builds in comment 27 have this problem resolved. It would be great to get confirmation of my finding. I will post a patch and more detail over the weekend.
Attached patch Patch (obsolete) — Splinter Review
Here is what seems to be happening:

In nssSlot_IsTokenPresent, we want to make sure that only one thread executes some code at a time[1]. However, we can currently run into deadlocks when trying to do so. Looking at this[2] crash report for example, here is what's happening (make sure to display all threads by clicking the link at the bottom of the page):

1. Thread 4 has entered the section of code that is only supposed to be run by one thread.
2. It acquired the lock[3]...
3. ...but released the lock a few lines later[4].
4. Thread 59 now acquires the lock[3].
5. But thread 59 isn't allowed to execute the section of code yet because it isn't the present thread.
6. So it waits. Without releasing the lock.[5].
7. When thread 4 goes to set isPresentThread to NULL and notify isPresentCondition, it attempts to acquire the lock again.[6]
8. But since thread 59 has acquired it and is waiting for isPresentCondition to be notified and isPresentThread to be NULL, we deadlock.
9. Last but not least, the main thread (thread 0) also attempts to acquire the lock and never succeeds, ultimately causing the hang in Firefox.[7]

This patch fixes the deadlock by releasing the lock before waiting for isPresentCondition to be notified, and by reacquiring it once this happens and before proceeding to make sure that no other race with another thread has occurred. Furthermore, we no longer release the lock before a thread has been able to notify isPresentCondition and set isPresentThread to NULL.


[1] https://dxr.mozilla.org/nss/rev/2f3ace03cbf89e9300c2dd09da2317a82e382791/nss/lib/dev/devslot.c#153-157
[2] https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-258ed1180522#allthreads
[3] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l154
[4] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l170
[5] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l156
[6] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l259
[7] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l138
Assignee: nobody → spohl.mozilla.bugs
Status: NEW → ASSIGNED
Attachment #8980922 - Flags: review?(rrelyea)
Component: XPCOM → Libraries
Product: Core → NSS
Version: Trunk → other
I'm curious what the best way is to get this fixed in mozilla-central as soon as possible. Should we get this fixed upstream and then update mozilla-central to the latest version of NSS after the patch has landed (similar to bug 1432177)? Or should we apply the patch to mozilla-central individually?
Blocks: 1432177
Depends on: 1447628
(In reply to Stephen A Pohl [:spohl] from comment #29)
> Created attachment 8980922 [details] [diff] [review]
> Patch
> 
> Here is what seems to be happening:
> 
> In nssSlot_IsTokenPresent, we want to make sure that only one thread
> executes some code at a time[1]. However, we can currently run into
> deadlocks when trying to do so. Looking at this[2] crash report for example,
> here is what's happening (make sure to display all threads by clicking the
> link at the bottom of the page):
> 
> 1. Thread 4 has entered the section of code that is only supposed to be run
> by one thread.
> 2. It acquired the lock[3]...
> 3. ...but released the lock a few lines later[4].
> 4. Thread 59 now acquires the lock[3].
> 5. But thread 59 isn't allowed to execute the section of code yet because it
> isn't the present thread.
> 6. So it waits. Without releasing the lock.[5].

Waiting on a condition variable implicitly releases the lock associated with the condition variable:

https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prcvar.h#38-42

But perhaps the threads are racing, and thread 4 tries to acquire the lock, and sleeps, before thread 59 releases the lock to wait on the condition variable?

In that case, I believe the correct fix is to remove the intermediate PZ_Unlock/PZ_Lock pair, as you have done, but to not release and re-acquire the lock around the condition variable wait.
(In reply to Nathan Froyd [:froydnj] from comment #31)
> (In reply to Stephen A Pohl [:spohl] from comment #29)
> > Created attachment 8980922 [details] [diff] [review]
> > Patch
> > 
> > Here is what seems to be happening:
> > 
> > In nssSlot_IsTokenPresent, we want to make sure that only one thread
> > executes some code at a time[1]. However, we can currently run into
> > deadlocks when trying to do so. Looking at this[2] crash report for example,
> > here is what's happening (make sure to display all threads by clicking the
> > link at the bottom of the page):
> > 
> > 1. Thread 4 has entered the section of code that is only supposed to be run
> > by one thread.
> > 2. It acquired the lock[3]...
> > 3. ...but released the lock a few lines later[4].
> > 4. Thread 59 now acquires the lock[3].
> > 5. But thread 59 isn't allowed to execute the section of code yet because it
> > isn't the present thread.
> > 6. So it waits. Without releasing the lock.[5].
> 
> Waiting on a condition variable implicitly releases the lock associated with
> the condition variable:
> 
> https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prcvar.h#38-
> 42
> 
> But perhaps the threads are racing, and thread 4 tries to acquire the lock,
> and sleeps, before thread 59 releases the lock to wait on the condition
> variable?

Interesting. If this is indeed the case, I can't quite explain why https://hg.mozilla.org/try/rev/92f182ace45032d713b0952656107e140fe6481a appeared to fix the deadlock for me. Maybe just changed the timing of things enough for the race to no longer occur? Try build with this change can be downloaded here:
https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/artifacts/public/build/target.dmg

> In that case, I believe the correct fix is to remove the intermediate
> PZ_Unlock/PZ_Lock pair, as you have done, but to not release and re-acquire
> the lock around the condition variable wait.

This would be the equivalent of my try run with this change:
https://hg.mozilla.org/try/rev/b65d256480fdf75cc98443eda64077e29667e46f

...resulting in try build:
https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/artifacts/public/build/target.dmg

...and this fixed the issue for me as well. I ended up combining the two fixes, for the patch that I've uploaded, but it sounds like the proper fix would be to only remove the intermediate PZ_Unlock/PZ_Lock pair. Will upload a new patch shortly. Thank you for your input, :froydnj!
Attached patch Patch (obsolete) — Splinter Review
Updated patch with feedback addressed.
Attachment #8980922 - Attachment is obsolete: true
Attachment #8980922 - Flags: review?(rrelyea)
Attachment #8980993 - Flags: review?(rrelyea)
(In reply to Stephen A Pohl [:spohl] from comment #27)

Hello,
> Could someone who has been able to reproduce this issue test these builds
> and let me know if this is fixed in either one (or both) of them?
> 
> 1.
> https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/
> artifacts/public/build/target.dmg

This build does not work in our scenario. Still easily reproducible.

> 2.
> https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/
> artifacts/public/build/target.dmg

This build does !
Flags: needinfo?(bertrand.perret)
(In reply to Stephen A Pohl [:spohl] from comment #27)
> Could someone who has been able to reproduce this issue test these builds
> and let me know if this is fixed in either one (or both) of them?
> 
> 1.
> https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/
> artifacts/public/build/target.dmg

For your information,
the crashes associated with this build :
https://crash-stats.mozilla.com/report/index/0a8d8051-d341-49ea-8214-713e70180528
https://crash-stats.mozilla.com/report/index/701c4972-5400-462f-b5d5-499830180528 

> 2.
> https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/
> artifacts/public/build/target.dmg
(In reply to BPER_ILEX from comment #35)
> (In reply to Stephen A Pohl [:spohl] from comment #27)
> 
> Hello,
> > Could someone who has been able to reproduce this issue test these builds
> > and let me know if this is fixed in either one (or both) of them?
> > 
> > [...]
> > 
> > 2.
> > https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/
> > artifacts/public/build/target.dmg
> 
> This build does !

Thank you for confirming! The patch that's up for review implements this change.
Comment on attachment 8980993 [details] [diff] [review]
Patch

Review of attachment 8980993 [details] [diff] [review]:
-----------------------------------------------------------------

In your analysis, you say that the lock is not being released before we wait, but the isPresentLock and the isPresentCondition are paired. The semantics of PR_WaitCondVar is that the associated lock is "atomically" released before you wait on the condition and reacquired when the condition returns.

I suspect there's a bug in the MAC implementation of PR_WaitCondVar where it's not releasing the lock. (which is why this issue only shows up on the Mac and not other platforms.

bob

::: lib/dev/devslot.c
@@ -166,5 @@
>       * if the token is present and that it needs initialization. */
>      slot->lastTokenPingState = nssSlotLastPingState_Update;
>      slot->isPresentThread = PR_GetCurrentThread();
>  
> -    PZ_Unlock(slot->isPresentLock);

We need to release this lock here. The bug is in the Mac implementation of  PR_WaitCondVar, which is supposed to release the lock associated with the condition variable.
Attachment #8980993 - Flags: review?(rrelyea) → review-
> 1. Thread 4 has entered the section of code that is only supposed to be run by one thread.
> 2. It acquired the lock[3]...
> 3. ...but released the lock a few lines later[4].
> 4. Thread 59 now acquires the lock[3].
> 5. But thread 59 isn't allowed to execute the section of code yet because it isn't the present thread.
> 6. So it waits. Without releasing the lock.[5].

Wait is supposed to free the lock. If it's not, then there's a problem with wait. Once you return from the wait, you regain the lock. See https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference/PR_WaitCondVar

If this isn't happening, then there's a bug in the NSPR implementation of PR_WaitCondVar for the Mac.

> 7. When thread 4 goes to set isPresentThread to NULL and notify isPresentCondition, it attempts to acquire the lock again.[6]

If PR_WaitCondVar released the lock as expected, then thread 4 can get this lock and proceed normally, causing thread 59 to wake up.

> 8. But since thread 59 has acquired it and is waiting for isPresentCondition to be notified and isPresentThread to be NULL, we deadlock.

Thread 59 will only regain the lock on waking up, quickly determined that it now as the isPresent data, free it's lock and return. 

> 9. Last but not least, the main thread (thread 0) also attempts to acquire the lock and never succeeds, ultimately causing the hang in Firefox.[7]

and now the lock is available for the main thread.


I wonder if the locking primitives used by NSPR for the Mac has changed?

bob
[Tracking Requested - why for this release]:
"Currently at the top of **Mac** crash stats in 60" (see comment 0).


(In reply to Robert Relyea from comment #39)
> I wonder if the locking primitives used by NSPR for the Mac has changed?

This hang started happening after the landing of NSS 3.36 in bug 1432177, which introduced the code in question here (see comment 15 and changesets [1], [2] and [3]). If this is indeed a bug in the NSPR implementation of PR_WaitCondVar for macOS, this bug may have existed all along. Or the changes in NSS 3.36 may have exposed a bug at the OS level, since this appears to affect macOS 10.12 and above.

I'm going to have to bow out of this one, but considering that this is "Currently at the top of **Mac** crash stats in 60" per comment 0, I think it's worth nominating for tracking and we should find someone who can tackle this. At the moment, this is known to affect at least macOS 10.12 and 10.13.

[1] https://hg.mozilla.org/mozilla-central/rev/348b36156a84
[2] https://hg.mozilla.org/mozilla-central/rev/1c8bbd57b772
[3] https://hg.mozilla.org/mozilla-central/rev/7ced88e883e9
Assignee: spohl.mozilla.bugs → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(alvaro.rocha)
The Mac NSPR locking code is shared with all other pthreads-using Unixes, so I think it's unlikely there's a bug in that particular code.

Perhaps the problem is that the PR_WaitCondVar call is using 0 (= PR_INTERVAL_NO_WAIT) for the timeout value, rather than the expected PR_INTERVAL_NO_TIMEOUT:

    PZ_Lock(slot->isPresentLock);
    while (slot->isPresentThread) {
        PR_WaitCondVar(slot->isPresentCondition, 0);
    }

?  This is the only PR_WaitCondVar call that passes a bare 0 here; some code in lib/util/nssilock.c also passes PR_INTERVAL_NO_WAIT, which seems suspicious on its own, but a bare constant 0 is suspicious nonetheless.

A timeout of 0 is translated into an absolute time of "now" before calling pthread_cond_timedwait.  And then by the time whatever library code gets around to testing the timeout value, we're now at now+epsilon, and we fail to release the lock, since we've passed "now".  macOS 10.12/10.13 might well have changed something along this codepath, making this particular "failure" happen all the time.

Due to the impossibly small timeout, thread 59, waiting for something to happen, is essentially busy-waiting on the lock, and the main thread and thread 4 can't actually ever acquire the lock to make the progress thread 59 needs.  That seems like a plausible scenario given the crash stacks, yes?

I will try pushing a try build with that change tomorrow if nobody beats me to it, and then perhaps somebody would be willing to take that try build for a test?
(In reply to Nathan Froyd [:froydnj] from comment #41)
> The Mac NSPR locking code is shared with all other pthreads-using Unixes, so
> I think it's unlikely there's a bug in that particular code.
> 
> Perhaps the problem is that the PR_WaitCondVar call is using 0 (=
> PR_INTERVAL_NO_WAIT) for the timeout value, rather than the expected
> PR_INTERVAL_NO_TIMEOUT:

This! I just built this locally and it does indeed avoid the hang! Pushed to try to confirm in comment 42. Will take the build for a spin tomorrow morning.

Thank you, Nathan! Would you like to post the patch for this? I'm happy to do it on your behalf, but I would need to know how to go about this: Should this be fixed in the NSS repo first and we then merge a version of NSS that includes the fix into m-c?
(In reply to Stephen A Pohl [:spohl] from comment #43)
> (In reply to Nathan Froyd [:froydnj] from comment #41)
> > The Mac NSPR locking code is shared with all other pthreads-using Unixes, so
> > I think it's unlikely there's a bug in that particular code.
> > 
> > Perhaps the problem is that the PR_WaitCondVar call is using 0 (=
> > PR_INTERVAL_NO_WAIT) for the timeout value, rather than the expected
> > PR_INTERVAL_NO_TIMEOUT:
> 
> This! I just built this locally and it does indeed avoid the hang! Pushed to
> try to confirm in comment 42. Will take the build for a spin tomorrow
> morning.

Excellent, thank you for testing!

> Thank you, Nathan! Would you like to post the patch for this? I'm happy to
> do it on your behalf, but I would need to know how to go about this: Should
> this be fixed in the NSS repo first and we then merge a version of NSS that
> includes the fix into m-c?

Yes, I believe we have to fix things in the NSS repo, and then a fixed (release?) version of NSS will be pushed around to all the appropriate places.

I'll post a patch for this tomorrow!
(In reply to Stephen A Pohl [:spohl] from comment #42)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=fd67c1c76446cd4677de810f81c127e11fdf50e7

Just tested and the fix also works as expected in the try build:
https://queue.taskcluster.net/v1/task/MBVBzAUvRX-1ka4WiIvcOQ/runs/0/artifacts/public/build/target.dmg
Here's the patch, as promised.
Assignee: nobody → nfroyd
Attachment #8980993 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8981818 - Flags: review?(rrelyea)
[Tracking Requested - why for this release]: hangs on recent Mac versions when using 2FA devices.  Seems like this is a problem that should be fixed, given that we touted this capability for Firefox 60...
Comment on attachment 8981818 [details] [diff] [review]
wait indefinitely in nssSlot_IsTokenPresent

Review of attachment 8981818 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm. I don't see any reason to use PR_INTERVAL_NO_WAIT here.
Attachment #8981818 - Flags: review+
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #48)
> lgtm. I don't see any reason to use PR_INTERVAL_NO_WAIT here.

Thank you!

Apparently I do not have push permissions to NSS, would you mind amending the reviewer in the commit message and pushing the patch for me?
Flags: needinfo?(franziskuskiefer)
Will do. I'm waiting on try https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=deeb10851254ba949c106f4c0b4e8e21b682899e
I understand we want this in Fx61. So we'll have to create a 3.37.2 release.
I agree with the fix using PR_INTERVAL_NO_TIMEOUT. PR_INTERVAL_NO_WAIT amounts to busy waiting, which is not what we want on any platform.
Comment on attachment 8981818 [details] [diff] [review]
wait indefinitely in nssSlot_IsTokenPresent

Review of attachment 8981818 [details] [diff] [review]:
-----------------------------------------------------------------

r+ rrelyea. This is the correct fix.

bob
Attachment #8981818 - Flags: review?(rrelyea) → review+
Small change to the commit message to note fkiefer's review and to make explicit the possibility that we could have seen this crash on earlier versions of macOS.
Attachment #8981818 - Attachment is obsolete: true
Attachment #8981956 - Flags: review+
trunk: https://hg.mozilla.org/projects/nss/rev/8232a58332dda55059b02430b93d09aabf93d840
3.37 branch: https://hg.mozilla.org/projects/nss/rev/97ba6eed6d8beb366e01eba5970b452282910e6a
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(franziskuskiefer)
Resolution: --- → FIXED
Target Milestone: --- → 3.38
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #50)
> Will do. I'm waiting on try
> https://treeherder.mozilla.org/#/jobs?repo=nss-
> try&revision=deeb10851254ba949c106f4c0b4e8e21b682899e
> I understand we want this in Fx61. So we'll have to create a 3.37.2 release.

Cannot this fix merged to the Fx 60 branch instead ?

I ask this because many of our customers report this problem. And our first line support team is very buzy.

Thanks.
franziskus, nfroyd, is there then follow up work to land on mozilla-central and request uplift to beta?
Flags: needinfo?(nfroyd)
Flags: needinfo?(franziskuskiefer)
See Also: → 1466028
See Also: → 1466038
See bug 1466028 and bug 1466038 for point releases and bug 1460617 for the next regular NSS release.
Flags: needinfo?(nfroyd)
Flags: needinfo?(franziskuskiefer)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #56)
> franziskus, nfroyd, is there then follow up work to land on mozilla-central
> and request uplift to beta?

Franziskus already covered this.  But this bug is probably worthy of a release note ("hey, we fixed the feature that we advertised in the last release!" ;)...what bug should we attach the request to?  This bug, since it actually fixed the problem, or the NSS upgrade bugs, since those will actually land the fix in the appropriate branch?
Flags: needinfo?(lhenry)
Duplicate of this bug: 1465215
Attaching the note to this bug is fine. I'll add it now as "Updated to NSS 3.37.2"

and will link it to the actual notes at https://developer.mozilla.org/docs/Mozilla/Projects/NSS/
Flags: needinfo?(lhenry)
Or, I guess that note should say, NSS 3.36?
Flags: needinfo?(franziskuskiefer)
This will go into 3 releases: NSS 3.38 Firefox 62, NSS 3.37.2 for Firefox 61, NSS 3.36.3 for Firefox 60 ESR and potentially 60.0.2.
There are no release notes yet but they'll be at https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/NSS_3.36.3_release_notes etc.
Flags: needinfo?(franziskuskiefer)
Hello,

after some tests performed with our security module enabled:

- it is fixed on current Nightly (2018-06-04)
- but it's still present in the current Beta (61 b10)
> status-firefox61: affected → fixed: this not quite true as for FF 61 b11

This was marked fixed an hour before your comment. That doesn't mean it's fixed in a release NOW. The next beta build should be done on Thursday and will include the fix.
Duplicate of this bug: 1464979
Crash Signature: [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e] [@ shutdownhang | libsystem_kernel.dylib@0x19bf2] → [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e] [@ shutdownhang | libsystem_kernel.dylib@0x19bf2] [@ shutdownhang | libsystem_kernel.dylib@0x16db6 ]
Crash Signature: [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e] [@ shutdownhang | libsystem_kernel.dylib@0x19bf2] [@ shutdownhang | libsystem_kernel.dylib@0x16db6 ] → [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e] [@ shutdownhang | libsystem_kernel.dylib@0x19bf2]
You need to log in before you can comment on or make changes to this bug.