Closed
Bug 1461731
Opened 7 years ago
Closed 7 years ago
Crash in shutdownhang | libsystem_kernel.dylib@0x1ca1e
Categories
(NSS :: Libraries, defect, P1)
Tracking
(relnote-firefox 60+, firefox-esr52 unaffected, firefox-esr6060+ fixed, firefox60+ fixed, firefox61+ fixed, firefox62+ fixed)
RESOLVED
FIXED
3.38
People
(Reporter: marcia, Assigned: froydnj)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(6 files, 3 obsolete files)
This bug was filed from the Socorro interface and is
report bp-690da5e1-4448-44e9-8900-ef3db0180515.
=============================================================
Currently at the top of **Mac** crash stats in 60, although present in other versions as well: https://bit.ly/2KqQba0. All users on 60 are running 10.13.
Comments:
*Firefox gets stuck (and sometimes crashes) since installation of 60.0
*60 is very buggy on my iMac. Freezes constantly. I have to use Chrome
*no more network access (from firefox only). Safari is still ok.
MOZ_CRASH(Shutdown hanging before starting.) - Seems like a bad user experience.
Top 10 frames of crashing thread:
0 libsystem_kernel.dylib libsystem_kernel.dylib@0x1ca1e
1 libsystem_pthread.dylib _pthread_cond_wait
2 libsystem_pthread.dylib pthread_cond_signal_thread_np
3 libmozglue.dylib <name omitted> mozglue/misc/ConditionVariable_posix.cpp:118
4 XUL mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/CondVar.h:68
5 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:967
6 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:517
7 XUL mozilla::net::nsHttpConnectionMgr::Shutdown xpcom/threads/nsThreadUtils.h:323
8 XUL mozilla::net::nsHttpHandler::Observe netwerk/protocol/http/nsHttpHandler.cpp:2754
9 XUL non-virtual thunk to mozilla::net::nsHttpHandler::Observe netwerk/protocol/http/nsHttpHandler.cpp
=============================================================
Reporter | ||
Comment 1•7 years ago
|
||
All users on 60 are running 10.13.4 17E202. I don't see any obvious correlations.
![]() |
||
Comment 2•7 years ago
|
||
Hey Stephen, can you take a look, I think we're shutting down a thread that hasn't started yet.
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 3•7 years ago
|
||
Adding another signature which is 10.12 specific. All of those crashes save one are MOZ_CRASH(Shutdown hanging before starting.
Crash Signature: [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e] → [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e]
[@ shutdownhang | libsystem_kernel.dylib@0x19bf2]
Hello, we are bumping into this problem too (hang of FF after opening several tabs).
We have sent several crash reports today like this one:
https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-258ed1180522
After further research, we noticed that the problem seems linked to the presence of a security module (pkcs#11)
configured in Firefox.
Apparently there is the same issue with the Belgium eID security module (as we found in the crash statistics database):
https://crash-stats.mozilla.com/report/index/374b0aad-7693-43ad-9579-5c3ab1180522
Some threads are blocked in libnss3.dylib calling nssSlot_IsTokenPresent() function
Comment 5•7 years ago
|
||
(In reply to BPER_ILEX from comment #4)
> Hello, we are bumping into this problem too (hang of FF after opening
> several tabs).
>
> We have sent several crash reports today like this one:
>
> https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-
> 258ed1180522
Thank you for commenting! Could you clarify step by step how you get into this state? Does Firefox hang right away? Or do you have to visit a particular site before the hang occurs? Is the crash report generated automatically after some time? Or do you try to exit Firefox and it is then that Firefox crashes?
> After further research, we noticed that the problem seems linked to the
> presence of a security module (pkcs#11)
> configured in Firefox.
Is there a way for us to reproduce this, or does this require some kind of smart-card reader or other device?
> Apparently there is the same issue with the Belgium eID security module (as
> we found in the crash statistics database):
>
> https://crash-stats.mozilla.com/report/index/374b0aad-7693-43ad-9579-
> 5c3ab1180522
Looking into this.
Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(bertrand.perret)
Same issue with Gemalto security module:
https://crash-stats.mozilla.com/report/index/c2458f7f-5bb4-485b-84f0-414ef0180522
So the this problem seems general to all security modules handled by Firefox and especially the NSS layer.
(In reply to Stephen A Pohl [:spohl] from comment #5)
> (In reply to BPER_ILEX from comment #4)
> > Hello, we are bumping into this problem too (hang of FF after opening
> > several tabs).
> >
> > We have sent several crash reports today like this one:
> >
> > https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-
> > 258ed1180522
>
> Thank you for commenting! Could you clarify step by step how you get into
> this state? Does Firefox hang right away? Or do you have to visit a
> particular site before the hang occurs? Is the crash report generated
> automatically after some time? Or do you try to exit Firefox and it is then
> that Firefox crashes?
The steps to reproduce:
0°) Lanch FF (current release)
1°) add a security module in FF
2°) Close FF
3°) Connect a PC/SC smartcard reader to the Mac
4°) Launch FF
5°) Go to Google News
6°) Open (say 5 items) in a separate tab
At one moment, a tab doesn't display its content
If we contine, further tabs won't work
If we try to close FF, it remains in memory
So we force it to close => then the Apple report dialog appears.
Click 'signal to Apple'
>
> > After further research, we noticed that the problem seems linked to the
> > presence of a security module (pkcs#11)
> > configured in Firefox.
>
> Is there a way for us to reproduce this, or does this require some kind of
> smart-card reader or other device?
You will need at least a reader and a security module downloadable here: http://testssl.asipsante.fr
>
> > Apparently there is the same issue with the Belgium eID security module (as
> > we found in the crash statistics database):
> >
> > https://crash-stats.mozilla.com/report/index/374b0aad-7693-43ad-9579-
> > 5c3ab1180522
>
> Looking into this.
Flags: needinfo?(bertrand.perret)
To get the Firefox crash report dialog when FF hangs
We type in a Terminal: kill -s SIGABRT <PID>
Sorry, there is an erratum in comment 7:
The installer for the security module: https://diagcps.eservices.esante.gouv.fr/
Use the "Diagnostiquer mon poste" button to acces the PKG download link
Web extension to configure the security module downloadable here: http://testssl.asipsante.fr
Comment 10•7 years ago
|
||
Another crash with FF 61b7 on macOS Sierra.
https://crash-stats.mozilla.com/report/index/6d11f44e-6e53-4b70-a723-461d00180523
With following user comments:
Security module configured.
Opening tabs from Google News on HTTPS url.
At one moment, the new tab does not show its content.
Further tabs do not either.
Closing Firefox then it hangs => then wait for it to crash
Comment 11•7 years ago
|
||
Thank you! I have ordered a Yubikey and will try to reproduce at the end of the week.
![]() |
||
Updated•7 years ago
|
Priority: -- → P1
Updated•7 years ago
|
status-firefox-esr60:
--- → affected
Comment 12•7 years ago
|
||
Are you able to reproduce this reliably? Would you be able to run a few builds and let us know in which ones the issue is reproducible?
1. https://archive.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-22-00-43-mozilla-central/firefox-60.0a1.en-US.mac.dmg
2. https://archive.mozilla.org/pub/firefox/nightly/2018/03/2018-03-06-10-01-23-mozilla-central/firefox-60.0a1.en-US.mac.dmg
3. https://archive.mozilla.org/pub/firefox/nightly/2018/03/2018-03-16-10-01-32-mozilla-central/firefox-61.0a1.en-US.mac.dmg
4. https://archive.mozilla.org/pub/firefox/nightly/2018/05/2018-05-04-22-01-15-mozilla-central/firefox-61.0a1.en-US.mac.dmg
5. https://archive.mozilla.org/pub/firefox/nightly/2018/05/2018-05-15-10-00-38-mozilla-central/firefox-62.0a1.en-US.mac.dmg
Thank you!
Flags: needinfo?(bertrand.perret)
Comment 13•7 years ago
|
||
We easily reproduce the issue with all the builds supplied in your comment.
When the pb occurs we get a status message 'Performing TLS handshake on ...'
In attachment Screenshot revealing the problem of tab not showing its content.
Flags: needinfo?(bertrand.perret)
Comment 14•7 years ago
|
||
Hi,
We are not able to reproduce the issue with this build:
https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-mozilla-central/firefox-60.0a1.en-US.mac.dmg
but occurs in the next build :
https://archive.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-22-00-43-mozilla-central/firefox-60.0a1.en-US.mac.dmg
Comment 15•7 years ago
|
||
(In reply to Alvaro from comment #14)
> Hi,
>
> We are not able to reproduce the issue with this build:
> https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg
>
> but occurs in the next build :
> https://archive.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-22-00-43-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg
Thank you! This was sort of what I expected, but I was hoping it would have started to work again with one of the later builds.
The first build was a build prior to landing an update to NSS to upgrade to version 3.36. The second build was the first build with version 3.36[1], which was known to cause issues with PKCS#11. The third build was supposed to have a fix for this (bug 1447628), but this does not appear to have resolved the issue, at least on macOS. All subsequent builds were after any further updates to NSS, which also do not appear to have resolved this. Will investigate.
[1] https://hg.mozilla.org/mozilla-central/rev/7ced88e883e9
Comment 16•7 years ago
|
||
(In reply to BPER_ILEX from comment #13)
> Created attachment 8980197 [details]
> Pb_FF60_61_62.png
>
> We easily reproduce the issue with all the builds supplied in your comment.
>
> When the pb occurs we get a status message 'Performing TLS handshake on ...'
>
> In attachment Screenshot revealing the problem of tab not showing its
> content.
Could you double-check that this is also the case with the following build? PKCS#11 was expected to still work in this build:
https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-mozilla-central/firefox-60.0a1.en-US.mac.dmg
Flags: needinfo?(bertrand.perret)
Comment 17•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #16)
> Could you double-check that this is also the case with the following build?
> PKCS#11 was expected to still work in this build:
>
> https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg
Oh, I missed the fact that you two ("BPER_ILE" and "Alvaro") may be from the same group (domain name of your email addresses appeared to be different at first sight). If this is the case, this was all the confirmation I needed. Thank you!
Comment 18•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #16)
> (In reply to BPER_ILEX from comment #13)
> > Created attachment 8980197 [details]
> > Pb_FF60_61_62.png
> >
> > We easily reproduce the issue with all the builds supplied in your comment.
> >
> > When the pb occurs we get a status message 'Performing TLS handshake on ...'
> >
> > In attachment Screenshot revealing the problem of tab not showing its
> > content.
>
> Could you double-check that this is also the case with the following build?
> PKCS#11 was expected to still work in this build:
>
> https://ftp.mozilla.org/pub/firefox/nightly/2018/01/2018-01-23-10-20-17-
> mozilla-central/firefox-60.0a1.en-US.mac.dmg
Hi,
I have performed another check for this particular build.
And I confirm that it does work with our security module configured.
Flags: needinfo?(bertrand.perret)
Comment 19•7 years ago
|
||
about:support infos for the last 60 Nightly build that works.
Comment 20•7 years ago
|
||
NSS libraries versions shipped with the last 60 Nightly build that works with security module configured.
Comment 21•7 years ago
|
||
about:support infos for the last 60 Nightly build that no longer works.
Comment 22•7 years ago
|
||
NSS libraries versions shipped with the last 60 Nightly build that no longer works with security module configured.
Comment 23•7 years ago
|
||
(In reply to BPER_ILEX from comment #21)
> Created attachment 8980511 [details]
> Build_KO_Infos.png
>
> about:support infos for the last 60 Nightly build that no longer works.
Erratum:
about:support infos for the first 60 Nightly build that no longer works.
Comment 24•7 years ago
|
||
(In reply to BPER_ILEX from comment #22)
> Created attachment 8980512 [details]
> NSS_Libs_KO.png
>
> NSS libraries versions shipped with the last 60 Nightly build that no longer
> works with security module configured.
Erratum:
NSS libraries versions shipped with the first 60 Nightly build that no longer
> works with security module configured.
Comment 25•7 years ago
|
||
Comment 26•7 years ago
|
||
Comment 27•7 years ago
|
||
Could someone who has been able to reproduce this issue test these builds and let me know if this is fixed in either one (or both) of them?
1. https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/artifacts/public/build/target.dmg
2. https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/artifacts/public/build/target.dmg
Flags: needinfo?(bertrand.perret)
Flags: needinfo?(alvaro.rocha)
Comment 28•7 years ago
|
||
I just received my Yubikey and I believe that both builds in comment 27 have this problem resolved. It would be great to get confirmation of my finding. I will post a patch and more detail over the weekend.
Comment 29•7 years ago
|
||
Here is what seems to be happening:
In nssSlot_IsTokenPresent, we want to make sure that only one thread executes some code at a time[1]. However, we can currently run into deadlocks when trying to do so. Looking at this[2] crash report for example, here is what's happening (make sure to display all threads by clicking the link at the bottom of the page):
1. Thread 4 has entered the section of code that is only supposed to be run by one thread.
2. It acquired the lock[3]...
3. ...but released the lock a few lines later[4].
4. Thread 59 now acquires the lock[3].
5. But thread 59 isn't allowed to execute the section of code yet because it isn't the present thread.
6. So it waits. Without releasing the lock.[5].
7. When thread 4 goes to set isPresentThread to NULL and notify isPresentCondition, it attempts to acquire the lock again.[6]
8. But since thread 59 has acquired it and is waiting for isPresentCondition to be notified and isPresentThread to be NULL, we deadlock.
9. Last but not least, the main thread (thread 0) also attempts to acquire the lock and never succeeds, ultimately causing the hang in Firefox.[7]
This patch fixes the deadlock by releasing the lock before waiting for isPresentCondition to be notified, and by reacquiring it once this happens and before proceeding to make sure that no other race with another thread has occurred. Furthermore, we no longer release the lock before a thread has been able to notify isPresentCondition and set isPresentThread to NULL.
[1] https://dxr.mozilla.org/nss/rev/2f3ace03cbf89e9300c2dd09da2317a82e382791/nss/lib/dev/devslot.c#153-157
[2] https://crash-stats.mozilla.com/report/index/a62c1c51-c218-4bdb-b8ad-258ed1180522#allthreads
[3] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l154
[4] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l170
[5] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l156
[6] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l259
[7] https://hg.mozilla.org/releases/mozilla-esr60/annotate/31d4880fde318a25abb7dc5b99a0aaafc63af3bb/security/nss/lib/dev/devslot.c#l138
Assignee: nobody → spohl.mozilla.bugs
Status: NEW → ASSIGNED
Attachment #8980922 -
Flags: review?(rrelyea)
Updated•7 years ago
|
Component: XPCOM → Libraries
Product: Core → NSS
Version: Trunk → other
Comment 30•7 years ago
|
||
I'm curious what the best way is to get this fixed in mozilla-central as soon as possible. Should we get this fixed upstream and then update mozilla-central to the latest version of NSS after the patch has landed (similar to bug 1432177)? Or should we apply the patch to mozilla-central individually?
Updated•7 years ago
|
![]() |
Assignee | |
Comment 31•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #29)
> Created attachment 8980922 [details] [diff] [review]
> Patch
>
> Here is what seems to be happening:
>
> In nssSlot_IsTokenPresent, we want to make sure that only one thread
> executes some code at a time[1]. However, we can currently run into
> deadlocks when trying to do so. Looking at this[2] crash report for example,
> here is what's happening (make sure to display all threads by clicking the
> link at the bottom of the page):
>
> 1. Thread 4 has entered the section of code that is only supposed to be run
> by one thread.
> 2. It acquired the lock[3]...
> 3. ...but released the lock a few lines later[4].
> 4. Thread 59 now acquires the lock[3].
> 5. But thread 59 isn't allowed to execute the section of code yet because it
> isn't the present thread.
> 6. So it waits. Without releasing the lock.[5].
Waiting on a condition variable implicitly releases the lock associated with the condition variable:
https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prcvar.h#38-42
But perhaps the threads are racing, and thread 4 tries to acquire the lock, and sleeps, before thread 59 releases the lock to wait on the condition variable?
In that case, I believe the correct fix is to remove the intermediate PZ_Unlock/PZ_Lock pair, as you have done, but to not release and re-acquire the lock around the condition variable wait.
Comment 32•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #31)
> (In reply to Stephen A Pohl [:spohl] from comment #29)
> > Created attachment 8980922 [details] [diff] [review]
> > Patch
> >
> > Here is what seems to be happening:
> >
> > In nssSlot_IsTokenPresent, we want to make sure that only one thread
> > executes some code at a time[1]. However, we can currently run into
> > deadlocks when trying to do so. Looking at this[2] crash report for example,
> > here is what's happening (make sure to display all threads by clicking the
> > link at the bottom of the page):
> >
> > 1. Thread 4 has entered the section of code that is only supposed to be run
> > by one thread.
> > 2. It acquired the lock[3]...
> > 3. ...but released the lock a few lines later[4].
> > 4. Thread 59 now acquires the lock[3].
> > 5. But thread 59 isn't allowed to execute the section of code yet because it
> > isn't the present thread.
> > 6. So it waits. Without releasing the lock.[5].
>
> Waiting on a condition variable implicitly releases the lock associated with
> the condition variable:
>
> https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prcvar.h#38-
> 42
>
> But perhaps the threads are racing, and thread 4 tries to acquire the lock,
> and sleeps, before thread 59 releases the lock to wait on the condition
> variable?
Interesting. If this is indeed the case, I can't quite explain why https://hg.mozilla.org/try/rev/92f182ace45032d713b0952656107e140fe6481a appeared to fix the deadlock for me. Maybe just changed the timing of things enough for the race to no longer occur? Try build with this change can be downloaded here:
https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/artifacts/public/build/target.dmg
> In that case, I believe the correct fix is to remove the intermediate
> PZ_Unlock/PZ_Lock pair, as you have done, but to not release and re-acquire
> the lock around the condition variable wait.
This would be the equivalent of my try run with this change:
https://hg.mozilla.org/try/rev/b65d256480fdf75cc98443eda64077e29667e46f
...resulting in try build:
https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/artifacts/public/build/target.dmg
...and this fixed the issue for me as well. I ended up combining the two fixes, for the patch that I've uploaded, but it sounds like the proper fix would be to only remove the intermediate PZ_Unlock/PZ_Lock pair. Will upload a new patch shortly. Thank you for your input, :froydnj!
Comment 33•7 years ago
|
||
Updated patch with feedback addressed.
Attachment #8980922 -
Attachment is obsolete: true
Attachment #8980922 -
Flags: review?(rrelyea)
Attachment #8980993 -
Flags: review?(rrelyea)
Comment hidden (Intermittent Failures Robot) |
Comment 35•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #27)
Hello,
> Could someone who has been able to reproduce this issue test these builds
> and let me know if this is fixed in either one (or both) of them?
>
> 1.
> https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/
> artifacts/public/build/target.dmg
This build does not work in our scenario. Still easily reproducible.
> 2.
> https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/
> artifacts/public/build/target.dmg
This build does !
Flags: needinfo?(bertrand.perret)
Comment 36•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #27)
> Could someone who has been able to reproduce this issue test these builds
> and let me know if this is fixed in either one (or both) of them?
>
> 1.
> https://queue.taskcluster.net/v1/task/Vzm1O27CRquxdw9e6HKuAg/runs/0/
> artifacts/public/build/target.dmg
For your information,
the crashes associated with this build :
https://crash-stats.mozilla.com/report/index/0a8d8051-d341-49ea-8214-713e70180528
https://crash-stats.mozilla.com/report/index/701c4972-5400-462f-b5d5-499830180528
> 2.
> https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/
> artifacts/public/build/target.dmg
Comment 37•7 years ago
|
||
(In reply to BPER_ILEX from comment #35)
> (In reply to Stephen A Pohl [:spohl] from comment #27)
>
> Hello,
> > Could someone who has been able to reproduce this issue test these builds
> > and let me know if this is fixed in either one (or both) of them?
> >
> > [...]
> >
> > 2.
> > https://queue.taskcluster.net/v1/task/FqHaPM1yTe6HdXLIcwx57w/runs/0/
> > artifacts/public/build/target.dmg
>
> This build does !
Thank you for confirming! The patch that's up for review implements this change.
Comment 38•7 years ago
|
||
Comment on attachment 8980993 [details] [diff] [review]
Patch
Review of attachment 8980993 [details] [diff] [review]:
-----------------------------------------------------------------
In your analysis, you say that the lock is not being released before we wait, but the isPresentLock and the isPresentCondition are paired. The semantics of PR_WaitCondVar is that the associated lock is "atomically" released before you wait on the condition and reacquired when the condition returns.
I suspect there's a bug in the MAC implementation of PR_WaitCondVar where it's not releasing the lock. (which is why this issue only shows up on the Mac and not other platforms.
bob
::: lib/dev/devslot.c
@@ -166,5 @@
> * if the token is present and that it needs initialization. */
> slot->lastTokenPingState = nssSlotLastPingState_Update;
> slot->isPresentThread = PR_GetCurrentThread();
>
> - PZ_Unlock(slot->isPresentLock);
We need to release this lock here. The bug is in the Mac implementation of PR_WaitCondVar, which is supposed to release the lock associated with the condition variable.
Attachment #8980993 -
Flags: review?(rrelyea) → review-
Comment 39•7 years ago
|
||
> 1. Thread 4 has entered the section of code that is only supposed to be run by one thread.
> 2. It acquired the lock[3]...
> 3. ...but released the lock a few lines later[4].
> 4. Thread 59 now acquires the lock[3].
> 5. But thread 59 isn't allowed to execute the section of code yet because it isn't the present thread.
> 6. So it waits. Without releasing the lock.[5].
Wait is supposed to free the lock. If it's not, then there's a problem with wait. Once you return from the wait, you regain the lock. See https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference/PR_WaitCondVar
If this isn't happening, then there's a bug in the NSPR implementation of PR_WaitCondVar for the Mac.
> 7. When thread 4 goes to set isPresentThread to NULL and notify isPresentCondition, it attempts to acquire the lock again.[6]
If PR_WaitCondVar released the lock as expected, then thread 4 can get this lock and proceed normally, causing thread 59 to wake up.
> 8. But since thread 59 has acquired it and is waiting for isPresentCondition to be notified and isPresentThread to be NULL, we deadlock.
Thread 59 will only regain the lock on waking up, quickly determined that it now as the isPresent data, free it's lock and return.
> 9. Last but not least, the main thread (thread 0) also attempts to acquire the lock and never succeeds, ultimately causing the hang in Firefox.[7]
and now the lock is available for the main thread.
I wonder if the locking primitives used by NSPR for the Mac has changed?
bob
Comment 40•7 years ago
|
||
[Tracking Requested - why for this release]:
"Currently at the top of **Mac** crash stats in 60" (see comment 0).
(In reply to Robert Relyea from comment #39)
> I wonder if the locking primitives used by NSPR for the Mac has changed?
This hang started happening after the landing of NSS 3.36 in bug 1432177, which introduced the code in question here (see comment 15 and changesets [1], [2] and [3]). If this is indeed a bug in the NSPR implementation of PR_WaitCondVar for macOS, this bug may have existed all along. Or the changes in NSS 3.36 may have exposed a bug at the OS level, since this appears to affect macOS 10.12 and above.
I'm going to have to bow out of this one, but considering that this is "Currently at the top of **Mac** crash stats in 60" per comment 0, I think it's worth nominating for tracking and we should find someone who can tackle this. At the moment, this is known to affect at least macOS 10.12 and 10.13.
[1] https://hg.mozilla.org/mozilla-central/rev/348b36156a84
[2] https://hg.mozilla.org/mozilla-central/rev/1c8bbd57b772
[3] https://hg.mozilla.org/mozilla-central/rev/7ced88e883e9
Assignee: spohl.mozilla.bugs → nobody
Status: ASSIGNED → NEW
tracking-firefox62:
--- → ?
Flags: needinfo?(alvaro.rocha)
![]() |
Assignee | |
Comment 41•7 years ago
|
||
The Mac NSPR locking code is shared with all other pthreads-using Unixes, so I think it's unlikely there's a bug in that particular code.
Perhaps the problem is that the PR_WaitCondVar call is using 0 (= PR_INTERVAL_NO_WAIT) for the timeout value, rather than the expected PR_INTERVAL_NO_TIMEOUT:
PZ_Lock(slot->isPresentLock);
while (slot->isPresentThread) {
PR_WaitCondVar(slot->isPresentCondition, 0);
}
? This is the only PR_WaitCondVar call that passes a bare 0 here; some code in lib/util/nssilock.c also passes PR_INTERVAL_NO_WAIT, which seems suspicious on its own, but a bare constant 0 is suspicious nonetheless.
A timeout of 0 is translated into an absolute time of "now" before calling pthread_cond_timedwait. And then by the time whatever library code gets around to testing the timeout value, we're now at now+epsilon, and we fail to release the lock, since we've passed "now". macOS 10.12/10.13 might well have changed something along this codepath, making this particular "failure" happen all the time.
Due to the impossibly small timeout, thread 59, waiting for something to happen, is essentially busy-waiting on the lock, and the main thread and thread 4 can't actually ever acquire the lock to make the progress thread 59 needs. That seems like a plausible scenario given the crash stacks, yes?
I will try pushing a try build with that change tomorrow if nobody beats me to it, and then perhaps somebody would be willing to take that try build for a test?
Comment 42•7 years ago
|
||
Comment 43•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #41)
> The Mac NSPR locking code is shared with all other pthreads-using Unixes, so
> I think it's unlikely there's a bug in that particular code.
>
> Perhaps the problem is that the PR_WaitCondVar call is using 0 (=
> PR_INTERVAL_NO_WAIT) for the timeout value, rather than the expected
> PR_INTERVAL_NO_TIMEOUT:
This! I just built this locally and it does indeed avoid the hang! Pushed to try to confirm in comment 42. Will take the build for a spin tomorrow morning.
Thank you, Nathan! Would you like to post the patch for this? I'm happy to do it on your behalf, but I would need to know how to go about this: Should this be fixed in the NSS repo first and we then merge a version of NSS that includes the fix into m-c?
![]() |
Assignee | |
Comment 44•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #43)
> (In reply to Nathan Froyd [:froydnj] from comment #41)
> > The Mac NSPR locking code is shared with all other pthreads-using Unixes, so
> > I think it's unlikely there's a bug in that particular code.
> >
> > Perhaps the problem is that the PR_WaitCondVar call is using 0 (=
> > PR_INTERVAL_NO_WAIT) for the timeout value, rather than the expected
> > PR_INTERVAL_NO_TIMEOUT:
>
> This! I just built this locally and it does indeed avoid the hang! Pushed to
> try to confirm in comment 42. Will take the build for a spin tomorrow
> morning.
Excellent, thank you for testing!
> Thank you, Nathan! Would you like to post the patch for this? I'm happy to
> do it on your behalf, but I would need to know how to go about this: Should
> this be fixed in the NSS repo first and we then merge a version of NSS that
> includes the fix into m-c?
Yes, I believe we have to fix things in the NSS repo, and then a fixed (release?) version of NSS will be pushed around to all the appropriate places.
I'll post a patch for this tomorrow!
Comment 45•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #42)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=fd67c1c76446cd4677de810f81c127e11fdf50e7
Just tested and the fix also works as expected in the try build:
https://queue.taskcluster.net/v1/task/MBVBzAUvRX-1ka4WiIvcOQ/runs/0/artifacts/public/build/target.dmg
![]() |
Assignee | |
Comment 46•7 years ago
|
||
Here's the patch, as promised.
Assignee: nobody → nfroyd
Attachment #8980993 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8981818 -
Flags: review?(rrelyea)
![]() |
Assignee | |
Comment 47•7 years ago
|
||
[Tracking Requested - why for this release]: hangs on recent Mac versions when using 2FA devices. Seems like this is a problem that should be fixed, given that we touted this capability for Firefox 60...
status-firefox-esr52:
--- → unaffected
tracking-firefox61:
--- → ?
tracking-firefox-esr60:
--- → ?
relnote-firefox:
--- → ?
Updated•7 years ago
|
Comment 48•7 years ago
|
||
Comment on attachment 8981818 [details] [diff] [review]
wait indefinitely in nssSlot_IsTokenPresent
Review of attachment 8981818 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm. I don't see any reason to use PR_INTERVAL_NO_WAIT here.
Attachment #8981818 -
Flags: review+
![]() |
Assignee | |
Comment 49•7 years ago
|
||
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #48)
> lgtm. I don't see any reason to use PR_INTERVAL_NO_WAIT here.
Thank you!
Apparently I do not have push permissions to NSS, would you mind amending the reviewer in the commit message and pushing the patch for me?
Flags: needinfo?(franziskuskiefer)
Comment 50•7 years ago
|
||
Will do. I'm waiting on try https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=deeb10851254ba949c106f4c0b4e8e21b682899e
I understand we want this in Fx61. So we'll have to create a 3.37.2 release.
Comment 51•7 years ago
|
||
I agree with the fix using PR_INTERVAL_NO_TIMEOUT. PR_INTERVAL_NO_WAIT amounts to busy waiting, which is not what we want on any platform.
Comment 52•7 years ago
|
||
Comment on attachment 8981818 [details] [diff] [review]
wait indefinitely in nssSlot_IsTokenPresent
Review of attachment 8981818 [details] [diff] [review]:
-----------------------------------------------------------------
r+ rrelyea. This is the correct fix.
bob
Attachment #8981818 -
Flags: review?(rrelyea) → review+
![]() |
Assignee | |
Comment 53•7 years ago
|
||
Small change to the commit message to note fkiefer's review and to make explicit the possibility that we could have seen this crash on earlier versions of macOS.
Attachment #8981818 -
Attachment is obsolete: true
Attachment #8981956 -
Flags: review+
Comment 54•7 years ago
|
||
trunk: https://hg.mozilla.org/projects/nss/rev/8232a58332dda55059b02430b93d09aabf93d840
3.37 branch: https://hg.mozilla.org/projects/nss/rev/97ba6eed6d8beb366e01eba5970b452282910e6a
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(franziskuskiefer)
Resolution: --- → FIXED
Target Milestone: --- → 3.38
Comment 55•7 years ago
|
||
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #50)
> Will do. I'm waiting on try
> https://treeherder.mozilla.org/#/jobs?repo=nss-
> try&revision=deeb10851254ba949c106f4c0b4e8e21b682899e
> I understand we want this in Fx61. So we'll have to create a 3.37.2 release.
Cannot this fix merged to the Fx 60 branch instead ?
I ask this because many of our customers report this problem. And our first line support team is very buzy.
Thanks.
Comment 56•7 years ago
|
||
franziskus, nfroyd, is there then follow up work to land on mozilla-central and request uplift to beta?
Flags: needinfo?(nfroyd)
Flags: needinfo?(franziskuskiefer)
Comment 57•7 years ago
|
||
See bug 1466028 and bug 1466038 for point releases and bug 1460617 for the next regular NSS release.
Flags: needinfo?(nfroyd)
Flags: needinfo?(franziskuskiefer)
![]() |
Assignee | |
Comment 58•7 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #56)
> franziskus, nfroyd, is there then follow up work to land on mozilla-central
> and request uplift to beta?
Franziskus already covered this. But this bug is probably worthy of a release note ("hey, we fixed the feature that we advertised in the last release!" ;)...what bug should we attach the request to? This bug, since it actually fixed the problem, or the NSS upgrade bugs, since those will actually land the fix in the appropriate branch?
Flags: needinfo?(lhenry)
Comment 59•7 years ago
|
||
Comment 61•7 years ago
|
||
Attaching the note to this bug is fine. I'll add it now as "Updated to NSS 3.37.2"
and will link it to the actual notes at https://developer.mozilla.org/docs/Mozilla/Projects/NSS/
Flags: needinfo?(lhenry)
Comment 62•7 years ago
|
||
Or, I guess that note should say, NSS 3.36?
Flags: needinfo?(franziskuskiefer)
Comment 63•7 years ago
|
||
This will go into 3 releases: NSS 3.38 Firefox 62, NSS 3.37.2 for Firefox 61, NSS 3.36.3 for Firefox 60 ESR and potentially 60.0.2.
There are no release notes yet but they'll be at https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/NSS_3.36.3_release_notes etc.
Flags: needinfo?(franziskuskiefer)
Comment 64•7 years ago
|
||
Hello,
after some tests performed with our security module enabled:
- it is fixed on current Nightly (2018-06-04)
- but it's still present in the current Beta (61 b10)
Comment 65•7 years ago
|
||
fixed in nightly by https://hg.mozilla.org/mozilla-central/rev/4f9eec636127
Updated•7 years ago
|
Comment 66•7 years ago
|
||
status-firefox61: affected → fixed: this not quite true as for FF 61 b11
I can reproduce the problem in a rate of 4/5 launches
Here are some crashes I have sent:
https://crash-stats.mozilla.com/report/index/b5812d72-4df1-488d-8bf7-a3f650180606
https://crash-stats.mozilla.com/report/index/d0ebb48f-1114-444d-8a7f-db5990180606
https://crash-stats.mozilla.com/report/index/ad2ec1f6-c2dc-4aa8-a6d1-296b50180606
https://crash-stats.mozilla.com/report/index/4032b99c-eeab-497f-a12f-081980180606
https://crash-stats.mozilla.com/report/index/e2d331f2-aa9b-46cc-9bf3-d2a690180606
The deadlock problem seems have moved elsewhere...
Comment 67•7 years ago
|
||
> status-firefox61: affected → fixed: this not quite true as for FF 61 b11
This was marked fixed an hour before your comment. That doesn't mean it's fixed in a release NOW. The next beta build should be done on Thursday and will include the fix.
Updated•7 years ago
|
Crash Signature: [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e]
[@ shutdownhang | libsystem_kernel.dylib@0x19bf2] → [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e]
[@ shutdownhang | libsystem_kernel.dylib@0x19bf2]
[@ shutdownhang | libsystem_kernel.dylib@0x16db6 ]
Updated•7 years ago
|
Crash Signature: [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e]
[@ shutdownhang | libsystem_kernel.dylib@0x19bf2]
[@ shutdownhang | libsystem_kernel.dylib@0x16db6 ] → [@ shutdownhang | libsystem_kernel.dylib@0x1ca1e]
[@ shutdownhang | libsystem_kernel.dylib@0x19bf2]
You need to log in
before you can comment on or make changes to this bug.
Description
•