Intermittent netwerk/test/unit/test_tls_server.js | Assertion failure: ss->opt.noLocks || !ssl_HaveRecvBufLock(ss), at /builds/worker/checkouts/gecko/security/nss/lib/ssl/sslsecur.c:37
Categories
(Core :: Security: PSM, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox85 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: keeler)
References
Details
(Keywords: intermittent-failure, Whiteboard: [psm-assigned])
Attachments
(3 files)
Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=299975034&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/N7uUZuYjQReZ0dz1CLDnvA/runs/0/artifacts/public/logs/live_backing.log
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - TEST-PASS | netwerk/test/unit/test_tls_server.js | - Got self-signed cert - true == true
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - netwerk/test/unit/test_tls_server.js | [11507, Main Thread] WARNING: GetDnsSuffixList is not supported without a bridge connection: file /builds/worker/checkouts/gecko/netwerk/system/android/nsAndroidNetworkLinkService.cpp, line 138
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - "Accept TLS client connection"
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - "Output to server written"
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - netwerk/test/unit/test_tls_server.js | Assertion failure: ss->opt.noLocks || !ssl_HaveRecvBufLock(ss), at /builds/worker/checkouts/gecko/security/nss/lib/ssl/sslsecur.c:37
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - netwerk/test/unit/test_tls_server.js | Aborted
[task 2020-04-29T12:21:05.772Z] 12:21:05 INFO - <<<<<<<
[task 2020-04-29T12:21:06.203Z] 12:21:06 INFO - mozcrash Copy/paste: /builds/worker/fetches/minidump_stackwalk/minidump_stackwalk /tmp/tmpFaSquN/12b45b73-c9dd-939f-72b3-8a2a3746f516.dmp /builds/worker/workspace/build/symbols
[task 2020-04-29T12:21:10.679Z] 12:21:10 INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/12b45b73-c9dd-939f-72b3-8a2a3746f516.dmp
[task 2020-04-29T12:21:10.679Z] 12:21:10 INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/12b45b73-c9dd-939f-72b3-8a2a3746f516.extra
[task 2020-04-29T12:21:10.681Z] 12:21:10 WARNING - PROCESS-CRASH | netwerk/test/unit/test_tls_server.js | application crashed [@ libc.so + 0x1cc48]
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Crash dump filename: /tmp/tmpFaSquN/12b45b73-c9dd-939f-72b3-8a2a3746f516.dmp
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Operating system: Android
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - 0.0.0 Linux 3.10.0+ #260 SMP PREEMPT Fri May 19 12:48:14 PDT 2017 x86_64
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - CPU: amd64
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - family 6 model 6 stepping 3
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - 4 CPUs
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - GPU: UNKNOWN
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Crash reason: SIGABRT
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Crash address: 0x2cf3
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Process uptime: not available
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Thread 3 (crashed)
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - 0 libc.so + 0x1cc48
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rax = 0x0000000000000000 rdx = 0x0000000000000006
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rcx = 0xffffffffffffffff rbx = 0x00007f551f3464f8
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rsi = 0x0000000000002cf7 rdi = 0x0000000000002cf3
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rbp = 0x0000000000000000 rsp = 0x00007f551f3444d8
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - r8 = 0x00007f551f624090 r9 = 0xbab733f05ccd319a
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - r10 = 0x0000000000000000 r11 = 0x0000000000000246
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - r12 = 0x0000000000002cf7 r13 = 0x0000000000000006
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - r14 = 0x00007f551498298a r15 = 0x00007f55149a59b0
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rip = 0x00007f5512811c48
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: given as instruction pointer in context
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 1 libc.so + 0x8a602
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f3444e0 rip = 0x00007f551287f602
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 2 libc.so + 0x30242
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344510 rip = 0x00007f5512825242
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 3 libc.so + 0x2877e
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344520 rip = 0x00007f551281d77e
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 4 liblog.so + 0x56f0
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344580 rip = 0x00007f55140dd6f0
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 5 libfreebl3.so!mp_clear [mpi.c:2eaf4ae7a038df49405ed53f8826aadbf163fa76 : 291 + 0x9]
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f3445e0 rip = 0x00007f550a06f0fc
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 6 libfreebl3.so!from_montgomery [ecp_256_32.c:2eaf4ae7a038df49405ed53f8826aadbf163fa76 : 1381 + 0x5]
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344600 rip = 0x00007f550a04a6cd
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 7 libnss3.so!PR_LogPrint [prlog.c:2eaf4ae7a038df49405ed53f8826aadbf163fa76 : 445 + 0xf]
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344720 rip = 0x00007f551488248e
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 8 libcrypto.so + 0xbb4b3
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344758 rip = 0x00007f5511b8d4b3
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Thread 0
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 12•4 years ago
|
||
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
For more information, please visit auto_nag documentation.
Comment 13•4 years ago
|
||
New occurrence: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=315249333&repo=autoland&lineNumber=4023
Comment 14•4 years ago
|
||
I just took a quick look at this and the stack trace is a little worrying. The stack shows that we are handling a certificate from a server in a callback. From there we initiate a send on a different socket. Callbacks from libssl aren't really very well suited to this sort of treatment, so it's surprising that there isn't any dispatch involved for all the certificate validation stuff.
Looking at the registers involved, we have several in common in the two layers of libssl code:
3 libssl3.so!ssl_Do1stHandshake [...]
rbx = 0x00007efd670a4000 rbp = 0x00007efd67fada40
rsp = 0x00007efd67fada10 r12 = 0x0000000000000000
r13 = 0x00007efd74b73ab9 r14 = 0x00007efd74b6b7a1
r15 = 0x00007efd74b710de rip = 0x00007efd74baa0fc
...
29 libssl3.so!ssl_Do1stHandshake [...]
rbx = 0x00007efd670a4000 rbp = 0x00007efd67faeaf0
rsp = 0x00007efd67faeac0 r12 = 0x0000000000000000
r13 = 0x00007efd74b73ab9 r14 = 0x00007efd74b6b7a1
r15 = 0x00007efd74b710de rip = 0x00007efd74baa151
I know very little about the calling convention here or registers, but it appears as though we might be attempting to send on the same socket (if any of r13, r14, r15 holds the single argument to this function). That would definitely explain the assertion failing.
And here I was sure that we should just disable locking here.
Forcing a dispatch might avoid this specific problem, but that's almost papering over the problem. The problem seems to be caused by the chain of events spawned from the constructor of the new authentication code here. The act of trying to access a service causes the event loop to be pumped in nsComponentManagerImpl::GetServiceLocked
. As the next event on the loop touches the same socket, that causes the safeguard against reentrant calls to fire. I thought that the stack looked a little too tall.
nsComponentManagerImpl::GetServiceLocked probably shouldn't be cranking the event loop synchronously like this, but I really don't understand our event loop well enough to say for sure and it's been this way for 10 years now.
Note that this appears to be hitting ClientAuthDataRunnable::BuildChainForCertificate
where we might ordinarily be doing a remote invocation via RemoteClientAuthDataRunnable::BuildChainForCertificate
instead. With that in mind, we might not hit this issue in the real world. That doesn't mean we shouldn't try to fix it.
Sorry Dana, it looks like this is the consequence of some weirdness in our event loop code. I don't think that it's fair to ask you to have to deal with bad choices of the past, but that seems to be at the root of this. I would support an effort to try to remove the extra call to NS_ProcessNextEvent
, but I'm sure that would require significant effort. In the meantime, an additional dispatch should avoid this particular instance of the problem.
Assignee | ||
Comment 15•4 years ago
|
||
Thanks for looking into this! I agree it would be good if we could avoid spinning the even loop there. I'll see if that's feasible and figure out a workaround if not.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 17•4 years ago
|
||
Before this patch, if nsIDNSService were created on a thread other than the
main thread, the code that instantiates it would synchronously dispatch an
event to the main thread to do the actual construction. However, this can lead
to deadlocks unless the XPCOM service instantiation code uses a nested event
loop to work around such problems. Since that causes stability issues, this
patch reworks the nsIDNSService constructor to only run on the main thread, and
ensures that the code calling the constructor only runs on the main thread.
Updated•4 years ago
|
Assignee | ||
Comment 18•4 years ago
|
||
If the implementation of an XPCOM service constructor involves dispatching an
event to another thread and blocking the current thread on that event, a
deadlock can occur if the target thread is also attempting to obtain that
service at the same time.
For instance, say nsIExampleService needs to be constructed on the main thread,
but some code in thread T is attempting to get a handle on it. Suppose the
implementation checks which thread it's running on and dispatches a synchronous
event to the main thread to do the actual construction work in the case that it
is not on the main thread. Further suppose that before that event is run, other
code on the main thread also attempts to get a handle on nsIExampleService. The
XPCOM service machinery sees that another thread is in the process of
constructing nsIExampleService, so it must wait for that to complete. However,
the thread that is attempting to do this construction is waiting for an event
to be run on the main thread. Thus, these two threads are deadlocked.
Before this patch, the XPCOM service machinery worked around this problem by
processing events on the second thread until it could proceed. However, this
caused a nested event loop, which is bad for stability because all guarantees
relating to when events will run are violated. In particular, this could cause
reentrency into the TLS implementation, which is not expected and has caused
intermittent crashes.
This patch removes the nested event loop in
nsComponentManagerImpl::GetServiceLocked in favor of not synchronously
dispatching events in XPCOM service constructors. One way to fix such
constructors is to move the event dispatch and blocking outside of the XPCOM
service machinery. See the other patch in this bug for an example.
Depends on D92799
Assignee | ||
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 20•4 years ago
|
||
Some PSM services need to be initialized on the main thread. Before this patch,
this was achieved by dispatching a synchronous task to the main thread in the
event that a different thread was attempting to acquire a given service for the
first time. However, with the upcoming removal of the nested event loop in the
XPCOM service instantiation code (see other patches in this bug), this can
cause a deadlock. This patch avoids the deadlock by removing the synchronous
dispatch and ensuring that these services get initialized on the main thread
relatively early, when PSM itself is initialized.
Depends on D92799
Comment 21•4 years ago
|
||
Bug 1671914 is a somewhat different instance of this weird nested event loop causing issues, for actual users. It also involved the NSS service wrappers, but it is the most trivial service imaginable, so I'm not sure exactly why it is failing.
Comment hidden (Intermittent Failures Robot) |
Comment 23•4 years ago
|
||
Comment 24•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/3aea9f380da0
https://hg.mozilla.org/mozilla-central/rev/dfc54b523a0b
https://hg.mozilla.org/mozilla-central/rev/87f72a8ec898
Updated•4 years ago
|
Description
•