Closed Bug 1634065 Opened 5 years ago Closed 4 years ago

Intermittent netwerk/test/unit/test_tls_server.js | Assertion failure: ss->opt.noLocks || !ssl_HaveRecvBufLock(ss), at /builds/worker/checkouts/gecko/security/nss/lib/ssl/sslsecur.c:37

Categories

(Core :: Security: PSM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
85 Branch
Root Cause Poor Architecture
Tracking Status
firefox85 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: keeler)

References

Details

(Keywords: intermittent-failure, Whiteboard: [psm-assigned])

Attachments

(3 files)

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=299975034&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/N7uUZuYjQReZ0dz1CLDnvA/runs/0/artifacts/public/logs/live_backing.log


[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - TEST-PASS | netwerk/test/unit/test_tls_server.js | - Got self-signed cert - true == true
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - netwerk/test/unit/test_tls_server.js | [11507, Main Thread] WARNING: GetDnsSuffixList is not supported without a bridge connection: file /builds/worker/checkouts/gecko/netwerk/system/android/nsAndroidNetworkLinkService.cpp, line 138
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - "Accept TLS client connection"
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - "Output to server written"
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - netwerk/test/unit/test_tls_server.js | Assertion failure: ss->opt.noLocks || !ssl_HaveRecvBufLock(ss), at /builds/worker/checkouts/gecko/security/nss/lib/ssl/sslsecur.c:37
[task 2020-04-29T12:21:05.771Z] 12:21:05 INFO - netwerk/test/unit/test_tls_server.js | Aborted
[task 2020-04-29T12:21:05.772Z] 12:21:05 INFO - <<<<<<<
[task 2020-04-29T12:21:06.203Z] 12:21:06 INFO - mozcrash Copy/paste: /builds/worker/fetches/minidump_stackwalk/minidump_stackwalk /tmp/tmpFaSquN/12b45b73-c9dd-939f-72b3-8a2a3746f516.dmp /builds/worker/workspace/build/symbols
[task 2020-04-29T12:21:10.679Z] 12:21:10 INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/12b45b73-c9dd-939f-72b3-8a2a3746f516.dmp
[task 2020-04-29T12:21:10.679Z] 12:21:10 INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/12b45b73-c9dd-939f-72b3-8a2a3746f516.extra
[task 2020-04-29T12:21:10.681Z] 12:21:10 WARNING - PROCESS-CRASH | netwerk/test/unit/test_tls_server.js | application crashed [@ libc.so + 0x1cc48]
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Crash dump filename: /tmp/tmpFaSquN/12b45b73-c9dd-939f-72b3-8a2a3746f516.dmp
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Operating system: Android
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - 0.0.0 Linux 3.10.0+ #260 SMP PREEMPT Fri May 19 12:48:14 PDT 2017 x86_64
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - CPU: amd64
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - family 6 model 6 stepping 3
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - 4 CPUs
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - GPU: UNKNOWN
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Crash reason: SIGABRT
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Crash address: 0x2cf3
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Process uptime: not available
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - Thread 3 (crashed)
[task 2020-04-29T12:21:10.681Z] 12:21:10 INFO - 0 libc.so + 0x1cc48
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rax = 0x0000000000000000 rdx = 0x0000000000000006
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rcx = 0xffffffffffffffff rbx = 0x00007f551f3464f8
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rsi = 0x0000000000002cf7 rdi = 0x0000000000002cf3
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - rbp = 0x0000000000000000 rsp = 0x00007f551f3444d8
[task 2020-04-29T12:21:10.682Z] 12:21:10 INFO - r8 = 0x00007f551f624090 r9 = 0xbab733f05ccd319a
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - r10 = 0x0000000000000000 r11 = 0x0000000000000246
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - r12 = 0x0000000000002cf7 r13 = 0x0000000000000006
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - r14 = 0x00007f551498298a r15 = 0x00007f55149a59b0
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rip = 0x00007f5512811c48
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: given as instruction pointer in context
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 1 libc.so + 0x8a602
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f3444e0 rip = 0x00007f551287f602
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 2 libc.so + 0x30242
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344510 rip = 0x00007f5512825242
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 3 libc.so + 0x2877e
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344520 rip = 0x00007f551281d77e
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 4 liblog.so + 0x56f0
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344580 rip = 0x00007f55140dd6f0
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 5 libfreebl3.so!mp_clear [mpi.c:2eaf4ae7a038df49405ed53f8826aadbf163fa76 : 291 + 0x9]
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f3445e0 rip = 0x00007f550a06f0fc
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 6 libfreebl3.so!from_montgomery [ecp_256_32.c:2eaf4ae7a038df49405ed53f8826aadbf163fa76 : 1381 + 0x5]
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344600 rip = 0x00007f550a04a6cd
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 7 libnss3.so!PR_LogPrint [prlog.c:2eaf4ae7a038df49405ed53f8826aadbf163fa76 : 445 + 0xf]
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344720 rip = 0x00007f551488248e
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - 8 libcrypto.so + 0xbb4b3
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - rsp = 0x00007f551f344758 rip = 0x00007f5511b8d4b3
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Found by: stack scanning
[task 2020-04-29T12:21:10.688Z] 12:21:10 INFO - Thread 0

Assignee: nobody → nobody
Component: Networking → Libraries
Product: Core → NSS
QA Contact: jjones
Version: unspecified → other
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

I just took a quick look at this and the stack trace is a little worrying. The stack shows that we are handling a certificate from a server in a callback. From there we initiate a send on a different socket. Callbacks from libssl aren't really very well suited to this sort of treatment, so it's surprising that there isn't any dispatch involved for all the certificate validation stuff.

Looking at the registers involved, we have several in common in the two layers of libssl code:

3  libssl3.so!ssl_Do1stHandshake [...]
   rbx = 0x00007efd670a4000   rbp = 0x00007efd67fada40
   rsp = 0x00007efd67fada10   r12 = 0x0000000000000000
   r13 = 0x00007efd74b73ab9   r14 = 0x00007efd74b6b7a1
   r15 = 0x00007efd74b710de   rip = 0x00007efd74baa0fc
...
29  libssl3.so!ssl_Do1stHandshake [...]
    rbx = 0x00007efd670a4000   rbp = 0x00007efd67faeaf0
    rsp = 0x00007efd67faeac0   r12 = 0x0000000000000000
    r13 = 0x00007efd74b73ab9   r14 = 0x00007efd74b6b7a1
    r15 = 0x00007efd74b710de   rip = 0x00007efd74baa151

I know very little about the calling convention here or registers, but it appears as though we might be attempting to send on the same socket (if any of r13, r14, r15 holds the single argument to this function). That would definitely explain the assertion failing.

And here I was sure that we should just disable locking here.

Forcing a dispatch might avoid this specific problem, but that's almost papering over the problem. The problem seems to be caused by the chain of events spawned from the constructor of the new authentication code here. The act of trying to access a service causes the event loop to be pumped in nsComponentManagerImpl::GetServiceLocked. As the next event on the loop touches the same socket, that causes the safeguard against reentrant calls to fire. I thought that the stack looked a little too tall.

nsComponentManagerImpl::GetServiceLocked probably shouldn't be cranking the event loop synchronously like this, but I really don't understand our event loop well enough to say for sure and it's been this way for 10 years now.

Note that this appears to be hitting ClientAuthDataRunnable::BuildChainForCertificate where we might ordinarily be doing a remote invocation via RemoteClientAuthDataRunnable::BuildChainForCertificate instead. With that in mind, we might not hit this issue in the real world. That doesn't mean we shouldn't try to fix it.

Sorry Dana, it looks like this is the consequence of some weirdness in our event loop code. I don't think that it's fair to ask you to have to deal with bad choices of the past, but that seems to be at the root of this. I would support an effort to try to remove the extra call to NS_ProcessNextEvent, but I'm sure that would require significant effort. In the meantime, an additional dispatch should avoid this particular instance of the problem.

Assignee: nobody → nobody
Blocks: 1612587
Root Cause: --- → Poor Architecture
Component: Libraries → Security: PSM
Flags: needinfo?(dkeeler)
OS: Unspecified → All
Product: NSS → Core
QA Contact: jjones
Hardware: Unspecified → All
Target Milestone: --- → mozilla77
Version: other → Trunk

Thanks for looking into this! I agree it would be good if we could avoid spinning the even loop there. I'll see if that's feasible and figure out a workaround if not.

Flags: needinfo?(dkeeler)

Before this patch, if nsIDNSService were created on a thread other than the
main thread, the code that instantiates it would synchronously dispatch an
event to the main thread to do the actual construction. However, this can lead
to deadlocks unless the XPCOM service instantiation code uses a nested event
loop to work around such problems. Since that causes stability issues, this
patch reworks the nsIDNSService constructor to only run on the main thread, and
ensures that the code calling the constructor only runs on the main thread.

Assignee: nobody → dkeeler

If the implementation of an XPCOM service constructor involves dispatching an
event to another thread and blocking the current thread on that event, a
deadlock can occur if the target thread is also attempting to obtain that
service at the same time.
For instance, say nsIExampleService needs to be constructed on the main thread,
but some code in thread T is attempting to get a handle on it. Suppose the
implementation checks which thread it's running on and dispatches a synchronous
event to the main thread to do the actual construction work in the case that it
is not on the main thread. Further suppose that before that event is run, other
code on the main thread also attempts to get a handle on nsIExampleService. The
XPCOM service machinery sees that another thread is in the process of
constructing nsIExampleService, so it must wait for that to complete. However,
the thread that is attempting to do this construction is waiting for an event
to be run on the main thread. Thus, these two threads are deadlocked.
Before this patch, the XPCOM service machinery worked around this problem by
processing events on the second thread until it could proceed. However, this
caused a nested event loop, which is bad for stability because all guarantees
relating to when events will run are violated. In particular, this could cause
reentrency into the TLS implementation, which is not expected and has caused
intermittent crashes.
This patch removes the nested event loop in
nsComponentManagerImpl::GetServiceLocked in favor of not synchronously
dispatching events in XPCOM service constructors. One way to fix such
constructors is to move the event dispatch and blocking outside of the XPCOM
service machinery. See the other patch in this bug for an example.

Depends on D92799

Severity: normal → S3
Priority: P5 → P1
Whiteboard: [psm-assigned]
See Also: → 1671914

Some PSM services need to be initialized on the main thread. Before this patch,
this was achieved by dispatching a synchronous task to the main thread in the
event that a different thread was attempting to acquire a given service for the
first time. However, with the upcoming removal of the nested event loop in the
XPCOM service instantiation code (see other patches in this bug), this can
cause a deadlock. This patch avoids the deadlock by removing the synchronous
dispatch and ensuring that these services get initialized on the main thread
relatively early, when PSM itself is initialized.

Depends on D92799

Bug 1671914 is a somewhat different instance of this weird nested event loop causing issues, for actual users. It also involved the NSS service wrappers, but it is the most trivial service imaginable, so I'm not sure exactly why it is failing.

Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3aea9f380da0 ensure nsIDNSService is only created on the main thread r=dragana,necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/dfc54b523a0b re-work how PSM services get initialized on the main thread r=kjacobs,necko-reviewers,bbeurdouche https://hg.mozilla.org/integration/autoland/rev/87f72a8ec898 remove the nested event loop in nsComponentManagerImpl::GetServiceLocked r=nika
Regressions: 1678565
Regressions: 1696593
Target Milestone: mozilla77 → 85 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: