SSL client dies in handshake code

RESOLVED FIXED in 3.4

Status

P1
normal
RESOLVED FIXED
17 years ago
17 years ago

People

(Reporter: julien.pierre, Assigned: wtc)

Tracking

3.3.2
Sun
Solaris

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

17 years ago
This occurred running my httptest client with NSS 3.3.2 with double-handshake
client auth . I have the core file.

detected a multithreaded program
t@64 (l@58) terminated by signal ABRT (Abort)
Current function is PR_Assert
  495       abort();
(dbx) where
current thread: t@64
  [1] __sigprocmask(0x0, 0xfdab1210, 0x0, 0x0, 0x0, 0x0), at 0xfec39bf0
  [2] _resetsig(0xfec3c510, 0x0, 0x0, 0xfdab1d78, 0xfec4e000, 0x0), at 0xfec2e620
  [3] _sigon(0xfdab1d78, 0xfec55990, 0x6, 0xfdab12e4, 0xfdab1d78, 0xfdab1328),
at 0xfec2dd10
  [4] _thrp_kill(0x0, 0x40, 0x6, 0xfec4e000, 0x40, 0xfebba428), at 0xfec30e84
  [5] raise(0x6, 0x0, 0x0, 0xffffffff, 0xfebba394, 0xfee6b6dc), at 0xfeb49b08
  [6] abort(0xfebb6000, 0x3f, 0xfebbd99c, 0xfebb9c78, 0x7b, 0xfeb91c94), at
0xfeb35124
=>[7] PR_Assert(s = 0xff316418 "!ssl_HaveXmitBufLock(ss)", file = 0xff316434
"sslsecur.c", ln = 123), line 495 in "prlog.c"
  [8] ssl_Do1stHandshake(ss = 0xd3dc750), line 123 in "sslsecur.c"
  [9] ssl_SecureSend(ss = 0xd3dc750, buf = 0xa2e68 "GET / HTTP/1.0\n\n", len =
16, flags = 0), line 1075 in "sslsecur.c"
  [10] ssl_Send(fd = 0x19c138, buf = 0xa2e68, len = 16, flags = 0, timeout =
360000000U), line 1192 in "sslsock.c"
  [11] PR_Send(fd = 0x19c138, buf = 0xa2e68, amount = 16, flags = 0, timeout =
360000000U), line 221 in "priometh.c"
  [12] SunRequest::send(this = 0x1c21c8, sock = 0x19c138), line 308 in "request.cpp"
  [13] SunEngine::makeRequest(this = 0xfdab1b64, request = CLASS, server =
CLASS, ignoresenderror = 1), line 413 in "engine.cpp"
  [14] SunTest::run(this = 0xd3758), line 454 in "tests.cpp"
  [15] TestInstance::execute_test(this = 0xd3708), line 152 in "tests.cpp"
  [16] thread_entry(arg = (nil)), line 1684 in "tests.cpp"
  [17] _pt_root(arg = 0xf1380), line 214 in "ptthread.c"
(dbx)
(Reporter)

Updated

17 years ago
Priority: -- → P1
Target Milestone: --- → 3.4
Version: 3.4 → 3.3.2
(Reporter)

Updated

17 years ago
Summary: SSL client dies in handshake code → SSL client dies in handshake code with NSS 3.3.2
The assertion that failed is in a loop, at the top.  It is there to ensure
that (a) the lock is not held when entering the loop, and (b) that no 
function called from within the loop returns while holding the lock. 
It appears that the caller of this function never holds the lock, so the 
obvious candidate is that some function that was called in this loop 
returned while holding the lock.  

I tried to look at the core file to see if I could determine which function
was called in the last iteration of this loop.  But the core file is enormous.  
dbx crashes when I try to use it to inspect this core file.  So, we probably 
won't get far with this one.  

If this is reproducible, one idea is to have a variable in the function that
holds the address of the function called by the previous iteration of the
loop.  That would make it much easier to find the culprit.  But even that
doesn't help if dbm cannot display the value.
(Assignee)

Comment 2

17 years ago
With Nelson's pointer, I searched all the files in nss/lib/ssl,
trying to match ssl_ReleaseRecvBufLock with ssl_GetRecvBufLock
and ssl_ReleaseXmitBufLock with ssl_GetXmitBufLock.

I believe the culprit is the ssl_CreateSecurityInfo function in
sslsecur.c, and the likely cause for the missing unlock error was
that sslBuffer_Grow() failed, indicating an out-of-memory error.
The relevant code in ssl_CreateSecurityInfo is reproduced below:

    ssl_GetRecvBufLock(ss);
    if ((gs = ss->gather) == 0) {
        ss->gather = gs = ssl_NewGather();
        if (!gs) {
            goto loser;
        }
    }

    rv = sslBuffer_Grow(&gs->buf, 4096);
    if (rv) {
        goto loser;
    }
    ssl_ReleaseRecvBufLock(ss);

    ssl_GetXmitBufLock(ss);
    rv = sslBuffer_Grow(&sec->writeBuf, 4096);
    if (rv) {
        goto loser;
    }
    ssl_ReleaseXmitBufLock(ss);

    SSL_TRC(5, ("%d: SSL[%d]: security info created", SSL_GETPID(), ss->fd));
    return SECSuccess;

  loser:
    if (sec) {
        PORT_Free(sec);
        ss->sec = sec = (sslSecurityInfo *)0;
    }
    if (gs) {
        ssl_DestroyGather(gs);
        ss->gather = gs = (sslGather *)0;
    }
    return SECFailure;

Nelson, does the loop call ssl_CreateSecurityInfo?  The loop
invokes ss->handshakeCallback and ss->handshake.  It's not
clear to me what those function pointers may be. 

How do I verify this theory with the core file?
Status: NEW → ASSIGNED
Each of the 3 "goto loser" statements quoted above leaves a lock locked
when the function returns.  This function must be fixed to ensure that
it never returns while leaving a lock locked.

The pointers ss->handshakeCallback and ss->handshake should always point
to one of the following functions, (or be null):
AlwaysBlock;
ssl2_BeginClientHandshake;
ssl2_BeginServerHandshake;
ssl2_HandleClientHelloMessage;
ssl2_HandleClientSessionKeyMessage;
ssl2_HandleMessage;
ssl2_HandleServerHelloMessage;
ssl2_HandleVerifyMessage;
ssl_GatherRecord1stHandshake;

BTW, I see that the "lock" in question is actually a PRMonitor, so it 
is possible that the monitor's lock count was already non-zero when 
the function was entered.  The value of ss->handshake would tell us 
if that is the case.
(Assignee)

Comment 4

17 years ago
Created attachment 71360 [details] [diff] [review]
Patch to unlock the recvBufLock or xmitBufLock on error

This patch gets rid of the problem of leaving a lock locked
before going to the 'loser' label.  However, I suspect that
we also need to lock the recvBufLock when destroying ss->gather
and lock the xmitBufLock when destroying ss->sec.  Alternatively,
I am not sure if we really need to lock the recvBufLock when
creating ss->gather and ss->gather->buf and lock the xmitBufLock
when creating ss->sec->writeBuf.  Nelson, do you know?
(Assignee)

Comment 5

17 years ago
Just to summarize what I have done and my findings.

I inspected all the files in nss/lib/ssl on the trunk of NSS as of
this morning (2002-02-25), trying to match ssl_ReleaseRecvBufLock
with ssl_GetRecvBufLock and ssl_ReleaseXmitBufLock with
ssl_GetXmitBufLock.

The only problems I found are in the ssl_CreateSecurityInfo function
in sslsecur.c.  Then I verified that that function has not changed
since NSS 3.3.2, so NSS 3.3.2 has exactly the same problems in that
function.

The problems are that ssl_CreateSecurityInfo may leave the recvBufLock
or xmitBufLock locked if it fails to create ss->gather, ss->gather->buf,
or ss->sec->writeBuf.  The most likely cause for the failure is an
out-of-memory error from the memory allocation routines(PORT_ZAlloc,
PORT_Realloc, or PORT_Alloc).

I believe this is the cause of the assertion failures reported in this
bug and bug 126501.
The RecvBufLock protects the gather buffer (among other things),
and yes this lock should be held while destroying the gather buffer.

The XmitBufLock protects the sec->writeBuf, among other things.
But the only way to get to the writebuf is through the sec 
structure that this function is creating.  If we didn't assign
ss->sec = sec until just before we return SECSuccess, then it should
not be necessary to use a lock to protect sec->writeBuf at all 
until ss->sec = sec.
Created attachment 71614 [details] [diff] [review]
Huge patch, see comments below.

This rather large patch accomplishes several things:

1. The sslSecurityInfo and sslGather structs were formerly allocated
separately from the sslSocket struct, and pointers to them were kept in
the sslSocket struct.  Now, they are part of the sslSocket struct itself,
not separtely allocated.  This has several effects:
a) slightly fewer malloc and free calls for each ssl socket,
b) eliminates the need to double check the pointers for these structs
to ensure they're not null in every function that accessed their members,
c) reduces the levels of pointer indirection required to access their
members, substantially reducing the number of memory cycles required in
some functions.  E.g. ss->sec->ci.elements is now ss->sec.ci.elements.

2) The sslGather struct is conceptually separate from the sslSecurityInfo
struct, and is no longer initialized as a side effect of calling
ssl_CreateSecurityInfo().  It is now initialized by the new function
ssl_InitGather().

3) Formerly, the sslSecurityInfo struct was allocated when SSL_ImportFD
was called, and then SSL_ResetHandshake would discard the old one and 
allocate a new one.  Now, SSL_ResetHandshake calls ssl_ResetSecurityInfo
to simply reset the state of that struct to its initial state.

4) many places that formerly did not check for memory allocation failures
now do check, and do the right thing when it fails, avoiding leaks.  

As a side effect, all the functions that handle sslSecurityInfo structs
were rewritten.  Hopefully, these changes will eliminate the cause of the
assertion failure reported in this bug. 

I have run the QA test script on this code, and will check it in shortly.
Attachment #71360 - Attachment is obsolete: true
(Reporter)

Comment 8

17 years ago
Nelson,

This should probably go into the 3.4 branch, right ? There is a 3.4 defect :
126501 .
(Reporter)

Updated

17 years ago
Summary: SSL client dies in handshake code with NSS 3.3.2 → SSL client dies in handshake code
(Reporter)

Comment 9

17 years ago
*** Bug 126501 has been marked as a duplicate of this bug. ***
(Assignee)

Comment 10

17 years ago
We have not been able to reproduce this assertion failure.
I will assume that the assertion failure was caused by the
condition described in comment #2 and the following comments
in bug 126501 and is therefore fixed by Nelson's patch:
http://bugzilla.mozilla.org/show_bug.cgi?id=126501#c3
http://bugzilla.mozilla.org/show_bug.cgi?id=126501#c4
(Assignee)

Comment 11

17 years ago
Marked the bug fixed.
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.