Closed Bug 202179 Opened 21 years ago Closed 21 years ago

MAC computation failed with strsclnt

Categories

(NSS :: Libraries, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: wtc)

Details

(Whiteboard: [3.3.5][3.7.10][3.8.3])

Attachments

(4 files, 1 obsolete file)

On my dual Xeon 2.2 GHz with hyperthreading, I consistently get 6 out of 100,000
MAC computation failures when connecting to aka . This server uses a 40-bit
cipher suite requesting client auth. Note that in this case strsclnt does not
present a client cert, so it just gets denied access.

It's possible that the failure is with the server rather than the client. The
server software is likely quite old. Still, this is abnormal.

C:\nss\39\mozilla\dist\WIN954.0_DBG.OBJ\bin>strsclnt -d . -D -N -q -p 443 -c 100
0 -t 100 aka 2>&1 | grep -v Validated
strsclnt: 0 cache hits; 100 cache misses, 0 cache not reusable
strsclnt: -- SSL:strsclnt: 0 cache hits; 1998 cache misses, 0 cache not reusable

strsclnt: PR_Read returned error -12213:
MAC computation failed.
strsclnt: PR_Read returned error -12213:
MAC computation failed.
strsclnt: PR_Read returned error -12213:
MAC computation failed.
strsclnt: PR_Read returned error -12213:
MAC computation failed.
strsclnt: PR_Read returned error -12213:
MAC computation failed.
strsclnt: PR_Read returned error -12213:
MAC computation failed.
FYI :

I tried reducing the number of client threads from 100.
I didn't get any failure with 1, 2, 4, 8, 16, 32, 64, 80 threads.

I got a single failure with 90 threads.

7 failures with 99 threads.

I can consistently reproduce multiple failures with 100 threads.

With 150 threads I got no failure however ! So much for the theory that I was
overloading the server and somehow bad things were happening.

There appears to be a range that causes the problem to be reproduced, and 100
threads is right in that range.

This might require testing on other machines to see if it is machine-dependent.
The error reported above is SSL_ERROR_MAC_COMPUTATION_FAILURE.  
It means that the attempt to compute the MAC value locally failed to compute 
any MAC value at all.  It means that one of the following functions returned
SECFailure: PK11_DigestBegin, PK11_DigestOp, PK11_DigestFinal.

That error is entirely distinct from the errors SSL_ERROR_BAD_MAC_READ and
SSL_ERROR_BAD_MAC_ALERT.  Those errors mean that a MAC was computed, but
the computed value was not the expected value.  

As an approach to debugging this, I would suggest setting a breakpoint inside
the PK11_ functions named above in the error paths that cause SECFailure to 
be returned.  That will tell you which function fails.  Then, if the failure
is actually occuring in softoken, set similar breakpoints in the error paths
of the softoken function(s) used by the particular failing PK11_ function.
When I run the test inside the debugger, it takes much longer and the problem
does not occur. I wlil try putting assertions.

I tried this test on my Sun as well - it instantly hangs, needing a pull of the
power cord. My Sun has been very sick in the past week and can't stay up for
more than a few minutes (it used to stay up for 3 - 6 months at a time, before I
rebooted it for software upgrades).
So far only my very fastest system has reproduced the problem. The slower OS/2 &
Sun boxes (when not crashing) in my office don't show the error.
I'm still busy putting asserts to go backwards up to the lowest level failure.
So far, I have found this :

C_SignInit is returning CKR_KEY_HANDLE_INVALID .
This occurs during pk11_context_init .

The stack is :

NTDLL! DbgBreakPoint address 0x77f9180c
pk11_context_init(PK11ContextStr * 0x0446c770, CK_MECHANISM * 0x0937fcc4) line
3566 + 21 bytes
PK11_DigestBegin(PK11ContextStr * 0x0446c770) line 3953 + 13 bytes
ssl3_ComputeRecordMAC(ssl3CipherSpec * 0x0056b680, PK11ContextStr * 0x0446c770,
int 0x00000017, unsigned short 0x0300, SSL3SequenceNumber {...}, unsigned char *
0x0055ab48, int 0x000000e1, unsigned char * 0x0937fd84, unsigned int *
0x0937fdd8) line 1597 + 9 bytes
ssl3_HandleRecord(sslSocketStr * 0x005468c8, SSL3Ciphertext * 0x0937fe00,
sslBufferStr * 0x00546a70) line 8313 + 95 bytes
ssl3_GatherCompleteHandshake(sslSocketStr * 0x005468c8, int 0x00000000) line 203
+ 22 bytes
ssl3_GatherAppDataRecord(sslSocketStr * 0x005468c8, int 0x00000000) line 232 +
13 bytes
DoRecv(sslSocketStr * 0x005468c8, unsigned char * 0x0c22a850, int 0x0000f000,
int 0x00000000) line 499 + 11 bytes
ssl_SecureRecv(sslSocketStr * 0x005468c8, unsigned char * 0x0c22a850, int
0x0000f000, int 0x00000000) line 980 + 21 bytes
ssl_SecureRead(sslSocketStr * 0x005468c8, unsigned char * 0x0c22a850, int
0x0000f000) line 989 + 19 bytes
ssl_Read(PRFileDesc * 0x005e0280, void * 0x0c22a850, int 0x0000f000) line 1262 +
21 bytes
PR_Read(PRFileDesc * 0x005e0280, void * 0x0c22a850, int 0x0000f000) line 138 +
20 bytes
handle_connection(PRFileDesc * 0x005e0280, int 0x000003dd) line 677 + 19 bytes
do_connects(void * 0x0012fea0, void * 0x0053f130, int 0x000003dd) line 792 + 13
bytes
thread_wrapper(void * 0x00411e80) line 373 + 26 bytes
_PR_NativeRunThread(void * 0x00542ed0) line 433 + 13 bytes
MSVCRT! 780060ce()
KERNEL32! lstrcmpiW + 183 by

I'll keep adding asserts to my tree to trace this ...
Here is the rest of the stack :

pk11_ObjectFromHandleOnSlot(unsigned long 0x00003c83, PK11SlotStr * 0x004d8cd8)
line 2131 + 36 bytes
pk11_ObjectFromHandle(unsigned long 0x00003c83, PK11SessionStr * 0x0825e838)
line 2145 + 13 bytes
pk11_InitGeneric(PK11SessionStr * 0x0825e838, PK11SessionContextStr * *
0x09b7fc78, int 0x00000003, PK11ObjectStr * * 0x09b7fc7c, unsigned long
0x00003c83, unsigned long * 0x09b7fc88, unsigned long 0x00000003, unsigned long
0x00000108) line 341 + 13 bytes
NSC_SignInit(unsigned long 0x00000787, CK_MECHANISM * 0x09b7fcc4, unsigned long
0x00003c83) line 1702 + 34 bytes


pk11_ObjectFromHandleOnSlot is returning a NULL object pointer

This is set by the macro pk11queue_find . Bob, any idea ?
Sounds like a race on the key handle.

Either 1) someone is deleting a key when they weren't supposed to, or 2) the key
is sharing a session with someone else and that session is getting clobbered, or
3) some internal race condition in the softoken is causing the object handle to
get lost. I would privately suspect case 2 as the most likely.

bob
Attached patch patch (obsolete) — Splinter Review
Ian, please review this patch.	Is it what you want?
Comment on attachment 122910 [details] [diff] [review]
patch

This patch is intended for bug 202270.
Attachment #122910 - Attachment is obsolete: true
Could this be related to bug 196353?
Ian,

There is a certain amount of commonality in these two bugs indeed.
We don't reuse object handles that quickly however in softoken. I'm certain that
I'm not going over the 2**28 boundary of handles in this short test. Perhaps the
problem is that the session object is being shared accross threads in the SSL
client, and when one of the threads closes the PKCS#11 session, it causes
problems for the other threads, which still have the a handle to the session
object that got destroyed ? This is just a theory, I haven't actually gone back
to look at the code recently to check if that makes sense.
Priority: -- → P2
FYI, I have now reproduced the problem on another system - my OS/2 box, which
was upgraded from a 450 MHz Pentium 2 to a single-CPU 2.4 GHz Pentium 4 recently.
This is the bug I was thinking of in the meeting today. Unfortunately, it wasn't
fixed. I was investigating it the week before I went on sabbatical. I cleaned my
tree from all the asserts a long time ago, but it should be possible to
reproduce by looking at the posted stack traces.
Here are some crashes seen at Sun.  Machine (and other) info:

Iam running on a ultra 10 (I think) with 512 MB memory and 1 GM swap space and
2 cpu machine. Iam using Loadrunner which simulates around 250-300 browsers on
my win2K machine and hits a servlet on the webserver. I can see the error
message almost immediately and then the crashes happen in 3-4 minutes. There
are no hardware accerelators and the machine is patched up to dater. As per
Chris when I
reverted back to NSS 3.3.4 it seemed to run fine without errors or crashes for
25 minutes.


Note they are able to use 3.3.4 without problems.  I'm trying to see if we can
isolate a problem patch.
Kirk, Ian, Chris,

Could you open a separate bug on your problem and cc
us?  We should presume that it is different from the
problem that Julien oroginally reported in this bug.
I'm pretty sure we're being bitten by the bug Julien found: 1. PR_Recv returns
SSL_ERROR_MAC_COMPUTATION_FAILURE 2. there's a race when closing a session
that's shared across multiple threads
The dbx output attached above seems to show two separate threads, destroying
two separate symkeys with separate session handles.  This is not a race to 
destroy a single session.  If the two session handles map to some common
object inside softoken, that's probably the bug.
I discussed this bug with Nelson and have been debugging this some more.
I determined that the key for which we don't find the object is the spec server
write MAC key.
We were able to track down the problem at Sun.  I believe there is a good 
chance it is the same problem Julien opened this bug for, so I'll describe it 
here.

Long ago, I made some changes to how NSS uses/reuses PKCS#11 sessions.  
Specifically, there were a couple of places where NSS systematically created  
sessions, did not use them, and then destroyed them.  Additionally, though NSS 
had a symkey freelist where sessions could be reused, I found that that was 
rarely the case, as most of the keys were not "owners" of their sessions.

The changes I made attempted to assure that all sessions we created were 
utilized, and increased the amount of reusable sessions on the freelist.

One of the most wasteful things NSS did was to create sessions for each of the 
SSL session keys, and then destroy them immediately.  I realize now I didn't 
fully understand why we did that at the time.  I thought my changes would 
obviate the need for that step, as I wanted open sessions for the keys on the 
freelist.

But I was wrong.  That was apparently a hack, and a necessary one.  What 
happens is the following sequence:

1.  NSS opens session 1 and generates the PMS in it.
2.  NSS uses session 1 and the PMS to derive the MS.  It opens session 2 and 
associates it with the MS.
3.  NSS uses session 2 and the MS to derive the SSL session keys.  It opens 
sessions 3-6 for the keys.
4.  NSS destroys sessions 3-6.  It associates session 2 with the client write 
MAC key.  This happens intentionally, but not obviously, as it just so happens 
that the client write MAC key is the first one sent to 
PK11_CreateSymKeyByHandle, and a hack in their transfers session ownership from 
the parent symkey (the MS) to the child (client write MAC).  Thus, after this 
step, the MS key and all of the SSL sessions keys are associated with session 
2, but only the client write MAC is the "owner".
5.  When the connection is complete, NSS deletes, in order, the MS, the server 
keys, and then the client keys, with the MAC key last.

From step 3, you can see that all the SSL session keys belong to session 2.  
Thus when session 2 is closed, the PKCS#11 objects representing the session 
keys are destroyed.  The hack was put in to ensure that session 2 was always 
destroyed last, based on the order that SSL destroys its keys.  If any part of 
the process described above is changed, problems will occur.

What I did long ago was to remove step 4, which seemed like an obvious 
performance bonus (and it was).  But that meant that session 2 could be closed 
before the SSL session keys are destroyed, as the SSL code destroys the MS 
first..

For our 3.3.5 release, we reverted to doing step 4 again.
This is what we used to solve the problems we saw.
The bug was introduced in the patch for bug 145322.
I prepared this patch for the NSS TIP (3.9).
Comment on attachment 129671 [details] [diff] [review]
patch for NSS TIP (3.9)

I've checked in this patch on the NSS trunk.

Julien, could you verify this bug fix?	Thanks.
Attachment #129671 - Flags: superreview?(nelsonb)
Attachment #129671 - Flags: review?(jpierre)
Wan-Teh,

I tested your patch for 3.9 and it fixes the problem.
Thank you, Julien.  Glad to know that this bug is indeed
the same as the crashes seen at Sun.  Marked the bug fixed.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Whiteboard: [3.3.5]
Target Milestone: --- → 3.9
Ian, thanks for your comment 20.  
I think it shows a design flaw with the present concept of a session "owner".
It shows that there may be numerous objects that reference a session, and
that the object that is the "owner" of the session must be destroyed last.
If the references are destroyed in another order, the code crashes. 

The pk11 wrapper code hides the existance and ownership of sessions from the 
caller (e.g. from libSSL).  It seems that the wrapper code that maintains this
"ownership" has been hand crafted to be correct as currently used by libSSL.
But it seems that an innocent change to libSSL that caused it to destroy 
objects in a different order would cause crashes in lower layer code.  

Similarly, one imagines that perhaps new applications that use pk11wrap 
might also experience crashes due to destroying objects in a different order
than the order that pk11wrap seems to require.  

Please comment here if that analysis is flawed.  

I think the pk11wrapper layer APIs need to allow the objects to be destroyed
in any order.  I would think the reference counting, rather than the present
ownership scheme, would be necessary to accomplish that.  Your comments in 
the patch suggest that you think likewise.  

Is that feasible for NSS 3.9?

Also, I wonder if we should regard the bug as fixed, or as being worked 
around.  In particular, does this patch introduce a performance regression?
Is there a solution that is performant and correct regardless of the order
of object destruction?
We should regard this bug as fixed, otherwise we
won't be able to indicate the fact that we are no
longer getting the "Mac computation failed" errors.

We should open a new bug for the work to come up
with a better solution.
Nelson,

I agree with everything you said in comment #26.  In fact, I started working on 
implementing reference-counted sessions for NSS 3.9, and I hope to have time to 
work on that soon.  I don't believe I opened a bug for that work yet, but I 
will.

I see no other way to address the performance problems in this area.  The 
scheme of transferring session ownership is very fragile, and changing it in 
any way invites these kind of regressions.
The bug opened as a follow-up to this one is bug 216552.
Comment on attachment 129671 [details] [diff] [review]
patch for NSS TIP (3.9)

My concerns with this patch are now the subject of bug 216552.
Attachment #129671 - Flags: superreview?(MisterSSL) → superreview+
Attachment #129671 - Flags: review?(jpierre)
Comment on attachment 129671 [details] [diff] [review]
patch for NSS TIP (3.9)

I carried this fix back to the NSS_3_7_BRANCH (NSS 3.7.10)
and NSS_3_8_BRANCH (NSS 3.8.3).
Whiteboard: [3.3.5] → [3.3.5][3.7.10][3.8.3]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: