Closed
Bug 202179
Opened 21 years ago
Closed 21 years ago
MAC computation failed with strsclnt
Categories
(NSS :: Libraries, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
3.9
People
(Reporter: julien.pierre, Assigned: wtc)
Details
(Whiteboard: [3.3.5][3.7.10][3.8.3])
Attachments
(4 files, 1 obsolete file)
10.17 KB,
text/plain
|
Details | |
6.33 KB,
text/plain
|
Details | |
986 bytes,
patch
|
Details | Diff | Splinter Review | |
939 bytes,
patch
|
nelson
:
superreview+
|
Details | Diff | Splinter Review |
On my dual Xeon 2.2 GHz with hyperthreading, I consistently get 6 out of 100,000 MAC computation failures when connecting to aka . This server uses a 40-bit cipher suite requesting client auth. Note that in this case strsclnt does not present a client cert, so it just gets denied access. It's possible that the failure is with the server rather than the client. The server software is likely quite old. Still, this is abnormal. C:\nss\39\mozilla\dist\WIN954.0_DBG.OBJ\bin>strsclnt -d . -D -N -q -p 443 -c 100 0 -t 100 aka 2>&1 | grep -v Validated strsclnt: 0 cache hits; 100 cache misses, 0 cache not reusable strsclnt: -- SSL:strsclnt: 0 cache hits; 1998 cache misses, 0 cache not reusable strsclnt: PR_Read returned error -12213: MAC computation failed. strsclnt: PR_Read returned error -12213: MAC computation failed. strsclnt: PR_Read returned error -12213: MAC computation failed. strsclnt: PR_Read returned error -12213: MAC computation failed. strsclnt: PR_Read returned error -12213: MAC computation failed. strsclnt: PR_Read returned error -12213: MAC computation failed.
Reporter | ||
Comment 1•21 years ago
|
||
FYI : I tried reducing the number of client threads from 100. I didn't get any failure with 1, 2, 4, 8, 16, 32, 64, 80 threads. I got a single failure with 90 threads. 7 failures with 99 threads. I can consistently reproduce multiple failures with 100 threads. With 150 threads I got no failure however ! So much for the theory that I was overloading the server and somehow bad things were happening. There appears to be a range that causes the problem to be reproduced, and 100 threads is right in that range. This might require testing on other machines to see if it is machine-dependent.
Comment 2•21 years ago
|
||
The error reported above is SSL_ERROR_MAC_COMPUTATION_FAILURE. It means that the attempt to compute the MAC value locally failed to compute any MAC value at all. It means that one of the following functions returned SECFailure: PK11_DigestBegin, PK11_DigestOp, PK11_DigestFinal. That error is entirely distinct from the errors SSL_ERROR_BAD_MAC_READ and SSL_ERROR_BAD_MAC_ALERT. Those errors mean that a MAC was computed, but the computed value was not the expected value. As an approach to debugging this, I would suggest setting a breakpoint inside the PK11_ functions named above in the error paths that cause SECFailure to be returned. That will tell you which function fails. Then, if the failure is actually occuring in softoken, set similar breakpoints in the error paths of the softoken function(s) used by the particular failing PK11_ function.
Reporter | ||
Comment 3•21 years ago
|
||
When I run the test inside the debugger, it takes much longer and the problem does not occur. I wlil try putting assertions. I tried this test on my Sun as well - it instantly hangs, needing a pull of the power cord. My Sun has been very sick in the past week and can't stay up for more than a few minutes (it used to stay up for 3 - 6 months at a time, before I rebooted it for software upgrades).
Reporter | ||
Comment 4•21 years ago
|
||
So far only my very fastest system has reproduced the problem. The slower OS/2 & Sun boxes (when not crashing) in my office don't show the error.
Reporter | ||
Comment 5•21 years ago
|
||
I'm still busy putting asserts to go backwards up to the lowest level failure. So far, I have found this : C_SignInit is returning CKR_KEY_HANDLE_INVALID . This occurs during pk11_context_init . The stack is : NTDLL! DbgBreakPoint address 0x77f9180c pk11_context_init(PK11ContextStr * 0x0446c770, CK_MECHANISM * 0x0937fcc4) line 3566 + 21 bytes PK11_DigestBegin(PK11ContextStr * 0x0446c770) line 3953 + 13 bytes ssl3_ComputeRecordMAC(ssl3CipherSpec * 0x0056b680, PK11ContextStr * 0x0446c770, int 0x00000017, unsigned short 0x0300, SSL3SequenceNumber {...}, unsigned char * 0x0055ab48, int 0x000000e1, unsigned char * 0x0937fd84, unsigned int * 0x0937fdd8) line 1597 + 9 bytes ssl3_HandleRecord(sslSocketStr * 0x005468c8, SSL3Ciphertext * 0x0937fe00, sslBufferStr * 0x00546a70) line 8313 + 95 bytes ssl3_GatherCompleteHandshake(sslSocketStr * 0x005468c8, int 0x00000000) line 203 + 22 bytes ssl3_GatherAppDataRecord(sslSocketStr * 0x005468c8, int 0x00000000) line 232 + 13 bytes DoRecv(sslSocketStr * 0x005468c8, unsigned char * 0x0c22a850, int 0x0000f000, int 0x00000000) line 499 + 11 bytes ssl_SecureRecv(sslSocketStr * 0x005468c8, unsigned char * 0x0c22a850, int 0x0000f000, int 0x00000000) line 980 + 21 bytes ssl_SecureRead(sslSocketStr * 0x005468c8, unsigned char * 0x0c22a850, int 0x0000f000) line 989 + 19 bytes ssl_Read(PRFileDesc * 0x005e0280, void * 0x0c22a850, int 0x0000f000) line 1262 + 21 bytes PR_Read(PRFileDesc * 0x005e0280, void * 0x0c22a850, int 0x0000f000) line 138 + 20 bytes handle_connection(PRFileDesc * 0x005e0280, int 0x000003dd) line 677 + 19 bytes do_connects(void * 0x0012fea0, void * 0x0053f130, int 0x000003dd) line 792 + 13 bytes thread_wrapper(void * 0x00411e80) line 373 + 26 bytes _PR_NativeRunThread(void * 0x00542ed0) line 433 + 13 bytes MSVCRT! 780060ce() KERNEL32! lstrcmpiW + 183 by I'll keep adding asserts to my tree to trace this ...
Reporter | ||
Comment 6•21 years ago
|
||
Here is the rest of the stack : pk11_ObjectFromHandleOnSlot(unsigned long 0x00003c83, PK11SlotStr * 0x004d8cd8) line 2131 + 36 bytes pk11_ObjectFromHandle(unsigned long 0x00003c83, PK11SessionStr * 0x0825e838) line 2145 + 13 bytes pk11_InitGeneric(PK11SessionStr * 0x0825e838, PK11SessionContextStr * * 0x09b7fc78, int 0x00000003, PK11ObjectStr * * 0x09b7fc7c, unsigned long 0x00003c83, unsigned long * 0x09b7fc88, unsigned long 0x00000003, unsigned long 0x00000108) line 341 + 13 bytes NSC_SignInit(unsigned long 0x00000787, CK_MECHANISM * 0x09b7fcc4, unsigned long 0x00003c83) line 1702 + 34 bytes pk11_ObjectFromHandleOnSlot is returning a NULL object pointer This is set by the macro pk11queue_find . Bob, any idea ?
Comment 7•21 years ago
|
||
Sounds like a race on the key handle. Either 1) someone is deleting a key when they weren't supposed to, or 2) the key is sharing a session with someone else and that session is getting clobbered, or 3) some internal race condition in the softoken is causing the object handle to get lost. I would privately suspect case 2 as the most likely. bob
Comment 8•21 years ago
|
||
Ian, please review this patch. Is it what you want?
Assignee | ||
Comment 9•21 years ago
|
||
Comment on attachment 122910 [details] [diff] [review] patch This patch is intended for bug 202270.
Attachment #122910 -
Attachment is obsolete: true
Comment 10•21 years ago
|
||
Could this be related to bug 196353?
Reporter | ||
Comment 11•21 years ago
|
||
Ian, There is a certain amount of commonality in these two bugs indeed. We don't reuse object handles that quickly however in softoken. I'm certain that I'm not going over the 2**28 boundary of handles in this short test. Perhaps the problem is that the session object is being shared accross threads in the SSL client, and when one of the threads closes the PKCS#11 session, it causes problems for the other threads, which still have the a handle to the session object that got destroyed ? This is just a theory, I haven't actually gone back to look at the code recently to check if that makes sense.
Reporter | ||
Updated•21 years ago
|
Priority: -- → P2
Reporter | ||
Comment 12•21 years ago
|
||
FYI, I have now reproduced the problem on another system - my OS/2 box, which was upgraded from a 450 MHz Pentium 2 to a single-CPU 2.4 GHz Pentium 4 recently.
Reporter | ||
Comment 13•21 years ago
|
||
This is the bug I was thinking of in the meeting today. Unfortunately, it wasn't fixed. I was investigating it the week before I went on sabbatical. I cleaned my tree from all the asserts a long time ago, but it should be possible to reproduce by looking at the posted stack traces.
Comment 14•21 years ago
|
||
Here are some crashes seen at Sun. Machine (and other) info: Iam running on a ultra 10 (I think) with 512 MB memory and 1 GM swap space and 2 cpu machine. Iam using Loadrunner which simulates around 250-300 browsers on my win2K machine and hits a servlet on the webserver. I can see the error message almost immediately and then the crashes happen in 3-4 minutes. There are no hardware accerelators and the machine is patched up to dater. As per Chris when I reverted back to NSS 3.3.4 it seemed to run fine without errors or crashes for 25 minutes. Note they are able to use 3.3.4 without problems. I'm trying to see if we can isolate a problem patch.
Comment 15•21 years ago
|
||
Assignee | ||
Comment 16•21 years ago
|
||
Kirk, Ian, Chris, Could you open a separate bug on your problem and cc us? We should presume that it is different from the problem that Julien oroginally reported in this bug.
Comment 17•21 years ago
|
||
I'm pretty sure we're being bitten by the bug Julien found: 1. PR_Recv returns SSL_ERROR_MAC_COMPUTATION_FAILURE 2. there's a race when closing a session that's shared across multiple threads
Comment 18•21 years ago
|
||
The dbx output attached above seems to show two separate threads, destroying two separate symkeys with separate session handles. This is not a race to destroy a single session. If the two session handles map to some common object inside softoken, that's probably the bug.
Reporter | ||
Comment 19•21 years ago
|
||
I discussed this bug with Nelson and have been debugging this some more. I determined that the key for which we don't find the object is the spec server write MAC key.
Comment 20•21 years ago
|
||
We were able to track down the problem at Sun. I believe there is a good chance it is the same problem Julien opened this bug for, so I'll describe it here. Long ago, I made some changes to how NSS uses/reuses PKCS#11 sessions. Specifically, there were a couple of places where NSS systematically created sessions, did not use them, and then destroyed them. Additionally, though NSS had a symkey freelist where sessions could be reused, I found that that was rarely the case, as most of the keys were not "owners" of their sessions. The changes I made attempted to assure that all sessions we created were utilized, and increased the amount of reusable sessions on the freelist. One of the most wasteful things NSS did was to create sessions for each of the SSL session keys, and then destroy them immediately. I realize now I didn't fully understand why we did that at the time. I thought my changes would obviate the need for that step, as I wanted open sessions for the keys on the freelist. But I was wrong. That was apparently a hack, and a necessary one. What happens is the following sequence: 1. NSS opens session 1 and generates the PMS in it. 2. NSS uses session 1 and the PMS to derive the MS. It opens session 2 and associates it with the MS. 3. NSS uses session 2 and the MS to derive the SSL session keys. It opens sessions 3-6 for the keys. 4. NSS destroys sessions 3-6. It associates session 2 with the client write MAC key. This happens intentionally, but not obviously, as it just so happens that the client write MAC key is the first one sent to PK11_CreateSymKeyByHandle, and a hack in their transfers session ownership from the parent symkey (the MS) to the child (client write MAC). Thus, after this step, the MS key and all of the SSL sessions keys are associated with session 2, but only the client write MAC is the "owner". 5. When the connection is complete, NSS deletes, in order, the MS, the server keys, and then the client keys, with the MAC key last. From step 3, you can see that all the SSL session keys belong to session 2. Thus when session 2 is closed, the PKCS#11 objects representing the session keys are destroyed. The hack was put in to ensure that session 2 was always destroyed last, based on the order that SSL destroys its keys. If any part of the process described above is changed, problems will occur. What I did long ago was to remove step 4, which seemed like an obvious performance bonus (and it was). But that meant that session 2 could be closed before the SSL session keys are destroyed, as the SSL code destroys the MS first.. For our 3.3.5 release, we reverted to doing step 4 again.
Comment 21•21 years ago
|
||
This is what we used to solve the problems we saw.
Assignee | ||
Comment 22•21 years ago
|
||
The bug was introduced in the patch for bug 145322. I prepared this patch for the NSS TIP (3.9).
Assignee | ||
Comment 23•21 years ago
|
||
Comment on attachment 129671 [details] [diff] [review] patch for NSS TIP (3.9) I've checked in this patch on the NSS trunk. Julien, could you verify this bug fix? Thanks.
Attachment #129671 -
Flags: superreview?(nelsonb)
Attachment #129671 -
Flags: review?(jpierre)
Reporter | ||
Comment 24•21 years ago
|
||
Wan-Teh, I tested your patch for 3.9 and it fixes the problem.
Assignee | ||
Comment 25•21 years ago
|
||
Thank you, Julien. Glad to know that this bug is indeed the same as the crashes seen at Sun. Marked the bug fixed.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Whiteboard: [3.3.5]
Target Milestone: --- → 3.9
Comment 26•21 years ago
|
||
Ian, thanks for your comment 20. I think it shows a design flaw with the present concept of a session "owner". It shows that there may be numerous objects that reference a session, and that the object that is the "owner" of the session must be destroyed last. If the references are destroyed in another order, the code crashes. The pk11 wrapper code hides the existance and ownership of sessions from the caller (e.g. from libSSL). It seems that the wrapper code that maintains this "ownership" has been hand crafted to be correct as currently used by libSSL. But it seems that an innocent change to libSSL that caused it to destroy objects in a different order would cause crashes in lower layer code. Similarly, one imagines that perhaps new applications that use pk11wrap might also experience crashes due to destroying objects in a different order than the order that pk11wrap seems to require. Please comment here if that analysis is flawed. I think the pk11wrapper layer APIs need to allow the objects to be destroyed in any order. I would think the reference counting, rather than the present ownership scheme, would be necessary to accomplish that. Your comments in the patch suggest that you think likewise. Is that feasible for NSS 3.9? Also, I wonder if we should regard the bug as fixed, or as being worked around. In particular, does this patch introduce a performance regression? Is there a solution that is performant and correct regardless of the order of object destruction?
Assignee | ||
Comment 27•21 years ago
|
||
We should regard this bug as fixed, otherwise we won't be able to indicate the fact that we are no longer getting the "Mac computation failed" errors. We should open a new bug for the work to come up with a better solution.
Comment 28•21 years ago
|
||
Nelson, I agree with everything you said in comment #26. In fact, I started working on implementing reference-counted sessions for NSS 3.9, and I hope to have time to work on that soon. I don't believe I opened a bug for that work yet, but I will. I see no other way to address the performance problems in this area. The scheme of transferring session ownership is very fragile, and changing it in any way invites these kind of regressions.
Comment 29•21 years ago
|
||
The bug opened as a follow-up to this one is bug 216552.
Comment 30•21 years ago
|
||
Comment on attachment 129671 [details] [diff] [review] patch for NSS TIP (3.9) My concerns with this patch are now the subject of bug 216552.
Attachment #129671 -
Flags: superreview?(MisterSSL) → superreview+
Reporter | ||
Updated•21 years ago
|
Attachment #129671 -
Flags: review?(jpierre)
Assignee | ||
Comment 31•21 years ago
|
||
Comment on attachment 129671 [details] [diff] [review] patch for NSS TIP (3.9) I carried this fix back to the NSS_3_7_BRANCH (NSS 3.7.10) and NSS_3_8_BRANCH (NSS 3.8.3).
Assignee | ||
Updated•21 years ago
|
Whiteboard: [3.3.5] → [3.3.5][3.7.10][3.8.3]
You need to log in
before you can comment on or make changes to this bug.
Description
•