Closed Bug 262192 Opened 20 years ago Closed 20 years ago

SIGSEGV in pk11_forceTokenAttribute in softoken

Categories

(NSS :: Libraries, defect, P1)

x86
SunOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: julien.pierre)

Details

Attachments

(1 file, 4 obsolete files)

I was running an SSL session reuse stress test on AMD64 solaris and got a core
file after about 48hrs. Unfortunately, this was with optimized bits, and I can't
narrow down the crash to a particular line of code. The stack on the crashing
thread appears corrupt - dbx only shows a single stack frame. I'm trying to
reproduce the crash with debug bits.
Keywords: sun-orion3
Priority: -- → P1
Target Milestone: --- → 3.9.3
Do you think this bug is only on Solaris 10 AMD64 or could affect any other
plateform ?
Any idea what object, attribute, and value was being forced?
Might give us some clues
Christophe,

In my previous SSL stress tests, I never saw a core file on any other platform,
so I'm not sure that they would be affected.

It's possible this has to do with the compiler optimizations . I was using the
highest level of optimizations in studio9, with the -xO5 and -fast options, in
order to get the best performance.

I'm now running a stress test with debug bits to find if I can reproduce the
crash. I will leave it running until monday. If it doesn't crash by then, I may
restart a test with the optimized bits to find if I can reproduce it again and
narrow it down to compiler optimizations ...

Nelson : unfortunately, the optimized core file really does not say much at all.
I can't get the value of arguments displayed.

(dbx) threads
      t@1  a  l@1   ?()   sleep on 0x53dfb8  in  __lwp_park() 
      t@2  a  l@2   _pt_root()   LWP suspended in  mutex_trylock_adaptive() 
      t@3  a  l@3   _pt_root()   LWP suspended in  __lwp_park() 
o>    t@4  a  l@4   _pt_root()   signal SIGSEGV in  pk11_forceTokenAttribute() 
      t@5  a  l@5   _pt_root()   LWP suspended in  __lwp_park() 
      t@6  a  l@6   _pt_root()   LWP suspended in  __pollsys() 
      t@7  a  l@7   _pt_root()   LWP suspended in  _close() 
      t@8  a  l@8   _pt_root()   LWP suspended in  __pollsys() 
      t@9  a  l@9   _pt_root()   LWP suspended in  __lwp_park() 
     t@10  a l@10   _pt_root()   sleep on 0x5a89f8  in  __lwp_park() 
(dbx) thread t@4
t@4 (l@4) stopped in pk11_forceTokenAttribute at 0x7ffffface53e
0x00007ffffface53e: pk11_forceTokenAttribute+0x00be:    movq    
0x0000000000000030(%r12),%r10
(dbx) where     
current thread: t@4
=>[1] pk11_forceTokenAttribute(), at 0x7ffffface53e 
(dbx) 

Julien, 
thanks for emailing me the stack contents, disassembly and register values.  
The stack does not appear to me to be clobbered.  
The crash occurred at line 1714 in pk11_forceTokenAttribute

1713     attribute=pk11_FindAttribute(object,type);
1714     if ((type != CKA_LABEL) && (attribute->attrib.ulValueLen == len) &&
1715         PORT_Memcmp(attribute->attrib.pValue,value,len) == 0) {

the value of "attribute" returned by pk11_FindAttribute was NULL, and 
line 1714 doesn't check for NULL before dereferencing it.  

I found that the compiler had in-lined the functions pk11_narrowToTokenObject
and pk11_FindAttribute, which effectively caused the pk11_FindAttribute call 
to be turned into a call to pk11_FindTokenAttribute.  Pretty impressive 
optimization.  

The fact that pk11_FindTokenAttribute returned NULL may or may not be 
indicative of another error, perhaps due to optimization.  But the immediate
cause of the crash is the failure to check the attribute pointer value for
NULL.  
Thanks, Nelson !

I think it would be useful to add a NULL check for the attribute here, but I'm
not certain what CKR error code to return.

Most other parts of the code return CKR_TEMPLATE_INCOMPLETE for this case. But
pk11_findAttribute doesn't say why it didn't find the attribute.

The proper return code might be CKR_ATTRIBUTE_TYPE_INVALID, or
CKR_OBJECT_HANDLE_INVALID . It's hard to tell.

However, I just found out that the tree that got the core file is a tip tree,
not 3.9 branch, which contained a patch made by Ian to enhance performance with
PKCS#11 contexts . This may very well be the cause of this crash. I will attach
the patch in question. Unfortunately it is one more variable if I have to
reproduce the problem.
Target Milestone: 3.9.3 → ---
Version: 3.9.2 → unspecified
Attached patch Ian's patch for the tip (obsolete) — Splinter Review
The attribute type being forced was CKA_NEVER_EXTRACTABLE.
That attribute is forced in these places:
/security/nss/lib/softoken/pkcs11.c, line 1306 
/security/nss/lib/softoken/pkcs11.c, line 1413 
/security/nss/lib/softoken/pkcs11c.c, line 2926 
/security/nss/lib/softoken/pkcs11c.c, line 3404 
/security/nss/lib/softoken/pkcs11c.c, line 3408 
The places that attempt to set CKA_NEVER_EXTRACTABLE are all in key (pair)
generation / unwrapping / deriving.  Ian's patch seems to affect digesting
only, so it's not at all obvious that his patch would affect the behavior 
of the operations relevant to this crash.
Hmm what is the object type that we are trying to force this attribute on?

The correct CKR value should CKR_ATTRIBUTE_TYPE_INVALID  or CKR_HOST_MEMORY,
depending on why it couldn't be found. The Token object calls should return a
static attribute for this value, unless object type isn't what we thought it
was. (indicating a trash somewhere else).

Adding a null check here is a good sanity check anyway, and should be included
in the patch.

bob
Julien sent me a dump of the memory to which the "object" pointer points.
The content of that memory is NOT a valid PK11Object.  That explains why
pk11_FindTokenAttribute returned NULL.  But it also says that the NULL 
pointer dereference was only a side-effect of the invalid object pointer, 
and the real cause of the crash (the code responsible for the invalid object
pointer, or which trashed the PK11 Object) has not yet been identified.  :(

Since the stack appears intact, we should be able to walk down the stack
until we find the source of the invalid pointer.  That value appears several
times in the stack, so we should be able to find the origin.
FYI, my test with debug bits is still running and the crash has not been reproduced.
Julien, you should try to do an optimized build
with debug symbols.  You can try adding -g after
the -xO5 and -fast options.  The compiler will
tell you whether you can use -g or some other
variant of -g with that compiler optimization
level.
I made an "optimized debug build" after struggling a little bit with the
compiler. I restarted my test. I hope I can reproduce the crash and get more
info out of the new core ...
I haven't been able to reproduce this bug in over a week of trying,
unfortunately. I can't debug my old core file either because the specific build
has been overwritten, and I can't reproduce it as the particular compiler has
gone away too. We are just going to have make the defensive fix in the function
where the crash occured, and speculate about the root cause above it.
Attached patch fix crash symptoms (obsolete) — Splinter Review
This patch checks for NULL pointers in the function where the crash was
observed.

As Nelson noted, the content of the PK11Object structure is corrupt, so I
attempt to check if slot and refCount are valid. If there are any other
assumptions I can make to validate the content of this structure, I'd like to
add more checks as well.
Attached patch add missing assertions (obsolete) — Splinter Review
Attachment #162015 - Attachment is obsolete: true
Attachment #162016 - Flags: superreview?(rrelyea0264)
Attachment #162016 - Flags: review?(saul.edwards.bugs)
Comment on attachment 162016 [details] [diff] [review]
add missing assertions

R+.. though I think the first return after the first if should probably be
CKR_DEVICE_ERROR. It can only happen is there is some sort of internal
corruption. CKR_ARGUMENTS_BAD means the user of the PKCS #11 module passed in
bad arguments. That would already be checked by the code tha converts object id
into PK11Object *.
Attachment #162016 - Flags: superreview?(rrelyea0264) → superreview+
Comment on attachment 160799 [details] [diff] [review]
Ian's patch for the tip

FYI, Ian's patch causes problems with the SSL3 cipher suite RSA with RC4 128
and SHA1 .  The problem is that the server becomes non-interoperable with
standard implementations (MAC mismatch).

The RC4 128/ MD5 suite isn't affected.

This may not have anything to do with the cause of the crash, but I thought I
would mention it.
Attachment #160799 - Attachment is obsolete: true
Julien, your observation in comment 18 is more than a little frightening.
Whiel this info is still fresh in your mind, please file a separate bug
about it and cc me on that bug.  Please be specific about the RCS version
number and file name that contains the flaw, and how/why it adversely 
affects MACs.  Thanks!
Nelson, I'm reluctant to file a bug about a patch that hasn't been checked in
anywhere yet.

To observe the problem, you can apply Ian's patch to an otherwise virgin tip NSS
tree, run selfserv with the "-c n" option to select the RSA RC4 128 SHA1 cipher
suite, then connect with mozilla or other tools using pre-existing NSS libs.
They will not be able to connect. But if you use a client that is using the same
broken libs, such as occurs in all.sh, the client will work .

Given that Ian's patch deals with digests and attempts to optimize the use of
PKCS#11 context, if there is something wrong with it, it is not completely
surprising that it could affect MAC computation with SSL, as it does here for
SHA1. But two wrongs seem to make one right in this case, as I suppose both the
client and server perform the incorrect SHA1 computation with the same bad
result . I think this uncovers a critical flaw in our test coverage. Our tests
should be able to figure out that SSL interoperability has been broken by a code
change such as this one.
Actually I need to move the checks in the patch one level up the stack or it
won't do any good, since the same corruption in the PK11Object would crash
pk11_forceAttribute, the sole caller of pk11_forceTokenAttribute .

One more thing, about Ian's patch. I found that it just breaks SSL3 cipher
suites in general, not specifically RSA 128/SHA1. But all TLS cipher suites
still work with the patch. To reproduce the problem with the patch, run selfserv
(from the tip) with -S and -T options, to disable SSL2 and TLS respectively.
Then, the browser won't be able to connect.
With Nelson's help, I figured out that Ian's patch missed one place to flush the
new digest buffer, in PK11_DigestKey . Fixing that also fixed SSL3.
Unfortunately, the review of the patch shows it's very unlikely to have caused
the memory corruption in the PK11Object, so it's probably unrelated to the crash
in this bug.
Julien - I opened bug 265620 for my patch.  Can you post your updated patch
(with Nelson's suggested fix) there?  Thanks.
Comment on attachment 162016 [details] [diff] [review]
add missing assertions

The assertions added in this patch, e.g.

>+    PORT_Assert(object);
>+    PORT_Assert(object->refCount);
>+    PORT_Assert(object->slot);
>+    if (!object ||
>+        !object->refCount ||
>+        !object->slot) {
>+        return CKR_ARGUMENTS_BAD;
>+    }
>+
>+    to = pk11_narrowToTokenObject(object);
>     PORT_Assert(to);
>     if (to == NULL) {
> 	return CKR_DEVICE_ERROR;
>     }

are good, as far as they go.  But they don't really assert 
that the object pointer really points to a valid PKCS11Object.
They merely assert that two memory values at various offsets
in the object are non-zero, which IMO has some significant 
probability of being true even if the pointer is pointing at
random memory.	

So, given that the object is a private struct to softoken, 
I propose the following additional check be implemented.
I propose adding a new const char * pointer to struct 
PK11ObjectStr, immediately after next and prev.  Also, 
there will be a global const character string, global to 
the softoken only, that is the value "PKCS11Object".  
Whenever a PKCS11Object is created/initialized, the new 
pointer member must be set to point to that one string.
Thereafter, at any time that we want to check that we have 
a valid PCKS11Object pointer, we merely have to check that
the value of that pointer equals the address of that string.
No need to do a string comparison.  

We could use any magic value, and could be an int rather than
a char *, but having it be a char * has value in debugging.
With the string, you don't have to memorize magic values in 
order to ask "what kind of object is this"?  You just look 
at the string and it tells you.  

I suggest that if we assert that the object pointer points to
a real PKCS11Object, in all the functions that are likely to
have been in the crash's call stack, we will find the offending
code much sooner.
Attachment #162016 - Attachment is obsolete: true
Nelson,

Thanks for the ideas. Either a string or a magic value is fine with me (we can
use 0xDEADBEEF). I think we should implement this on the tip . For the
short-term  3.9.4 patch release, I would like to use the shorter patch just
submitted. 
Attachment #163079 - Attachment description: move assertions one level up → move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later.
Attachment #163079 - Flags: review?(nelson)
Comment on attachment 163079 [details] [diff] [review]
move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later.

Saul, please review for 3.9.4 check in. Thanks.
Attachment #163079 - Flags: superreview?(saul.edwards.bugs)
I have checked in the patch to NSS_3_9_BRANCH due to our imminent 3.9.4 build.

Checking in pkcs11u.c;
/cvsroot/mozilla/security/nss/lib/softoken/pkcs11u.c,v  <--  pkcs11u.c
new revision: 1.54.2.2; previous revision: 1.54.2.1
done

I would still like the reviews, though. Thanks.
Comment on attachment 163079 [details] [diff] [review]
move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later.

This patch has 2 "hunks".  The first hunk is fine. 
The second one asserts that a pointer is not NULL, 
immediately AFTER dereferencing it.  e.g.

>     if (pk11_isToken(object->handle)) {
>+        PORT_Assert(object);
>+        PORT_Assert(object->refCount);
>+        PORT_Assert(object->slot);
>+        if (!object ||
>+            !object->refCount ||
>+            !object->slot) {
>+            return CKR_DEVICE_ERROR;
>+        }
> 	return pk11_forceTokenAttribute(object,type,value,len);
>     }

Seems to me that these 3 assertions and the code that returns
CKR_DEVICE_ERROR should take place BEFORE we dereference object.

Also, I don't understand what is gained by moving these 
assertions up out of pk11_forceTokenAttribute into a code path
that only occurs immediately priori to calling that function.
I thought the point of moving the assertions up was so that 
they would also apply to the non-token cases.  No?
Attachment #163079 - Flags: review?(nelson) → review-
Comment on attachment 163079 [details] [diff] [review]
move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later.

Julien, Please correct this before shipping 3.9.4
Attachment #163079 - Flags: superreview?(saul.edwards.bugs)
Nelson,

You are right, the third patch was not functionally different from the second
one. I moved the assertions before the pk11_isToken call .

Checking in pkcs11u.c;
/cvsroot/mozilla/security/nss/lib/softoken/pkcs11u.c,v  <--  pkcs11u.c
new revision: 1.54.2.3; previous revision: 1.54.2.2
done
Attached patch updated patchSplinter Review
move assertions a few lines up. This is a cumulative diff
This patch has been checked in to both the NSS_3_9_BRANCH and the tip :

Checking in pkcs11u.c;
/cvsroot/mozilla/security/nss/lib/softoken/pkcs11u.c,v	<--  pkcs11u.c
new revision: 1.58; previous revision: 1.57
done

A more extensive patch will follow later to incorporate the magic logic and
Nelson's other ideas for the tip, so I'm keeping this bug open.
Attachment #163079 - Attachment is obsolete: true
Attachment #162016 - Flags: review?(saul.edwards.bugs)
I have not seen this specific problem occur again. Marking fixed.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Julien, please set the target milestone so we
know in which NSS release this bug was fixed.
Target Milestone: --- → 3.9.4
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: