Closed Bug 1054373 Opened 5 years ago Closed 2 years ago

Crash in PK11_DoesMechanism due to race condition

Categories

(NSS :: Libraries, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: elio.maldonado.batiz, Assigned: rrelyea)

References

Details

Attachments

(4 files, 11 obsolete files)

3.88 KB, patch
Details | Diff | Splinter Review
8.75 KB, patch
Details | Diff | Splinter Review
8.73 KB, patch
Details | Diff | Splinter Review
5.24 KB, patch
Details | Diff | Splinter Review
Attached patch Fixes race conditions (obsolete) — Splinter Review
Patch from Bob Relyea
Attachment #8473762 - Flags: review?(wtc)
Attached patch Fixes race conditions - V2 (obsolete) — Splinter Review
Red Hat testing revealed that the previous version caused deadlock on the pem module. This version addresses that.
Attachment #8473762 - Attachment is obsolete: true
Attachment #8473762 - Flags: review?(wtc)
Attachment #8476042 - Flags: review?(wtc)
Assignee: nobody → emaldona
Group: core-security
In terms of a security rating, is this just a null deref?
Flags: needinfo?(emaldona)
I was trying to limit access to the bug as it was pointed to me that there was some Red Hat customer server information that I forgot to redact out when I manually cloned this bug. This is not I what I was trying to accomplish. This is not a security sensitive bug.
Flags: needinfo?(emaldona)
marking comment 0 private, because it contains private information.

I suggest to mark this bug as invalid, and file another, cleaner one.
marking invalid. 

please file a new bug.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
Keywords: sec-other
Comment on attachment 8476042 [details] [diff] [review]
Fixes race conditions - V2

Review of attachment 8476042 [details] [diff] [review]:
-----------------------------------------------------------------

r=wtc. Please review my suggested changes carefully and consult Bob
if necessary, because I didn't check the code carefully.

I pointed out a couple of changes that I don't understand, and I
ask you (or Bob) to doublecheck the changes are correct.

::: lib/pk11wrap/pk11util.c
@@ +1262,5 @@
>   * helper function to actually create and destroy user defined slots
>   */
>  static SECStatus
>  secmod_UserDBOp(PK11SlotInfo *slot, CK_OBJECT_CLASS objClass, 
> +		const char *sendSpec, PRBool needlock)

DESIGN: in general a boolean parameter makes the code less readable.
Since there are only two callers of this function, it may be better
to just let the caller be responsible for acquiring the necessary
lock.

@@ +1285,5 @@
>      if (crv != CKR_OK) {
>  	PORT_SetError(PK11_MapError(crv));
>  	return SECFailure;
>      }
> +    return SECSuccess;

IMPORTANT: please confirm that you intend to remove the SECMOD_UpdateSlotList call.

@@ +1308,2 @@
>  	PK11_FreeSlot(slot);
> +	if ((crv == CKR_OK) && 

Nit: delete the four spaces at the end of lines inside this function.
(You can see them in the code review tool.)

@@ +1363,5 @@
>      char *escSpec;
>      char *sendSpec;
>      SECStatus rv;
>  
> +    PZ_Lock(mod->refLock);   /* don't reuse a slot on the fly */

I don't understand this comment. At least it should be moved
to the previous line. Its current end-of-line location implies it
describes the PZ_Lock(mod->refLock) call, which is what I find
confusing.

@@ +1399,5 @@
>  	PK11_FreeSlot(slot);
>  	PORT_SetError(SEC_ERROR_NO_MEMORY);
>  	return NULL;
>      }
> +    rv = secmod_UserDBOp(slot, CKO_NETSCAPE_NEWSLOT, sendSpec, 

Nit: delete the space at the end of line.

@@ +1410,5 @@
>      PK11_FreeSlot(slot);
>      if (rv != SECSuccess) {
>  	return NULL;
>      }
> +    rv = SECMOD_UpdateSlotList(mod); /* don't call holding the mod->reflock */

This comment should be more clearer because it could be misinterpreted
to mean updating the slot list of 'mod' does not require holding mod->refLock.

We can just say SECMOD_UpdateSlotList will acquire mod->reflock internally.

@@ +1516,5 @@
>  	/* PR_smprintf does not set no memory error */
>  	PORT_SetError(SEC_ERROR_NO_MEMORY);
>  	return SECFailure;
>      }
> +    rv = secmod_UserDBOp(slot, CKO_NETSCAPE_DELSLOT, sendSpec, PR_TRUE);

Please confirm we don't need to call SECMOD_UpdateSlotList(mod) here.
Attachment #8476042 - Flags: review?(wtc) → review+
Elio: I didn't realize this bug is already closed. If the patch I
just reviewed is abandoned, please ignore my review comments.
Group: core-security
(In reply to Wan-Teh Chang from comment #8)
> Elio: I didn't realize this bug is already closed. If the patch I
> just reviewed is abandoned, please ignore my review comments.

Wan-Teh, it turns out that I can't ignore your review and we do need this patch or a variant of it. This bug was originally reported as a problem with Red Hat Directory Server / RHDS 9.1 crash on RHEL 6.5 in a heavily used environment at but now the problem has resurfaces. This time in a different manner with RHEL-7 after we rebased to NSS 3.21 and it wasn't until we included this patch plus another related one that the build succeeded. That's the reason for reponing this bug. I'll submit an updated version of the patch for review once we have some completed additional testing.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Attached patch Fixes race conditions - V3 (obsolete) — Splinter Review
Updated for the current sources and expanded due to further testing done downstream at Red Hat. This is Bob's patch so I defer to him to answer any questions.
Attachment #8739523 - Flags: review?(wtc)
Comment on attachment 8476042 [details] [diff] [review]
Fixes race conditions - V2

obsoleted by V3.
Attachment #8476042 - Attachment is obsolete: true
Attached patch Fixes race conditions - V4 (obsolete) — Splinter Review
Attachment #8739523 - Attachment is obsolete: true
Attachment #8739523 - Flags: review?(wtc)
Attachment #8739528 - Flags: review?(wtc)
Assignee: emaldona → rrelyea
Keywords: sec-other
Comment on attachment 8739528 [details] [diff] [review]
Fixes race conditions - V4

This is on Bob's behalf as he's the author of this patch. Wan-Teh's time for NSS reviews is extremely limited so I'm switching the review to Eric.
Attachment #8739528 - Flags: review?(wtc) → review?(ekr)
OS: Linux → All
Hardware: x86 → All
Version: 3.16.1 → trunk
Stealing, we should fix this.
Assignee: rrelyea → ttaubert
This is similar to what Elio did, but a lot smaller. I suspect that his patch contained other things besides just fixing the race condition. Let's concentrate on that and do any potential cleanup in another bug.
Attachment #8739528 - Attachment is obsolete: true
Attachment #8739528 - Flags: review?(ekr)
https://hg.mozilla.org/projects/nss/rev/92e0af39805c

Landed. If we want to take the cleanup suggested by Elio in his patch, let's open a follow-up.

Kai, do you have any way to verify that the problem is fixed for you too?
Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Flags: needinfo?(kaie)
Resolution: --- → FIXED
Target Milestone: --- → 3.30
Blocks: 1334127
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Backed out to unblock bug 1334127.

https://hg.mozilla.org/projects/nss/rev/01d6c0dff06f
No longer blocks: 1334127
Flags: needinfo?(kaie)
Target Milestone: 3.30 → ---
We still have one of the older obsolete patches used in RHEL packages, but it now runs into a deadlock with the recent PK11_ResetToken fix, so we need to get this fixed, and should follow up on this work.

Bob, we have some trouble understanding how the locks must be set. Would you please be able to give us a general description of how the locking is supposed to work, so that we can try to work on a better patch?
Flags: needinfo?(rrelyea)
Bob: This happens to be our number 1 NSS crash for ChromeOS. Because of Chrome's multi-threaded certificate verification, we regularly have two threads verifying certificates, and due to slot/token events, find themselves clobbering eachother.

An example stack trace looks like this:
Thread 0:
C_GetMechanismList
PK11_ReadMechanismList
PK11_InitToken
nssSlot_Refresh
nssSlot_IsTokenPresent
nssSlot_GetToken
nssTrustDomain_FindCertificatesBySubject
CERT_CreateSubjectCertList
pkix_pl_Pk11CertSTore_GetCert
pkix_BuildForwardDepthFirstSearch
pkix_Build_InitiateBuildChain
PKIX_BuildChain
CERT_PKIXVerifyCert

Thread 1:
PK11_ReadMechanismList
PK11_InitToken
nssSlot_Refresh
nssSlot_IsTokenPresent
nssSlot_GetToken
nssTrustDOmain_FindCertificatesBySubject
CERT_CreateSubjectCertList
pkix_pl_Pk11CertSTore_GetCert
pkix_BuildForwardDepthFirstSearch
pkix_Build_InitiateBuildChain
PKIX_BuildChain
CERT_PKIXVerifyCert


In the case of the crash, Thread 1 has finished calling C_GetMechanismList and updated slot->mechanismList, and is the process of iterating it to set slot->mechanismBits. Thread 0, however, is busy reallocating slot->mechanismList as part of the C_GetMechanismList call.

The change in Comment #19 would have addressed this, by ensuring PK11_ReadMechanismList calls were serialized. However, the fact that two threads are in the process of calling PK11_InitToken as part of nssSlot_Refresh makes me think that we actually need locking further up, since PK11_InitToken modifies other attributes of the slot.

Does that sound unreasonable? My thinking is that the locking may need to be as high as at the nssTrustDomain level around the active slots, or potentially more aggressive use of nssSlot_EnterMonitor and nssSlot_ExitMonitor
Ryan, I believe you are correct. We should probably check why we have two threads in IsTokenPresent trying to refresh the token.

I'm sort or relunctant to use nssSlot_EnterMonitor there because it is already overloaded.

nssSlot_EnterMonitor protects 2 things: several PK11SlotInfo variables (including the mechanism list cache), and the PKCS #11 concurrency if necessary (for PKCS #11 modules that aren't thread safe). That means there are a lot of 'if (!isthreadsafe)' calls throughout the pk11wrap layer. It also protects concurrency on the 'global' session the pk11wrap layer creates to do single shot operations (or all operations when the token only supports one session).

To add complication to that, there is an equivalent lock for the module, and if the module is not thread safe, all slots on the module share that lock (so getting the module lock also gets the slot locks for all slots on the module). This lock also protects  all the the module variables (including it slot list).

Finally if the module isn't threadsafe, any session locks (which controls PKCS #11 concurrency over a particular session) are the same as the module lock (which is the same as the slot lock).

We probably want to decouple the variable consistencies with the PKCS #11 concurrency. That would push the concurrency locks down to the bottom of the calls, just before we make a PKCS #11 call. The consistency lock would be called at a higher level (with the rule you can't use the consistency lock while holding the concurrency lock).

We also should handle the multiple calls to PK11_InitToken with a standard condition variable, so only one caller initializes the token. (the others wait for the initialization, verify that it's initialized and continue one without calling PK11_InitToken.
Flags: needinfo?(rrelyea)
For reference purposes:
- we have another bug with an identical subject: bug 1068351
- that other bug was resolved as a duplicate of bug 1050069
- that bug 1050069 was marked as resolved with this small (probably incomplet) commit:
  https://hg.mozilla.org/projects/nss/rev/c765bf1d4e4b
See Also: → 1068351, 1050069
Attached patch token-init-cvar.patch (obsolete) — Splinter Review
I am not sure if I fully understand Bob's comment, but attaching a patch as a starting-point.  It is based on the following assumptions:

- PK11_ReadMechanism is, despite the name, a local function only called from PK11_TokenInit
- whenever PK11_DoesMechanism is called, slot->mechanismList should be already allocated and initialized

Given that, the patch protects PK11_TokenInit with a conditional variable, as used in nss_Init.
Attachment #8934164 - Flags: review?(rrelyea)
Comment on attachment 8934164 [details] [diff] [review]
token-init-cvar.patch

Review of attachment 8934164 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/pk11wrap/pk11slot.c
@@ +426,5 @@
>      return slot;
> +
> +loser:
> +    if (slot->sessionLock) {
> +        PZ_DestroyLock(slot->sessionLock);

BUG: This doesn't look correct

Line 365 ties slot->sessionLock only allocates a new lock if mod->isThreadSafe - otherwise, it uses mod->refLock. This would blow out the mod->refLock

It seems you should be tying this to mod->isThreadSafe as well
Attached patch isPresent_condition.patch (obsolete) — Splinter Review
This is an untested but builds patch to single thread the isPresent use of PK11_InitToken().

There are other uses, but the isPresent one is the one that is likely to be hitting other threads.

A similar set of code inside PK11_InitToken() should probably be created, except it should allow each call to PK11_InitToken() to proceed once the previous is finished. That's because the other usages are cases where we did something that could change the cached Token state (changing the password could change the token from needsLogin=TRUE to needsLogin=FALSE, for instance), so we would need to proceed with an InitToken once the previous InitToken completes. That is not included in this patch since it's not a race condition we've seen in the wild. A separate bug should be written to handle that.

bob
Comment on attachment 8934366 [details] [diff] [review]
isPresent_condition.patch

Review of attachment 8934366 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/dev/devslot.c
@@ +111,5 @@
>  
> +/* when we leave is nssSlot_IsTokenPresent once we set up
> + * the condition, we need to do a number of things:
> + * 1) set the lastTokenPing time so the future calls within
> + * the window returns our old status, 2) clean up our condition

Comment nit: Would it make sense to add a newline between "," ad "2)"

@@ +159,5 @@
> +    /* if we were one of multiple threads here, the first thread will have
> +     * given us the answer, no need to make more queries of the token. */
> +    if (within_token_delay_period(slot)) {
> +        PZ_Unlock(slot->isPresentLock);
> +        return ((slot->ckFlags & CKF_TOKEN_PRESENT) != 0);

Doesn't this have a race condition with ckFlags still?

Would it make more sense to stash the present status to a stack-local variable, unlock, and then return that stack local?

@@ +163,5 @@
> +        return ((slot->ckFlags & CKF_TOKEN_PRESENT) != 0);
> +    }
> +    /* this is the winning thread, block all others until we've determined
> +     * if the token is present and that it needs initialization. */
> +    slot->inIsPresent = PR_TRUE; 

nit: trailing space is present here

@@ +184,3 @@
>              return PR_FALSE;
>          }
> +        nssSlot_ExitMonitor(slot);

Is this correct? The slot monitor is exited on line 172 and it's not re-entered before this line.

@@ +242,5 @@
>      nssrv = nssSlot_Refresh(slot);
>      if (nssrv != PR_SUCCESS) {
>          slot->token->base.name[0] = 0; /* XXX */
>          slot->ckFlags &= ~CKF_TOKEN_PRESENT;
>          /* TODO: insert a barrier here to avoid reordering of the assingments */

Is this TODO still applicable, given the CV firing and locking?
Attachment #8934366 - Flags: review-
Comment on attachment 8934164 [details] [diff] [review]
token-init-cvar.patch

As Bob told us that his patch will likely fix the original issue, I withdraw this one.
Attachment #8934164 - Attachment is obsolete: true
Attachment #8934164 - Flags: review?(rrelyea)
Bob, will you respond to Ryan's comments? Do you need to attach a new patch?
Flags: needinfo?(rrelyea)
Comment on attachment 8934366 [details] [diff] [review]
isPresent_condition.patch

Review of attachment 8934366 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/dev/devslot.c
@@ +159,5 @@
> +    /* if we were one of multiple threads here, the first thread will have
> +     * given us the answer, no need to make more queries of the token. */
> +    if (within_token_delay_period(slot)) {
> +        PZ_Unlock(slot->isPresentLock);
> +        return ((slot->ckFlags & CKF_TOKEN_PRESENT) != 0);

So you are saying:
    unsigned slot_ckFlags = slot->ckFlags;
    PZ_Unloack(slot->isPresentLock);
    return (slot_ckflags & CKF_TOKEN_PRESENT) != 0);

I think that's a reasonable adjustment to the patch, though in practice all the threads waiting on the condition should fall through to here without any new threads entering the this function (they'd be short circuited above).

@@ +184,3 @@
>              return PR_FALSE;
>          }
> +        nssSlot_ExitMonitor(slot);

It's incorrect. it's left from another version of the patch.
Attached patch isPresentCondition.patch (obsolete) — Splinter Review
Since Ryan was kind enough to make the initial review (thanks), requesting an update review.
Attachment #8934366 - Attachment is obsolete: true
Flags: needinfo?(rrelyea)
Attachment #8935493 - Flags: review?(ryan.sleevi)
re comment 25: yes and yes;).
Comment on attachment 8934366 [details] [diff] [review]
isPresent_condition.patch

Review of attachment 8934366 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/dev/devslot.c
@@ +159,5 @@
> +    /* if we were one of multiple threads here, the first thread will have
> +     * given us the answer, no need to make more queries of the token. */
> +    if (within_token_delay_period(slot)) {
> +        PZ_Unlock(slot->isPresentLock);
> +        return ((slot->ckFlags & CKF_TOKEN_PRESENT) != 0);

Correct. The short-circuit (within_token_delay_period) is still racey, but the above pattern would avoid that race :)
Comment on attachment 8935493 [details] [diff] [review]
isPresentCondition.patch

Review of attachment 8935493 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure whether to set r- on a "Looks structurally sound, but could you explain why we're doing it this way", so leaving an r? and looking for feedback :)

::: lib/dev/devslot.c
@@ +112,5 @@
> +/* when we leave is nssSlot_IsTokenPresent once we set up
> + * the condition, we need to do a number of things:
> + * 1) set the lastTokenPing time so the future calls within
> + * the window returns our old status,
> + * 2) clean up our condition state and wake up all the other threads 

unnecessary trailing whitespace

@@ +116,5 @@
> + * 2) clean up our condition state and wake up all the other threads 
> + * that were were trying to get the token status at the same time we did.
> + * They should now short circuit within the delay period rather than each
> + * trying to get their the slot info on their own. */
> +#define NSS_IS_PRESENT_EXIT_CONDITION(slot) \

Is there a reason you went with the macro approach rather than the "rv" approach?

That is, I can see

PRBool rv = PR_FALSE;

...

done:
  slot->lastTokenPing = ...
  PZ_Lock(...);
  ...
  PZ_Unlock(...);
  return rv;

(Or even two labels, so that the early shortcircuits on 139, 144, and 151 can 'goto done' while the others 'goto notify_and_cleanup').

This seems more in line with existing NSS style throughout the codebase.

@@ +117,5 @@
> + * that were were trying to get the token status at the same time we did.
> + * They should now short circuit within the delay period rather than each
> + * trying to get their the slot info on their own. */
> +#define NSS_IS_PRESENT_EXIT_CONDITION(slot) \
> +        slot->lastTokenPing = PR_IntervalNow();\

It'd be good to have a consistent style between the ";\" and "; \"

Suggestion: "; \" for all four lines

::: lib/pk11wrap/dev3hack.c
@@ +119,5 @@
>      rvSlot->slotID = nss3slot->slotID;
>      /* Grab the slot name from the PKCS#11 fixed-length buffer */
>      rvSlot->base.name = nssUTF8_Duplicate(nss3slot->slot_name, td->arena);
>      rvSlot->lock = (nss3slot->isThreadSafe) ? NULL : nss3slot->sessionLock;
> +    rvSlot->isPresentLock= PZ_NewLock(nssiLockOther);

s/isPresentLock= /isPresentLock = /
(In reply to Ryan Sleevi from comment #34)
> 
> I'm not sure whether to set r- on a "Looks structurally sound, but could you
> explain why we're doing it this way", so leaving an r? and looking for
> feedback :)
Assignee: ttaubert → rrelyea
Flags: needinfo?(rrelyea)
Attachment #8935493 - Flags: review?(ryan.sleevi) → review-
Attachment #8935493 - Flags: feedback?(rrelyea)
Attached patch isPresentCondition.patch v3 (obsolete) — Splinter Review
Thanks for the review Ryan. The big thing was removing the Macro and doing a goto, which is consistent NSS style.  I only had 1 label, since there isn't any common clean up in the early cases (that's also pretty consistent NSS style...  early tests just return).

Changing that removes the macro altogether, so no need to deal with the second comment. 

This patch should also deal with the missing space before the = comment as well.

bob
Attachment #8935493 - Attachment is obsolete: true
Attachment #8935493 - Flags: feedback?(rrelyea)
Flags: needinfo?(rrelyea)
Attachment #8938557 - Flags: review?(ryan.sleevi)
Comment on attachment 8938557 [details] [diff] [review]
isPresentCondition.patch v3

Review of attachment 8938557 [details] [diff] [review]:
-----------------------------------------------------------------

I'm going to r+ with some comment nits you can fix before committing :)

::: lib/dev/devslot.c
@@ +236,4 @@
>      }
> +done:
> +    /* when we leave is nssSlot_IsTokenPresent once we set up
> +     * the condition, we need to do a number of things:

I'm having trouble parsing this sentence, so I'm not sure if my reword is correct, but

/* Before returning, it's necessary to:
 * 1) Set the lastTokenPing time so that any other threads waiting on this
 *    initialization and any future calls within the initialization window
 *    return the just-computed status.
 * 2) Indicate we're complete, waking up all other threads that may still
 *    be waiting on initialization can progress.
 */

@@ +238,5 @@
> +    /* when we leave is nssSlot_IsTokenPresent once we set up
> +     * the condition, we need to do a number of things:
> +     * 1) set the lastTokenPing time so the future calls within
> +     * the window returns our old status,
> +     * 2) clean up our condition state and wake up all the other threads 

unnecessary trailing whitespace
Attachment #8938557 - Flags: review?(ryan.sleevi) → review+
I see that Bob checked in yesterday, and the backed it out.
The commit can be found here:
  https://hg.mozilla.org/projects/nss/rev/919051b7381e1e7e36c92ed0bbbbbbd7bd71a588

The commit adjusted the comment.

clang-format required changes to whitespace, and scan-build complained about the following new compiler warning:

devslot.c:219:6: warning: Value stored to 'isPresent' is never read
            isPresent = PR_TRUE;

The code is:
        /* token not removed, finished */
        if (isPresent) {
            isPresent = PR_TRUE;
            goto done;
        }

Obviously the isPresent = PR_TRUE assignment is redundant, because it's already true, so obviously it can be removed.

I suggest that Bob double checks if something else was intended to be done, but otherwise, I guess we can check the patch in with the clang-format fixed and this statement removed.
Attached patch 1054373-v4.patch (obsolete) — Splinter Review
Attaching Bob's patch from last night, with code formatting fixed and the redundant statement removed.

Try build running here:
https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=87c74d1c84ff0646df67f23fb27d795ed5c57edd
try build looked fine.

Bob, I've checked it in for you:
https://hg.mozilla.org/projects/nss/rev/272dde8958e9
Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.35
I'm reopening, because more investigation might be required.

We have a new intermittent assertion failure:

cert.sh: OCSP response creation selftest
cert.sh: perform selftest --------------------------
ocspresp c:/Users/task_1515069813/tests_results/security/localhost.1/serverCA serverCA chain-1-serverCA -f ../tests.pw
Assertion failure: lock->owner == me, at c:/Users/task_1515069803/nspr/Debug/pr/src/threads/combined/../../../../../pr/src/threads/combined/prulock.c:319
cert.sh: #365: perform selftest (3)  - FAILED

This was seen only once in many test cycles. But it seems the patch introduced a new race?
Status: RESOLVED → REOPENED
Flags: needinfo?(rrelyea)
Resolution: FIXED → ---
https://hg.mozilla.org/projects/nss/rev/db89b28b84e3e6f05a38a31075f341f016a494a7

Backed out again for the same issue as seen in comment 41.  This seems to happen with tstclnt and a number of different utilities.

https://treeherder.mozilla.org/logviewer.html#?job_id=154244528&repo=nss&lineNumber=3057

This is probably a latent bug that this is exposing, so we should probably investigate further.
It's hard to see it. I've been trying to reproduce it locally without success. I single stack traceback would be infinitely helpful here.
Flags: needinfo?(rrelyea)
The assert is the case where we call a lock on a lock that we already hold.
I think Martin is correct that we are likely exposing a latent bug. The patch itself only adds lock calls to a new lock it creates. The locks always frees before exiting the function, and doesn't make any calls while holding the lock that could be reenterant. More telling, however, is in the tests that are failing, we should never be getting into the new code anyway. The code should only be active for removable tokens, which softoken (normally) is not.
The assertion failure happens inside PR_Unlock.

The assertion can be triggered in the following scenarios:
(a) it was locked by a different thread
(b) data corruption, an out of bounds write changed lock->owner
(c) lock has already been destroyed and lock->owner contains invalid data
(d) lock was never initialized, lock pointer points to random data, and lock destroy is attempted

Apparently the attached patch triggers one of the above scenarios more often. Which one is it most likely?

For (d), is it certain that all code added in the patch does reach a working init for the new locks?

For (c), should we clear the memory of each destroyed lock, after destruction, to ensure this gets reliably triggered whenever it happens, and cannot be hidden/delayed by memory still containing valid lock information, after destruction?
(In reply to Ryan Sleevi from comment #27)
> @@ +242,5 @@
> >      nssrv = nssSlot_Refresh(slot);
> >      if (nssrv != PR_SUCCESS) {
> >          slot->token->base.name[0] = 0; /* XXX */
> >          slot->ckFlags &= ~CKF_TOKEN_PRESENT;
> >          /* TODO: insert a barrier here to avoid reordering of the assignments */
> 
> Is this TODO still applicable, given the CV firing and locking?

As far as I understand, the TODO is still applicable.  Even the last available version of the patch does not ensure that slot->ckFlags is updated before slot->lastTokenPing and does not prevent other threads from observing (and acting upon) the partially updated slot state, does it?
(In reply to Kamil Dudka from comment #47)
> As far as I understand, the TODO is still applicable.  Even the last
> available version of the patch does not ensure that slot->ckFlags is updated
> before slot->lastTokenPing and does not prevent other threads from observing
> (and acting upon) the partially updated slot state, does it?

I believe it does, with the assumption that PZ_Lock/Unlock serve as memory fences.

The read of isInPresent (line 144) is only done within isPresentLock
The read of lastTokenPing (line 149) is only done within isPresentLock, and only if isInPresent == FALSE
The read of ckFlags (line 150) is only done within isPresentLock, and only if isInPresent == FALSE

The write of isInPresent (line 248) is only done within isPresentLock
The write of lastTokenPing (line 246) is only done while isInPresent == TRUE, without the lock - but is synchronized behind the lock that will set isInPresent = FALSE (line 248)
The write of ckFlags (line 234) is only done while isInPresent == TRUE, without the lock - but is synchronized behind the lock that will set isInPresent = FALSE (line 248)

So the writes are synchronized behind the lock to set isInPresent = FALSE, while the reads are synchronized behind the lock to check isInPresent's state.
(In reply to Kai Engert (:kaie:) from comment #38)
> and scan-build complained about the following new compiler warning:
> 
> devslot.c:219:6: warning: Value stored to 'isPresent' is never read
>             isPresent = PR_TRUE;
> 
> The code is:
>         /* token not removed, finished */
>         if (isPresent) {
>             isPresent = PR_TRUE;
>             goto done;
>         }

That was invaluably useful finding by the static analyzers.

> Obviously the isPresent = PR_TRUE assignment is redundant, because it's
> already true, so obviously it can be removed.

... but your conclusion was totally incorrect.

Now you have two instances of the isPresent variable in nssSlot_IsTokenPresent(), one of them shadowing the other.  Consequently, if the isPresent variable in the inner scope is updated to PR_TRUE, the same variable in the outer scope remains PR_FALSE and that value is mistakenly returned out of nssSlot_IsTokenPresent() later on.

Note that this flaw broke the basic functionality of curl.  You should seriously consider adding a curl smoke test to your CI.
(In reply to Kamil Dudka from comment #49)
> 
> Now you have two instances of the isPresent variable in
> nssSlot_IsTokenPresent(), one of them shadowing the other.  Consequently, if
> the isPresent variable in the inner scope is updated to PR_TRUE, the same
> variable in the outer scope remains PR_FALSE and that value is mistakenly
> returned out of nssSlot_IsTokenPresent() later on.

Kamil, thank you very much for finding this bug in the patch!

Bob, could you please update the patch?
Flags: needinfo?(rrelyea)
(In reply to Ryan Sleevi from comment #48)
> I believe it does, with the assumption that PZ_Lock/Unlock serve as memory
> fences.

I am not sure which code snapshot the line numbers in your comment can be mapped to ... but I can see a block of code in nssSlot_IsTokenPresent() that accesses both slot->lastTokenPing and slot->ckFlags without enforcing any synchronization:

      PRBool isPresent = PR_FALSE;

      /* permanent slots are always present unless they're disabled */
      if (nssSlot_IsPermanent(slot)) {
          return !PK11_IsDisabled(slot->pk11slot);
      }

      /* avoid repeated calls to check token status within set interval */
      if (within_token_delay_period(slot)) {
          return ((slot->ckFlags & CKF_TOKEN_PRESENT) != 0);
      }
(In reply to Kamil Dudka from comment #51)
> (In reply to Ryan Sleevi from comment #48)
> > I believe it does, with the assumption that PZ_Lock/Unlock serve as memory
> > fences.
> 
> I am not sure which code snapshot the line numbers in your comment can be
> mapped to ... 

Apologies, https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1054373&attachment=8938557

> but I can see a block of code in nssSlot_IsTokenPresent() that
> accesses both slot->lastTokenPing and slot->ckFlags without enforcing any
> synchronization:

Gosh, I completely missed that - yes, that would be unsafe/racy as you mentioned. The lock on Line 143 of https://bugzilla.mozilla.org/attachment.cgi?id=8938557&action=diff#a/lib/dev/devslot.c_sec3 would need to be moved to line 129, would you agree (in addition to the other bug you pointed, the shadowing on line 203 )
Attached patch isPresent.patch (obsolete) — Splinter Review
Here's the patch handling both problems that kamel pointed out.

NOTE: I'm working to set up a try for this patch. I don't think the fixes solve the tstclient asserts because I've verified tstclient never gets into the body of the isPresent code. When I place an assert(0) in the body, only certutil --merge crashes.

This means the race or memory corruption isn't in the isPresent code itself (though it might be in the initialization/free code in this patch).
Flags: needinfo?(rrelyea)
Attachment #8938557 - Attachment is obsolete: true
Attachment #8939753 - Attachment is obsolete: true
Attachment #8942359 - Attachment is obsolete: true
Comment on attachment 8942359 [details] [diff] [review]
isPresent.patch

Review of attachment 8942359 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/dev/devslot.c
@@ +249,1 @@
>      slot->lastTokenPing = PR_IntervalNow();

Arg. This was supposed to move within the isPresentLock below... (sigh).
Comment 55 applies to the patch posted in comment 53. The patch posted in comment 54 fixes the issue.
The try run looks good.  I re-ran a few of the targets and I couldn't get the intermittent error, so I'm tentatively saying that you have fixed this one.
Comment on attachment 8942364 [details] [diff] [review]
isPresentv5.patch

Functionally, it looks good.

>@@ -162,61 +190,73 @@ nssSlot_IsTokenPresent(
>         if (slot->token->base.name[0] != 0) {
>             /* notify the high-level cache that the token is removed */
>             slot->token->base.name[0] = 0; /* XXX */
>             nssToken_NotifyCertsNotVisible(slot->token);
>         }
>         slot->token->base.name[0] = 0; /* XXX */
>         /* clear the token cache */
>         nssToken_Remove(slot->token);
>-        slot->lastTokenPing = PR_IntervalNow();
>-        return PR_FALSE;
>+        isPresent = PR_FALSE;
>+        goto done;
>     }
>     /* token is present, use the session info to determine if the card
>      * has been removed and reinserted.
>      */
>     session = nssToken_GetDefaultSession(slot->token);
>     if (session) {
>-        PRBool isPresent = PR_FALSE;
>+        PRBool  tokenNotRemoved = PR_FALSE;

An outstanding space inserted here ^^^.

>         nssSession_EnterMonitor(session);
>         if (session->handle != CK_INVALID_SESSION) {
>             CK_SESSION_INFO sessionInfo;
>             ckrv = CKAPI(epv)->C_GetSessionInfo(session->handle, &sessionInfo);
>             if (ckrv != CKR_OK) {
>                 /* session is screwy, close and invalidate it */
>                 CKAPI(epv)
>                     ->C_CloseSession(session->handle);
>                 session->handle = CK_INVALID_SESSION;
>             }
>         }
>-        isPresent = session->handle != CK_INVALID_SESSION;
>+        tokenNotRemoved = session->handle != CK_INVALID_SESSION;

The above statement contains three negations in a row, which made me read it
three times before actually understanding it.  Perhaps using !tokenRemoved
instead of tokenNotRemoved would improve the readability?
I mean something like this:

--- a/nss/lib/dev/devslot.c
+++ b/nss/lib/dev/devslot.c
@@ -203,7 +203,7 @@ nssSlot_IsTokenPresent(
      */
     session = nssToken_GetDefaultSession(slot->token);
     if (session) {
-        PRBool  tokenNotRemoved = PR_FALSE;
+        PRBool tokenRemoved;
         nssSession_EnterMonitor(session);
         if (session->handle != CK_INVALID_SESSION) {
             CK_SESSION_INFO sessionInfo;
@@ -215,10 +215,10 @@ nssSlot_IsTokenPresent(
                 session->handle = CK_INVALID_SESSION;
             }
         }
-        tokenNotRemoved = session->handle != CK_INVALID_SESSION;
+        tokenRemoved = (session->handle == CK_INVALID_SESSION);
         nssSession_ExitMonitor(session);
         /* token not removed, finished */
-        if (tokenNotRemoved) {
+        if (!tokenRemoved) {
             isPresent = PR_TRUE;
             goto done;
         }
r+ for kamel's patch. I almost submitted a patch with exactly that wording.
based on Kamel's review and martin's comment I'll check in the patch with comment Kamel's comment 60 update.
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: 3.35 → 3.36
This doesn't seem entirely correct according to Coverity. Could you check if this is an issue Bob?

*** CID 1427599:  Concurrent data access violations  (MISSING_LOCK)
/lib/dev/devslot.c: 93 in nssSlot_ResetDelay()
87     }
88
89     NSS_IMPLEMENT void
90     nssSlot_ResetDelay(
91         NSSSlot *slot)
92     {
>>>     CID 1427599:  Concurrent data access violations  (MISSING_LOCK)
>>>     Accessing "slot->lastTokenPing" without holding lock "PRLock.mutex". Elsewhere, "NSSSlotStr.lastTokenPing" is accessed with "PRLock.mutex" held 1 out of 2 times (1 of these accesses strongly imply that it is necessary).
93         slot->lastTokenPing = 0;
94     }
95
96     static PRBool
97     within_token_delay_period(const NSSSlot *slot)
98     {
Flags: needinfo?(rrelyea)
That should be fine. I looked at that use before attaching the patch.

bob
Flags: needinfo?(rrelyea)
It doesn't look OK to me.  The code that updates and accesses the value is called under slot->isPresentLock every time.  It's not safe to assume that this write is effective without the lock.  If you want the zero value to propagate to other threads, then you need a lock to ensure that they don't continue to use cached values.
If lastTokenPing=0 wins a race, then the minor effect is:
  The caching was ineffective, the next check is performed within a shorter period of time.

If lastTokenPing=PR_IntervalNow() wins a race, then the ResetDelay request from another thread has failed.

If it's a requirement that the ResetDelay request always succeeds, then it seems necessary to ensure there's no race.
Flags: needinfo?(rrelyea)
 Things work if reset is a no-up, it waits the entire PR_Interval before the token is visible. In the race case, it doesn't matter that someone this the PR_IntervalNow() first. The whole PR_Interval is an optimization to keep from pinging the token to often (we usually get 'isTokenPresent' requests in fast clusters together. Only one needs to know the answer). They are all racing the actual token insertion/removal, so it has to be robust against actually winning or losing races.

The locks are meant to protect races getting multiple isPresent calls into the body of isPresent. The lock isn't protecting access to the interval timer.
Flags: needinfo?(rrelyea)
 Things work if reset is a no-op, it just waits the entire PR_Interval before the token is visible. It's an optimization in code where we know we've changed the state (and usually it's called by a synchronous call to ispresent which shouldn't race). In the race case, it doesn't matter that someone this the PR_IntervalNow() first. The whole PR_Interval is an optimization to keep from pinging the token to often (we usually get 'isTokenPresent' requests in fast clusters together. Only one needs to know the answer). They are all racing the actual token insertion/removal, so it has to be robust against actually winning or losing races.

The locks are meant to protect races getting multiple isPresent calls into the body of isPresent. The lock isn't protecting access to the interval timer.
Hmm so actually, I do need to protect the lastTokenPing, but the simple lock won't be sufficient. I'll have to check lastTokenPing == magic before I set it to PR_IntervalNow().
Attached patch race.update (obsolete) — Splinter Review
patch to fix the race condition. Also fixes theoretical collisions between reset state and PR_IntervalNow().
That basic argument doesn't apply to my (flawed) proposed logic in comment 69. (it's not "will we fail to notice a change in our running code", my argument was "it didn't matter if a particular isPresent detected a change or not"). The falacy of the argument is as follows.


Say isPresent is called and it isn't in the lastTokenPing interval and goes and reads the current state. After reading the state, but before updating the lastTokenPing time our code does something programmatically that changes the token state (like load a new database into a previous empty slot). The code calls reset(). This code path will call isPresent again sometime after the reset expecting to get the new state information. We wind up returning the the previous thread first (the current locking that was added as a result of this bug guarantees this, no thread can complete the body of isPresent until the previous thread has finished. That thread will load the previous token state in the cache and set the lastTokenPing, which our new isPresent call duly picks up. NOTE: we can't just lock around out setting of the variable (though we do need to add locks), because the lock isn't held in the body if 'isPresent'. Instead, we give Reset a unique value. If we detect that lastTokenPing was reset, we don't modify the lastTokenPing time. This will allow some other thread (like the isPresent call after the reset()) to get a fresh state. The particular isPresent call will still return the old state, which may no longer be accurate, but that's allowed in this function (that's the kind of race we have to be tolerant of because someone can remove or insert a token on us in the middle and we can't possibly lock that condition from racing).
Attachment #8948593 - Flags: review?(martin.thomson)
Comment on attachment 8948593 [details] [diff] [review]
race.update

Review of attachment 8948593 [details] [diff] [review]:
-----------------------------------------------------------------

BTW, splinter is the worst tool for this sort of review.

::: lib/dev/devslot.c
@@ +92,5 @@
>      return nssToken_GetName(slot->token);
>  }
>  
>  NSS_IMPLEMENT void
>  nssSlot_ResetDelay(

If I understand this function, setting this to zero prevents IsTokenPresent from ever doing anything meaningful.  Is that really the desired outcome?  It seems rather permanent.

@@ +97,5 @@
>      NSSSlot *slot)
>  {
> +    PZ_Lock(slot->isPresentLock);
> +    slot->lastTokenPing = NSSSLOT_RESET_STATE;
> +    PZ_Unlock(slot->isPresentLock);

I appreciate the locking here.  I concede that it might be just an abundance of caution, but is helps.

I'm less sure about the overloading of lastTokenPing for storing this state.  I think that you really want to change the type of inIsPresent to an int and use that for storing the state.  That will mean that ResetDelay can check if another thread is currently probing by reading inIsPresent.

You will have to change the code at the end of IsTokenPresent so that it checks for whatever the reset state is before restoring the state to the default.
> If I understand this function, setting this to zero prevents IsTokenPresent from ever doing anything meaningful.  Is that really the desired outcome?  It seems rather permanent.

No, if it's set to zero IsTokenPresent no longer short circuits. If we are in reset or getting state, within_token_delay_period always returns false, which causes the body of IsTokenPresent to execute.

> I appreciate the locking here.  I concede that it might be just an abundance of caution, but is helps.

Actually it is needed, otherwise you race the update of the lastTokenPing state in IsTokenPresent. Getting the lock means that the update happens either before the final lock sequence (in which case it respects that the reset happened while fetching the token state and does not update lastTokenPing) or after the final lock sequence (with causes the next call to IsTokenPresent to get the new state). If you don't have the lock you race the 'is getting state check'

> I'm less sure about the overloading of lastTokenPing for storing this state.  I think that you really want to change the type of
> inIsPresent to an int and use that for storing the state.  That will mean that ResetDelay can check if another thread is currently
> probing by reading inIsPresent.

I thought about that, and no I can't. The state we are referring to is the state of the lastTokenPing value. inIsPresent needs to maintain it's state as long as there is a thread running in the inIsPresent, even after a reset or new threads could enter the inIsPresent body (which is what your condition is trying to avoid). I could include another state variable for the lastTokenPing state, but it would say that the lastTokenPing state only valid if the state is a new 'VALID' state. I'd have pretty much the same checks as I have now except I don't need the collision check.

> You will have to change the code at the end of IsTokenPresent so that it checks for whatever the reset state is before restoring the state to the default.

Yeah I could actually do either check now, but it's mute because I can't change the inIsPresent flag in reset anyway.

bob
(In reply to Robert Relyea from comment #75)
> > If I understand this function, setting this to zero prevents IsTokenPresent from ever doing anything meaningful.  Is that really the desired outcome?  It seems rather permanent.
> 
> No, if it's set to zero IsTokenPresent no longer short circuits. If we are
> in reset or getting state, within_token_delay_period always returns false,
> which causes the body of IsTokenPresent to execute.

I see, I think that I misread the intent of the code.  ResetDelay causes the next call of IsTokenPresent to run in full.

> > I'm less sure about the overloading of lastTokenPing for storing this state.  I think that you really want to change the type of
> > inIsPresent to an int and use that for storing the state.  That will mean that ResetDelay can check if another thread is currently
> > probing by reading inIsPresent.
> 
> I thought about that, and no I can't. The state we are referring to is the
> state of the lastTokenPing value. inIsPresent needs to maintain it's state
> as long as there is a thread running in the inIsPresent, even after a reset
> or new threads could enter the inIsPresent body (which is what your
> condition is trying to avoid). I could include another state variable for
> the lastTokenPing state, but it would say that the lastTokenPing state only
> valid if the state is a new 'VALID' state. I'd have pretty much the same
> checks as I have now except I don't need the collision check.

If the states were CHECKING, CHECKING_BUT_RESET, NOT_CHECKING, and RESET, then you could track all potential states.  That actually looks like two state variables.

A PRIntervalTime value of 0 is not special, so the gymnastics you go through here to avoid that value made me nervous.  Better to explicitly track state.

Don't call it inIsPresent, call it presenceCheckingState or something like that.
>If the states were CHECKING, CHECKING_BUT_RESET, NOT_CHECKING, and RESET, then you could track all potential states. 

That becomes confusing because we are mixing 2 separate states. Once is 'is there a thread in the body of isPresent' and the other is 'what is the state of lastTokenPing'.

> That actually looks like two state variables.

Agreed. I would be OK added a second state variable. 

> A PRIntervalTime value of 0 is not special, so the gymnastics you go through here to avoid that value made me nervous.  Better to explicitly track state.

The treating of zero as special isn't new in the code. It was necessary even before the reset code as lastTokenPing is initialized to zero on startup. What is new was I needed a new state to detect whether or not the reset happened while we were in the body of the thread. (if the reset happened before, then we can update lastTokenPing as normal). That added yet another special value, which probably warrants it's own state variable.

> Don't call it inIsPresent, call it presenceCheckingState or something like that.

Normally I'm ambivalent to name change suggestions, but in this case I don't know what presenceCheckingState means, but inIsPresent describes the state exactly. I'll have inIsPresent and lastTokenPingState.
Attachment #8948593 - Attachment is obsolete: true
Attachment #8948593 - Flags: review?(martin.thomson)
Attachment #8950772 - Flags: review?(martin.thomson)
Comment on attachment 8950772 [details] [diff] [review]
Race Update with Martin's comments incorporated

Review of attachment 8950772 [details] [diff] [review]:
-----------------------------------------------------------------

This looks much better, thanks for indulging me.

You will need to run clang-format to get this to stick on landing, but it otherwise looks great.
Attachment #8950772 - Flags: review?(martin.thomson) → review+
You need to log in before you can comment on or make changes to this bug.