Closed Bug 1050069 Opened 5 years ago Closed 5 years ago

389-ds-base server reports crash in stan_GetCERTCertificate under the replication replay failure condition

Categories

(NSS :: Libraries, defect)

3.15.2
x86_64
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: elio.maldonado.batiz, Assigned: elio.maldonado.batiz)

References

Details

Attachments

(3 files)

As originally reprted by Sankar Ramalingam 2014-05-05 14:38:18 EDT

Description of problem: Directory Server reported a crash when running acceptance tests.

Version-Release number of selected component (if applicable): 389-ds-base-1.3.1.6-25

How reproducible: Consistently

Steps to Reproduce:
1. Run automated acceptance for mmrepl/accept tests
2. Clone a similar beaker job - https://beaker.engineering.redhat.com/jobs/643479
3. Observe the test results - 15/66 FAIL

Actual results: Test failures reported with trac47606 tests. Server crashed when trying initialize one of the consumer.

Expected results: No crash reported.

Additional info:

The crash is in NSS....

Program terminated with signal 11, Segmentation fault.
#0  PR_EnterMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:527
527        if (!pthread_equal(mon->owner, self))
Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.3-4.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 nss-softokn-3.15.4-2.el7.x86_64 nss-softokn-freebl-3.15.4-2.el7.x86_64 nss-util-3.15.4-2.el7.x86_64 sqlite-3.7.17-4.el7.x86_64 xz-libs-5.1.2-8alpha.el7.x86_64 zlib-1.2.7-13.el7.x86_64
(gdb) bt
#0  PR_EnterMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:527
#1  0x00007f9fe29c74b6 in nssPKIObject_Lock (object=object@entry=0x7f9fc82836f0) at pkibase.c:22
#2  0x00007f9fe29c96e9 in stan_GetCERTCertificate (c=0x7f9fc82836f0, forceUpdate=forceUpdate@entry=1)
    at pki3hack.c:858
#3  0x00007f9fe29c9d04 in STAN_ForceCERTCertificateUpdate (c=<optimized out>) at pki3hack.c:914
#4  0x00007f9fe29c59c0 in nssTrustDomain_RemoveTokenCertsFromCache (td=0x7f9fe6101c60,
    token=<optimized out>) at tdcache.c:448
#5  0x00007f9fe298627c in nssToken_NotifyCertsNotVisible (tok=<optimized out>) at dev3hack.c:303
#6  0x00007f9fe29cac59 in nssSlot_IsTokenPresent (slot=0x7f9fe624e030) at devslot.c:172
#7  0x00007f9fe29cd66e in nssToken_IsPresent (token=<optimized out>) at devtoken.c:1442
#8  0x00007f9fe29af5d4 in pk11_IsPresentCertLoad (slot=0x7f9fbc030aa0, loadCerts=1)
    at pk11slot.c:1436
#9  0x00007f9fe29b28b1 in SECMOD_CloseUserDB (slot=0x7f9fbc030aa0) at pk11util.c:1522
#10 0x00007f9fe312bc73 in tlsm_ctx_free (ctx=0x7f9fbc009710) at tls_m.c:2148
#11 0x00007f9fe3127c85 in ldap_int_tls_destroy (lo=0x7f9fbc000e80) at tls2.c:105
#12 0x00007f9fe310c6d7 in ldap_ld_free (ld=0x7f9fbc000dd0, close=close@entry=1,
    sctrls=sctrls@entry=0x0, cctrls=cctrls@entry=0x0) at unbind.c:209
#13 0x00007f9fe310c807 in ldap_unbind_ext (ld=<optimized out>, sctrls=sctrls@entry=0x0,
    cctrls=cctrls@entry=0x0) at unbind.c:52
#14 0x00007f9fe3d2d3de in slapi_ldap_unbind (ld=<optimized out>) at ldap/servers/slapd/ldaputil.c:148
#15 0x00007f9fd97cbf99 in close_connection_internal (conn=conn@entry=0x7f9fe64211b0)
    at ldap/servers/plugins/replication/repl5_connection.c:1203
#16 0x00007f9fd97cc8c6 in perform_operation (conn=0x7f9fe64211b0, optype=optype@entry=1,
    dn=<optimized out>, attrs=attrs@entry=0x7f9fbc080de0, newrdn=newrdn@entry=0x0,
    newparent=newparent@entry=0x0, deleteoldrdn=deleteoldrdn@entry=0,
    update_control=update_control@entry=0x7f9fbc056430, extop_oid=extop_oid@entry=0x0,
    extop_payload=extop_payload@entry=0x0, message_id=message_id@entry=0x7f9fd1f1cc6c)
    at ldap/servers/plugins/replication/repl5_connection.c:738
#17 0x00007f9fd97cd01d in conn_send_add (conn=<optimized out>, dn=<optimized out>,
    attrs=attrs@entry=0x7f9fbc080de0, update_control=update_control@entry=0x7f9fbc056430,
    message_id=message_id@entry=0x7f9fd1f1cc6c)
    at ldap/servers/plugins/replication/repl5_connection.c:771
#18 0x00007f9fd97cf94a in replay_update (prp=prp@entry=0x7f9fe64465f0, op=0x7f9fd1f1ccd0,
    message_id=message_id@entry=0x7f9fd1f1cc6c)
    at ldap/servers/plugins/replication/repl5_inc_protocol.c:1362
#19 0x00007f9fd97d0e56 in send_updates (num_changes_sent=0x7f9fd1f1cc60,
    remote_update_vector=<optimized out>, prp=0x7f9fe64465f0)
    at ldap/servers/plugins/replication/repl5_inc_protocol.c:1719
#20 repl5_inc_run (prp=<optimized out>) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1056
#21 0x00007f9fd97d5a5c in prot_thread_main (arg=0x7f9fe6422c30)
    at ldap/servers/plugins/replication/repl5_protocol.c:296
#22 0x00007f9fe233a740 in _pt_root (arg=0x7f9fe641ff20)
    at ../../../nspr/pr/src/pthreads/ptthread.c:204
#23 0x00007f9fe1cdbdf3 in start_thread (arg=0x7f9fd1f1d700) at pthread_create.c:308
#24 0x00007f9fe1a093dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


# rpm -q nss
nss-3.15.4-6.el7.x86_64

This is the latest NSS for rhel-7.0.
Just in case, other nss related package versions.
nss-3.15.4-6.el7.x86_64
nss-softokn-freebl-3.15.4-2.el7.x86_64
nss-util-3.15.4-2.el7.x86_64
nss-tools-3.15.4-6.el7.x86_64
nss-debuginfo-3.15.4-6.el7.x86_64
nss-softokn-3.15.4-2.el7.x86_64
nss-sysinit-3.15.4-6.el7.x86_64
The problem is reproducible with nss-3.16.2.
I'm submitting this patch on behalf of Bob Relyea.
Attachment #8469289 - Flags: review?(wtc)
Comment on attachment 8469289 [details] [diff] [review]
fix crash in stan_GetCERTCertificate

Review of attachment 8469289 [details] [diff] [review]:
-----------------------------------------------------------------

r=wtc. We should give this patch a high priority. Chrome on Linux and Chrome OS
have been plagued by a crash that occurs after importing a certificate into the
certificate database. I wonder if it's related to this bug.

I didn't try to understand why we need to call AddRef in these functions.
Is there a simple rule to determine when we should call AddRef on an object?
Or are you just blindly calling AddRef? It would be good to consult Bob on
this and have criteria for needing AddRef documented in the source code, so
that someone other than Bob will be able to maintain this code in the future.

::: lib/pki/pki3hack.c
@@ +853,5 @@
>  {
>      nssDecodedCert *dc = NULL;
>      CERTCertificate *cc = NULL;
>      CERTCertTrust certTrust;
> +    nssPKIObject *object = &c->object;

I think the code is easier to understand without the 'object' variable.

Alternatively, on line 892 (of the original code) there is one occurrence
of c->object that should also be replaced by 'object':

 891     } else if (CERT_GetCertTrust(cc, &certTrust) != SECSuccess &&
 892                !c->object.cryptoContext) {
Attachment #8469289 - Flags: review?(wtc) → review+
Flags: needinfo?(rrelyea)
Assignee: nobody → rrelyea
Elio, Bob, what's stopping this from getting checked in?
I don't know why Elio pinged this back to me. I think he just wanted me to answer wtc's questions. I'll do so here, but Elio, you have the r+, you can check in.

> I didn't try to understand why we need to call AddRef in these functions.

It's because it's not all that clear. It seems certain functions in pki will delete the existing object. The addref makes sure they don't go away while you are holding the object lock.

> Is there a simple rule to determine when we should call AddRef on an object?

No, the criteria for this one was were we calling nssCryptokiObject_Destroy somewhere here.


> Or are you just blindly calling AddRef? 

Somewhat. The main one in stan_CERTCertificateUpdate is because that's where we were crashing, and debugging showed that we were hitting a condition where the object was going away on us. The rest were other places in the file that we were calling nssCryptokiObject_Destroy on something that looked like a cert in a situation where we were having a net -1 on the reference count.

> It would be good to consult Bob on
> this and have criteria for needing AddRef documented in the source code, so
> that someone other than Bob will be able to maintain this code in the future.

Other than what I just said, I have nothing else to add. Maybe a comment that says this happens, so you may need AddRef elsewhere. The main case was nssTrustDomain_AddCertsToCache fixed in this patch:

https://hg.mozilla.org/projects/nss/diff/204f22c527f8/lib/pk11wrap/pk11cert.c

--------------

> I think the code is easier to understand without the 'object' variable.

Paranoia. Initially it looked like the cert->object field was being clobbered (actually it can't be because it's not a pointer, it's an imbedded struct).

Elio, go ahead and check in, you can make the change wtc suggested on line 892.

bob
Assignee: rrelyea → emaldona
Flags: needinfo?(rrelyea)
(In reply to Robert Relyea from comment #5)
> Elio, go ahead and check in, you can make the change wtc suggested on line
> 892.

That's option 2 of the ones Wan-Teh suggested. 

> I think the code is easier to understand without the 'object' variable.     <---- 1
> Alternatively, on line 892 (of the original code) there is one occurrence
> of c->object that should also be replaced by 'object':                      <---- 2

 891     } else if (CERT_GetCertTrust(cc, &certTrust) != SECSuccess &&
 892                !c->object.cryptoContext) {

If you don't mind, I'll go with the first option.
Pushed: https://hg.mozilla.org/projects/nss/rev/812c461c8daf
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
> If you don't mind, I'll go with the first option.

I prefered option 2, which is why I ased you do to it.
ased -> asked
Is it necessary to backout and reopen this bug?
Flags: needinfo?(rrelyea)
Target Milestone: --- → 3.18
No, getting the fix in was more important than bike shedding.
Flags: needinfo?(rrelyea)
Additional testing by Red Hat QE found that this fix introduced some problems. Patch comming next.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Patch provided by Bob Relyea, slightly adapted for the current sources.
Attachment #8547187 - Flags: review?(wtc)
Comment on attachment 8547187 [details] [diff] [review]
Fixes race condition

Review of attachment 8547187 [details] [diff] [review]:
-----------------------------------------------------------------

I spent about 1.5 hours studying this patch and the related
NSS code. I believe I understand the bug and the proposed fix.

Please go over my review comments and confirm that I understand
this code correctly. I also suggested some changes. Thanks.

::: lib/pki/tdcache.c
@@ +394,1 @@
>      nssPKIObject_AddRef(object);

If I understand the code correctly, this nssPKIObject_AddRef(object) call
and the matching nssPKIObject_Destroy(object) call at the end of this
function are unnecessary.

To make this patch less risky, we don't need to remove these two calls.

@@ +398,5 @@
>  	    nssCryptokiObject_Destroy(object->instances[i]);
>  	    object->instances[i] = object->instances[object->numInstances-1];
>  	    object->instances[object->numInstances-1] = NULL;
>  	    object->numInstances--;
> +	    /* make sure id doesn't disappear on us before we finish */

"id" in this comment is a typo. The function doesn't have any
argument or variable named "id".

I would further point out that we acquire this reference because
we are storing |c| in the dtor->certs array.

@@ +399,5 @@
>  	    object->instances[i] = object->instances[object->numInstances-1];
>  	    object->instances[object->numInstances-1] = NULL;
>  	    object->numInstances--;
> +	    /* make sure id doesn't disappear on us before we finish */
> +	    nssPKIObject_AddRef(object);

I probably would use nssCertificate_AddRef instead. The two AddRef
functions are the same, so it doesn't matter which one we use.

@@ +442,5 @@
>      nssHash_Iterate(td->cache->issuerAndSN, remove_token_certs, (void *)&dtor);
>      for (i=0; i<dtor.numCerts; i++) {
>  	if (dtor.certs[i]->object.numInstances == 0) {
>  	    nssTrustDomain_RemoveCertFromCacheLOCKED(td, dtor.certs[i]);
> +	    nssPKIObject_Destroy(&dtor.certs[i]->object);

Note: please read my next comment first, and then come back to
read this comment. It'll be clearer when read in this order.

Here, we cannot call nssCertificate_Destroy(dtor.certs[i])
because we are holding td->cache->lock and nssCertificate_Destroy
will try to lock td->cache->lock.

This means if dtor.certs[i]->object->refCount is decremented to
0 by nssPKIObject_Destroy, we won't destroy dtor.certs[i]
completely. I believe this problematic case won't happen because
we are holding td->cache->lock, which will prevent other
nssCertificate_Destroy calls.

@@ +450,5 @@
>      PZ_Unlock(td->cache->lock);
>      for (i=0; i<dtor.numCerts; i++) {
>  	if (dtor.certs[i]) {
>  	    STAN_ForceCERTCertificateUpdate(dtor.certs[i]);
> +	    nssPKIObject_Destroy(&dtor.certs[i]->object);

1. I believe it is this code, executed after releasing td->cache->lock,
that requires us to acquire a reference inside the remove_token_certs
function, right?

2. IMPORTANT: I believe we should call nssCertificate_Destroy(dtor.certs[i])
instead. nssCertificate_Destroy does more work than nssPKIObject_Destroy.
It is like a virtual destructor in C++. nssPKIObject_Destroy only destroys
the base class.
If I understand the race condition correctly, I believe
this patch should also work. The advantages of this patch
are that the _AddRef and the matching _Destroy calls are
in the same function, and that we call _AddRef only when
necessary.

I also call the correct destructor nssCertificate_Destroy.

On the other hand, if I misunderstand the bug, then this
patch is wrong.
Attachment #8551460 - Flags: superreview?(rrelyea)
Attachment #8551460 - Flags: review?(emaldona)
Comment on attachment 8547187 [details] [diff] [review]
Fixes race condition

Review of attachment 8547187 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/pki/tdcache.c
@@ +398,5 @@
>  	    nssCryptokiObject_Destroy(object->instances[i]);
>  	    object->instances[i] = object->instances[object->numInstances-1];
>  	    object->instances[object->numInstances-1] = NULL;
>  	    object->numInstances--;
> +	    /* make sure id doesn't disappear on us before we finish */

I think "id" is a typo for "it".
Comment on attachment 8551460 [details] [diff] [review]
Fixes race condition, alternative patch

Review of attachment 8551460 [details] [diff] [review]:
-----------------------------------------------------------------

I think I follow your logic. I was under the impression that the DestroyObject call called through a virtual pointer to destroy the full object type (independent of type), I think calling DestroyCertificate is fine if we *know* the object is a certificate (and not, say, a crl).

That being said I think your patch may be better. I'd like to have our ldap guys try it in their servers.
Attachment #8551460 - Flags: superreview?(rrelyea) → superreview+
See the discussion on Bug 1068351 which make me suspect that the fix may not be a complete after all, we know the original one wasn't, don't know if Wan-Teh's alternative fix will cause the same problem.
I think I was getting confused as I notice attachment title "Fixes race condition, alternative patch" so the appending the other one to this one isn't needed after all.
Flags: needinfo?(wtc)
Duplicate of this bug: 1068351
Comment on attachment 8551460 [details] [diff] [review]
Fixes race condition, alternative patch

Review of attachment 8551460 [details] [diff] [review]:
-----------------------------------------------------------------

r+ based on positive feedback from Noriko Hosoi who ran the QE tests against a test build we provided.
Attachment #8551460 - Flags: review?(emaldona) → review+
Flags: needinfo?(wtc)
Pushed: https://hg.mozilla.org/projects/nss/rev/c765bf1d4e4b
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Flags: needinfo?(wtc)
Resolution: --- → FIXED
Flags: needinfo?(wtc)
Attachment #8547187 - Flags: review?(ekr)
Comment on attachment 8547187 [details] [diff] [review]
Fixes race condition

Review of attachment 8547187 [details] [diff] [review]:
-----------------------------------------------------------------

WTC seems to be handling this.
Attachment #8547187 - Flags: review?(ekr)
Attachment #8547187 - Flags: review?(wtc)
See Also: → 1054373
You need to log in before you can comment on or make changes to this bug.