Closed Bug 489188 Opened 15 years ago Closed 12 years ago

object/memory leak in libpkix

Categories

(NSS :: Libraries, defect, P1)

3.12.3
defect

Tracking

(Not tracked)

RESOLVED FIXED
3.13.6

People

(Reporter: alvolkov.bgs, Assigned: KaiE)

References

Details

(Keywords: memory-leak)

Attachments

(8 files, 1 obsolete file)

Yet another memory leak in libpkix. This time selfserv leaked list of chain checkers that were created in pkix_Build_CheckInCache. The actual reason of the error is not known, but the print out of the leaked object bring some light to the leak problem.

ssl.sh: Stress TLS  ECDH-ECDSA  AES 128 CBC with SHA (no reuse) ----
selfserv starting at Mon Apr 20 00:16:47 PDT 2009
selfserv -D -p 8111 -d ../server -n GORIDE.red.iplanet.com -B -s \
         -e GORIDE.red.iplanet.com-ec -w nss -c :C004 -i ../tests_pid.3120  &
trying to connect to selfserv at Mon Apr 20 00:16:47 PDT 2009
tstclnt -p 8111 -h GORIDE.red.iplanet.com -B -s -q \
        -d ../client -v < /export/tinderlight/data/goride_32_DBG/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt.exe: connecting to GORIDE.red.iplanet.com:8111 (address=192.18.72.38)
kill -0 1456 >/dev/null 2>/dev/null
selfserv with PID 1456 found at Mon Apr 20 00:16:47 PDT 2009
selfserv with PID 1456 started at Mon Apr 20 00:16:47 PDT 2009
strsclnt -q -p 8111 -d ../client -B -s -w nss -2 -c 100 -C :C004 -N \
          GORIDE.red.iplanet.com
strsclnt started at Mon Apr 20 00:16:47 PDT 2009
....
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
0[b52898]: Class Cert leaked 2 objects of size 144 bytes, total = 288 bytes
0[b52898]: Class CertChainChecker leaked 4 objects of size 20 bytes, total = 80 bytes
0[b52898]: Class PolicyCheckerState leaked 1 objects of size 88 bytes, total = 88 bytes
0[b52898]: Class PolicyNode leaked 1 objects of size 28 bytes, total = 28 bytes
0[b52898]: Class CertSelector leaked 1 objects of size 12 bytes, total = 12 bytes
0[b52898]: Class ComCertSelParams leaked 1 objects of size 80 bytes, total = 80 bytes
0[b52898]: Class Date leaked 1 objects of size 8 bytes, total = 8 bytes
0[b52898]: Class Error leaked 1 objects of size 20 bytes, total = 20 bytes
0[b52898]: Class List leaked 23 objects of size 20 bytes, total = 460 bytes
0[b52898]: Class OID leaked 14 objects of size 12 bytes, total = 168 bytes
0[b52898]: Class PublicKey leaked 3 objects of size 4 bytes, total = 12 bytes
0[b52898]: Class SignatureCheckerState leaked 1 objects of size 20 bytes, total = 20 bytes
0[b52898]: Class TargetCertCheckerState leaked 1 objects of size 32 bytes, total = 32 bytes
0[b52898]: Class X500Name leaked 4 objects of size 24 bytes, total = 96 bytes
0[b52898]: Assertion failure: numLeakedObjects == 0, at C:/mozilla-build/msys/export/tinderlight/data/goride_32_DBG/mozilla/security/nss/lib/libpkix/pkix_pl_nss/system/pkix_pl_lifecycle.c:294
strsclnt completed at Mon Apr 20 00:16:59 PDT 2009
ssl.sh: #4271: Stress TLS  ECDH-ECDSA  AES 128 CBC with SHA (no reuse) produced a returncode of 3, expected is 0.  - FAILED
trying to kill selfserv with PID 1456 at Mon Apr 20 00:16:59 PDT 2009
kill 1456
./all.sh: line 17:  1456 Terminated              ${PROFTOOL} ${BINDIR}/selfserv -D -p ${PORT} -d ${P_R_SERVERDIR} -n ${HOSTADDR} ${SERVER_OPTIONS} ${ECC_OPTIONS} -w nss ${sparam} -i ${R_SERVERPID} $verbose
selfserv with PID 1456 killed at Mon Apr 20 00:16:59 PDT 2009

Potential leak is at this line 3101 through 3104 of pkix_build.c rev 1.56.
The cleanup of the list of checkers happens only upon successful execution of  pkix_Build_ValidateEntireChain. They are leaked in case of it's failure.

Patch fixes memory leak by destroying object right before they get assigned to a new value.
Attachment #373678 - Flags: review?(nelson)
Attachment #373678 - Flags: review?(nelson) → review+
Comment on attachment 373678 [details] [diff] [review]
Patch v1 - resolt object leak problem

r=nelson
Please verify also this leak (Tinderbox goride MINGW32_NT-5.2 32bit DBG 2009/04/20 16:45:36), if it's caused by the same reason or if it is something different:

ssl.sh: Stress SSL3 ECDHE-ECDSA AES 128 CBC with SHA (no reuse) ----
selfserv starting at Mon Apr 20 18:45:38 PDT 2009
selfserv -D -p 8111 -d ../ext_server -n GORIDE.red.iplanet.com -B -s \
         -e GORIDE.red.iplanet.com-ec -w nss -c :C009 -i ../tests_pid.3544  &
trying to connect to selfserv at Mon Apr 20 18:45:38 PDT 2009
tstclnt -p 8111 -h GORIDE.red.iplanet.com -B -s -q \
        -d ../ext_client -v < /export/tinderlight/data/goride_32_DBG/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt.exe: connecting to GORIDE.red.iplanet.com:8111 (address=192.18.72.38)
kill -0 3784 >/dev/null 2>/dev/null
selfserv with PID 3784 found at Mon Apr 20 18:45:39 PDT 2009
selfserv with PID 3784 started at Mon Apr 20 18:45:39 PDT 2009
strsclnt -q -p 8111 -d ../ext_client -B -s -w nss -c 100 -C :C009 -N -T \
          GORIDE.red.iplanet.com
strsclnt started at Mon Apr 20 18:45:39 PDT 2009
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 8 cache misses, 0 cache not reusable
          0 stateless resumes
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
0[b528a0]: Class Error leaked 1 objects of size 20 bytes, total = 20 bytes
0[b528a0]: Assertion failure: numLeakedObjects == 0, at C:/mozilla-build/msys/export/tinderlight/data/goride_32_DBG/mozilla/security/nss/lib/libpkix/pkix_pl_nss/system/pkix_pl_lifecycle.c:294
strsclnt completed at Mon Apr 20 18:45:45 PDT 2009
ssl.sh: #4395: Stress SSL3 ECDHE-ECDSA AES 128 CBC with SHA (no reuse) produced a returncode of 3, expected is 0.  - FAILED
While Alexei is away, Julien and I need to take some of his bugs. 
I'm going to ask Julien to take the ones that are obviously related to 
the CRL cache.
Assignee: alexei.volkov.bugs → julien.pierre.boogz
Priority: -- → P1
Target Milestone: 3.12.1 → 3.12.4
This seems to be some PKIX shutdown problem. In nightly tests from yesterday (builds 20090420.1) I have seen the same bug on 2 machines.

AIX (failure in selfserv):

...
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 100 cache misses, 0 cache not reusable
          0 stateless resumes
strsclnt: NoReuse - 100 server certificates tested.
strsclnt completed at Tue Apr 21 09:47:54 PDT 2009
ssl.sh: #4701: Stress TLS RC4 128 with MD5 (no reuse, client auth) produced a returncode of 0, expected is 0.  - PASSED
trying to kill selfserv with PID 422004 at Tue Apr 21 09:47:54 PDT 2009
kill -USR1 422004
Assertion failure: numLeakedObjects == 0, at /h/guybrush.red.iplanet.com/export/mccrel3/security/securitytip/builds/20090420.1/wozzeck_Solaris8/mozilla/security/nss/lib/libpkix/pkix_pl_nss/system/pkix_pl_lifecycle.c:294
selfserv: 0 cache hits; 100 cache misses, 0 cache not reusable
          0 stateless resumes, 0 ticket parse failures
all.sh[12]: 422004 IOT/Abort trap(coredump)
selfserv with PID 422004 killed at Tue Apr 21 09:47:59 PDT 2009
ssl.sh: #4702: kill_selfserv core detection step - Core file is detected - FAILED

Solaris (failure in strsclnt):

strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
Assertion failure: numLeakedObjects == 0, at pkix_pl_lifecycle.c:294
strsclnt: 0 cache hits; 100 cache misses, 0 cache not reusable
          0 stateless resumes
strsclnt: NoReuse - 100 server certificates tested.
all.sh: line 326:   619 Abort                   (core dumped) ${PROFTOOL} ${BINDIR}/strsclnt -q -p ${PORT} -d ${P_R_CLIENTDIR} ${CLIENT_OPTIONS} -w nss $cparam $verbose ${HOSTADDR}
strsclnt completed at Mon Apr 20 23:56:20 PDT 2009
ssl.sh: #4705: Stress TLS  ECDH-ECDSA  AES 128 CBC with SHA (no reuse) produced a returncode of 134, expected is 0.  - Core file is detected - FAILED

Maybe this can tell something more:

Current function is PR_Assert
  550       abort();
(dbx) where                                        
  [1] _basename(0x1, 0x6, 0x0, 0xd0ec0000, 0x8046c88, 0xd0e503fe), at 0xd0e4f597 
  [2] _nss_netdb_aliases(0x6, 0xd1079d3c, 0x0, 0x0, 0x0, 0x0), at 0xd0e67d8d 
  [3] confstr(0xd1283114, 0xd1032cd9, 0x8046cec, 0xd12306cc, 0xd12727ac, 0xd12727c4), at 0xd0e503fe 
=>[4] PR_Assert(s = 0xd12727ac "numLeakedObjects == 0", file = 0xd12727c4 "pkix_pl_lifecycle.c", ln = 294), line 550 in "prlog.c"
  [5] PKIX_PL_Shutdown(plContext = 0x80b6848), line 294 in "pkix_pl_lifecycle.c"
  [6] PKIX_Shutdown(plContext = 0x80b6848), line 232 in "pkix_lifecycle.c"
  [7] NSS_Shutdown(), line 884 in "nssinit.c"
  [8] main(argc = 17, argv = 0x8046e14), line 1532 in "strsclnt.c"
tinderbox: build: trunk standard ciotat SunOS/sparc 64bit DBG


strsclnt: -- SSL: Server Certificate Validated.
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: -- SSL: Server Certificate Validated.
(PKIX_PL_Cert_VerifySignature: PKIX_PL_HashTable_Add skipped: entry existed
(pkix_CacheCertChain_Add: PKIX_PL_HashTable_Add for CertChain skipped: entry existed
strsclnt: 0 cache hits; 10 cache misses, 0 cache not reusable
          0 stateless resumes
strsclnt: NoReuse - 10 server certificates tested.
strsclnt completed at Fri May  8 07:51:17 PDT 2009
ssl.sh: #5054: Stress TLS ECDH-ECDSA AES 128 CBC with SHA (no reuse, client auth) produced a returncode of 0, expected is 0.  - PASSED
trying to kill selfserv with PID 29922 at Fri May  8 07:51:18 PDT 2009
kill -USR1 29922
1[100130a70]: Class Error leaked 1 objects of size 32 bytes, total = 32 bytes
1[100130a70]: Assertion failure: numLeakedObjects == 0, at pkix_pl_lifecycle.c:294
Assertion failure: numLeakedObjects == 0, at pkix_pl_lifecycle.c:294
selfserv: 0 cache hits; 10 cache misses, 0 cache not reusable
          0 stateless resumes, 0 ticket parse failures
./all.sh: line 17: 29922 Abort                   (core dumped) ${PROFTOOL} ${BINDIR}/selfserv -D -p ${PORT} -d ${P_R_SERVERDIR} -n ${HOSTADDR} ${SERVER_OPTIONS} ${ECC_OPTIONS} -w nss ${sparam} -i ${R_SERVERPID} $verbose

When viewed this bug I thought the approved v1 patch may not have been check in, but it was checked in 1.57 <alexei.volkov.bugs@sun.com> 2009-04-20 14:19
489188 - object/memory leak in libpkix.r=nelson.
I have seen some of these leaks in strsclnt as well. I recall that Alexei had a method to trace libpkix object leaks. Is it documented somewhere ?
This happened on Mac too :

http://tinderbox.mozilla.org/showlog.cgi?log=NSS/1242386373.1242395594.6390.gz&fulltext=1

strsclnt: -- SSL: Server Certificate Validated.
-1609533664[401f60]: Class Error leaked 1 objects of size 20 bytes, total = 20 bytes
-1609533664[401f60]: Assertion failure: numLeakedObjects == 0, at pkix_pl_lifecycle.c:294
Assertion failure: numLeakedObjects == 0, at pkix_pl_lifecycle.c:294
strsclnt: 0 cache hits; 100 cache misses, 0 cache not reusable
          0 stateless resumes
strsclnt: NoReuse - 100 server certificates tested.
./ssl.sh: line 421: 14849 Abort trap              ${PROFTOOL} ${BINDIR}/strsclnt -q -p ${PORT} -d ${P_R_CLIENTDIR} ${CLIENT_OPTIONS} -w nss $cparam $verbose ${HOSTADDR}
strsclnt completed at Fri May 15 05:40:55 PDT 2009
ssl.sh: #5142: Stress TLS  ECDH-ECDSA  AES 128 CBC with SHA (no reuse) produced a returncode of 134, expected is 0.  - FAILED
trying to kill selfserv with PID 14829 at Fri May 15 05:40:55 PDT 2009
I'm giving this bug back to Alexei.
Assignee: julien.pierre.boogz → alexei.volkov.bugs
Status: ASSIGNED → NEW
Keywords: mlk
Whiteboard: PKIX SUN_MUST_HAVE
> strsclnt: -- SSL: Server Certificate Validated.
> -1609533664[401f60]: Class Error leaked 1 objects of size 20 bytes, total = 20
> bytes
> -1609533664[401f60]: Assertion failure: numLeakedObjects == 0, at
> pkix_pl_lifecycle.c:294
This leak appears to be an incorrect handling/overwrite of the pointer to an error that was returned by a cache indicating an object, that was tried to be added to the cache, is already there. The problematic code is either in pkix_CacheCert_Add or in pkix_CacheCertChain_Add. Still working on trying to figure out the exact problematic code area.

However, I'm attaching the patch that will resolve the problem by handling the error within protected by locks area.
Attachment #387014 - Flags: review?(nelson)
Comment on attachment 387014 [details] [diff] [review]
Patch v2 - fix the leak by handling error in the protected area [backed out]

r=nelson. 
We will want to back this out when the real cause is found, so please say that in the checkin comment.
Attachment #387014 - Flags: review?(nelson) → review+
I'm running into this assertion failure when running all.sh
on Windows Vista.  The value of numLeakedObjects is 1.
Wan-Teh, did you make any changes or is it a build from the trunk?
If you have made some changes to lib pkix, this might mean, that you have leaked a pointer ref count somewhere.
Alexei, it is a build from the current NSS trunk, without
any changes.  I built it on Windows Vista with Visual
Studio 2008.  I set OS_TARGET=WIN95 and NSS_ENABLE_ECC=1
during the build, and set OS_TARGET=WIN95 when I ran all.sh.
(That is, I unset NSS_ENABLE_ECC before I ran all.sh.)
Thanks for the report. I'll check it out.
strsclnt had this assertion failure on an NSS Tinderbox last night:

trunk standard attic Linux 32bit DBG on 2010/01/21 20:18:13

Here is the excerpt of the failed SSL stress test from the test
log file.
We frequently get this problem on tinderbox, especially on machines with multiple cores.

This bug has a patch, it has r+, but apparently it never got checked in.

I've done so now, let's see if it helps.

Checking in lib/libpkix/pkix_pl_nss/system/pkix_pl_hashtable.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix_pl_nss/system/pkix_pl_hashtable.c,v  <--  pkix_pl_hashtable.c
new revision: 1.14; previous revision: 1.13
done
Marking fixed for now. I will reopen if I see the problem again...
Target Milestone: 3.12.4 → 3.14
Patch DIDN'T help. We still get the assertion on Linux/ppc64.
Leaving bug open.

Keep the patch, or backout?
Blocks: pkix-default
Whenever we run into this assertion, is's a stress test that sets
  "no session reuse"
in strsclnt.

With "session reuse enabled", we don't get this bug.
... also, by inspecting all the recent occurrences of this bug on tinderbox, we always had bypass mode enabled in both client and server.

This bug is mysterious.

On my Linux machine with 4 cores, using the following test execution command, the test suite runs into this bug at least once in three runs:
   NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix" \
   NSS_SSL_TESTS="bypass_normal" ./all.sh

I tried to reproduce this crash, by running the test that crashed, only.
I assumed, if I run this crashing test multiple times, I should be able to see the crash frequently.
But this assumption is wrong.

For example, the following series of commands (taken from the logfile) ran into the assertion:

selfserv -D -p 8443 -d ../server -n localhost.localdomain -B -s -e localhost.localdomain-ec -w nss -r -r 
tstclnt -p 8443 -h localhost.localdomain -B -s -q -d ../client -v < /home/kaie/moz/nss/head/mozilla/security/nss/tests/ssl/sslreq.dat
strsclnt -q -p 8443 -d ../client -B -s -w nss -c 100 -C c -T -N -n TestUser localhost.localdomain


I copied the client and db directories to "c" and "s" and I used a mininum script to run these commands:

selfserv -D -p 8443 -d ../s -n localhost.localdomain -B -s -e localhost.localdomain-ec -w nss -r -r  &
PID=$!
tstclnt -p 8443 -h localhost.localdomain -B -s -q -d ../c -v < /home/kaie/moz/nss/head/mozilla/security/nss/tests/ssl/sslreq.dat
kill -0 $PID >/dev/null 2>/dev/null
strsclnt -q -p 8443 -d ../c -B -s -w nss -c 100 -C c -T -N -n TestUser localhost.localdomain
kill -USR1 $PID
selfserv -b -p 8443 2>/dev/null;

I looped through above commands, but it never crashed.

(For testing purposes I ran "killall -11 selfserv", and that immediately crashed and produced a core file.)
After spending most of today on this bug, and after some black alleys, I made progress.

I edited sslstress.txt to run the problematic tests more often, reliably giving me at least one crash each test run, and I added printf statements to the error-creation and error-destroy functions.

Luckily the problamtic error code was unique.
When crashing, the process has allocated a NULLARGUMENT error object, but not freed it.

After adding some assertions to all places where this error is thrown, I got a core file with the stack causing the issue.

Function PKIX_PL_Cert_GetExtendedKeyUsage will always call PKIX_List_SetImmutable on list cert->extKeyUsages, even if that list pointer is NULL.

This causes a NULLARGUMENT error object to be created. I have not checked why this error object doesn't get freed.

However, we can avoid the creation of the error object, by avoiding this unnecessary function call if the list is NULL.

In my local testing, this simple NULL check fixes the problem.
Attached patch Patch v3 (obsolete) — Splinter Review
Attachment #624222 - Flags: review?(wtc)
Comment on attachment 387014 [details] [diff] [review]
Patch v2 - fix the leak by handling error in the protected area [backed out]

Wan-Teh, are you OK to backout this patch, which I landed yesterday, but which didn't help?
Attachment #387014 - Flags: feedback?(wtc)
Attachment #387014 - Attachment description: fix the leak by handling error in the protected area. → Patch v2 - fix the leak by handling error in the protected area - propose to backout
Comment on attachment 387014 [details] [diff] [review]
Patch v2 - fix the leak by handling error in the protected area [backed out]

Yes, we should back out this patch.  (I didn't review it.)
Comment on attachment 387014 [details] [diff] [review]
Patch v2 - fix the leak by handling error in the protected area [backed out]

backed out

Checking in lib/libpkix/pkix_pl_nss/system/pkix_pl_hashtable.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix_pl_nss/system/pkix_pl_hashtable.c,v  <--  pkix_pl_hashtable.c
new revision: 1.15; previous revision: 1.14
done
Attachment #387014 - Attachment description: Patch v2 - fix the leak by handling error in the protected area - propose to backout → Patch v2 - fix the leak by handling error in the protected area [backed out]
Attachment #387014 - Flags: feedback?(wtc)
Whiteboard: PKIX SUN_MUST_HAVE → [fix frequent tinderbox orange]
This bug is now waiting for review of patch v3, which fixes the bug according to my testing.
Why does this leak occur only intermittently?
(In reply to Brian Smith (:bsmith) from comment #30)
> Why does this leak occur only intermittently?

I haven't researched that.
Kai, please review and test this patch.

I studied the code related to cert->extKeyUsages and
cert->extKeyUsagesAbsent.  I found that this is where
we construct cert->extKeyUsages, so we only need to
call PKIX_List_SetImmutable after we just constructed
cert->extKeyUsages.  This is essentially the same as
your patch v3, with the PKIX_List_SetImmutable call
moved to a better place.

Note that this fix merely suppresses the creation of
the PKIX_NULLARGUMENT error object.  We still need to
track down why the error object is leaked.  If we
don't do that, we will still leak an error object
if a different error object is created elsewhere.
Attachment #624222 - Attachment is obsolete: true
Attachment #624956 - Flags: review?(kaie)
Attachment #624222 - Flags: review?(wtc)
Kai, please review and test this patch.  To test it, you need to
back out patch v4.

By studying all the functions on the call stack that allocates the
leaked NULLARGUMENT error, I believe this is where the error object
is leaked:

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/libpkix/pkix/top/pkix_build.c&rev=1.64&mark=3078-3082,3096,3102-3106,3109#3078

We know the pkix_Build_ValidateEntireChain call returns a fatal
NULLARGUMENT error.  So the PKIX_CHECK_ONLY_FATAL macro causes
us to goto cleanup.

Under cleanup:, I believe that we enter the if statement and
call pkix_CacheCertChain_Remove inside the PKIX_CHECK_FATAL.
This causes us to lose the error object stored in the
pkixErrorResult variable.

I don't know if we should preserve the original error object.
This patch merely fixes the error object leak.  It does not
preserve the original error object if the pkix_CacheCertChain_Remove
call succeeds.
Attachment #624962 - Flags: review?(kaie)
This is an alternative fix for the error object leak.
This patch preserves the original error object if the
pkix_CacheCertChain_Remove doesn't fail.  Since I didn't
find out whether we need to preserve the original error
object, I think it is safer to maintain the current
behavior of dropping the original error object and just
fix the error object leak.
Comment on attachment 624966 [details] [diff] [review]
Fix pkixErrorResult leak, approach 2 (preserves the original error object)

There is only one caller of pkix_Build_CheckInCache, pkix_Build_InitiateBuildChain:

            ...
            PKIX_CHECK_ONLY_FATAL(
                pkix_Build_CheckInCache(state, &buildResult,
                                        &nbioContext, plContext),
                PKIX_UNABLETOBUILDCHAIN);
            ...

So, whether we preserve the original error or not doesn't matter in the CURRENT state of the code. However, is better to preserve the error when the call to pkix_CacheCertChain_Remove doesn't fail, because AFAICT that is how functions in libpkix are "supposed" to work. So, I think this patch (approach 2) is better than approach 1, as far as making the code understandable.

So, my understanding is that if we were to check in this patch (approach 2), we would start seeing intermittent NULLARGUMENT errors, and if we check in your other patch then those intermittent NULLARGUMENT errors would go away. 

If we were to check in approach 1, then the bug fixed in Patch v4 would be not result in a detectable failure, right? If so then I think the approach 1 patch is not just worse than approach 2 because of understandability, but really wrong.
Just to clarify: I recommend checking in two patches:
- patch v4
- Fix pkixErrorResult leak, either approach 1 or approach 2

Brian: your understanding in comment 35 is correct.  It is true that
the current code and also the "approach 1" patch both suppress the
NULLARGUMENT errors.  I now agree that is wrong.

Kai: please test "approach 2" instead of "approach 1".  Thanks.
It would also be nice to track down why the NULLARGUMENT errors
are intermittent.  But I understand you may not have time to do
that.
Attachment #624966 - Flags: review?(kaie)
Comment on attachment 624956 [details] [diff] [review]
Patch v4 [checked in]

Wan-Teh, I suspect you made an unintentional copy/paste mistake.

Your patch doesn't simply move the code - you modify a different list.
Attachment #624956 - Flags: review?(kaie) → review-
Comment on attachment 624956 [details] [diff] [review]
Patch v4 [checked in]

(In reply to Kai Engert (:kaie) from comment #37)
> 
> Your patch doesn't simply move the code - you modify a different list.

Sorry, my mistake. You do modify the correct list, which is named differently in this scope.
Attachment #624956 - Flags: review- → review?(kaie)
Wan-Teh, neither your "approach 1" nor your "approach 2" fixes the issue.
Here are the test results:

For all testing described in this comment, the following applies:
- patch from bug 761715 was reverted
- attachment 624227 [details] [diff] [review] from this bug was applied (test enhancements)
- test execution with:
  NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix" \
  NSS_SSL_TESTS="bypass_normal" ./all.sh

no patches 
test-run #1: fail
test-run #2: ok
test-run #3: fail

only patch v4:
test-run #1: ok
test-run #2: ok
test-run #3: ok

only approach 1:
test-run #1: ok
test-run #2: fail !
test-run #3: fail !

only approach 2:
test-run #1: fail !
test-run #2: ok
test-run #3: fail
Sorry. I've performed the testing without sufficient focus.

Of course, when using ONLY approach 1 or ONLY approach 2, we still fail, because my testing modifications cause us to assert/abort whenever the PKIX_NULLARGUMENT objects gets created.

In order to test your approaches, I must remove those additional asserts...
Ignore comment 39.
Comment on attachment 624962 [details] [diff] [review]
Fix pkixErrorResult leak, approach 1

Patch "approach 1" fixes the leak.

No other failures in the remainder of the test suite.
Comment on attachment 624966 [details] [diff] [review]
Fix pkixErrorResult leak, approach 2 (preserves the original error object)

This patch "approach 2" introduces a new problem.

Probably whenever we have the NULL argument error object,
we get another error, like the following example:

strsclnt: -- SSL: Server Certificate Invalid, err -8187.
security library: invalid arguments.
strsclnt: PR_Send returned error -8187:
security library: invalid arguments.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
selfserv: HDX PR_Read returned error -12271:
SSL peer cannot verify your certificate.
strsclnt: 0 cache hits; 12 cache misses, 0 cache not reusable
          0 stateless resumes
strsclnt: NoReuse - 12 server certificates tested.
strsclnt completed at Wed Jun  6 11:51:06 CEST 2012
ssl.sh: #595: Stress SSL3 RC4 128 with MD5 (no reuse, client auth) produced a returncode of 1, expected is 0.  - FAILED
Attachment #624966 - Flags: review?(kaie) → review-
When combining patch v4 and approach 2, the errors shown in comment 43 are gone.

Does this mean, whenever we saw a NULLARGUMENT error, it was caused by a certificate validation failure?
When using the diagnostic printf statements for file pkix_error.c found in attachment 624227 [details] [diff] [review], we get the attached output.

With the help of a patch from bug 762057 I translated those error numbers into meaning...

The set of error codes seen is:
code 702 NULLARGUMENT
code 865 PRIMHASHTABLEADDFAILED
code 19 ATTEMPTTOADDDUPLICATEKEY

But errors 865 and 19 are commonly seen, in all runs of strsclnt, even in the succeeding runs. The diagnostic output has statements meaning "did't add, because entry was already there" - meaning those errors aren't fatal but are supposed to be ignored.


Whenever we get certificate validation failure -8187, we can see that the NULLARGUMENT error code was produced immediately before.
Comment on attachment 624956 [details] [diff] [review]
Patch v4 [checked in]

r=kaie
Attachment #624956 - Flags: review?(kaie) → review+
Comment on attachment 624956 [details] [diff] [review]
Patch v4 [checked in]

I think we agree that this change is good.
I've checked it in to make the tinderbox orange stop.

Checking in pkix_pl_cert.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_cert.c,v  <--  pkix_pl_cert.c
new revision: 1.31; previous revision: 1.30
done


If you decide to further investigate this bug, you must back it out for further testing.
Attachment #624956 - Attachment description: Patch v4 → Patch v4 [checked in]
Comment on attachment 624956 [details] [diff] [review]
Patch v4 [checked in]

also checked in to NSS_3_13_4_BRANCH (for potential 3.13.6)

Checking in pkix_pl_cert.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_cert.c,v  <--  pkix_pl_cert.c
new revision: 1.28.2.2; previous revision: 1.28.2.1
done
Kai, thank you for checking in patch v4 and testing the "approach 1"
and "approach 2" patches.

The behaviors you reported in comment 42 and comment 43 are expected:
- "approach 1" should fix the object leak and preserve the current
  behavior (suppress the NULLARGUMENT error).
- "approach 2" should also fix the object leak, but it will expose
  the NULLARGUMENT error intentionally.

Since it seems wrong to suppress an error, if combining patch v4 and
approach 2 fixes the object leak, I suggest that we check in approach 2
instead of approach 1.
Comment on attachment 624966 [details] [diff] [review]
Fix pkixErrorResult leak, approach 2 (preserves the original error object)

Wan-Teh, thanks for the additional clarificaions.
r=kaie
Attachment #624966 - Flags: review- → review+
Attachment #624962 - Flags: review?(kaie)
Comment on attachment 624966 [details] [diff] [review]
Fix pkixErrorResult leak, approach 2 (preserves the original error object)

trunk:

Checking in pkix_build.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix/top/pkix_build.c,v  <--  pkix_build.c
new revision: 1.65; previous revision: 1.64
done


3_13_4 branch for 3.13.6:

Checking in pkix_build.c;
/cvsroot/mozilla/security/nss/lib/libpkix/pkix/top/pkix_build.c,v  <--  pkix_build.c
new revision: 1.63.2.1; previous revision: 1.63
done
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee: alvolkov.bgs → kaie
Whiteboard: [fix frequent tinderbox orange] → [fix frequent tinderbox orange][3.13.6]
Whiteboard: [fix frequent tinderbox orange][3.13.6] → [fix frequent tinderbox orange]
Target Milestone: 3.14 → 3.13.6
Cleaning up/fixing whiteboards containing 'orange' but not '[orange]' in advance of bug 791758.
Whiteboard: [fix frequent tinderbox orange]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: