Last Comment Bug 458905 - Memory leaks in PKIX bridge certificates.
: Memory leaks in PKIX bridge certificates.
Status: RESOLVED FIXED
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: trunk
: All All
: P2 normal (vote)
: 3.12.2
Assigned To: Nelson Bolyard (seldom reads bugmail)
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-10-07 10:07 PDT by Slavomir Katuscak
Modified: 2008-11-19 08:08 PST (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
patch, hack fix. v1 (654 bytes, patch)
2008-10-08 12:26 PDT, Nelson Bolyard (seldom reads bugmail)
rrelyea: review+
Details | Diff | Review

Description Slavomir Katuscak 2008-10-07 10:07:12 PDT
While testing PKIX memory leak script on Linux (RHEL4 machine dopushups on standard 32bit/DBG Tinderbox build) I found some new memory leaks.

Steps to reproduce (on Linux):

1. Generate one selfsigned CA cert (root).
certutil -N -d ArmyDB -f ArmyDB/dbpasswd
certutil -s "CN=Army ROOT CA, O=Army, C=US" -S -n Army -t CTu,CTu,CTu -v 600 -x -d ArmyDB -1 -2 -5 -f ArmyDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092

2. Generate other selfsigned CA cert (root).
certutil -N -d NavyDB -f NavyDB/dbpasswd
certutil -s "CN=Navy ROOT CA, O=Navy, C=US" -S -n Navy -t CTu,CTu,CTu -v 600 -x -d NavyDB -1 -2 -5 -f NavyDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092

3. Generate bridge CA cert and sign it with both root certs.
certutil -N -d BridgeDB -f BridgeDB/dbpasswd
certutil -s "CN=Bridge Bridge, O=Bridge, C=US" -R -2 -d BridgeDB -f BridgeDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092 -o BridgeReq.der
certutil -C -c Army -v 60 -d ArmyDB -i BridgeReq.der -o BridgeArmy.der -f ArmyDB/dbpasswd -7 Bridge@Army 
certutil -A -n Bridge -t u,u,u -d BridgeDB -f BridgeDB/dbpasswd -i BridgeArmy.der
certutil -C -c Navy -v 60 -d NavyDB -i BridgeReq.der -o BridgeNavy.der -f NavyDB/dbpasswd -7 Bridge@Navy 
certutil -A -n Bridge -t u,u,u -d BridgeDB -f BridgeDB/dbpasswd -i BridgeNavy.der

4. Generate user certificate signed by bridge CA.
certutil -N -d UserDB -f UserDB/dbpasswd
certutil -s "CN=User EE, O=User, C=US" -R  -d UserDB -f UserDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092 -o UserReq.der
certutil -C -c Bridge -v 60 -d BridgeDB -i UserReq.der -o UserBridge.der -f BridgeDB/dbpasswd  
certutil -A -n User -t u,u,u -d UserDB -f UserDB/dbpasswd -i UserBridge.der

5. Create test DB and import there certs of all CAs (roots + both signed bridge certs).
certutil -N -d AllDB -f AllDB/dbpasswd
certutil -A -n Army  -t "" -d AllDB -f AllDB/dbpasswd -i Army.der
certutil -A -n Navy  -t "" -d AllDB -f AllDB/dbpasswd -i Navy.der
certutil -A -n Bridge  -t "" -d AllDB -f AllDB/dbpasswd -i BridgeArmy.der
certutil -A -n Bridge  -t "" -d AllDB -f AllDB/dbpasswd -i BridgeNavy.der

6. Verify end user cert with test DB and trust one of root certs (I run this under valgrind tool on Linux)
vfychain -d AllDB -pp -vv    UserBridge.der  -t Army

/usr/bin/valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --partial-loads-ok=yes --leak-resolution=high --num-callers=50 /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/dist/Linux2.6_x86_glibc_PTH_DBG.OBJ/bin/vfychain -d AllDB -pp -vv UserBridge.der -t Army

Leaks found:

==18734== 148 (60 direct, 88 indirect) bytes in 1 blocks are definitely lost in loss record 31 of 38
==18734==    at 0x40056BF: calloc (vg_replace_malloc.c:279)
==18734==    by 0x420088C: PR_Calloc (prmem.c:474)
==18734==    by 0x40F8000: nss_ZAlloc (arena.c:892)
==18734==    by 0x40F7891: nssArena_Create (arena.c:412)
==18734==    by 0x40EAF74: add_cert_to_cache (tdcache.c:794)
==18734==    by 0x40EB183: nssTrustDomain_AddCertsToCache (tdcache.c:877)
==18734==    by 0x40EE4BC: cert_createObject (pkibase.c:1061)
==18734==    by 0x40EE0EC: nssPKIObjectCollection_GetObjects (pkibase.c:893)
==18734==    by 0x40EE60E: nssPKIObjectCollection_GetCertificates (pkibase.c:1115)
==18734==    by 0x40E9310: nssTrustDomain_FindCertificatesBySubject (trustdomain.c:677)
==18734==    by 0x40E939B: NSSTrustDomain_FindCertificatesBySubject (trustdomain.c:702)
==18734==    by 0x40E1DE3: CERT_CreateSubjectCertList (stanpcertdb.c:692)
==18734==    by 0x418CFA4: pkix_pl_Pk11CertStore_CertQuery (pkix_pl_pk11certstore.c:238)
==18734==    by 0x418DDCA: pkix_pl_Pk11CertStore_GetCert (pkix_pl_pk11certstore.c:532)
==18734==    by 0x412EBB9: pkix_Build_GatherCerts (pkix_build.c:2072)
==18734==    by 0x412FEDE: pkix_BuildForwardDepthFirstSearch (pkix_build.c:2626)
==18734==    by 0x41362B4: pkix_Build_InitiateBuildChain (pkix_build.c:4260)
==18734==    by 0x4136E97: PKIX_BuildChain (pkix_build.c:4428)
==18734==    by 0x409A3FA: CERT_PKIXVerifyCert (certvfypkix.c:2155)
==18734==    by 0x804BF0B: main (vfychain.c:505)

==18734== 88 bytes in 1 blocks are indirectly lost in loss record 32 of 38
==18734==    at 0x40056BF: calloc (vg_replace_malloc.c:279)
==18734==    by 0x420088C: PR_Calloc (prmem.c:474)
==18734==    by 0x42118FE: PR_NewLock (ptsynch.c:174)
==18734==    by 0x40F78C4: nssArena_Create (arena.c:418)
==18734==    by 0x40EAF74: add_cert_to_cache (tdcache.c:794)
==18734==    by 0x40EB183: nssTrustDomain_AddCertsToCache (tdcache.c:877)
==18734==    by 0x40EE4BC: cert_createObject (pkibase.c:1061)
==18734==    by 0x40EE0EC: nssPKIObjectCollection_GetObjects (pkibase.c:893)
==18734==    by 0x40EE60E: nssPKIObjectCollection_GetCertificates (pkibase.c:1115)
==18734==    by 0x40E9310: nssTrustDomain_FindCertificatesBySubject (trustdomain.c:677)
==18734==    by 0x40E939B: NSSTrustDomain_FindCertificatesBySubject (trustdomain.c:702)
==18734==    by 0x40E1DE3: CERT_CreateSubjectCertList (stanpcertdb.c:692)
==18734==    by 0x418CFA4: pkix_pl_Pk11CertStore_CertQuery (pkix_pl_pk11certstore.c:238)
==18734==    by 0x418DDCA: pkix_pl_Pk11CertStore_GetCert (pkix_pl_pk11certstore.c:532)
==18734==    by 0x412EBB9: pkix_Build_GatherCerts (pkix_build.c:2072)
==18734==    by 0x412FEDE: pkix_BuildForwardDepthFirstSearch (pkix_build.c:2626)
==18734==    by 0x41362B4: pkix_Build_InitiateBuildChain (pkix_build.c:4260)
==18734==    by 0x4136E97: PKIX_BuildChain (pkix_build.c:4428)
==18734==    by 0x409A3FA: CERT_PKIXVerifyCert (certvfypkix.c:2155)
==18734==    by 0x804BF0B: main (vfychain.c:505)

There were also found stacks related to bugs 367384 and 430544.

I tried to reproduce this on Solaris, but DBX doesn't detect those leaks. 
Also experiments with using some certs on command line and some in DB shows, that leaks are found only when both bridge certs are in DB.
Comment 1 Nelson Bolyard (seldom reads bugmail) 2008-10-07 14:45:12 PDT
Slavo, were you testing with the environment variable 
NSS_DISABLE_ARENA_FREE_LIST set to 1 ?

The leaks reported here are all arena leaks.  I would expect this pattern 
of leaks in a process that was not using NSS_DISABLE_ARENA_FREE_LIST.
Comment 2 Slavomir Katuscak 2008-10-08 04:18:37 PDT
I reused functionality from previous memory leak testing scripts (memleak.sh), where is NSS_DISABLE_ARENA_FREE_LIST set to 1 in initialization function. You can expect this variable to be set now and also in future in memory leaks found by my testing scripts.
Comment 3 Slavomir Katuscak 2008-10-08 09:00:26 PDT
Reproduced on Solaris:

Memory Leak (mel):
Found leaked block of size 88 bytes at address 0x6e578
At time of allocation, the call stack was:
    [1] calloc() at 0xdf137488 
    [2] PR_Calloc() at line 475 in "prmem.c"
    [3] PR_NewLock() at line 174 in "ptsynch.c"
    [4] nssArena_Create() at line 418 in "arena.c"
    [5] add_cert_to_cache() at line 794 in "tdcache.c"
    [6] nssTrustDomain_AddCertsToCache() at line 877 in "tdcache.c"
    [7] cert_createObject() at line 1061 in "pkibase.c"
    [8] nssPKIObjectCollection_GetObjects() at line 893 in "pkibase.c"
    [9] nssPKIObjectCollection_GetCertificates() at line 1117 in "pkibase.c"
    [10] nssTrustDomain_FindCertificatesBySubject() at line 679 in "trustdomain.c"
    [11] NSSTrustDomain_FindCertificatesBySubject() at line 706 in "trustdomain.c"
    [12] CERT_CreateSubjectCertList() at line 696 in "stanpcertdb.c"
    [13] pkix_pl_Pk11CertStore_CertQuery() at line 242 in "pkix_pl_pk11certstore.c"
    [14] pkix_pl_Pk11CertStore_GetCert() at line 534 in "pkix_pl_pk11certstore.c"
    [15] pkix_Build_GatherCerts() at line 2078 in "pkix_build.c"
    [16] pkix_BuildForwardDepthFirstSearch() at line 2628 in "pkix_build.c"

Memory Leak (mel):
Found leaked block of size 60 bytes at address 0x6e428
At time of allocation, the call stack was:
    [1] calloc() at 0xdf137488 
    [2] PR_Calloc() at line 475 in "prmem.c"
    [3] nss_ZAlloc() at line 892 in "arena.c"
    [4] nssArena_Create() at line 412 in "arena.c"
    [5] add_cert_to_cache() at line 794 in "tdcache.c"
    [6] nssTrustDomain_AddCertsToCache() at line 877 in "tdcache.c"
    [7] cert_createObject() at line 1061 in "pkibase.c"
    [8] nssPKIObjectCollection_GetObjects() at line 893 in "pkibase.c"
    [9] nssPKIObjectCollection_GetCertificates() at line 1117 in "pkibase.c"
    [10] nssTrustDomain_FindCertificatesBySubject() at line 679 in "trustdomain.c"
    [11] NSSTrustDomain_FindCertificatesBySubject() at line 706 in "trustdomain.c"
    [12] CERT_CreateSubjectCertList() at line 696 in "stanpcertdb.c"
    [13] pkix_pl_Pk11CertStore_CertQuery() at line 242 in "pkix_pl_pk11certstore.c"
    [14] pkix_pl_Pk11CertStore_GetCert() at line 534 in "pkix_pl_pk11certstore.c"
    [15] pkix_Build_GatherCerts() at line 2078 in "pkix_build.c"
    [16] pkix_BuildForwardDepthFirstSearch() at line 2628 in "pkix_build.c"
Comment 4 Nelson Bolyard (seldom reads bugmail) 2008-10-08 09:27:07 PDT
Slavo, is this a new test? 
Or is this a new leak in an old test ("old" being older than 90 days) ?
Comment 5 Nelson Bolyard (seldom reads bugmail) 2008-10-08 11:11:45 PDT
This appears to be an original "day 1" bug.  I wonder why we never found 
this leak before now.  That's why I asked the questions in comment 4.

In a comment to follow this one, I will discuss the specifics of the bad
functions.  In this comment, I will discuss the general problem.

The code responsible for the leak is in tdcache.c.  It's bad logic spread
across many functions, the functions that deal with allocating and 
destroying the "entry" objects.  It's a bad design, in several ways.  

Any number from 0-3 objects try to share an nssArena.  They all occupy space 
allocated from the arena.  These objects may be independently destroyed.  
There is no reference counting of handles/references of the Arena. Instead,
The logic to deallocate the Arena makes use of the fatally-flawed "owner" 
concept, where one of the objects that occupies it is considered to "own" 
the Arena, and when that object is destroyed, the Arena is destroyed.  
This makes for several problems. 

1.  If the "owner" object is destroyed before the other objects that are 
allocated from the arena, those other objects are deallocated while 
references to them remain.

2. The "owner" object may not use the arena at all, in which case, either:

a) NO objects that hold references to the arena, and it will be leaked at
the time it is created, or

b) All the objects that hold references to the Arena are marked as NOT the 
owner of the Arena, so when they are all destroyed, the Arena is leaked. 

The solution to this problem requires completely eliminating the "owner"
concept from the management of the nssArena.  The solution is either:

a) have a separate Arena for each "subject entry", "nickname entry" and 
"email entry" object.  There is no "owner" flag.  Each of those entry 
objects implicitly owns the arena in which it is allocated, and destroying
that object destroys its arean.  Allocation of the arena is done in each
of the functions that allocate those "entry" objects, rather than in the
function that calls those entry object allocators, or 

b) have the Arena be truly reference counted.  Each "entry" object holds
a reference.  When each entry object is deleted, the reference count is 
decremented, and when it goes to zero, the Arena is deallocated.

NSS has been plagued with problems with the use of this "owner" concept.
No NSS developer should ever again attempt to put such code into NSS.
No NSS reviewer should ever allow it.
Comment 6 Nelson Bolyard (seldom reads bugmail) 2008-10-08 11:35:43 PDT
I'm taking this bug.  Even though it was triggered by a call from libPKIX,
the flaw is not (primarily) in libPKIX.  

I DO wonder why we never saw this bug before.  What is libPKIX doing 
differently than the rest of NSS that triggers this bug?   
Is that different behavior of libPKIX a bug?  

The problem is seen first and foremost in function add_cert_to_cache() in
file tdcache.c, and in the functions that it calls: add_subject_entry and add_nickname_entry, and the function they call, new_cache_entry, all of 
which are also in tdcache.c.

add_cert_to_cache creates an nssArena (line 794), which it then passes
to add_subject_entry.  add_subject_entry (line 560) looks to see if there
is already a subject entry for this subject.  If not, then at line 576 
it calls new_cache_entry, passing the argument PR_TRUE, telling 
new_cache_entry to mark this object as the "owner" of the Arena.  
See line 143 for the code that marks the new cache entry as owner.

HOWEVER, if add_subject_entry finds that there is already a subject 
entry for this subject (line 562) it reuses the existing subject entry,
and makes NO USE of the nssArena that was passed to it.  Consequently,
there is NO object that is the "owner" of the nssArena in that case.  

When add_subject_entry returns, add_cert_to_cache may (or may not) 
call add_nickname_entry (line 809) passing the Arena to it.  
add_nickname_entry also calls new_cache_entry (line 625), passing it
PR_FALSE, so that it will NOT be the owner of the Arena.

The test at lines 805-806 is intended to ensure that we only call 
add_nickname_entry in the case where add_subject_entry returned a newly
allocated subject entry, and not in the case where it returns a reused
entry.  I'm not yet convinced that that test is entirely correct.  

IFF the test at line 806 really does accurately detect a new entry,
so that it is always false when add_subject_entry returns a reused entry,
then a Quick-and-Dirty hack fix is to add an "else" onto the end of the 
big block of code in lines 806-836.  That else block can simply destroy
the nssArena that was allocated at line 794, since it has not been used.

I will write a patch that implements that quick-and-dirty hack.  
But I really want to rid NSS of this pernicious "owner" concept.
Comment 7 Nelson Bolyard (seldom reads bugmail) 2008-10-08 12:26:09 PDT
Created attachment 342289 [details] [diff] [review]
patch, hack fix.  v1

Bob, please reivew after reading the preceding comments in this bug.  
Thanks
Comment 8 Robert Relyea 2008-10-08 14:22:20 PDT
Comment on attachment 342289 [details] [diff] [review]
patch, hack fix.  v1

r+, though reluctantly.

It seems to me that it would make more sense for add_subject_entry to free an unused arena (clearly it's adopting the arena in the other case).

On the other hand it makes the most sense for add_subject_entry to create it's own arena. add_cert_to_cache can use the arena already created by add_subject_entry (though I would have to look at both functions to see what the heck add_cert_to_cache is allocating that should be part of the subject_entry).


As to why we don't see this in NSS proper? It's pretty likely that we don't have a lot of instances where we are importing multiple certs with the same subject in normal NSS. It's much more common in PKIX.

bob
Comment 9 Slavomir Katuscak 2008-10-09 03:13:06 PDT
(In reply to comment #4)
> Slavo, is this a new test? 
> Or is this a new leak in an old test ("old" being older than 90 days) ?

It's new test which we didn't run before.
Comment 10 Slavomir Katuscak 2008-10-29 04:14:37 PDT
Adding pattern to list of ignored leaks:

#458905
**/nssArena_Create/**
Comment 11 Nelson Bolyard (seldom reads bugmail) 2008-10-29 10:42:18 PDT
(In reply to comment #10)
> Adding pattern to list of ignored leaks:
> 
> #458905
> **/nssArena_Create/**

No, don't add that pattern.  Remove it if you have already done so.
That's too broad and will hide many more leaks than just this one.

Use something like:
**/add_cert_to_cache/nssTrustDomain_AddCertsToCache/cert_createObject/**
or 
**/cert_createObject/nssTrustDomain_AddCertsToCache/add_cert_to_cache/**
whichever order is correct.
Comment 12 Slavomir Katuscak 2008-10-30 02:04:45 PDT
Ok, removing old one, adding:
**/cert_createObject/nssTrustDomain_AddCertsToCache/add_cert_to_cache/**
Comment 13 Slavomir Katuscak 2008-11-03 01:46:50 PST
Adding pattern:
**/cert_createObject/nssTrustDomain_AddCertsToCache/nssArena_Create/**

Previous one was not matched in OPT builds in nightly tests.
Comment 14 Nelson Bolyard (seldom reads bugmail) 2008-11-19 08:08:40 PST
Checking in tdcache.c; new revision: 1.48; previous revision: 1.47

Note You need to log in before you can comment on or make changes to this bug.