Closed Bug 458905 Opened 16 years ago Closed 16 years ago

Memory leaks in PKIX bridge certificates.

Categories

(NSS :: Libraries, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED
3.12.2

People

(Reporter: slavomir.katuscak+mozilla, Assigned: nelson)

Details

Attachments

(1 file)

While testing PKIX memory leak script on Linux (RHEL4 machine dopushups on standard 32bit/DBG Tinderbox build) I found some new memory leaks. Steps to reproduce (on Linux): 1. Generate one selfsigned CA cert (root). certutil -N -d ArmyDB -f ArmyDB/dbpasswd certutil -s "CN=Army ROOT CA, O=Army, C=US" -S -n Army -t CTu,CTu,CTu -v 600 -x -d ArmyDB -1 -2 -5 -f ArmyDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092 2. Generate other selfsigned CA cert (root). certutil -N -d NavyDB -f NavyDB/dbpasswd certutil -s "CN=Navy ROOT CA, O=Navy, C=US" -S -n Navy -t CTu,CTu,CTu -v 600 -x -d NavyDB -1 -2 -5 -f NavyDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092 3. Generate bridge CA cert and sign it with both root certs. certutil -N -d BridgeDB -f BridgeDB/dbpasswd certutil -s "CN=Bridge Bridge, O=Bridge, C=US" -R -2 -d BridgeDB -f BridgeDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092 -o BridgeReq.der certutil -C -c Army -v 60 -d ArmyDB -i BridgeReq.der -o BridgeArmy.der -f ArmyDB/dbpasswd -7 Bridge@Army certutil -A -n Bridge -t u,u,u -d BridgeDB -f BridgeDB/dbpasswd -i BridgeArmy.der certutil -C -c Navy -v 60 -d NavyDB -i BridgeReq.der -o BridgeNavy.der -f NavyDB/dbpasswd -7 Bridge@Navy certutil -A -n Bridge -t u,u,u -d BridgeDB -f BridgeDB/dbpasswd -i BridgeNavy.der 4. Generate user certificate signed by bridge CA. certutil -N -d UserDB -f UserDB/dbpasswd certutil -s "CN=User EE, O=User, C=US" -R -d UserDB -f UserDB/dbpasswd -z /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/tests_results/security/dopushups.11/tests_noise.19092 -o UserReq.der certutil -C -c Bridge -v 60 -d BridgeDB -i UserReq.der -o UserBridge.der -f BridgeDB/dbpasswd certutil -A -n User -t u,u,u -d UserDB -f UserDB/dbpasswd -i UserBridge.der 5. Create test DB and import there certs of all CAs (roots + both signed bridge certs). certutil -N -d AllDB -f AllDB/dbpasswd certutil -A -n Army -t "" -d AllDB -f AllDB/dbpasswd -i Army.der certutil -A -n Navy -t "" -d AllDB -f AllDB/dbpasswd -i Navy.der certutil -A -n Bridge -t "" -d AllDB -f AllDB/dbpasswd -i BridgeArmy.der certutil -A -n Bridge -t "" -d AllDB -f AllDB/dbpasswd -i BridgeNavy.der 6. Verify end user cert with test DB and trust one of root certs (I run this under valgrind tool on Linux) vfychain -d AllDB -pp -vv UserBridge.der -t Army /usr/bin/valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --partial-loads-ok=yes --leak-resolution=high --num-callers=50 /export/tinderlight/data/dopushups_32_DBG.last.1/mozilla/dist/Linux2.6_x86_glibc_PTH_DBG.OBJ/bin/vfychain -d AllDB -pp -vv UserBridge.der -t Army Leaks found: ==18734== 148 (60 direct, 88 indirect) bytes in 1 blocks are definitely lost in loss record 31 of 38 ==18734== at 0x40056BF: calloc (vg_replace_malloc.c:279) ==18734== by 0x420088C: PR_Calloc (prmem.c:474) ==18734== by 0x40F8000: nss_ZAlloc (arena.c:892) ==18734== by 0x40F7891: nssArena_Create (arena.c:412) ==18734== by 0x40EAF74: add_cert_to_cache (tdcache.c:794) ==18734== by 0x40EB183: nssTrustDomain_AddCertsToCache (tdcache.c:877) ==18734== by 0x40EE4BC: cert_createObject (pkibase.c:1061) ==18734== by 0x40EE0EC: nssPKIObjectCollection_GetObjects (pkibase.c:893) ==18734== by 0x40EE60E: nssPKIObjectCollection_GetCertificates (pkibase.c:1115) ==18734== by 0x40E9310: nssTrustDomain_FindCertificatesBySubject (trustdomain.c:677) ==18734== by 0x40E939B: NSSTrustDomain_FindCertificatesBySubject (trustdomain.c:702) ==18734== by 0x40E1DE3: CERT_CreateSubjectCertList (stanpcertdb.c:692) ==18734== by 0x418CFA4: pkix_pl_Pk11CertStore_CertQuery (pkix_pl_pk11certstore.c:238) ==18734== by 0x418DDCA: pkix_pl_Pk11CertStore_GetCert (pkix_pl_pk11certstore.c:532) ==18734== by 0x412EBB9: pkix_Build_GatherCerts (pkix_build.c:2072) ==18734== by 0x412FEDE: pkix_BuildForwardDepthFirstSearch (pkix_build.c:2626) ==18734== by 0x41362B4: pkix_Build_InitiateBuildChain (pkix_build.c:4260) ==18734== by 0x4136E97: PKIX_BuildChain (pkix_build.c:4428) ==18734== by 0x409A3FA: CERT_PKIXVerifyCert (certvfypkix.c:2155) ==18734== by 0x804BF0B: main (vfychain.c:505) ==18734== 88 bytes in 1 blocks are indirectly lost in loss record 32 of 38 ==18734== at 0x40056BF: calloc (vg_replace_malloc.c:279) ==18734== by 0x420088C: PR_Calloc (prmem.c:474) ==18734== by 0x42118FE: PR_NewLock (ptsynch.c:174) ==18734== by 0x40F78C4: nssArena_Create (arena.c:418) ==18734== by 0x40EAF74: add_cert_to_cache (tdcache.c:794) ==18734== by 0x40EB183: nssTrustDomain_AddCertsToCache (tdcache.c:877) ==18734== by 0x40EE4BC: cert_createObject (pkibase.c:1061) ==18734== by 0x40EE0EC: nssPKIObjectCollection_GetObjects (pkibase.c:893) ==18734== by 0x40EE60E: nssPKIObjectCollection_GetCertificates (pkibase.c:1115) ==18734== by 0x40E9310: nssTrustDomain_FindCertificatesBySubject (trustdomain.c:677) ==18734== by 0x40E939B: NSSTrustDomain_FindCertificatesBySubject (trustdomain.c:702) ==18734== by 0x40E1DE3: CERT_CreateSubjectCertList (stanpcertdb.c:692) ==18734== by 0x418CFA4: pkix_pl_Pk11CertStore_CertQuery (pkix_pl_pk11certstore.c:238) ==18734== by 0x418DDCA: pkix_pl_Pk11CertStore_GetCert (pkix_pl_pk11certstore.c:532) ==18734== by 0x412EBB9: pkix_Build_GatherCerts (pkix_build.c:2072) ==18734== by 0x412FEDE: pkix_BuildForwardDepthFirstSearch (pkix_build.c:2626) ==18734== by 0x41362B4: pkix_Build_InitiateBuildChain (pkix_build.c:4260) ==18734== by 0x4136E97: PKIX_BuildChain (pkix_build.c:4428) ==18734== by 0x409A3FA: CERT_PKIXVerifyCert (certvfypkix.c:2155) ==18734== by 0x804BF0B: main (vfychain.c:505) There were also found stacks related to bugs 367384 and 430544. I tried to reproduce this on Solaris, but DBX doesn't detect those leaks. Also experiments with using some certs on command line and some in DB shows, that leaks are found only when both bridge certs are in DB.
Slavo, were you testing with the environment variable NSS_DISABLE_ARENA_FREE_LIST set to 1 ? The leaks reported here are all arena leaks. I would expect this pattern of leaks in a process that was not using NSS_DISABLE_ARENA_FREE_LIST.
I reused functionality from previous memory leak testing scripts (memleak.sh), where is NSS_DISABLE_ARENA_FREE_LIST set to 1 in initialization function. You can expect this variable to be set now and also in future in memory leaks found by my testing scripts.
Reproduced on Solaris: Memory Leak (mel): Found leaked block of size 88 bytes at address 0x6e578 At time of allocation, the call stack was: [1] calloc() at 0xdf137488 [2] PR_Calloc() at line 475 in "prmem.c" [3] PR_NewLock() at line 174 in "ptsynch.c" [4] nssArena_Create() at line 418 in "arena.c" [5] add_cert_to_cache() at line 794 in "tdcache.c" [6] nssTrustDomain_AddCertsToCache() at line 877 in "tdcache.c" [7] cert_createObject() at line 1061 in "pkibase.c" [8] nssPKIObjectCollection_GetObjects() at line 893 in "pkibase.c" [9] nssPKIObjectCollection_GetCertificates() at line 1117 in "pkibase.c" [10] nssTrustDomain_FindCertificatesBySubject() at line 679 in "trustdomain.c" [11] NSSTrustDomain_FindCertificatesBySubject() at line 706 in "trustdomain.c" [12] CERT_CreateSubjectCertList() at line 696 in "stanpcertdb.c" [13] pkix_pl_Pk11CertStore_CertQuery() at line 242 in "pkix_pl_pk11certstore.c" [14] pkix_pl_Pk11CertStore_GetCert() at line 534 in "pkix_pl_pk11certstore.c" [15] pkix_Build_GatherCerts() at line 2078 in "pkix_build.c" [16] pkix_BuildForwardDepthFirstSearch() at line 2628 in "pkix_build.c" Memory Leak (mel): Found leaked block of size 60 bytes at address 0x6e428 At time of allocation, the call stack was: [1] calloc() at 0xdf137488 [2] PR_Calloc() at line 475 in "prmem.c" [3] nss_ZAlloc() at line 892 in "arena.c" [4] nssArena_Create() at line 412 in "arena.c" [5] add_cert_to_cache() at line 794 in "tdcache.c" [6] nssTrustDomain_AddCertsToCache() at line 877 in "tdcache.c" [7] cert_createObject() at line 1061 in "pkibase.c" [8] nssPKIObjectCollection_GetObjects() at line 893 in "pkibase.c" [9] nssPKIObjectCollection_GetCertificates() at line 1117 in "pkibase.c" [10] nssTrustDomain_FindCertificatesBySubject() at line 679 in "trustdomain.c" [11] NSSTrustDomain_FindCertificatesBySubject() at line 706 in "trustdomain.c" [12] CERT_CreateSubjectCertList() at line 696 in "stanpcertdb.c" [13] pkix_pl_Pk11CertStore_CertQuery() at line 242 in "pkix_pl_pk11certstore.c" [14] pkix_pl_Pk11CertStore_GetCert() at line 534 in "pkix_pl_pk11certstore.c" [15] pkix_Build_GatherCerts() at line 2078 in "pkix_build.c" [16] pkix_BuildForwardDepthFirstSearch() at line 2628 in "pkix_build.c"
OS: Linux → All
Hardware: PC → All
Slavo, is this a new test? Or is this a new leak in an old test ("old" being older than 90 days) ?
This appears to be an original "day 1" bug. I wonder why we never found this leak before now. That's why I asked the questions in comment 4. In a comment to follow this one, I will discuss the specifics of the bad functions. In this comment, I will discuss the general problem. The code responsible for the leak is in tdcache.c. It's bad logic spread across many functions, the functions that deal with allocating and destroying the "entry" objects. It's a bad design, in several ways. Any number from 0-3 objects try to share an nssArena. They all occupy space allocated from the arena. These objects may be independently destroyed. There is no reference counting of handles/references of the Arena. Instead, The logic to deallocate the Arena makes use of the fatally-flawed "owner" concept, where one of the objects that occupies it is considered to "own" the Arena, and when that object is destroyed, the Arena is destroyed. This makes for several problems. 1. If the "owner" object is destroyed before the other objects that are allocated from the arena, those other objects are deallocated while references to them remain. 2. The "owner" object may not use the arena at all, in which case, either: a) NO objects that hold references to the arena, and it will be leaked at the time it is created, or b) All the objects that hold references to the Arena are marked as NOT the owner of the Arena, so when they are all destroyed, the Arena is leaked. The solution to this problem requires completely eliminating the "owner" concept from the management of the nssArena. The solution is either: a) have a separate Arena for each "subject entry", "nickname entry" and "email entry" object. There is no "owner" flag. Each of those entry objects implicitly owns the arena in which it is allocated, and destroying that object destroys its arean. Allocation of the arena is done in each of the functions that allocate those "entry" objects, rather than in the function that calls those entry object allocators, or b) have the Arena be truly reference counted. Each "entry" object holds a reference. When each entry object is deleted, the reference count is decremented, and when it goes to zero, the Arena is deallocated. NSS has been plagued with problems with the use of this "owner" concept. No NSS developer should ever again attempt to put such code into NSS. No NSS reviewer should ever allow it.
I'm taking this bug. Even though it was triggered by a call from libPKIX, the flaw is not (primarily) in libPKIX. I DO wonder why we never saw this bug before. What is libPKIX doing differently than the rest of NSS that triggers this bug? Is that different behavior of libPKIX a bug? The problem is seen first and foremost in function add_cert_to_cache() in file tdcache.c, and in the functions that it calls: add_subject_entry and add_nickname_entry, and the function they call, new_cache_entry, all of which are also in tdcache.c. add_cert_to_cache creates an nssArena (line 794), which it then passes to add_subject_entry. add_subject_entry (line 560) looks to see if there is already a subject entry for this subject. If not, then at line 576 it calls new_cache_entry, passing the argument PR_TRUE, telling new_cache_entry to mark this object as the "owner" of the Arena. See line 143 for the code that marks the new cache entry as owner. HOWEVER, if add_subject_entry finds that there is already a subject entry for this subject (line 562) it reuses the existing subject entry, and makes NO USE of the nssArena that was passed to it. Consequently, there is NO object that is the "owner" of the nssArena in that case. When add_subject_entry returns, add_cert_to_cache may (or may not) call add_nickname_entry (line 809) passing the Arena to it. add_nickname_entry also calls new_cache_entry (line 625), passing it PR_FALSE, so that it will NOT be the owner of the Arena. The test at lines 805-806 is intended to ensure that we only call add_nickname_entry in the case where add_subject_entry returned a newly allocated subject entry, and not in the case where it returns a reused entry. I'm not yet convinced that that test is entirely correct. IFF the test at line 806 really does accurately detect a new entry, so that it is always false when add_subject_entry returns a reused entry, then a Quick-and-Dirty hack fix is to add an "else" onto the end of the big block of code in lines 806-836. That else block can simply destroy the nssArena that was allocated at line 794, since it has not been used. I will write a patch that implements that quick-and-dirty hack. But I really want to rid NSS of this pernicious "owner" concept.
Assignee: alexei.volkov.bugs → nbolyard
Priority: -- → P2
Target Milestone: 3.12.1 → 3.12.2
Bob, please reivew after reading the preceding comments in this bug. Thanks
Attachment #342289 - Flags: review?(rrelyea)
Comment on attachment 342289 [details] [diff] [review] patch, hack fix. v1 r+, though reluctantly. It seems to me that it would make more sense for add_subject_entry to free an unused arena (clearly it's adopting the arena in the other case). On the other hand it makes the most sense for add_subject_entry to create it's own arena. add_cert_to_cache can use the arena already created by add_subject_entry (though I would have to look at both functions to see what the heck add_cert_to_cache is allocating that should be part of the subject_entry). As to why we don't see this in NSS proper? It's pretty likely that we don't have a lot of instances where we are importing multiple certs with the same subject in normal NSS. It's much more common in PKIX. bob
Attachment #342289 - Flags: review?(rrelyea) → review+
(In reply to comment #4) > Slavo, is this a new test? > Or is this a new leak in an old test ("old" being older than 90 days) ? It's new test which we didn't run before.
Adding pattern to list of ignored leaks: #458905 **/nssArena_Create/**
(In reply to comment #10) > Adding pattern to list of ignored leaks: > > #458905 > **/nssArena_Create/** No, don't add that pattern. Remove it if you have already done so. That's too broad and will hide many more leaks than just this one. Use something like: **/add_cert_to_cache/nssTrustDomain_AddCertsToCache/cert_createObject/** or **/cert_createObject/nssTrustDomain_AddCertsToCache/add_cert_to_cache/** whichever order is correct.
Ok, removing old one, adding: **/cert_createObject/nssTrustDomain_AddCertsToCache/add_cert_to_cache/**
Adding pattern: **/cert_createObject/nssTrustDomain_AddCertsToCache/nssArena_Create/** Previous one was not matched in OPT builds in nightly tests.
Checking in tdcache.c; new revision: 1.48; previous revision: 1.47
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: