Closed
Bug 814438
Opened 12 years ago
Closed 8 years ago
Invalid Read of 4 bytes through bogus pointer in nssCertificate_Destroy (thunderbird)
Categories
(Core :: Security: PSM, defect)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: ishikawa, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(7 files)
While running thunderbird (built from comm-central) through "make mozmill" testing under valgrind, I noticed the following error in the output. The problem does not happen all the time. This is the first time I noticed this problem in about two dozen such test runs under valgrind (with varying timeout values). It is possible that this seems to be happening at a shutdown time, and maybe different threads stomping on some other task's thread for shut down processing of various data structure. Or maybe a genuine algorithmic bug. Looking at the bogus pointer value of 0xffffffd6, I have a feeling a code somewhere returns (-1) as invalid pointer or a filler for an invalid entry in a hash table, which for an unknown reason, the code picks it up as valid value, and try to use that as a valid pointer to access c in nssCertificate_Destroy (certificate.c) But if there is a race of threads, anything goes. Here is the relevant log: TEST-START | /TB-NEW/TB-3HG/new-src/mail/test/mozmill/search-window/test-multiple-search-windows.js | test_show_multiple_search_windows_for_the_same_folder Step Pass: {"function": "Controller.keypress()"} unsupported sort key: 17 unsupported sort key: 17 Step Pass: {"function": "Controller.keypress()"} unsupported sort key: 17 unsupported sort key: 17 ==26388== Invalid read of size 4 ==26388== at 0x47AAD70: nssCertificate_Destroy (certificate.c:102) ==26388== by 0x47AAE6E: NSSCertificate_Destroy (certificate.c:150) ==26388== by 0x47A800B: CERT_DestroyCertificate (stanpcertdb.c:795) ==26388== by 0x47A45A8: IssuerCache_Destroy (crl.c:1130) ==26388== by 0x47A4620: FreeIssuer (crl.c:1243) ==26388== by 0x403EAD9: PL_HashTableEnumerateEntries (plhash.c:374) ==26388== by 0x47A46CB: ShutdownCRLCache (crl.c:1308) ==26388== by 0x476A02F: nss_Shutdown (nssinit.c:1086) ==26388== by 0x476A168: NSS_Shutdown (nssinit.c:1149) ==26388== by 0x594CAD6: nsNSSComponent::ShutdownNSS() (nsNSSComponent.cpp:1866) ==26388== by 0x594D511: nsNSSComponent::DoProfileBeforeChange(nsISupports*) (nsNSSComponent.cpp:2549) ==26388== by 0x594D70B: nsNSSComponent::Observe(nsISupports*, char const*, unsigned short const*) (nsNSSComponent.cpp:2176) ==26388== by 0x6056C92: nsObserverList::NotifyObservers(nsISupports*, char const*, unsigned short const*) (nsObserverList.cpp:99) ==26388== by 0x6057205: nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*) (nsObserverService.cpp:161) ==26388== by 0x4C35182: nsXREDirProvider::DoShutdown() (nsXREDirProvider.cpp:892) ==26388== by 0x4C2D6AB: ScopedXPCOMStartup::~ScopedXPCOMStartup() (nsAppRunner.cpp:1119) ==26388== by 0x4C3387F: XREMain::XRE_main(int, char**, nsXREAppData const*) (nsAppRunner.cpp:3915) ==26388== by 0x4C33AEC: XRE_main (nsAppRunner.cpp:4084) ==26388== by 0x41ADE45: (below main) (libc-start.c:228) ==26388== Address 0xffffffd6 is not stack'd, malloc'd or (recently) free'd ==26388== Relevant source snipets certificate.c 92 NSS_IMPLEMENT PRStatus 93 nssCertificate_Destroy ( 94 NSSCertificate *c 95 ) 96 { 97 nssCertificateStoreTrace lockTrace = {NULL, NULL, PR_FALSE, PR_FALSE}; 98 nssCertificateStoreTrace unlockTrace = {NULL, NULL, PR_FALSE, PR_FALSE}; 99 100 if (c) { 101 PRUint32 i; * 102 nssDecodedCert *dc = c->decoding; 103 NSSTrustDomain *td = STAN_GetDefaultTrustDomain(); certificate.c 145 NSS_IMPLEMENT PRStatus 146 NSSCertificate_Destroy ( 147 NSSCertificate *c 148 ) 149 { * 150 return nssCertificate_Destroy(c); 151 } 152 stanpcertdb.c 783 void 784 CERT_DestroyCertificate(CERTCertificate *cert) 785 { 786 if ( cert ) { 787 /* don't use STAN_GetNSSCertificate because we don't want to 788 * go to the trouble of translating the CERTCertificate into 789 * an NSSCertificate just to destroy it. If it hasn't been done 790 * yet, don't do it at all. 791 */ 792 NSSCertificate *tmp = cert->nssCertificate; 793 if (tmp) { 794 /* delete the NSSCertificate */ * 795 NSSCertificate_Destroy(tmp); 796 } else if (cert->arena) { crl.c 1089 /* destructor for CRL DPCache object */ 1090 static SECStatus DPCache_Destroy(CRLDPCache* cache) 1091 { 1092 PRUint32 i = 0; 1093 PORT_Assert(cache); [...] 1122 /* free the array of CRLs */ 1123 if (cache->crls) 1124 { 1125 PORT_Free(cache->crls); 1126 } 1127 /* destroy the cert */ 1128 if (cache->issuer) 1129 { *1130 CERT_DestroyCertificate(cache->issuer); 1131 } 1132 /* free the subject */ 1133 if (cache->subject) 1134 { crl.c 1228 /* callback function used in hash table destructor */ 1229 static PRIntn PR_CALLBACK FreeIssuer(PLHashEntry *he, PRIntn i, void *arg) 1230 { 1231 CRLIssuerCache* issuer = NULL; 1232 SECStatus* rv = (SECStatus*) arg; 1233 1234 PORT_Assert(he); 1235 if (!he) 1236 { 1237 return HT_ENUMERATE_NEXT; 1238 } 1239 issuer = (CRLIssuerCache*) he->value; 1240 PORT_Assert(issuer); 1241 if (issuer) 1242 { *1243 if (SECSuccess != IssuerCache_Destroy(issuer)) 1244 { 1245 PORT_Assert(rv); 1246 if (rv) 1247 { 1248 *rv = SECFailure; 1249 } 1250 return HT_ENUMERATE_NEXT; 1251 } 1252 } 1253 return HT_ENUMERATE_NEXT; 1254 } I am going to upload the full log to Bug 803816 - Valgrind warnings about uninitialised memory use (Thunderbird)
Reporter | ||
Comment 1•12 years ago
|
||
Further up the stack I see this code. ShutdownCRLCache (crl.c:1308) crl.c 1284 /* needs to be called at NSS shutdown time 1285 This will destroy the global CRL cache, including 1286 - the hash table of issuer cache objects 1287 - the issuer cache objects 1288 - DPCache objects in issuer cache objects */ 1289 SECStatus ShutdownCRLCache(void) 1290 { 1291 SECStatus rv = SECSuccess; 1292 if (PR_FALSE == crlcache_initialized && 1293 !crlcache.lock && !crlcache.issuers) 1294 { 1295 /* CRL cache has already been shut down */ 1296 return SECSuccess; 1297 } 1298 if (PR_TRUE == crlcache_initialized && 1299 (!crlcache.lock || !crlcache.issuers || !namedCRLCache.lock || 1300 !namedCRLCache.entries)) 1301 { 1302 /* CRL cache has partially been shut down */ 1303 PORT_SetError(SEC_ERROR_LIBRARY_FAILURE); 1304 return SECFailure; 1305 } 1306 /* empty the CRL cache */ 1307 /* free the issuers */ *1308 PL_HashTableEnumerateEntries(crlcache.issuers, &FreeIssuer, &rv); 1309 /* free the hash table of issuers */ 1310 PL_HashTableDestroy(crlcache.issuers); 1311 crlcache.issuers = NULL; 1312 /* free the global lock */ 1313 #ifdef GLOBAL_RWLOCK 1314 NSSRWLock_Destroy(crlcache.lock); 1315 #else 1316 PR_DestroyLock(crlcache.lock); 1317 #endif 1318 crlcache.lock = NULL; 1319 1320 /* empty the named CRL cache. This must be done after freeing the CRL 1321 * cache, since some CRLs in this cache are in the memory for the other */ 1322 /* free the entries */ 1323 PL_HashTableEnumerateEntries(namedCRLCache.entries, &FreeNamedEntries, &rv); 1324 /* free the hash table of issuers */ 1325 PL_HashTableDestroy(namedCRLCache.entries); 1326 namedCRLCache.entries = NULL; 1327 /* free the global lock */ 1328 PR_DestroyLock(namedCRLCache.lock); 1329 namedCRLCache.lock = NULL; 1330 1331 crlcache_initialized = PR_FALSE; 1332 return rv; 1333 } From the cursorly reading, I think we need a call to a lock function (?) a la #ifdef GLOBAL_RWLOCK NSSRWLock_LockRead(crlcache.lock); #else PR_Lock(crlcache.lock); #endif just before 1308 (if multiple threads are somehow running. But presumably there are? Or we don't need these locking scheme, correct?) And I am not sure if that ought to be against crlcache.lock even, but since it is destroyed immediately afterward in 1313, it seems likely the candidate. Sure we need to unlock it before destroying it. Oh yes, probably we need locking of namedCRLcache.lock at 1323 before we begin freeing the namedCRLcache entries. Don't forget to unlock it before destroying it. But I think someone in the know can make the final judgement. The locking scheme in this file is not well documented (for a casual passer-by). TIA
Updated•12 years ago
|
Component: Untriaged → Security: PSM
Product: Thunderbird → Core
Comment 2•12 years ago
|
||
bsmith, please can you take a look at this, given that it may be responsible for the #1 toporange, bug 761987.
Flags: needinfo?(bsmith)
Reporter | ||
Comment 3•12 years ago
|
||
For exploratory purposes, I added lock, unlock in two places. TB runs and does not crash, and I can terminate it without any bad effect. (Come to think of it, a few months ago, I experienced strange segmentation error on quiting firefox under linux (and sometimes still did in the last serveral weeks), and I wonder if this segmentation error is related to the bug reported here. - it happens on termination of TB (or FB, I presume.) Comment from people in the know will be appreciated.
Comment 4•12 years ago
|
||
I pushed this patch to try: https://tbpl.mozilla.org/?tree=Try&rev=e89771bec0cd It did not seem to cause any new failures. Unfortunately, there was a crash in nssCertificate_Destroy, still: https://tbpl.mozilla.org/php/getParsedLog.php?id=17433809&tree=Try&full=1#error0 (just like bug 761987).
Reporter | ||
Comment 5•12 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #4) > I pushed this patch to try: > > https://tbpl.mozilla.org/?tree=Try&rev=e89771bec0cd > > It did not seem to cause any new failures. > > Unfortunately, there was a crash in nssCertificate_Destroy, still: > > https://tbpl.mozilla.org/php/getParsedLog. > php?id=17433809&tree=Try&full=1#error0 > > (just like bug 761987). Thank you for trying. On my end, the valgrind tests simply seem to take loger to execute or something (?). But valgrind does not produce any warnings regarding this (but again, it happend only once.) So then, I can think of some hairy interaction in the hash-related functions (and miscompilation in ARM CPU? or even on X86. I am using x86 (intel CPU). Short of miscompilation, I think the hash-table itself is corrupted due to thread races (with insufficient lock/unlock, etc.). At least, on x86, the problem happened once clearly with possibly 0xFFFFFFF in one pointer value (which, I think, should have been 0, i.e., null.) Someone familiar with the code needs to inserts printf statements and run tests and see what happens when firefox or tb dies during testing.
Reporter | ||
Comment 6•12 years ago
|
||
I re-read the post: https://bugzilla.mozilla.org/show_bug.cgi?id=761987#c173 It seems then whichever thread that is executing the hash table manipulation code is doing this without acquiring the lock, I think. That is bad. I thought it was only the cleanup code that was missing the lock. But there are other places where the lock is missing obviously. That is why the above patch only was not enough.
Updated•12 years ago
|
Attachment #685854 -
Attachment is patch: true
Comment 7•12 years ago
|
||
Any possibility of using Helgrind to investigate the presence of races with and without the patch? In particular, to see if it makes any difference? If someone does this, I strongly suggest to use valgrind-3.8.1 on x86 or x86_64 linux, and use the --fair-sched=yes flag (gives more repeatable results with Helgrind).
Reporter | ||
Comment 8•12 years ago
|
||
TB Helgrind log with the patch. (Sorry I am uploading the log with the patch first.) Sorry I am only testing TB. With TB, I noticed that during a previous run that addon autocheck accesses some URLs and I saw NSS-related cert handling, and so accessed the addon menu, and then check if the plugins are all updated, then quit. (But due to the slowness of operaton under helgrind, I may have clicked a few incorrect menu entries, but we can at least identify some cert-related functions.) I will upload the TB helgrind log WITHOUT the patch. TAI
Reporter | ||
Comment 9•12 years ago
|
||
I ran TB without the patch under helgrind. But I had a problem. After I invoked help (which calls external browser), and checked the update of plug-in a few times, I tried to quit, but helgrind seemed to be in a state of resource starvation (maybe 32bit Linux cannot give more than 2GB of user space, and thus helgrind could not proceed any longer.) After waiting for about half an hour, I quit helgrind from a different terminal by "kill -HUP". The log is attached. I will upload another TB helgrind run where I only tested the plugin update twice. The next log is one for a complete shutdown.
Reporter | ||
Comment 10•12 years ago
|
||
This log captured the valgrind output while it ran TB (under 32bit Debian GNU/linux) without the patch was run and help was invoked once, and plugin check for update was performed twice, and TB was quit. This time, valgrind finished in a reasonable time. *BUT* note the strange line near the very end during the shutdown. ==28335== vex x86->IR: unhandled instruction bytes: 0x6C 0x89 0x55 0x68 ==28335== I think, there are so many suspicious potential races and so forth, valgrind kept the huge in-memory structure to print the warning, and thus in the previous run, it ran out of memory and could not finish cleanly. In this log, valgrind could shutdow cleanly because the manual operation was not that complex and lengthy and it still had enough memory left. But the above unhandled instruction is a little bit to understand. Hope these logs might give some insight, but I am not sure such a small number of runs can uncover the bug caused by nondeterministic behavior. TIA PS: I have applied the patch to make GC friendly to helgrind and such. PPS: Oh yes, I noiced the following: but this happened with the patch here also. grep -h -4 waited ~/Dropbox/*helgrind-*.txt ==27130== by 0x4882949: ssl3_HandleHandshakeMessage (ssl3con.c:8478) ==27130== ==27130== ---------------------------------------------------------------- ==27130== ==27130== Thread #52: pthread_cond_destroy: destruction of condition variable being waited upon ==27130== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==27130== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==27130== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==27130== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==27130== by 0x595642D: mozilla::psm::(anonymous namespace)::AuthCertificate(mozilla::psm::TransportSecurityInfo*, CERTCertificateStr*) (SSLServerCertVerification.cpp:646) ==27130== ==27130== ---------------------------------------------------------------- ==27130== ==27130== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon ==27130== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==27130== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==27130== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==27130== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==28335== by 0x607D43A: nsComponentManagerImpl::CreateInstanceByContractID(char const*, nsISupports*, nsID const&, void**) (nsComponentManager.cpp:1031) ==28335== ==28335== ---------------------------------------------------------------- ==28335== ==28335== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon ==28335== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==28335== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==28335== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==28335== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==28335== by 0x4882949: ssl3_HandleHandshakeMessage (ssl3con.c:8478) ==28335== ==28335== ---------------------------------------------------------------- ==28335== ==28335== Thread #45: pthread_cond_destroy: destruction of condition variable being waited upon ==28335== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==28335== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==28335== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==28335== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==28335== by 0x4882949: ssl3_HandleHandshakeMessage (ssl3con.c:8478) ==28335== ==28335== ---------------------------------------------------------------- ==28335== ==28335== Thread #46: pthread_cond_destroy: destruction of condition variable being waited upon ==28335== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==28335== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==28335== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==28335== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==8015== by 0x1A78EA91: ??? ==8015== ==8015== ---------------------------------------------------------------- ==8015== ==8015== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon ==8015== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==8015== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==8015== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==8015== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==8015== by 0x595642D: mozilla::psm::(anonymous namespace)::AuthCertificate(mozilla::psm::TransportSecurityInfo*, CERTCertificateStr*) (SSLServerCertVerification.cpp:646) ==8015== ==8015== ---------------------------------------------------------------- ==8015== ==8015== Thread #53: pthread_cond_destroy: destruction of condition variable being waited upon ==8015== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==8015== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==8015== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==8015== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==8015== by 0x43DA15F: ??? ==8015== ==8015== ---------------------------------------------------------------- ==8015== ==8015== Thread #52: pthread_cond_destroy: destruction of condition variable being waited upon ==8015== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==8015== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==8015== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==8015== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) -- ==8015== by 0x4C69A09: nsIOService::GetInstance() (nsIOService.cpp:291) ==8015== ==8015== ---------------------------------------------------------------- ==8015== ==8015== Thread #4: pthread_cond_destroy: destruction of condition variable being waited upon ==8015== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==8015== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==8015== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==8015== by 0x4719B06: PR_DestroyMonitor (ptsynch.c:551) -- TEST-START | /TB-NEW/TB-3HG/new-src/mail/test/mozmill/message-header/test-message-header.js | test_more_button_with_many_recipients Step Pass: {"function": "controller.click()"} ==13832== ---------------------------------------------------------------- ==13832== ==13832== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon ==13832== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==13832== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==13832== by 0x47191BA: PR_DestroyCondVar (ptsynch.c:340) ==13832== by 0x58C1A60: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) ishikawa@debian-vm:~/TB-NEW/TB-3HG/objdir-tb3/mozilla/dist/bin$
Comment 11•12 years ago
|
||
> ==28335==
> vex x86->IR: unhandled instruction bytes: 0x6C 0x89 0x55 0x68
> ==28335==
Hmm, that's a bit strange. But .. assuming TB uses the same bunch of
JITs that FX uses .. won't we have to use --smc-check=all-non-file for
Helgrind, to make sure it runs correctly?
Reporter | ||
Comment 12•12 years ago
|
||
(In reply to Julian Seward from comment #11) > > ==28335== > > vex x86->IR: unhandled instruction bytes: 0x6C 0x89 0x55 0x68 > > ==28335== > > Hmm, that's a bit strange. But .. assuming TB uses the same bunch of > JITs that FX uses .. won't we have to use --smc-check=all-non-file for > Helgrind, to make sure it runs correctly? Sorry, I thought --smc-check=all-non-file was meant for memcheck and did not specify it. Here is list the arguments to run "make mozmill" test under valgrind (memcheck). valgrind --trace-children=yes --smc-check=all-non-file --gen-suppressions=all --track-origins=yes --malloc-fill=0xA5 --free-fill=0xC3 --leak-check=full --num-callers=50 --suppressions=$HOME/TB-NEW/TB-3HG/new-src/mozilla/build/valgrind/cross-architecture.sup --suppressions=$HOME/TB-NEW/TB-3HG/new-src/mozilla/build/valgrind/i386-redhat-linux-gnu.sup --suppressions=$HOME/Dropbox/myown.sup --show-possibly-lost=no ~/TB-NEW/TB-3HG/objdir-tb3/mozilla/dist/bin/thunderbird-bin -profile /TB-NEW/TB-3HG/objdir-tb3/mozilla/_tests/mozmill/mozmillprofile -jsbridge 24242 -foregroun Obviously, we do not need --trace-children (since the above calls the master script under which tb runs, we need this. This time I am invoking TB simply.) Do I need --track-origins? I will try.
Reporter | ||
Comment 13•12 years ago
|
||
(I am using TB built from comm-central repository) I am uploading a helgrind run of TB "make mozmill" test run with a very limited set of test target. I only left "search-window" in $MOZSRC/new-src/mail/test/mozmill/mozmilltests.list One of the uploaded logs is WITHOUT the patch, and the other is WITH the patch. I can not immediately see the big difference. I am afraid that --fair-sched=yes does not invoke the particular problematic order that caused a problem under x86. (We see a fewer number of read/write races, but I am not entirely sure if they are significant.) (I left a few warnings that appear near the beginning of the startup. One is obviously the GLX problem, I think, but I am not sure why GLX allows the call of pthread_barrier_init with count = 0. This has been going on for a long time for TB at least, I think. Also, a few messages related to the dynamic loading of key synchronization functions. I also see a lot of message/event queue related warnings, they may be genuine, or OK, but even if so, I wonder if we can use a lock/unlock without slowing down TB, and obtains a clean log.) I *do see* the disturbing massages about "destruction of condition variable being waited" in both cases. This may cause the problem in question. I suspect so since one of the places where this warning appears is in ==16069== Thread #47: pthread_cond_destroy: destruction of condition variable being waited upon ==16069== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==16069== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==16069== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==16069== by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56) ==16069== by 0x5947D82: nsHTTPListener::Release() (nsNSSCallbacks.cpp:536) ==16069== by 0x594755A: nsNSSHttpRequestSession::~nsNSSHttpRequestSession() (nsCOMPtr.h:410) ==16069== by 0x4803FFA: PKIX_PL_Object_DecRef (pkix_pl_object.c:891) and IMMEDIATELY AFTER THIS we see ==16069== Thread #47: pthread_cond_destroy: destruction of unknown cond var ==16069== at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940) ==16069== by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958) ==16069== by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372) ==16069== by 0x47ABE12: nssCertificate_Destroy (certificate.c:128) ==16069== by 0x47ABE6E: NSSCertificate_Destroy (certificate.c:150) ==16069== by 0x47A900B: CERT_DestroyCertificate (stanpcertdb.c:795) ==16069== by 0x47EB163: pkix_pl_Cert_Destroy (pkix_pl_cert.c:1167) See nssCertificate_Destroy() appears in the stack. So I suspect there *may be* a relation to the issue we have here. I have run the "make mozmill" with its all targets under helgrind, but the log is almost 16MB, and so I am not uploading it unless someone requests them (logs with or without patch. [log with the patch is still being collected. It takes more than to 12 hours finish...]) --- Some statistical observation from the log (without checking the logical consistency, etc.) ==================== LOG WITH PATCH. ==================== (grep Helgrind: LOGFILE | sort | uniq -c) 23 Helgrind:LockOrder 17 Helgrind:Misc 1 Helgrind:PthAPIerror 122 Helgrind:Race destruction of condition variable being waited (not sure why the thread number is different from the one without the patch below.) Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon Thread #49: pthread_cond_destroy: destruction of condition variable being waited upon Thread #50: pthread_cond_destroy: destruction of condition variable being waited upon Thread #4: pthread_cond_destroy: destruction of condition variable being waited upon destruction of unknown cond var grep "destruction of unknown cond var" tb-val32-helgrind-1201-with-patch-take-1-log.txt | wc -l 11 Potential read races? grep "during read of" tb-val32-helgrind-1201-with-patch-take-1-log.txt | wc -l 27 Potential write races? grep "during write of" tb-val32-helgrind-1201-with-patch-take-1-log.txt | wc -l 95 ==================== * LOG WITHOUT PATCH * ==================== (grep Helgrind: LOGFILE | sort | uniq -c) 25 Helgrind:LockOrder 17 Helgrind:Misc 1 Helgrind:PthAPIerror 134 Helgrind:Race destruction of condition variable being waited (not sure why the thread number is different from the one with the patch above.) Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon Thread #47: pthread_cond_destroy: destruction of condition variable being waited upon Thread #48: pthread_cond_destroy: destruction of condition variable being waited upon Thread #4: pthread_cond_destroy: destruction of condition variable being waited upon destruction of unknown cond var grep "destruction of unknown cond var" tb-val32-helgrind-1202-WITHOUT-patch-take-1-log.txt | wc -l 11 Potential read races? grep "during read of" tb-val32-helgrind-1202-WITHOUT-patch-take-1-log.txt | wc -l 33 Potential write races? grep "during write of" tb-val32-helgrind-1202-WITHOUT-patch-take-1-log.txt | wc -l 101 TIA
Reporter | ||
Comment 14•12 years ago
|
||
This is the log from the patched thunderbird. I think we really need to exercise the SSL related access using https:, but I am not sure how to create such a test script (accessing the plug-in update check would be enough, and possibly calling HELP from within TB, which calls an external browser, seems to do the trick.) TIA
Reporter | ||
Comment 15•12 years ago
|
||
>I have run the "make mozmill" with its all targets under helgrind, but
>the log is almost 16MB, and so I am not uploading it unless someone
>requests them (logs with or without patch. [log with the patch is still being >collected. It takes more than to 12 hours finish...])
I found I could suppress "Helgrind:PthAPIerror"
by setting
LP_NUM_THREADS=2; export LP_NUM_THREADS
before running Thunderbird. I am not sure why GLX is compiled this way, but
I digress.
I ran "make mozmill" because I wanted to give a baseline comparison of thread-friendliness (or not) of TB with known user actions.
However, it is not quite aggressive enough (--fair-sched=yes), I am afraid.
Still here are comparison of the logs. Not much difference between the two from
generic statistical observation.
[WITHOUT THE PATCH]
ishikawa@debian-vm:/tmp/tachyon_insp_xe$ sh ~/Dropbox/t-summary.sh /home/ishikawa/Dropbox/tb-val32-helgrind-1203-WITHOUT-patch-log-take-1.txt
Analyzing /home/ishikawa/Dropbox/tb-val32-helgrind-1203-WITHOUT-patch-log-take-1.txt ...
========================================
Summary
========================================
720 Helgrind:LockOrder
565 Helgrind:Misc
2945 Helgrind:Race
========================================
destruction of condition variable being waited upon.
========================================
60
========================================
destruction of unknown cond var
========================================
470
========================================
Potential read races
========================================
663
========================================
Potential write races
========================================
2282
[WITH THE PATCH]
$ sh ~/Dropbox/t-summary.sh /home/ishikawa/Dropbox/tb-val32-helgrind-1202-with-patch-log-take-1.txt
Analyzing /home/ishikawa/Dropbox/tb-val32-helgrind-1202-with-patch-log-take-1.txt ...
========================================
Summary
========================================
715 Helgrind:LockOrder
517 Helgrind:Misc
2971 Helgrind:Race
========================================
destruction of condition variable being waited upon.
========================================
61
========================================
destruction of unknown cond var
========================================
421
========================================
Potential read races
========================================
640
========================================
Potential write races
========================================
2331
Maybe with the patch, we have less " destruction of unknown cond var ".
t-summary.sh is a a script:
:
#
if [ -s $1 ]
then
echo "Analyzing $1 ..."
else
echo "File $1 missing. Quitting..."
exit 1
fi
echo " ========================================"
echo " Summary"
echo " ========================================"
grep Helgrind: $1 | sort | uniq -c
echo " ========================================"
echo " destruction of condition variable being waited upon. "
echo " ========================================"
grep "condition variable being waited" $1 | wc -l
echo " ========================================"
echo " destruction of unknown cond var "
echo " ========================================"
grep "destruction of unknown cond var" $1 | wc -l
echo " ========================================"
echo " Potential read races "
echo " ========================================"
grep "during read of" $1 | wc -l
echo " ========================================"
echo " Potential write races "
echo " ========================================"
grep "during write of" $1 | wc -l
PS: I think we really need someone who knows inside SSL related functions to take a look at this problem.
Reporter | ||
Comment 16•12 years ago
|
||
I added a few more lock/unlock pairs in places where I thought the proper lock/unlock was missing. make mozmill of thunderbird using this patch and under helgrind seems to run OK without ill-effect. I think we need more proper lock and unlock in plarena.c and certificate.c needs better comments and a good overall design document about how the locking is done to protect various data structure. Not easy to deduce this from a cursory reading. Is there such design document ?
Reporter | ||
Updated•12 years ago
|
Reporter | ||
Comment 18•10 years ago
|
||
For those who are interested in thread races in mozilla software, the helgrind today has better handling of the destruction of condition variable, and so we may get a better, i.e., clean log that helps us to identify really problematic issues. I now run 64-bit linux and so might want to run it but given it takes almost half a day to finish the test, I may not be able to do it until some long holidays like the end of the year/new year. TIA
Gecko doesn't use CRLs any longer, so I don't think the crash from comment 0 is an issue now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 20•8 years ago
|
||
(In reply to David Keeler [:keeler] (use needinfo?) from comment #19) > Gecko doesn't use CRLs any longer, so I don't think the crash from comment 0 > is an issue now. Just curious, if CRL is not used, what alternative is used now? OCSP, perhaps? TIA.
OCSP and a curated revocation list we're calling OneCRL: https://blog.mozilla.org/security/2015/03/03/revoking-intermediate-certificates-introducing-onecrl/
You need to log in
before you can comment on or make changes to this bug.
Description
•