Closed Bug 814438 Opened 12 years ago Closed 8 years ago

Invalid Read of 4 bytes through bogus pointer in nssCertificate_Destroy (thunderbird)

Categories

(Core :: Security: PSM, defect)

x86
Linux
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: ishikawa, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

While running thunderbird (built from comm-central) through
"make mozmill" testing under valgrind, I noticed the following error in the output.

The problem does not happen all the time. This is the first time I noticed this problem in about two dozen such test runs under valgrind (with varying timeout values).

It is possible that this seems to be happening at a shutdown time, and
maybe different threads stomping on some other task's thread for shut
down processing of various data structure.
Or maybe a genuine algorithmic bug.

Looking at the bogus pointer value of 0xffffffd6, I have a feeling a code somewhere returns (-1) as invalid pointer or a filler for an invalid entry in a hash table, which for an unknown reason, the code
picks it up as valid value, and try to use that as a valid pointer to access c in  nssCertificate_Destroy (certificate.c)

But if there is a race of threads, anything goes.

Here is the relevant log:

TEST-START | /TB-NEW/TB-3HG/new-src/mail/test/mozmill/search-window/test-multiple-search-windows.js | test_show_multiple_search_windows_for_the_same_folder
Step Pass: {"function": "Controller.keypress()"}
unsupported sort key: 17
unsupported sort key: 17
Step Pass: {"function": "Controller.keypress()"}
unsupported sort key: 17
unsupported sort key: 17
==26388== Invalid read of size 4
==26388==    at 0x47AAD70: nssCertificate_Destroy (certificate.c:102)
==26388==    by 0x47AAE6E: NSSCertificate_Destroy (certificate.c:150)
==26388==    by 0x47A800B: CERT_DestroyCertificate (stanpcertdb.c:795)
==26388==    by 0x47A45A8: IssuerCache_Destroy (crl.c:1130)
==26388==    by 0x47A4620: FreeIssuer (crl.c:1243)
==26388==    by 0x403EAD9: PL_HashTableEnumerateEntries (plhash.c:374)
==26388==    by 0x47A46CB: ShutdownCRLCache (crl.c:1308)
==26388==    by 0x476A02F: nss_Shutdown (nssinit.c:1086)
==26388==    by 0x476A168: NSS_Shutdown (nssinit.c:1149)
==26388==    by 0x594CAD6: nsNSSComponent::ShutdownNSS() (nsNSSComponent.cpp:1866)
==26388==    by 0x594D511: nsNSSComponent::DoProfileBeforeChange(nsISupports*) (nsNSSComponent.cpp:2549)
==26388==    by 0x594D70B: nsNSSComponent::Observe(nsISupports*, char const*, unsigned short const*) (nsNSSComponent.cpp:2176)
==26388==    by 0x6056C92: nsObserverList::NotifyObservers(nsISupports*, char const*, unsigned short const*) (nsObserverList.cpp:99)
==26388==    by 0x6057205: nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*) (nsObserverService.cpp:161)
==26388==    by 0x4C35182: nsXREDirProvider::DoShutdown() (nsXREDirProvider.cpp:892)
==26388==    by 0x4C2D6AB: ScopedXPCOMStartup::~ScopedXPCOMStartup() (nsAppRunner.cpp:1119)
==26388==    by 0x4C3387F: XREMain::XRE_main(int, char**, nsXREAppData const*) (nsAppRunner.cpp:3915)
==26388==    by 0x4C33AEC: XRE_main (nsAppRunner.cpp:4084)
==26388==    by 0x41ADE45: (below main) (libc-start.c:228)
==26388==  Address 0xffffffd6 is not stack'd, malloc'd or (recently) free'd
==26388==

Relevant source snipets
certificate.c
   92	NSS_IMPLEMENT PRStatus
   93	nssCertificate_Destroy (
   94	  NSSCertificate *c
   95	)
   96	{
   97	    nssCertificateStoreTrace lockTrace = {NULL, NULL, PR_FALSE, PR_FALSE};
   98	    nssCertificateStoreTrace unlockTrace = {NULL, NULL, PR_FALSE, PR_FALSE};
   99	     
  100	    if (c) {
  101		PRUint32 i;
* 102		nssDecodedCert *dc = c->decoding;
  103		NSSTrustDomain *td = STAN_GetDefaultTrustDomain();

certificate.c
  145	NSS_IMPLEMENT PRStatus
  146	NSSCertificate_Destroy (
  147	  NSSCertificate *c
  148	)
  149	{
* 150	    return nssCertificate_Destroy(c);
  151	}
  152	


stanpcertdb.c
  783	void
  784	CERT_DestroyCertificate(CERTCertificate *cert)
  785	{
  786	    if ( cert ) {
  787		/* don't use STAN_GetNSSCertificate because we don't want to
  788		 * go to the trouble of translating the CERTCertificate into
  789		 * an NSSCertificate just to destroy it.  If it hasn't been done
  790		 * yet, don't do it at all.
  791		 */
  792		NSSCertificate *tmp = cert->nssCertificate;
  793		if (tmp) {
  794		    /* delete the NSSCertificate */
* 795		    NSSCertificate_Destroy(tmp);
  796		} else if (cert->arena) {


 crl.c
 1089	/* destructor for CRL DPCache object */
 1090	static SECStatus DPCache_Destroy(CRLDPCache* cache)
 1091	{
 1092	    PRUint32 i = 0;
 1093	    PORT_Assert(cache);
     [...]

 1122	    /* free the array of CRLs */
 1123	    if (cache->crls)
 1124	    {
 1125		PORT_Free(cache->crls);
 1126	    }
 1127	    /* destroy the cert */
 1128	    if (cache->issuer)
 1129	    {
*1130	        CERT_DestroyCertificate(cache->issuer);
 1131	    }
 1132	    /* free the subject */
 1133	    if (cache->subject)
 1134	    {


 crl.c
 1228	/* callback function used in hash table destructor */
 1229	static PRIntn PR_CALLBACK FreeIssuer(PLHashEntry *he, PRIntn i, void *arg)
 1230	{
 1231	    CRLIssuerCache* issuer = NULL;
 1232	    SECStatus* rv = (SECStatus*) arg;
 1233	
 1234	    PORT_Assert(he);
 1235	    if (!he)
 1236	    {
 1237	        return HT_ENUMERATE_NEXT;
 1238	    }
 1239	    issuer = (CRLIssuerCache*) he->value;
 1240	    PORT_Assert(issuer);
 1241	    if (issuer)
 1242	    {
*1243	        if (SECSuccess != IssuerCache_Destroy(issuer))
 1244	        {
 1245	            PORT_Assert(rv);
 1246	            if (rv)
 1247	            {
 1248	                *rv = SECFailure;
 1249	            }
 1250	            return HT_ENUMERATE_NEXT;
 1251	        }
 1252	    }
 1253	    return HT_ENUMERATE_NEXT;
 1254	}


I am going to upload the full log to  Bug 803816 - Valgrind warnings about uninitialised memory use (Thunderbird)
Blocks: 803816
Further up the stack I see this code.
ShutdownCRLCache (crl.c:1308)
crl.c

 1284	/* needs to be called at NSS shutdown time
 1285	   This will destroy the global CRL cache, including 
 1286	   - the hash table of issuer cache objects
 1287	   - the issuer cache objects
 1288	   - DPCache objects in issuer cache objects */
 1289	SECStatus ShutdownCRLCache(void)
 1290	{
 1291	    SECStatus rv = SECSuccess;
 1292	    if (PR_FALSE == crlcache_initialized &&
 1293	        !crlcache.lock && !crlcache.issuers)
 1294	    {
 1295	        /* CRL cache has already been shut down */
 1296	        return SECSuccess;
 1297	    }
 1298	    if (PR_TRUE == crlcache_initialized &&
 1299	        (!crlcache.lock || !crlcache.issuers || !namedCRLCache.lock ||
 1300	         !namedCRLCache.entries))
 1301	    {
 1302	        /* CRL cache has partially been shut down */
 1303	        PORT_SetError(SEC_ERROR_LIBRARY_FAILURE);
 1304	        return SECFailure;
 1305	    }
 1306	    /* empty the CRL cache */
 1307	    /* free the issuers */
*1308	    PL_HashTableEnumerateEntries(crlcache.issuers, &FreeIssuer, &rv);
 1309	    /* free the hash table of issuers */
 1310	    PL_HashTableDestroy(crlcache.issuers);
 1311	    crlcache.issuers = NULL;
 1312	    /* free the global lock */
 1313	#ifdef GLOBAL_RWLOCK
 1314	    NSSRWLock_Destroy(crlcache.lock);
 1315	#else
 1316	    PR_DestroyLock(crlcache.lock);
 1317	#endif
 1318	    crlcache.lock = NULL;
 1319	
 1320	    /* empty the named CRL cache. This must be done after freeing the CRL
 1321	     * cache, since some CRLs in this cache are in the memory for the other  */
 1322	    /* free the entries */
 1323	    PL_HashTableEnumerateEntries(namedCRLCache.entries, &FreeNamedEntries, &rv);
 1324	    /* free the hash table of issuers */
 1325	    PL_HashTableDestroy(namedCRLCache.entries);
 1326	    namedCRLCache.entries = NULL;
 1327	    /* free the global lock */
 1328	    PR_DestroyLock(namedCRLCache.lock);
 1329	    namedCRLCache.lock = NULL;
 1330	
 1331	    crlcache_initialized = PR_FALSE;
 1332	    return rv;
 1333	}

From the cursorly reading, I think we need a call to a lock function (?) a la

#ifdef GLOBAL_RWLOCK
    NSSRWLock_LockRead(crlcache.lock);
#else
    PR_Lock(crlcache.lock);
#endif

just before 1308 (if multiple threads are somehow running. But presumably there are? Or we don't need these locking scheme, correct?)

And I am not sure if that ought to be against crlcache.lock even, but since
it is destroyed immediately afterward in 1313, it seems likely the candidate.
Sure we need to unlock it before destroying it.


Oh yes, probably we need locking of namedCRLcache.lock at 1323 before we begin
freeing the namedCRLcache entries. Don't forget to unlock it before destroying it.

But I think someone in the know can make the final judgement.

The locking scheme in this file is not well documented (for a casual passer-by).

TIA
Component: Untriaged → Security: PSM
Product: Thunderbird → Core
bsmith, please can you take a look at this, given that it may be responsible for the #1 toporange, bug 761987.
Flags: needinfo?(bsmith)
For exploratory purposes, I added lock, unlock in two places.
TB runs and does not crash, and I can terminate it without any bad effect.

(Come to think of it, a few months ago, I experienced strange segmentation error on quiting firefox under linux (and sometimes still did in the last serveral weeks), and I wonder if this segmentation error is related to the bug reported here.
 - it happens on termination of TB (or FB, I presume.)

Comment from people in the know will be appreciated.
I pushed this patch to try:

https://tbpl.mozilla.org/?tree=Try&rev=e89771bec0cd

It did not seem to cause any new failures.

Unfortunately, there was a crash in nssCertificate_Destroy, still:

https://tbpl.mozilla.org/php/getParsedLog.php?id=17433809&tree=Try&full=1#error0

(just like bug 761987).
(In reply to Geoff Brown [:gbrown] from comment #4)
> I pushed this patch to try:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=e89771bec0cd
> 
> It did not seem to cause any new failures.
> 
> Unfortunately, there was a crash in nssCertificate_Destroy, still:
> 
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=17433809&tree=Try&full=1#error0
> 
> (just like bug 761987).

Thank you for trying. 
On my end, the valgrind tests simply seem to take loger to execute or something (?). But valgrind does not produce any warnings regarding this (but again, it happend only once.)

So then, I can think of some hairy interaction in the hash-related functions (and miscompilation in ARM CPU? or even on X86.

I am using x86 (intel CPU).

Short of miscompilation, I think the hash-table itself is corrupted due to
thread races (with insufficient lock/unlock, etc.).

At least, on x86, the problem happened once clearly with possibly 0xFFFFFFF in
one pointer value (which, I think, should have been 0, i.e., null.)

Someone familiar with the code needs to inserts printf statements  and run tests and see what happens when firefox or tb dies during testing.
I re-read the post:

https://bugzilla.mozilla.org/show_bug.cgi?id=761987#c173

It seems then whichever thread that is executing the hash table manipulation
code is doing this without acquiring the lock, I think. That is bad.

I thought it was only the cleanup code that was missing the lock.
But there are other places where the lock is missing obviously.

That is why the above patch only was not enough.
Any possibility of using Helgrind to investigate the presence of races
with and without the patch?  In particular, to see if it makes any
difference?

If someone does this, I strongly suggest to use valgrind-3.8.1 on 
x86 or x86_64 linux, and use the --fair-sched=yes flag (gives more
repeatable results with Helgrind).
TB Helgrind log with the patch. (Sorry I am uploading the log with the patch first.)
Sorry I am only testing TB.
With TB, I noticed that during a previous run that addon autocheck accesses some URLs and I saw NSS-related cert handling, and so accessed the addon menu, and then check if the plugins are all updated, then quit.
(But due to the slowness of operaton under helgrind, I may have clicked a few incorrect menu entries, but we can at least identify some cert-related functions.)

I will upload the TB helgrind log WITHOUT the patch.
TAI
I ran TB without the patch under helgrind.
But I had a problem. After I invoked help (which calls external browser), and
checked the update of plug-in a few times, I tried to quit, but 
helgrind seemed to be in a state of resource starvation (maybe 32bit Linux cannot give more than 2GB of user space, and thus helgrind could not proceed any longer.)
After waiting for about half an hour, I quit helgrind from a different terminal by "kill -HUP".

The log is attached. I will upload another TB helgrind run where I only tested the plugin update twice. The next log is one for a complete shutdown.
This log captured the valgrind output while it ran TB (under 32bit Debian GNU/linux) without the patch was run and help was invoked once, and plugin check for update was performed twice, and TB was quit.

This time, valgrind finished in a reasonable time.

*BUT* note the strange line near the very end during the shutdown.

==28335==
vex x86->IR: unhandled instruction bytes: 0x6C 0x89 0x55 0x68
==28335==

I think, there are so many suspicious potential races and so forth, valgrind kept the huge in-memory structure to print the warning, and thus in the previous run, it ran out of memory and could not finish cleanly.

In this log, valgrind could shutdow cleanly because the manual operation was not that complex and lengthy and it still had enough memory left.

But the above unhandled instruction is a little bit to understand.

Hope these logs might give some insight, but I am not sure such a small number of runs can uncover the bug caused by nondeterministic behavior. 

TIA

PS: I have applied the patch to make GC friendly to helgrind and such.

PPS: Oh yes, I noiced 
the following: but this happened with the patch here also.

 grep -h -4 waited ~/Dropbox/*helgrind-*.txt
==27130==    by 0x4882949: ssl3_HandleHandshakeMessage (ssl3con.c:8478)
==27130==
==27130== ----------------------------------------------------------------
==27130==
==27130== Thread #52: pthread_cond_destroy: destruction of condition variable being waited upon
==27130==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==27130==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==27130==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==27130==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==27130==    by 0x595642D: mozilla::psm::(anonymous namespace)::AuthCertificate(mozilla::psm::TransportSecurityInfo*, CERTCertificateStr*) (SSLServerCertVerification.cpp:646)
==27130==
==27130== ----------------------------------------------------------------
==27130==
==27130== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
==27130==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==27130==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==27130==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==27130==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==28335==    by 0x607D43A: nsComponentManagerImpl::CreateInstanceByContractID(char const*, nsISupports*, nsID const&, void**) (nsComponentManager.cpp:1031)
==28335==
==28335== ----------------------------------------------------------------
==28335==
==28335== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
==28335==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==28335==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==28335==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==28335==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==28335==    by 0x4882949: ssl3_HandleHandshakeMessage (ssl3con.c:8478)
==28335==
==28335== ----------------------------------------------------------------
==28335==
==28335== Thread #45: pthread_cond_destroy: destruction of condition variable being waited upon
==28335==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==28335==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==28335==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==28335==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==28335==    by 0x4882949: ssl3_HandleHandshakeMessage (ssl3con.c:8478)
==28335==
==28335== ----------------------------------------------------------------
==28335==
==28335== Thread #46: pthread_cond_destroy: destruction of condition variable being waited upon
==28335==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==28335==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==28335==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==28335==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==8015==    by 0x1A78EA91: ???
==8015==
==8015== ----------------------------------------------------------------
==8015==
==8015== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
==8015==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==8015==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==8015==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==8015==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==8015==    by 0x595642D: mozilla::psm::(anonymous namespace)::AuthCertificate(mozilla::psm::TransportSecurityInfo*, CERTCertificateStr*) (SSLServerCertVerification.cpp:646)
==8015==
==8015== ----------------------------------------------------------------
==8015==
==8015== Thread #53: pthread_cond_destroy: destruction of condition variable being waited upon
==8015==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==8015==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==8015==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==8015==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==8015==    by 0x43DA15F: ???
==8015==
==8015== ----------------------------------------------------------------
==8015==
==8015== Thread #52: pthread_cond_destroy: destruction of condition variable being waited upon
==8015==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==8015==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==8015==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==8015==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
--
==8015==    by 0x4C69A09: nsIOService::GetInstance() (nsIOService.cpp:291)
==8015==
==8015== ----------------------------------------------------------------
==8015==
==8015== Thread #4: pthread_cond_destroy: destruction of condition variable being waited upon
==8015==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==8015==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==8015==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==8015==    by 0x4719B06: PR_DestroyMonitor (ptsynch.c:551)
--
TEST-START | /TB-NEW/TB-3HG/new-src/mail/test/mozmill/message-header/test-message-header.js | test_more_button_with_many_recipients
Step Pass: {"function": "controller.click()"}
==13832== ----------------------------------------------------------------
==13832==
==13832== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
==13832==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==13832==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==13832==    by 0x47191BA: PR_DestroyCondVar (ptsynch.c:340)
==13832==    by 0x58C1A60: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
ishikawa@debian-vm:~/TB-NEW/TB-3HG/objdir-tb3/mozilla/dist/bin$
> ==28335==
> vex x86->IR: unhandled instruction bytes: 0x6C 0x89 0x55 0x68
> ==28335==

Hmm, that's a bit strange.  But .. assuming TB uses the same bunch of
JITs that FX uses .. won't we have to use --smc-check=all-non-file for
Helgrind, to make sure it runs correctly?
(In reply to Julian Seward from comment #11)
> > ==28335==
> > vex x86->IR: unhandled instruction bytes: 0x6C 0x89 0x55 0x68
> > ==28335==
> 
> Hmm, that's a bit strange.  But .. assuming TB uses the same bunch of
> JITs that FX uses .. won't we have to use --smc-check=all-non-file for
> Helgrind, to make sure it runs correctly?

Sorry, I thought --smc-check=all-non-file was meant for memcheck and did
not specify it.

Here is list the arguments to run "make mozmill" test under valgrind 
(memcheck).

valgrind --trace-children=yes --smc-check=all-non-file --gen-suppressions=all --track-origins=yes --malloc-fill=0xA5 --free-fill=0xC3 --leak-check=full --num-callers=50 --suppressions=$HOME/TB-NEW/TB-3HG/new-src/mozilla/build/valgrind/cross-architecture.sup --suppressions=$HOME/TB-NEW/TB-3HG/new-src/mozilla/build/valgrind/i386-redhat-linux-gnu.sup --suppressions=$HOME/Dropbox/myown.sup --show-possibly-lost=no  ~/TB-NEW/TB-3HG/objdir-tb3/mozilla/dist/bin/thunderbird-bin -profile /TB-NEW/TB-3HG/objdir-tb3/mozilla/_tests/mozmill/mozmillprofile -jsbridge 24242 -foregroun

Obviously, we do not need
--trace-children (since the above calls the master script under which
tb runs, we need this. This time I am invoking TB simply.)

Do I need --track-origins?

I will try.
(I am using TB built from comm-central repository)

I am uploading a helgrind run of TB "make mozmill" test run with a
very limited set of test target.  I only left "search-window" in
$MOZSRC/new-src/mail/test/mozmill/mozmilltests.list

One of the uploaded logs is WITHOUT the patch, and the other is WITH
the patch.

I can not immediately see the big difference.
I am afraid that --fair-sched=yes does not
invoke the particular problematic order that
caused a problem under x86.
(We see a fewer number of read/write races, but I am not entirely sure
if they are significant.)

(I left a few warnings that appear near the beginning of the startup.
One is obviously the GLX problem, I think, but I am not sure why
GLX allows the call of pthread_barrier_init with count = 0.
This has been going on for a long time for TB at least, I think.
Also, a few messages related to the dynamic loading of key 
synchronization functions.
I also see a lot of message/event queue related warnings, they may be genuine, or 
OK, but even if so, I wonder if we can use a lock/unlock without 
slowing down TB, and obtains a clean log.)

I *do see* the disturbing massages about
"destruction of condition variable being waited"
in both cases.
This may cause the problem in question. 
I suspect so since one of the places where this warning appears
is in 

==16069== Thread #47: pthread_cond_destroy: destruction of condition variable being waited upon
==16069==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==16069==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==16069==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==16069==    by 0x5947C40: nsHTTPListener::~nsHTTPListener() (CondVar.h:56)
==16069==    by 0x5947D82: nsHTTPListener::Release() (nsNSSCallbacks.cpp:536)
==16069==    by 0x594755A: nsNSSHttpRequestSession::~nsNSSHttpRequestSession() (nsCOMPtr.h:410)
==16069==    by 0x4803FFA: PKIX_PL_Object_DecRef (pkix_pl_object.c:891)


and IMMEDIATELY AFTER THIS we see

==16069== Thread #47: pthread_cond_destroy: destruction of unknown cond var
==16069==    at 0x4027A7F: pthread_cond_destroy_WRK (hg_intercepts.c:940)
==16069==    by 0x4029781: pthread_cond_destroy@* (hg_intercepts.c:958)
==16069==    by 0x47193BA: PR_DestroyCondVar (ptsynch.c:372)
==16069==    by 0x47ABE12: nssCertificate_Destroy (certificate.c:128)
==16069==    by 0x47ABE6E: NSSCertificate_Destroy (certificate.c:150)
==16069==    by 0x47A900B: CERT_DestroyCertificate (stanpcertdb.c:795)
==16069==    by 0x47EB163: pkix_pl_Cert_Destroy (pkix_pl_cert.c:1167)

See nssCertificate_Destroy() appears in the stack.
So I suspect there *may be* a relation to the issue we have here.


I have run the "make mozmill" with its all targets under helgrind, but
the log is almost 16MB, and so I am not uploading it unless someone
requests them (logs with or without patch. [log with the patch is still being collected. It takes more than to 12 hours finish...])
 
---

Some statistical observation from the log (without
checking the logical consistency, etc.)

====================
LOG WITH PATCH. 
====================

(grep Helgrind: LOGFILE | sort | uniq -c)

     23    Helgrind:LockOrder
     17    Helgrind:Misc
      1    Helgrind:PthAPIerror
    122    Helgrind:Race

destruction of condition variable being waited
(not sure why the thread number is different from the one without the patch below.)
Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
Thread #49: pthread_cond_destroy: destruction of condition variable being waited upon
Thread #50: pthread_cond_destroy: destruction of condition variable being waited upon
Thread #4: pthread_cond_destroy: destruction of condition variable being waited upon

destruction of unknown cond var
grep  "destruction of unknown cond var"  tb-val32-helgrind-1201-with-patch-take-1-log.txt | wc -l
11

Potential read races?
grep  "during read of"  tb-val32-helgrind-1201-with-patch-take-1-log.txt | wc -l
27

Potential write races?
grep  "during write of"  tb-val32-helgrind-1201-with-patch-take-1-log.txt | wc -l
95

====================
* LOG WITHOUT PATCH *
====================
(grep Helgrind: LOGFILE | sort | uniq -c)
     25    Helgrind:LockOrder
     17    Helgrind:Misc
      1    Helgrind:PthAPIerror
    134    Helgrind:Race

destruction of condition variable being waited
(not sure why the thread number is different from the one with the patch above.)
Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
Thread #47: pthread_cond_destroy: destruction of condition variable being waited upon
Thread #48: pthread_cond_destroy: destruction of condition variable being waited upon
Thread #4: pthread_cond_destroy: destruction of condition variable being waited upon

destruction of unknown cond var
grep  "destruction of unknown cond var"  tb-val32-helgrind-1202-WITHOUT-patch-take-1-log.txt | wc -l

11


Potential read races?
grep  "during read of"  tb-val32-helgrind-1202-WITHOUT-patch-take-1-log.txt | wc -l
33

Potential write races?
grep  "during write of"  tb-val32-helgrind-1202-WITHOUT-patch-take-1-log.txt | wc -l
101


TIA
This is the log from the patched thunderbird.

I think we really need to exercise the SSL related access using https:, but
I am not sure how to create such a test script (accessing the plug-in update check would be enough, and possibly calling HELP from within TB, which calls an external browser, seems to do the trick.)

TIA
>I have run the "make mozmill" with its all targets under helgrind, but
>the log is almost 16MB, and so I am not uploading it unless someone
>requests them (logs with or without patch. [log with the patch is still being >collected. It takes more than to 12 hours finish...])

I found I could suppress  "Helgrind:PthAPIerror"
by setting
LP_NUM_THREADS=2; export LP_NUM_THREADS
before running Thunderbird. I am not sure why GLX is compiled this way, but
I digress.

I ran "make mozmill" because I wanted to give a baseline comparison of thread-friendliness (or not) of TB with  known user actions.
However, it is not quite aggressive enough (--fair-sched=yes), I am afraid.

Still here are comparison of the logs. Not much difference between the two from
generic statistical observation. 


[WITHOUT THE PATCH]
ishikawa@debian-vm:/tmp/tachyon_insp_xe$ sh ~/Dropbox/t-summary.sh /home/ishikawa/Dropbox/tb-val32-helgrind-1203-WITHOUT-patch-log-take-1.txt
Analyzing /home/ishikawa/Dropbox/tb-val32-helgrind-1203-WITHOUT-patch-log-take-1.txt ...
 ========================================
 Summary
 ========================================
    720    Helgrind:LockOrder
    565    Helgrind:Misc
   2945    Helgrind:Race
 ========================================
 destruction of condition variable being waited upon. 
 ========================================
60
 ========================================
 destruction of unknown cond var 
 ========================================
470
 ========================================
 Potential read races 
 ========================================
663
 ========================================
 Potential write races 
 ========================================
2282

[WITH THE PATCH]
$ sh ~/Dropbox/t-summary.sh /home/ishikawa/Dropbox/tb-val32-helgrind-1202-with-patch-log-take-1.txt
Analyzing /home/ishikawa/Dropbox/tb-val32-helgrind-1202-with-patch-log-take-1.txt ...
 ========================================
 Summary
 ========================================
    715    Helgrind:LockOrder
    517    Helgrind:Misc
   2971    Helgrind:Race
 ========================================
 destruction of condition variable being waited upon. 
 ========================================
61
 ========================================
 destruction of unknown cond var 
 ========================================
421
 ========================================
 Potential read races 
 ========================================
640
 ========================================
 Potential write races 
 ========================================
2331


Maybe with the patch, we have less " destruction of unknown cond var ".


 
t-summary.sh is a a script:
:
#

if [ -s $1 ]
then
    echo "Analyzing $1 ..."
else
    echo "File $1 missing. Quitting..."
    exit 1
fi

echo " ========================================"
echo " Summary"
echo " ========================================"

grep Helgrind: $1 | sort | uniq -c


echo " ========================================"
echo " destruction of condition variable being waited upon. "
echo " ========================================"

grep "condition variable being waited" $1  | wc -l 

echo " ========================================"
echo " destruction of unknown cond var "
echo " ========================================"
grep "destruction of unknown cond var" $1 | wc -l

echo " ========================================"
echo " Potential read races "
echo " ========================================"

grep  "during read of"  $1 | wc -l

echo " ========================================"
echo " Potential write races "
echo " ========================================"

grep  "during write of"  $1 | wc -l


PS: I think we really need someone who knows inside SSL related functions to take a look at this problem.
Depends on: 819445
I added a few more lock/unlock pairs in places where I thought the proper lock/unlock was missing.

make mozmill of thunderbird using this patch and under helgrind seems to run OK without ill-effect.

I think we need more proper lock and unlock in plarena.c and
certificate.c needs better comments and a good overall design document about
how the locking is done to protect various data structure. 
Not easy to deduce this from a cursory reading. Is there such design document ?
Depends on: 819602, 819599
bit late now
Flags: needinfo?(bsmith)
For those who are interested in thread races in mozilla software, the
helgrind today has better handling of the destruction of condition variable, and so
we may get a better, i.e., clean log that helps us to identify really problematic issues.
I now run 64-bit linux and so might want to run it but
given it takes almost half a day to finish the test, I may not be able to do it until some long holidays like the end of the year/new year.

TIA
Gecko doesn't use CRLs any longer, so I don't think the crash from comment 0 is an issue now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
(In reply to David Keeler [:keeler] (use needinfo?) from comment #19)
> Gecko doesn't use CRLs any longer, so I don't think the crash from comment 0
> is an issue now.

Just curious, if CRL is not used, what alternative is used now?
OCSP, perhaps?

TIA.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: