Closed
Bug 819386
Opened 12 years ago
Closed 11 years ago
Intermittent Android shutdown crash [@ nss_certificate_hash]
Categories
(Core :: Security: PSM, defect)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: emorley, Unassigned)
References
Details
(Keywords: crash, intermittent-failure)
Crash Data
Attachments
(2 files)
+++ This bug was initially created as a clone of Bug #761987 +++ Android Tegra 250 mozilla-central opt test reftest-1 on 2012-12-07 06:53:46 PST for push 8432562db685 slave: tegra-188 https://tbpl.mozilla.org/php/getParsedLog.php?id=17713908&tree=Firefox { PROCESS-CRASH | Shutdown | application crashed [@ nss_certificate_hash] Crash dump filename: /tmp/tmpNp0916/4abc67c0-1bc6-bb7c-25427d9e-21042e11.dmp Operating system: Android 0.0.0 Linux 2.6.32.9-00002-gd8084dc-dirty #1 SMP PREEMPT Wed Feb 2 11:32:06 PST 2011 armv7l nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys CPU: arm 0 CPUs Crash reason: SIGSEGV Crash address: 0xffffffff Thread 4 (crashed) 0 libnss3.so!nss_certificate_hash [pkistore.c : 703 + 0x2] r4 = 0xffffffff r5 = 0x51ced4f8 r6 = 0x58437000 r7 = 0x00000000 r8 = 0x54127018 r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e911768 lr = 0x44959d45 pc = 0x51fc90d0 Found by: given as instruction pointer in context 1 libplds4.so!PL_HashTableLookup [plhash.c : 332 + 0x1] r4 = 0x58437000 r5 = 0x51ced4f8 r6 = 0x58437000 r7 = 0x00000000 r8 = 0x54127018 r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e911770 pc = 0x44959d45 Found by: call frame info 2 libnss3.so!nssHash_Lookup [hash.c : 328 + 0x7] r4 = 0x54127838 r5 = 0x58437000 r6 = 0x58437000 r7 = 0x00000000 r8 = 0x54127018 r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e911780 pc = 0x51fcec05 Found by: call frame info 3 libnss3.so!nssTrustDomain_RemoveCertFromCacheLOCKED [tdcache.c : 341 + 0x5] r4 = 0x54127018 r5 = 0x4e9117c8 r6 = 0x58437000 r7 = 0x00000000 r8 = 0x54127018 r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e911790 pc = 0x51fc8dbd Found by: call frame info 4 libnss3.so!nssCertificate_Destroy [certificate.c : 121 + 0x7] r4 = 0x58437000 r5 = 0x4e9117c8 r6 = 0x3fa0aaaa r7 = 0x00000000 r8 = 0x54127018 r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e9117b8 pc = 0x51fc6fab Found by: call frame info 5 libnss3.so!NSSCertificate_Destroy [certificate.c : 150 + 0x3] r4 = 0x50b7b280 r5 = 0x4ec46ff8 r6 = 0x00000000 r7 = 0x51cf8de0 r8 = 0x51fc25bd r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e9117f0 pc = 0x51fc6fff Found by: call frame info 6 libnss3.so!CERT_DestroyCertificate [stanpcertdb.c : 795 + 0x5] r4 = 0x50b7b280 r5 = 0x4ec46ff8 r6 = 0x00000000 r7 = 0x51cf8de0 r8 = 0x51fc25bd r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e9117f8 pc = 0x51fc4931 Found by: call frame info 7 libnss3.so!IssuerCache_Destroy [crl.c : 1130 + 0x3] r4 = 0x50b7b280 r5 = 0x4ec46ff8 r6 = 0x00000000 r7 = 0x51cf8de0 r8 = 0x51fc25bd r9 = 0x4e91184c r10 = 0x00000009 fp = 0x00000010 sp = 0x4e911800 pc = 0x51fc258f }
Comment 1•12 years ago
|
||
(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #0) > +++ This bug was initially created as a clone of Bug #761987 +++ > Thank you for splitting this from the bugzilla entry that receives so many errors these days. It is a pity that the true cause is not still found. But to me it looks as if the following could be *ONE* of the smoking guns: Bug 819602 - ~TimerThread() destroys condition variable on which other threads are blocked.
Comment 2•12 years ago
|
||
(In reply to ISHIKAWA, chiaki from comment #1) > (In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #0) > But to me it looks as if the following could be *ONE* of the smoking guns: > > Bug 819602 - ~TimerThread() destroys condition variable on which other > threads are blocked. The bug in 819602 was "INVALID" in that the warning was false positive generated by helgrind (valgrind). After fixing the problem (see Bug 819599, ( "~nsEventQueue() destroys condition variable on which other threads are blocked." especially, the comment https://bugzilla.mozilla.org/show_bug.cgi?id=819599#c7 ) for details. Now I am looking into lock/unlock issue myself in secport.c, PL_ARENA_ALLOCATE macro, etc. But still no luck to nail down a problem. That the crash seems to occur near the end of the program life time suggests that some internal data structre has been desroyed prematurely or something... TIA
Updated•12 years ago
|
Crash Signature: [@ nss_certificate_hash] → [@ nss_certificate_hash]
[@ nss_certificate_hash | PL_HashTableLookup | nssHash_Lookup | nssTrustDomain_RemoveCertFromCacheLOCKED | nssCertificate_Destroy | NSSCertificate_Destroy | CERT_DestroyCertificate | IssuerCache_Destroy]
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 9•11 years ago
|
||
Hi, after seeing some fishy lock/unlock in plarena.c [I thought it was fishy anyway], I expanded the area protected by the locks somewhat. I have been running TB (comm-central) with the modified plarena.c for like two months on 32-bit Debian GNU/Linux. BACKGROUND: The last time I saw a strange address value reported in comment 161 (and comment 170) of Bug 761987 was back in October. I think I saw the bug for the first time after about 20 runs of TB "make mozmill" test under valgrind. After the major surgery to plarena.c which I am reporting here, I think I have run "make mozmill" almost 80 times. But I have not encountered that problem yet. The point is: 1. I have not seen any ill-effect from this change (well, as far as I can tell), TB seems to work just fine running "make mozmill" test. 2. Short of compiler generating incorrect code, I can't think of other causes of non-deterministic error than the data races between the threads. 3. Not seeing the bug after 80 times of "make mozmill" does not guarantee that I will not see the bug in the 81st "make mozmill" run if the bug is caused by data races between threads. So this is a long shot, but will someone test this change to plarena.c on Android and if the frequency of the crash is reduced? (Short of compiler bug, I wonder if some synchronization instruction may not behave as expected in CPU manual. These subtle bugs have occurred in CPUs for embedded devices before.) TIA PS: As I waded through the source tree, I also found mozilla/security/nss/lib/base/arena.c that has a comment near the beginning which says * This contains the implementation of NSS's thread-safe arenas. We seem to have many dupes of similar codes :-(
Comment 10•11 years ago
|
||
(In reply to ISHIKAWA, chiaki from comment #9) > So this is a long shot, but will someone test this change to plarena.c on > Android and if the frequency of the crash is reduced? I pushed your patch to try: https://tbpl.mozilla.org/?tree=Try&rev=9a36b29d05ab I will trigger retries as time allows and watch for crashes.
Comment 11•11 years ago
|
||
The try run reproduced the nssCertificate_Destroy crash: https://tbpl.mozilla.org/php/getParsedLog.php?id=20230377&tree=Try&full=1#error0
Comment 12•11 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #11) > The try run reproduced the nssCertificate_Destroy crash: > https://tbpl.mozilla.org/php/getParsedLog. > php?id=20230377&tree=Try&full=1#error0 Thank you. OK, so reproducing the bug on ARM V6(7?) is so easy. I have not seen the same bug on x86 since November. {But my testing is with *TB* only.] And it was noticed ONLY BECAUSE I ran TB under valgrind/memgrind. If I ran it without valgrind/memgrind, the bogus pointer generated pointed at some valid memory area (in terms of accessibility) and would not cause the crash or anything. As someone noted earlier, I see ShutdownNSS() in the stack, and thus this must be related to some ordering issues of thread shutdown, etc. But for some reason, it may not manifest so badly on x86. Actually, in my TB testing running "make mozmill" using debug build of TB, I see very distressing message in some tests near shutdown: e.g., [somewhat sanitized] [...] WARNING: nsExceptionService ignoring thread destruction after shutdown: file /TB-NEW/TB-3HG/new-src/mozilla/xpcom/base/nsExceptionService.cpp, line 167 WARNING: unable to Flush() dirty datasource during XPCOM shutdown: file /TB-NEW/TB-3HG/new-src/mozilla/rdf/base/src/nsRDFXMLDataSource.cpp, line 745 WARNING: not an nsIRDFRemoteDataSource: 'remote != nullptr', file /TB-NEW/TB-3HG/new-src/mozilla/rdf/datasource/src/nsLocalStore.cpp, line 279 WARNING: Leaking the RDF Service.: file /TB-NEW/TB-3HG/new-src/mozilla/rdf/build/nsRDFModule.cpp, line 165 WARNING: Leaking the RDF Service.: file /TB-NEW/TB-3HG/new-src/mozilla/rdf/build/nsRDFModule.cpp, line 165 ###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /TB-NEW/TB-3HG/new-src/mozilla/xpcom/build/nsXPComInit.cpp, line 674 [...] Sometimes I got, instead of RDF service problems, db left open /TB-NEW/TB-3HG/objdir-tb3/mozilla/_tests/mozmill/mozmillprofile/Mail/Local Folders/dnsPrefetch.msf db left open /TB-NEW/TB-3HG/objdir-tb3/mozilla/_tests/mozmill/mozmillprofile/Mail/Local Folders/generalContentPolicy.msf and other times I see WARNING: Fonts still alive while shutting down gfxFontCache: 'mFonts.Count() == 0', file /TB-NEW/TB-3HG/new-src/mozilla/gfx/thebes/gfxFont.cpp, line 1270 Assertion failed at /TB-NEW/TB-3HG/new-src/mozilla/gfx/cairo/cairo/src/cairo-hash.c:196: hash_table->live_entries == 0 just before ###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /TB-NEW/TB-3HG/new-src/mozilla/xpcom/build/nsXPComInit.cpp, line 674 I am somewhat surprised that I don't get a crash after seeing these disturbing messages. It may be that these problems near shutdown may manifest in a stark manner on ARM. I am curious, though. Doesn't Firefox (presumably) on other platform, say, X86 linux show similar warning messages near shutdown (debug build version, that is) when it is run through similar tests? Another possibility is, of course, the hash table is messed up early and that could be the reason why the processing blew up in trying to traverse the messed pointers when the program tried to enumerate the entries. One thing caught my eye in the log: here, --- begin quote --- Thread 12 0 libc.so + 0xebb8 r4 = 0x50b888a4 r5 = 0x50a55a10 r6 = 0x00000000 r7 = 0x000000f0 r8 = 0x55b22100 r9 = 0x00000000 r10 = 0x009c181e fp = 0x00000000 sp = 0x517ffe58 lr = 0xafd108c8 pc = 0xafd0ebb8 Found by: given as instruction pointer in context 1 libc.so + 0x10992 sp = 0x517ffe78 pc = 0xafd10994 Found by: stack scanning 2 libnspr4.so!PR_SetCurrentThreadName [ptthread.c:9a36b29d05ab : 1712 + 0xb] sp = 0x517ffe80 pc = 0x449cefb1 Found by: stack scanning 3 libnspr4.so!PR_WaitCondVar [ptsynch.c:9a36b29d05ab : 385 + 0x7] sp = 0x517ffe98 pc = 0x449cb1b1 Found by: stack scanning --- end quote I thought PR_WaitCondVar is a sync primitive. A thread usually waits for a notification in PR_WaitCondVar, and when the notification comes, it gets released, and then do something. That there is another function invoked from PR_WaitCondVar, i.e., PR_SetCurrentThreadName, looks very fishy to me if I am not mistaken. Many other thread wait in PR_WaitCondVar, and such calls immediately end up in libc, presumably calling the lower-level support routines. This thread 12 stands out. This may need investigating. I wonder if some thread primitive misbehave [occasionally] under Arm/Android? Hard to believe. Still the frequency of the problem suggests something is different on this platform. - memory layout to cause the bogus pointer to cause SIGSEGV more easily, - occasionally misbehaving sync primitive?? (Thread 12 above), - some factors that multiply the ill-effect of shutdown ordering issues, etc.]
Comment 13•11 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #11) > The try run reproduced the nssCertificate_Destroy crash: > https://tbpl.mozilla.org/php/getParsedLog. > php?id=20230377&tree=Try&full=1#error0 Dear Geoff, Thank you again. I wonder if you can push the newly attached patch because the previous patch forgot to include the patches to related files that handle certificates and stuff. I have been running the whole set of patches for TB in the last few months. [I finally got the Tryserver access, but am still learning the ropes.] Sorry for the initial oversight. TIA
Comment 14•11 years ago
|
||
No problem: https://tbpl.mozilla.org/?tree=Try&rev=7a2c57ba793d
Comment 15•11 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #14) > No problem: https://tbpl.mozilla.org/?tree=Try&rev=7a2c57ba793d Geoff, I am not an expert to read the run result of tryserver, but the current output seems to be much better than the one in the previous post in comment 11. Well, correct me wrong if I am overlooking serious problems. TIA
Comment 16•11 years ago
|
||
I agree -- this looks great! I don't see any sign of the nssCertificate_Destroy crash. :randix -- are you still investigating that crash? Thoughts on this patch?
Comment 17•11 years ago
|
||
Oh no! With more retries, now there is one crash: https://tbpl.mozilla.org/php/getParsedLog.php?id=20287621&tree=Try&full=1#error0
Comment 18•11 years ago
|
||
I thi(In reply to Geoff Brown [:gbrown] from comment #17) > Oh no! With more retries, now there is one crash: > https://tbpl.mozilla.org/php/getParsedLog. > php?id=20287621&tree=Try&full=1#error0 I think we need to look at plhash.c (which I thought I did, but I will take a fresh look again), and nssinit.c to see if there are some disturbing thread-races. TIA
Comment 19•11 years ago
|
||
(In reply to ISHIKAWA, chiaki from comment #18) > I thi(In reply to Geoff Brown [:gbrown] from comment #17) > > Oh no! With more retries, now there is one crash: > > https://tbpl.mozilla.org/php/getParsedLog. > > php?id=20287621&tree=Try&full=1#error0 > > I think we need to look at plhash.c (which I thought I did, but I will take > a fresh look again), and nssinit.c to see if there are some disturbing > thread-races. > > TIA The use of functions in plhash.c on a hash table needs to be protected by locking appropriately, but it seems that some functions such as add may not be quite properly locked. (Well cursory check, and so there may be some misunderstanding of the code.) However, I am not entirely sure if there can be different thread trying to populate the CRL database (I highly doubt it, but I am no expert in the nss-related code.) I think we ought to add lock/unlock properly [though we don't know such races for a particular hashtable exists] until the bug is eradicated. Coming from embedded programming experience, I view the code as very fragile :-( TIA
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 21•11 years ago
|
||
I just want to add the comment that I am seeing somewhat random values in the arena pointer during the crashes from bug 761987. I think this track is the right one: that there are locking issues in the arena handling.
Comment 22•11 years ago
|
||
ISHIKAWA, chiaki and Geoff Brown [:gbrown]: Have either of you done more work on either plarena.c or plash.c? With your patch, my failure rate has gone down, approaching zero. Nonetheless, examining the code definitely exposes race conditions, even after your patch, unless it can be shown that the code is only called from a single thread in certain cases (as per the comments in the patch). That seems to me to be a poor assumption.
Comment 23•11 years ago
|
||
(In reply to Rand Dow [:randix] from comment #22) > ISHIKAWA, chiaki and Geoff Brown [:gbrown]: > Have either of you done more work on either plarena.c or plash.c? With your > patch, my failure rate has gone down, approaching zero. Nonetheless, > examining the code definitely exposes race conditions, even after your > patch, unless it can be shown that the code is only called from a single > thread in certain cases (as per the comments in the patch). That seems to me > to be a poor assumption. I have looked at plarena.c and (you mean) plhash.c. Currently, I am looking at the higher-level calling routines (that call the routines in these files) to see if they properly use lock/unlock, enter_monitor/exit_monitor, or whatever mechanism to observe the proper entry to the critical region (i.e., some functions in plarena.c and plhash.c, etc.) Reading the code really makes me wonder why there are so many different sync/lock primitives with subtle differences.: PR_Lock/PR_Unlock, PZ_Lock, PZ_Unlock, PR_EnterMonitor/PR_ExitMoniter, to name a few. (I have no idea what the difference is between PR_Lock and PZ_Lock series. The recent post to dev mailing suggests that we should not call PR_Lock, PR_Unlock directly any more(!?) at least in Gecko code. I wish there *were* COMMENTS to various routines. There were some exceptions to make us realize whoever wrote the code was aware of the lock/unlock issues with succinct, yet to-the-point comments, but other functions leave much room for improvement comment-wise. Anyway, right now, I have spotted one problem (I am only finished with checking the external calls to about one third of routines in plhash.c. I spotted a problem in a calling sequence that starts with pkix_pl_Pk11CertStore_ImportCrl(). we probably need to protect |namedCRLCache| since addition to CRLCache is done, but this function is declared in the preceding comment that "308 * THREAD SAFETY: 309 * Thread Safe (see Thread Safety Definitions in Programmer's Guide)" which means that the caller probably don't bother to lock/unlock to protect |namedCRLCache|. (So unless I am mistaken, there is a chance of race here.) I may uncover a few more bugs in the caller side which don't lock / unlock the resources properly by going through the callers of remaining 2/3 of functions in plhash.c Once I am through with checking, I intend to add proper lock/unlock to protect the resources and then run the modified source locally to see if my change does not bring about deadlock. If the local modification looks good, then I will ask some one to run the test again on firefox TryServer on Arm. [My local PC is set up for only for testing TB right now. I have not downloaded FF recently.] Also, it seems imperative that we test the software on ARM, which is only available on TryServer. It *IS* strange the problem manifests so badly on ARM CPU architecture. BTW, I think mozilla source code now needs javadoc and/or doxygen style annotation for better readability. Some files have such comments, but as far as I can figure out, most of the current problems I tackle have occurred in poorly documented files. But I digress. Anyway, here is what I am doing, but please don't hold your breath. It is a slow-going process on my end. But if someone wants to the scrutiny of the callers of the routines in plhash.c, I have checked the callers of the functions starting from the beginning of plhash.c I checked up to PL_HashTableRemove(), and was about to check PL_HashTableLookup() . So the big function that needs more checking on the caller side are PL_HashTableLookup(), and PL_HashTableEnumerateEntries(). Hmm, looks to me I am more than half way through checking instead of only one third. TIA
Comment 24•11 years ago
|
||
(In reply to Rand Dow [:randix] from comment #22) > ISHIKAWA, chiaki and Geoff Brown [:gbrown]: > Have either of you done more work on either plarena.c or plash.c? I have not - sorry.
Comment 25•11 years ago
|
||
(In reply to ISHIKAWA, chiaki from comment #23) > Also, it seems imperative that we test the software on ARM, which is only > available on TryServer. > It *IS* strange the problem manifests so badly on ARM CPU architecture. That has been bothering me too. However, I have been thinking, it is strange that it happens on *Android* builds -- I wonder if there is something in Fennec, Android-specific code that is using NSS incorrectly. I have been meaning to pursue bug 841157, but have not found the time.
Comment 26•11 years ago
|
||
I am no longer to reproduce these crashes in current Nightly, which is strange to me. Overall, the crashes have not diminished. Question for Geoff: Can you still get crashes?
Comment 27•11 years ago
|
||
Sorry - I have not tried lately. It may be that the logging has shifted the timing in such a way that a race outcome is always successful...disabling some of your logging might help?
Comment 28•11 years ago
|
||
(In reply to Rand Dow [:randix] from comment #26) > Overall, the crashes have not diminished. That's what I thought, but looking more closely at recent failures in bug 761987, I notice: - there are frequent failures in various tests on mozilla-aurora - on mozilla-inbound/central, almost all recent failures are on Talos tests, especially tsvg and tp4m
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Comment 30•11 years ago
|
||
(OrangeWFM for bugs not modified in > 2 months)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Comment 31•11 years ago
|
||
So the bug does not happen any more on ARM?
Reporter | ||
Comment 32•11 years ago
|
||
Not in automation, no :-)
You need to log in
before you can comment on or make changes to this bug.
Description
•