Closed Bug 819386 Opened 12 years ago Closed 11 years ago

Intermittent Android shutdown crash [@ nss_certificate_hash]

Categories

(Core :: Security: PSM, defect)

ARM
Android
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: emorley, Unassigned)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #761987 +++

Android Tegra 250 mozilla-central opt test reftest-1 on 2012-12-07 06:53:46 PST for push 8432562db685

slave: tegra-188

https://tbpl.mozilla.org/php/getParsedLog.php?id=17713908&tree=Firefox

{
PROCESS-CRASH | Shutdown | application crashed [@ nss_certificate_hash]
Crash dump filename: /tmp/tmpNp0916/4abc67c0-1bc6-bb7c-25427d9e-21042e11.dmp
Operating system: Android
                  0.0.0 Linux 2.6.32.9-00002-gd8084dc-dirty #1 SMP PREEMPT Wed Feb 2 11:32:06 PST 2011 armv7l nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys
CPU: arm
     0 CPUs

Crash reason:  SIGSEGV
Crash address: 0xffffffff

Thread 4 (crashed)
 0  libnss3.so!nss_certificate_hash [pkistore.c : 703 + 0x2]
     r4 = 0xffffffff    r5 = 0x51ced4f8    r6 = 0x58437000    r7 = 0x00000000
     r8 = 0x54127018    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e911768    lr = 0x44959d45    pc = 0x51fc90d0
    Found by: given as instruction pointer in context
 1  libplds4.so!PL_HashTableLookup [plhash.c : 332 + 0x1]
     r4 = 0x58437000    r5 = 0x51ced4f8    r6 = 0x58437000    r7 = 0x00000000
     r8 = 0x54127018    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e911770    pc = 0x44959d45
    Found by: call frame info
 2  libnss3.so!nssHash_Lookup [hash.c : 328 + 0x7]
     r4 = 0x54127838    r5 = 0x58437000    r6 = 0x58437000    r7 = 0x00000000
     r8 = 0x54127018    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e911780    pc = 0x51fcec05
    Found by: call frame info
 3  libnss3.so!nssTrustDomain_RemoveCertFromCacheLOCKED [tdcache.c : 341 + 0x5]
     r4 = 0x54127018    r5 = 0x4e9117c8    r6 = 0x58437000    r7 = 0x00000000
     r8 = 0x54127018    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e911790    pc = 0x51fc8dbd
    Found by: call frame info
 4  libnss3.so!nssCertificate_Destroy [certificate.c : 121 + 0x7]
     r4 = 0x58437000    r5 = 0x4e9117c8    r6 = 0x3fa0aaaa    r7 = 0x00000000
     r8 = 0x54127018    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e9117b8    pc = 0x51fc6fab
    Found by: call frame info
 5  libnss3.so!NSSCertificate_Destroy [certificate.c : 150 + 0x3]
     r4 = 0x50b7b280    r5 = 0x4ec46ff8    r6 = 0x00000000    r7 = 0x51cf8de0
     r8 = 0x51fc25bd    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e9117f0    pc = 0x51fc6fff
    Found by: call frame info
 6  libnss3.so!CERT_DestroyCertificate [stanpcertdb.c : 795 + 0x5]
     r4 = 0x50b7b280    r5 = 0x4ec46ff8    r6 = 0x00000000    r7 = 0x51cf8de0
     r8 = 0x51fc25bd    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e9117f8    pc = 0x51fc4931
    Found by: call frame info
 7  libnss3.so!IssuerCache_Destroy [crl.c : 1130 + 0x3]
     r4 = 0x50b7b280    r5 = 0x4ec46ff8    r6 = 0x00000000    r7 = 0x51cf8de0
     r8 = 0x51fc25bd    r9 = 0x4e91184c   r10 = 0x00000009    fp = 0x00000010
     sp = 0x4e911800    pc = 0x51fc258f
}
(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #0)
> +++ This bug was initially created as a clone of Bug #761987 +++
> 

Thank you for splitting this from the bugzilla entry that receives so many errors these days.

It is a pity that the true cause is not still found.
But to me it looks as if the following could be *ONE* of the smoking guns:

Bug 819602 - ~TimerThread() destroys condition variable on which other threads are blocked.
(In reply to ISHIKAWA, chiaki from comment #1)
> (In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #0)

> But to me it looks as if the following could be *ONE* of the smoking guns:
> 
> Bug 819602 - ~TimerThread() destroys condition variable on which other
> threads are blocked.

The bug in 819602 was "INVALID" in that the warning was false positive generated
by helgrind (valgrind). After fixing the problem (see 
 Bug 819599,
( "~nsEventQueue() destroys condition variable on which other 
  threads are blocked." especially, the comment
https://bugzilla.mozilla.org/show_bug.cgi?id=819599#c7
) for details.

Now I am looking into lock/unlock issue myself in
secport.c, PL_ARENA_ALLOCATE macro, etc. But still no luck to nail down a problem.
That the crash seems to occur near the end of the program life time suggests that
some internal data structre has been desroyed prematurely or something... 


TIA
Crash Signature: [@ nss_certificate_hash] → [@ nss_certificate_hash] [@ nss_certificate_hash | PL_HashTableLookup | nssHash_Lookup | nssTrustDomain_RemoveCertFromCacheLOCKED | nssCertificate_Destroy | NSSCertificate_Destroy | CERT_DestroyCertificate | IssuerCache_Destroy]
Depends on: 830046
No longer depends on: 830046
Hi, 
after seeing some fishy lock/unlock in plarena.c [I thought it was fishy anyway], I expanded the area protected by the locks somewhat.

I have been running TB (comm-central) with the modified plarena.c for like two months on 32-bit Debian GNU/Linux.

BACKGROUND:

The last time I saw a strange address value reported in 
comment 161 (and comment 170) of Bug 761987  was back in October. I think I saw the bug for the first time after about 20 runs of TB "make mozmill" test under valgrind.

After the major surgery to plarena.c which I am reporting here, I think I have run "make mozmill" almost 80 times. But I have not encountered that problem yet.

The point is:

1. I have not seen any ill-effect from this change (well, as far as 
   I can tell), TB seems to work just fine running "make mozmill" test.

2. Short of compiler generating incorrect code, I can't think of other
   causes of non-deterministic error than the data races between the threads.

3. Not seeing the bug after 80 times of "make mozmill" does not guarantee
   that I will not see the bug in the 81st "make mozmill" run if the bug is
   caused by data races between threads.

So this is a long shot, but will someone test this change to plarena.c on Android and if the frequency of the crash is reduced? (Short of compiler bug, I wonder if some synchronization instruction may not behave as expected in CPU manual. These subtle bugs have occurred in CPUs for embedded devices before.)

TIA
 
PS: As I waded through the source tree, I also found 
mozilla/security/nss/lib/base/arena.c

that has a comment near the beginning which says

 * This contains the implementation of NSS's thread-safe arenas.

We seem to have many dupes of similar codes :-(
(In reply to ISHIKAWA, chiaki from comment #9)
> So this is a long shot, but will someone test this change to plarena.c on
> Android and if the frequency of the crash is reduced?

I pushed your patch to try: https://tbpl.mozilla.org/?tree=Try&rev=9a36b29d05ab

I will trigger retries as time allows and watch for crashes.
The try run reproduced the nssCertificate_Destroy crash: https://tbpl.mozilla.org/php/getParsedLog.php?id=20230377&tree=Try&full=1#error0
(In reply to Geoff Brown [:gbrown] from comment #11)
> The try run reproduced the nssCertificate_Destroy crash:
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=20230377&tree=Try&full=1#error0

Thank you.

OK, so reproducing the bug on ARM V6(7?) is so easy. I have not seen the same bug
on x86 since November. {But my testing is with *TB* only.]
And it was noticed ONLY BECAUSE I ran TB under valgrind/memgrind.
If I ran it without valgrind/memgrind, the bogus pointer generated pointed at
some valid memory area (in terms of accessibility) and would not cause the crash or anything.

As someone noted earlier, I see ShutdownNSS() in the stack, and thus
this must be related to some ordering issues of thread shutdown, etc.
But for some reason, it may not manifest so badly on x86.
Actually, in my TB testing running "make mozmill" using debug build of TB, I see very distressing message in some tests near shutdown: e.g., [somewhat sanitized]
       [...]
WARNING: nsExceptionService ignoring thread destruction after shutdown: file /TB-NEW/TB-3HG/new-src/mozilla/xpcom/base/nsExceptionService.cpp, line 167
WARNING: unable to Flush() dirty datasource during XPCOM shutdown: file /TB-NEW/TB-3HG/new-src/mozilla/rdf/base/src/nsRDFXMLDataSource.cpp, line 745
WARNING: not an nsIRDFRemoteDataSource: 'remote != nullptr', file /TB-NEW/TB-3HG/new-src/mozilla/rdf/datasource/src/nsLocalStore.cpp, line 279
WARNING: Leaking the RDF Service.: file /TB-NEW/TB-3HG/new-src/mozilla/rdf/build/nsRDFModule.cpp, line 165
WARNING: Leaking the RDF Service.: file /TB-NEW/TB-3HG/new-src/mozilla/rdf/build/nsRDFModule.cpp, line 165
###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /TB-NEW/TB-3HG/new-src/mozilla/xpcom/build/nsXPComInit.cpp, line 674

     [...]

Sometimes I got, instead of RDF service problems, 
db left open /TB-NEW/TB-3HG/objdir-tb3/mozilla/_tests/mozmill/mozmillprofile/Mail/Local Folders/dnsPrefetch.msf
db left open /TB-NEW/TB-3HG/objdir-tb3/mozilla/_tests/mozmill/mozmillprofile/Mail/Local Folders/generalContentPolicy.msf

and other times I see

WARNING: Fonts still alive while shutting down gfxFontCache: 'mFonts.Count() == 0', file /TB-NEW/TB-3HG/new-src/mozilla/gfx/thebes/gfxFont.cpp, line 1270
Assertion failed at /TB-NEW/TB-3HG/new-src/mozilla/gfx/cairo/cairo/src/cairo-hash.c:196: hash_table->live_entries == 0

just before
###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /TB-NEW/TB-3HG/new-src/mozilla/xpcom/build/nsXPComInit.cpp, line 674

I am somewhat surprised that I don't get a crash after seeing these disturbing messages.

It may be that these problems near shutdown may manifest in a stark manner
on ARM. 

I am curious, though. Doesn't Firefox (presumably) on other platform, say, X86 linux show similar warning messages near shutdown (debug build version, that is) when it is run through similar tests?

Another possibility is, of course, the hash table is messed up early and
that could be the reason why the processing blew up in trying to traverse the
messed pointers when the program tried to enumerate the entries.

One thing caught my eye in the log: here,

--- begin quote ---
Thread 12
 0  libc.so + 0xebb8
     r4 = 0x50b888a4    r5 = 0x50a55a10    r6 = 0x00000000    r7 = 0x000000f0
     r8 = 0x55b22100    r9 = 0x00000000   r10 = 0x009c181e    fp = 0x00000000
     sp = 0x517ffe58    lr = 0xafd108c8    pc = 0xafd0ebb8
    Found by: given as instruction pointer in context
 1  libc.so + 0x10992
     sp = 0x517ffe78    pc = 0xafd10994
    Found by: stack scanning
 2  libnspr4.so!PR_SetCurrentThreadName [ptthread.c:9a36b29d05ab : 1712 + 0xb]
     sp = 0x517ffe80    pc = 0x449cefb1
    Found by: stack scanning
 3  libnspr4.so!PR_WaitCondVar [ptsynch.c:9a36b29d05ab : 385 + 0x7]
     sp = 0x517ffe98    pc = 0x449cb1b1
    Found by: stack scanning
--- end quote

I thought PR_WaitCondVar is a sync primitive. A thread usually waits for
a notification in PR_WaitCondVar, and when the notification comes,
it gets released, and then do something. 
That there is another function invoked from PR_WaitCondVar, i.e., PR_SetCurrentThreadName, looks very fishy to me if I am not mistaken.
Many other thread wait in PR_WaitCondVar, and such calls immediately end up in libc, presumably calling the lower-level support routines. 
This thread 12 stands out. This may need investigating.

I wonder if some thread primitive misbehave [occasionally] under Arm/Android? Hard to believe.
Still the frequency of the problem suggests something is different on this  platform.
 - memory layout to cause the bogus pointer to cause SIGSEGV more easily,
 - occasionally misbehaving sync primitive?? (Thread 12 above),
 - some factors that multiply the ill-effect of shutdown ordering issues, etc.]
(In reply to Geoff Brown [:gbrown] from comment #11)
> The try run reproduced the nssCertificate_Destroy crash:
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=20230377&tree=Try&full=1#error0

Dear Geoff,

Thank you again.

I wonder if you can push the newly attached patch because 
the previous patch forgot to include the patches to related files that handle certificates and stuff. I have been running the whole set of patches for TB in the last few months. [I finally got the Tryserver access, but am still learning the ropes.]

Sorry for the initial oversight.

TIA
(In reply to Geoff Brown [:gbrown] from comment #14)
> No problem: https://tbpl.mozilla.org/?tree=Try&rev=7a2c57ba793d

Geoff, I am not an expert to read the run result of tryserver, but the current output seems to be much better than the one in the previous post in comment 11.
Well, correct me wrong if I am overlooking serious problems.

TIA
I agree -- this looks great! I don't see any sign of the nssCertificate_Destroy crash.

:randix -- are you still investigating that crash? Thoughts on this patch?
I thi(In reply to Geoff Brown [:gbrown] from comment #17)
> Oh no! With more retries, now there is one crash:
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=20287621&tree=Try&full=1#error0

I think we need to look at plhash.c (which I thought I did, but I will take  a fresh look again), and nssinit.c to see if there are some disturbing thread-races.

TIA
(In reply to ISHIKAWA, chiaki from comment #18)
> I thi(In reply to Geoff Brown [:gbrown] from comment #17)
> > Oh no! With more retries, now there is one crash:
> > https://tbpl.mozilla.org/php/getParsedLog.
> > php?id=20287621&tree=Try&full=1#error0
> 
> I think we need to look at plhash.c (which I thought I did, but I will take 
> a fresh look again), and nssinit.c to see if there are some disturbing
> thread-races.
> 
> TIA

The use of functions in plhash.c on a hash table
needs to be protected by locking appropriately, but
it seems that some functions such as add  may not be quite properly locked.
(Well cursory check, and so there may be some misunderstanding of the code.)
However, I am not entirely sure if there can
be different thread trying to populate the CRL database (I highly doubt it, but I am no expert in the nss-related code.)
I think we ought to add lock/unlock properly [though we don't know such races for a particular  hashtable exists] until the bug is eradicated.
Coming from embedded programming experience, I view the code as very fragile :-(

TIA
I just want to add the comment that I am seeing somewhat random values in the arena pointer during the crashes from bug 761987. I think this track is the right one: that there are locking issues in the arena handling.
ISHIKAWA, chiaki and Geoff Brown [:gbrown]:
Have either of you done more work on either plarena.c or plash.c? With your patch, my failure rate has gone down, approaching zero. Nonetheless, examining the code definitely exposes race conditions, even after your patch, unless it can be shown that the code is only called from a single thread in certain cases (as per the comments in the patch). That seems to me to be a poor assumption.
(In reply to Rand Dow [:randix] from comment #22)
> ISHIKAWA, chiaki and Geoff Brown [:gbrown]:
> Have either of you done more work on either plarena.c or plash.c? With your
> patch, my failure rate has gone down, approaching zero. Nonetheless,
> examining the code definitely exposes race conditions, even after your
> patch, unless it can be shown that the code is only called from a single
> thread in certain cases (as per the comments in the patch). That seems to me
> to be a poor assumption.

I have looked at plarena.c and (you mean) plhash.c.
Currently, 
I am looking at the higher-level calling routines (that call the routines
in these files) to see if they properly use lock/unlock, enter_monitor/exit_monitor, or whatever mechanism to observe the proper entry to the critical region (i.e., some functions in plarena.c and plhash.c, etc.)

Reading the code really makes me wonder why there are so many different sync/lock primitives with subtle differences.:
PR_Lock/PR_Unlock, PZ_Lock, PZ_Unlock, PR_EnterMonitor/PR_ExitMoniter, to name a few. (I have no idea what the difference is between PR_Lock and PZ_Lock series.
The recent post to dev mailing suggests that we should not call PR_Lock, PR_Unlock directly any more(!?) at least in Gecko code.

I wish there *were* COMMENTS to various routines.
There were some exceptions to make us realize whoever wrote the code
was aware of the lock/unlock issues with succinct, yet to-the-point comments, but other functions leave much room for improvement comment-wise.

Anyway, right now, I have spotted one problem (I am only finished with checking the external calls to about one third of routines in plhash.c.
I spotted a problem in a calling sequence that starts with
pkix_pl_Pk11CertStore_ImportCrl().
we probably need to protect |namedCRLCache| since addition to CRLCache is done, but this function is declared in the preceding comment that 
"308  * THREAD SAFETY:
309  *  Thread Safe (see Thread Safety Definitions in Programmer's Guide)"

which means that the caller probably don't bother to lock/unlock to protect |namedCRLCache|. (So unless I am mistaken, there is a chance of race here.)

I may uncover a few more bugs in the caller side which don't lock / unlock
the resources properly by going through the callers of 
remaining 2/3 of functions in plhash.c

Once I am through with checking, I intend to add proper lock/unlock to protect the resources and then run the modified source locally to see 
if my change does not bring about
deadlock. If the local modification looks good, then I will 
ask some one to run the test again on firefox TryServer on Arm.
[My local PC is set up for only for testing TB right now. I have not downloaded
FF recently.]

Also, it seems imperative that we test the software on ARM, which is only
available on TryServer.
It *IS* strange the problem manifests so badly on ARM CPU architecture.

BTW, I think mozilla source code now needs javadoc and/or doxygen style 
annotation for better readability. Some files have such comments, but
as far as I can figure out, most of the current problems I tackle 
have occurred in poorly documented files. But I digress.

Anyway, here is what I am doing, but please don't hold your breath. It is a slow-going process on my end.
But if someone wants to the scrutiny of the callers of the routines in plhash.c,
I have checked the callers of the functions starting from the beginning of plhash.c
I checked up to PL_HashTableRemove(), and was about to check PL_HashTableLookup()
.
So the big function that needs more checking on the caller side
are PL_HashTableLookup(), and
PL_HashTableEnumerateEntries().

Hmm, looks to me I am more than half way through checking instead of only one third.

TIA
(In reply to Rand Dow [:randix] from comment #22)
> ISHIKAWA, chiaki and Geoff Brown [:gbrown]:
> Have either of you done more work on either plarena.c or plash.c? 

I have not - sorry.
(In reply to ISHIKAWA, chiaki from comment #23) 
> Also, it seems imperative that we test the software on ARM, which is only
> available on TryServer.
> It *IS* strange the problem manifests so badly on ARM CPU architecture.

That has been bothering me too. However, I have been thinking, it is strange that it happens on *Android* builds -- I wonder if there is something in Fennec, Android-specific code that is using NSS incorrectly. I have been meaning to pursue bug 841157, but have not found the time.
I am no longer to reproduce these crashes in current Nightly, which is strange to me.
Overall, the crashes have not diminished.

Question for Geoff: Can you still get crashes?
Sorry - I have not tried lately.

It may be that the logging has shifted the timing in such a way that a race outcome is always successful...disabling some of your logging might help?
(In reply to Rand Dow [:randix] from comment #26)
> Overall, the crashes have not diminished.

That's what I thought, but looking more closely at recent failures in bug 761987, I notice:
 - there are frequent failures in various tests on mozilla-aurora
 - on mozilla-inbound/central, almost all recent failures are on Talos tests, especially tsvg and tp4m
(OrangeWFM for bugs not modified in > 2 months)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
So the bug does not happen any more on ARM?
Not in automation, no :-)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: