Closed Bug 871575 Opened 6 years ago Closed 8 months ago

Investigate and fix the nss shutdown crash that we see on Android

Categories

(Core :: Networking, defect, P3)

ARM
Android
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [necko-backlog])

Crash Data

I want to use this bug to track, communicate, and fix the nss_shutdown crash that we see on Android.  This bug should not be used for tbpl to star tests failures.
(This won't appear on TBPL unless the keyword intermittent-failure is used & the summary matches, so we're safe from starring).

Thank you for filing this - hopeful will make communication (and visibility into progress) easier :-)
Summary: [DO NOT STAR] investigate and fix the nss shutdown crash that we see on Android → Investigate and fix the nss shutdown crash that we see on Android
Brad, you mentioned Doug is making progress on this. Is that being tracked somewhere or is IRC pinging the only way to find out?
Crash Signature: [@ nssCertificate_Destroy] [@ nssCertificate_Destroy | NSSCertificate_Destroy | CERT_DestroyCertificate | IssuerCache_Destroy] [@ 0xffff0fc4 | PR_AtomicDecrement | nssCertificate_Destroy] [@ FreeArenaList | PORT_FreeArena_Util | IssuerCache_Destroy] […
here is an example log file that shows this failure:
https://tbpl.mozilla.org/php/getParsedLog.php?id=22983942&tree=Mozilla-Inbound
I have a pretty good, but vague, idea of what is happening here. I haven't had time to dig into the details but I am planning to do so this week. I think there are a combination of bugs that need to be fixed: at least one in Necko and one in PSM. The Necko bug is that Necko tries to do SSL networking while and/or after NSS has been shut down. The PSM/NSS bug is almost definitely a refcounting error in CERTCertificate. i.e. we're calling CERT_DestroyCertificate() at least one time more than we're calling CERT_DupCertificate().
Assignee: doug.turner → bsmith
Best stack so far:


Thread 12 (crashed)
 0  libnss3.so!PR_Lock [ptsynch.c : 184 + 0x2]
     r4 = 0x00000000    r5 = 0x6ac02fa0    r6 = 0x63ff8484    r7 = 0x5d2bc858
     r8 = 0x62155818    r9 = 0x62140b60   r10 = 0x00000008    fp = 0x00000010
     sp = 0x5d2bc850    lr = 0x5f28591d    pc = 0x5f28591e
    Found by: given as instruction pointer in context
 1  libnss3.so!nssCertificate_Destroy [certificate.c:4e1be9ada51a : 106 + 0x7]
     sp = 0x5d2bc858    pc = 0x5f2060a5
    Found by: stack scanning
 2  libnss3.so!IssuerCache_Destroy [crl.c:4e1be9ada51a : 1182 + 0x5]
     sp = 0x5d2bc88c    pc = 0x5f201c5d
    Found by: stack scanning
 3  libnss3.so!IssuerCache_Destroy [crl.c:4e1be9ada51a : 1128 + 0x3]
     sp = 0x5d2bc890    pc = 0x5f201c29
    Found by: stack scanning
 4  libnss3.so!FreeIssuer [crl.c:4e1be9ada51a : 1241 + 0x3]
     sp = 0x5d2bc8a8    pc = 0x5f201c6d
    Found by: stack scanning
 5  libnss3.so!PL_HashTableEnumerateEntries [plhash.c : 374 + 0x1]
     sp = 0x5d2bc8b8    pc = 0x5f288e4d
    Found by: stack scanning
 6  libnss3.so!ShutdownCRLCache [crl.c:4e1be9ada51a : 1306 + 0x9]
     sp = 0x5d2bc8e8    pc = 0x5f201cd7
    Found by: stack scanning
 7  libnss3.so!nss_Shutdown [nssinit.c:4e1be9ada51a : 1082 + 0x3]
     sp = 0x5d2bc900    pc = 0x5f252ce9
    Found by: stack scanning
 8  libnss3.so!NSS_Shutdown [nssinit.c:4e1be9ada51a : 1145 + 0x3]
     sp = 0x5d2bc918    pc = 0x5f252dd1
    Found by: stack scanning
 9  libxul.so!nsNSSComponent::ShutdownNSS() [nsNSSComponent.cpp:4e1be9ada51a : 1902 + 0x3]
     sp = 0x5d2bc928    pc = 0x62da673b
    Found by: stack scanning
10  libxul.so!nsNSSComponent::DoProfileBeforeChange(nsISupports*) [nsNSSComponent.cpp:4e1be9ada51a : 2499 + 0x5]
     sp = 0x5d2bc940    pc = 0x62da67cb
    Found by: stack scanning
11  libxul.so!nsNSSComponent::Observe(nsISupports*, char const*, unsigned short const*) [nsNSSComponent.cpp:4e1be9ada51a : 2184 + 0x7]
     sp = 0x5d2bc960    pc = 0x62da7141
    Found by: stack scanning
12  libnss3.so!PR_Unlock [ptsynch.c : 205 + 0x3]
     sp = 0x5d2bc9b0    pc = 0x5f285a91
    Found by: stack scanning
13  libxul.so + 0x96e3a7
     sp = 0x5d2bc9bc    pc = 0x62da13a9
    Found by: stack scanning
14  libnss3.so!PR_ExitMonitor [ptsynch.c : 557 + 0x3]
     sp = 0x5d2bc9c0    pc = 0x5f285cb9
    Found by: stack scanning
15  libxul.so!nsObserverList::NotifyObservers(nsISupports*, char const*, unsigned short const*) [nsObserverList.cpp:4e1be9ada51a : 99 + 0x7]
     sp = 0x5d2bc9e0    pc = 0x630b1887
    Found by: stack scanning
16  libxul.so!nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*) [nsObserverService.cpp:4e1be9ada51a : 161 + 0x9]
     sp = 0x5d2bca08    pc = 0x630b1bdb
    Found by: stack scanning
17  libxul.so!nsObserverService::Create(nsISupports*, nsID const&, void**) [nsAutoPtr.h:4e1be9ada51a : 880 + 0xf]
     sp = 0x5d2bca14    pc = 0x630b1ba9
    Found by: stack scanning
18  libxul.so!nsXREDirProvider::DoShutdown() [nsXREDirProvider.cpp:4e1be9ada51a : 871 + 0x11]
     sp = 0x5d2bca20    pc = 0x6270d967
    Found by: stack scanning
19  libxul.so!nsAppShellService::CreateHiddenWindow() [nsAppShellService.cpp:4e1be9ada51a : 88 + 0x3]
     sp = 0x5d2bca24    pc = 0x62d759eb
    Found by: stack scanning
20  libxul.so!ScopedXPCOMStartup::~ScopedXPCOMStartup [nsAppRunner.cpp:4e1be9ada51a : 1120 + 0x9]
     sp = 0x5d2bca48    pc = 0x62709d19
    Found by: stack scanning
21  libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:4e1be9ada51a : 3964 + 0x5]
     sp = 0x5d2bca60    pc = 0x6270d3cb
    Found by: stack scanning
22  libxul.so!XRE_main [nsAppRunner.cpp:4e1be9ada51a : 4140 + 0x3]
     sp = 0x5d2bca88    pc = 0x6270d553
    Found by: stack scanning
23  libmozglue.so!__wrap_realloc [jemalloc.c:4e1be9ada51a : 4692 + 0x3]
     sp = 0x5d2bcb00    pc = 0x5bc91d11
    Found by: stack scanning
24  libmozglue.so!arena_malloc [jemalloc.c:4e1be9ada51a : 4167 + 0x3]
     sp = 0x5d2bcb28    pc = 0x5bc907c5
    Found by: stack scanning
25  libmozalloc.so!moz_xrealloc [mozalloc.cpp : 86 + 0x7]
     sp = 0x5d2bcb58    pc = 0x5f03f02f
    Found by: stack scanning
26  libxul.so!nsTArray_base<nsTArrayInfallibleAllocator>::EnsureCapacity(unsigned int, unsigned int) [nsTArray.h:4e1be9ada51a : 196 + 0x5]
     sp = 0x5d2bcb68    pc = 0x62710e79
    Found by: stack scanning
27  libxul.so!XRE_InitChildProcess [GeckoProfilerImpl.h:4e1be9ada51a : 286 + 0x0]
     sp = 0x5d2bcb80    pc = 0x62710000
    Found by: stack scanning
28  libxul.so!GeckoStart [nsAndroidStartup.cpp:4e1be9ada51a : 73 + 0xf]
     sp = 0x5d2bcb98    pc = 0x627110b3
    Found by: stack scanning
29  libdvm.so + 0xbabd4
     sp = 0x5d2bcb9c    pc = 0x409dbbd6
    Found by: stack scanning
30  libdvm.so + 0x52383
     sp = 0x5d2bcbbc    pc = 0x40973385
    Found by: stack scanning
31  libxul.so + 0x2de01b
     sp = 0x5d2bcbc4    pc = 0x6271101d
    Found by: stack scanning
32  libmozglue.so!Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun [APKOpen.cpp:4e1be9ada51a : 355 + 0x3]
     sp = 0x5d2bcbc8    pc = 0x5bc99b15
So, we hacked cert_CheckCertRevocationStatus to lie, and this is the new fail.



Thread 8 (crashed)
 0  libnss3.so!nssCertificate_Destroy [certificate.c:6088f3785cb6 : 98 + 0x0]
     r4 = 0xffffff82    r5 = 0x58f7e010    r6 = 0x578093d0    r7 = 0x00000000
     r8 = 0x00000000    r9 = 0x00000001   r10 = 0x00000001    fp = 0x0000ffff
     sp = 0x514fbd18    lr = 0x50e8e680    pc = 0x50e8e688
    Found by: given as instruction pointer in context
 1  libnss3.so!ssl3_CleanupPeerCerts [ssl3con.c:6088f3785cb6 : 8493 + 0x6]
     sp = 0x514fbd50    pc = 0x50f085c4
    Found by: stack scanning
 2  libnss3.so!ssl3_DestroySSL3Info [ssl3con.c:6088f3785cb6 : 10784 + 0x6]
     sp = 0x514fbd60    pc = 0x50f125e4
    Found by: stack scanning
 3  libnss3.so!ssl_DestroySocketContents [sslsock.c:6088f3785cb6 : 408 + 0x6]
     sp = 0x514fbd68    pc = 0x50f1ef60
    Found by: stack scanning
 4  libnss3.so!ssl_FreeSocket [sslsock.c:6088f3785cb6 : 471 + 0x6]
     sp = 0x514fbd78    pc = 0x50f200e8
    Found by: stack scanning
 5  libnss3.so!ssl_DefClose [ssldef.c:6088f3785cb6 : 205 + 0x6]
     sp = 0x514fbd80    pc = 0x50f19d08
    Found by: stack scanning
 6  libnss3.so!ssl_Close [sslsock.c:6088f3785cb6 : 2088 + 0xe]
     sp = 0x514fbd90    pc = 0x50f1f634
    Found by: stack scanning
 7  libxul.so!nsNSSSocketInfo::CloseSocketAndDestroy(nsNSSShutDownPreventionLock const&) [nsNSSIOLayer.cpp:6088f3785cb6 : 769 + 0xe]
     sp = 0x514fbd98    pc = 0x5415e0b4
    Found by: stack scanning
 8  libxul.so!nsSSLIOLayerClose [nsNSSIOLayer.cpp:6088f3785cb6 : 747 + 0xa]
     sp = 0x514fbdb0    pc = 0x5415e11c
    Found by: stack scanning
 9  libnss3.so!PR_Close [priometh.c:6088f3785cb6 : 104 + 0xa]
     sp = 0x514fbdc0    pc = 0x50f3cd6c
    Found by: stack scanning
10  libxul.so!nsSocketTransport::ReleaseFD_Locked(PRFileDesc*) [nsSocketTransport2.cpp:6088f3785cb6 : 1452 + 0x6]
     sp = 0x514fbdc8    pc = 0x5385c1f8
    Found by: stack scanning
11  libxul.so!nsSocketTransport::OnSocketDetached(PRFileDesc*) [nsSocketTransport2.cpp:6088f3785cb6 : 1699 + 0x6]
     sp = 0x514fbdd0    pc = 0x5385e810
    Found by: stack scanning
12  libxul.so!nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext*, nsSocketTransportService::SocketContext*) [nsSocketTransportService2.cpp:6088f3785cb6 : 180 + 0xa]
     sp = 0x514fbdf0    pc = 0x5385f3a0
    Found by: stack scanning
13  libxul.so!nsSocketTransportService::DoPollIteration(bool) [nsSocketTransportService2.cpp:6088f3785cb6 : 819 + 0x6]
     sp = 0x514fbe10    pc = 0x5385fe38
    Found by: stack scanning
14  libnss3.so!PR_ExitMonitor [ptsynch.c:6088f3785cb6 : 557 + 0x6]
     sp = 0x514fbe18    pc = 0x50f50998
    Found by: stack scanning
15  libxul.so!nsSocketTransportService::Run() [nsSocketTransportService2.cpp:6088f3785cb6 : 641 + 0xe]
     sp = 0x514fbe48    pc = 0x5385fff0
    Found by: stack scanning
We have noted that these crashes occur in reftests but not in mochitests. I compared the way we launch the browser for reftests vs for mochitests. I only see these differences:

 - mochitest defines env var MOZ_HIDE_RESULTS_TABLE; reftest does not
 - mochitest passes a url on the command line; reftest does not

Of course the profile content is different -- others are looking at that. And the tests themselves...

-----

Raw data, mochitest:

cmd = ['org.mozilla.fennec', '-no-remote', '-profile', '/mnt/sdcard/tests/profile/', 'http://mochi.test:8888/tests/?autorun=1&closeWhenDone=1&logFile=%2Fmnt%2Fsdcard%2Ftests%2Flogs%2Fmochitest.log&fileLevel=INFO&consoleLevel=INFO&totalChunks=8&thisChunk=1&hideResultsTable=1&testManifest=android.json&runOnly=true']
cwd = None
env = {'MOZ_CRASHREPORTER': '1', 'XPCOM_DEBUG_BREAK': 'stack', 'MOZ_HIDE_RESULTS_TABLE': '1', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1', 'MOZ_PROCESS_LOG': '/tmp/tmpgJuCUkpidlog', 'XPCOM_MEM_BLOAT_LOG': '/tmp/tmpz_tUa_/runtests_leaks.log'}

Raw data, reftest:

cmd = ['org.mozilla.fennec', '-no-remote', '-profile', '/mnt/sdcard/tests/reftest/profile/']
cwd = None
env = {'MOZ_CRASHREPORTER': '1', 'XPCOM_DEBUG_BREAK': 'stack', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1', 'MOZ_PROCESS_LOG': '/tmp/tmpVhCGWHpidlog', 'XPCOM_MEM_BLOAT_LOG': '/tmp/tmpbL13B3/runreftest_leaks.log'}
I looked into the preferences a bit, and while adding these preferences (which exist in mochitest) to talos and reftest I had great success:
http://people.mozilla.org/~jmaher/mobile_prefs.patch (note: these resolve to nothing, 404)

reftest results:
https://tbpl.mozilla.org/?tree=Try&rev=23b4329d7cda (<2% failure rate)

talos results:
https://tbpl.mozilla.org/?tree=Try&rev=f52e0ac62c73 (<5% failure rate)

* we are usually between 8-14% total failure rate, reftest/talos jobs retriggered fall into that category if not higher on average.  These low failure rates are insane.

Looking at the failures, I do not see any process crashes, most of the stuff is installation issues or cleanup issues.  

The data is convincing that something about these preferences are allowing us to not hit this error condition.  I would like to narrow this list of preferences down to a smaller subset to see which preference[s] prevent us from seeing a crash on shutdown.

From a nss/necko perspective, does any of this help?
Thanks. I think the effect that your pref changes have is to basically stop all the SSL networking for this test suite. It also means that we can probably reduce this further by testing directly against the default safebrowsing server.

I need to add some better error handling to nssCertificate_Destroy and friends to better detect when the reference count goes below zero.

By the way, do these crashes ever happen in debug builds, or only in release builds?
we only run tests on opt builds, so I cannot answer with confidence on the debug build stuff.

bsmith- are you saying I should only have the safe browser prefs, or I should remove the safe browsing prefs first?
Well, we do run tests, hidden and going quite poorly, on debug on the Cedar tree, but between their fondness for crashing before shutdown and the infrequency with which they run, it's not really possible to say whether we don't see them with debug builds, or just we haven't yet seen them.
(In reply to Brian Smith (:bsmith) from comment #9)
> By the way, do these crashes ever happen in debug builds, or only in release
> builds?

When I was chasing this bug, I was only running debug builds, and for a while could reproduce it at will on the try servers for Android. I was also able to reproduce it for a while on osx using debug builds.
(In reply to Rand Dow [:randix] from comment #12)
> (In reply to Brian Smith (:bsmith) from comment #9)
> > By the way, do these crashes ever happen in debug builds, or only in release
> > builds?
> 
> When I was chasing this bug, I was only running debug builds, and for a
> while could reproduce it at will on the try servers for Android. I was also
> able to reproduce it for a while on osx using debug builds.

What, in as much detail as you can remember, did you do to reproduce it on OS X?
ok, I have narrowed it down to 1 single pref:
user_pref("extensions.update.background.url", "http://127.0.0.1:8888/extensions-dummy/updateBackgroundURL");

If we add this pref, I see no crashes:
https://tbpl.mozilla.org/?tree=Try&rev=bf4d196ef0fc

Does this help diagnose the problem?

I would like to add this pref in general as I have never seen jsreftest be so stable before!
Please land all of them, not just that one, right now, on mozilla-central (I'll watch it, and watch where I'll be merging it to all the other trunk branches).

Those are all "don't fail intermittently in incomprehensible ways at any point during the test run because we're hitting the network in the background in the way we absolutely should not be doing" prefs, that I'm horrified to learn we didn't manage to ever set for Android.
This will also need to be fixed in talos as well.  I have 1 other thing to investigate before updating talos, I can include this as well.  

The danger in setting the prefs is we don't fix the root cause which folks in the real world can hit.  Now that :bsmith has a really good read on this and we have narrowed it down to a smaller case, we are probably fine landing this.

I wonder if b2g needs this?
Yeah, I didn't close the bug as though I thought that was the total solution to everything, but completely apart from shutdown we do not under any circumstances want extension or plugin or safebrowsing background updates hitting the network during test runs - some of those prefs come from desktop leaks at shutdown from the updates running then, but some of them come from some random test intermittently failing and chasing down why it failed to surprise network activity sometimes having surprising results. In the case of some random test intermittently failing on Android, of course, we just haven't ever tried to chase down why it failed because reproducing failures in Android tests is massively annoying, so we don't.
I filed bug 874147 for the Reftest configuration issue and bug 874149 for the Talos issue, because we should track those things separately from the other things causing these test failures, such as the root cause of the crashing.
Keywords: crash
Crash Signature: IssuerCache_Destroy] [@ PR_Lock | nssCertificateStore_Lock | nssCertificate_ Destroy] [@ nssArena_Destroy | nssCertificate_Destroy] → IssuerCache_Destroy] [@ PR_Lock | nssCertificateStore_Lock | nssCertificate_ Destroy] [@ nssArena_Destroy | nssCertificate_Destroy] [@ nssCertificate_Destroy | IssuerCache_Destroy ]
Assignee: brian → nobody
Whiteboard: [necko-backlog]
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → WONTFIX
Closing because no crashes reported for 12 weeks.
You need to log in before you can comment on or make changes to this bug.