Closed
Bug 871575
Opened 11 years ago
Closed 5 years ago
Investigate and fix the nss shutdown crash that we see on Android
Categories
(Core :: Networking, defect, P3)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: jmaher, Unassigned)
References
Details
(Keywords: crash, Whiteboard: [necko-backlog])
Crash Data
I want to use this bug to track, communicate, and fix the nss_shutdown crash that we see on Android. This bug should not be used for tbpl to star tests failures.
Comment 1•11 years ago
|
||
(This won't appear on TBPL unless the keyword intermittent-failure is used & the summary matches, so we're safe from starring). Thank you for filing this - hopeful will make communication (and visibility into progress) easier :-)
Summary: [DO NOT STAR] investigate and fix the nss shutdown crash that we see on Android → Investigate and fix the nss shutdown crash that we see on Android
Comment 2•11 years ago
|
||
Brad, you mentioned Doug is making progress on this. Is that being tracked somewhere or is IRC pinging the only way to find out?
Reporter | ||
Updated•11 years ago
|
Crash Signature: [@ nssCertificate_Destroy]
[@ nssCertificate_Destroy | NSSCertificate_Destroy | CERT_DestroyCertificate | IssuerCache_Destroy]
[@ 0xffff0fc4 | PR_AtomicDecrement | nssCertificate_Destroy]
[@ FreeArenaList | PORT_FreeArena_Util | IssuerCache_Destroy]
[…
Reporter | ||
Comment 3•11 years ago
|
||
here is an example log file that shows this failure: https://tbpl.mozilla.org/php/getParsedLog.php?id=22983942&tree=Mozilla-Inbound
Comment 4•11 years ago
|
||
I have a pretty good, but vague, idea of what is happening here. I haven't had time to dig into the details but I am planning to do so this week. I think there are a combination of bugs that need to be fixed: at least one in Necko and one in PSM. The Necko bug is that Necko tries to do SSL networking while and/or after NSS has been shut down. The PSM/NSS bug is almost definitely a refcounting error in CERTCertificate. i.e. we're calling CERT_DestroyCertificate() at least one time more than we're calling CERT_DupCertificate().
Assignee: doug.turner → bsmith
Comment 5•11 years ago
|
||
Best stack so far: Thread 12 (crashed) 0 libnss3.so!PR_Lock [ptsynch.c : 184 + 0x2] r4 = 0x00000000 r5 = 0x6ac02fa0 r6 = 0x63ff8484 r7 = 0x5d2bc858 r8 = 0x62155818 r9 = 0x62140b60 r10 = 0x00000008 fp = 0x00000010 sp = 0x5d2bc850 lr = 0x5f28591d pc = 0x5f28591e Found by: given as instruction pointer in context 1 libnss3.so!nssCertificate_Destroy [certificate.c:4e1be9ada51a : 106 + 0x7] sp = 0x5d2bc858 pc = 0x5f2060a5 Found by: stack scanning 2 libnss3.so!IssuerCache_Destroy [crl.c:4e1be9ada51a : 1182 + 0x5] sp = 0x5d2bc88c pc = 0x5f201c5d Found by: stack scanning 3 libnss3.so!IssuerCache_Destroy [crl.c:4e1be9ada51a : 1128 + 0x3] sp = 0x5d2bc890 pc = 0x5f201c29 Found by: stack scanning 4 libnss3.so!FreeIssuer [crl.c:4e1be9ada51a : 1241 + 0x3] sp = 0x5d2bc8a8 pc = 0x5f201c6d Found by: stack scanning 5 libnss3.so!PL_HashTableEnumerateEntries [plhash.c : 374 + 0x1] sp = 0x5d2bc8b8 pc = 0x5f288e4d Found by: stack scanning 6 libnss3.so!ShutdownCRLCache [crl.c:4e1be9ada51a : 1306 + 0x9] sp = 0x5d2bc8e8 pc = 0x5f201cd7 Found by: stack scanning 7 libnss3.so!nss_Shutdown [nssinit.c:4e1be9ada51a : 1082 + 0x3] sp = 0x5d2bc900 pc = 0x5f252ce9 Found by: stack scanning 8 libnss3.so!NSS_Shutdown [nssinit.c:4e1be9ada51a : 1145 + 0x3] sp = 0x5d2bc918 pc = 0x5f252dd1 Found by: stack scanning 9 libxul.so!nsNSSComponent::ShutdownNSS() [nsNSSComponent.cpp:4e1be9ada51a : 1902 + 0x3] sp = 0x5d2bc928 pc = 0x62da673b Found by: stack scanning 10 libxul.so!nsNSSComponent::DoProfileBeforeChange(nsISupports*) [nsNSSComponent.cpp:4e1be9ada51a : 2499 + 0x5] sp = 0x5d2bc940 pc = 0x62da67cb Found by: stack scanning 11 libxul.so!nsNSSComponent::Observe(nsISupports*, char const*, unsigned short const*) [nsNSSComponent.cpp:4e1be9ada51a : 2184 + 0x7] sp = 0x5d2bc960 pc = 0x62da7141 Found by: stack scanning 12 libnss3.so!PR_Unlock [ptsynch.c : 205 + 0x3] sp = 0x5d2bc9b0 pc = 0x5f285a91 Found by: stack scanning 13 libxul.so + 0x96e3a7 sp = 0x5d2bc9bc pc = 0x62da13a9 Found by: stack scanning 14 libnss3.so!PR_ExitMonitor [ptsynch.c : 557 + 0x3] sp = 0x5d2bc9c0 pc = 0x5f285cb9 Found by: stack scanning 15 libxul.so!nsObserverList::NotifyObservers(nsISupports*, char const*, unsigned short const*) [nsObserverList.cpp:4e1be9ada51a : 99 + 0x7] sp = 0x5d2bc9e0 pc = 0x630b1887 Found by: stack scanning 16 libxul.so!nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*) [nsObserverService.cpp:4e1be9ada51a : 161 + 0x9] sp = 0x5d2bca08 pc = 0x630b1bdb Found by: stack scanning 17 libxul.so!nsObserverService::Create(nsISupports*, nsID const&, void**) [nsAutoPtr.h:4e1be9ada51a : 880 + 0xf] sp = 0x5d2bca14 pc = 0x630b1ba9 Found by: stack scanning 18 libxul.so!nsXREDirProvider::DoShutdown() [nsXREDirProvider.cpp:4e1be9ada51a : 871 + 0x11] sp = 0x5d2bca20 pc = 0x6270d967 Found by: stack scanning 19 libxul.so!nsAppShellService::CreateHiddenWindow() [nsAppShellService.cpp:4e1be9ada51a : 88 + 0x3] sp = 0x5d2bca24 pc = 0x62d759eb Found by: stack scanning 20 libxul.so!ScopedXPCOMStartup::~ScopedXPCOMStartup [nsAppRunner.cpp:4e1be9ada51a : 1120 + 0x9] sp = 0x5d2bca48 pc = 0x62709d19 Found by: stack scanning 21 libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:4e1be9ada51a : 3964 + 0x5] sp = 0x5d2bca60 pc = 0x6270d3cb Found by: stack scanning 22 libxul.so!XRE_main [nsAppRunner.cpp:4e1be9ada51a : 4140 + 0x3] sp = 0x5d2bca88 pc = 0x6270d553 Found by: stack scanning 23 libmozglue.so!__wrap_realloc [jemalloc.c:4e1be9ada51a : 4692 + 0x3] sp = 0x5d2bcb00 pc = 0x5bc91d11 Found by: stack scanning 24 libmozglue.so!arena_malloc [jemalloc.c:4e1be9ada51a : 4167 + 0x3] sp = 0x5d2bcb28 pc = 0x5bc907c5 Found by: stack scanning 25 libmozalloc.so!moz_xrealloc [mozalloc.cpp : 86 + 0x7] sp = 0x5d2bcb58 pc = 0x5f03f02f Found by: stack scanning 26 libxul.so!nsTArray_base<nsTArrayInfallibleAllocator>::EnsureCapacity(unsigned int, unsigned int) [nsTArray.h:4e1be9ada51a : 196 + 0x5] sp = 0x5d2bcb68 pc = 0x62710e79 Found by: stack scanning 27 libxul.so!XRE_InitChildProcess [GeckoProfilerImpl.h:4e1be9ada51a : 286 + 0x0] sp = 0x5d2bcb80 pc = 0x62710000 Found by: stack scanning 28 libxul.so!GeckoStart [nsAndroidStartup.cpp:4e1be9ada51a : 73 + 0xf] sp = 0x5d2bcb98 pc = 0x627110b3 Found by: stack scanning 29 libdvm.so + 0xbabd4 sp = 0x5d2bcb9c pc = 0x409dbbd6 Found by: stack scanning 30 libdvm.so + 0x52383 sp = 0x5d2bcbbc pc = 0x40973385 Found by: stack scanning 31 libxul.so + 0x2de01b sp = 0x5d2bcbc4 pc = 0x6271101d Found by: stack scanning 32 libmozglue.so!Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun [APKOpen.cpp:4e1be9ada51a : 355 + 0x3] sp = 0x5d2bcbc8 pc = 0x5bc99b15
Comment 6•11 years ago
|
||
So, we hacked cert_CheckCertRevocationStatus to lie, and this is the new fail. Thread 8 (crashed) 0 libnss3.so!nssCertificate_Destroy [certificate.c:6088f3785cb6 : 98 + 0x0] r4 = 0xffffff82 r5 = 0x58f7e010 r6 = 0x578093d0 r7 = 0x00000000 r8 = 0x00000000 r9 = 0x00000001 r10 = 0x00000001 fp = 0x0000ffff sp = 0x514fbd18 lr = 0x50e8e680 pc = 0x50e8e688 Found by: given as instruction pointer in context 1 libnss3.so!ssl3_CleanupPeerCerts [ssl3con.c:6088f3785cb6 : 8493 + 0x6] sp = 0x514fbd50 pc = 0x50f085c4 Found by: stack scanning 2 libnss3.so!ssl3_DestroySSL3Info [ssl3con.c:6088f3785cb6 : 10784 + 0x6] sp = 0x514fbd60 pc = 0x50f125e4 Found by: stack scanning 3 libnss3.so!ssl_DestroySocketContents [sslsock.c:6088f3785cb6 : 408 + 0x6] sp = 0x514fbd68 pc = 0x50f1ef60 Found by: stack scanning 4 libnss3.so!ssl_FreeSocket [sslsock.c:6088f3785cb6 : 471 + 0x6] sp = 0x514fbd78 pc = 0x50f200e8 Found by: stack scanning 5 libnss3.so!ssl_DefClose [ssldef.c:6088f3785cb6 : 205 + 0x6] sp = 0x514fbd80 pc = 0x50f19d08 Found by: stack scanning 6 libnss3.so!ssl_Close [sslsock.c:6088f3785cb6 : 2088 + 0xe] sp = 0x514fbd90 pc = 0x50f1f634 Found by: stack scanning 7 libxul.so!nsNSSSocketInfo::CloseSocketAndDestroy(nsNSSShutDownPreventionLock const&) [nsNSSIOLayer.cpp:6088f3785cb6 : 769 + 0xe] sp = 0x514fbd98 pc = 0x5415e0b4 Found by: stack scanning 8 libxul.so!nsSSLIOLayerClose [nsNSSIOLayer.cpp:6088f3785cb6 : 747 + 0xa] sp = 0x514fbdb0 pc = 0x5415e11c Found by: stack scanning 9 libnss3.so!PR_Close [priometh.c:6088f3785cb6 : 104 + 0xa] sp = 0x514fbdc0 pc = 0x50f3cd6c Found by: stack scanning 10 libxul.so!nsSocketTransport::ReleaseFD_Locked(PRFileDesc*) [nsSocketTransport2.cpp:6088f3785cb6 : 1452 + 0x6] sp = 0x514fbdc8 pc = 0x5385c1f8 Found by: stack scanning 11 libxul.so!nsSocketTransport::OnSocketDetached(PRFileDesc*) [nsSocketTransport2.cpp:6088f3785cb6 : 1699 + 0x6] sp = 0x514fbdd0 pc = 0x5385e810 Found by: stack scanning 12 libxul.so!nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext*, nsSocketTransportService::SocketContext*) [nsSocketTransportService2.cpp:6088f3785cb6 : 180 + 0xa] sp = 0x514fbdf0 pc = 0x5385f3a0 Found by: stack scanning 13 libxul.so!nsSocketTransportService::DoPollIteration(bool) [nsSocketTransportService2.cpp:6088f3785cb6 : 819 + 0x6] sp = 0x514fbe10 pc = 0x5385fe38 Found by: stack scanning 14 libnss3.so!PR_ExitMonitor [ptsynch.c:6088f3785cb6 : 557 + 0x6] sp = 0x514fbe18 pc = 0x50f50998 Found by: stack scanning 15 libxul.so!nsSocketTransportService::Run() [nsSocketTransportService2.cpp:6088f3785cb6 : 641 + 0xe] sp = 0x514fbe48 pc = 0x5385fff0 Found by: stack scanning
Comment 7•11 years ago
|
||
We have noted that these crashes occur in reftests but not in mochitests. I compared the way we launch the browser for reftests vs for mochitests. I only see these differences: - mochitest defines env var MOZ_HIDE_RESULTS_TABLE; reftest does not - mochitest passes a url on the command line; reftest does not Of course the profile content is different -- others are looking at that. And the tests themselves... ----- Raw data, mochitest: cmd = ['org.mozilla.fennec', '-no-remote', '-profile', '/mnt/sdcard/tests/profile/', 'http://mochi.test:8888/tests/?autorun=1&closeWhenDone=1&logFile=%2Fmnt%2Fsdcard%2Ftests%2Flogs%2Fmochitest.log&fileLevel=INFO&consoleLevel=INFO&totalChunks=8&thisChunk=1&hideResultsTable=1&testManifest=android.json&runOnly=true'] cwd = None env = {'MOZ_CRASHREPORTER': '1', 'XPCOM_DEBUG_BREAK': 'stack', 'MOZ_HIDE_RESULTS_TABLE': '1', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1', 'MOZ_PROCESS_LOG': '/tmp/tmpgJuCUkpidlog', 'XPCOM_MEM_BLOAT_LOG': '/tmp/tmpz_tUa_/runtests_leaks.log'} Raw data, reftest: cmd = ['org.mozilla.fennec', '-no-remote', '-profile', '/mnt/sdcard/tests/reftest/profile/'] cwd = None env = {'MOZ_CRASHREPORTER': '1', 'XPCOM_DEBUG_BREAK': 'stack', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1', 'MOZ_PROCESS_LOG': '/tmp/tmpVhCGWHpidlog', 'XPCOM_MEM_BLOAT_LOG': '/tmp/tmpbL13B3/runreftest_leaks.log'}
Reporter | ||
Comment 8•11 years ago
|
||
I looked into the preferences a bit, and while adding these preferences (which exist in mochitest) to talos and reftest I had great success: http://people.mozilla.org/~jmaher/mobile_prefs.patch (note: these resolve to nothing, 404) reftest results: https://tbpl.mozilla.org/?tree=Try&rev=23b4329d7cda (<2% failure rate) talos results: https://tbpl.mozilla.org/?tree=Try&rev=f52e0ac62c73 (<5% failure rate) * we are usually between 8-14% total failure rate, reftest/talos jobs retriggered fall into that category if not higher on average. These low failure rates are insane. Looking at the failures, I do not see any process crashes, most of the stuff is installation issues or cleanup issues. The data is convincing that something about these preferences are allowing us to not hit this error condition. I would like to narrow this list of preferences down to a smaller subset to see which preference[s] prevent us from seeing a crash on shutdown. From a nss/necko perspective, does any of this help?
Comment 9•11 years ago
|
||
Thanks. I think the effect that your pref changes have is to basically stop all the SSL networking for this test suite. It also means that we can probably reduce this further by testing directly against the default safebrowsing server. I need to add some better error handling to nssCertificate_Destroy and friends to better detect when the reference count goes below zero. By the way, do these crashes ever happen in debug builds, or only in release builds?
Reporter | ||
Comment 10•11 years ago
|
||
we only run tests on opt builds, so I cannot answer with confidence on the debug build stuff. bsmith- are you saying I should only have the safe browser prefs, or I should remove the safe browsing prefs first?
Comment 11•11 years ago
|
||
Well, we do run tests, hidden and going quite poorly, on debug on the Cedar tree, but between their fondness for crashing before shutdown and the infrequency with which they run, it's not really possible to say whether we don't see them with debug builds, or just we haven't yet seen them.
Comment 12•11 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #9) > By the way, do these crashes ever happen in debug builds, or only in release > builds? When I was chasing this bug, I was only running debug builds, and for a while could reproduce it at will on the try servers for Android. I was also able to reproduce it for a while on osx using debug builds.
Comment 13•11 years ago
|
||
(In reply to Rand Dow [:randix] from comment #12) > (In reply to Brian Smith (:bsmith) from comment #9) > > By the way, do these crashes ever happen in debug builds, or only in release > > builds? > > When I was chasing this bug, I was only running debug builds, and for a > while could reproduce it at will on the try servers for Android. I was also > able to reproduce it for a while on osx using debug builds. What, in as much detail as you can remember, did you do to reproduce it on OS X?
Reporter | ||
Comment 14•11 years ago
|
||
ok, I have narrowed it down to 1 single pref: user_pref("extensions.update.background.url", "http://127.0.0.1:8888/extensions-dummy/updateBackgroundURL"); If we add this pref, I see no crashes: https://tbpl.mozilla.org/?tree=Try&rev=bf4d196ef0fc Does this help diagnose the problem? I would like to add this pref in general as I have never seen jsreftest be so stable before!
Comment 15•11 years ago
|
||
Please land all of them, not just that one, right now, on mozilla-central (I'll watch it, and watch where I'll be merging it to all the other trunk branches). Those are all "don't fail intermittently in incomprehensible ways at any point during the test run because we're hitting the network in the background in the way we absolutely should not be doing" prefs, that I'm horrified to learn we didn't manage to ever set for Android.
Comment 16•11 years ago
|
||
Pushed in https://hg.mozilla.org/mozilla-central/rev/6d1306e3532a.
Reporter | ||
Comment 17•11 years ago
|
||
This will also need to be fixed in talos as well. I have 1 other thing to investigate before updating talos, I can include this as well. The danger in setting the prefs is we don't fix the root cause which folks in the real world can hit. Now that :bsmith has a really good read on this and we have narrowed it down to a smaller case, we are probably fine landing this. I wonder if b2g needs this?
Comment 18•11 years ago
|
||
Yeah, I didn't close the bug as though I thought that was the total solution to everything, but completely apart from shutdown we do not under any circumstances want extension or plugin or safebrowsing background updates hitting the network during test runs - some of those prefs come from desktop leaks at shutdown from the updates running then, but some of them come from some random test intermittently failing and chasing down why it failed to surprise network activity sometimes having surprising results. In the case of some random test intermittently failing on Android, of course, we just haven't ever tried to chase down why it failed because reproducing failures in Android tests is massively annoying, so we don't.
Comment 19•11 years ago
|
||
I filed bug 874147 for the Reftest configuration issue and bug 874149 for the Talos issue, because we should track those things separately from the other things causing these test failures, such as the root cause of the crashing.
Updated•11 years ago
|
Crash Signature: IssuerCache_Destroy]
[@ PR_Lock | nssCertificateStore_Lock | nssCertificate_ Destroy]
[@ nssArena_Destroy | nssCertificate_Destroy] → IssuerCache_Destroy]
[@ PR_Lock | nssCertificateStore_Lock | nssCertificate_ Destroy]
[@ nssArena_Destroy | nssCertificate_Destroy]
[@ nssCertificate_Destroy | IssuerCache_Destroy ]
Updated•10 years ago
|
Assignee: brian → nobody
Updated•8 years ago
|
Whiteboard: [necko-backlog]
Comment 20•7 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Comment 21•7 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Comment 22•5 years ago
|
||
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Comment 23•5 years ago
|
||
Closing because no crashes reported for 12 weeks.
You need to log in
before you can comment on or make changes to this bug.
Description
•