Closed
Bug 871575
Opened 12 years ago
Closed 6 years ago
Investigate and fix the nss shutdown crash that we see on Android
Categories
(Core :: Networking, defect, P3)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: jmaher, Unassigned)
References
Details
(Keywords: crash, Whiteboard: [necko-backlog])
Crash Data
I want to use this bug to track, communicate, and fix the nss_shutdown crash that we see on Android. This bug should not be used for tbpl to star tests failures.
Comment 1•12 years ago
|
||
(This won't appear on TBPL unless the keyword intermittent-failure is used & the summary matches, so we're safe from starring).
Thank you for filing this - hopeful will make communication (and visibility into progress) easier :-)
Summary: [DO NOT STAR] investigate and fix the nss shutdown crash that we see on Android → Investigate and fix the nss shutdown crash that we see on Android
Comment 2•12 years ago
|
||
Brad, you mentioned Doug is making progress on this. Is that being tracked somewhere or is IRC pinging the only way to find out?
| Reporter | ||
Updated•12 years ago
|
Crash Signature: [@ nssCertificate_Destroy]
[@ nssCertificate_Destroy | NSSCertificate_Destroy | CERT_DestroyCertificate | IssuerCache_Destroy]
[@ 0xffff0fc4 | PR_AtomicDecrement | nssCertificate_Destroy]
[@ FreeArenaList | PORT_FreeArena_Util | IssuerCache_Destroy]
[…
| Reporter | ||
Comment 3•12 years ago
|
||
here is an example log file that shows this failure:
https://tbpl.mozilla.org/php/getParsedLog.php?id=22983942&tree=Mozilla-Inbound
Comment 4•12 years ago
|
||
I have a pretty good, but vague, idea of what is happening here. I haven't had time to dig into the details but I am planning to do so this week. I think there are a combination of bugs that need to be fixed: at least one in Necko and one in PSM. The Necko bug is that Necko tries to do SSL networking while and/or after NSS has been shut down. The PSM/NSS bug is almost definitely a refcounting error in CERTCertificate. i.e. we're calling CERT_DestroyCertificate() at least one time more than we're calling CERT_DupCertificate().
Assignee: doug.turner → bsmith
Comment 5•12 years ago
|
||
Best stack so far:
Thread 12 (crashed)
0 libnss3.so!PR_Lock [ptsynch.c : 184 + 0x2]
r4 = 0x00000000 r5 = 0x6ac02fa0 r6 = 0x63ff8484 r7 = 0x5d2bc858
r8 = 0x62155818 r9 = 0x62140b60 r10 = 0x00000008 fp = 0x00000010
sp = 0x5d2bc850 lr = 0x5f28591d pc = 0x5f28591e
Found by: given as instruction pointer in context
1 libnss3.so!nssCertificate_Destroy [certificate.c:4e1be9ada51a : 106 + 0x7]
sp = 0x5d2bc858 pc = 0x5f2060a5
Found by: stack scanning
2 libnss3.so!IssuerCache_Destroy [crl.c:4e1be9ada51a : 1182 + 0x5]
sp = 0x5d2bc88c pc = 0x5f201c5d
Found by: stack scanning
3 libnss3.so!IssuerCache_Destroy [crl.c:4e1be9ada51a : 1128 + 0x3]
sp = 0x5d2bc890 pc = 0x5f201c29
Found by: stack scanning
4 libnss3.so!FreeIssuer [crl.c:4e1be9ada51a : 1241 + 0x3]
sp = 0x5d2bc8a8 pc = 0x5f201c6d
Found by: stack scanning
5 libnss3.so!PL_HashTableEnumerateEntries [plhash.c : 374 + 0x1]
sp = 0x5d2bc8b8 pc = 0x5f288e4d
Found by: stack scanning
6 libnss3.so!ShutdownCRLCache [crl.c:4e1be9ada51a : 1306 + 0x9]
sp = 0x5d2bc8e8 pc = 0x5f201cd7
Found by: stack scanning
7 libnss3.so!nss_Shutdown [nssinit.c:4e1be9ada51a : 1082 + 0x3]
sp = 0x5d2bc900 pc = 0x5f252ce9
Found by: stack scanning
8 libnss3.so!NSS_Shutdown [nssinit.c:4e1be9ada51a : 1145 + 0x3]
sp = 0x5d2bc918 pc = 0x5f252dd1
Found by: stack scanning
9 libxul.so!nsNSSComponent::ShutdownNSS() [nsNSSComponent.cpp:4e1be9ada51a : 1902 + 0x3]
sp = 0x5d2bc928 pc = 0x62da673b
Found by: stack scanning
10 libxul.so!nsNSSComponent::DoProfileBeforeChange(nsISupports*) [nsNSSComponent.cpp:4e1be9ada51a : 2499 + 0x5]
sp = 0x5d2bc940 pc = 0x62da67cb
Found by: stack scanning
11 libxul.so!nsNSSComponent::Observe(nsISupports*, char const*, unsigned short const*) [nsNSSComponent.cpp:4e1be9ada51a : 2184 + 0x7]
sp = 0x5d2bc960 pc = 0x62da7141
Found by: stack scanning
12 libnss3.so!PR_Unlock [ptsynch.c : 205 + 0x3]
sp = 0x5d2bc9b0 pc = 0x5f285a91
Found by: stack scanning
13 libxul.so + 0x96e3a7
sp = 0x5d2bc9bc pc = 0x62da13a9
Found by: stack scanning
14 libnss3.so!PR_ExitMonitor [ptsynch.c : 557 + 0x3]
sp = 0x5d2bc9c0 pc = 0x5f285cb9
Found by: stack scanning
15 libxul.so!nsObserverList::NotifyObservers(nsISupports*, char const*, unsigned short const*) [nsObserverList.cpp:4e1be9ada51a : 99 + 0x7]
sp = 0x5d2bc9e0 pc = 0x630b1887
Found by: stack scanning
16 libxul.so!nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*) [nsObserverService.cpp:4e1be9ada51a : 161 + 0x9]
sp = 0x5d2bca08 pc = 0x630b1bdb
Found by: stack scanning
17 libxul.so!nsObserverService::Create(nsISupports*, nsID const&, void**) [nsAutoPtr.h:4e1be9ada51a : 880 + 0xf]
sp = 0x5d2bca14 pc = 0x630b1ba9
Found by: stack scanning
18 libxul.so!nsXREDirProvider::DoShutdown() [nsXREDirProvider.cpp:4e1be9ada51a : 871 + 0x11]
sp = 0x5d2bca20 pc = 0x6270d967
Found by: stack scanning
19 libxul.so!nsAppShellService::CreateHiddenWindow() [nsAppShellService.cpp:4e1be9ada51a : 88 + 0x3]
sp = 0x5d2bca24 pc = 0x62d759eb
Found by: stack scanning
20 libxul.so!ScopedXPCOMStartup::~ScopedXPCOMStartup [nsAppRunner.cpp:4e1be9ada51a : 1120 + 0x9]
sp = 0x5d2bca48 pc = 0x62709d19
Found by: stack scanning
21 libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:4e1be9ada51a : 3964 + 0x5]
sp = 0x5d2bca60 pc = 0x6270d3cb
Found by: stack scanning
22 libxul.so!XRE_main [nsAppRunner.cpp:4e1be9ada51a : 4140 + 0x3]
sp = 0x5d2bca88 pc = 0x6270d553
Found by: stack scanning
23 libmozglue.so!__wrap_realloc [jemalloc.c:4e1be9ada51a : 4692 + 0x3]
sp = 0x5d2bcb00 pc = 0x5bc91d11
Found by: stack scanning
24 libmozglue.so!arena_malloc [jemalloc.c:4e1be9ada51a : 4167 + 0x3]
sp = 0x5d2bcb28 pc = 0x5bc907c5
Found by: stack scanning
25 libmozalloc.so!moz_xrealloc [mozalloc.cpp : 86 + 0x7]
sp = 0x5d2bcb58 pc = 0x5f03f02f
Found by: stack scanning
26 libxul.so!nsTArray_base<nsTArrayInfallibleAllocator>::EnsureCapacity(unsigned int, unsigned int) [nsTArray.h:4e1be9ada51a : 196 + 0x5]
sp = 0x5d2bcb68 pc = 0x62710e79
Found by: stack scanning
27 libxul.so!XRE_InitChildProcess [GeckoProfilerImpl.h:4e1be9ada51a : 286 + 0x0]
sp = 0x5d2bcb80 pc = 0x62710000
Found by: stack scanning
28 libxul.so!GeckoStart [nsAndroidStartup.cpp:4e1be9ada51a : 73 + 0xf]
sp = 0x5d2bcb98 pc = 0x627110b3
Found by: stack scanning
29 libdvm.so + 0xbabd4
sp = 0x5d2bcb9c pc = 0x409dbbd6
Found by: stack scanning
30 libdvm.so + 0x52383
sp = 0x5d2bcbbc pc = 0x40973385
Found by: stack scanning
31 libxul.so + 0x2de01b
sp = 0x5d2bcbc4 pc = 0x6271101d
Found by: stack scanning
32 libmozglue.so!Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun [APKOpen.cpp:4e1be9ada51a : 355 + 0x3]
sp = 0x5d2bcbc8 pc = 0x5bc99b15
Comment 6•12 years ago
|
||
So, we hacked cert_CheckCertRevocationStatus to lie, and this is the new fail.
Thread 8 (crashed)
0 libnss3.so!nssCertificate_Destroy [certificate.c:6088f3785cb6 : 98 + 0x0]
r4 = 0xffffff82 r5 = 0x58f7e010 r6 = 0x578093d0 r7 = 0x00000000
r8 = 0x00000000 r9 = 0x00000001 r10 = 0x00000001 fp = 0x0000ffff
sp = 0x514fbd18 lr = 0x50e8e680 pc = 0x50e8e688
Found by: given as instruction pointer in context
1 libnss3.so!ssl3_CleanupPeerCerts [ssl3con.c:6088f3785cb6 : 8493 + 0x6]
sp = 0x514fbd50 pc = 0x50f085c4
Found by: stack scanning
2 libnss3.so!ssl3_DestroySSL3Info [ssl3con.c:6088f3785cb6 : 10784 + 0x6]
sp = 0x514fbd60 pc = 0x50f125e4
Found by: stack scanning
3 libnss3.so!ssl_DestroySocketContents [sslsock.c:6088f3785cb6 : 408 + 0x6]
sp = 0x514fbd68 pc = 0x50f1ef60
Found by: stack scanning
4 libnss3.so!ssl_FreeSocket [sslsock.c:6088f3785cb6 : 471 + 0x6]
sp = 0x514fbd78 pc = 0x50f200e8
Found by: stack scanning
5 libnss3.so!ssl_DefClose [ssldef.c:6088f3785cb6 : 205 + 0x6]
sp = 0x514fbd80 pc = 0x50f19d08
Found by: stack scanning
6 libnss3.so!ssl_Close [sslsock.c:6088f3785cb6 : 2088 + 0xe]
sp = 0x514fbd90 pc = 0x50f1f634
Found by: stack scanning
7 libxul.so!nsNSSSocketInfo::CloseSocketAndDestroy(nsNSSShutDownPreventionLock const&) [nsNSSIOLayer.cpp:6088f3785cb6 : 769 + 0xe]
sp = 0x514fbd98 pc = 0x5415e0b4
Found by: stack scanning
8 libxul.so!nsSSLIOLayerClose [nsNSSIOLayer.cpp:6088f3785cb6 : 747 + 0xa]
sp = 0x514fbdb0 pc = 0x5415e11c
Found by: stack scanning
9 libnss3.so!PR_Close [priometh.c:6088f3785cb6 : 104 + 0xa]
sp = 0x514fbdc0 pc = 0x50f3cd6c
Found by: stack scanning
10 libxul.so!nsSocketTransport::ReleaseFD_Locked(PRFileDesc*) [nsSocketTransport2.cpp:6088f3785cb6 : 1452 + 0x6]
sp = 0x514fbdc8 pc = 0x5385c1f8
Found by: stack scanning
11 libxul.so!nsSocketTransport::OnSocketDetached(PRFileDesc*) [nsSocketTransport2.cpp:6088f3785cb6 : 1699 + 0x6]
sp = 0x514fbdd0 pc = 0x5385e810
Found by: stack scanning
12 libxul.so!nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext*, nsSocketTransportService::SocketContext*) [nsSocketTransportService2.cpp:6088f3785cb6 : 180 + 0xa]
sp = 0x514fbdf0 pc = 0x5385f3a0
Found by: stack scanning
13 libxul.so!nsSocketTransportService::DoPollIteration(bool) [nsSocketTransportService2.cpp:6088f3785cb6 : 819 + 0x6]
sp = 0x514fbe10 pc = 0x5385fe38
Found by: stack scanning
14 libnss3.so!PR_ExitMonitor [ptsynch.c:6088f3785cb6 : 557 + 0x6]
sp = 0x514fbe18 pc = 0x50f50998
Found by: stack scanning
15 libxul.so!nsSocketTransportService::Run() [nsSocketTransportService2.cpp:6088f3785cb6 : 641 + 0xe]
sp = 0x514fbe48 pc = 0x5385fff0
Found by: stack scanning
Comment 7•12 years ago
|
||
We have noted that these crashes occur in reftests but not in mochitests. I compared the way we launch the browser for reftests vs for mochitests. I only see these differences:
- mochitest defines env var MOZ_HIDE_RESULTS_TABLE; reftest does not
- mochitest passes a url on the command line; reftest does not
Of course the profile content is different -- others are looking at that. And the tests themselves...
-----
Raw data, mochitest:
cmd = ['org.mozilla.fennec', '-no-remote', '-profile', '/mnt/sdcard/tests/profile/', 'http://mochi.test:8888/tests/?autorun=1&closeWhenDone=1&logFile=%2Fmnt%2Fsdcard%2Ftests%2Flogs%2Fmochitest.log&fileLevel=INFO&consoleLevel=INFO&totalChunks=8&thisChunk=1&hideResultsTable=1&testManifest=android.json&runOnly=true']
cwd = None
env = {'MOZ_CRASHREPORTER': '1', 'XPCOM_DEBUG_BREAK': 'stack', 'MOZ_HIDE_RESULTS_TABLE': '1', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1', 'MOZ_PROCESS_LOG': '/tmp/tmpgJuCUkpidlog', 'XPCOM_MEM_BLOAT_LOG': '/tmp/tmpz_tUa_/runtests_leaks.log'}
Raw data, reftest:
cmd = ['org.mozilla.fennec', '-no-remote', '-profile', '/mnt/sdcard/tests/reftest/profile/']
cwd = None
env = {'MOZ_CRASHREPORTER': '1', 'XPCOM_DEBUG_BREAK': 'stack', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1', 'MOZ_PROCESS_LOG': '/tmp/tmpVhCGWHpidlog', 'XPCOM_MEM_BLOAT_LOG': '/tmp/tmpbL13B3/runreftest_leaks.log'}
| Reporter | ||
Comment 8•12 years ago
|
||
I looked into the preferences a bit, and while adding these preferences (which exist in mochitest) to talos and reftest I had great success:
http://people.mozilla.org/~jmaher/mobile_prefs.patch (note: these resolve to nothing, 404)
reftest results:
https://tbpl.mozilla.org/?tree=Try&rev=23b4329d7cda (<2% failure rate)
talos results:
https://tbpl.mozilla.org/?tree=Try&rev=f52e0ac62c73 (<5% failure rate)
* we are usually between 8-14% total failure rate, reftest/talos jobs retriggered fall into that category if not higher on average. These low failure rates are insane.
Looking at the failures, I do not see any process crashes, most of the stuff is installation issues or cleanup issues.
The data is convincing that something about these preferences are allowing us to not hit this error condition. I would like to narrow this list of preferences down to a smaller subset to see which preference[s] prevent us from seeing a crash on shutdown.
From a nss/necko perspective, does any of this help?
Comment 9•12 years ago
|
||
Thanks. I think the effect that your pref changes have is to basically stop all the SSL networking for this test suite. It also means that we can probably reduce this further by testing directly against the default safebrowsing server.
I need to add some better error handling to nssCertificate_Destroy and friends to better detect when the reference count goes below zero.
By the way, do these crashes ever happen in debug builds, or only in release builds?
| Reporter | ||
Comment 10•12 years ago
|
||
we only run tests on opt builds, so I cannot answer with confidence on the debug build stuff.
bsmith- are you saying I should only have the safe browser prefs, or I should remove the safe browsing prefs first?
Comment 11•12 years ago
|
||
Well, we do run tests, hidden and going quite poorly, on debug on the Cedar tree, but between their fondness for crashing before shutdown and the infrequency with which they run, it's not really possible to say whether we don't see them with debug builds, or just we haven't yet seen them.
Comment 12•12 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #9)
> By the way, do these crashes ever happen in debug builds, or only in release
> builds?
When I was chasing this bug, I was only running debug builds, and for a while could reproduce it at will on the try servers for Android. I was also able to reproduce it for a while on osx using debug builds.
Comment 13•12 years ago
|
||
(In reply to Rand Dow [:randix] from comment #12)
> (In reply to Brian Smith (:bsmith) from comment #9)
> > By the way, do these crashes ever happen in debug builds, or only in release
> > builds?
>
> When I was chasing this bug, I was only running debug builds, and for a
> while could reproduce it at will on the try servers for Android. I was also
> able to reproduce it for a while on osx using debug builds.
What, in as much detail as you can remember, did you do to reproduce it on OS X?
| Reporter | ||
Comment 14•12 years ago
|
||
ok, I have narrowed it down to 1 single pref:
user_pref("extensions.update.background.url", "http://127.0.0.1:8888/extensions-dummy/updateBackgroundURL");
If we add this pref, I see no crashes:
https://tbpl.mozilla.org/?tree=Try&rev=bf4d196ef0fc
Does this help diagnose the problem?
I would like to add this pref in general as I have never seen jsreftest be so stable before!
Comment 15•12 years ago
|
||
Please land all of them, not just that one, right now, on mozilla-central (I'll watch it, and watch where I'll be merging it to all the other trunk branches).
Those are all "don't fail intermittently in incomprehensible ways at any point during the test run because we're hitting the network in the background in the way we absolutely should not be doing" prefs, that I'm horrified to learn we didn't manage to ever set for Android.
Comment 16•12 years ago
|
||
| Reporter | ||
Comment 17•12 years ago
|
||
This will also need to be fixed in talos as well. I have 1 other thing to investigate before updating talos, I can include this as well.
The danger in setting the prefs is we don't fix the root cause which folks in the real world can hit. Now that :bsmith has a really good read on this and we have narrowed it down to a smaller case, we are probably fine landing this.
I wonder if b2g needs this?
Comment 18•12 years ago
|
||
Yeah, I didn't close the bug as though I thought that was the total solution to everything, but completely apart from shutdown we do not under any circumstances want extension or plugin or safebrowsing background updates hitting the network during test runs - some of those prefs come from desktop leaks at shutdown from the updates running then, but some of them come from some random test intermittently failing and chasing down why it failed to surprise network activity sometimes having surprising results. In the case of some random test intermittently failing on Android, of course, we just haven't ever tried to chase down why it failed because reproducing failures in Android tests is massively annoying, so we don't.
Comment 19•12 years ago
|
||
I filed bug 874147 for the Reftest configuration issue and bug 874149 for the Talos issue, because we should track those things separately from the other things causing these test failures, such as the root cause of the crashing.
Updated•12 years ago
|
Crash Signature: IssuerCache_Destroy]
[@ PR_Lock | nssCertificateStore_Lock | nssCertificate_ Destroy]
[@ nssArena_Destroy | nssCertificate_Destroy] → IssuerCache_Destroy]
[@ PR_Lock | nssCertificateStore_Lock | nssCertificate_ Destroy]
[@ nssArena_Destroy | nssCertificate_Destroy]
[@ nssCertificate_Destroy | IssuerCache_Destroy ]
Updated•11 years ago
|
Assignee: brian → nobody
Updated•9 years ago
|
Whiteboard: [necko-backlog]
Comment 20•8 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Comment 21•8 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Comment 22•6 years ago
|
||
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Comment 23•6 years ago
|
||
Closing because no crashes reported for 12 weeks.
You need to log in
before you can comment on or make changes to this bug.
Description
•