This bug was filed from the Socorro interface and is report bp-81611011-ce24-41f3-b56a-764c32160527. ============================================================= There are a decent number of these crashes floating around on various versions. They're crashes early in startup, and users sometimes submit a large number of them. I debugged the minidump for bp-81611011-ce24-41f3-b56a-764c32160527, and the MSVC debugger says we're crashing because gCacheData is null. The MSVC debugger also says sPreferences is null, and sShutdown is false. This suggests that the codepath from AddBoolVarCache calling GetBool calling GetBool calling InitStaticMembers didn't actually work right -- I'm guessing because the do_GetService somehow failed. The stack in the MSVC debugger isn't any better than in crash-stats. However, Linux crashes like bp-563a3c80-2340-4bda-a49a-81d2d2160525 have a full stack. Is there something obvious here that might cause the do_GetService to fail?
OK, so null gCacheData means that Preferences::GetInstanceForService() hasn't been called, or didn't get to the point where it initializes gCacheData. AddBoolVarCache -> GetBool -> InitStaticMembers should have spun up the prefs service. Let's work on the assumption that GetInstanceForService() did get called. I guess we could confirm that by looking at sRootBranch, sDefaultRootBranch, and sPreferences. What would cause GetInstanceForService to return prior to initializing gCacheData? Only a failed call to Preferences::Init(): http://dxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#516 So what could have failed there? * pref_InitInitialObjects, which has a ton of failure cases. * PREF_CopyCharPref, which depends on an earlier call to PREF_Init succeeding so that gHashTable is initialized. The observer service bits could have failed, but that seems unlikely; I think a number of other things would have failed prior to this if the observer service didn't work. My money is on pref_InitInitialObjects failing somehow. Maybe a corrupted omnijar is causing things to fail?
Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT, given that we're not going to be able to start up anyway?
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #2) > Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT, > given that we're not going to be able to start up anyway? Seems reasonable; maybe we could even get extra information about what's going wrong?
Or we could try to startup without loading core preferences at all; I'm not sure whether crashing or no prefs would provide for a worse user experience.
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache': - nightly (version 50): 7 crashes from 2016-06-06. - aurora (version 49): 34 crashes from 2016-06-07. - beta (version 48): 431 crashes from 2016-06-06. - release (version 47): 1105 crashes from 2016-05-31. - esr (version 45): 67 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 0 0 0 0 0 0 0 - aurora 22 4 1 0 3 4 0 - beta 89 46 68 88 36 41 13 - release 170 147 138 179 183 149 62 - esr 1 3 2 2 21 26 3 Affected platforms: Windows, Linux
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache': - nightly (version 51): 4 crashes from 2016-08-01. - aurora (version 50): 11 crashes from 2016-08-01. - beta (version 49): 141 crashes from 2016-08-02. - release (version 48): 157 crashes from 2016-07-25. - esr (version 45): 84 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 0 0 0 - aurora 6 5 0 - beta 36 79 5 - release 41 60 13 - esr 4 9 3 Affected platforms: Windows, Linux Crash rank on the last 7 days: Browser Content Plugin - nightly #185 - aurora - beta #475 #3221 - release #510 #525 - esr #837
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache': - nightly (version 52): 4 crashes from 2016-09-19. - aurora (version 51): 0 crashes from 2016-09-19. - beta (version 50): 142 crashes from 2016-09-20. - release (version 49): 172 crashes from 2016-09-05. - esr (version 45): 108 crashes from 2016-06-01. Crash volume on the last weeks (Week N is from 10-03 to 10-09): W. N-1 W. N-2 - nightly 4 0 - aurora 0 0 - beta 141 1 - release 142 30 - esr 7 7 Affected platforms: Windows, Linux Crash rank on the last 7 days: Browser Content Plugin - nightly #234 - aurora - beta #115 - release #336 #1170 - esr #1068
Too late for firefox 52, mass-wontfix.
[Tracking Requested - why for this release]: Crash is spiking in Beta, and I'm about to dup bug 1396317 to it, and that bug is already being tracked.
My best idea is to add some diagnostic code that gives us info about exactly where the prefs initialization is going wrong.
Created attachment 8904854 [details] [diff] [review] Add diagnostics about Prefs startup failures froydnj, this is my best idea for the moment. What do you think?
Here's a sample crash where I deliberately injected a fault: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906 Also, the patch uses MOZ_CRASH_UNSAFE_PRINTF. In theory a "data steward" (whoever that is) is supposed to review uses of MOZ_CRASH_UNSAFE_PRINTF, but I don't think it's necessary here -- it's just printing static strings, it's much the same as adding a dozen or so MOZ_CRASH calls with a delay.
Comment on attachment 8904854 [details] [diff] [review] Add diagnostics about Prefs startup failures Review of attachment 8904854 [details] [diff] [review]: ----------------------------------------------------------------- Sure, let's do this.
2 months ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/ed7b5443cf8b98c807cd1fa29857b51511cd0076 Bug 1276488 - Add diagnostics about Prefs startup failures. r=froydnj.
Comment on attachment 8904854 [details] [diff] [review] Add diagnostics about Prefs startup failures Approval Request Comment [Feature/Bug causing the regression]: This one. It's not a regression fix, just diagnostic code. [User impact if declined]: Ongoing crashes; this is diagnostic code to try to work out why the Preferences module is failing to initialize for some users at startup. [Is this code covered by automated tests?]: Not really; the Preferences module initializes correctly in all our tests. I did some spot tests locally by injecting a fault, here's a crash report from that: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906 [Has the fix been verified in Nightly?]: Not yet. The crash *is* showing up on Nightly so I could wait a bit, but it seems to be higher volume on Beta so it would be nice to get it into Beta soon. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: Not really. [Why is the change risky/not risky?]: Just some diagnostic code. [String changes made/needed]: None.
Comment on attachment 8904854 [details] [diff] [review] Add diagnostics about Prefs startup failures Diagnostic patch for a top crash. Let's take it for beta 10.
(In reply to Nicholas Nethercote [:njn] from comment #17) > [Needs manual test from QE? If yes, steps to reproduce]: No. Per Nicholas's comment, this does not require manual verification. Updating the qe-verify flag.
this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.
(In reply to [:philipp] from comment #22) > this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting > "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10. Yes: > (100.0% in signature vs 00.04% overall) moz_crash_reason = !gCacheData: pref_ReadPrefFromJar() failed Currently we have 520 occurrences from 20 installations, all on Beta (none on Nightly). The failing call looks like this: > // Load jar:$gre/omni.jar!/greprefs.js > rv = pref_ReadPrefFromJar(jarReader, "greprefs.js"); Within pref_ReadPrefFromJar() there is only one way that it can fail, here: > nsCString manifest; > MOZ_TRY_VAR(manifest, URLPreloader::ReadZip(jarReader, nsDependentCString(name)));
Some additional info from glandium: - greprefs.js is the main prefs file, containing the contents of modules/libpref/init/all.js plus a few other smaller prefs files. - greprefs.js is stored within the omni.ja file, which contains tons of Firefox resources. (See https://developer.mozilla.org/en-US/docs/Mozilla/About_omni.ja_(formerly_omni.jar) for details.) - greprefs.js is typically the first file within omni.ja to be read on startup. - Therefore, there the most likely explanation is that omni.ja is somehow corrupted. rstrong, any ideas if this likely omni.ja corruption could have something to do with the updater?
One interesting data point: more than half of these crash reports have an empty "install time" field. I don't recall seeing that before.
(In reply to Nicholas Nethercote [:njn] from comment #24) <snip> > rstrong, any ideas if this likely omni.ja corruption could have something to > do with the updater? It is possible but there is not much if anything for me to go on and I haven't seen any indicators of there being corruption.
While this is a high crash rate in beta 10 and 11, it's 1000 crashes a week for 35 installations. Marking this wontfix for 56.
For early 56 release, this is in the top 5 crash signatures, and is a startup crash. As in beta, it appears to be relatively few people hitting the crash over and over. Nathan or Nick, can you or someone else take another look here?
Sorry, I don't know what else to do beyond what I already did. It looks like the omni.ja file has been corrupted, and once that happens the entire installation is messed up and will require reinstalling. As you say, despite the high numbers of crash reports, it appears to be affecting a relatively small number of users.
Surely this is not the first time we've consistently seen corrupted omni.ja files, though? Have we been able to do anything about them in the past? Does the updater do any verification of the state of the omni.ja? Do we have any "health of your installation" checks?
Just kinda reaching for stuff here...I'm at a loss for fixes, if we're looking at on-disk corruption, or even consistent memory corruption, so just trying to throw stuff out and see if anything sticks.
Thanks for checking again, the spike in crashes was just a bit worrying. I'll wontfix for 56 and 57 since there is nothing actionable here and relatively few installations are affected.
I just wanted to point out that the number of these crashes increased significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.
(In reply to Tom Grabowski [:TomGrab] from comment #33) > I just wanted to point out that the number of these crashes increased > significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2. We started migrating 1/5/20% of eligible 32-bit users to 64-bit in 56.0.1, but we are now migrating 100% of eligible users in 56.0.2. I was worried this might be a new 64-bit crash from third-party software, but I don't think this is a 64-bit specific issue. This CacheDataAppendElement signature is 2.0% higher in 56.0.2 than 56.0.1 or 56.0 even when only comparing Win32 crashes: https://mozilla.github.io/stab-crashes/scomp.html?common=product%3DFirefox%26platform%3DWindows%26cpu_arch%21%3Damd64&p1=version%3D56.0%26version%3D56.0.1&p2=version%3D56.0.2
Is there a way to detect omni.ja corruption and recover somehow? Is that the avenue we should explore?
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #35) > Is there a way to detect omni.ja corruption and recover somehow? Is that the > avenue we should explore? The jar file should have a CRC right? In theory we should return `NS_ERROR_FILE_CORRUPTED` , so I guess the question is whether or not whatever reads omni.ja checks for an error.  https://searchfox.org/mozilla-central/rev/550148ab69b2879bfb82ffa698720ede1fa626f2/modules/libjar/nsJARInputStream.cpp#345