XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache, due to problems reading greprefs.js from the omnijar
Categories
(Core :: XPCOM, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox47 | --- | wontfix |
firefox48 | --- | wontfix |
firefox49 | --- | wontfix |
firefox-esr45 | --- | wontfix |
thunderbird_esr52 | --- | wontfix |
firefox50 | --- | wontfix |
firefox51 | --- | wontfix |
firefox52 | --- | wontfix |
firefox56 | + | wontfix |
firefox57 | + | wontfix |
firefox58 | --- | wontfix |
firefox61 | --- | wontfix |
firefox62 | --- | wontfix |
firefox63 | --- | wontfix |
firefox64 | --- | wontfix |
firefox65 | --- | wontfix |
firefox66 | --- | wontfix |
firefox67 | --- | wontfix |
firefox68 | --- | wontfix |
People
(Reporter: dbaron, Assigned: n.nethercote)
References
Details
(Keywords: crash, topcrash, Whiteboard: [tbird crash] )
Crash Data
Attachments
(1 file)
17.28 KB,
patch
|
froydnj
:
review+
lizzard
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-81611011-ce24-41f3-b56a-764c32160527. ============================================================= There are a decent number of these crashes floating around on various versions. They're crashes early in startup, and users sometimes submit a large number of them. I debugged the minidump for bp-81611011-ce24-41f3-b56a-764c32160527, and the MSVC debugger says we're crashing because gCacheData is null. The MSVC debugger also says sPreferences is null, and sShutdown is false. This suggests that the codepath from AddBoolVarCache calling GetBool calling GetBool calling InitStaticMembers didn't actually work right -- I'm guessing because the do_GetService somehow failed. The stack in the MSVC debugger isn't any better than in crash-stats. However, Linux crashes like bp-563a3c80-2340-4bda-a49a-81d2d2160525 have a full stack. Is there something obvious here that might cause the do_GetService to fail?
Comment 1•8 years ago
|
||
OK, so null gCacheData means that Preferences::GetInstanceForService() hasn't been called, or didn't get to the point where it initializes gCacheData. AddBoolVarCache -> GetBool -> InitStaticMembers should have spun up the prefs service. Let's work on the assumption that GetInstanceForService() did get called. I guess we could confirm that by looking at sRootBranch, sDefaultRootBranch, and sPreferences. What would cause GetInstanceForService to return prior to initializing gCacheData? Only a failed call to Preferences::Init(): http://dxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#516 So what could have failed there? * pref_InitInitialObjects, which has a ton of failure cases. * PREF_CopyCharPref, which depends on an earlier call to PREF_Init succeeding so that gHashTable is initialized. The observer service bits could have failed, but that seems unlikely; I think a number of other things would have failed prior to this if the observer service didn't work. My money is on pref_InitInitialObjects failing somehow. Maybe a corrupted omnijar is causing things to fail?
Reporter | ||
Comment 2•8 years ago
|
||
Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT, given that we're not going to be able to start up anyway?
Comment 3•8 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #2) > Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT, > given that we're not going to be able to start up anyway? Seems reasonable; maybe we could even get extra information about what's going wrong?
Comment 4•8 years ago
|
||
Or we could try to startup without loading core preferences at all; I'm not sure whether crashing or no prefs would provide for a worse user experience.
Comment 5•8 years ago
|
||
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache': - nightly (version 50): 7 crashes from 2016-06-06. - aurora (version 49): 34 crashes from 2016-06-07. - beta (version 48): 431 crashes from 2016-06-06. - release (version 47): 1105 crashes from 2016-05-31. - esr (version 45): 67 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 0 0 0 0 0 0 0 - aurora 22 4 1 0 3 4 0 - beta 89 46 68 88 36 41 13 - release 170 147 138 179 183 149 62 - esr 1 3 2 2 21 26 3 Affected platforms: Windows, Linux
Comment 6•8 years ago
|
||
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache': - nightly (version 51): 4 crashes from 2016-08-01. - aurora (version 50): 11 crashes from 2016-08-01. - beta (version 49): 141 crashes from 2016-08-02. - release (version 48): 157 crashes from 2016-07-25. - esr (version 45): 84 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 0 0 0 - aurora 6 5 0 - beta 36 79 5 - release 41 60 13 - esr 4 9 3 Affected platforms: Windows, Linux Crash rank on the last 7 days: Browser Content Plugin - nightly #185 - aurora - beta #475 #3221 - release #510 #525 - esr #837
Comment 7•8 years ago
|
||
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache': - nightly (version 52): 4 crashes from 2016-09-19. - aurora (version 51): 0 crashes from 2016-09-19. - beta (version 50): 142 crashes from 2016-09-20. - release (version 49): 172 crashes from 2016-09-05. - esr (version 45): 108 crashes from 2016-06-01. Crash volume on the last weeks (Week N is from 10-03 to 10-09): W. N-1 W. N-2 - nightly 4 0 - aurora 0 0 - beta 141 1 - release 142 30 - esr 7 7 Affected platforms: Windows, Linux Crash rank on the last 7 days: Browser Content Plugin - nightly #234 - aurora - beta #115 - release #336 #1170 - esr #1068
Comment 8•7 years ago
|
||
Too late for firefox 52, mass-wontfix.
Updated•7 years ago
|
Assignee | ||
Comment 9•7 years ago
|
||
[Tracking Requested - why for this release]: Crash is spiking in Beta, and I'm about to dup bug 1396317 to it, and that bug is already being tracked.
Assignee | ||
Comment 11•7 years ago
|
||
The crash address is always 0x0, so the analysis in comment 0 and comment 1 seems likely to still hold.
Assignee | ||
Comment 12•7 years ago
|
||
My best idea is to add some diagnostic code that gives us info about exactly where the prefs initialization is going wrong.
Updated•7 years ago
|
Assignee | ||
Comment 13•7 years ago
|
||
froydnj, this is my best idea for the moment. What do you think?
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Comment 14•7 years ago
|
||
Here's a sample crash where I deliberately injected a fault: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906 Also, the patch uses MOZ_CRASH_UNSAFE_PRINTF. In theory a "data steward" (whoever that is) is supposed to review uses of MOZ_CRASH_UNSAFE_PRINTF, but I don't think it's necessary here -- it's just printing static strings, it's much the same as adding a dozen or so MOZ_CRASH calls with a delay.
Comment 15•7 years ago
|
||
Comment on attachment 8904854 [details] [diff] [review] Add diagnostics about Prefs startup failures Review of attachment 8904854 [details] [diff] [review]: ----------------------------------------------------------------- Sure, let's do this.
Updated•7 years ago
|
Assignee | ||
Comment 16•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/ed7b5443cf8b98c807cd1fa29857b51511cd0076 Bug 1276488 - Add diagnostics about Prefs startup failures. r=froydnj.
Assignee | ||
Comment 17•7 years ago
|
||
Comment on attachment 8904854 [details] [diff] [review] Add diagnostics about Prefs startup failures Approval Request Comment [Feature/Bug causing the regression]: This one. It's not a regression fix, just diagnostic code. [User impact if declined]: Ongoing crashes; this is diagnostic code to try to work out why the Preferences module is failing to initialize for some users at startup. [Is this code covered by automated tests?]: Not really; the Preferences module initializes correctly in all our tests. I did some spot tests locally by injecting a fault, here's a crash report from that: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906 [Has the fix been verified in Nightly?]: Not yet. The crash *is* showing up on Nightly so I could wait a bit, but it seems to be higher volume on Beta so it would be nice to get it into Beta soon. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: Not really. [Why is the change risky/not risky?]: Just some diagnostic code. [String changes made/needed]: None.
Comment 18•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ed7b5443cf8b
Updated•7 years ago
|
Comment 19•7 years ago
|
||
Comment on attachment 8904854 [details] [diff] [review] Add diagnostics about Prefs startup failures Diagnostic patch for a top crash. Let's take it for beta 10.
Comment 20•7 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/3484d06e80e1
Comment 21•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #17) > [Needs manual test from QE? If yes, steps to reproduce]: No. Per Nicholas's comment, this does not require manual verification. Updating the qe-verify flag.
Comment 22•7 years ago
|
||
this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.
Updated•7 years ago
|
Assignee | ||
Comment 23•7 years ago
|
||
(In reply to [:philipp] from comment #22) > this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting > "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10. Yes: > (100.0% in signature vs 00.04% overall) moz_crash_reason = !gCacheData: pref_ReadPrefFromJar() failed Currently we have 520 occurrences from 20 installations, all on Beta (none on Nightly). The failing call looks like this: > // Load jar:$gre/omni.jar!/greprefs.js > rv = pref_ReadPrefFromJar(jarReader, "greprefs.js"); Within pref_ReadPrefFromJar() there is only one way that it can fail, here: > nsCString manifest; > MOZ_TRY_VAR(manifest, URLPreloader::ReadZip(jarReader, nsDependentCString(name)));
Assignee | ||
Comment 24•7 years ago
|
||
Some additional info from glandium: - greprefs.js is the main prefs file, containing the contents of modules/libpref/init/all.js plus a few other smaller prefs files. - greprefs.js is stored within the omni.ja file, which contains tons of Firefox resources. (See https://developer.mozilla.org/en-US/docs/Mozilla/About_omni.ja_(formerly_omni.jar) for details.) - greprefs.js is typically the first file within omni.ja to be read on startup. - Therefore, there the most likely explanation is that omni.ja is somehow corrupted. rstrong, any ideas if this likely omni.ja corruption could have something to do with the updater?
Assignee | ||
Comment 25•7 years ago
|
||
One interesting data point: more than half of these crash reports have an empty "install time" field. I don't recall seeing that before.
Comment 26•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #24) <snip> > rstrong, any ideas if this likely omni.ja corruption could have something to > do with the updater? It is possible but there is not much if anything for me to go on and I haven't seen any indicators of there being corruption.
Assignee | ||
Updated•7 years ago
|
Comment 27•7 years ago
|
||
While this is a high crash rate in beta 10 and 11, it's 1000 crashes a week for 35 installations. Marking this wontfix for 56.
Comment 28•7 years ago
|
||
For early 56 release, this is in the top 5 crash signatures, and is a startup crash. As in beta, it appears to be relatively few people hitting the crash over and over. Nathan or Nick, can you or someone else take another look here?
Assignee | ||
Comment 29•7 years ago
|
||
Sorry, I don't know what else to do beyond what I already did. It looks like the omni.ja file has been corrupted, and once that happens the entire installation is messed up and will require reinstalling. As you say, despite the high numbers of crash reports, it appears to be affecting a relatively small number of users.
Comment 30•7 years ago
|
||
Surely this is not the first time we've consistently seen corrupted omni.ja files, though? Have we been able to do anything about them in the past? Does the updater do any verification of the state of the omni.ja? Do we have any "health of your installation" checks?
Comment 31•7 years ago
|
||
Just kinda reaching for stuff here...I'm at a loss for fixes, if we're looking at on-disk corruption, or even consistent memory corruption, so just trying to throw stuff out and see if anything sticks.
Comment 32•7 years ago
|
||
Thanks for checking again, the spike in crashes was just a bit worrying. I'll wontfix for 56 and 57 since there is nothing actionable here and relatively few installations are affected.
Comment 33•7 years ago
|
||
I just wanted to point out that the number of these crashes increased significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.
Comment 34•7 years ago
|
||
(In reply to Tom Grabowski [:TomGrab] from comment #33) > I just wanted to point out that the number of these crashes increased > significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2. We started migrating 1/5/20% of eligible 32-bit users to 64-bit in 56.0.1, but we are now migrating 100% of eligible users in 56.0.2. I was worried this might be a new 64-bit crash from third-party software, but I don't think this is a 64-bit specific issue. This CacheDataAppendElement signature is 2.0% higher in 56.0.2 than 56.0.1 or 56.0 even when only comparing Win32 crashes: https://mozilla.github.io/stab-crashes/scomp.html?common=product%3DFirefox%26platform%3DWindows%26cpu_arch%21%3Damd64&p1=version%3D56.0%26version%3D56.0.1&p2=version%3D56.0.2
Comment 35•6 years ago
|
||
Is there a way to detect omni.ja corruption and recover somehow? Is that the avenue we should explore?
Comment 36•6 years ago
|
||
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #35) > Is there a way to detect omni.ja corruption and recover somehow? Is that the > avenue we should explore? The jar file should have a CRC right? In theory we should return `NS_ERROR_FILE_CORRUPTED` [1], so I guess the question is whether or not whatever reads omni.ja checks for an error. [1] https://searchfox.org/mozilla-central/rev/550148ab69b2879bfb82ffa698720ede1fa626f2/modules/libjar/nsJARInputStream.cpp#345
Comment 37•6 years ago
|
||
mozilla::Preferences::AddBoolVarCache appears to be the mac flavor
Comment 40•6 years ago
|
||
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #39) > Ted do you know who could dive into comment 36? I don't, sorry! Given that omni.ja contains many files required for Firefox to start successfully I'm not sure that there's much that callers could usefully do for recovery here.
Assignee | ||
Comment 41•6 years ago
|
||
> Given that omni.ja contains many files required for Firefox
> to start successfully I'm not sure that there's much that callers could
> usefully do for recovery here.
Presumably we'd show some error message saying something like "your installation is corrupt, please reinstall", which would be much better than silently crashing.
Comment 42•6 years ago
|
||
Just to sum up a bit as this bug is aging: we've talked about possible omnijar corruption a few times, but it's never been terribly verifiable (just theoretical, judging from what we've seen). It's probably worth doing the approach in #c36 to see *if* that happens, and then talk about how to recover from that. Before we undertake something like that, we'd want to know that we're solving a real problem (correctly).
Comment 43•6 years ago
|
||
Adding another signature that is showing up in 61 nightly.
Comment 44•6 years ago
|
||
This is also visible in 59.0.2 and 59.0.3, with multiple crashes per installation, usually at startup.
Comment 45•6 years ago
|
||
static void mozilla::CacheDataAppendElement signature is currently the #5 browser top crash on 61. I also updated the affect branches to reflect the fact that 62 and 63 are also affected.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 46•6 years ago
|
||
#5 top crash for 62 beta as well. Again, seems to be a repeated startup crash for the affected users.
Comment 47•6 years ago
|
||
Nathan can you help find someone to take another look at these crashes? Or if there's nothing actionable, let's add the "stalled" keyword tag.
Comment 48•6 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #47) > Nathan can you help find someone to take another look at these crashes? Or > if there's nothing actionable, let's add the "stalled" keyword tag. I looked through several reports from: [@ static void mozilla::CacheDataAppendElement ] [@ mozilla::Preferences::AddBoolVarCache ] and all of them are failing with the crash reason "!gCacheData: pref_ReadPrefFromJar() failed", which comes from: https://searchfox.org/mozilla-central/source/modules/libpref/Preferences.cpp#4907-4909 This reason is strong confirmation for the corrupted omni.ja hypothesis. Reading the preferences from the omni.ja is actually split into two stages: https://searchfox.org/mozilla-central/source/modules/libpref/Preferences.cpp#4772-4789 I'm not sure at which stage we're failing there. I guess we can add more instrumentation to try and figure out which exact cases are failing? Maybe we can get the string for the preferences just fine, but it can't be parsed for some reason. I don't know if we could get the entire prefs file provided in the minidump or something to examine for issues. Nick, do you have space to add some more error reasons, similar to your last patch? If we don't want to add more instrumentation, or think that it wouldn't help...given that we can't read the prefs from the omni.ja, the options I see, in order of increasing complexity, are: 1) Move forward in startup without the prefs, which might just move crashes to a different location or bring up a super-weird browser (if we can't read other files from omni.ja); 2) Bake prefs directly into the binary, to be used as a last resort if we can't read prefs from omni.ja. Again, corrupted omni.ja might just mean we're moving crashes around. Bad RAM or something might thwart this idea too; 3) Figure out some way to overwrite the corrupted omni.ja with a known-good omni.ja...sounds complicated. Are there other options?
Assignee | ||
Comment 49•6 years ago
|
||
We are going around in circles in this bug. I think it's clear that omni.ja corruption is the root problem here. If pref initialization was somehow made impervious to such corruption, it seems very likely that some other component would crash or fail catastrophically in some other way shortly afterwards. As mentioned above, I think the right approach is to introduce a mechanism to detect corrupt omni.ja files (and possibly other parts of the installation). This idea came up in the context of Project Uptime (https://wiki.mozilla.org/Platform/Uptime#Detect_if_Firefox_is_misinstalled) but nothing concrete happened.
Updated•6 years ago
|
Comment 50•5 years ago
|
||
Marking 65 as affected. In early 65 data, this is the top browser crash and is flagged as a startup crash.
Comment 51•5 years ago
|
||
Adding 66 as affected - #9 overall in b14.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 52•5 years ago
|
||
Bug 1569526 has landed, which removes the data structure involved in this signature. So the signature will change, but I'm not sure what it will change to.
Comment 53•5 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #52)
Bug 1569526 has landed, which removes the data structure involved in this signature. So the signature will change, but I'm not sure what it will change to.
Not a single beta or nightly crash with mozilla::Preferences::AddBoolVarCache - https://crash-stats.mozilla.org/signature/?release_channel=%21release&signature=mozilla%3A%3APreferences%3A%3AAddBoolVarCache&date=%3E%3D2019-07-29T12%3A54%3A00.000Z&date=%3C2019-09-26T12%3A54%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-version&_sort=-date&page=1
Do we have a testcase or direct contact with someone reproducing this?
Comment 54•4 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #53)
(In reply to Nicholas Nethercote [:njn] from comment #52)
Bug 1569526 has landed, which removes the data structure involved in this signature. So the signature will change, but I'm not sure what it will change to.
Not a single beta or nightly crash with mozilla::Preferences::AddBoolVarCache - https://crash-stats.mozilla.org/signature/?release_channel=%21release&signature=mozilla%3A%3APreferences%3A%3AAddBoolVarCache&date=%3E%3D2019-07-29T12%3A54%3A00.000Z&date=%3C2019-09-26T12%3A54%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-version&_sort=-date&page=1
Do we have a testcase or direct contact with someone reproducing this?
User in https://support.mozilla.org/en-US/questions/1286827 with signature mozilla::Preferences::AddBoolVarCache apparently can reproduce with 68.6.0
Comment 55•3 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?
Comment 56•3 years ago
|
||
This bug isn't ready to be closed yet. We're still getting (albeit at a bit of a lower frequency) crashes which appear to be caused by the omni.ja being corrupted at startup, with this being the signature
This feels similar to bug 1471720, in that it's a startup crash which appears to be being caused by omni.ja corruption at startup.
Comment 57•3 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?
Comment 59•2 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 60•2 years ago
|
||
We've removed VarCache so we shouldn't be running into this anymore.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Description
•