XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache, due to problems reading greprefs.js from the omnijar

REOPENED
Unassigned

Status

()

defect
P1
critical
REOPENED
3 years ago
a month ago

People

(Reporter: dbaron, Unassigned)

Tracking

({crash, leave-open, topcrash})

unspecified
mozilla57
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox47 wontfix, firefox48 wontfix, firefox49 wontfix, firefox-esr45 wontfix, thunderbird_esr52 affected, firefox50 wontfix, firefox51 wontfix, firefox52 wontfix, firefox56+ wontfix, firefox57+ wontfix, firefox58 wontfix, firefox61 wontfix, firefox62 wontfix, firefox63 wontfix, firefox64 wontfix, firefox65 wontfix, firefox66 fix-optional, firefox67 affected, firefox68 ?)

Details

(Whiteboard: [tbird crash] , crash signature)

Attachments

(1 attachment)

This bug was filed from the Socorro interface and is 
report bp-81611011-ce24-41f3-b56a-764c32160527.
=============================================================

There are a decent number of these crashes floating around on various versions.  They're crashes early in startup, and users sometimes submit a large number of them.

I debugged the minidump for bp-81611011-ce24-41f3-b56a-764c32160527, and the MSVC debugger says we're crashing because gCacheData is null.  The MSVC debugger also says sPreferences is null, and sShutdown is false.  This suggests that the codepath from AddBoolVarCache calling GetBool calling GetBool calling InitStaticMembers didn't actually work right -- I'm guessing because the do_GetService somehow failed.

The stack in the MSVC debugger isn't any better than in crash-stats.  However, Linux crashes like bp-563a3c80-2340-4bda-a49a-81d2d2160525 have a full stack.

Is there something obvious here that might cause the do_GetService to fail?
Flags: needinfo?(nfroyd)
OK, so null gCacheData means that Preferences::GetInstanceForService() hasn't been called, or didn't get to the point where it initializes gCacheData.

AddBoolVarCache -> GetBool -> InitStaticMembers should have spun up the prefs service.  Let's work on the assumption that GetInstanceForService() did get called.  I guess we could confirm that by looking at sRootBranch, sDefaultRootBranch, and sPreferences.

What would cause GetInstanceForService to return prior to initializing gCacheData?  Only a failed call to Preferences::Init():

http://dxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#516

So what could have failed there?

* pref_InitInitialObjects, which has a ton of failure cases.
* PREF_CopyCharPref, which depends on an earlier call to PREF_Init succeeding so that gHashTable is initialized.  The observer service bits could have failed, but that seems unlikely; I think a number of other things would have failed prior to this if the observer service didn't work.

My money is on pref_InitInitialObjects failing somehow.  Maybe a corrupted omnijar is causing things to fail?
Flags: needinfo?(nfroyd)
Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT, given that we're not going to be able to start up anyway?
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #2)
> Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT,
> given that we're not going to be able to start up anyway?

Seems reasonable; maybe we could even get extra information about what's going wrong?
Or we could try to startup without loading core preferences at all; I'm not sure whether crashing or no prefs would provide for a worse user experience.
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 50): 7 crashes from 2016-06-06.
 - aurora  (version 49): 34 crashes from 2016-06-07.
 - beta    (version 48): 431 crashes from 2016-06-06.
 - release (version 47): 1105 crashes from 2016-05-31.
 - esr     (version 45): 67 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          0          0          0
 - aurora          22          4          1          0          3          4          0
 - beta            89         46         68         88         36         41         13
 - release        170        147        138        179        183        149         62
 - esr              1          3          2          2         21         26          3

Affected platforms: Windows, Linux
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 51): 4 crashes from 2016-08-01.
 - aurora  (version 50): 11 crashes from 2016-08-01.
 - beta    (version 49): 141 crashes from 2016-08-02.
 - release (version 48): 157 crashes from 2016-07-25.
 - esr     (version 45): 84 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        6       5       0
 - beta         36      79       5
 - release      41      60      13
 - esr           4       9       3

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #185
 - aurora
 - beta    #475      #3221
 - release #510                #525
 - esr     #837
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 52): 4 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 142 crashes from 2016-09-20.
 - release (version 49): 172 crashes from 2016-09-05.
 - esr     (version 45): 108 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly       4       0
 - aurora        0       0
 - beta        141       1
 - release     142      30
 - esr           7       7

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser     Content Plugin
 - nightly #234
 - aurora
 - beta    #115
 - release #336                #1170
 - esr     #1068
Too late for firefox 52, mass-wontfix.

Updated

2 years ago
Whiteboard: [tbird crash]
[Tracking Requested - why for this release]:
Crash is spiking in Beta, and I'm about to dup bug 1396317 to it, and that bug is already being tracked.
Duplicate of this bug: 1396317
The crash address is always 0x0, so the analysis in comment 0 and comment 1 seems likely to still hold.
My best idea is to add some diagnostic code that gives us info about exactly where the prefs initialization is going wrong.
froydnj, this is my best idea for the moment. What do you think?
Attachment #8904854 - Flags: review?(nfroyd)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Here's a sample crash where I deliberately injected a fault: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906

Also, the patch uses MOZ_CRASH_UNSAFE_PRINTF. In theory a "data steward" (whoever that is) is supposed to review uses of MOZ_CRASH_UNSAFE_PRINTF, but I don't think it's necessary here -- it's just printing static strings, it's much the same as adding a dozen or so MOZ_CRASH calls with a delay.
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Review of attachment 8904854 [details] [diff] [review]:
-----------------------------------------------------------------

Sure, let's do this.
Attachment #8904854 - Flags: review?(nfroyd) → review+
Priority: -- → P1
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Approval Request Comment
[Feature/Bug causing the regression]: This one. It's not a regression fix, just diagnostic code.

[User impact if declined]: Ongoing crashes; this is diagnostic code to try to work out why the Preferences module is failing to initialize for some users at startup.

[Is this code covered by automated tests?]: Not really; the Preferences module initializes correctly in all our tests. I did some spot tests locally by injecting a fault, here's a crash report from that: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906

[Has the fix been verified in Nightly?]: Not yet. The crash *is* showing up on Nightly so I could wait a bit, but it seems to be higher volume on Beta so it would be nice to get it into Beta soon.

[Needs manual test from QE? If yes, steps to reproduce]: No.

[List of other uplifts needed for the feature/fix]: None.

[Is the change risky?]: Not really.

[Why is the change risky/not risky?]: Just some diagnostic code.

[String changes made/needed]: None.
Attachment #8904854 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/ed7b5443cf8b
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57

Updated

2 years ago
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Diagnostic patch for a top crash. Let's take it for beta 10.
Attachment #8904854 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Nicholas Nethercote [:njn] from comment #17)
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Per Nicholas's comment, this does not require manual verification. Updating the qe-verify flag.
Flags: qe-verify-

Comment 22

2 years ago
this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.
Crash Signature: [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] → [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement]
(In reply to [:philipp] from comment #22)
> this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting
> "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.

Yes:

> (100.0% in signature vs 00.04% overall) moz_crash_reason = !gCacheData: pref_ReadPrefFromJar() failed

Currently we have 520 occurrences from 20 installations, all on Beta (none on Nightly).

The failing call looks like this:

> // Load jar:$gre/omni.jar!/greprefs.js
> rv = pref_ReadPrefFromJar(jarReader, "greprefs.js");

Within pref_ReadPrefFromJar() there is only one way that it can fail, here:

> nsCString manifest;
> MOZ_TRY_VAR(manifest, URLPreloader::ReadZip(jarReader, nsDependentCString(name)));
Some additional info from glandium:

- greprefs.js is the main prefs file, containing the contents of modules/libpref/init/all.js plus a few other smaller prefs files.

- greprefs.js is stored within the omni.ja file, which contains tons of Firefox resources. (See https://developer.mozilla.org/en-US/docs/Mozilla/About_omni.ja_(formerly_omni.jar) for details.)

- greprefs.js is typically the first file within omni.ja to be read on startup.

- Therefore, there the most likely explanation is that omni.ja is somehow corrupted.

rstrong, any ideas if this likely omni.ja corruption could have something to do with the updater?
Flags: needinfo?(robert.strong.bugs)
Summary: XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache → XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache, due to problems reading greprefs.js from the omnijar
One interesting data point: more than half of these crash reports have an empty "install time" field. I don't recall seeing that before.
(In reply to Nicholas Nethercote [:njn] from comment #24)
<snip>
> rstrong, any ideas if this likely omni.ja corruption could have something to
> do with the updater?
It is possible but there is not much if anything for me to go on and I haven't seen any indicators of there being corruption.
Flags: needinfo?(robert.strong.bugs)
Assignee: n.nethercote → nobody
While this is a high crash rate in beta 10 and 11, it's 1000 crashes a week for 35 installations. Marking this wontfix for 56.
For early 56 release, this is in the top 5 crash signatures, and is a startup crash. 
As in beta, it appears to be relatively few people hitting the crash over and over. 
Nathan or Nick, can you or someone else take another look here?
Flags: needinfo?(nfroyd)
Flags: needinfo?(n.nethercote)
Sorry, I don't know what else to do beyond what I already did. It looks like the omni.ja file has been corrupted, and once that happens the entire installation is messed up and will require reinstalling. As you say, despite the high numbers of crash reports, it appears to be affecting a relatively small number of users.
Flags: needinfo?(n.nethercote)
Surely this is not the first time we've consistently seen corrupted omni.ja files, though?  Have we been able to do anything about them in the past?  Does the updater do any verification of the state of the omni.ja?  Do we have any "health of your installation" checks?
Flags: needinfo?(nfroyd)
Just kinda reaching for stuff here...I'm at a loss for fixes, if we're looking at on-disk corruption, or even consistent memory corruption, so just trying to throw stuff out and see if anything sticks.
Thanks for checking again, the spike in crashes was just a bit worrying.  I'll wontfix for 56 and 57 since there is nothing actionable here and relatively few installations are affected.
I just wanted to point out that the number of these crashes increased significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.
(In reply to Tom Grabowski [:TomGrab] from comment #33)
> I just wanted to point out that the number of these crashes increased
> significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.

We started migrating 1/5/20% of eligible 32-bit users to 64-bit in 56.0.1, but we are now migrating 100% of eligible users in 56.0.2. I was worried this might be a new 64-bit crash from third-party software, but I don't think this is a 64-bit specific issue. This CacheDataAppendElement signature is 2.0% higher in 56.0.2 than 56.0.1 or 56.0 even when only comparing Win32 crashes:

https://mozilla.github.io/stab-crashes/scomp.html?common=product%3DFirefox%26platform%3DWindows%26cpu_arch%21%3Damd64&p1=version%3D56.0%26version%3D56.0.1&p2=version%3D56.0.2
Is there a way to detect omni.ja corruption and recover somehow? Is that the avenue we should explore?
Flags: needinfo?(ddurst)
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #35)
> Is there a way to detect omni.ja corruption and recover somehow? Is that the
> avenue we should explore?

The jar file should have a CRC right? In theory we should return `NS_ERROR_FILE_CORRUPTED` [1], so I guess the question is whether or not whatever reads omni.ja checks for an error.

[1] https://searchfox.org/mozilla-central/rev/550148ab69b2879bfb82ffa698720ede1fa626f2/modules/libjar/nsJARInputStream.cpp#345
mozilla::Preferences::AddBoolVarCache appears to be the mac flavor
Crash Signature: [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] → [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] [@ mozilla::Preferences::AddBoolVarCache]
Duplicate of this bug: 873831
Ted do you know who could dive into comment 36?
Flags: needinfo?(ted)
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #39)
> Ted do you know who could dive into comment 36?

I don't, sorry! Given that omni.ja contains many files required for Firefox to start successfully I'm not sure that there's much that callers could usefully do for recovery here.
Flags: needinfo?(ted)
> Given that omni.ja contains many files required for Firefox
> to start successfully I'm not sure that there's much that callers could
> usefully do for recovery here.

Presumably we'd show some error message saying something like "your installation is corrupt, please reinstall", which would be much better than silently crashing.
Just to sum up a bit as this bug is aging: we've talked about possible omnijar corruption a few times, but it's never been terribly verifiable (just theoretical, judging from what we've seen).

It's probably worth doing the approach in #c36 to see *if* that happens, and then talk about how to recover from that. Before we undertake something like that, we'd want to know that we're solving a real problem (correctly).
Flags: needinfo?(ddurst)
Adding another signature that is showing up in 61 nightly.
Crash Signature: [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] [@ mozilla::Preferences::AddBoolVarCache] → [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] [@ static void mozilla::CacheDataAppendElement] [@ mozilla::Preferences::AddBoolVarCache]
This is also visible in 59.0.2 and 59.0.3, with multiple crashes per installation, usually at startup.
static void mozilla::CacheDataAppendElement signature is currently the #5 browser top crash on 61. I also updated the affect branches to reflect the fact that 62 and 63 are also affected.
#5 top crash for 62 beta as well. Again, seems to be a repeated startup crash for the affected users.
Keywords: topcrash
Whiteboard: [tbird crash] → [tbird crash]
Nathan can you help find someone to take another look at these crashes? Or if there's nothing actionable, let's add the "stalled" keyword tag.
Flags: needinfo?(nfroyd)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #47)
> Nathan can you help find someone to take another look at these crashes? Or
> if there's nothing actionable, let's add the "stalled" keyword tag.

I looked through several reports from:

[@ static void mozilla::CacheDataAppendElement ]
[@ mozilla::Preferences::AddBoolVarCache ] 

and all of them are failing with the crash reason "!gCacheData: pref_ReadPrefFromJar() failed", which comes from:

https://searchfox.org/mozilla-central/source/modules/libpref/Preferences.cpp#4907-4909

This reason is strong confirmation for the corrupted omni.ja hypothesis.  Reading the preferences from the omni.ja is actually split into two stages:

https://searchfox.org/mozilla-central/source/modules/libpref/Preferences.cpp#4772-4789

I'm not sure at which stage we're failing there.  I guess we can add more instrumentation to try and figure out which exact cases are failing?  Maybe we can get the string for the preferences just fine, but it can't be parsed for some reason.  I don't know if we could get the entire prefs file provided in the minidump or something to examine for issues.  Nick, do you have space to add some more error reasons, similar to your last patch?

If we don't want to add more instrumentation, or think that it wouldn't help...given that we can't read the prefs from the omni.ja, the options I see, in order of increasing complexity, are:

1) Move forward in startup without the prefs, which might just move crashes to a different location or bring up a super-weird browser (if we can't read other files from omni.ja);
2) Bake prefs directly into the binary, to be used as a last resort if we can't read prefs from omni.ja.  Again, corrupted omni.ja might just mean we're moving crashes around.  Bad RAM or something might thwart this idea too;
3) Figure out some way to overwrite the corrupted omni.ja with a known-good omni.ja...sounds complicated.

Are there other options?
Flags: needinfo?(nfroyd) → needinfo?(n.nethercote)
We are going around in circles in this bug. I think it's clear that omni.ja corruption is the root problem here. If pref initialization was somehow made impervious to such corruption, it seems very likely that some other component would crash or fail catastrophically in some other way shortly afterwards.

As mentioned above, I think the right approach is to introduce a mechanism to detect corrupt omni.ja files (and possibly other parts of the installation). This idea came up in the context of Project Uptime (https://wiki.mozilla.org/Platform/Uptime#Detect_if_Firefox_is_misinstalled) but nothing concrete happened.
Flags: needinfo?(n.nethercote)

Marking 65 as affected. In early 65 data, this is the top browser crash and is flagged as a startup crash.

Adding 66 as affected - #9 overall in b14.

You need to log in before you can comment on or make changes to this bug.