XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache, due to problems reading greprefs.js from the omnijar

REOPENED
Unassigned

Status

()

Core
XPCOM
P1
critical
REOPENED
2 years ago
2 days ago

People

(Reporter: dbaron, Unassigned, NeedInfo)

Tracking

({crash, leave-open})

unspecified
mozilla57
crash, leave-open
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox47 affected, firefox48 affected, firefox49 affected, firefox-esr45 affected, thunderbird_esr52 affected, firefox50 affected, firefox51 affected, firefox52 wontfix, firefox56+ wontfix, firefox57+ wontfix, firefox58 affected)

Details

(Whiteboard: [tbird crash], crash signature)

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-81611011-ce24-41f3-b56a-764c32160527.
=============================================================

There are a decent number of these crashes floating around on various versions.  They're crashes early in startup, and users sometimes submit a large number of them.

I debugged the minidump for bp-81611011-ce24-41f3-b56a-764c32160527, and the MSVC debugger says we're crashing because gCacheData is null.  The MSVC debugger also says sPreferences is null, and sShutdown is false.  This suggests that the codepath from AddBoolVarCache calling GetBool calling GetBool calling InitStaticMembers didn't actually work right -- I'm guessing because the do_GetService somehow failed.

The stack in the MSVC debugger isn't any better than in crash-stats.  However, Linux crashes like bp-563a3c80-2340-4bda-a49a-81d2d2160525 have a full stack.

Is there something obvious here that might cause the do_GetService to fail?
Flags: needinfo?(nfroyd)
OK, so null gCacheData means that Preferences::GetInstanceForService() hasn't been called, or didn't get to the point where it initializes gCacheData.

AddBoolVarCache -> GetBool -> InitStaticMembers should have spun up the prefs service.  Let's work on the assumption that GetInstanceForService() did get called.  I guess we could confirm that by looking at sRootBranch, sDefaultRootBranch, and sPreferences.

What would cause GetInstanceForService to return prior to initializing gCacheData?  Only a failed call to Preferences::Init():

http://dxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#516

So what could have failed there?

* pref_InitInitialObjects, which has a ton of failure cases.
* PREF_CopyCharPref, which depends on an earlier call to PREF_Init succeeding so that gHashTable is initialized.  The observer service bits could have failed, but that seems unlikely; I think a number of other things would have failed prior to this if the observer service didn't work.

My money is on pref_InitInitialObjects failing somehow.  Maybe a corrupted omnijar is causing things to fail?
Flags: needinfo?(nfroyd)
(Reporter)

Comment 2

2 years ago
Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT, given that we're not going to be able to start up anyway?
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #2)
> Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT,
> given that we're not going to be able to start up anyway?

Seems reasonable; maybe we could even get extra information about what's going wrong?
Or we could try to startup without loading core preferences at all; I'm not sure whether crashing or no prefs would provide for a worse user experience.
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 50): 7 crashes from 2016-06-06.
 - aurora  (version 49): 34 crashes from 2016-06-07.
 - beta    (version 48): 431 crashes from 2016-06-06.
 - release (version 47): 1105 crashes from 2016-05-31.
 - esr     (version 45): 67 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          0          0          0
 - aurora          22          4          1          0          3          4          0
 - beta            89         46         68         88         36         41         13
 - release        170        147        138        179        183        149         62
 - esr              1          3          2          2         21         26          3

Affected platforms: Windows, Linux
status-firefox47: --- → affected
status-firefox48: --- → affected
status-firefox49: --- → affected
status-firefox50: --- → affected
status-firefox-esr45: --- → affected
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 51): 4 crashes from 2016-08-01.
 - aurora  (version 50): 11 crashes from 2016-08-01.
 - beta    (version 49): 141 crashes from 2016-08-02.
 - release (version 48): 157 crashes from 2016-07-25.
 - esr     (version 45): 84 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        6       5       0
 - beta         36      79       5
 - release      41      60      13
 - esr           4       9       3

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #185
 - aurora
 - beta    #475      #3221
 - release #510                #525
 - esr     #837
status-firefox51: --- → affected
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 52): 4 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 142 crashes from 2016-09-20.
 - release (version 49): 172 crashes from 2016-09-05.
 - esr     (version 45): 108 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly       4       0
 - aurora        0       0
 - beta        141       1
 - release     142      30
 - esr           7       7

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser     Content Plugin
 - nightly #234
 - aurora
 - beta    #115
 - release #336                #1170
 - esr     #1068
status-firefox52: --- → affected
Too late for firefox 52, mass-wontfix.
status-firefox52: affected → wontfix

Updated

3 months ago
status-thunderbird_esr52: --- → affected
Whiteboard: [tbird crash]
[Tracking Requested - why for this release]:
Crash is spiking in Beta, and I'm about to dup bug 1396317 to it, and that bug is already being tracked.
tracking-firefox56: --- → ?
tracking-firefox57: --- → ?
Duplicate of this bug: 1396317
The crash address is always 0x0, so the analysis in comment 0 and comment 1 seems likely to still hold.
My best idea is to add some diagnostic code that gives us info about exactly where the prefs initialization is going wrong.
status-firefox56: --- → affected
status-firefox57: --- → affected
tracking-firefox56: ? → +
tracking-firefox57: ? → +
Created attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

froydnj, this is my best idea for the moment. What do you think?
Attachment #8904854 - Flags: review?(nfroyd)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Here's a sample crash where I deliberately injected a fault: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906

Also, the patch uses MOZ_CRASH_UNSAFE_PRINTF. In theory a "data steward" (whoever that is) is supposed to review uses of MOZ_CRASH_UNSAFE_PRINTF, but I don't think it's necessary here -- it's just printing static strings, it's much the same as adding a dozen or so MOZ_CRASH calls with a delay.
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Review of attachment 8904854 [details] [diff] [review]:
-----------------------------------------------------------------

Sure, let's do this.
Attachment #8904854 - Flags: review?(nfroyd) → review+
Priority: -- → P1
https://hg.mozilla.org/integration/mozilla-inbound/rev/ed7b5443cf8b98c807cd1fa29857b51511cd0076
Bug 1276488 - Add diagnostics about Prefs startup failures. r=froydnj.
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Approval Request Comment
[Feature/Bug causing the regression]: This one. It's not a regression fix, just diagnostic code.

[User impact if declined]: Ongoing crashes; this is diagnostic code to try to work out why the Preferences module is failing to initialize for some users at startup.

[Is this code covered by automated tests?]: Not really; the Preferences module initializes correctly in all our tests. I did some spot tests locally by injecting a fault, here's a crash report from that: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906

[Has the fix been verified in Nightly?]: Not yet. The crash *is* showing up on Nightly so I could wait a bit, but it seems to be higher volume on Beta so it would be nice to get it into Beta soon.

[Needs manual test from QE? If yes, steps to reproduce]: No.

[List of other uplifts needed for the feature/fix]: None.

[Is the change risky?]: Not really.

[Why is the change risky/not risky?]: Just some diagnostic code.

[String changes made/needed]: None.
Attachment #8904854 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/ed7b5443cf8b
Status: ASSIGNED → RESOLVED
Last Resolved: 2 months ago
status-firefox57: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57

Updated

2 months ago
Status: RESOLVED → REOPENED
status-firefox57: fixed → affected
Keywords: leave-open
Resolution: FIXED → ---
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Diagnostic patch for a top crash. Let's take it for beta 10.
Attachment #8904854 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 20

2 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/3484d06e80e1
status-firefox56: affected → fixed
(In reply to Nicholas Nethercote [:njn] from comment #17)
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Per Nicholas's comment, this does not require manual verification. Updating the qe-verify flag.
Flags: qe-verify-

Comment 22

2 months ago
this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.
Crash Signature: [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] → [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement]

Updated

2 months ago
status-firefox56: fixed → affected
(In reply to [:philipp] from comment #22)
> this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting
> "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.

Yes:

> (100.0% in signature vs 00.04% overall) moz_crash_reason = !gCacheData: pref_ReadPrefFromJar() failed

Currently we have 520 occurrences from 20 installations, all on Beta (none on Nightly).

The failing call looks like this:

> // Load jar:$gre/omni.jar!/greprefs.js
> rv = pref_ReadPrefFromJar(jarReader, "greprefs.js");

Within pref_ReadPrefFromJar() there is only one way that it can fail, here:

> nsCString manifest;
> MOZ_TRY_VAR(manifest, URLPreloader::ReadZip(jarReader, nsDependentCString(name)));
Some additional info from glandium:

- greprefs.js is the main prefs file, containing the contents of modules/libpref/init/all.js plus a few other smaller prefs files.

- greprefs.js is stored within the omni.ja file, which contains tons of Firefox resources. (See https://developer.mozilla.org/en-US/docs/Mozilla/About_omni.ja_(formerly_omni.jar) for details.)

- greprefs.js is typically the first file within omni.ja to be read on startup.

- Therefore, there the most likely explanation is that omni.ja is somehow corrupted.

rstrong, any ideas if this likely omni.ja corruption could have something to do with the updater?
Flags: needinfo?(robert.strong.bugs)
Summary: XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache → XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache, due to problems reading greprefs.js from the omnijar
One interesting data point: more than half of these crash reports have an empty "install time" field. I don't recall seeing that before.
(In reply to Nicholas Nethercote [:njn] from comment #24)
<snip>
> rstrong, any ideas if this likely omni.ja corruption could have something to
> do with the updater?
It is possible but there is not much if anything for me to go on and I haven't seen any indicators of there being corruption.
Flags: needinfo?(robert.strong.bugs)
Assignee: n.nethercote → nobody
While this is a high crash rate in beta 10 and 11, it's 1000 crashes a week for 35 installations. Marking this wontfix for 56.
status-firefox56: affected → wontfix
For early 56 release, this is in the top 5 crash signatures, and is a startup crash. 
As in beta, it appears to be relatively few people hitting the crash over and over. 
Nathan or Nick, can you or someone else take another look here?
status-firefox56: wontfix → affected
Flags: needinfo?(nfroyd)
Flags: needinfo?(n.nethercote)
Sorry, I don't know what else to do beyond what I already did. It looks like the omni.ja file has been corrupted, and once that happens the entire installation is messed up and will require reinstalling. As you say, despite the high numbers of crash reports, it appears to be affecting a relatively small number of users.
Flags: needinfo?(n.nethercote)
Surely this is not the first time we've consistently seen corrupted omni.ja files, though?  Have we been able to do anything about them in the past?  Does the updater do any verification of the state of the omni.ja?  Do we have any "health of your installation" checks?
Flags: needinfo?(nfroyd)
Just kinda reaching for stuff here...I'm at a loss for fixes, if we're looking at on-disk corruption, or even consistent memory corruption, so just trying to throw stuff out and see if anything sticks.
Thanks for checking again, the spike in crashes was just a bit worrying.  I'll wontfix for 56 and 57 since there is nothing actionable here and relatively few installations are affected.
status-firefox56: affected → wontfix
status-firefox57: affected → wontfix
I just wanted to point out that the number of these crashes increased significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.
(In reply to Tom Grabowski [:TomGrab] from comment #33)
> I just wanted to point out that the number of these crashes increased
> significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.

We started migrating 1/5/20% of eligible 32-bit users to 64-bit in 56.0.1, but we are now migrating 100% of eligible users in 56.0.2. I was worried this might be a new 64-bit crash from third-party software, but I don't think this is a 64-bit specific issue. This CacheDataAppendElement signature is 2.0% higher in 56.0.2 than 56.0.1 or 56.0 even when only comparing Win32 crashes:

https://mozilla.github.io/stab-crashes/scomp.html?common=product%3DFirefox%26platform%3DWindows%26cpu_arch%21%3Damd64&p1=version%3D56.0%26version%3D56.0.1&p2=version%3D56.0.2
status-firefox58: --- → affected
Is there a way to detect omni.ja corruption and recover somehow? Is that the avenue we should explore?
Flags: needinfo?(ddurst)
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #35)
> Is there a way to detect omni.ja corruption and recover somehow? Is that the
> avenue we should explore?

The jar file should have a CRC right? In theory we should return `NS_ERROR_FILE_CORRUPTED` [1], so I guess the question is whether or not whatever reads omni.ja checks for an error.

[1] https://searchfox.org/mozilla-central/rev/550148ab69b2879bfb82ffa698720ede1fa626f2/modules/libjar/nsJARInputStream.cpp#345
You need to log in before you can comment on or make changes to this bug.