Closed Bug 1276488 Opened 8 years ago Closed 2 years ago

XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache, due to problems reading greprefs.js from the omnijar

Categories

(Core :: XPCOM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- wontfix
thunderbird_esr52 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox56 + wontfix
firefox57 + wontfix
firefox58 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix

People

(Reporter: dbaron, Assigned: n.nethercote)

References

Details

(Keywords: crash, topcrash, Whiteboard: [tbird crash] )

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-81611011-ce24-41f3-b56a-764c32160527.
=============================================================

There are a decent number of these crashes floating around on various versions.  They're crashes early in startup, and users sometimes submit a large number of them.

I debugged the minidump for bp-81611011-ce24-41f3-b56a-764c32160527, and the MSVC debugger says we're crashing because gCacheData is null.  The MSVC debugger also says sPreferences is null, and sShutdown is false.  This suggests that the codepath from AddBoolVarCache calling GetBool calling GetBool calling InitStaticMembers didn't actually work right -- I'm guessing because the do_GetService somehow failed.

The stack in the MSVC debugger isn't any better than in crash-stats.  However, Linux crashes like bp-563a3c80-2340-4bda-a49a-81d2d2160525 have a full stack.

Is there something obvious here that might cause the do_GetService to fail?
Flags: needinfo?(nfroyd)
OK, so null gCacheData means that Preferences::GetInstanceForService() hasn't been called, or didn't get to the point where it initializes gCacheData.

AddBoolVarCache -> GetBool -> InitStaticMembers should have spun up the prefs service.  Let's work on the assumption that GetInstanceForService() did get called.  I guess we could confirm that by looking at sRootBranch, sDefaultRootBranch, and sPreferences.

What would cause GetInstanceForService to return prior to initializing gCacheData?  Only a failed call to Preferences::Init():

http://dxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#516

So what could have failed there?

* pref_InitInitialObjects, which has a ton of failure cases.
* PREF_CopyCharPref, which depends on an earlier call to PREF_Init succeeding so that gHashTable is initialized.  The observer service bits could have failed, but that seems unlikely; I think a number of other things would have failed prior to this if the observer service didn't work.

My money is on pref_InitInitialObjects failing somehow.  Maybe a corrupted omnijar is causing things to fail?
Flags: needinfo?(nfroyd)
Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT, given that we're not going to be able to start up anyway?
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #2)
> Maybe some of those pref startup failure cases should be NS_RUNTIMEABORT,
> given that we're not going to be able to start up anyway?

Seems reasonable; maybe we could even get extra information about what's going wrong?
Or we could try to startup without loading core preferences at all; I'm not sure whether crashing or no prefs would provide for a worse user experience.
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 50): 7 crashes from 2016-06-06.
 - aurora  (version 49): 34 crashes from 2016-06-07.
 - beta    (version 48): 431 crashes from 2016-06-06.
 - release (version 47): 1105 crashes from 2016-05-31.
 - esr     (version 45): 67 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          0          0          0
 - aurora          22          4          1          0          3          4          0
 - beta            89         46         68         88         36         41         13
 - release        170        147        138        179        183        149         62
 - esr              1          3          2          2         21         26          3

Affected platforms: Windows, Linux
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 51): 4 crashes from 2016-08-01.
 - aurora  (version 50): 11 crashes from 2016-08-01.
 - beta    (version 49): 141 crashes from 2016-08-02.
 - release (version 48): 157 crashes from 2016-07-25.
 - esr     (version 45): 84 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        6       5       0
 - beta         36      79       5
 - release      41      60      13
 - esr           4       9       3

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #185
 - aurora
 - beta    #475      #3221
 - release #510                #525
 - esr     #837
Crash volume for signature 'nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache':
 - nightly (version 52): 4 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 142 crashes from 2016-09-20.
 - release (version 49): 172 crashes from 2016-09-05.
 - esr     (version 45): 108 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly       4       0
 - aurora        0       0
 - beta        141       1
 - release     142      30
 - esr           7       7

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser     Content Plugin
 - nightly #234
 - aurora
 - beta    #115
 - release #336                #1170
 - esr     #1068
Too late for firefox 52, mass-wontfix.
Whiteboard: [tbird crash]
[Tracking Requested - why for this release]:
Crash is spiking in Beta, and I'm about to dup bug 1396317 to it, and that bug is already being tracked.
The crash address is always 0x0, so the analysis in comment 0 and comment 1 seems likely to still hold.
My best idea is to add some diagnostic code that gives us info about exactly where the prefs initialization is going wrong.
froydnj, this is my best idea for the moment. What do you think?
Attachment #8904854 - Flags: review?(nfroyd)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Here's a sample crash where I deliberately injected a fault: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906

Also, the patch uses MOZ_CRASH_UNSAFE_PRINTF. In theory a "data steward" (whoever that is) is supposed to review uses of MOZ_CRASH_UNSAFE_PRINTF, but I don't think it's necessary here -- it's just printing static strings, it's much the same as adding a dozen or so MOZ_CRASH calls with a delay.
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Review of attachment 8904854 [details] [diff] [review]:
-----------------------------------------------------------------

Sure, let's do this.
Attachment #8904854 - Flags: review?(nfroyd) → review+
Priority: -- → P1
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Approval Request Comment
[Feature/Bug causing the regression]: This one. It's not a regression fix, just diagnostic code.

[User impact if declined]: Ongoing crashes; this is diagnostic code to try to work out why the Preferences module is failing to initialize for some users at startup.

[Is this code covered by automated tests?]: Not really; the Preferences module initializes correctly in all our tests. I did some spot tests locally by injecting a fault, here's a crash report from that: https://crash-stats.mozilla.com/report/index/7b7ac3a3-35db-4d56-9bd6-746131170906

[Has the fix been verified in Nightly?]: Not yet. The crash *is* showing up on Nightly so I could wait a bit, but it seems to be higher volume on Beta so it would be nice to get it into Beta soon.

[Needs manual test from QE? If yes, steps to reproduce]: No.

[List of other uplifts needed for the feature/fix]: None.

[Is the change risky?]: Not really.

[Why is the change risky/not risky?]: Just some diagnostic code.

[String changes made/needed]: None.
Attachment #8904854 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/ed7b5443cf8b
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Comment on attachment 8904854 [details] [diff] [review]
Add diagnostics about Prefs startup failures

Diagnostic patch for a top crash. Let's take it for beta 10.
Attachment #8904854 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Nicholas Nethercote [:njn] from comment #17)
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Per Nicholas's comment, this does not require manual verification. Updating the qe-verify flag.
Flags: qe-verify-
this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.
Crash Signature: [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] → [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement]
(In reply to [:philipp] from comment #22)
> this is now showing up as [@ mozilla::CacheDataAppendElement] and hitting
> "!gCacheData: pref_ReadPrefFromJar() failed" in 56.0b10.

Yes:

> (100.0% in signature vs 00.04% overall) moz_crash_reason = !gCacheData: pref_ReadPrefFromJar() failed

Currently we have 520 occurrences from 20 installations, all on Beta (none on Nightly).

The failing call looks like this:

> // Load jar:$gre/omni.jar!/greprefs.js
> rv = pref_ReadPrefFromJar(jarReader, "greprefs.js");

Within pref_ReadPrefFromJar() there is only one way that it can fail, here:

> nsCString manifest;
> MOZ_TRY_VAR(manifest, URLPreloader::ReadZip(jarReader, nsDependentCString(name)));
Some additional info from glandium:

- greprefs.js is the main prefs file, containing the contents of modules/libpref/init/all.js plus a few other smaller prefs files.

- greprefs.js is stored within the omni.ja file, which contains tons of Firefox resources. (See https://developer.mozilla.org/en-US/docs/Mozilla/About_omni.ja_(formerly_omni.jar) for details.)

- greprefs.js is typically the first file within omni.ja to be read on startup.

- Therefore, there the most likely explanation is that omni.ja is somehow corrupted.

rstrong, any ideas if this likely omni.ja corruption could have something to do with the updater?
Flags: needinfo?(robert.strong.bugs)
Summary: XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache → XPCOM Startup crashes in nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache, due to problems reading greprefs.js from the omnijar
One interesting data point: more than half of these crash reports have an empty "install time" field. I don't recall seeing that before.
(In reply to Nicholas Nethercote [:njn] from comment #24)
<snip>
> rstrong, any ideas if this likely omni.ja corruption could have something to
> do with the updater?
It is possible but there is not much if anything for me to go on and I haven't seen any indicators of there being corruption.
Flags: needinfo?(robert.strong.bugs)
Assignee: n.nethercote → nobody
While this is a high crash rate in beta 10 and 11, it's 1000 crashes a week for 35 installations. Marking this wontfix for 56.
For early 56 release, this is in the top 5 crash signatures, and is a startup crash. 
As in beta, it appears to be relatively few people hitting the crash over and over. 
Nathan or Nick, can you or someone else take another look here?
Flags: needinfo?(nfroyd)
Flags: needinfo?(n.nethercote)
Sorry, I don't know what else to do beyond what I already did. It looks like the omni.ja file has been corrupted, and once that happens the entire installation is messed up and will require reinstalling. As you say, despite the high numbers of crash reports, it appears to be affecting a relatively small number of users.
Flags: needinfo?(n.nethercote)
Surely this is not the first time we've consistently seen corrupted omni.ja files, though?  Have we been able to do anything about them in the past?  Does the updater do any verification of the state of the omni.ja?  Do we have any "health of your installation" checks?
Flags: needinfo?(nfroyd)
Just kinda reaching for stuff here...I'm at a loss for fixes, if we're looking at on-disk corruption, or even consistent memory corruption, so just trying to throw stuff out and see if anything sticks.
Thanks for checking again, the spike in crashes was just a bit worrying.  I'll wontfix for 56 and 57 since there is nothing actionable here and relatively few installations are affected.
I just wanted to point out that the number of these crashes increased significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.
(In reply to Tom Grabowski [:TomGrab] from comment #33)
> I just wanted to point out that the number of these crashes increased
> significantly from 0.2% in 56.0.1 to 2.1% in 56.0.2.

We started migrating 1/5/20% of eligible 32-bit users to 64-bit in 56.0.1, but we are now migrating 100% of eligible users in 56.0.2. I was worried this might be a new 64-bit crash from third-party software, but I don't think this is a 64-bit specific issue. This CacheDataAppendElement signature is 2.0% higher in 56.0.2 than 56.0.1 or 56.0 even when only comparing Win32 crashes:

https://mozilla.github.io/stab-crashes/scomp.html?common=product%3DFirefox%26platform%3DWindows%26cpu_arch%21%3Damd64&p1=version%3D56.0%26version%3D56.0.1&p2=version%3D56.0.2
Is there a way to detect omni.ja corruption and recover somehow? Is that the avenue we should explore?
Flags: needinfo?(ddurst)
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #35)
> Is there a way to detect omni.ja corruption and recover somehow? Is that the
> avenue we should explore?

The jar file should have a CRC right? In theory we should return `NS_ERROR_FILE_CORRUPTED` [1], so I guess the question is whether or not whatever reads omni.ja checks for an error.

[1] https://searchfox.org/mozilla-central/rev/550148ab69b2879bfb82ffa698720ede1fa626f2/modules/libjar/nsJARInputStream.cpp#345
mozilla::Preferences::AddBoolVarCache appears to be the mac flavor
Crash Signature: [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] → [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] [@ mozilla::Preferences::AddBoolVarCache]
Ted do you know who could dive into comment 36?
Flags: needinfo?(ted)
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #39)
> Ted do you know who could dive into comment 36?

I don't, sorry! Given that omni.ja contains many files required for Firefox to start successfully I'm not sure that there's much that callers could usefully do for recovery here.
Flags: needinfo?(ted)
> Given that omni.ja contains many files required for Firefox
> to start successfully I'm not sure that there's much that callers could
> usefully do for recovery here.

Presumably we'd show some error message saying something like "your installation is corrupt, please reinstall", which would be much better than silently crashing.
Just to sum up a bit as this bug is aging: we've talked about possible omnijar corruption a few times, but it's never been terribly verifiable (just theoretical, judging from what we've seen).

It's probably worth doing the approach in #c36 to see *if* that happens, and then talk about how to recover from that. Before we undertake something like that, we'd want to know that we're solving a real problem (correctly).
Flags: needinfo?(ddurst)
Adding another signature that is showing up in 61 nightly.
Crash Signature: [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] [@ mozilla::Preferences::AddBoolVarCache] → [@ nsTArray_Impl<T>::AppendElement<T> | mozilla::Preferences::AddBoolVarCache] [@ mozilla::CacheDataAppendElement] [@ static void mozilla::CacheDataAppendElement] [@ mozilla::Preferences::AddBoolVarCache]
This is also visible in 59.0.2 and 59.0.3, with multiple crashes per installation, usually at startup.
static void mozilla::CacheDataAppendElement signature is currently the #5 browser top crash on 61. I also updated the affect branches to reflect the fact that 62 and 63 are also affected.
#5 top crash for 62 beta as well. Again, seems to be a repeated startup crash for the affected users.
Keywords: topcrash
Whiteboard: [tbird crash] → [tbird crash]
Nathan can you help find someone to take another look at these crashes? Or if there's nothing actionable, let's add the "stalled" keyword tag.
Flags: needinfo?(nfroyd)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #47)
> Nathan can you help find someone to take another look at these crashes? Or
> if there's nothing actionable, let's add the "stalled" keyword tag.

I looked through several reports from:

[@ static void mozilla::CacheDataAppendElement ]
[@ mozilla::Preferences::AddBoolVarCache ] 

and all of them are failing with the crash reason "!gCacheData: pref_ReadPrefFromJar() failed", which comes from:

https://searchfox.org/mozilla-central/source/modules/libpref/Preferences.cpp#4907-4909

This reason is strong confirmation for the corrupted omni.ja hypothesis.  Reading the preferences from the omni.ja is actually split into two stages:

https://searchfox.org/mozilla-central/source/modules/libpref/Preferences.cpp#4772-4789

I'm not sure at which stage we're failing there.  I guess we can add more instrumentation to try and figure out which exact cases are failing?  Maybe we can get the string for the preferences just fine, but it can't be parsed for some reason.  I don't know if we could get the entire prefs file provided in the minidump or something to examine for issues.  Nick, do you have space to add some more error reasons, similar to your last patch?

If we don't want to add more instrumentation, or think that it wouldn't help...given that we can't read the prefs from the omni.ja, the options I see, in order of increasing complexity, are:

1) Move forward in startup without the prefs, which might just move crashes to a different location or bring up a super-weird browser (if we can't read other files from omni.ja);
2) Bake prefs directly into the binary, to be used as a last resort if we can't read prefs from omni.ja.  Again, corrupted omni.ja might just mean we're moving crashes around.  Bad RAM or something might thwart this idea too;
3) Figure out some way to overwrite the corrupted omni.ja with a known-good omni.ja...sounds complicated.

Are there other options?
Flags: needinfo?(nfroyd) → needinfo?(n.nethercote)
We are going around in circles in this bug. I think it's clear that omni.ja corruption is the root problem here. If pref initialization was somehow made impervious to such corruption, it seems very likely that some other component would crash or fail catastrophically in some other way shortly afterwards.

As mentioned above, I think the right approach is to introduce a mechanism to detect corrupt omni.ja files (and possibly other parts of the installation). This idea came up in the context of Project Uptime (https://wiki.mozilla.org/Platform/Uptime#Detect_if_Firefox_is_misinstalled) but nothing concrete happened.
Flags: needinfo?(n.nethercote)

Marking 65 as affected. In early 65 data, this is the top browser crash and is flagged as a startup crash.

Adding 66 as affected - #9 overall in b14.

Bug 1569526 has landed, which removes the data structure involved in this signature. So the signature will change, but I'm not sure what it will change to.

(In reply to Nicholas Nethercote [:njn] from comment #52)

Bug 1569526 has landed, which removes the data structure involved in this signature. So the signature will change, but I'm not sure what it will change to.

Not a single beta or nightly crash with mozilla::Preferences::AddBoolVarCache - https://crash-stats.mozilla.org/signature/?release_channel=%21release&signature=mozilla%3A%3APreferences%3A%3AAddBoolVarCache&date=%3E%3D2019-07-29T12%3A54%3A00.000Z&date=%3C2019-09-26T12%3A54%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-version&_sort=-date&page=1

Do we have a testcase or direct contact with someone reproducing this?

(In reply to Wayne Mery (:wsmwk) from comment #53)

(In reply to Nicholas Nethercote [:njn] from comment #52)

Bug 1569526 has landed, which removes the data structure involved in this signature. So the signature will change, but I'm not sure what it will change to.

Not a single beta or nightly crash with mozilla::Preferences::AddBoolVarCache - https://crash-stats.mozilla.org/signature/?release_channel=%21release&signature=mozilla%3A%3APreferences%3A%3AAddBoolVarCache&date=%3E%3D2019-07-29T12%3A54%3A00.000Z&date=%3C2019-09-26T12%3A54%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-version&_sort=-date&page=1

Do we have a testcase or direct contact with someone reproducing this?

User in https://support.mozilla.org/en-US/questions/1286827 with signature mozilla::Preferences::AddBoolVarCache apparently can reproduce with 68.6.0

The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?

Flags: needinfo?(nika)

This bug isn't ready to be closed yet. We're still getting (albeit at a bit of a lower frequency) crashes which appear to be caused by the omni.ja being corrupted at startup, with this being the signature

This feels similar to bug 1471720, in that it's a startup crash which appears to be being caused by omni.ja corruption at startup.

Flags: needinfo?(nika)

The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?

Flags: needinfo?(nika)
QA Whiteboard: qa-not-actionable

status is unchanged since the last ping

Flags: needinfo?(nika)

The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(nika)
Flags: needinfo?(nika)

We've removed VarCache so we shouldn't be running into this anymore.

Status: REOPENED → RESOLVED
Closed: 7 years ago2 years ago
Resolution: --- → FIXED
Assignee: nobody → n.nethercote
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: