Closed Bug 1063052 Opened 6 years ago Closed 5 years ago

Firefox 32+ startup crash in nsFrame::BoxReflow

Categories

(Core :: Layout, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla35
Tracking Status
firefox32 + verified
firefox33 + verified
firefox34 + verified
firefox35 + verified
firefox-esr31 --- fixed

People

(Reporter: kairo, Assigned: robert)

References

(Depends on 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-e9f649b0-09ca-48c7-82d3-570622140902.
=============================================================

In early Firefox 32 stats, we have a startup crash in nsFrame::BoxReflow at #3 in the topcrash list, at ~25% the rate of the leading OOM|small signature.

All the crash reasons seem to be EXCEPTION_ACCESS_VIOLATION_READ at an address of 0x34.

See https://crash-stats.mozilla.com/report/list?signature=nsFrame%3A%3ABoxReflow%28nsBoxLayoutState%26%2C+nsPresContext*%2C+nsHTMLReflowMetrics%26%2C+nsRenderingContext*%2C+int%2C+int%2C+int%2C+int%2C+bool%29&product=Firefox&process_type=browser&version=Firefox%3A32.0
[Tracking Requested - why for this release]:
#3 topcrash in early 32 release data.

David, do you have any leads on what is going on here?
Flags: needinfo?(dmajor)
Historically I think this is the place we crash if we're unable to load our default stylesheets or similar base chrome stuff (though adding some explicit NS_RUNTIMEABORT to make it clearer what the problem is might be a good idea, since it would help with diagnosis).  See other nsFrame::BoxReflow crash bugs for some details.
|metrics| is null here: https://hg.mozilla.org/releases/mozilla-release/file/44234f451065/layout/generic/nsFrame.cpp#l8317

There was a spike of these crashes with build 20140822024446 (32.0b9) but few with build 20140825202822 (32.0 release going to beta channel). Then a bunch more on the 32.0 release channel. Since it's a startup crash, it may have been severe enough that the 32.0b9 users couldn't update (or changed browsers).

Here's the b9 changelog: https://hg.mozilla.org/releases/mozilla-beta/shortlog/208377
These two changes touched frames: https://hg.mozilla.org/releases/mozilla-beta/rev/8e6b808eed02
https://hg.mozilla.org/releases/mozilla-beta/rev/11a5306111d0 (but it's a backout)

Markus, Seth, any chance these changes could have led to a null metrics? It's a long shot I know.

In email addresses I am seeing lots of schools and universities. It's possible this may be related to some commonly-deployed educational software. I'll try contacting some users.
Flags: needinfo?(seth)
Flags: needinfo?(mstange)
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #3)
> https://hg.mozilla.org/releases/mozilla-beta/rev/11a5306111d0 (but it's a
> backout)

I think it's extremely unlikely that this change is related to the crash.

Seth's change looks more plausible as a cause to me.
Flags: needinfo?(mstange)
according to feedback on SUMO by affected users this can be solved through a clean reinstall, so it might be caused by external factors...

https://support.mozilla.org/en-US/questions/1018220
Yes, the .edu users who got back to me all mentioned that it was fixed by a reinstall, but I couldn't find anything common in their environments.

rmcguigan - heads up, I suspect we might see some more cases of this.
Given the high ranking (#3) and that this is a start-up crash, unless we can demonstrate the the crash signature is in decline due to people having used the workaround, I think we need to pursue a fix.

dmajor, mstange - Do either of you have any other leads?
Flags: needinfo?(mstange)
Flags: needinfo?(dmajor)
Note that if dbaron's guess in comment #2 is correct, even if we stop the crashes in layout, the browser UI will be almost entirely non-functional, which isn't much of an improvement.

bz, is there a way to detect failed loads of the initial stylesheets using telemetry or some other indicator?
Flags: needinfo?(bzbarsky)
Some explanation for the patch - this stack shouldn't be happening:

xul!nsFrame::DoLayout
xul!nsBoxFrame::LayoutChildAt
xul!mozilla::ScrollFrameHelper::LayoutScrollbars

That top frame out to be nsBoxFrame::DoLayout. Somehow the ScrollFrameHelper got an mScrollCornerBox that isn't a box frame.

Looking over some more user comments, it seems that some of them still experience this even after reinstalling. And the scary thing is I'm seeing this same crash on FennecAndroid 32 and even OSX SeaMonkey.
Flags: needinfo?(dmajor)
(In reply to Lawrence Mandel [:lmandel] from comment #7)
> dmajor, mstange - Do either of you have any other leads?

I don't have anything else to add.
Flags: needinfo?(mstange)
Bug 436232 and bug 380015 has some interesting discussion and details.
In particular landry@openbsd.org's backtraces in bug 436232 comment 27:
>Program received signal SIGSEGV, Segmentation fault.
>in nsFrame::BoxReflow (this=0x202391848, ...

and before that, the stack when that frame was created:

>Breakpoint 2, nsFrame::Init (this=0x202391848
>#4 nsCSSFrameConstructor::ConstructInline
>...
>#9 nsCSSFrameConstructor::BeginBuildingScrollFrame
>...
>#11 nsCSSFrameConstructor::ConstructDocElementFrame
Blocks: 436232
Comment on attachment 8486219 [details] [diff] [review]
wallpaper fix

As you said above, this will likely lead to a non-functional browser.
Or a new crash somewhere else.

So I'm not sure this actually improves the situation for the user.
For us, it kills the signal that a large number of installations
are broken, as dbaron said in bug 436232 comment 19.  He suggested
built-in UI instead.  That's probably a better solution for the user
if we provide links to some help channel (SUMO?) and for us too if we
can get in touch with users having this problem to investigate the
root cause.

Anyway, can you add a debug-only assert? (just in case this shows up
in tests somehow), like so:
     } else {
+      MOZ_ASSERT(frame->IsBoxFrame());
+      if (frame->IsBoxFrame()) {

I'll defer to dbaron to make the call if this is actually an
improvement.
Attachment #8486219 - Flags: review?(mats)
Attachment #8486219 - Flags: review?(dbaron)
Attachment #8486219 - Flags: review+
> bz, is there a way to detect failed loads of the initial stylesheets using telemetry or
> some other indicator?

We could probably add telemetry for it, yes...
Flags: needinfo?(bzbarsky)
Between my vacation and the CSS WG meeting I haven't had a chance to look into this closely, and given the time pressure on it, I suspect you probably want to get this in before I'll have a chance to.

It seems like this patch could have one of four outcomes for the users for whom Firefox is not starting:
 (1) Firefox now starts up correctly
 (2) Firefox now starts up but fails to display any useful UI, but also fails to submit any crash reports
 (3) Firefox crashes slightly later during startup, in an equally-not-exploitable way
 (4) Firefox crashes slightly later during startup, in a somehow more exploitable way

My gut feeling is that the most likely outcomes from the patch are either (2) or (3), which would suggest that the patch is slightly harmful.  But I haven't looked into the issue closely enough to put much weight behind that opinion.

I'm ok with taking the patch, though.  At the very least, though, we should add an NS_RUNTIMEABORT if the basic style sheets fail to load so that at least some potential causes give us crash reports.
(In reply to Mats Palmgren (:mats) from comment #12)
> Bug 436232 and bug 380015 has some interesting discussion and details.
> In particular landry@openbsd.org's backtraces in bug 436232 comment 27:
> >Program received signal SIGSEGV, Segmentation fault.
> >in nsFrame::BoxReflow (this=0x202391848, ...
> 
> and before that, the stack when that frame was created:
> 
> >Breakpoint 2, nsFrame::Init (this=0x202391848
> >#4 nsCSSFrameConstructor::ConstructInline
> >...
> >#9 nsCSSFrameConstructor::BeginBuildingScrollFrame
> >...
> >#11 nsCSSFrameConstructor::ConstructDocElementFrame

That's consistent with minimal-xul.css not having loaded.
I'm clearing the NI here because I don't really understand how the changes in my patch could have led to this.

If this is an issue in loading chrome stuff as suggested in comment 2, it's kinda suggestive that it happens only on Windows, isn't it? Maybe we are using a feature in the Windows chrome that we're not using on other platforms?
Flags: needinfo?(seth)
Seems like we should definitely do this.

I lean against taking the "wallpaper fix" at this time.
Attachment #8486845 - Flags: review?(cam)
(In reply to Seth Fowler [:seth] from comment #17)
> If this is an issue in loading chrome stuff as suggested in comment 2, it's
> kinda suggestive that it happens only on Windows, isn't it?

It doesn't only happen on Windows; see comment #10.
(In reply to Mats Palmgren (:mats) from comment #13)
> Anyway, can you add a debug-only assert? (just in case this shows up
> in tests somehow), like so:
>      } else {
> +      MOZ_ASSERT(frame->IsBoxFrame());
> +      if (frame->IsBoxFrame()) {

In debug builds we would have asserted here: http://hg.mozilla.org/mozilla-central/file/4d1793da0b96/layout/generic/nsGfxScrollFrame.cpp#l4577
    NS_PRECONDITION(!mScrollCornerBox || mScrollCornerBox->IsBoxFrame(), "Must be a box frame!");
Comment on attachment 8486845 [details] [diff] [review]
NS_RUNTIMEABORT if a builtin stylesheet fails to load

Review of attachment 8486845 [details] [diff] [review]:
-----------------------------------------------------------------

Given that not loading the builtin style sheet will surely result in a non-functional browser, and the crashes almost all seem to be happening just as the browser is loading, I agree it'd be better to runtime abort here with some additional information.

::: layout/style/nsLayoutStylesheetCache.cpp
@@ +463,5 @@
> +  nsresult rv = gCSSLoader->LoadSheetSync(aURI, aEnableUnsafeRules, true,
> +                                          getter_AddRefs(aSheet));
> +  if (NS_FAILED(rv)) {
> +    ErrorLoadingBuiltinSheet(aURI,
> +      nsPrintfCString("LoadSheetSync failed with error %x", rv).get());;

Remove duplicate ";".
Attachment #8486845 - Flags: review?(cam) → review+
I wonder if this could be related to the lazy loading of UA style sheets we do now (bug 1008455)?
Flags: needinfo?(jwatt)
Comment on attachment 8486845 [details] [diff] [review]
NS_RUNTIMEABORT if a builtin stylesheet fails to load

Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: none
[Describe test coverage new/current, TBPL]: N/A
[Risks and why]: this is a very low-risk patch that may help us track down mysterious startup crashes
[String/UUID change made/needed]: none
Attachment #8486845 - Flags: approval-mozilla-beta?
Attachment #8486845 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/a5678a961b1e
Assignee: nobody → robert
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
This crash is a candidate for a Fx32 chemspill. The patch here will still abort (though with better diagnostics.) I don't think we should uplift this one for a chemspill but still uplift to beta & aurora.
[roc, did you intend for this bug to get auto-closed as FIXED when your patch was merged to m-c? I'm guessing not -- but if so, we should file a new bug to track anticipated failures of the new NS_RUNTIMEABORT, since presumably the affected users will still have a startup crash, just a bit earlier & with more information.]
Flags: needinfo?(roc)
(In reply to Jet Villegas (:jet) from comment #26)
> This crash is a candidate for a Fx32 chemspill. The patch here will still
> abort (though with better diagnostics.) I don't think we should uplift this
> one for a chemspill but still uplift to beta & aurora.

Mats requested that we do take this in the chemspill. Unfortunately, if we are going to take this in the chemspill, I need to know soon.

Jet - Can you explain why you don't think we should take this diagnostic patch in the chemspill?
Flags: needinfo?(mats)
Flags: needinfo?(bugs)
(In reply to Daniel Holbert [:dholbert] from comment #27)
> [roc, did you intend for this bug to get auto-closed as FIXED when your
> patch was merged to m-c? I'm guessing not -- but if so, we should file a new
> bug to track anticipated failures of the new NS_RUNTIMEABORT, since
> presumably the affected users will still have a startup crash, just a bit
> earlier & with more information.]

Reopening as this bug hasn't been fixed. The patch just adds more diagnostic information to help track down the source of the failure.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Lawrence Mandel [:lmandel] from comment #28)
> Mats requested that we do take this in the chemspill. Unfortunately, if we
> are going to take this in the chemspill, I need to know soon.
> 
> Jet - Can you explain why you don't think we should take this diagnostic
> patch in the chemspill?

As we're going to build today, I spoke with bz and dmajor about this online. Although bz said that he doesn't expect this will make things any worse he also said that beta will probably be enough to get the diagnostic data that we need. As I don't have a strong case for taking this in the chemspill, it's not going to ride along. We can reconsider if we need to respin.
Flags: needinfo?(mats)
Flags: needinfo?(bugs)
Comment on attachment 8486845 [details] [diff] [review]
NS_RUNTIMEABORT if a builtin stylesheet fails to load

Beta+ and Aurora+. Let's see if this change provides any more useful information about this crash.
Attachment #8486845 - Flags: approval-mozilla-beta?
Attachment #8486845 - Flags: approval-mozilla-beta+
Attachment #8486845 - Flags: approval-mozilla-aurora?
Attachment #8486845 - Flags: approval-mozilla-aurora+
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #32)
> https://hg.mozilla.org/releases/mozilla-beta/rev/b6cd0cc8c045

This was pushed to the GECKO330b3_2014091121_RELBRANCH instead of default. Please backout from there and re-push.
Flags: needinfo?(roc)
FYI, I have the data from a user who followed the instructions in https://support.mozilla.org/en-US/questions/1019342 , which might be useful for debugging the cause of this.  I haven't had any time to look into this today, though.
(Oh, and the data are from 64-bit Windows, so perhaps best looked into by somebody with a 64-bit Windows dev environment.)
Can I take a look?
Flags: needinfo?(dbaron)
Absolutely.  Sent email.
Flags: needinfo?(dbaron)
Thanks! After merging those files into my FF32 install, I hit the BoxReflow crash on every startup. Debugging now...
Nightly's new NS_RUNTIMEABORT says we can't find c:\...\firefox\chrome\toolkit\content\global\minimal-xul.css. That directory has a regular xul.css but not a minimal one.

How come this isn't an issue on vanilla installs? I don't even have a "chrome" folder on my usual machines. Are these supposed to come from omni.ja or something? Would a leftover xul.css prevent the extraction of minimal-xul.css? This user's xul.css mentions enndeakin at sympatico, so it looks like an old one.
Flags: needinfo?(bzbarsky)
(In reply to David Major [:dmajor] from comment #40)
> Nightly's new NS_RUNTIMEABORT says we can't find
> c:\...\firefox\chrome\toolkit\content\global\minimal-xul.css. That directory
> has a regular xul.css but not a minimal one.

(FWIW: the file minimal-xul.css was created in bug 1008455, for Firefox 32. So, this seems to be a regression from that bug.)
Blocks: 1008455
Some context, since some of it is buried in the sumo thread: The user at comment 35 had a bunch of files left over after uninstalling FF32, including a chrome directory containing an old xul.css and no minimal-xul.css.
(In reply to David Major [:dmajor] from comment #40)
> How come this isn't an issue on vanilla installs? I don't even have a
> "chrome" folder on my usual machines. Are these supposed to come from
> omni.ja or something?

Yes -- per https://hg.mozilla.org/mozilla-central/rev/b42ec325e34d#l9.1 , this was added to the manifest for omni.ja. (And I can verify that it's there inside my omni.ja in my own Firefox 32 installation.)

I'm not sure how the old stale xul.css impacts things, since I don't really know how this path-resolution code works for these resources. However, I could imagine that this code might assume it should use the "real" directory if that directory exists (instead of looking inside omni.ja); something like that would explain why it's trying & failing to find an actual minimal-xul.css file instead of looking inside the jar, in your case.
Seems like loading these stale files could be causing tons of other problems as well.

At this point it seems like it's likely prudent to back out bug 1008455 for Firefox 32 if that's still straightforward to do, at least assuming there aren't a bunch of other similar file splits also in Firefox 32 that could cause similar problems.

But we should definitely figure out how to avoid this problem of users hitting stale files for 33 and up.  (I could probably be convinced that some approaches might be safe for 32, even.)
The end of comment 43 sounds fairly plausible to me.  Benjamin, can you confirm whether that's how it works?

If so, we can work around by doing something where we load xul.css if minimal-xul.css fails, but that will still leave the users involved with a broken old xul.css, no?
Flags: needinfo?(bzbarsky) → needinfo?(benjamin)
But yes, the safest short-term fix for 32 is to back out bug 1008455 there.
FWIW, I don't think another point-release for 32 is planned atm. That said, we should fix this for 33 and up for sure.
Bug 1008455 had other work build on top of it. Specifically in this order:

b42ec325e34d
Bug 1008455 - Avoid loading the xul.css UA style sheet when possible

feff69cd698f
Bug 1013936, part 2 - Only load the html.css UA style sheet on-demand for SVG documents

3d3d1e8d9a6e
Bug 1011806 - Stop loading user-agent and user style sheets for SVG-as-an-image (the main UA sheets svg.css, html.css, etc. will still load on demand)

c92802ca70c1
Bug 1016131 - Fix crash in nsCSSStyleSheet::IsApplicable when dom.forms.number is disabled

254c144e9ca1
Bug 1016145 - Load the non-SVG user-agent style sheets for non-SVG documents served as image/svg+xml

Backing out bug 1008455 probably means backing out these commits too.
Flags: needinfo?(jwatt)
I'd be pretty surprised if we had a fallback mechanism, since the URLs here appear to be

chrome://global/content/xul.css
and chrome://global/content/minimal-xul.css

Those should both resolve to content within omni.ja. But it's possible that we have a fallback for *all of omnijar*. That would be really crazy though.
Do we have a user who was experiencing this and could send us a zip of their install? I'm specifically interested to see whether installdir/chrome.manifest exists and if so, what the contents are.

After a little poking, here's the loading order for the root chrome.manifest file:

1) installdir/omni.ja!/chrome.manifest (GRE omnijar)
2) installdir/chrome.manifest (GRE flat)
3) installdir/browser/omni.ja!/chrome.manifest (app omnijar)
4) installdir/browser/chrome.manifest (app flat)

Note that the "flat" chrome.manifest comes after the omnijar, and can therefore override entries found in the omnijar version.

Also note that in a normal Firefox build, installdir/chrome.manifest does not exist, so #2 ought to do nothing at all. But installdir/chrome.manifest is not listed in removed-files.in, so on update we might not remove it, depending on what kind of MAR you get. If a user ended up with an installdir/chrome.manifest from an old build on update, it could screw up all sorts of things.

Firefox hasn't had an installdir/chrome.manifest since the great GRE/browser split, but it did before (in FF 20).

rstrong, I remember you saying that removed-files.in isn't really necessary any more, but I don't remember the technical details. Should we be adding this to removed-files so that updates remove things correctly?
Flags: needinfo?(benjamin) → needinfo?(mgrimes)
Flags: needinfo?(robert.strong.bugs)
It would be interesting to know what nsChromeRegistry::ConvertChromeURL is doing here to lead us to something not in omni.ja.

That said, the user's "Mozilla Firefox" directory has a chrome.manifest file that seems to point to everything laid out on disk, not in a jar.  Do we just pick up chrome.manifest files in the install directory?  Should we?

(The xul.css the user has corresponds to the state of xul.css from 3d267302cbd6 until just prior to 25877e8f89c2.)

[mid-aired with bsmedberg's comment; added only the below]

bsmedberg -- I'll forward you the user's install directory (post-uninstall, I believe)
This is totally weird. The user had a completely unpacked set of chrome files, such as what you might have from a self-build before packaging or --disable-omnijar. I cannot come up with a reasonable explanation for how an end-user would end up in that state. Except perhaps pale moon or some alternate gecko-based browser uses this packaging format? Or some malware intentionally unpacks and modifies the files.

The contents of installdir/chrome.manifest:

manifest chrome/chrome.manifest
manifest components/components.manifest

The contents of installdir/chrome/chrome.manifest:

locale alerts en-US en-US/locale/en-US/alerts/
locale autoconfig en-US en-US/locale/en-US/autoconfig/
locale cookie en-US en-US/locale/en-US/cookie/
locale global en-US en-US/locale/en-US/global/
locale global-platform en-US en-US/locale/en-US/global-platform/
locale global-region en-US en-US/locale/en-US/global-region/
locale mozapps en-US en-US/locale/en-US/mozapps/
locale necko en-US en-US/locale/en-US/necko/
locale passwordmgr en-US en-US/locale/en-US/passwordmgr/
locale pipnss en-US en-US/locale/en-US/pipnss/
locale pippki en-US en-US/locale/en-US/pippki/
locale places en-US en-US/locale/en-US/places/
locale weave en-US en-US/locale/en-US/
content marionette marionette/content/
content specialpowers marionette/content/
content cookie toolkit/content/cookie/
content global toolkit/content/global/ contentaccessible=yes
content global-platform toolkit/content/global-platform/ platform
content global-region toolkit/content/global-region/
content mozapps toolkit/content/mozapps/
content passwordmgr toolkit/content/passwordmgr/
content satchel toolkit/content/satchel/
content xbl-marquee toolkit/content/xbl-marquee/
skin global classic/1.0 toolkit/skin/classic/aero/global/ os=WINNT osversion>=6
skin global classic/1.0 toolkit/skin/classic/global/ os!=WINNT
skin global classic/1.0 toolkit/skin/classic/global/ os=WINNT osversion<6
skin mozapps classic/1.0 toolkit/skin/classic/aero/mozapps/ os=WINNT osversion>=6
skin mozapps classic/1.0 toolkit/skin/classic/mozapps/ os!=WINNT
skin mozapps classic/1.0 toolkit/skin/classic/mozapps/ os=WINNT osversion<6
content recording recording/content/
content pippki pippki/content/pippki/
resource specialpowers marionette/modules/
override chrome://global/content/nsTransferable.js chrome://global/content/nsDragAndDrop.js
resource gre-resources toolkit/res/

The contents of xul.css match up with Firefox 30. I'm going to try and compare the FF30 omnijar against these files and see if there are any differences at all.
Flags: needinfo?(mgrimes)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #50)
> Do we have a user who was experiencing this and could send us a zip of their
> install? I'm specifically interested to see whether
> installdir/chrome.manifest exists and if so, what the contents are.
> 
> After a little poking, here's the loading order for the root chrome.manifest
> file:
> 
> 1) installdir/omni.ja!/chrome.manifest (GRE omnijar)
> 2) installdir/chrome.manifest (GRE flat)
> 3) installdir/browser/omni.ja!/chrome.manifest (app omnijar)
> 4) installdir/browser/chrome.manifest (app flat)
> 
> Note that the "flat" chrome.manifest comes after the omnijar, and can
> therefore override entries found in the omnijar version.
> 
> Also note that in a normal Firefox build, installdir/chrome.manifest does
> not exist, so #2 ought to do nothing at all. But installdir/chrome.manifest
> is not listed in removed-files.in, so on update we might not remove it,
> depending on what kind of MAR you get. If a user ended up with an
> installdir/chrome.manifest from an old build on update, it could screw up
> all sorts of things.
> 
> Firefox hasn't had an installdir/chrome.manifest since the great GRE/browser
> split, but it did before (in FF 20).
> 
> rstrong, I remember you saying that removed-files.in isn't really necessary
> any more, but I don't remember the technical details. Should we be adding
> this to removed-files so that updates remove things correctly?
It depends on when this file was added. It was implemented in Firefox 5 and the entries in removed-files was kept up to date until Firefox 27 since we needed a required update to a version that had these changes. Bug 386760, Bug 649607, and the removed-files.in have details on the implementation
http://mxr.mozilla.org/mozilla-central/source/browser/installer/removed-files.in

Looking at the patch that cleaned up removed-files.in I don't see xul.css listed. I also checked the Firefox 4 removed-files.in which had omni.ja support and it didn't have xul.css listed either.
attachment #806430 [details] [diff] [review]

Having said that, I would have expected this to have happened in Firefox 27 if it were due to the file being left behind outside of the omni.ja.
More details: the user's remaining unpacked files *exactly* match Firefox 31 release. I have no useful theory about how this actually happened. In the meantime, I think a patch to add installdir/chrome.manifest to removed-files would fix this. I invite argument about whether that might have unintended side effects :-(
Attached patch remove-chrome.manifest-1063052 (obsolete) — Splinter Review
Attachment #8489549 - Flags: superreview?(dbaron)
Attachment #8489549 - Flags: review?(robert.strong.bugs)
Attachment #8489549 - Flags: feedback?(mh+mozilla)
Comment on attachment 8489549 [details] [diff] [review]
remove-chrome.manifest-1063052

>diff --git a/browser/installer/removed-files.in b/browser/installer/removed-files.in
>--- a/browser/installer/removed-files.in
>+++ b/browser/installer/removed-files.in
>@@ -48,24 +48,29 @@
>...
>+# Some users are ending up with unpacked chrome instead of omni.ja. This
>+# causes updates to break badly, see bug 1063052. Removing the toplevel
>+# chrome.manifest causes us to use the updated omni.ja.
s/This causes updates/This causes Firefox/

since the update works.

>+chrome.manifest
Attachment #8489549 - Flags: review?(robert.strong.bugs) → review+
Flags: needinfo?(robert.strong.bugs)
something similar also happened in bug 1055766, so this approach might help with various start-up crash signatures.
(In reply to Jonathan Watt [:jwatt] from comment #48)
> Bug 1008455 had other work build on top of it. Specifically in this order:
> 
> [snip]

FWIW backing out all five commits and rebuilding seems to result in a usable Firefox. (Although if we're not going to spin another paint release hopefully we don't need to do that.)
Interesting! There is a single difference in that users unpacked browser directory and the FF31 browser omni.ja:

diff -U 4 -r ff31-browser-omnijar/defaults/preferences/firefox.js Mozilla Firefox/browser/defaults/preferences/firefox.js
--- ff31-browser-omnijar/defaults/preferences/firefox.js	2010-01-01 00:00:00 -0500
+++ Mozilla Firefox/browser/defaults/preferences/firefox.js	2014-08-03 01:03:24 -0400
@@ -196,9 +196,9 @@
 pref("general.autoScroll", true);
 //@line 270 "c:\builds\moz2_slave\rel-m-rel-w32_bld-000000000000\build\browser\app\profile\firefox.js"
 
 // At startup, check if we're the default browser and prompt user if not.
-pref("browser.shell.checkDefaultBrowser", true);
+pref("browser.shell.checkDefaultBrowser", false);
 pref("browser.shell.shortcutFavicons",true);
 
 // 0 = blank, 1 = home (browser.startup.homepage), 2 = last visited page, 3 = resume previous browser session
 // The behavior of option 3 is detailed at: http://wiki.mozilla.org/Session_Restore

So I'll be willing to wager that some kind of crapware is doing this unpack procedure precisely to disable checking for default-browser. This is not something we measure in FHR or telemetry, so figuring out how prevalent the problem is could be hard.

I am wondering whether we can/should pull 32.0.1 updates for this.
Flags: needinfo?(lmandel)
Could this be a portable Firefox or other distribution? Do you know the install path?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #50)
> Also note that in a normal Firefox build, installdir/chrome.manifest does
> not exist, 

FYI, firefox-32.0.1.tar.bz2 for Linux that I just downloaded from
mozilla.com comes with a top-level chrome.manifest file:

# cat chrome.manifest 
manifest components/components.manifest
# cat components/components.manifest
binary-component libdbusservice.so
binary-component libmozgnome.so
# ls components/
components.manifest  libdbusservice.so  libmozgnome.so
Now that we know more about this bug, tracking across the board. We should get fixes into Firefox 33 and ESR 31.2.0.

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #59)

> I am wondering whether we can/should pull 32.0.1 updates for this.

This depends on whether we think this bug is critical enough to ship a 32.0.2. If so, then we should pull 32.0.1 and get 32.0.2 together and out the door. From Benjamin and my conversation on irc, I don't know that we have evidence to suggest that this bug affects a significant enough percentage of the population to warrant a point release. I'm happy to hear arguments in favour of 32.0.2.
Comment on attachment 8489549 [details] [diff] [review]
remove-chrome.manifest-1063052

sr=dbaron, although I wonder why we should even be looking at a chrome.manifest in this location if it's not what we're shipping; it seems to just add an unsupported extension point that we don't want people to use.

Though that depends on not using it ourselves (comment 61), I guess.

Also, I presume this won't cause a problem if we then replace the file ourselves, as on Linux (comment 61)?
Attachment #8489549 - Flags: superreview?(dbaron) → superreview+
> I'm happy to hear arguments in favour of 32.0.2.

This affects relatively few, but being a consistent startup crash, it means we probably won't see those users ever again.

> So I'll be willing to wager that some kind of crapware is doing this unpack
> procedure precisely to disable checking for default-browser.

This could explain why we're seeing it in managed environments, the software that pushes out updates might also add some IT-chosen customizations.
Depends on: 1067561
> > So I'll be willing to wager that some kind of crapware is doing this unpack
> > procedure precisely to disable checking for default-browser.
> 
> This could explain why we're seeing it in managed environments, the software
> that pushes out updates might also add some IT-chosen customizations.

mkaply, you might have some experience here given [1]. Any idea how common this practice is among enterprises?

[1] http://mike.kaply.com/2013/05/06/dont-unpack-and-repack-omni-jar/
Flags: needinfo?(mozilla)
Comment on attachment 8489549 [details] [diff] [review]
remove-chrome.manifest-1063052

Review of attachment 8489549 [details] [diff] [review]:
-----------------------------------------------------------------

As mentioned by the other Mike, linux builds *do* have a chrome.manifest file. It so happens that we currently don't have one on mac and windows, but if we'd ever add xpcom binary components, we'd end up with some, too.

That being said, iirc, we should be able to merge those xpcom components in libxul now, although that's probably not something we'd be doing on 32.
Attachment #8489549 - Flags: feedback?(mh+mozilla) → feedback-
(In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment #66)
> As mentioned by the other Mike

Err, that was Mats.

> That being said, iirc, we should be able to merge those xpcom components in
> libxul now, although that's probably not something we'd be doing on 32.

Karl, do you remember why we haven't done so yet?
Flags: needinfo?(karlt)
Is it actually a *problem* if we list a file in removed-files.in that is also a file we install?  (i.e., if we remove first and then replace, that seems fine)
I'm fairly certain it is since the explicit file removals listed in removed-files.in are handled last. Whatever the case, only files you don't want to exist after an update should be listed in removed-files.in.
(In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment #67)
> > That being said, iirc, we should be able to merge those xpcom components in
> > libxul now, although that's probably not something we'd be doing on 32.
> 
> Karl, do you remember why we haven't done so yet?

mozgnome and dbusservice are covered by bug 563628 (and bug 821291).

There's no reason to have them as separate shared libraries now.
Flags: needinfo?(karlt)
(In reply to David Major [:dmajor] from comment #65)
> mkaply, you might have some experience here given [1]. Any idea how common
> this practice is among enterprises?

I haven't seen cases where people keep unzip omni.ja and run Firefox unzipped unless they are doing testing.

We strongly encourage folks not to unzip or modify omni.ja.

I could post on the enterprise list to see if anyone has seen anything like this.

Can I get a summary of exactly what we are seeing so I can ask?
Flags: needinfo?(mozilla)
(In reply to Mike Kaply (:mkaply) from comment #71)
> Can I get a summary of exactly what we are seeing so I can ask?

Some users have the unzipped contents of omni.ja from FF31 in their FF directory. We don't know how or why. Those files remain even after installing FF32. FF32 adds a new file inside omni.ja, but the lookup for that file goes to the unzipped directory, which doesn't have that file.
#ifdef using MOZ_GTK. Also the first version caused build bustage because you can't put something in removed-files that you're also packaging, which is a nice protective measure so if we do start needing root chrome.manifest in the future on win/mac we'll discover it at packaging time.
Attachment #8490191 - Flags: review?(mh+mozilla)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #59)
> I am wondering whether we can/should pull 32.0.1 updates for this.

We disabled Firefox 32.0.1 updates for this today. This bug is the driver for 32.0.2.
Attachment #8490191 - Flags: review?(mh+mozilla) → review+
Summary: Firefox 32 startup crash in nsFrame::BoxReflow → Firefox 32+ startup crash in nsFrame::BoxReflow
Attachment #8489549 - Attachment is obsolete: true
Comment on attachment 8490191 [details] [diff] [review]
remove-chrome.manifest-1063052

Approval Request Comment
[Feature/regressing bug #]: Unknown user behavior causing frankenstalls
[User impact if declined]: crashes on upgrade from FF31->32
[Describe test coverage new/current, TBPL]: Very little real-world testing has been completed. This *should* fix the issue, but finding somebody to do the upgrade in this situation and test it might be impossible.
Nightly testers are not going to provide much useful testing of this: the risks are probably mostly related to updates from unusual configurations or perhaps to the installer sequence. I don't believe that there is a big risk here, but this is an install/update change which magnifies risk in general, and we should have focused testing of various install and update scenarios.

[String/UUID change made/needed]: None
Attachment #8490191 - Flags: approval-mozilla-release?
Attachment #8490191 - Flags: approval-mozilla-beta?
Attachment #8490191 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/eaf866fcb961
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #76)
> [Describe test coverage new/current, TBPL]: Very little real-world testing
> has been completed. This *should* fix the issue, but finding somebody to do
> the upgrade in this situation and test it might be impossible.
> Nightly testers are not going to provide much useful testing of this: the
> risks are probably mostly related to updates from unusual configurations or
> perhaps to the installer sequence. I don't believe that there is a big risk
> here, but this is an install/update change which magnifies risk in general,
> and we should have focused testing of various install and update scenarios.

I agree that we're not going to get useful testing on Nightly. We will push this to the release branch shortly and can test there.

How about having QA test the regular update paths, which they already do for release, and test one where they have reproduced the failing environment? Is that enough coverage?
Comment on attachment 8490191 [details] [diff] [review]
remove-chrome.manifest-1063052

Approved for release. We'll test there and then uplift to the other branches.
Attachment #8490191 - Flags: approval-mozilla-release? → approval-mozilla-release+
Followup fodder:
- That people would rely on extracting omni.ja completely to override a pref is wrong. There are (documented?) ways to do this. If those are not documented, they should be.
- It seems to me the main part of the failure in this bug comes from the fact that chrome manifests outside the omni.ja are treated after the chrome manifests inside the omni.ja. What about doing the opposite? If something is registered from an extracted omni.ja, it will be overridden by the contents of omni.ja. If people really want to use the contents of the extracted omni.ja, they'll have to remove omni.ja.
I think the main reason why the extracted "flat chrome" works is for developing. I agree with comment #81 in that those people wanting to do some development there without needing to build should also easily be able to at least rename omni.jar while testing their stuff, and making omni.jar override flat chrome would make us significantly more robust against bugs like the one here.

Glandium, do you want to file a separate bug on that reordering?
We've performed some additional testing today using Firefox 32.0.2 (BuildID=20140917194002) on Win 7 x86, Win 8.1 x64, Ubuntu 14.04 x86, and Mac OS X 10.9.4. 

Our testing covered:

1. Regular update
- tested update from Firefox 20, 27, 31, 32 and 32.0.1 to Firefox 32.0.2
- used different update types: Help -> About, Pave update, Automatic update
- no issues were encountered

2. nsFrame::BoxReflow original crash
We were able to consistently reproduce the original crash when updating from Firefox 31 to Firefox 32, on Windows 7 x64, but encountered no issue when updating from 31 to 32.0.2. 

The scenario used to reproduce the original issue and confirm the fix in 32.0.2:
1. Installed Firefox 31 cleanly in a new empty folder.
2. In the installation folder renamed omni.ja to omni.zip.
3. Using Windows Explorer extracted all contents of omni.zip to the installation folder.
4. Renamed omni.zip back to omni.ja.
5. Started Firefox 31 with a brand new profile and went to Help -> About to trigger update to 32.
6. After download completed, chose to restart Firefox to apply the update ---> Firefox crashed on startup on any try with the nsFrame::BoxReflow signature.

To verify update from 31 to 32.0.2, before step 5 I just edited defaults\pref\channel-prefs.js and set the update channel to betatest. This ensures the update will go to 32.0.2 instead of 32.0. 

I'll leave the bug open in case you want to do some additional verification or ask some people with this original issue to verify.
Comment on attachment 8490191 [details] [diff] [review]
remove-chrome.manifest-1063052

Approving for Beta and Aurora now that QA has verified.
Attachment #8490191 - Flags: approval-mozilla-beta?
Attachment #8490191 - Flags: approval-mozilla-beta+
Attachment #8490191 - Flags: approval-mozilla-aurora?
Attachment #8490191 - Flags: approval-mozilla-aurora+
Setting for verification in the other branches as well.
Flags: qe-verify+
Any chance to get an automated test for this underlying crash situation?
Flags: in-testsuite?
I've performed some additional testing today trying to verify the fix on all other branches.

I got the following results:
1. Reproduced the original crash each time using scenario from comment 83 (but doing a pave-over upgrade at step #5), for all branches (33 Beta 2, 34 Aurora from Sep 4th, 35 Nightly from Sep 4th).
2. Did not get any issue when updating 31 to 33 Beta 5 through About->Help (scenario from comment 83, but forced to update to Beta 33).
3. Did not manage to test updating to 34 and 35 through About->Help. I've set aurora/nightly in "channel-prefs.js" and firefox-mozilla-central/firefox-mozilla-aurora in "update-settings.ini" but update failed. I'm thinking I'm probably doing something wrong here while trying to force update to aurora/nightly.
4. Doing a pave-over upgrade at step #5 in scenario from comment 83 will still crash the application on startup with:
a) original crash ([@ nsFrame::BoxReflow(nsBoxLayoutState&, nsPresContext*, nsHTMLReflowMetrics&, nsRenderingContext*, int, int, int, int, bool)]) - on Firefox 32.0.2 and 33 Beta 5
b) [@ mozalloc_abort(char const* const) | NS_DebugBreak | ErrorLoadingBuiltinSheet ] (bp-1e4e2d40-60d5-4da8-b475-539932140919) - on latest 34 Aurora and latest 35 Nightly

#4 above seems to indicate that the fix works OK only when updating through About->Help. When overwriting the older installation (pave-over upgrade) we still get startup crashes. Any thoughts on this?
What is a "pave-over upgrade"?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #89)
> What is a "pave-over upgrade"?

It's downloading the new version and installing it in the location of the older version, thus performing an upgrade.
Florin, I think that's a little unexpected, but doesn't require re-opening this bug. Could you file a separate bug about that; in particular I'd expect the installer to remove things in the removed-files list, or everything that isn't explicitly in the install manifest.
Most things in the removed-files list don't need to be removed on install since the installer keeps track of what it installs. For cases like this we just add a one-off to remove the file.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #82)
> Glandium, do you want to file a separate bug on that reordering?

Filed bug 1072152.
Re-verified after bug 1070988 and bug 1072152 were fixed. Update through About -> Help (re-verified for Firefox 33 Beta 8) and pave-over install now both work without causing any crash. 

Versions covered:
- Firefox 33 Beta 8 - BuildID: 20140929180120
- Latest Firefox 34 Aurora - BuildID: 20140930004006.
- Latest Firefox 35 Nightly - BuildID: 20140929030205.

Setting issue as Verified.
Too late to get this into 31.1.0 esr so let's get an approval on this to ship with FF 34.
Flags: needinfo?(robert)
It's actually bsmedberg's patch that matters here
Flags: needinfo?(robert) → needinfo?(benjamin)
I don't understand what the question is. Didn't this all land?
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #97)
> I don't understand what the question is. Didn't this all land?

It's was being tracked for ESR31 - if this is not a serious crash on ESR31, I'm fine with not taking it as it doesn't truly meet the criteria for landing.  Do you have data to show which way we should go here?
Flags: needinfo?(benjamin)
The proximate cause of the regression was bug 1008455, which was not part of ESR31 and won't be uplifted. I don't think we need to uplift this.
Flags: needinfo?(benjamin)
In bug 1055766, it seems that we are experiencing the same issue with the ESR 24 => 31 upgrade. Benjamin, do you think it could be the same issue? Thanks.
Flags: needinfo?(benjamin)
bug 1055766 looks like a crash in JS garbage-collection code, which doesn't strike me as something that could be caused by stale omni.ja content [the ultimate underlying problem here]; so I don't think they're connected.  (See also comment 99.)
If it were related, note that the patch here only works on automatic (MAR) upgrade. To fix this for sysadmins who are manually upgrading, you'd need to take 1070988 or 1072152.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #102)
> If it were related, note that the patch here only works on automatic (MAR)
> upgrade. To fix this for sysadmins who are manually upgrading, you'd need to
> take 1070988 or 1072152.

Can we get a ESR31 patch to resolve the startup crashes?
Flags: needinfo?(benjamin)
The set of patches you might want to take are perhaps this bug, bug 1070988, and bug 1072152. I don't think we need new patches, just a decision about which patch(es) to take.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #104)
> The set of patches you might want to take are perhaps this bug, bug 1070988,
> and bug 1072152. I don't think we need new patches, just a decision about
> which patch(es) to take.

Right and just to confirm Bug #1055766 is a dupe of this bug right? Want to make sure we are addressing the right thing.
Where was that connection established? Per comment 101, it doesn't make much sense to me that bug 1055766 would be connected to this bug.
(In reply to Daniel Holbert [:dholbert] from comment #106)
> Where was that connection established? Per comment 101, it doesn't make much
> sense to me that bug 1055766 would be connected to this bug.

I think bug 1055766 ended with two different problems. The original problem in comment 0 does not appear to be related. Bug 1055766 comment 15 seems to be a different problem which looks much more similar to this bug. It was believed to be the same because bug 1055766 comment 5 and following comments noted that the steps to reproduce are similar, but I see no evidence to suggest that any of those later crashes observed are the same as the crash in comment 0.
Benjamin,

In testing for Bug #1055766 Steve Fink ran across this signature in crashed when updating ESR plus the crash signature in that bug. He thinks backportimg your patch for ESR could fix both possibly. Any chance you can look in this some and backporting?
Flags: needinfo?(benjamin)
Do the patches need changes? I don't think I have additional time for this bug, but if you've tried applying the patches to ESR and they don't work, I can ask dmajor to have a look. It's probably very trivial merge-conflict-resolution.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #109)
> Do the patches need changes? I don't think I have additional time for this
> bug, but if you've tried applying the patches to ESR and they don't work, I
> can ask dmajor to have a look. It's probably very trivial
> merge-conflict-resolution.

I will have to ask someone to check as I do not have access to a windows machine to test this.
Ryan can you try landing these to mozilla-esr31 branch and let us know if they merge without conflict?  If so, consider them a+ for esr31 and go ahead with pushing the changes.
Flags: needinfo?(ryanvm)
Per some of the earlier comments in the bug, Ben's patch is the one that matters most here (and the only one that landed for the Fx32 chemspill).

https://hg.mozilla.org/releases/mozilla-esr31/rev/d071f8dfe4d6
Flags: needinfo?(ryanvm)
Not worth writing a layout crashtest for IMO, since the root cause was
a frankenbuild.
Flags: in-testsuite? → in-testsuite-
No longer blocks: 436232
You need to log in before you can comment on or make changes to this bug.