Closed Bug 380015 Opened 17 years ago Closed 16 years ago

Crash [@ nsFrame::BoxReflow] on startup when Fx 2.0 libraries not removed from install directory

Categories

(Firefox :: Installer, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9beta5

People

(Reporter: fantasai.bugs, Assigned: robert.strong.bugs)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(3 files)

Description:
  Nightly builds segfault on startup.

Steps to Reproduce:
  1. Download nightly build
  2. tar -xvjf f<tab>
  3. cd firefox
  4. ./firefox

Expected Results:
  Working firefox build

Actual Results:
  ./run-mozilla.sh: line 131:   759 Segmentation fault      "$prog" ${1+"$@"}

Tested with ftp.mozilla.org trunk nightlies, 2007-02-03 and 2007-05-07 on Ubuntu Linux 6.06 on Dell D620 machine.

I don't have this problem when I compile a build myself.
Product: Firefox → Core
QA Contact: general → general
Attached file backtrace
I can reproduce this with a static opt build on ubuntu 7.04.
oops, missed the reported error:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47007716719648 (LWP 10267)]
0x00002ac0ceed561c in nsFrame::BoxReflow (this=0x64dd70, aState=@0xbf08b0, aPresContext=0xd455e8, aDesiredSize=@0x7fffdc57ebd0, 
    aRenderingContext=0x177000ef4e10, aX=32767, aY=-823286586, aWidth=6000, aHeight=0, aMoveFrame=0) at nsFrame.cpp:6257
6257        if (metrics->mLastSize.width != aWidth)
Status: UNCONFIRMED → NEW
Ever confirmed: true
(gdb) print metrics
$1 = (nsBoxLayoutMetrics *) 0x0
Flags: blocking1.9?
Attached file valgrind warnings
The pango_shape valgrind warnings are likely due to bug 381654.

The XftFontOpenInfo warnings are independent.

Attachment 265736 [details] [diff] submitted for bug 381654 but I'm not so hopeful that it will help here.
Depends on: 381654
This has the same stack as the crash from bug 383875:
https://crash-reports.mozilla.com/reports/report/index/27a59de6-1758-11dc-a54e-001a4bd43ed6

Would be nice if crash-reports worked.  :-/
Should have put more info there, that crash is on Win32, with my patch to embed manifests in all DLLs, crash is 100% reproducible on startup on a machine without the VC8 CRT installed.
Severity: normal → critical
Keywords: crash
Version: unspecified → Trunk
Comment on attachment 265674 [details]
backtrace

This backtrace shows a problem with the stack. Frame 5 has a null aPresContext, but its caller just passes it through.
These are opt builds so the parameter values in the stack are not reliable, right?
(In reply to comment #11)
> These are opt builds so the parameter values in the stack are not reliable,
> right?
> 

er, yes, of course
do people have any reason to believe this is not a duplicate of bug 292549?
Depends on: 292549
I launched -P and created a new profile (the old profile was just from 2.0.0.6, and was a "stock" profile with no new bookmarks or pref changes), and this started working for me again...

Linux mozilla-qa 2.6.20-16-generic #2 SMP Thu Jun 7 20:19:32 UTC 2007 i686 GNU/Linux
WFIW, according to [1] almost zero nsFrame::BoxReflow crashes on trunk 2007-06-13 through 2007-08-25. Then, a series of thunderbird crashes, which seem to have gone.

[1] http://crash-stats.mozilla.com/report/list?range_unit=weeks&branch=1.9&range_value=2&signature=nsFrame%3A%3ABoxReflow%28nsBoxLayoutState%26%2C+nsPresContext%2A%2C+nsHTMLReflowMetrics%26%2C+nsIRenderingContext%2A%2C+int%2C+int%2C+int%2C+int%2C+int%29
+'ing.  Seems to be crashing on windows as well.  Setting priority to P2 so we can dig into this rather quickly.  It's at #16 on top crashers.
Flags: blocking1.9? → blocking1.9+
No real STR, let's watch for it in Beta 4 and see if we can figure out the cause.
Flags: tracking1.9+
(to be clear, not marking wanted-next+ as we're not sure that this hasn't been fixed by some other bug; for now this is a tracking bug for this type of crash, if it becomes a major issue then it should be renominated as blocking1.9?)
That tells me that this bug is still blocking, on investigation and confidence if not on additional code changes.  Renominating, because I think we should prefer to have maybe-fixed things on the list than to have to remember to watch for these or detect that a reported crash is similar enough to be renominated.
Flags: blocking1.9?
blocking1.9+, P2, since Shaver is so demanding and all.
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
It's worth noting that a lot of these stacks (all of the ones that seem unmangled all the way down to main) go through:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/toolkit/xre/nsAppRunner.cpp&rev=1.208&mark=3076#3076

Maybe there's something unusual with the way we bring up the modal dialog so early in startup?
So how does one get the extension manager to bring up modal dialogs from that line of code?  Is it possible that doing that is reliably crashy on Windows?
(In reply to comment #23)
> So how does one get the extension manager to bring up modal dialogs from that
> line of code?  Is it possible that doing that is reliably crashy on Windows?

If you have some extensions installed that aren't compatible with your current version of firefox (setting extensions.checkCompatibility to false will let you install them) then change extensions.lastAppVersion to a lower version number, quit firefox and delete the compatibility.ini file from your profile. Then on startup you should get the modal window from there.

It is worth mentioning that there is another point in there that we can launch a modal dialog, the extension updates dialog which would open from:

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/toolkit/xre/nsAppRunner.cpp&rev=1.208&mark=3084#3084

Do we get stack traces there?

I wonder if this is related to the kind of reflow loop I see as part of bug 354527 (caused by bug 413336)
So, looking through the topcrashes on Windows showing up with this signature, I'm noticing that most (although not all) of them are 1.9 builds (with firefox.exe showing version 1.9.0.2988) that have a bunch of 1.8.* version libraries loaded (typically xpcom_core.dll, jar50.dll, and myspell.dll; sometimes also spellchk.dll and xpinstal.dll) -- generally all showing the same 1.8.* version (almost all 1.8.20080.20121, but I saw one 1.8.20071.12718).  I don't see this happening with other crashes that I looked at.

See bp-c3fbe6ba-f07f-11dc-b0b4-001a4bd43ef6 for an example (the "Modules" tab).
The end of the version ID is the Build ID. 2008020121 is 2.0.0.12:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12 (although this doesn't show the hour)

The fact that that's most of them just proves that we have good security release uptake. :)

Yeah, I knew that the 20080.20121 was a build ID.  I'm still not sure what the .2988 is.

I just did the experiment of (in a clean directory):
$ (cd /c/Program\ Files/Mozilla\ Firefox/ && tar -c .) | tar -x
$ mkdir deleteme
$ ./firefox.exe  -profile deleteme 
  [quit]
$ (cd /c/Program\ Files/Minefield/ && tar -c .) | tar -x
$ ./firefox.exe  -profile deleteme 

and the build crashed on startup, producing bp-faffdae1-f089-11dc-9cc6-001a4bd43ed6.  I'm still waiting to see if it's this crash, but I'm sort of expecting it will be.  (If so, the question then becomes how users end up in that situation and if there's anything we can do avoid ending up with those dlls there or to avoid using them.)
We changed the versioning on trunk, so the last field is just "days since jan 1, 2000".

Aren't crashes like these the reason we stopped shipping zip builds? I can't fathom how you'd wind up in this situation using the installer, seems like you would have to manually screw yourself by copying files around.
Although, I guess this was initially reported on Linux, where we *only* ship tarballs. :-/
Sure enough, this is the crash signature you get when you copy a Minefield install on top of a Firefox 2.0.0.12 install.  See bp-faffdae1-f089-11dc-9cc6-001a4bd43ed6 (same one mentioned above), bp-3fce0e4a-f08c-11dc-83ef-001a4bd46e84, and bp-54fcbec7-f08c-11dc-b2c3-001a4bd43ed6.

For what it's worth, I tried the other way around 3.0 to 2.0.0.12 (wondering if I could get a 3.0 thanks to simultaneous use of updater and a fresh install on top), and I got a busted 2.0.0.12 since it (of course) doesn't know to remove the libraries that are new in 3.0 (brwsrcmp.dll), resulting in crashes in nsACString_internal::Assign: TB42544824, TB42544899.


So I guess we need to try to figure out how users are ending up in this situation (enough that it's the #6 topcrash on the first day of release, although it's since dropped to #16).
Summary: Crash [@ nsFrame::BoxReflow] on startup → Crash [@ nsFrame::BoxReflow] on startup when Fx 2.0 libraries not removed from install directory
I just discussed this briefly with Rob Strong.  The discussion led to two breakpad/socorro questions:

  1. are the files in the module list stored in the database as full pathnames and just shown in the UI as file basenames, or does the database only really have the basename?  If the former, could we see some examples from some of these **Windows** incidents (perhaps cleaned up) so we could find out:

    + what the directory name the install was done into was (might tell us something about steps to repro -- e.g., whether it's Firefox, Minefield, etc.)

    + so that we could confirm that this is happening on Windows because of crossing of files within a single install directory rather than some condition that causes dlls from separate install directories to be used at the same time

  2. whether it would be possible to make the install.log part of the data sent in the breakpad reports, at least temporarily, so that we could see if there were installation issues causing this and what they were.  (Potentially this could be conditioned on something you could test at runtime, like whether there's an "xpcom_core.dll" in the Modules list.)
1. Only the leaf name is being stored in the database, but the dump files contain full pathnames. They get stripped out during processing, so it would currently be slightly tricky to get them back, but I think we could.
2. We could probably send install.log wholesale, but since these are startup crashes I guess we'd have to do it very early in startup. Not terrible, just have to use NSPR file methods. The crash reporter doesn't currently conditionally send any data after the crash, it's all setup beforehand. We don't actually look at the list of modules while we're submitting it, so it'd be easier just to send it all the time. We'd need a db change to store it, though.
Keywords: topcrash
Any thoughts on how we can fix?   fx3 installer changes, with checking/warnings about installing on top of an existing installation?

we could quitely disallow installation on top, or provide warning, or insist on retrying another install location.

we are past the string deadline if we need to present the user with additional info, but this might deserve an exception.  Marking late-l10n until we know more about how we want to address.  if its late-l10n the pri should move to P1 as well.

dveditz might also have some ideas on how we have approached this in the past...

Keywords: late-l10n
The installer already removes these files. The only way to get into this situation is to manually extract on top of an existing build.
I think I see what is going on here for Win32 and should be able to fix it without a string change.
Moving over to toolkit -> nsis installer
Component: General → NSIS Installer
Product: Core → Toolkit
QA Contact: general → installer
Target Milestone: --- → mozilla1.9beta5
taking
Assignee: nobody → robert.bugzilla
Keywords: late-l10n
No longer depends on: 292549, 381654
note: should have a patch over this weekend.
(In reply to comment #30)
> I got a busted 2.0.0.12 since it (of course) doesn't know to remove
> the libraries that are new in 3.0 (brwsrcmp.dll)

We could fix that. If we expect some number of people to install an early 3.0 beta/preview and then downgrade until some site/addon/bug gets fixed we probably should. I'm pretty sure we did that with later versions of the FF1.5 installer.

Filed bug 423226 to add this to the FF2 installer, should block 2.0.0.14 (unfortunately it's missed 2.0.0.13, unless we think this is respin-worthy and/or some other regression forces a respin). That's pretty late (will miss beta 5, for example), is this important enough to stop-ship 2.0.0.13? how common a crash is this?

Fixing the windows installer doesn't help on Mac or Linux of course, not sure what we could do there.
Depends on: 423226
Attached patch patch rev 1Splinter Review
Adds delete on reboot support for files listed in removed-files.log that are deleted.

This is similar to the following for comparison purposes
http://lxr.mozilla.org/seamonkey/source/toolkit/mozapps/installer/windows/nsis/common.nsh#3710
Attachment #310306 - Flags: review?(benjamin)
Whiteboard: [has patch]
Comment on attachment 310306 [details] [diff] [review]
patch rev 1

Boy, do I wish we had some installer unit-tests... rstrong, can you maybe file a bug about that and we can think about it after FF3 ships?
Attachment #310306 - Flags: review?(benjamin) → review+
Filed bug 423754 for the NSIS installer unit tests
Checked in to trunk. This should fix the recent rise in the Win32 crashes with trunk builds. bug 423226 is for the branch bug as mentioned in comment #39.

Checking in mozilla/toolkit/mozapps/installer/windows/nsis/common.nsh;
/cvsroot/mozilla/toolkit/mozapps/installer/windows/nsis/common.nsh,v  <--  common.nsh
new revision: 1.35; previous revision: 1.34
done
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Whiteboard: [has patch]
As far as trunk topcrash data go:  this signature was responsible for about 10-40 crashes per build ID in nightlies in the week leading up to this fix landing; in the 3 days since it landed there have been a total of 6 crashes.  So it looks like it fixed most (although not quite all) of this problem.
Crash Signature: [@ nsFrame::BoxReflow]
Component: NSIS Installer → Installer
Product: Toolkit → Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: