Solaris: Build packaging problem, crash on startup

RESOLVED FIXED

Status

SeaMonkey
Build Config
P1
critical
RESOLVED FIXED
19 years ago
13 years ago

People

(Reporter: mats, Assigned: Rich Burridge)

Tracking

({crash, helpwanted})

Trunk
Sun
Solaris
crash, helpwanted
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [dogfood-] Build packaging problem. Need plan.)

Attachments

(2 attachments)

(Reporter)

Description

19 years ago
Apprunner nightly builds for Solaris have been crashing on startup for about
2 weeks now. Viewer works fine though. I will attach a stack trace. uname -a:
SunOS claudia 5.6 Generic_105181-11 sun4u sparc SUNW,Ultra-2
(Reporter)

Comment 1

19 years ago
Created attachment 1566 [details]
Apprunner crash data, Solaris 2.6

Updated

19 years ago
Assignee: don → cyeh
Component: Browser-General → Build Config

Comment 2

19 years ago
Chris,

Are you folks seeing this problem?
(Reporter)

Comment 3

18 years ago
Just as a status update: it's still crashing. I have checked now and then for
the past month (latest build: 1999-10-07-16-M11) and it always crashes with
the same stack trace as the one I have attached.

Updated

18 years ago
Priority: P3 → P1
Should a P1 critical have no milestone?

Comment 5

18 years ago
Here is the analysis I sent several of the engineers on 8 September, plus some
more recent discoveries.  I'm a little sad that it hasn't resulted in any
action, so I'll post it here:

> 1) The problem appears to be that gdk_rdb_init calls gdk_rgb_choose_visual,
>    which does:
>
>         visuals = gdk_list_visuals ();
>
>    When this returns, visuals is NULL; subsequently, the code does:
>
>           tmp_list = visuals;
>           best_visual = tmp_list->data;
>
>    Which is clearly the cause of the crash.
>
> 2) So why does gdk_list_visuals() return NULL?  It's pretty simple:
>
>         gdk_list_visuals (void)
>         {
>           GList *list;
>           guint i;
>
>           list = NULL;
>           for (i = 0; i < nvisuals; ++i)
>             list = g_list_append (list, (gpointer) &visuals[i]);
>
>           return list;
>         }
>
>     nvisuals is 0, so the list returned is NULL.
>
>     Sadly, what I can't answer is why nvisuals is empty.  I'll keep poking
>     to see if I can figure out why.
>
>     This leads me to...
>
> 3) Why is it that there are two complete copies of gdk in different
>    shared libraries?  Here's what I mean:
>
> $ dis -F gdk_rgb_init libgfx_gtk.so | head
> disassembly for libgfx_gtk.so
>
> section .text
> gdk_rgb_init()
>         165798:  9d e3 bf 90       save         %sp, -112, %sp
>         16579c:  11 00 00 00       sethi        %hi(XSynchronize), %o0
>         1657a0:  d0 4a 20 00       ldsb         [%o0 + XSynchronize], %o0
>
> $ dis -F gdk_rgb_init libwidget_gtk.so | head
> disassembly for libwidget_gtk.so
>
> section .text
> gdk_rgb_init()
>         1a5a0c:  9d e3 bf 90       save         %sp, -112, %sp
>         1a5a10:  11 00 00 00       sethi        %hi(XSynchronize), %o0
>         1a5a14:  d0 4a 20 00       ldsb         [%o0 + XSynchronize], %o0
>

That was my original mail-- I got some vague response that this "looked like
an X server bug" which I rejected.   Since then I thought some more about this
problem:

I think the issue is that one library is getting loaded after the other, and
effectively interposing on the other's functions.  This is a standard linker
trick, but I think here it is happening by mistake.  Recall that this would
also explain why nvisuals is 0-- it is state private to the gdk library.  In
this case, when the second copy of that library is loaded, it has its own,
*uninitialized* version of the nvisuals variable.

To check this, I performed the following experiment using LD_PRELOAD, which
should keep a particular library at the head of the link chain:

First, I installed libgtk and libgdk, etc a well-known place (in this case
/usr/local/lib)

$ /bin/ksh
$ export LD_PRELOAD=libgtk.so:libgdk.so
$ export LD_LIBRARY_PATH=/usr/local/lib

Then I ran mozilla, and everything came up.  In this case, I forced the
application to *always* use my copy of gtk/gdk, instead of getting confused
between two other copies.

So: I think removing one of the two statically linked copies of libgtk/gdk
should solve this problem.

Comment 6

18 years ago
Created attachment 2550 [details]
more successful stacktrace

Updated

18 years ago
QA Contact: leger → granrose

Comment 7

18 years ago
Updating QA contact to a release person.  Internal Core QA does not test
Solaris.

Updated

18 years ago
Status: NEW → RESOLVED
Last Resolved: 18 years ago
Resolution: --- → DUPLICATE

Updated

18 years ago
Status: RESOLVED → VERIFIED

Comment 8

18 years ago
*** This bug has been marked as a duplicate of 13160 ***
(Reporter)

Updated

18 years ago
Status: VERIFIED → REOPENED
(Reporter)

Comment 9

18 years ago
Reopen because bug 13160 was fixed but this bug still occurs in build 1999111909

Updated

18 years ago
Resolution: DUPLICATE → ---

Comment 10

18 years ago
Clearing DUPLICATE resolution due to reopen.

Updated

18 years ago
Assignee: cyeh → chofmann
Status: REOPENED → NEW

Comment 11

18 years ago
did mcafee's recent award winning fixes help this? ;-)
jdunn, do you see this on other unix ports?

Comment 12

18 years ago
On AIX, we ran into this problem EONS ago and the 'fix' is to create
a shared gtk library and link against this instead of the static libs.

on HPUX we are still chasing the problem that was introduced with
superwin, but it is either a similar case to this OR a threading issue

Personally I would love it if we only lined against gtk/gdk... once
instead of the 3-5 times we do it now.

Updated

18 years ago
Assignee: chofmann → mcafee
Target Milestone: M12

Comment 13

18 years ago
stealing from chofmann, m12.

Updated

18 years ago
Summary: Apprunner crash on startup → Solaris: Duplicate gtk libs -> crash @ startup

Comment 14

18 years ago
better summary

Updated

18 years ago
Target Milestone: M12 → M13

Comment 15

18 years ago
gtk/gdk should be dynamically linked, I only show an
undefined ref. to gdk_rgb_init in libwidget_gtk.so.
I'm guessing you've got some static versions of gdk/gtk
around, and/or missing some dynamic versions.  ?
This looks like gtk installation confusion to me.
pushing off m12.

Comment 16

18 years ago
I'm can't remember when this worked on Solaris - maybe M5?...
I just downloaded the latest build - ftp server said 12/02/99.
I haven't installed any custom GTK, GDK, etc.
I am viewing it thru an PC X server if that helps anybody.

e5sey{gcfalck}85: ./mozilla
MOZILLA_FIVE_HOME=/home/iis/mozilla/package

LD_LIBRARY_PATH=/home/iis/mozilla/package:/usr/ucblib:/interleaf/rdm2.7/sun4os5/fulcrum/lib:/interleaf/rdm2.6/sun4os5/fulcrum/lib:/usr/openwin/lib
       SHLIB_PATH=/home/iis/mozilla/package
          LIBPATH=/home/iis/mozilla/package
      MOZ_PROGRAM=./mozilla-bin
      MOZ_TOOLKIT=
        moz_debug=0
     moz_debugger=
nNCL: registering deferred (0)
Segmentation Fault

e5sey{gcfalck}113: gdb mozilla-bin core
GNU gdb 4.17
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.6"...
(no debugging symbols found)...
Core was generated by `./mozilla-bin'.
Program terminated with signal 9, Killed.
Reading symbols from /home/iis/mozilla/package/libraptorgfx.so...
(no debugging symbols found)...done.
Reading symbols from /home/iis/mozilla/package/libmozjs.so...
(no debugging symbols found)...done.
...[snip]...
(gdb) where
#0  0xeda74cc4 in gdk_rgb_set_min_colors ()
#1  0xeda74cc0 in gdk_rgb_set_min_colors ()
#2  0xeda74e90 in gdk_rgb_init ()
#3  0xeda77f78 in gdk_rgb_get_visual ()
#4  0xed9cc674 in nsDeviceContextGTK::Init ()
#5  0xee28eb58 in nsBaseWidget::BaseCreate ()
#6  0xee287a3c in nsWidget::CreateWidget ()
#7  0xee287d5c in nsWidget::Create ()
#8  0xee7bf290 in nsWebShellWindow::Initialize ()
#9  0xee7bd248 in nsAppShellService::JustCreateTopWindow ()
#10 0xee7bd0c0 in nsAppShellService::CreateTopLevelWindow ()
#11 0xedd9683c in nsProfile::LoadDefaultProfileDir ()
#12 0xedd96398 in nsProfile::StartupWithArgs ()
#13 0x1455c in NS_CanRun ()
#14 0x14aa0 in main ()
(gdb) q
e5sey{gcfalck}114: uname -a
SunOS e5sey 5.6 Generic_105181-09 sun4u sparc

Updated

18 years ago
Summary: Solaris: Duplicate gtk libs -> crash @ startup → [DOGFOOD] Solaris: Duplicate gtk libs -> crash @ startup
Target Milestone: M13 → M12

Comment 17

18 years ago
ok, back to m12.  latest download crashes for me per
greg's last comment.

Updated

18 years ago
Whiteboard: PDT-

Comment 18

18 years ago
will we figure this out before Tuesday?

Comment 19

18 years ago
re: tuesday: I hope so.

Updated

18 years ago
Whiteboard: PDT- → [PDT-]

Updated

18 years ago
Target Milestone: M12 → M13

Comment 20

18 years ago
Moving to M13 since this has been determined to be PDT-.

Updated

18 years ago
Blocks: 21564

Comment 21

18 years ago
*** Bug 20748 has been marked as a duplicate of this bug. ***

Updated

18 years ago
Assignee: mcafee → briano

Comment 22

18 years ago
Brian, it looks like executor (2.5.1) is linking with
gtk libs improperly?  We should try running the build in
that tree to debug this.
(Reporter)

Comment 23

18 years ago
*** Bug 22814 has been marked as a duplicate of this bug. ***

Updated

18 years ago
Whiteboard: [PDT-] → [PDT-] Build packaging problem

Comment 24

18 years ago
Brian is trying a 2.6 build, this might fix the problem.

Updated

18 years ago
Summary: [DOGFOOD] Solaris: Duplicate gtk libs -> crash @ startup → [DOGFOOD] Solaris: Build packaging problem, crash on startup
Whiteboard: [PDT-] Build packaging problem → Build packaging problem

Comment 25

18 years ago
Better summary, I think this is just a packaging mis-cue.
Putting back on PDT radar, we look stupid shipping
Solaris bits that DO NOT WORK ANYWHERE, NOT EVEN ON THE
BUILD MACHINE.

Updated

18 years ago
Whiteboard: Build packaging problem → PDT- Build packaging problem

Comment 26

18 years ago
marking PDT- per PDT, dogfood eaters can do manual install

Comment 27

18 years ago
*** Bug 20678 has been marked as a duplicate of this bug. ***

Updated

18 years ago
Whiteboard: PDT- Build packaging problem → [PDT-] Build packaging problem

Updated

18 years ago
Whiteboard: [PDT-] Build packaging problem → [PDT-] Build packaging problem. Need status from briano.

Comment 28

18 years ago
The problem as I see it (from a Release point of view) is that we are
incorrectly linking with the GTK libs multiple times when we should be
doing so once.  Linking dynamically is _not_ an option, because we can't
release a product that requires the user to find, download, build, and
install _anything_ (especially something like GTK that is changing rapidly,
and which any random version can't be guaranteed to be 100% compatible
with our releases).  The user must be able to simply download our product,
install it, and run.  And that means static GTK.  Period.

Let's fix the _real_ problem here, instead of coming up with hacks that
results in a product we ultimately can't ship.

Updated

18 years ago
Status: NEW → ASSIGNED

Comment 29

18 years ago
Brian - We are not going to ship a statically linked product on Linux.  This
would simply suck.  There are too many issues that can cause far more crashes
doing this.
What did we do with Motif in 4.x?  Why is this solution not good enough here?

(As I understand it, we shipped both dynamic and static versions for Linux/*BSD,
and dynamic-Motif-only for platforms that came with Motif by default.)

Comment 31

18 years ago
Because statically linking GTK with our app won't work correctly.  The whole
idea of components becomes broken at that point.  We can't simply link GTK in to
the main app because that then breaks the idea of being able to change toolkits
on the fly.  Also statically linking GTK with mozilla can cause theme version
conflicts which can kill the browser on startup.
We could statically link GTK into the GTK widget/gfx lib though, right?

That wouldn't solve the theme problem, though.

Comment 33

18 years ago
No, this won't work.  Since widget and gfx would both then have the same symbols
inside them so that on broken platforms that have to resolve all the symbols at
runtime they would then get duplicated symbols causing the app not to startup.

Comment 34

18 years ago
Which leaves us right back at the beginning....  Do we end up having to
ship the GTK shared libs (prebuilt for each platform) as part of our
product releases (except for Linux...)?

FYI: For 4.x, the only special case platform was Linux, where we provided
both a statically-linked Motif version and a dynamic Motif version.  All
the other platforms were statically linked.

Comment 35

18 years ago
We could make gtk a component! :-)

We need to:
  1) Get the bits we build at night working
  2) Figure out a shipping strategy.

Switching back to dynamic linking fixes (1), let's
do that now and work on (2).  I'd rather have something
working than nothing at all, which is currently the case.
And, in fact, though Mozilla can statically link with the LGPL'd GTK, I'm not
sure other commercial interests can.  I think we should package the GTK libs
right alongside the xpcom and js ones, since we're going to need the
LD_LIBRARY_PATH stuff set up anyway.

Comment 37

18 years ago
*** Bug 13682 has been marked as a duplicate of this bug. ***

Updated

18 years ago
Target Milestone: M13 → M14

Comment 38

18 years ago
briano checked in last night. mcafee's step one above might be able to be
crossed off and a possible large set of folks might be able to run.

lets get in a room and battle this out mano y mano.

Comment 39

18 years ago
bits still crash for me, Solaris 2.6

Comment 40

18 years ago
That's because no new builds have been delivered since 1/14 (due to build
errors of various different types).  Theoretically, tomorrow's builds might
make it all the way.

Comment 41

18 years ago
The Solaris build made it to the FTP server.  Right now, it seems that I
must have installed my own copy of gtk for mozilla to start up.  I'm willing
to do this, although I don't know what the blessed version of gtk is.  Mozilla
seems extremely volatile (crashes & freezes) when I point it at my copy of
gtk, however.  I guess that could be today's nightly build. I dunno.

I've gotten "Virtual memory exceeded in `new'" errors several times, even
though my machine has plenty of memory (and plenty of swap).

Comment 42

18 years ago
briano is no longer at netscape.
over to granrose, cc-ing leaf.
Assignee: briano → granrose
Status: ASSIGNED → NEW

Comment 43

18 years ago
Adding "crash" keyword to all known open crasher bugs.
Keywords: crash

Comment 44

18 years ago
accepted. changed QA contact to mcafee since I can't verify my own bug.  moved 
to M16, since Solaris delivery is not a beta blocker.

looking at the mozilla ftp site we had Solaris 2.6 and 2.51 builds deliver 
yesterday which is more than we've had most of January it seems.  anyone have 
any insights on how we're going to resolve this?
Status: NEW → ASSIGNED
QA Contact: granrose → mcafee
Whiteboard: [PDT-] Build packaging problem. Need status from briano. → [PDT-] Build packaging problem. Need plan.
Target Milestone: M14 → M16

Comment 45

18 years ago
Putting dogfood in the keyword field.
Keywords: dogfood

Comment 46

18 years ago
The 1/27 and 2/3 builds don't crash like before (and use my gtk).
Now I get two messages "Gdk-WARNING **:shmat failed!" The errno set by the call
corresponds to "To many open files". Adding --no-xshm to the arguments of
mozilla-bin prevents this.
In either case, it gets down to "WEBSHELL+ = 1" and nothing else happens - no
crash but no windows open.
(SunOS 5.5.1 Gtk+-1.2.6)

Updated

18 years ago
Summary: [DOGFOOD] Solaris: Build packaging problem, crash on startup → Solaris: Build packaging problem, crash on startup
What machine are the ftp.mozilla.org bits built on?  I currently have a theory
(which I am still testing) that there is an optimizer bug in gcc 2.7.2.3 that's
causing the hang-at-startup problem.

Comment 48

18 years ago
my comment in this bug log says executor, still valid?
Executor doesn't have a dynamic copy of libg++ and libstdc++, as mentioned
above, this is likely part of the problem.  Even if we could link statically
with those libs (and it sounds like Pavlov thinks we can't), the current linking
setup actually attempts to pull non-PIC code out of those libraries into at
least one or two of the shared libs (libwidget_gtk, if i remember correctly).

Comment 50

18 years ago
not sure what to do with this one.  punting to nobody until someone wants to 
step forward and claim it.
Assignee: granrose → nobody
Status: ASSIGNED → NEW
Keywords: helpwanted
Target Milestone: M16 → ---

Comment 51

18 years ago
*** Bug 16210 has been marked as a duplicate of this bug. ***
Adding Rich Burridge to the CC on this bug; perhaps he will want it.  It's
possible that the fix he checked in to bug 15604 will have fixed this. 
Additionally, if the builds in question were gcc builds, there is a problem with
non-PIC code being pulled into shared libraries (see bug 23759) which could
conceivably be causing problems here as well.  I'm gonna try and look at 23759
again soon.

Updated

18 years ago
Depends on: 15604, 23759
(Assignee)

Comment 53

18 years ago
[richb - 10th April 2000]
I'll take it. This is how we intended to "fix" it for our Solaris Netscape
PR1 version, (whose bits we are currently building in order to give it back
to the mozilla.org site; hi Leaf!):

We have simply build glib, gtk+ and libIDL dynamically, created a single
.tar.gz distribution which contains mozilla distribution + those three
libraries + a simple "netscape" script (that'll probably be a copy of the
"mozilla" script). The "netscape" script will be setup to just look for
it's dependancies within the distribution directory.

Does this approach sound like the right one to you all?

PS: We are also using the Gnu compilers so expect a 7.8Mb binary distribution
    and not a 16Mb one that you get with SW 5.0 compilers. I'll worry about
    *that* problem somewhere else.

Comment 54

18 years ago
huh? PR1 on mozilla.org? you must have us confused for a portal company that's
released a mozilla-based browser and called it ``Netscape 6 PR1'' =)

Seriously, rich, if you are planning on putting up a pr1 build a la netscape,
send me private mail.
richb: that approach sounds reasonable to me.  in some ideal world, there might
be some mechanism for the user to configure the browser to use an already
existing set of shared libraries if they happen to already be installed. 
Conceivably split this out into two tarballs so that people who already have
this stuff don't need to download it again?

Assignee: nobody → rich.burridge

Comment 56

18 years ago
What bout using Solaris packages ? 
Maybe you can ship packages like this:
- GTK
- GLIB
- libIDL
- Mozilla core
- Mozilla mail/news
- Mozilla editor
- Mozilla misc
- Mozilla add-ons

If a user already installed a library he/she may skip the matching package...
(Assignee)

Comment 57

18 years ago
[richb - 11th April 2000]
We're way ahead of you. For our equivalent of PR1, we are just goinh to do
a .tar.gz and make it available off somewhere like www.sunfreeware.com, but 
for beta2 and beyond, we'll use the SVR4 pkgadd delivery style. The extra
libraries (glib, gtk+ ...) will be in a separate package.

To the bug submitter; are you happy with are proposed fix for this problem?
Can I close the bug, or do you want to wait under Netscape 6 beta2? Thanks.
(Assignee)

Updated

18 years ago
Status: NEW → ASSIGNED

Comment 58

18 years ago
I'm still against hosting gpl binaries that we need to host the source for as
well. dmose, you think we should start hosting gpl licensed binaries, and comply
with the publishing of sources as well? Do you know how long we have to keep
hosting the sources?

Comment 59

18 years ago
Section 6 of the LGPL is what you want to look at:
http://www.gnu.org/copyleft/lgpl.html

The shortened answer is that you must make the source available upon request for
at least 3 yrs. 
If we get the binary builds from somewhere else, though, then we can use 3c in
the GPL to just ``pass along'' the source distribution requirements.  That would
save us some work.

Comment 61

18 years ago
*** Bug 35888 has been marked as a duplicate of this bug. ***

Comment 62

18 years ago
*** Bug 27995 has been marked as a duplicate of this bug. ***

Comment 63

18 years ago
Putting on [dogfood-] radar.
Whiteboard: [PDT-] Build packaging problem. Need plan. → [dogfood-] Build packaging problem. Need plan.

Comment 64

18 years ago
*** Bug 27112 has been marked as a duplicate of this bug. ***

Updated

18 years ago
No longer blocks: 21564

Comment 65

18 years ago
Nightly builds are back and work fine with gtk 1.2.3 fetched from
www.sunfreeware.com as well as gtk 1.2.7 built with WS5.0.  RichB has a
packaging plan.  Marking fixed.

Comment 66

18 years ago
Actually marking fixed this time.
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago18 years ago
Resolution: --- → FIXED
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.