Closed Bug 13202 Opened 25 years ago Closed 24 years ago

Solaris: Build packaging problem, crash on startup

Categories

(SeaMonkey :: Build Config, defect, P1)

Sun
Solaris
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: MatsPalmgren_bugz, Assigned: rich.burridge)

References

Details

(Keywords: crash, helpwanted, Whiteboard: [dogfood-] Build packaging problem. Need plan.)

Attachments

(2 files)

Apprunner nightly builds for Solaris have been crashing on startup for about
2 weeks now. Viewer works fine though. I will attach a stack trace. uname -a:
SunOS claudia 5.6 Generic_105181-11 sun4u sparc SUNW,Ultra-2
Assignee: don → cyeh
Component: Browser-General → Build Config
Chris,

Are you folks seeing this problem?
Just as a status update: it's still crashing. I have checked now and then for
the past month (latest build: 1999-10-07-16-M11) and it always crashes with
the same stack trace as the one I have attached.
Priority: P3 → P1
Should a P1 critical have no milestone?
Here is the analysis I sent several of the engineers on 8 September, plus some
more recent discoveries.  I'm a little sad that it hasn't resulted in any
action, so I'll post it here:

> 1) The problem appears to be that gdk_rdb_init calls gdk_rgb_choose_visual,
>    which does:
>
>         visuals = gdk_list_visuals ();
>
>    When this returns, visuals is NULL; subsequently, the code does:
>
>           tmp_list = visuals;
>           best_visual = tmp_list->data;
>
>    Which is clearly the cause of the crash.
>
> 2) So why does gdk_list_visuals() return NULL?  It's pretty simple:
>
>         gdk_list_visuals (void)
>         {
>           GList *list;
>           guint i;
>
>           list = NULL;
>           for (i = 0; i < nvisuals; ++i)
>             list = g_list_append (list, (gpointer) &visuals[i]);
>
>           return list;
>         }
>
>     nvisuals is 0, so the list returned is NULL.
>
>     Sadly, what I can't answer is why nvisuals is empty.  I'll keep poking
>     to see if I can figure out why.
>
>     This leads me to...
>
> 3) Why is it that there are two complete copies of gdk in different
>    shared libraries?  Here's what I mean:
>
> $ dis -F gdk_rgb_init libgfx_gtk.so | head
> disassembly for libgfx_gtk.so
>
> section .text
> gdk_rgb_init()
>         165798:  9d e3 bf 90       save         %sp, -112, %sp
>         16579c:  11 00 00 00       sethi        %hi(XSynchronize), %o0
>         1657a0:  d0 4a 20 00       ldsb         [%o0 + XSynchronize], %o0
>
> $ dis -F gdk_rgb_init libwidget_gtk.so | head
> disassembly for libwidget_gtk.so
>
> section .text
> gdk_rgb_init()
>         1a5a0c:  9d e3 bf 90       save         %sp, -112, %sp
>         1a5a10:  11 00 00 00       sethi        %hi(XSynchronize), %o0
>         1a5a14:  d0 4a 20 00       ldsb         [%o0 + XSynchronize], %o0
>

That was my original mail-- I got some vague response that this "looked like
an X server bug" which I rejected.   Since then I thought some more about this
problem:

I think the issue is that one library is getting loaded after the other, and
effectively interposing on the other's functions.  This is a standard linker
trick, but I think here it is happening by mistake.  Recall that this would
also explain why nvisuals is 0-- it is state private to the gdk library.  In
this case, when the second copy of that library is loaded, it has its own,
*uninitialized* version of the nvisuals variable.

To check this, I performed the following experiment using LD_PRELOAD, which
should keep a particular library at the head of the link chain:

First, I installed libgtk and libgdk, etc a well-known place (in this case
/usr/local/lib)

$ /bin/ksh
$ export LD_PRELOAD=libgtk.so:libgdk.so
$ export LD_LIBRARY_PATH=/usr/local/lib

Then I ran mozilla, and everything came up.  In this case, I forced the
application to *always* use my copy of gtk/gdk, instead of getting confused
between two other copies.

So: I think removing one of the two statically linked copies of libgtk/gdk
should solve this problem.
Attached file more successful stacktrace —
QA Contact: leger → granrose
Updating QA contact to a release person.  Internal Core QA does not test
Solaris.
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → DUPLICATE
Status: RESOLVED → VERIFIED
*** This bug has been marked as a duplicate of 13160 ***
Status: VERIFIED → REOPENED
Reopen because bug 13160 was fixed but this bug still occurs in build 1999111909
Resolution: DUPLICATE → ---
Clearing DUPLICATE resolution due to reopen.
Assignee: cyeh → chofmann
Status: REOPENED → NEW
did mcafee's recent award winning fixes help this? ;-)
jdunn, do you see this on other unix ports?
On AIX, we ran into this problem EONS ago and the 'fix' is to create
a shared gtk library and link against this instead of the static libs.

on HPUX we are still chasing the problem that was introduced with
superwin, but it is either a similar case to this OR a threading issue

Personally I would love it if we only lined against gtk/gdk... once
instead of the 3-5 times we do it now.
Assignee: chofmann → mcafee
Target Milestone: M12
stealing from chofmann, m12.
Summary: Apprunner crash on startup → Solaris: Duplicate gtk libs -> crash @ startup
better summary
Target Milestone: M12 → M13
gtk/gdk should be dynamically linked, I only show an
undefined ref. to gdk_rgb_init in libwidget_gtk.so.
I'm guessing you've got some static versions of gdk/gtk
around, and/or missing some dynamic versions.  ?
This looks like gtk installation confusion to me.
pushing off m12.
I'm can't remember when this worked on Solaris - maybe M5?...
I just downloaded the latest build - ftp server said 12/02/99.
I haven't installed any custom GTK, GDK, etc.
I am viewing it thru an PC X server if that helps anybody.

e5sey{gcfalck}85: ./mozilla
MOZILLA_FIVE_HOME=/home/iis/mozilla/package

LD_LIBRARY_PATH=/home/iis/mozilla/package:/usr/ucblib:/interleaf/rdm2.7/sun4os5/fulcrum/lib:/interleaf/rdm2.6/sun4os5/fulcrum/lib:/usr/openwin/lib
       SHLIB_PATH=/home/iis/mozilla/package
          LIBPATH=/home/iis/mozilla/package
      MOZ_PROGRAM=./mozilla-bin
      MOZ_TOOLKIT=
        moz_debug=0
     moz_debugger=
nNCL: registering deferred (0)
Segmentation Fault

e5sey{gcfalck}113: gdb mozilla-bin core
GNU gdb 4.17
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.6"...
(no debugging symbols found)...
Core was generated by `./mozilla-bin'.
Program terminated with signal 9, Killed.
Reading symbols from /home/iis/mozilla/package/libraptorgfx.so...
(no debugging symbols found)...done.
Reading symbols from /home/iis/mozilla/package/libmozjs.so...
(no debugging symbols found)...done.
...[snip]...
(gdb) where
#0  0xeda74cc4 in gdk_rgb_set_min_colors ()
#1  0xeda74cc0 in gdk_rgb_set_min_colors ()
#2  0xeda74e90 in gdk_rgb_init ()
#3  0xeda77f78 in gdk_rgb_get_visual ()
#4  0xed9cc674 in nsDeviceContextGTK::Init ()
#5  0xee28eb58 in nsBaseWidget::BaseCreate ()
#6  0xee287a3c in nsWidget::CreateWidget ()
#7  0xee287d5c in nsWidget::Create ()
#8  0xee7bf290 in nsWebShellWindow::Initialize ()
#9  0xee7bd248 in nsAppShellService::JustCreateTopWindow ()
#10 0xee7bd0c0 in nsAppShellService::CreateTopLevelWindow ()
#11 0xedd9683c in nsProfile::LoadDefaultProfileDir ()
#12 0xedd96398 in nsProfile::StartupWithArgs ()
#13 0x1455c in NS_CanRun ()
#14 0x14aa0 in main ()
(gdb) q
e5sey{gcfalck}114: uname -a
SunOS e5sey 5.6 Generic_105181-09 sun4u sparc
Summary: Solaris: Duplicate gtk libs -> crash @ startup → [DOGFOOD] Solaris: Duplicate gtk libs -> crash @ startup
Target Milestone: M13 → M12
ok, back to m12.  latest download crashes for me per
greg's last comment.
Whiteboard: PDT-
will we figure this out before Tuesday?
re: tuesday: I hope so.
Whiteboard: PDT- → [PDT-]
Target Milestone: M12 → M13
Moving to M13 since this has been determined to be PDT-.
Blocks: 21564
*** Bug 20748 has been marked as a duplicate of this bug. ***
Assignee: mcafee → briano
Brian, it looks like executor (2.5.1) is linking with
gtk libs improperly?  We should try running the build in
that tree to debug this.
*** Bug 22814 has been marked as a duplicate of this bug. ***
Whiteboard: [PDT-] → [PDT-] Build packaging problem
Brian is trying a 2.6 build, this might fix the problem.
Summary: [DOGFOOD] Solaris: Duplicate gtk libs -> crash @ startup → [DOGFOOD] Solaris: Build packaging problem, crash on startup
Whiteboard: [PDT-] Build packaging problem → Build packaging problem
Better summary, I think this is just a packaging mis-cue.
Putting back on PDT radar, we look stupid shipping
Solaris bits that DO NOT WORK ANYWHERE, NOT EVEN ON THE
BUILD MACHINE.
Whiteboard: Build packaging problem → PDT- Build packaging problem
marking PDT- per PDT, dogfood eaters can do manual install
*** Bug 20678 has been marked as a duplicate of this bug. ***
Whiteboard: PDT- Build packaging problem → [PDT-] Build packaging problem
Whiteboard: [PDT-] Build packaging problem → [PDT-] Build packaging problem. Need status from briano.
The problem as I see it (from a Release point of view) is that we are
incorrectly linking with the GTK libs multiple times when we should be
doing so once.  Linking dynamically is _not_ an option, because we can't
release a product that requires the user to find, download, build, and
install _anything_ (especially something like GTK that is changing rapidly,
and which any random version can't be guaranteed to be 100% compatible
with our releases).  The user must be able to simply download our product,
install it, and run.  And that means static GTK.  Period.

Let's fix the _real_ problem here, instead of coming up with hacks that
results in a product we ultimately can't ship.
Status: NEW → ASSIGNED
Brian - We are not going to ship a statically linked product on Linux.  This
would simply suck.  There are too many issues that can cause far more crashes
doing this.
What did we do with Motif in 4.x?  Why is this solution not good enough here?

(As I understand it, we shipped both dynamic and static versions for Linux/*BSD,
and dynamic-Motif-only for platforms that came with Motif by default.)
Because statically linking GTK with our app won't work correctly.  The whole
idea of components becomes broken at that point.  We can't simply link GTK in to
the main app because that then breaks the idea of being able to change toolkits
on the fly.  Also statically linking GTK with mozilla can cause theme version
conflicts which can kill the browser on startup.
We could statically link GTK into the GTK widget/gfx lib though, right?

That wouldn't solve the theme problem, though.
No, this won't work.  Since widget and gfx would both then have the same symbols
inside them so that on broken platforms that have to resolve all the symbols at
runtime they would then get duplicated symbols causing the app not to startup.
Which leaves us right back at the beginning....  Do we end up having to
ship the GTK shared libs (prebuilt for each platform) as part of our
product releases (except for Linux...)?

FYI: For 4.x, the only special case platform was Linux, where we provided
both a statically-linked Motif version and a dynamic Motif version.  All
the other platforms were statically linked.
We could make gtk a component! :-)

We need to:
  1) Get the bits we build at night working
  2) Figure out a shipping strategy.

Switching back to dynamic linking fixes (1), let's
do that now and work on (2).  I'd rather have something
working than nothing at all, which is currently the case.
And, in fact, though Mozilla can statically link with the LGPL'd GTK, I'm not
sure other commercial interests can.  I think we should package the GTK libs
right alongside the xpcom and js ones, since we're going to need the
LD_LIBRARY_PATH stuff set up anyway.
*** Bug 13682 has been marked as a duplicate of this bug. ***
Target Milestone: M13 → M14
briano checked in last night. mcafee's step one above might be able to be
crossed off and a possible large set of folks might be able to run.

lets get in a room and battle this out mano y mano.
bits still crash for me, Solaris 2.6
That's because no new builds have been delivered since 1/14 (due to build
errors of various different types).  Theoretically, tomorrow's builds might
make it all the way.
The Solaris build made it to the FTP server.  Right now, it seems that I
must have installed my own copy of gtk for mozilla to start up.  I'm willing
to do this, although I don't know what the blessed version of gtk is.  Mozilla
seems extremely volatile (crashes & freezes) when I point it at my copy of
gtk, however.  I guess that could be today's nightly build. I dunno.

I've gotten "Virtual memory exceeded in `new'" errors several times, even
though my machine has plenty of memory (and plenty of swap).
briano is no longer at netscape.
over to granrose, cc-ing leaf.
Assignee: briano → granrose
Status: ASSIGNED → NEW
Adding "crash" keyword to all known open crasher bugs.
Keywords: crash
accepted. changed QA contact to mcafee since I can't verify my own bug.  moved 
to M16, since Solaris delivery is not a beta blocker.

looking at the mozilla ftp site we had Solaris 2.6 and 2.51 builds deliver 
yesterday which is more than we've had most of January it seems.  anyone have 
any insights on how we're going to resolve this?
Status: NEW → ASSIGNED
QA Contact: granrose → mcafee
Whiteboard: [PDT-] Build packaging problem. Need status from briano. → [PDT-] Build packaging problem. Need plan.
Target Milestone: M14 → M16
Putting dogfood in the keyword field.
Keywords: dogfood
The 1/27 and 2/3 builds don't crash like before (and use my gtk).
Now I get two messages "Gdk-WARNING **:shmat failed!" The errno set by the call
corresponds to "To many open files". Adding --no-xshm to the arguments of
mozilla-bin prevents this.
In either case, it gets down to "WEBSHELL+ = 1" and nothing else happens - no
crash but no windows open.
(SunOS 5.5.1 Gtk+-1.2.6)
Summary: [DOGFOOD] Solaris: Build packaging problem, crash on startup → Solaris: Build packaging problem, crash on startup
What machine are the ftp.mozilla.org bits built on?  I currently have a theory
(which I am still testing) that there is an optimizer bug in gcc 2.7.2.3 that's
causing the hang-at-startup problem.
my comment in this bug log says executor, still valid?
Executor doesn't have a dynamic copy of libg++ and libstdc++, as mentioned
above, this is likely part of the problem.  Even if we could link statically
with those libs (and it sounds like Pavlov thinks we can't), the current linking
setup actually attempts to pull non-PIC code out of those libraries into at
least one or two of the shared libs (libwidget_gtk, if i remember correctly).
not sure what to do with this one.  punting to nobody until someone wants to 
step forward and claim it.
Assignee: granrose → nobody
Status: ASSIGNED → NEW
Keywords: helpwanted
Target Milestone: M16 → ---
*** Bug 16210 has been marked as a duplicate of this bug. ***
Adding Rich Burridge to the CC on this bug; perhaps he will want it.  It's
possible that the fix he checked in to bug 15604 will have fixed this. 
Additionally, if the builds in question were gcc builds, there is a problem with
non-PIC code being pulled into shared libraries (see bug 23759) which could
conceivably be causing problems here as well.  I'm gonna try and look at 23759
again soon.
[richb - 10th April 2000]
I'll take it. This is how we intended to "fix" it for our Solaris Netscape
PR1 version, (whose bits we are currently building in order to give it back
to the mozilla.org site; hi Leaf!):

We have simply build glib, gtk+ and libIDL dynamically, created a single
.tar.gz distribution which contains mozilla distribution + those three
libraries + a simple "netscape" script (that'll probably be a copy of the
"mozilla" script). The "netscape" script will be setup to just look for
it's dependancies within the distribution directory.

Does this approach sound like the right one to you all?

PS: We are also using the Gnu compilers so expect a 7.8Mb binary distribution
    and not a 16Mb one that you get with SW 5.0 compilers. I'll worry about
    *that* problem somewhere else.
huh? PR1 on mozilla.org? you must have us confused for a portal company that's
released a mozilla-based browser and called it ``Netscape 6 PR1'' =)

Seriously, rich, if you are planning on putting up a pr1 build a la netscape,
send me private mail.
richb: that approach sounds reasonable to me.  in some ideal world, there might
be some mechanism for the user to configure the browser to use an already
existing set of shared libraries if they happen to already be installed. 
Conceivably split this out into two tarballs so that people who already have
this stuff don't need to download it again?

Assignee: nobody → rich.burridge
What bout using Solaris packages ? 
Maybe you can ship packages like this:
- GTK
- GLIB
- libIDL
- Mozilla core
- Mozilla mail/news
- Mozilla editor
- Mozilla misc
- Mozilla add-ons

If a user already installed a library he/she may skip the matching package...
[richb - 11th April 2000]
We're way ahead of you. For our equivalent of PR1, we are just goinh to do
a .tar.gz and make it available off somewhere like www.sunfreeware.com, but 
for beta2 and beyond, we'll use the SVR4 pkgadd delivery style. The extra
libraries (glib, gtk+ ...) will be in a separate package.

To the bug submitter; are you happy with are proposed fix for this problem?
Can I close the bug, or do you want to wait under Netscape 6 beta2? Thanks.
Status: NEW → ASSIGNED
I'm still against hosting gpl binaries that we need to host the source for as
well. dmose, you think we should start hosting gpl licensed binaries, and comply
with the publishing of sources as well? Do you know how long we have to keep
hosting the sources?
Section 6 of the LGPL is what you want to look at:
http://www.gnu.org/copyleft/lgpl.html

The shortened answer is that you must make the source available upon request for
at least 3 yrs. 
If we get the binary builds from somewhere else, though, then we can use 3c in
the GPL to just ``pass along'' the source distribution requirements.  That would
save us some work.
*** Bug 35888 has been marked as a duplicate of this bug. ***
*** Bug 27995 has been marked as a duplicate of this bug. ***
Putting on [dogfood-] radar.
Whiteboard: [PDT-] Build packaging problem. Need plan. → [dogfood-] Build packaging problem. Need plan.
*** Bug 27112 has been marked as a duplicate of this bug. ***
No longer blocks: 21564
Nightly builds are back and work fine with gtk 1.2.3 fetched from
www.sunfreeware.com as well as gtk 1.2.7 built with WS5.0.  RichB has a
packaging plan.  Marking fixed.
Actually marking fixed this time.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago24 years ago
Resolution: --- → FIXED
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: