Closed Bug 13202 Opened 26 years ago Closed 25 years ago

Solaris: Build packaging problem, crash on startup

Categories

(SeaMonkey :: Build Config, defect, P1)

Sun
Solaris
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: MatsPalmgren_bugz, Assigned: rich.burridge)

References

Details

(Keywords: crash, helpwanted, Whiteboard: [dogfood-] Build packaging problem. Need plan.)

Attachments

(2 files)

Apprunner nightly builds for Solaris have been crashing on startup for about 2 weeks now. Viewer works fine though. I will attach a stack trace. uname -a: SunOS claudia 5.6 Generic_105181-11 sun4u sparc SUNW,Ultra-2
Assignee: don → cyeh
Component: Browser-General → Build Config
Chris, Are you folks seeing this problem?
Just as a status update: it's still crashing. I have checked now and then for the past month (latest build: 1999-10-07-16-M11) and it always crashes with the same stack trace as the one I have attached.
Priority: P3 → P1
Should a P1 critical have no milestone?
Here is the analysis I sent several of the engineers on 8 September, plus some more recent discoveries. I'm a little sad that it hasn't resulted in any action, so I'll post it here: > 1) The problem appears to be that gdk_rdb_init calls gdk_rgb_choose_visual, > which does: > > visuals = gdk_list_visuals (); > > When this returns, visuals is NULL; subsequently, the code does: > > tmp_list = visuals; > best_visual = tmp_list->data; > > Which is clearly the cause of the crash. > > 2) So why does gdk_list_visuals() return NULL? It's pretty simple: > > gdk_list_visuals (void) > { > GList *list; > guint i; > > list = NULL; > for (i = 0; i < nvisuals; ++i) > list = g_list_append (list, (gpointer) &visuals[i]); > > return list; > } > > nvisuals is 0, so the list returned is NULL. > > Sadly, what I can't answer is why nvisuals is empty. I'll keep poking > to see if I can figure out why. > > This leads me to... > > 3) Why is it that there are two complete copies of gdk in different > shared libraries? Here's what I mean: > > $ dis -F gdk_rgb_init libgfx_gtk.so | head > disassembly for libgfx_gtk.so > > section .text > gdk_rgb_init() > 165798: 9d e3 bf 90 save %sp, -112, %sp > 16579c: 11 00 00 00 sethi %hi(XSynchronize), %o0 > 1657a0: d0 4a 20 00 ldsb [%o0 + XSynchronize], %o0 > > $ dis -F gdk_rgb_init libwidget_gtk.so | head > disassembly for libwidget_gtk.so > > section .text > gdk_rgb_init() > 1a5a0c: 9d e3 bf 90 save %sp, -112, %sp > 1a5a10: 11 00 00 00 sethi %hi(XSynchronize), %o0 > 1a5a14: d0 4a 20 00 ldsb [%o0 + XSynchronize], %o0 > That was my original mail-- I got some vague response that this "looked like an X server bug" which I rejected. Since then I thought some more about this problem: I think the issue is that one library is getting loaded after the other, and effectively interposing on the other's functions. This is a standard linker trick, but I think here it is happening by mistake. Recall that this would also explain why nvisuals is 0-- it is state private to the gdk library. In this case, when the second copy of that library is loaded, it has its own, *uninitialized* version of the nvisuals variable. To check this, I performed the following experiment using LD_PRELOAD, which should keep a particular library at the head of the link chain: First, I installed libgtk and libgdk, etc a well-known place (in this case /usr/local/lib) $ /bin/ksh $ export LD_PRELOAD=libgtk.so:libgdk.so $ export LD_LIBRARY_PATH=/usr/local/lib Then I ran mozilla, and everything came up. In this case, I forced the application to *always* use my copy of gtk/gdk, instead of getting confused between two other copies. So: I think removing one of the two statically linked copies of libgtk/gdk should solve this problem.
Attached file more successful stacktrace —
QA Contact: leger → granrose
Updating QA contact to a release person. Internal Core QA does not test Solaris.
Status: NEW → RESOLVED
Closed: 26 years ago
Resolution: --- → DUPLICATE
Status: RESOLVED → VERIFIED
*** This bug has been marked as a duplicate of 13160 ***
Status: VERIFIED → REOPENED
Reopen because bug 13160 was fixed but this bug still occurs in build 1999111909
Resolution: DUPLICATE → ---
Clearing DUPLICATE resolution due to reopen.
Assignee: cyeh → chofmann
Status: REOPENED → NEW
did mcafee's recent award winning fixes help this? ;-) jdunn, do you see this on other unix ports?
On AIX, we ran into this problem EONS ago and the 'fix' is to create a shared gtk library and link against this instead of the static libs. on HPUX we are still chasing the problem that was introduced with superwin, but it is either a similar case to this OR a threading issue Personally I would love it if we only lined against gtk/gdk... once instead of the 3-5 times we do it now.
Assignee: chofmann → mcafee
Target Milestone: M12
stealing from chofmann, m12.
Summary: Apprunner crash on startup → Solaris: Duplicate gtk libs -> crash @ startup
better summary
Target Milestone: M12 → M13
gtk/gdk should be dynamically linked, I only show an undefined ref. to gdk_rgb_init in libwidget_gtk.so. I'm guessing you've got some static versions of gdk/gtk around, and/or missing some dynamic versions. ? This looks like gtk installation confusion to me. pushing off m12.
I'm can't remember when this worked on Solaris - maybe M5?... I just downloaded the latest build - ftp server said 12/02/99. I haven't installed any custom GTK, GDK, etc. I am viewing it thru an PC X server if that helps anybody. e5sey{gcfalck}85: ./mozilla MOZILLA_FIVE_HOME=/home/iis/mozilla/package LD_LIBRARY_PATH=/home/iis/mozilla/package:/usr/ucblib:/interleaf/rdm2.7/sun4os5/fulcrum/lib:/interleaf/rdm2.6/sun4os5/fulcrum/lib:/usr/openwin/lib SHLIB_PATH=/home/iis/mozilla/package LIBPATH=/home/iis/mozilla/package MOZ_PROGRAM=./mozilla-bin MOZ_TOOLKIT= moz_debug=0 moz_debugger= nNCL: registering deferred (0) Segmentation Fault e5sey{gcfalck}113: gdb mozilla-bin core GNU gdb 4.17 Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.6"... (no debugging symbols found)... Core was generated by `./mozilla-bin'. Program terminated with signal 9, Killed. Reading symbols from /home/iis/mozilla/package/libraptorgfx.so... (no debugging symbols found)...done. Reading symbols from /home/iis/mozilla/package/libmozjs.so... (no debugging symbols found)...done. ...[snip]... (gdb) where #0 0xeda74cc4 in gdk_rgb_set_min_colors () #1 0xeda74cc0 in gdk_rgb_set_min_colors () #2 0xeda74e90 in gdk_rgb_init () #3 0xeda77f78 in gdk_rgb_get_visual () #4 0xed9cc674 in nsDeviceContextGTK::Init () #5 0xee28eb58 in nsBaseWidget::BaseCreate () #6 0xee287a3c in nsWidget::CreateWidget () #7 0xee287d5c in nsWidget::Create () #8 0xee7bf290 in nsWebShellWindow::Initialize () #9 0xee7bd248 in nsAppShellService::JustCreateTopWindow () #10 0xee7bd0c0 in nsAppShellService::CreateTopLevelWindow () #11 0xedd9683c in nsProfile::LoadDefaultProfileDir () #12 0xedd96398 in nsProfile::StartupWithArgs () #13 0x1455c in NS_CanRun () #14 0x14aa0 in main () (gdb) q e5sey{gcfalck}114: uname -a SunOS e5sey 5.6 Generic_105181-09 sun4u sparc
Summary: Solaris: Duplicate gtk libs -> crash @ startup → [DOGFOOD] Solaris: Duplicate gtk libs -> crash @ startup
Target Milestone: M13 → M12
ok, back to m12. latest download crashes for me per greg's last comment.
Whiteboard: PDT-
will we figure this out before Tuesday?
re: tuesday: I hope so.
Whiteboard: PDT- → [PDT-]
Target Milestone: M12 → M13
Moving to M13 since this has been determined to be PDT-.
Blocks: 21564
*** Bug 20748 has been marked as a duplicate of this bug. ***
Assignee: mcafee → briano
Brian, it looks like executor (2.5.1) is linking with gtk libs improperly? We should try running the build in that tree to debug this.
*** Bug 22814 has been marked as a duplicate of this bug. ***
Whiteboard: [PDT-] → [PDT-] Build packaging problem
Brian is trying a 2.6 build, this might fix the problem.
Summary: [DOGFOOD] Solaris: Duplicate gtk libs -> crash @ startup → [DOGFOOD] Solaris: Build packaging problem, crash on startup
Whiteboard: [PDT-] Build packaging problem → Build packaging problem
Better summary, I think this is just a packaging mis-cue. Putting back on PDT radar, we look stupid shipping Solaris bits that DO NOT WORK ANYWHERE, NOT EVEN ON THE BUILD MACHINE.
Whiteboard: Build packaging problem → PDT- Build packaging problem
marking PDT- per PDT, dogfood eaters can do manual install
*** Bug 20678 has been marked as a duplicate of this bug. ***
Whiteboard: PDT- Build packaging problem → [PDT-] Build packaging problem
Whiteboard: [PDT-] Build packaging problem → [PDT-] Build packaging problem. Need status from briano.
The problem as I see it (from a Release point of view) is that we are incorrectly linking with the GTK libs multiple times when we should be doing so once. Linking dynamically is _not_ an option, because we can't release a product that requires the user to find, download, build, and install _anything_ (especially something like GTK that is changing rapidly, and which any random version can't be guaranteed to be 100% compatible with our releases). The user must be able to simply download our product, install it, and run. And that means static GTK. Period. Let's fix the _real_ problem here, instead of coming up with hacks that results in a product we ultimately can't ship.
Status: NEW → ASSIGNED
Brian - We are not going to ship a statically linked product on Linux. This would simply suck. There are too many issues that can cause far more crashes doing this.
What did we do with Motif in 4.x? Why is this solution not good enough here? (As I understand it, we shipped both dynamic and static versions for Linux/*BSD, and dynamic-Motif-only for platforms that came with Motif by default.)
Because statically linking GTK with our app won't work correctly. The whole idea of components becomes broken at that point. We can't simply link GTK in to the main app because that then breaks the idea of being able to change toolkits on the fly. Also statically linking GTK with mozilla can cause theme version conflicts which can kill the browser on startup.
We could statically link GTK into the GTK widget/gfx lib though, right? That wouldn't solve the theme problem, though.
No, this won't work. Since widget and gfx would both then have the same symbols inside them so that on broken platforms that have to resolve all the symbols at runtime they would then get duplicated symbols causing the app not to startup.
Which leaves us right back at the beginning.... Do we end up having to ship the GTK shared libs (prebuilt for each platform) as part of our product releases (except for Linux...)? FYI: For 4.x, the only special case platform was Linux, where we provided both a statically-linked Motif version and a dynamic Motif version. All the other platforms were statically linked.
We could make gtk a component! :-) We need to: 1) Get the bits we build at night working 2) Figure out a shipping strategy. Switching back to dynamic linking fixes (1), let's do that now and work on (2). I'd rather have something working than nothing at all, which is currently the case.
And, in fact, though Mozilla can statically link with the LGPL'd GTK, I'm not sure other commercial interests can. I think we should package the GTK libs right alongside the xpcom and js ones, since we're going to need the LD_LIBRARY_PATH stuff set up anyway.
*** Bug 13682 has been marked as a duplicate of this bug. ***
Target Milestone: M13 → M14
briano checked in last night. mcafee's step one above might be able to be crossed off and a possible large set of folks might be able to run. lets get in a room and battle this out mano y mano.
bits still crash for me, Solaris 2.6
That's because no new builds have been delivered since 1/14 (due to build errors of various different types). Theoretically, tomorrow's builds might make it all the way.
The Solaris build made it to the FTP server. Right now, it seems that I must have installed my own copy of gtk for mozilla to start up. I'm willing to do this, although I don't know what the blessed version of gtk is. Mozilla seems extremely volatile (crashes & freezes) when I point it at my copy of gtk, however. I guess that could be today's nightly build. I dunno. I've gotten "Virtual memory exceeded in `new'" errors several times, even though my machine has plenty of memory (and plenty of swap).
briano is no longer at netscape. over to granrose, cc-ing leaf.
Assignee: briano → granrose
Status: ASSIGNED → NEW
Adding "crash" keyword to all known open crasher bugs.
Keywords: crash
accepted. changed QA contact to mcafee since I can't verify my own bug. moved to M16, since Solaris delivery is not a beta blocker. looking at the mozilla ftp site we had Solaris 2.6 and 2.51 builds deliver yesterday which is more than we've had most of January it seems. anyone have any insights on how we're going to resolve this?
Status: NEW → ASSIGNED
QA Contact: granrose → mcafee
Whiteboard: [PDT-] Build packaging problem. Need status from briano. → [PDT-] Build packaging problem. Need plan.
Target Milestone: M14 → M16
Putting dogfood in the keyword field.
Keywords: dogfood
The 1/27 and 2/3 builds don't crash like before (and use my gtk). Now I get two messages "Gdk-WARNING **:shmat failed!" The errno set by the call corresponds to "To many open files". Adding --no-xshm to the arguments of mozilla-bin prevents this. In either case, it gets down to "WEBSHELL+ = 1" and nothing else happens - no crash but no windows open. (SunOS 5.5.1 Gtk+-1.2.6)
Summary: [DOGFOOD] Solaris: Build packaging problem, crash on startup → Solaris: Build packaging problem, crash on startup
What machine are the ftp.mozilla.org bits built on? I currently have a theory (which I am still testing) that there is an optimizer bug in gcc 2.7.2.3 that's causing the hang-at-startup problem.
my comment in this bug log says executor, still valid?
Executor doesn't have a dynamic copy of libg++ and libstdc++, as mentioned above, this is likely part of the problem. Even if we could link statically with those libs (and it sounds like Pavlov thinks we can't), the current linking setup actually attempts to pull non-PIC code out of those libraries into at least one or two of the shared libs (libwidget_gtk, if i remember correctly).
not sure what to do with this one. punting to nobody until someone wants to step forward and claim it.
Assignee: granrose → nobody
Status: ASSIGNED → NEW
Keywords: helpwanted
Target Milestone: M16 → ---
*** Bug 16210 has been marked as a duplicate of this bug. ***
Adding Rich Burridge to the CC on this bug; perhaps he will want it. It's possible that the fix he checked in to bug 15604 will have fixed this. Additionally, if the builds in question were gcc builds, there is a problem with non-PIC code being pulled into shared libraries (see bug 23759) which could conceivably be causing problems here as well. I'm gonna try and look at 23759 again soon.
[richb - 10th April 2000] I'll take it. This is how we intended to "fix" it for our Solaris Netscape PR1 version, (whose bits we are currently building in order to give it back to the mozilla.org site; hi Leaf!): We have simply build glib, gtk+ and libIDL dynamically, created a single .tar.gz distribution which contains mozilla distribution + those three libraries + a simple "netscape" script (that'll probably be a copy of the "mozilla" script). The "netscape" script will be setup to just look for it's dependancies within the distribution directory. Does this approach sound like the right one to you all? PS: We are also using the Gnu compilers so expect a 7.8Mb binary distribution and not a 16Mb one that you get with SW 5.0 compilers. I'll worry about *that* problem somewhere else.
huh? PR1 on mozilla.org? you must have us confused for a portal company that's released a mozilla-based browser and called it ``Netscape 6 PR1'' =) Seriously, rich, if you are planning on putting up a pr1 build a la netscape, send me private mail.
richb: that approach sounds reasonable to me. in some ideal world, there might be some mechanism for the user to configure the browser to use an already existing set of shared libraries if they happen to already be installed. Conceivably split this out into two tarballs so that people who already have this stuff don't need to download it again?
Assignee: nobody → rich.burridge
What bout using Solaris packages ? Maybe you can ship packages like this: - GTK - GLIB - libIDL - Mozilla core - Mozilla mail/news - Mozilla editor - Mozilla misc - Mozilla add-ons If a user already installed a library he/she may skip the matching package...
[richb - 11th April 2000] We're way ahead of you. For our equivalent of PR1, we are just goinh to do a .tar.gz and make it available off somewhere like www.sunfreeware.com, but for beta2 and beyond, we'll use the SVR4 pkgadd delivery style. The extra libraries (glib, gtk+ ...) will be in a separate package. To the bug submitter; are you happy with are proposed fix for this problem? Can I close the bug, or do you want to wait under Netscape 6 beta2? Thanks.
Status: NEW → ASSIGNED
I'm still against hosting gpl binaries that we need to host the source for as well. dmose, you think we should start hosting gpl licensed binaries, and comply with the publishing of sources as well? Do you know how long we have to keep hosting the sources?
Section 6 of the LGPL is what you want to look at: http://www.gnu.org/copyleft/lgpl.html The shortened answer is that you must make the source available upon request for at least 3 yrs.
If we get the binary builds from somewhere else, though, then we can use 3c in the GPL to just ``pass along'' the source distribution requirements. That would save us some work.
*** Bug 35888 has been marked as a duplicate of this bug. ***
*** Bug 27995 has been marked as a duplicate of this bug. ***
Putting on [dogfood-] radar.
Whiteboard: [PDT-] Build packaging problem. Need plan. → [dogfood-] Build packaging problem. Need plan.
*** Bug 27112 has been marked as a duplicate of this bug. ***
No longer blocks: 21564
Nightly builds are back and work fine with gtk 1.2.3 fetched from www.sunfreeware.com as well as gtk 1.2.7 built with WS5.0. RichB has a packaging plan. Marking fixed.
Actually marking fixed this time.
Status: ASSIGNED → RESOLVED
Closed: 26 years ago25 years ago
Resolution: --- → FIXED
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: