[PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.

VERIFIED FIXED

Status

defect
P1
blocker
VERIFIED FIXED
20 years ago
19 years ago

People

(Reporter: laurel, Assigned: blizzard)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: Suggest dropping support for glibc2.0)

(Reporter)

Description

20 years ago
1999-06-24-12-m8 Linux rh5.2

The build is crashing on startup.  Happening on two of two machines tried within
the mailnewsqa group.  No stack trace.

Comment 1

20 years ago
Here is the log from Phillip Bond:

here is std out:

WARNING -- -editor is going away, use -edit instead!
width was not set
height was not set
**************************************************
nsComponentManager:
Load(/u/phillip/seamonkey/linux/99062412/package/components/librdf.so)
FAILED with error:
/u/phillip/seamonkey/linux/99062412/package/components/librdf.so:
undefined symbol: _._32CompositeArcsInOutEnumeratorImpl
**************************************************

and here is some gdb magic:

Program received signal SIGSEGV, Segmentation fault.
0x80e201b0 in ?? ()
(gdb) bt
#0  0x80e201b0 in ?? ()
#1  0x400733ed in nsComponentManagerImpl::LoadFactory ()
#2  0x400734e6 in nsComponentManagerImpl::FindFactory ()
#3  0x40073694 in nsComponentManagerImpl::CreateInstance ()
#4  0x400768be in nsComponentManager::CreateInstance ()
#5  0x4007703c in nsServiceManagerImpl::GetService ()
#6  0x400775ae in nsServiceManager::GetService ()
#7  0x403fe086 in nsChromeRegistry::InitRegistry ()
#8  0x4021ce1e in nsNetlibService::OpenStream ()
#9  0x40236de4 in nsDocumentBindInfo::Bind ()
#10 0x40236d3e in nsDocumentBindInfo::Bind ()
#11 0x40235d49 in nsDocLoaderImpl::LoadDocument ()
#12 0x4023a56e in nsWebShell::DoLoadURL ()
#13 0x4023a980 in nsWebShell::LoadURL ()
#14 0x4023a321 in nsWebShell::LoadURL ()
#15 0x4001b3c4 in nsWebShellWindow::Initialize ()
#16 0x4001a831 in nsAppShellService::CreateTopLevelWindow ()
#17 0x8051f1b in main ()
adding dp to the cc list, since this is happening in component-stuff.

Updated

20 years ago
Assignee: chofmann → waterson
Looks like rdf has a undefined variable. Waterson ?

Updated

20 years ago
QA Contact: jimmylee

Comment 4

20 years ago
This is not a XPInstall (aka SmartUpdate) issue.  Removing myself as QA Contact.

Comment 5

20 years ago
WORKSFORME. Please try removing dist/bin/component.reg and restarting.

Comment 6

20 years ago
This is happening to mostly everyone in QA.  I can't imagine folks doing
anything different with today's respin builds than they did with previous Linux
builds.  Anyway, will forward this on.

Comment 7

20 years ago
(QA: component.reg is in the package directory with precompiled builds.)

Removing this file didn't fix my crash. It may work for people outside of QA,
but a) it still prevents us from doing any testing; and b) there are going to be
people "out there" with machines just like ours, too. Well, I need a vacation
anyway....

Updated

20 years ago
Summary: 1999-06-24-12-m8 Linux build crashes on startup. → [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup.

Comment 8

20 years ago
The interesting to find out is what changed between yesterday and today's
builds to cause apprunner not to work for the majority of us running it in QA
(and I'm sure others if they have similar setups to ours which we've been
using since Seamonkey project began).
Being unable to run on 5.2 is unacceptable; this should be considered a blocking
bug, and is going to keep the tree closed tomorrow.

Updated

20 years ago
Target Milestone: M8

Comment 10

20 years ago
Putting on the M8 Target Milestone radar. :-)

Updated

20 years ago
Priority: P3 → P1
I agree with leaf. In fact, I am pretty surprised that we opened the tree
without resolving this.

Let us get this fixed.

Waterson, I wouldn't have thought components.reg will have anything to do with
this. Did you check this on debug or release build.

Could you resolve this (either you convince the testing folks or they convince
you. Any middle ground is considered not resolved).
Waterson (and a small cadre of xhead-types) is working on this diligently as we
idly comment this bug.

The verifications passed because they are psuedo-automated, and because they
were done on a redhat 6.0 machine.

Comment 13

20 years ago
This happens because of a race condition in the dynamic loader. Two threads are
trying to relocate code at the same time and this royally hoses stuff. Thread #1
is running the component manager. Evil Thread #2 comes from normal static
linkage from xpinstall. Reassigning to dveditz to deal.

Comment 14

20 years ago
This happens because of a race condition in the dynamic loader. Two threads are
trying to relocate code at the same time and this royally hoses stuff. Thread #1
is running the component manager. Evil Thread #2 comes from normal static
linkage from xpinstall. Reassigning to dveditz to deal.
Wow! awesome catch waterson.

Is exclusive locking the code that loads an option. I would like to know this
more. How is static code from xpinstall causing a race.
I've taken the liberty of turning off the xpinstall build in
mozilla/xpinstall/Makefile.in

I'd prefer if tomorrow's builds were testable by qa while this is getting
resolved.
Assignee: dveditz → sgehani
Samir, leaf has turned off XPInstall in Unix again until this is fixed. Sounds
like Waterson has more details than are in the bug report.

Comment 18

20 years ago
the problem is that linux's libdl is not thread-safe, so when multiple threads
spin up and start relocating DLLs, they bump into each other. This may happen on
other platforms as well..

I don't even think this isn't really even specific to mozilla, it's just
specific to applications that link against many many DLLs and have many threads.

One solution we talked about was to override dlopen() dlsym() etc to call
PR_LoadLibrary/PR_FindSymbol/etc, then modify PR_LoadLibrary/PR_FindSymbol/etc
to call _dlopen()/_dlsym()/etc

Adding wan-teh to the CC list, because he may have some comments.
Aren't we safe if all our dll loads happen via PR_LoadLibrary() Is xpinstall
doing dlopen() instead of PR_LoadLibrary() or is dlopen() happen implicitly as a
consequence of PR_LoadLibrary() of xpinstall.so
Linking against _symbol rather than symbol is begging for pain, I think
(especially on Linux, where the glibc guys go to ever-increasing pains to hide
symbols that aren't part of the API).

On the other hand, glibc's threadsafety story is, um, mildly weak, especially
pre-2.1.  This may be our only choice. =/

I'm going to bug some glibc folks and see if there's not a better way.
I am told by glibc people that glibc's 2.0's dl* isn't threadsafe (``doesn't
attempt threadsafety'', whee!), but 2.1's is supposed to be.

_dlopen doesn't exist in glibc 2.1, so we're going to need some nice autoconf
magic to make this work anyway.  This is going to be _so_ much fun.  If we're
racing with ld.so-driven relocations, I'm not sure that overriding dlopen will
actually work, but I guess we'll find out.

Also, can someone confirm that it doesn't happen on RH6.0 with glibc 2.1?

Comment 22

20 years ago
I would first try adding a lock in PR_LoadLibrary
to serialize its dlopen calls.  Is it possible to find
out the glibc version at run time?

Overriding standard library functions is always a
pain.

Comment 23

20 years ago
I agree, but the problem isn't that OUR code is calling dlopen() - it's that
libdl (part of glibc, not mozilla) is calling it when it resolves link-time
dependancies.

basically what's happening is this (or some minor variation, use your
imagination)
- thread spins up
- thread calls some function in another .so that hasn't been called yet
- libdl trys to do the resolution using dlsym() or something
- at the same time, XPCOM or some such beast is calling PR_LoadLibrary, which in
turn calles dlopen() or dlsym() or something.

so if we overload dlopen() to call PR_LoadLibrary (assuming PR_LoadLibrary is
threadsafe, or is made to be threadsafe) then it should look something like:
- thread spins up
- thread calls some function in another .so that hasn't been called yet
- libdl trys to do the resolution using dlsym() or something
- we override dlsym() with PR_FindSymbol(), so that gets called instead
- PR_FindSymbol locks some mutex
- PR_FindSymbol calls _dlsym()
- at the same time, XPCOM or some such beast is calling PR_LoadLibrary, but this
time it blocks because PR_FindSymbol() is preventing it from continuing.

Updated

20 years ago
Assignee: sgehani → wtc

Comment 24

20 years ago
Reassigning to wtc as this appears to be an issue best dealt with by NSPR. Bug
#8971 has been entered specific to XPInstall.

Updated

20 years ago
Component: XPInstall → NSPR
Product: MailNews → NSPR

Comment 25

20 years ago
Based on Samir's comments, correcting component from mailnews to nspr since this
bug is not mail specific.

Updated

20 years ago
QA Contact: srinivas

Comment 26

20 years ago
Settin QA Contact

Comment 27

20 years ago
Added srinivas and larryh to the cc list.

Comment 28

20 years ago
If waterson@netscape.com's comments on 06/25/99 00:35
are correct, the root cause of this crash is that
libdl in glibc 2.0 is not thread safe.  Because
not all the dlopen calls are made through NSPR
(e.g., as alecf@netscape.com pointed out on
06/26/99 16:48, libdl is also calling dlopen
when it resolves link-time dependancies), we
cannot fix or work around this problem at the
NSPR level.

Larry and I have two suggestions:
1. Make all the dl* calls from the main thread.
2. Ask glibc 2.0 maintainers to make the
   dl* functions thread-safe.  Larry has submitted
   a bug report to bug-glibc@gnu.org.

Overriding dlopen, dlsym, etc., as suggested by
alecf@netscape.com on 06/25/99 15:30, may not be
feasible.  Larry and I examined the glibc 2.0.7
source code.  While there is the pair dlopen
and _dl_open, there is only dlsym but no _dl_sym.
This means it's essentially impossible to wrap
dlsym.

Comment 29

20 years ago
I sent the following to bug-glibc@gnu.org

-------- Original Message --------
Subject: bug glibc 2.0.7 dlopen() is not thread-safe
Date: Tue, 29 Jun 1999 17:43:36 -0700
From: Lawrence Hardiman <LarryH@Netscape.COM>
Organization: Netscape Communications Corporation, Mountain View CA, USA
To: bug-glibc@gnu.org

Sorry if this has been reported before, and fixed. I did not know
where or how to search a GNU bug database.

Abstract: dlopen() and related functions are not thread safe

Description:
glibc 2.0.7 dlopen() may yield unpredictable results when used in
a highly theaded environment. Users see various faults and other
unpredicatable results.

glibc 2.1.1 does not exhibit this behavior. Examination of source
for dlopen() shows that in glibc 2.1.1 that dlopen() and related
functions are thread-safe.

Is there a variant of glibc 2.0 in which dlopen(), etc. is
thread-safe?

See also: http://bugzilla.mozilla.org/show_bug.cgi?id=8849
=============================
... and got the following answer:

-------- Original Message --------
Subject: Re: bug glibc 2.0.7 dlopen() is not thread-safe
Date: 29 Jun 1999 17:46:17 -0700
From: Ulrich Drepper <drepper@cygnus.com>
Reply-To: drepper@cygnus.com (Ulrich Drepper)
To: larryh@netscape.com (Lawrence Hardiman)
CC: bug-glibc@gnu.org
References: <37796838.49B304FD@Netscape.COM>

larryh@netscape.com (Lawrence Hardiman) writes:

> Abstract: dlopen() and related functions are not thread safe

Use glibc 2.1.

--
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com   `------------------------
=======================================

Comment 30

20 years ago
Another suggestion: In the source for glibc (2.0 and 2.1) the code for libdl.so
appears to be contained in .../elf/*. This suggests that this library can be
built independently of the rest of glibc. Consider building the libdl.so from
glibc 2.1 and using it with the rest of glibc 2.0.

Packaging a replacement libdl with the browser may be more palatable that
requiring complete replacement of glibc. ... Give it a whirl.
I don't think that repackaging libdl.so will be enough: ld.so-driven dlopen
stuff may not use the system libdl.so, though I'm sure it uses the same source
code.  (I imagine a bootstrapping issue there: how does ld.so get the dlopen
symbol from libdl.so?  It can't use libdl.so, that's for sure!)

We _could_ package a replacement ld.so as well and start mozilla with that, if
glibc2.1's ld.so is compatible with glibc2.0's.  When do dlthings happen in the
code after startup?  Can we not synchronize them via PR_LoadLibrary locks, once
the initial ld.so scramble is done?

I guess I'm curious as to what _exactly_ is happening on the various threads
when this happens?  Does ld.so really do lazy symbol stuff unavoidably?  Does
LD_BIND_NOW=1 in the initial environment help us (at the cost of some startup
time, probably)?

Updated

20 years ago
Assignee: wtc → briano
Target Milestone: M8 → M9

Comment 32

20 years ago
try m9. reassign to briano to see if he can come up with a creative solution.
Severity: blocker → normal
Summary: [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup. → [PP] 1999-06-24-12-m8 Linux may crash on startup.
I've removed the "blocker" status because it isn't crashing at startup anymore
(we turned off the XPInstall thread). Now it's just a lurking bug waiting to
get us some other time.
Filed http://developer.redhat.com/bugzilla/show_bug.cgi?id=4011 with Red Hat, in
case they're still fixing problems with 5.x.

Updated

20 years ago
Assignee: briano → chofmann

Comment 35

20 years ago
I'm clearly not the right owner for this bug.  Reassigning to chofmann.

Comment 36

20 years ago
libdl.so from the glibc 2.1 is incompatible with the glibc 2.0.x (it is safe to
say that all internal ELF handling is changed, data structures are changed,
there is no reason for anybody to spend time on backporting libdl)

At this point in time it seems that our only hope is to add a number of
__libc_lock calls in the libdl code that comes with glibc 2.0.x. However, I am
not in a position where I can test or have the test applications doing this, so
there must be somebody else that goes in deep and starts throwing thread locks
left and right. :-(
(Assignee)

Updated

20 years ago
Assignee: chofmann → blizzard
(Assignee)

Comment 37

20 years ago
I'll take this bug and try to get it fixed.  I can probably RH into releasing an
official 5.2 update for it.

Updated

20 years ago
Summary: [PP] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup.

Comment 38

20 years ago
I'm seeing this now, on the 7/29 build, and it does prevent startup (and I'm
hearing the same thing from other 5.2 users), so I'm adding [blocker] back.

Updated

20 years ago
Summary: [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup.

Comment 39

20 years ago
I'm modifying the summary to reflect a more current date.

Updated

20 years ago
Summary: [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup.

Comment 40

20 years ago
adding RH5.2 in the title.
(Assignee)

Comment 41

20 years ago
*** Bug 9292 has been marked as a duplicate of this bug. ***

Updated

20 years ago
Severity: normal → blocker

Comment 42

20 years ago
Putting on the blocker radar
(Assignee)

Updated

20 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 43

20 years ago
I'm able to reproduce this problem with Red Hat 6.0 and glibc targeted for Red
Hat 6.1.  I'm continuing to look into it.

Comment 44

20 years ago
well, after locks of hacking in NSPR, I'm concerned that we're not actually
going to be able to override dlopen and friends. There are a few problems:
- there is no _dlopen, _dlclose, _dlsym, or _dlerror on linux, which means it's
very difficult to call the "real" versions of these functions
- the way we would call these functions without the _ versions would be to use
dlsym with the RTLD_NEXT flag, but it's kind of hard to call dlsym() when that's
the function you're trying to override.

One interesting fix that seems to make a big difference is to switch to using
RTLD_LAZY in PR_LoadLibrary instead of RTLD_NOW. This reduces the number of
symbol lookups (i.e. dlsym() calls) at load-time to almost nothing, and spreads
the actual dlsym() calls over the lifetime of the app instance, so the chances
of race conditions are much lower. This is not a fix to the problem though,
merely a way of reducing the probability of it occuring.

This basically means changing
#ifdef LINUX
#define  _PR_DLOPEN_FLAGS RTLD_NOW
#else
#define  _PR_DLOPEN_FLAGS RTLD_LAZY
#endif /* LINUX */

to just
#define  _PR_DLOPEN_FLAGS RTLD_LAZY

The problem is that the original #ifdef LINUX was put there to fix a linux
porting issue that directory server was having back in february.

Larry - what do you think about changing prlink.c back the way it was, without
the special case for linux?

Updated

20 years ago
Whiteboard: fix on the way

Comment 45

20 years ago
alecf says:
Chris Blizzard has a fix for glibc itself - he's done the fix for the
unreleased redhat 6.1 and is now backporting it to redhat 6.0 and 5.2.

Comment 46

20 years ago
It's also happening with SUSE 6.1. Is it possible to use RH glibcs when chris
hash finished his backport?
(Assignee)

Comment 47

20 years ago
RPMS are now available for Red Hat 6.0 and 5.2 for this problem.  Please note
that these ARE NOT OFFICIAL RED HAT RPMS.  They are not signed and pretty much
untested.   You are taking your own life into your hands by installing them.  I
hope you know how to use sash, just in case.  If it helps, I'm running the 6.0
rpms now and haven't seen any problems.

Assuming these work well, they will probably be released as offical Red Hat
RPMS.

As for people running SUSE, I don't know what to tell you.  If you install the
Red Hat glibc, your system might be unusable.  I'd bug SUSE about it.

I'd like feedback on this.  Let me know if it works or if it doesn't.

URL:

http://people.redhat.com/blizzard/glibc/
(Assignee)

Comment 48

20 years ago
The glibc update for 5.2 doesn't seem to work.  I'm working on trying to sort
it out.  Keep tuned.

Comment 49

20 years ago
at the RTLD_LAZY trick doesn't seem to work for me anymore (funny, it did
yesterday!) so that's not an option either.

Comment 50

20 years ago
oh, and just to clarify: I had two NSPR hacks, both of which seem to be
worthless now:
- use RTLD_LAZY like all the other platforms (doesn't seem to fix the problem
now)
- override and wrap dlsym(), dlopen() etc with locks so that they can't be
called simultaneously. This turned out to be a flop too because glibc doesn't do
the common "weak <symbol>/strong _<symbol>" that most other platforms do
(which means if I override dlsym() I can't get back to the original dlsym())

So I think now we're relying on chris' genius.

Comment 51

20 years ago
*** Bug 10600 has been marked as a duplicate of this bug. ***
(Assignee)

Comment 52

20 years ago
I've talked to Ulrich about either making the dl library thread safe or porting
the dl library from 2.1 back to glibc 2.0.  He says there's zero chance of that
happening because of the architecture changes in the two versions.

So, this presents an interesting problem.  Since glibc 2.0 is not thread safe
with regards to dynamic loading, nspr is going to be very hard to set up to work
around the problem.  The problem is that anytime that you want to use dlsym()
you will have to suspend the operation of any threads.  The reason is that even
in normal operation other threads will be resolving symbols via the the same
operations that dlsym() will.  They will inevitably step on each other's toes.

So, there are two ways that we can go here.

1. hack nspr to do this
2. don't use pthreads for a glibc 2.0 release and just use userland threads.

Comment 53

20 years ago
ok, so I've learned that hacking NSPR is next to impossible for dlsym() because
dlsym() happens to also be the function that needs to be available to hack NSPR.

More clearly:
- we need to hack NSPR to override dlsym()
- hacking NSPR to override any function requires the _use_ of the function
dlsym()
- dlsym() is not available for use because we'd be overriding it.

Comment 54

20 years ago
userland threads, eh? Wan-Teh or larry, how does one create a user-level thread
in NSPR? I'm confused between the type (USER vs. SYSTEM) and the scope (LOCAL
vs. GLOBAL vs. GLOBAL_UNBOUND) would type=SYSTEM, scope=LOCAL create a
user-level, NSPR-scheduled thread? I'm experimenting with this now.

I've got another kind of interesting idea brewing too:

The basic problem is that libc is calling non-threadsafe calls when creating a
thread.

We've been trying to solve the problem by protecting those calls. What if we
instead solved the problem by protecting the creation of threads, so that all
threads are suspended between the PR_CreateThread() and the actual kick-off of
the thread function.

so it essentially looks like this:

PR_CreateThread(start_function)
  lock(thread_creation_lock)
  data.func=start_function
  pthread_create(_pt_root, data) /* data contains start_function */

_pt_root(data)
  unlock(thread_creation_lock)
  data->func();   /* this is myfunction() */

this will at least synchronize all thread creation, which may be when some of
this symbol resolution is happening.

I'm going to try a userlevel thread first, then the above locking mechanism.

Comment 55

20 years ago
ok, I can't seem to get userlevel threads working:
- using PR_LOCAL_THREAD is the same as PR_GLOBAL_THREAD with pthreads
- compiling NSPR with CLASSIC_NSPR=1 (to force NSPR to use user-level threads)
makes it crash in _PR_CPU_Idle

I'm trying to do the NSPR locking thing I just mentioned but I'm not having much
luck with that because you don't seem to be allowed to release a lock that
another thread has locked. I'm still fiddling with this though.

Updated

20 years ago
Whiteboard: fix on the way → attemping two solutions

Comment 56

20 years ago
A clarification on the type and scope arguments to PR_CreateThread:

You should almost always set the thread type to USER; the SYSTEM type was meant
primarily for use by JVM.
The default build of NSPR on most Unix platforms supports GLOBAL scope threads
only; each GLOBAL scope thread is a pthread.
When building NSPR using CLASSIC_NSPR=1 you get LOCAL scope (user-level) threads
only.

Are you seeing the crash in _PR_Cpu_Idle() with the checked in version of
NSPR source? If so, do you have a stack trace?
And yes, locks have to be unlocked by the owner of the lock.

Comment 57

20 years ago
Ok, I put locks in PR_CreateThread to serialize thread creation, and then added
those locks to all the dlopen(), dlsym() etc calls... and we STILL have the same
problem.

I'm going to try to make user level threads work now. I'm not sure what the
problem is... userlevel threads work find in nsprpub/pr/tests/threads.c

Comment 58

20 years ago
Oh, sspitzer had an interesting idea that also didn't work, unfortuantely: if
loading the library fails then try again up to 10 times. In our tests it always
worked on the second attempt to load the library, but then the pointer we got
back from the next PR_FindSymbol (for NSFindFactory, etc) was garbage and it
crashes immediately.

Comment 59

20 years ago
Ok, I'm not really having any problems with the stock Redhat 6 install.

I now personally think:
- we should solicit help from the net to get userlevel threads working on redhat
5.2 (aka glibc2)
- we should encourage everyone else to just upgrade to redhat 6.0 (aka glibc2.1)

I think by the time this product is released, RH52 will be old skool. RH6.1 will
probably be released by then anyway and even 6.0 will be old. glibc2.0 is just
too f***ed to waste our time with.

Updated

20 years ago
Whiteboard: attemping two solutions → Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2)

Comment 60

20 years ago
Updating summary to reflect my reccomendation.
I know lots of people working just fine on Redhat 6.0 (2.2 kernel, 2.1 glibc)
and on Redhat 5.2 (2.0 kernel, 2.0 glibc)

We can leave it to the net to try and fix this one, though after the sheer
volume of work chris and I have put into this, I'll be surprised if anything
comes out of it.

Comment 61

20 years ago
Is there any real evidence which glibc function is triggering the dl calls?
I'm skeptical that it was really during pthread_create.

I think the idea of serializing the PR_CreateThread calls is fundamentally the
right solution for glibc 2.0, you just have to find the right function that's
triggering the dl calls.

The only dl calls off the top of my head that it might be would be the
nsswitch stuff perhaps you need to serialize the first call to and resolver
functions or something like that? It shouldn't be too hard to do a grep over
the glibc sources to find all the locations that could trigger dl calls from
outside that section but I haven't got a libc source tree handy.

Comment 62

20 years ago
well, serializing PR_CreateThread calls didn't help anything....
blizzard might have more comments on the glibc thing, but he's spent alot of
time hacking at it.

Comment 63

20 years ago
*** Bug 4303 has been marked as a duplicate of this bug. ***

Comment 64

20 years ago
blizzard: could you let us know if the current glibc in redhat rawhide also
addresses this? (2.1.2-3)

thanks.
(Assignee)

Comment 65

20 years ago
rpm -qp --changelog glibc-2.1.2-3.i386.rpm reveals this:

* Mon Aug 02 1999 Cristian Gafton <gafton@redhat.com>

- upgraded snapshot to get the ld.so fixes for thread safety

So the answer is "yes."

Comment 66

20 years ago
We are dropping 5.2 support...shall we mark this Resolved/Won't Fix?

Comment 67

20 years ago
To get a feel for how often this occurs here are 25 runs on my machine:

ggggbggbbggggggbbgbggbggb (g=good, b=bad)

8 load failures out of 25 runs.

Here's a sample of the errors,

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: GetStyleContext__C7nsFramePP15nsIStyleContext

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: Destroy__7nsFrameR14nsIPresContext

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: SetFrameState__7nsFrameUi

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol:
GetNextPrevLineFromeBlockFrame__7nsFrameP15nsIFocusTracker11nsDirectionP8nsIFrameiiPP10nsIContentPiSc

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: SetFrameState__7nsFrameUi

Comment 68

20 years ago
I think either we should leave this open with no target fix and help wanted, or
close it resolved/wont fix.

Comment 69

20 years ago
Help Wanted sounds like a good idea for this bug -- maybe some of the glibc
gurus will be willing to look at fixes for other versions of Linux.

Comment 70

20 years ago
I tried to upgrade the glibc 2.0 from rh 5.2 to glibc 2.1 from rh 6.0.  A bunch
of stuff had to be upgraded as well, including binutils, kernel-headers,
kernel-source and compilers.

Everything seems to be working as expected, except for the debugger.  If you use
the 5.2 debugger, it (the debugger) hangs right after the first thread is
created.

If i use the 6.0 debugger, it (the debugger) core dumps right away.

So this franken-redhat setup might be a workaround for people that dont want to
upgrade to 6.0.

Of course, its completely at your own risk, cause it hasnt been QAed at all.

Updated

20 years ago
Target Milestone: M9 → M10

Comment 71

20 years ago
I've added a note to get his on the M9 release notes
as a high profile item.  Update this bug with
additional notes and comments to go into the M9 and future
releases notes.
http://bugzilla.mozilla.org/show_bug.cgi?id=11352

moving the rest of this work to M10 since its not something
we will address directly on Seamonkey M9.

Comment 72

20 years ago
*** Bug 11311 has been marked as a duplicate of this bug. ***

Comment 73

20 years ago
Hi, all.

I ran into this bug with the CVS tree from today (8/19/99). When I ran
viewer I was getting strange symbol resolving errors on my RedHat 5.2
system. Doing this seemed to fix the problem (for me at least).

LD_BIND_NOW=1
./apprunner

I ran into this same problem when porting a JNI app to the JDK
port from blackdown. In that case, I also needed to preload
libpthread.so like this before it would work correctly.

LD_PRELOAD=libpthread.so

The guys over at blackdown were a big help in tracking down
this problem, so it might help to have one of your experts
talk to one of their experts. The follow who wrote the fix
they are using in the JDK can be contacted at this address.

Juergen Kreileder <kreilede@issan.cs.uni-dortmund.de>

He is very friendly and helpful, so I am sure he would not
mind helping the mozilla team resolve this nasty bug.
(Assignee)

Updated

20 years ago
Target Milestone: M10 → M11
(Assignee)

Comment 74

20 years ago
Moving out to M11

Comment 75

20 years ago
AFAIK RH 5.x support has been dropped. This does work in 6, right? If so, we
should probably close this out as WONTFIX? Anyone?

Comment 76

20 years ago
I'd like to leave this open as "HELP WANTED"... because it would suck not to
support 5.2, but I don't think we have anyone right now who actually wants to
bother with this.

What's the HELP WANTED procedure? Let's do that.
Summary: [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup. → [PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.
Whiteboard: Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2) → Suggest dropping support for glibc2.0
HELP WANTED-ified, removed [BLOCKER] and updated status field.  (It's not a
kernel issue.)

Updated

20 years ago
Target Milestone: M11 → M20

Comment 78

20 years ago
mving out to m-way-far-away

Comment 79

20 years ago
everyone, I haven't seen apprunner crash on startup i a long
time...shall we close this one out.???
(Assignee)

Comment 80

20 years ago
You haven't seen the crash on 5.2 in a long time?

Comment 81

20 years ago
there are SOME redhat 5.2 systems that have never seen this problem.
I don't know what it is about them. I'd like to leave this open unless we get
confirmation from more than 5 or 10 people that they aren't seeing this.

Comment 82

20 years ago
A conditional "works for me": I have a glibc 2.0 system (Debian slink, not
everything is RH :-) and it does work _if_ I set LD_BIND_NOW=1 and
LD_PRELOAD=libpthread.so.0. Until recently (about 2 weeks ago) I had to also
preload libgtk and libgdk, but this does not work any more - seems something in
the library dependencies inside mozilla has changed/improved.

I think it should get verified with someone who knows glibc internals (and not
just tells everyone to upgrade) if this fix really avoids the race condition,
and then set in the startup script, depending on library version.
(Assignee)

Comment 83

20 years ago
Did anyone ever try preloading glibc 2.1 on one of those systems to see if it
worked?  That might an acceptable work around.
(Assignee)

Updated

20 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 20 years ago
Resolution: --- → WONTFIX
(Assignee)

Comment 84

20 years ago
There are ideas in this bug report on workarounds but it's not something that we
can actually fix.

Comment 85

19 years ago
Marking Verified/Won't Fix.
Status: RESOLVED → VERIFIED
(Assignee)

Comment 86

19 years ago
Please ignore the spam.  Changing address.
Assignee: blizzard → blizzard
Status: VERIFIED → NEW
(Assignee)

Comment 87

19 years ago
busted when I reassigned
Status: NEW → RESOLVED
Last Resolved: 20 years ago19 years ago
Resolution: WONTFIX → FIXED
(Assignee)

Comment 88

19 years ago
busted when I reassigned
Status: RESOLVED → VERIFIED

Updated

19 years ago
Target Milestone: M20 → ---
You need to log in before you can comment on or make changes to this bug.