Closed
Bug 8849
Opened 25 years ago
Closed 25 years ago
[PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.
Categories
(NSPR :: NSPR, defect, P1)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: laurel, Assigned: blizzard)
References
Details
(Whiteboard: Suggest dropping support for glibc2.0)
1999-06-24-12-m8 Linux rh5.2
The build is crashing on startup. Happening on two of two machines tried within
the mailnewsqa group. No stack trace.
Here is the log from Phillip Bond:
here is std out:
WARNING -- -editor is going away, use -edit instead!
width was not set
height was not set
**************************************************
nsComponentManager:
Load(/u/phillip/seamonkey/linux/99062412/package/components/librdf.so)
FAILED with error:
/u/phillip/seamonkey/linux/99062412/package/components/librdf.so:
undefined symbol: _._32CompositeArcsInOutEnumeratorImpl
**************************************************
and here is some gdb magic:
Program received signal SIGSEGV, Segmentation fault.
0x80e201b0 in ?? ()
(gdb) bt
#0 0x80e201b0 in ?? ()
#1 0x400733ed in nsComponentManagerImpl::LoadFactory ()
#2 0x400734e6 in nsComponentManagerImpl::FindFactory ()
#3 0x40073694 in nsComponentManagerImpl::CreateInstance ()
#4 0x400768be in nsComponentManager::CreateInstance ()
#5 0x4007703c in nsServiceManagerImpl::GetService ()
#6 0x400775ae in nsServiceManager::GetService ()
#7 0x403fe086 in nsChromeRegistry::InitRegistry ()
#8 0x4021ce1e in nsNetlibService::OpenStream ()
#9 0x40236de4 in nsDocumentBindInfo::Bind ()
#10 0x40236d3e in nsDocumentBindInfo::Bind ()
#11 0x40235d49 in nsDocLoaderImpl::LoadDocument ()
#12 0x4023a56e in nsWebShell::DoLoadURL ()
#13 0x4023a980 in nsWebShell::LoadURL ()
#14 0x4023a321 in nsWebShell::LoadURL ()
#15 0x4001b3c4 in nsWebShellWindow::Initialize ()
#16 0x4001a831 in nsAppShellService::CreateTopLevelWindow ()
#17 0x8051f1b in main ()
Comment 2•25 years ago
|
||
adding dp to the cc list, since this is happening in component-stuff.
Updated•25 years ago
|
Assignee: chofmann → waterson
Comment 3•25 years ago
|
||
Looks like rdf has a undefined variable. Waterson ?
This is not a XPInstall (aka SmartUpdate) issue. Removing myself as QA Contact.
Comment 5•25 years ago
|
||
WORKSFORME. Please try removing dist/bin/component.reg and restarting.
This is happening to mostly everyone in QA. I can't imagine folks doing
anything different with today's respin builds than they did with previous Linux
builds. Anyway, will forward this on.
(QA: component.reg is in the package directory with precompiled builds.)
Removing this file didn't fix my crash. It may work for people outside of QA,
but a) it still prevents us from doing any testing; and b) there are going to be
people "out there" with machines just like ours, too. Well, I need a vacation
anyway....
Summary: 1999-06-24-12-m8 Linux build crashes on startup. → [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup.
The interesting to find out is what changed between yesterday and today's
builds to cause apprunner not to work for the majority of us running it in QA
(and I'm sure others if they have similar setups to ours which we've been
using since Seamonkey project began).
Comment 9•25 years ago
|
||
Being unable to run on 5.2 is unacceptable; this should be considered a blocking
bug, and is going to keep the tree closed tomorrow.
Comment 10•25 years ago
|
||
Putting on the M8 Target Milestone radar. :-)
Updated•25 years ago
|
Priority: P3 → P1
Comment 11•25 years ago
|
||
I agree with leaf. In fact, I am pretty surprised that we opened the tree
without resolving this.
Let us get this fixed.
Waterson, I wouldn't have thought components.reg will have anything to do with
this. Did you check this on debug or release build.
Could you resolve this (either you convince the testing folks or they convince
you. Any middle ground is considered not resolved).
Comment 12•25 years ago
|
||
Waterson (and a small cadre of xhead-types) is working on this diligently as we
idly comment this bug.
The verifications passed because they are psuedo-automated, and because they
were done on a redhat 6.0 machine.
Comment 13•25 years ago
|
||
This happens because of a race condition in the dynamic loader. Two threads are
trying to relocate code at the same time and this royally hoses stuff. Thread #1
is running the component manager. Evil Thread #2 comes from normal static
linkage from xpinstall. Reassigning to dveditz to deal.
Comment 14•25 years ago
|
||
This happens because of a race condition in the dynamic loader. Two threads are
trying to relocate code at the same time and this royally hoses stuff. Thread #1
is running the component manager. Evil Thread #2 comes from normal static
linkage from xpinstall. Reassigning to dveditz to deal.
Comment 15•25 years ago
|
||
Wow! awesome catch waterson.
Is exclusive locking the code that loads an option. I would like to know this
more. How is static code from xpinstall causing a race.
Comment 16•25 years ago
|
||
I've taken the liberty of turning off the xpinstall build in
mozilla/xpinstall/Makefile.in
I'd prefer if tomorrow's builds were testable by qa while this is getting
resolved.
Updated•25 years ago
|
Assignee: dveditz → sgehani
Comment 17•25 years ago
|
||
Samir, leaf has turned off XPInstall in Unix again until this is fixed. Sounds
like Waterson has more details than are in the bug report.
Comment 18•25 years ago
|
||
the problem is that linux's libdl is not thread-safe, so when multiple threads
spin up and start relocating DLLs, they bump into each other. This may happen on
other platforms as well..
I don't even think this isn't really even specific to mozilla, it's just
specific to applications that link against many many DLLs and have many threads.
One solution we talked about was to override dlopen() dlsym() etc to call
PR_LoadLibrary/PR_FindSymbol/etc, then modify PR_LoadLibrary/PR_FindSymbol/etc
to call _dlopen()/_dlsym()/etc
Adding wan-teh to the CC list, because he may have some comments.
Comment 19•25 years ago
|
||
Aren't we safe if all our dll loads happen via PR_LoadLibrary() Is xpinstall
doing dlopen() instead of PR_LoadLibrary() or is dlopen() happen implicitly as a
consequence of PR_LoadLibrary() of xpinstall.so
Comment 20•25 years ago
|
||
Linking against _symbol rather than symbol is begging for pain, I think
(especially on Linux, where the glibc guys go to ever-increasing pains to hide
symbols that aren't part of the API).
On the other hand, glibc's threadsafety story is, um, mildly weak, especially
pre-2.1. This may be our only choice. =/
I'm going to bug some glibc folks and see if there's not a better way.
Comment 21•25 years ago
|
||
I am told by glibc people that glibc's 2.0's dl* isn't threadsafe (``doesn't
attempt threadsafety'', whee!), but 2.1's is supposed to be.
_dlopen doesn't exist in glibc 2.1, so we're going to need some nice autoconf
magic to make this work anyway. This is going to be _so_ much fun. If we're
racing with ld.so-driven relocations, I'm not sure that overriding dlopen will
actually work, but I guess we'll find out.
Also, can someone confirm that it doesn't happen on RH6.0 with glibc 2.1?
Comment 22•25 years ago
|
||
I would first try adding a lock in PR_LoadLibrary
to serialize its dlopen calls. Is it possible to find
out the glibc version at run time?
Overriding standard library functions is always a
pain.
Comment 23•25 years ago
|
||
I agree, but the problem isn't that OUR code is calling dlopen() - it's that
libdl (part of glibc, not mozilla) is calling it when it resolves link-time
dependancies.
basically what's happening is this (or some minor variation, use your
imagination)
- thread spins up
- thread calls some function in another .so that hasn't been called yet
- libdl trys to do the resolution using dlsym() or something
- at the same time, XPCOM or some such beast is calling PR_LoadLibrary, which in
turn calles dlopen() or dlsym() or something.
so if we overload dlopen() to call PR_LoadLibrary (assuming PR_LoadLibrary is
threadsafe, or is made to be threadsafe) then it should look something like:
- thread spins up
- thread calls some function in another .so that hasn't been called yet
- libdl trys to do the resolution using dlsym() or something
- we override dlsym() with PR_FindSymbol(), so that gets called instead
- PR_FindSymbol locks some mutex
- PR_FindSymbol calls _dlsym()
- at the same time, XPCOM or some such beast is calling PR_LoadLibrary, but this
time it blocks because PR_FindSymbol() is preventing it from continuing.
Updated•25 years ago
|
Assignee: sgehani → wtc
Comment 24•25 years ago
|
||
Reassigning to wtc as this appears to be an issue best dealt with by NSPR. Bug
#8971 has been entered specific to XPInstall.
Comment 25•25 years ago
|
||
Based on Samir's comments, correcting component from mailnews to nspr since this
bug is not mail specific.
Comment 26•25 years ago
|
||
Settin QA Contact
Comment 27•25 years ago
|
||
Added srinivas and larryh to the cc list.
Comment 28•25 years ago
|
||
If waterson@netscape.com's comments on 06/25/99 00:35
are correct, the root cause of this crash is that
libdl in glibc 2.0 is not thread safe. Because
not all the dlopen calls are made through NSPR
(e.g., as alecf@netscape.com pointed out on
06/26/99 16:48, libdl is also calling dlopen
when it resolves link-time dependancies), we
cannot fix or work around this problem at the
NSPR level.
Larry and I have two suggestions:
1. Make all the dl* calls from the main thread.
2. Ask glibc 2.0 maintainers to make the
dl* functions thread-safe. Larry has submitted
a bug report to bug-glibc@gnu.org.
Overriding dlopen, dlsym, etc., as suggested by
alecf@netscape.com on 06/25/99 15:30, may not be
feasible. Larry and I examined the glibc 2.0.7
source code. While there is the pair dlopen
and _dl_open, there is only dlsym but no _dl_sym.
This means it's essentially impossible to wrap
dlsym.
Comment 29•25 years ago
|
||
I sent the following to bug-glibc@gnu.org
-------- Original Message --------
Subject: bug glibc 2.0.7 dlopen() is not thread-safe
Date: Tue, 29 Jun 1999 17:43:36 -0700
From: Lawrence Hardiman <LarryH@Netscape.COM>
Organization: Netscape Communications Corporation, Mountain View CA, USA
To: bug-glibc@gnu.org
Sorry if this has been reported before, and fixed. I did not know
where or how to search a GNU bug database.
Abstract: dlopen() and related functions are not thread safe
Description:
glibc 2.0.7 dlopen() may yield unpredictable results when used in
a highly theaded environment. Users see various faults and other
unpredicatable results.
glibc 2.1.1 does not exhibit this behavior. Examination of source
for dlopen() shows that in glibc 2.1.1 that dlopen() and related
functions are thread-safe.
Is there a variant of glibc 2.0 in which dlopen(), etc. is
thread-safe?
See also: http://bugzilla.mozilla.org/show_bug.cgi?id=8849
=============================
... and got the following answer:
-------- Original Message --------
Subject: Re: bug glibc 2.0.7 dlopen() is not thread-safe
Date: 29 Jun 1999 17:46:17 -0700
From: Ulrich Drepper <drepper@cygnus.com>
Reply-To: drepper@cygnus.com (Ulrich Drepper)
To: larryh@netscape.com (Lawrence Hardiman)
CC: bug-glibc@gnu.org
References: <37796838.49B304FD@Netscape.COM>
larryh@netscape.com (Lawrence Hardiman) writes:
> Abstract: dlopen() and related functions are not thread safe
Use glibc 2.1.
--
---------------. drepper at gnu.org ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com `------------------------
=======================================
Comment 30•25 years ago
|
||
Another suggestion: In the source for glibc (2.0 and 2.1) the code for libdl.so
appears to be contained in .../elf/*. This suggests that this library can be
built independently of the rest of glibc. Consider building the libdl.so from
glibc 2.1 and using it with the rest of glibc 2.0.
Packaging a replacement libdl with the browser may be more palatable that
requiring complete replacement of glibc. ... Give it a whirl.
Comment 31•25 years ago
|
||
I don't think that repackaging libdl.so will be enough: ld.so-driven dlopen
stuff may not use the system libdl.so, though I'm sure it uses the same source
code. (I imagine a bootstrapping issue there: how does ld.so get the dlopen
symbol from libdl.so? It can't use libdl.so, that's for sure!)
We _could_ package a replacement ld.so as well and start mozilla with that, if
glibc2.1's ld.so is compatible with glibc2.0's. When do dlthings happen in the
code after startup? Can we not synchronize them via PR_LoadLibrary locks, once
the initial ld.so scramble is done?
I guess I'm curious as to what _exactly_ is happening on the various threads
when this happens? Does ld.so really do lazy symbol stuff unavoidably? Does
LD_BIND_NOW=1 in the initial environment help us (at the cost of some startup
time, probably)?
Updated•25 years ago
|
Assignee: wtc → briano
Target Milestone: M8 → M9
Comment 32•25 years ago
|
||
try m9. reassign to briano to see if he can come up with a creative solution.
Updated•25 years ago
|
Severity: blocker → normal
Summary: [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup. → [PP] 1999-06-24-12-m8 Linux may crash on startup.
Comment 33•25 years ago
|
||
I've removed the "blocker" status because it isn't crashing at startup anymore
(we turned off the XPInstall thread). Now it's just a lurking bug waiting to
get us some other time.
Comment 34•25 years ago
|
||
Filed http://developer.redhat.com/bugzilla/show_bug.cgi?id=4011 with Red Hat, in
case they're still fixing problems with 5.x.
Updated•25 years ago
|
Assignee: briano → chofmann
Comment 35•25 years ago
|
||
I'm clearly not the right owner for this bug. Reassigning to chofmann.
Comment 36•25 years ago
|
||
libdl.so from the glibc 2.1 is incompatible with the glibc 2.0.x (it is safe to
say that all internal ELF handling is changed, data structures are changed,
there is no reason for anybody to spend time on backporting libdl)
At this point in time it seems that our only hope is to add a number of
__libc_lock calls in the libdl code that comes with glibc 2.0.x. However, I am
not in a position where I can test or have the test applications doing this, so
there must be somebody else that goes in deep and starts throwing thread locks
left and right. :-(
Assignee | ||
Updated•25 years ago
|
Assignee: chofmann → blizzard
Assignee | ||
Comment 37•25 years ago
|
||
I'll take this bug and try to get it fixed. I can probably RH into releasing an
official 5.2 update for it.
Updated•25 years ago
|
Summary: [PP] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup.
Comment 38•25 years ago
|
||
I'm seeing this now, on the 7/29 build, and it does prevent startup (and I'm
hearing the same thing from other 5.2 users), so I'm adding [blocker] back.
Summary: [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup.
Comment 39•25 years ago
|
||
I'm modifying the summary to reflect a more current date.
Updated•25 years ago
|
Summary: [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup.
Comment 40•25 years ago
|
||
adding RH5.2 in the title.
Assignee | ||
Comment 41•25 years ago
|
||
*** Bug 9292 has been marked as a duplicate of this bug. ***
Updated•25 years ago
|
Severity: normal → blocker
Comment 42•25 years ago
|
||
Putting on the blocker radar
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 43•25 years ago
|
||
I'm able to reproduce this problem with Red Hat 6.0 and glibc targeted for Red
Hat 6.1. I'm continuing to look into it.
Comment 44•25 years ago
|
||
well, after locks of hacking in NSPR, I'm concerned that we're not actually
going to be able to override dlopen and friends. There are a few problems:
- there is no _dlopen, _dlclose, _dlsym, or _dlerror on linux, which means it's
very difficult to call the "real" versions of these functions
- the way we would call these functions without the _ versions would be to use
dlsym with the RTLD_NEXT flag, but it's kind of hard to call dlsym() when that's
the function you're trying to override.
One interesting fix that seems to make a big difference is to switch to using
RTLD_LAZY in PR_LoadLibrary instead of RTLD_NOW. This reduces the number of
symbol lookups (i.e. dlsym() calls) at load-time to almost nothing, and spreads
the actual dlsym() calls over the lifetime of the app instance, so the chances
of race conditions are much lower. This is not a fix to the problem though,
merely a way of reducing the probability of it occuring.
This basically means changing
#ifdef LINUX
#define _PR_DLOPEN_FLAGS RTLD_NOW
#else
#define _PR_DLOPEN_FLAGS RTLD_LAZY
#endif /* LINUX */
to just
#define _PR_DLOPEN_FLAGS RTLD_LAZY
The problem is that the original #ifdef LINUX was put there to fix a linux
porting issue that directory server was having back in february.
Larry - what do you think about changing prlink.c back the way it was, without
the special case for linux?
Updated•25 years ago
|
Whiteboard: fix on the way
Comment 45•25 years ago
|
||
alecf says:
Chris Blizzard has a fix for glibc itself - he's done the fix for the
unreleased redhat 6.1 and is now backporting it to redhat 6.0 and 5.2.
Comment 46•25 years ago
|
||
It's also happening with SUSE 6.1. Is it possible to use RH glibcs when chris
hash finished his backport?
Assignee | ||
Comment 47•25 years ago
|
||
RPMS are now available for Red Hat 6.0 and 5.2 for this problem. Please note
that these ARE NOT OFFICIAL RED HAT RPMS. They are not signed and pretty much
untested. You are taking your own life into your hands by installing them. I
hope you know how to use sash, just in case. If it helps, I'm running the 6.0
rpms now and haven't seen any problems.
Assuming these work well, they will probably be released as offical Red Hat
RPMS.
As for people running SUSE, I don't know what to tell you. If you install the
Red Hat glibc, your system might be unusable. I'd bug SUSE about it.
I'd like feedback on this. Let me know if it works or if it doesn't.
URL:
http://people.redhat.com/blizzard/glibc/
Assignee | ||
Comment 48•25 years ago
|
||
The glibc update for 5.2 doesn't seem to work. I'm working on trying to sort
it out. Keep tuned.
Comment 49•25 years ago
|
||
at the RTLD_LAZY trick doesn't seem to work for me anymore (funny, it did
yesterday!) so that's not an option either.
Comment 50•25 years ago
|
||
oh, and just to clarify: I had two NSPR hacks, both of which seem to be
worthless now:
- use RTLD_LAZY like all the other platforms (doesn't seem to fix the problem
now)
- override and wrap dlsym(), dlopen() etc with locks so that they can't be
called simultaneously. This turned out to be a flop too because glibc doesn't do
the common "weak <symbol>/strong _<symbol>" that most other platforms do
(which means if I override dlsym() I can't get back to the original dlsym())
So I think now we're relying on chris' genius.
Comment 51•25 years ago
|
||
*** Bug 10600 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 52•25 years ago
|
||
I've talked to Ulrich about either making the dl library thread safe or porting
the dl library from 2.1 back to glibc 2.0. He says there's zero chance of that
happening because of the architecture changes in the two versions.
So, this presents an interesting problem. Since glibc 2.0 is not thread safe
with regards to dynamic loading, nspr is going to be very hard to set up to work
around the problem. The problem is that anytime that you want to use dlsym()
you will have to suspend the operation of any threads. The reason is that even
in normal operation other threads will be resolving symbols via the the same
operations that dlsym() will. They will inevitably step on each other's toes.
So, there are two ways that we can go here.
1. hack nspr to do this
2. don't use pthreads for a glibc 2.0 release and just use userland threads.
Comment 53•25 years ago
|
||
ok, so I've learned that hacking NSPR is next to impossible for dlsym() because
dlsym() happens to also be the function that needs to be available to hack NSPR.
More clearly:
- we need to hack NSPR to override dlsym()
- hacking NSPR to override any function requires the _use_ of the function
dlsym()
- dlsym() is not available for use because we'd be overriding it.
Comment 54•25 years ago
|
||
userland threads, eh? Wan-Teh or larry, how does one create a user-level thread
in NSPR? I'm confused between the type (USER vs. SYSTEM) and the scope (LOCAL
vs. GLOBAL vs. GLOBAL_UNBOUND) would type=SYSTEM, scope=LOCAL create a
user-level, NSPR-scheduled thread? I'm experimenting with this now.
I've got another kind of interesting idea brewing too:
The basic problem is that libc is calling non-threadsafe calls when creating a
thread.
We've been trying to solve the problem by protecting those calls. What if we
instead solved the problem by protecting the creation of threads, so that all
threads are suspended between the PR_CreateThread() and the actual kick-off of
the thread function.
so it essentially looks like this:
PR_CreateThread(start_function)
lock(thread_creation_lock)
data.func=start_function
pthread_create(_pt_root, data) /* data contains start_function */
_pt_root(data)
unlock(thread_creation_lock)
data->func(); /* this is myfunction() */
this will at least synchronize all thread creation, which may be when some of
this symbol resolution is happening.
I'm going to try a userlevel thread first, then the above locking mechanism.
Comment 55•25 years ago
|
||
ok, I can't seem to get userlevel threads working:
- using PR_LOCAL_THREAD is the same as PR_GLOBAL_THREAD with pthreads
- compiling NSPR with CLASSIC_NSPR=1 (to force NSPR to use user-level threads)
makes it crash in _PR_CPU_Idle
I'm trying to do the NSPR locking thing I just mentioned but I'm not having much
luck with that because you don't seem to be allowed to release a lock that
another thread has locked. I'm still fiddling with this though.
Updated•25 years ago
|
Whiteboard: fix on the way → attemping two solutions
Comment 56•25 years ago
|
||
A clarification on the type and scope arguments to PR_CreateThread:
You should almost always set the thread type to USER; the SYSTEM type was meant
primarily for use by JVM.
The default build of NSPR on most Unix platforms supports GLOBAL scope threads
only; each GLOBAL scope thread is a pthread.
When building NSPR using CLASSIC_NSPR=1 you get LOCAL scope (user-level) threads
only.
Are you seeing the crash in _PR_Cpu_Idle() with the checked in version of
NSPR source? If so, do you have a stack trace?
And yes, locks have to be unlocked by the owner of the lock.
Comment 57•25 years ago
|
||
Ok, I put locks in PR_CreateThread to serialize thread creation, and then added
those locks to all the dlopen(), dlsym() etc calls... and we STILL have the same
problem.
I'm going to try to make user level threads work now. I'm not sure what the
problem is... userlevel threads work find in nsprpub/pr/tests/threads.c
Comment 58•25 years ago
|
||
Oh, sspitzer had an interesting idea that also didn't work, unfortuantely: if
loading the library fails then try again up to 10 times. In our tests it always
worked on the second attempt to load the library, but then the pointer we got
back from the next PR_FindSymbol (for NSFindFactory, etc) was garbage and it
crashes immediately.
Comment 59•25 years ago
|
||
Ok, I'm not really having any problems with the stock Redhat 6 install.
I now personally think:
- we should solicit help from the net to get userlevel threads working on redhat
5.2 (aka glibc2)
- we should encourage everyone else to just upgrade to redhat 6.0 (aka glibc2.1)
I think by the time this product is released, RH52 will be old skool. RH6.1 will
probably be released by then anyway and even 6.0 will be old. glibc2.0 is just
too f***ed to waste our time with.
Updated•25 years ago
|
Whiteboard: attemping two solutions → Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2)
Comment 60•25 years ago
|
||
Updating summary to reflect my reccomendation.
I know lots of people working just fine on Redhat 6.0 (2.2 kernel, 2.1 glibc)
and on Redhat 5.2 (2.0 kernel, 2.0 glibc)
We can leave it to the net to try and fix this one, though after the sheer
volume of work chris and I have put into this, I'll be surprised if anything
comes out of it.
Comment 61•25 years ago
|
||
Is there any real evidence which glibc function is triggering the dl calls?
I'm skeptical that it was really during pthread_create.
I think the idea of serializing the PR_CreateThread calls is fundamentally the
right solution for glibc 2.0, you just have to find the right function that's
triggering the dl calls.
The only dl calls off the top of my head that it might be would be the
nsswitch stuff perhaps you need to serialize the first call to and resolver
functions or something like that? It shouldn't be too hard to do a grep over
the glibc sources to find all the locations that could trigger dl calls from
outside that section but I haven't got a libc source tree handy.
Comment 62•25 years ago
|
||
well, serializing PR_CreateThread calls didn't help anything....
blizzard might have more comments on the glibc thing, but he's spent alot of
time hacking at it.
Comment 63•25 years ago
|
||
*** Bug 4303 has been marked as a duplicate of this bug. ***
Comment 64•25 years ago
|
||
blizzard: could you let us know if the current glibc in redhat rawhide also
addresses this? (2.1.2-3)
thanks.
Assignee | ||
Comment 65•25 years ago
|
||
rpm -qp --changelog glibc-2.1.2-3.i386.rpm reveals this:
* Mon Aug 02 1999 Cristian Gafton <gafton@redhat.com>
- upgraded snapshot to get the ld.so fixes for thread safety
So the answer is "yes."
Comment 66•25 years ago
|
||
We are dropping 5.2 support...shall we mark this Resolved/Won't Fix?
Comment 67•25 years ago
|
||
To get a feel for how often this occurs here are 25 runs on my machine:
ggggbggbbggggggbbgbggbggb (g=good, b=bad)
8 load failures out of 25 runs.
Here's a sample of the errors,
Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: GetStyleContext__C7nsFramePP15nsIStyleContext
Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: Destroy__7nsFrameR14nsIPresContext
Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: SetFrameState__7nsFrameUi
Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol:
GetNextPrevLineFromeBlockFrame__7nsFrameP15nsIFocusTracker11nsDirectionP8nsIFrameiiPP10nsIContentPiSc
Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: SetFrameState__7nsFrameUi
Comment 68•25 years ago
|
||
I think either we should leave this open with no target fix and help wanted, or
close it resolved/wont fix.
Comment 69•25 years ago
|
||
Help Wanted sounds like a good idea for this bug -- maybe some of the glibc
gurus will be willing to look at fixes for other versions of Linux.
Comment 70•25 years ago
|
||
I tried to upgrade the glibc 2.0 from rh 5.2 to glibc 2.1 from rh 6.0. A bunch
of stuff had to be upgraded as well, including binutils, kernel-headers,
kernel-source and compilers.
Everything seems to be working as expected, except for the debugger. If you use
the 5.2 debugger, it (the debugger) hangs right after the first thread is
created.
If i use the 6.0 debugger, it (the debugger) core dumps right away.
So this franken-redhat setup might be a workaround for people that dont want to
upgrade to 6.0.
Of course, its completely at your own risk, cause it hasnt been QAed at all.
Updated•25 years ago
|
Target Milestone: M9 → M10
Comment 71•25 years ago
|
||
I've added a note to get his on the M9 release notes
as a high profile item. Update this bug with
additional notes and comments to go into the M9 and future
releases notes.
http://bugzilla.mozilla.org/show_bug.cgi?id=11352
moving the rest of this work to M10 since its not something
we will address directly on Seamonkey M9.
Comment 72•25 years ago
|
||
*** Bug 11311 has been marked as a duplicate of this bug. ***
Comment 73•25 years ago
|
||
Hi, all.
I ran into this bug with the CVS tree from today (8/19/99). When I ran
viewer I was getting strange symbol resolving errors on my RedHat 5.2
system. Doing this seemed to fix the problem (for me at least).
LD_BIND_NOW=1
./apprunner
I ran into this same problem when porting a JNI app to the JDK
port from blackdown. In that case, I also needed to preload
libpthread.so like this before it would work correctly.
LD_PRELOAD=libpthread.so
The guys over at blackdown were a big help in tracking down
this problem, so it might help to have one of your experts
talk to one of their experts. The follow who wrote the fix
they are using in the JDK can be contacted at this address.
Juergen Kreileder <kreilede@issan.cs.uni-dortmund.de>
He is very friendly and helpful, so I am sure he would not
mind helping the mozilla team resolve this nasty bug.
Assignee | ||
Updated•25 years ago
|
Target Milestone: M10 → M11
Assignee | ||
Comment 74•25 years ago
|
||
Moving out to M11
Comment 75•25 years ago
|
||
AFAIK RH 5.x support has been dropped. This does work in 6, right? If so, we
should probably close this out as WONTFIX? Anyone?
Comment 76•25 years ago
|
||
I'd like to leave this open as "HELP WANTED"... because it would suck not to
support 5.2, but I don't think we have anyone right now who actually wants to
bother with this.
What's the HELP WANTED procedure? Let's do that.
Updated•25 years ago
|
Summary: [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup. → [PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.
Whiteboard: Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2) → Suggest dropping support for glibc2.0
Comment 77•25 years ago
|
||
HELP WANTED-ified, removed [BLOCKER] and updated status field. (It's not a
kernel issue.)
Updated•25 years ago
|
Target Milestone: M11 → M20
Comment 78•25 years ago
|
||
mving out to m-way-far-away
Comment 79•25 years ago
|
||
everyone, I haven't seen apprunner crash on startup i a long
time...shall we close this one out.???
Assignee | ||
Comment 80•25 years ago
|
||
You haven't seen the crash on 5.2 in a long time?
Comment 81•25 years ago
|
||
there are SOME redhat 5.2 systems that have never seen this problem.
I don't know what it is about them. I'd like to leave this open unless we get
confirmation from more than 5 or 10 people that they aren't seeing this.
Comment 82•25 years ago
|
||
A conditional "works for me": I have a glibc 2.0 system (Debian slink, not
everything is RH :-) and it does work _if_ I set LD_BIND_NOW=1 and
LD_PRELOAD=libpthread.so.0. Until recently (about 2 weeks ago) I had to also
preload libgtk and libgdk, but this does not work any more - seems something in
the library dependencies inside mozilla has changed/improved.
I think it should get verified with someone who knows glibc internals (and not
just tells everyone to upgrade) if this fix really avoids the race condition,
and then set in the startup script, depending on library version.
Assignee | ||
Comment 83•25 years ago
|
||
Did anyone ever try preloading glibc 2.1 on one of those systems to see if it
worked? That might an acceptable work around.
Assignee | ||
Updated•25 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → WONTFIX
Assignee | ||
Comment 84•25 years ago
|
||
There are ideas in this bug report on workarounds but it's not something that we
can actually fix.
Assignee | ||
Comment 86•25 years ago
|
||
Please ignore the spam. Changing address.
Assignee: blizzard → blizzard
Status: VERIFIED → NEW
Assignee | ||
Comment 87•25 years ago
|
||
busted when I reassigned
Status: NEW → RESOLVED
Closed: 25 years ago → 25 years ago
Resolution: WONTFIX → FIXED
Updated•24 years ago
|
Target Milestone: M20 → ---
You need to log in
before you can comment on or make changes to this bug.
Description
•