8849 - [PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.

Reporter

Description

•

25 years ago

1999-06-24-12-m8 Linux rh5.2

The build is crashing on startup.  Happening on two of two machines tried within
the mailnewsqa group.  No stack trace.

lchiang

Comment 1

•

25 years ago

Here is the log from Phillip Bond:

here is std out:

WARNING -- -editor is going away, use -edit instead!
width was not set
height was not set
**************************************************
nsComponentManager:
Load(/u/phillip/seamonkey/linux/99062412/package/components/librdf.so)
FAILED with error:
/u/phillip/seamonkey/linux/99062412/package/components/librdf.so:
undefined symbol: _._32CompositeArcsInOutEnumeratorImpl
**************************************************

and here is some gdb magic:

Program received signal SIGSEGV, Segmentation fault.
0x80e201b0 in ?? ()
(gdb) bt
#0  0x80e201b0 in ?? ()
#1  0x400733ed in nsComponentManagerImpl::LoadFactory ()
#2  0x400734e6 in nsComponentManagerImpl::FindFactory ()
#3  0x40073694 in nsComponentManagerImpl::CreateInstance ()
#4  0x400768be in nsComponentManager::CreateInstance ()
#5  0x4007703c in nsServiceManagerImpl::GetService ()
#6  0x400775ae in nsServiceManager::GetService ()
#7  0x403fe086 in nsChromeRegistry::InitRegistry ()
#8  0x4021ce1e in nsNetlibService::OpenStream ()
#9  0x40236de4 in nsDocumentBindInfo::Bind ()
#10 0x40236d3e in nsDocumentBindInfo::Bind ()
#11 0x40235d49 in nsDocLoaderImpl::LoadDocument ()
#12 0x4023a56e in nsWebShell::DoLoadURL ()
#13 0x4023a980 in nsWebShell::LoadURL ()
#14 0x4023a321 in nsWebShell::LoadURL ()
#15 0x4001b3c4 in nsWebShellWindow::Initialize ()
#16 0x4001a831 in nsAppShellService::CreateTopLevelWindow ()
#17 0x8051f1b in main ()

Daniel (Leaf) Nunes

Comment 2

•

25 years ago

adding dp to the cc list, since this is happening in component-stuff.

Suresh Duddi (gone)

Updated

•

25 years ago

Assignee: chofmann → waterson

Suresh Duddi (gone)

Comment 3

•

25 years ago

Looks like rdf has a undefined variable. Waterson ?

Jimmy Lee

Updated

•

25 years ago

QA Contact: jimmylee

Jimmy Lee

Comment 4

•

25 years ago

This is not a XPInstall (aka SmartUpdate) issue.  Removing myself as QA Contact.

Chris Waterson

Comment 5

•

25 years ago

WORKSFORME. Please try removing dist/bin/component.reg and restarting.

lchiang

Comment 6

•

25 years ago

This is happening to mostly everyone in QA.  I can't imagine folks doing
anything different with today's respin builds than they did with previous Linux
builds.  Anyway, will forward this on.

scurtis

Comment 7

•

25 years ago

(QA: component.reg is in the package directory with precompiled builds.)

Removing this file didn't fix my crash. It may work for people outside of QA,
but a) it still prevents us from doing any testing; and b) there are going to be
people "out there" with machines just like ours, too. Well, I need a vacation
anyway....

lchiang

Updated

•

25 years ago

Summary: 1999-06-24-12-m8 Linux build crashes on startup. → [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup.

lchiang

Comment 8

•

25 years ago

The interesting to find out is what changed between yesterday and today's
builds to cause apprunner not to work for the majority of us running it in QA
(and I'm sure others if they have similar setups to ours which we've been
using since Seamonkey project began).

Daniel (Leaf) Nunes

Comment 9

•

25 years ago

Being unable to run on 5.2 is unacceptable; this should be considered a blocking
bug, and is going to keep the tree closed tomorrow.

leger

Updated

•

25 years ago

Target Milestone: M8

leger

Comment 10

•

25 years ago

Putting on the M8 Target Milestone radar. :-)

Suresh Duddi (gone)

Updated

•

25 years ago

Priority: P3 → P1

Suresh Duddi (gone)

Comment 11

•

25 years ago

I agree with leaf. In fact, I am pretty surprised that we opened the tree
without resolving this.

Let us get this fixed.

Waterson, I wouldn't have thought components.reg will have anything to do with
this. Did you check this on debug or release build.

Could you resolve this (either you convince the testing folks or they convince
you. Any middle ground is considered not resolved).

Daniel (Leaf) Nunes

Comment 12

•

25 years ago

Waterson (and a small cadre of xhead-types) is working on this diligently as we
idly comment this bug.

The verifications passed because they are psuedo-automated, and because they
were done on a redhat 6.0 machine.

Chris Waterson

Comment 13

•

25 years ago

This happens because of a race condition in the dynamic loader. Two threads are
trying to relocate code at the same time and this royally hoses stuff. Thread #1
is running the component manager. Evil Thread #2 comes from normal static
linkage from xpinstall. Reassigning to dveditz to deal.

Chris Waterson

Comment 14

•

25 years ago

This happens because of a race condition in the dynamic loader. Two threads are
trying to relocate code at the same time and this royally hoses stuff. Thread #1
is running the component manager. Evil Thread #2 comes from normal static
linkage from xpinstall. Reassigning to dveditz to deal.

Suresh Duddi (gone)

Comment 15

•

25 years ago

Wow! awesome catch waterson.

Is exclusive locking the code that loads an option. I would like to know this
more. How is static code from xpinstall causing a race.

Daniel (Leaf) Nunes

Comment 16

•

25 years ago

I've taken the liberty of turning off the xpinstall build in
mozilla/xpinstall/Makefile.in

I'd prefer if tomorrow's builds were testable by qa while this is getting
resolved.

Daniel Veditz [:dveditz]

Updated

•

25 years ago

Assignee: dveditz → sgehani

Daniel Veditz [:dveditz]

Comment 17

•

25 years ago

Samir, leaf has turned off XPInstall in Unix again until this is fixed. Sounds
like Waterson has more details than are in the bug report.

Alec Flett

Comment 18

•

25 years ago

the problem is that linux's libdl is not thread-safe, so when multiple threads
spin up and start relocating DLLs, they bump into each other. This may happen on
other platforms as well..

I don't even think this isn't really even specific to mozilla, it's just
specific to applications that link against many many DLLs and have many threads.

One solution we talked about was to override dlopen() dlsym() etc to call
PR_LoadLibrary/PR_FindSymbol/etc, then modify PR_LoadLibrary/PR_FindSymbol/etc
to call _dlopen()/_dlsym()/etc

Adding wan-teh to the CC list, because he may have some comments.

Suresh Duddi (gone)

Comment 19

•

25 years ago

Aren't we safe if all our dll loads happen via PR_LoadLibrary() Is xpinstall
doing dlopen() instead of PR_LoadLibrary() or is dlopen() happen implicitly as a
consequence of PR_LoadLibrary() of xpinstall.so

shaver-do-not-use-this-account

Comment 20

•

25 years ago

Linking against _symbol rather than symbol is begging for pain, I think
(especially on Linux, where the glibc guys go to ever-increasing pains to hide
symbols that aren't part of the API).

On the other hand, glibc's threadsafety story is, um, mildly weak, especially
pre-2.1.  This may be our only choice. =/

I'm going to bug some glibc folks and see if there's not a better way.

shaver-do-not-use-this-account

Comment 21

•

25 years ago

I am told by glibc people that glibc's 2.0's dl* isn't threadsafe (``doesn't
attempt threadsafety'', whee!), but 2.1's is supposed to be.

_dlopen doesn't exist in glibc 2.1, so we're going to need some nice autoconf
magic to make this work anyway.  This is going to be _so_ much fun.  If we're
racing with ld.so-driven relocations, I'm not sure that overriding dlopen will
actually work, but I guess we'll find out.

Also, can someone confirm that it doesn't happen on RH6.0 with glibc 2.1?

Wan-Teh Chang

Comment 22

•

25 years ago

I would first try adding a lock in PR_LoadLibrary
to serialize its dlopen calls.  Is it possible to find
out the glibc version at run time?

Overriding standard library functions is always a
pain.

Alec Flett

Comment 23

•

25 years ago

I agree, but the problem isn't that OUR code is calling dlopen() - it's that
libdl (part of glibc, not mozilla) is calling it when it resolves link-time
dependancies.

basically what's happening is this (or some minor variation, use your
imagination)
- thread spins up
- thread calls some function in another .so that hasn't been called yet
- libdl trys to do the resolution using dlsym() or something
- at the same time, XPCOM or some such beast is calling PR_LoadLibrary, which in
turn calles dlopen() or dlsym() or something.

so if we overload dlopen() to call PR_LoadLibrary (assuming PR_LoadLibrary is
threadsafe, or is made to be threadsafe) then it should look something like:
- thread spins up
- thread calls some function in another .so that hasn't been called yet
- libdl trys to do the resolution using dlsym() or something
- we override dlsym() with PR_FindSymbol(), so that gets called instead
- PR_FindSymbol locks some mutex
- PR_FindSymbol calls _dlsym()
- at the same time, XPCOM or some such beast is calling PR_LoadLibrary, but this
time it blocks because PR_FindSymbol() is preventing it from continuing.

Samir Gehani

Updated

•

25 years ago

Assignee: sgehani → wtc

Samir Gehani

Comment 24

•

25 years ago

Reassigning to wtc as this appears to be an issue best dealt with by NSPR. Bug
#8971 has been entered specific to XPInstall.

lchiang

Updated

•

25 years ago

Component: XPInstall → NSPR

Product: MailNews → NSPR

lchiang

Comment 25

•

25 years ago

Based on Samir's comments, correcting component from mailnews to nspr since this
bug is not mail specific.

leger

Updated

•

25 years ago

QA Contact: srinivas

leger

Comment 26

•

25 years ago

Settin QA Contact

Wan-Teh Chang

Comment 27

•

25 years ago

Added srinivas and larryh to the cc list.

Wan-Teh Chang

Comment 28

•

25 years ago

If waterson@netscape.com's comments on 06/25/99 00:35
are correct, the root cause of this crash is that
libdl in glibc 2.0 is not thread safe.  Because
not all the dlopen calls are made through NSPR
(e.g., as alecf@netscape.com pointed out on
06/26/99 16:48, libdl is also calling dlopen
when it resolves link-time dependancies), we
cannot fix or work around this problem at the
NSPR level.

Larry and I have two suggestions:
1. Make all the dl* calls from the main thread.
2. Ask glibc 2.0 maintainers to make the
   dl* functions thread-safe.  Larry has submitted
   a bug report to bug-glibc@gnu.org.

Overriding dlopen, dlsym, etc., as suggested by
alecf@netscape.com on 06/25/99 15:30, may not be
feasible.  Larry and I examined the glibc 2.0.7
source code.  While there is the pair dlopen
and _dl_open, there is only dlsym but no _dl_sym.
This means it's essentially impossible to wrap
dlsym.

larryh (gone)

Comment 29

•

25 years ago

I sent the following to bug-glibc@gnu.org

-------- Original Message --------
Subject: bug glibc 2.0.7 dlopen() is not thread-safe
Date: Tue, 29 Jun 1999 17:43:36 -0700
From: Lawrence Hardiman <LarryH@Netscape.COM>
Organization: Netscape Communications Corporation, Mountain View CA, USA
To: bug-glibc@gnu.org

Sorry if this has been reported before, and fixed. I did not know
where or how to search a GNU bug database.

Abstract: dlopen() and related functions are not thread safe

Description:
glibc 2.0.7 dlopen() may yield unpredictable results when used in
a highly theaded environment. Users see various faults and other
unpredicatable results.

glibc 2.1.1 does not exhibit this behavior. Examination of source
for dlopen() shows that in glibc 2.1.1 that dlopen() and related
functions are thread-safe.

Is there a variant of glibc 2.0 in which dlopen(), etc. is
thread-safe?

See also: http://bugzilla.mozilla.org/show_bug.cgi?id=8849
=============================
... and got the following answer:

-------- Original Message --------
Subject: Re: bug glibc 2.0.7 dlopen() is not thread-safe
Date: 29 Jun 1999 17:46:17 -0700
From: Ulrich Drepper <drepper@cygnus.com>
Reply-To: drepper@cygnus.com (Ulrich Drepper)
To: larryh@netscape.com (Lawrence Hardiman)
CC: bug-glibc@gnu.org
References: <37796838.49B304FD@Netscape.COM>

larryh@netscape.com (Lawrence Hardiman) writes:

> Abstract: dlopen() and related functions are not thread safe

Use glibc 2.1.

--
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com   `------------------------
=======================================

larryh (gone)

Comment 30

•

25 years ago

Another suggestion: In the source for glibc (2.0 and 2.1) the code for libdl.so
appears to be contained in .../elf/*. This suggests that this library can be
built independently of the rest of glibc. Consider building the libdl.so from
glibc 2.1 and using it with the rest of glibc 2.0.

Packaging a replacement libdl with the browser may be more palatable that
requiring complete replacement of glibc. ... Give it a whirl.

shaver-do-not-use-this-account

Comment 31

•

25 years ago

I don't think that repackaging libdl.so will be enough: ld.so-driven dlopen
stuff may not use the system libdl.so, though I'm sure it uses the same source
code.  (I imagine a bootstrapping issue there: how does ld.so get the dlopen
symbol from libdl.so?  It can't use libdl.so, that's for sure!)

We _could_ package a replacement ld.so as well and start mozilla with that, if
glibc2.1's ld.so is compatible with glibc2.0's.  When do dlthings happen in the
code after startup?  Can we not synchronize them via PR_LoadLibrary locks, once
the initial ld.so scramble is done?

I guess I'm curious as to what _exactly_ is happening on the various threads
when this happens?  Does ld.so really do lazy symbol stuff unavoidably?  Does
LD_BIND_NOW=1 in the initial environment help us (at the cost of some startup
time, probably)?

chris hofmann

Updated

•

25 years ago

Assignee: wtc → briano

Target Milestone: M8 → M9

chris hofmann

Comment 32

•

25 years ago

try m9. reassign to briano to see if he can come up with a creative solution.

Daniel Veditz [:dveditz]

Updated

•

25 years ago

Severity: blocker → normal

Summary: [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup. → [PP] 1999-06-24-12-m8 Linux may crash on startup.

Daniel Veditz [:dveditz]

Comment 33

•

25 years ago

I've removed the "blocker" status because it isn't crashing at startup anymore
(we turned off the XPInstall thread). Now it's just a lurking bug waiting to
get us some other time.

shaver-do-not-use-this-account

Comment 34

•

25 years ago

Filed http://developer.redhat.com/bugzilla/show_bug.cgi?id=4011 with Red Hat, in
case they're still fixing problems with 5.x.

Brian Ostrom

Updated

•

25 years ago

Assignee: briano → chofmann

Brian Ostrom

Comment 35

•

25 years ago

I'm clearly not the right owner for this bug.  Reassigning to chofmann.

gafton

Comment 36

•

25 years ago

libdl.so from the glibc 2.1 is incompatible with the glibc 2.0.x (it is safe to
say that all internal ELF handling is changed, data structures are changed,
there is no reason for anybody to spend time on backporting libdl)

At this point in time it seems that our only hope is to add a number of
__libc_lock calls in the libdl code that comes with glibc 2.0.x. However, I am
not in a position where I can test or have the test applications doing this, so
there must be somebody else that goes in deep and starts throwing thread locks
left and right. :-(

Christopher Blizzard (:blizzard)

Assignee

Updated

•

25 years ago

Assignee: chofmann → blizzard

Christopher Blizzard (:blizzard)

Assignee

Comment 37

•

25 years ago

I'll take this bug and try to get it fixed.  I can probably RH into releasing an
official 5.2 update for it.

Akkana Peck

Updated

•

25 years ago

Summary: [PP] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup.

Akkana Peck

Comment 38

•

25 years ago

I'm seeing this now, on the 7/29 build, and it does prevent startup (and I'm
hearing the same thing from other 5.2 users), so I'm adding [blocker] back.

lchiang

Updated

•

25 years ago

Summary: [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup.

lchiang

Comment 39

•

25 years ago

I'm modifying the summary to reflect a more current date.

Chris McAfee

Updated

•

25 years ago

Summary: [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup.

Chris McAfee

Comment 40

•

25 years ago

adding RH5.2 in the title.

Christopher Blizzard (:blizzard)

Assignee

Comment 41

•

25 years ago

*** Bug 9292 has been marked as a duplicate of this bug. ***

Chris McAfee

Updated

•

25 years ago

Severity: normal → blocker

Chris McAfee

Comment 42

•

25 years ago

Putting on the blocker radar

Christopher Blizzard (:blizzard)

Assignee

Updated

•

25 years ago

Status: NEW → ASSIGNED

Christopher Blizzard (:blizzard)

Assignee

Comment 43

•

25 years ago

I'm able to reproduce this problem with Red Hat 6.0 and glibc targeted for Red
Hat 6.1.  I'm continuing to look into it.

Alec Flett

Comment 44

•

25 years ago

well, after locks of hacking in NSPR, I'm concerned that we're not actually
going to be able to override dlopen and friends. There are a few problems:
- there is no _dlopen, _dlclose, _dlsym, or _dlerror on linux, which means it's
very difficult to call the "real" versions of these functions
- the way we would call these functions without the _ versions would be to use
dlsym with the RTLD_NEXT flag, but it's kind of hard to call dlsym() when that's
the function you're trying to override.

One interesting fix that seems to make a big difference is to switch to using
RTLD_LAZY in PR_LoadLibrary instead of RTLD_NOW. This reduces the number of
symbol lookups (i.e. dlsym() calls) at load-time to almost nothing, and spreads
the actual dlsym() calls over the lifetime of the app instance, so the chances
of race conditions are much lower. This is not a fix to the problem though,
merely a way of reducing the probability of it occuring.

This basically means changing
#ifdef LINUX
#define  _PR_DLOPEN_FLAGS RTLD_NOW
#else
#define  _PR_DLOPEN_FLAGS RTLD_LAZY
#endif /* LINUX */

to just
#define  _PR_DLOPEN_FLAGS RTLD_LAZY

The problem is that the original #ifdef LINUX was put there to fix a linux
porting issue that directory server was having back in february.

Larry - what do you think about changing prlink.c back the way it was, without
the special case for linux?

Chris McAfee

Updated

•

25 years ago

Whiteboard: fix on the way

Chris McAfee

Comment 45

•

25 years ago

alecf says:
Chris Blizzard has a fix for glibc itself - he's done the fix for the
unreleased redhat 6.1 and is now backporting it to redhat 6.0 and 5.2.

Andreas Otte

Comment 46

•

25 years ago

It's also happening with SUSE 6.1. Is it possible to use RH glibcs when chris
hash finished his backport?

Christopher Blizzard (:blizzard)

Assignee

Comment 47

•

25 years ago

RPMS are now available for Red Hat 6.0 and 5.2 for this problem.  Please note
that these ARE NOT OFFICIAL RED HAT RPMS.  They are not signed and pretty much
untested.   You are taking your own life into your hands by installing them.  I
hope you know how to use sash, just in case.  If it helps, I'm running the 6.0
rpms now and haven't seen any problems.

Assuming these work well, they will probably be released as offical Red Hat
RPMS.

As for people running SUSE, I don't know what to tell you.  If you install the
Red Hat glibc, your system might be unusable.  I'd bug SUSE about it.

I'd like feedback on this.  Let me know if it works or if it doesn't.

URL:

http://people.redhat.com/blizzard/glibc/

Christopher Blizzard (:blizzard)

Assignee

Comment 48

•

25 years ago

The glibc update for 5.2 doesn't seem to work.  I'm working on trying to sort
it out.  Keep tuned.

Alec Flett

Comment 49

•

25 years ago

at the RTLD_LAZY trick doesn't seem to work for me anymore (funny, it did
yesterday!) so that's not an option either.

Alec Flett

Comment 50

•

25 years ago

oh, and just to clarify: I had two NSPR hacks, both of which seem to be
worthless now:
- use RTLD_LAZY like all the other platforms (doesn't seem to fix the problem
now)
- override and wrap dlsym(), dlopen() etc with locks so that they can't be
called simultaneously. This turned out to be a flop too because glibc doesn't do
the common "weak <symbol>/strong _<symbol>" that most other platforms do
(which means if I override dlsym() I can't get back to the original dlsym())

So I think now we're relying on chris' genius.

Gagan

Comment 51

•

25 years ago

*** Bug 10600 has been marked as a duplicate of this bug. ***

Christopher Blizzard (:blizzard)

Assignee

Comment 52

•

25 years ago

I've talked to Ulrich about either making the dl library thread safe or porting
the dl library from 2.1 back to glibc 2.0.  He says there's zero chance of that
happening because of the architecture changes in the two versions.

So, this presents an interesting problem.  Since glibc 2.0 is not thread safe
with regards to dynamic loading, nspr is going to be very hard to set up to work
around the problem.  The problem is that anytime that you want to use dlsym()
you will have to suspend the operation of any threads.  The reason is that even
in normal operation other threads will be resolving symbols via the the same
operations that dlsym() will.  They will inevitably step on each other's toes.

So, there are two ways that we can go here.

1. hack nspr to do this
2. don't use pthreads for a glibc 2.0 release and just use userland threads.

Alec Flett

Comment 53

•

25 years ago

ok, so I've learned that hacking NSPR is next to impossible for dlsym() because
dlsym() happens to also be the function that needs to be available to hack NSPR.

More clearly:
- we need to hack NSPR to override dlsym()
- hacking NSPR to override any function requires the _use_ of the function
dlsym()
- dlsym() is not available for use because we'd be overriding it.

Alec Flett

Comment 54

•

25 years ago

userland threads, eh? Wan-Teh or larry, how does one create a user-level thread
in NSPR? I'm confused between the type (USER vs. SYSTEM) and the scope (LOCAL
vs. GLOBAL vs. GLOBAL_UNBOUND) would type=SYSTEM, scope=LOCAL create a
user-level, NSPR-scheduled thread? I'm experimenting with this now.

I've got another kind of interesting idea brewing too:

The basic problem is that libc is calling non-threadsafe calls when creating a
thread.

We've been trying to solve the problem by protecting those calls. What if we
instead solved the problem by protecting the creation of threads, so that all
threads are suspended between the PR_CreateThread() and the actual kick-off of
the thread function.

so it essentially looks like this:

PR_CreateThread(start_function)
  lock(thread_creation_lock)
  data.func=start_function
  pthread_create(_pt_root, data) /* data contains start_function */

_pt_root(data)
  unlock(thread_creation_lock)
  data->func();   /* this is myfunction() */

this will at least synchronize all thread creation, which may be when some of
this symbol resolution is happening.

I'm going to try a userlevel thread first, then the above locking mechanism.

Alec Flett

Comment 55

•

25 years ago

ok, I can't seem to get userlevel threads working:
- using PR_LOCAL_THREAD is the same as PR_GLOBAL_THREAD with pthreads
- compiling NSPR with CLASSIC_NSPR=1 (to force NSPR to use user-level threads)
makes it crash in _PR_CPU_Idle

I'm trying to do the NSPR locking thing I just mentioned but I'm not having much
luck with that because you don't seem to be allowed to release a lock that
another thread has locked. I'm still fiddling with this though.

Alec Flett

Updated

•

25 years ago

Whiteboard: fix on the way → attemping two solutions

srinivas

Comment 56

•

25 years ago

A clarification on the type and scope arguments to PR_CreateThread:

You should almost always set the thread type to USER; the SYSTEM type was meant
primarily for use by JVM.
The default build of NSPR on most Unix platforms supports GLOBAL scope threads
only; each GLOBAL scope thread is a pthread.
When building NSPR using CLASSIC_NSPR=1 you get LOCAL scope (user-level) threads
only.

Are you seeing the crash in _PR_Cpu_Idle() with the checked in version of
NSPR source? If so, do you have a stack trace?
And yes, locks have to be unlocked by the owner of the lock.

Alec Flett

Comment 57

•

25 years ago

Ok, I put locks in PR_CreateThread to serialize thread creation, and then added
those locks to all the dlopen(), dlsym() etc calls... and we STILL have the same
problem.

I'm going to try to make user level threads work now. I'm not sure what the
problem is... userlevel threads work find in nsprpub/pr/tests/threads.c

Alec Flett

Comment 58

•

25 years ago

Oh, sspitzer had an interesting idea that also didn't work, unfortuantely: if
loading the library fails then try again up to 10 times. In our tests it always
worked on the second attempt to load the library, but then the pointer we got
back from the next PR_FindSymbol (for NSFindFactory, etc) was garbage and it
crashes immediately.

Alec Flett

Comment 59

•

25 years ago

Ok, I'm not really having any problems with the stock Redhat 6 install.

I now personally think:
- we should solicit help from the net to get userlevel threads working on redhat
5.2 (aka glibc2)
- we should encourage everyone else to just upgrade to redhat 6.0 (aka glibc2.1)

I think by the time this product is released, RH52 will be old skool. RH6.1 will
probably be released by then anyway and even 6.0 will be old. glibc2.0 is just
too f***ed to waste our time with.

Alec Flett

Updated

•

25 years ago

Whiteboard: attemping two solutions → Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2)

Alec Flett

Comment 60

•

25 years ago

Updating summary to reflect my reccomendation.
I know lots of people working just fine on Redhat 6.0 (2.2 kernel, 2.1 glibc)
and on Redhat 5.2 (2.0 kernel, 2.0 glibc)

We can leave it to the net to try and fix this one, though after the sheer
volume of work chris and I have put into this, I'll be surprised if anything
comes out of it.

gsstark

Comment 61

•

25 years ago

Is there any real evidence which glibc function is triggering the dl calls?
I'm skeptical that it was really during pthread_create.

I think the idea of serializing the PR_CreateThread calls is fundamentally the
right solution for glibc 2.0, you just have to find the right function that's
triggering the dl calls.

The only dl calls off the top of my head that it might be would be the
nsswitch stuff perhaps you need to serialize the first call to and resolver
functions or something like that? It shouldn't be too hard to do a grep over
the glibc sources to find all the locations that could trigger dl calls from
outside that section but I haven't got a libc source tree handy.

Alec Flett

Comment 62

•

25 years ago

well, serializing PR_CreateThread calls didn't help anything....
blizzard might have more comments on the glibc thing, but he's spent alot of
time hacking at it.

Alec Flett

Comment 63

•

25 years ago

*** Bug 4303 has been marked as a duplicate of this bug. ***

old account

Comment 64

•

25 years ago

blizzard: could you let us know if the current glibc in redhat rawhide also
addresses this? (2.1.2-3)

thanks.

Christopher Blizzard (:blizzard)

Assignee

Comment 65

•

25 years ago

rpm -qp --changelog glibc-2.1.2-3.i386.rpm reveals this:

* Mon Aug 02 1999 Cristian Gafton <gafton@redhat.com>

- upgraded snapshot to get the ld.so fixes for thread safety

So the answer is "yes."

leger

Comment 66

•

25 years ago

We are dropping 5.2 support...shall we mark this Resolved/Won't Fix?

Steve Lamm

Comment 67

•

25 years ago

To get a feel for how often this occurs here are 25 runs on my machine:

ggggbggbbggggggbbgbggbggb (g=good, b=bad)

8 load failures out of 25 runs.

Here's a sample of the errors,

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: GetStyleContext__C7nsFramePP15nsIStyleContext

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: Destroy__7nsFrameR14nsIPresContext

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: SetFrameState__7nsFrameUi

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol:
GetNextPrevLineFromeBlockFrame__7nsFrameP15nsIFocusTracker11nsDirectionP8nsIFrameiiPP10nsIContentPiSc

Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED
with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so:
undefined symbol: SetFrameState__7nsFrameUi

Alec Flett

Comment 68

•

25 years ago

I think either we should leave this open with no target fix and help wanted, or
close it resolved/wont fix.

Akkana Peck

Comment 69

•

25 years ago

Help Wanted sounds like a good idea for this bug -- maybe some of the glibc
gurus will be willing to look at fixes for other versions of Linux.

ramiro

Comment 70

•

25 years ago

I tried to upgrade the glibc 2.0 from rh 5.2 to glibc 2.1 from rh 6.0.  A bunch
of stuff had to be upgraded as well, including binutils, kernel-headers,
kernel-source and compilers.

Everything seems to be working as expected, except for the debugger.  If you use
the 5.2 debugger, it (the debugger) hangs right after the first thread is
created.

If i use the 6.0 debugger, it (the debugger) core dumps right away.

So this franken-redhat setup might be a workaround for people that dont want to
upgrade to 6.0.

Of course, its completely at your own risk, cause it hasnt been QAed at all.

chris hofmann

Updated

•

25 years ago

Target Milestone: M9 → M10

chris hofmann

Comment 71

•

25 years ago

I've added a note to get his on the M9 release notes
as a high profile item.  Update this bug with
additional notes and comments to go into the M9 and future
releases notes.
http://bugzilla.mozilla.org/show_bug.cgi?id=11352

moving the rest of this work to M10 since its not something
we will address directly on Seamonkey M9.

jdalbec

Comment 72

•

25 years ago

*** Bug 11311 has been marked as a duplicate of this bug. ***

Mo DeJong

Comment 73

•

25 years ago

Hi, all.

I ran into this bug with the CVS tree from today (8/19/99). When I ran
viewer I was getting strange symbol resolving errors on my RedHat 5.2
system. Doing this seemed to fix the problem (for me at least).

LD_BIND_NOW=1
./apprunner

I ran into this same problem when porting a JNI app to the JDK
port from blackdown. In that case, I also needed to preload
libpthread.so like this before it would work correctly.

LD_PRELOAD=libpthread.so

The guys over at blackdown were a big help in tracking down
this problem, so it might help to have one of your experts
talk to one of their experts. The follow who wrote the fix
they are using in the JDK can be contacted at this address.

Juergen Kreileder <kreilede@issan.cs.uni-dortmund.de>

He is very friendly and helpful, so I am sure he would not
mind helping the mozilla team resolve this nasty bug.

Christopher Blizzard (:blizzard)

Assignee

Updated

•

25 years ago

Target Milestone: M10 → M11

Christopher Blizzard (:blizzard)

Assignee

Comment 74

•

25 years ago

Moving out to M11

cpratt

Comment 75

•

25 years ago

AFAIK RH 5.x support has been dropped. This does work in 6, right? If so, we
should probably close this out as WONTFIX? Anyone?

Alec Flett

Comment 76

•

25 years ago

I'd like to leave this open as "HELP WANTED"... because it would suck not to
support 5.2, but I don't think we have anyone right now who actually wants to
bother with this.

What's the HELP WANTED procedure? Let's do that.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Updated

•

25 years ago

Summary: [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup. → [PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.

Whiteboard: Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2) → Suggest dropping support for glibc2.0

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 77

•

25 years ago

HELP WANTED-ified, removed [BLOCKER] and updated status field.  (It's not a
kernel issue.)

chris hofmann

Updated

•

25 years ago

Target Milestone: M11 → M20

chris hofmann

Comment 78

•

25 years ago

mving out to m-way-far-away

sujay

Comment 79

•

25 years ago

everyone, I haven't seen apprunner crash on startup i a long
time...shall we close this one out.???

Christopher Blizzard (:blizzard)

Assignee

Comment 80

•

25 years ago

You haven't seen the crash on 5.2 in a long time?

Alec Flett

Comment 81

•

25 years ago

there are SOME redhat 5.2 systems that have never seen this problem.
I don't know what it is about them. I'd like to leave this open unless we get
confirmation from more than 5 or 10 people that they aren't seeing this.

olaf

Comment 82

•

25 years ago

A conditional "works for me": I have a glibc 2.0 system (Debian slink, not
everything is RH :-) and it does work _if_ I set LD_BIND_NOW=1 and
LD_PRELOAD=libpthread.so.0. Until recently (about 2 weeks ago) I had to also
preload libgtk and libgdk, but this does not work any more - seems something in
the library dependencies inside mozilla has changed/improved.

I think it should get verified with someone who knows glibc internals (and not
just tells everyone to upgrade) if this fix really avoids the race condition,
and then set in the startup script, depending on library version.

Christopher Blizzard (:blizzard)

Assignee

Comment 83

•

25 years ago

Did anyone ever try preloading glibc 2.1 on one of those systems to see if it
worked?  That might an acceptable work around.

Christopher Blizzard (:blizzard)

Assignee

Updated

•

25 years ago

Status: ASSIGNED → RESOLVED

Closed: 25 years ago

Resolution: --- → WONTFIX

Christopher Blizzard (:blizzard)

Assignee

Comment 84

•

25 years ago

There are ideas in this bug report on workarounds but it's not something that we
can actually fix.

leger

Comment 85

•

24 years ago

Marking Verified/Won't Fix.

Status: RESOLVED → VERIFIED

Christopher Blizzard (:blizzard)

Assignee

Comment 86

•

24 years ago

Please ignore the spam.  Changing address.

Assignee: blizzard → blizzard

Status: VERIFIED → NEW

Christopher Blizzard (:blizzard)

Assignee

Comment 87

•

24 years ago

busted when I reassigned

Status: NEW → RESOLVED

Closed: 25 years ago → 24 years ago

Resolution: WONTFIX → FIXED

Christopher Blizzard (:blizzard)

Assignee

Comment 88

•

24 years ago

busted when I reassigned

Status: RESOLVED → VERIFIED

Wan-Teh Chang

Updated

•

24 years ago

Target Milestone: M20 → ---