Closed Bug 8849 Opened 25 years ago Closed 25 years ago

[PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.

Categories

(NSPR :: NSPR, defect, P1)

x86
Linux
defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: laurel, Assigned: blizzard)

References

Details

(Whiteboard: Suggest dropping support for glibc2.0)

1999-06-24-12-m8 Linux rh5.2 The build is crashing on startup. Happening on two of two machines tried within the mailnewsqa group. No stack trace.
Here is the log from Phillip Bond: here is std out: WARNING -- -editor is going away, use -edit instead! width was not set height was not set ************************************************** nsComponentManager: Load(/u/phillip/seamonkey/linux/99062412/package/components/librdf.so) FAILED with error: /u/phillip/seamonkey/linux/99062412/package/components/librdf.so: undefined symbol: _._32CompositeArcsInOutEnumeratorImpl ************************************************** and here is some gdb magic: Program received signal SIGSEGV, Segmentation fault. 0x80e201b0 in ?? () (gdb) bt #0 0x80e201b0 in ?? () #1 0x400733ed in nsComponentManagerImpl::LoadFactory () #2 0x400734e6 in nsComponentManagerImpl::FindFactory () #3 0x40073694 in nsComponentManagerImpl::CreateInstance () #4 0x400768be in nsComponentManager::CreateInstance () #5 0x4007703c in nsServiceManagerImpl::GetService () #6 0x400775ae in nsServiceManager::GetService () #7 0x403fe086 in nsChromeRegistry::InitRegistry () #8 0x4021ce1e in nsNetlibService::OpenStream () #9 0x40236de4 in nsDocumentBindInfo::Bind () #10 0x40236d3e in nsDocumentBindInfo::Bind () #11 0x40235d49 in nsDocLoaderImpl::LoadDocument () #12 0x4023a56e in nsWebShell::DoLoadURL () #13 0x4023a980 in nsWebShell::LoadURL () #14 0x4023a321 in nsWebShell::LoadURL () #15 0x4001b3c4 in nsWebShellWindow::Initialize () #16 0x4001a831 in nsAppShellService::CreateTopLevelWindow () #17 0x8051f1b in main ()
adding dp to the cc list, since this is happening in component-stuff.
Assignee: chofmann → waterson
Looks like rdf has a undefined variable. Waterson ?
QA Contact: jimmylee
This is not a XPInstall (aka SmartUpdate) issue. Removing myself as QA Contact.
WORKSFORME. Please try removing dist/bin/component.reg and restarting.
This is happening to mostly everyone in QA. I can't imagine folks doing anything different with today's respin builds than they did with previous Linux builds. Anyway, will forward this on.
(QA: component.reg is in the package directory with precompiled builds.) Removing this file didn't fix my crash. It may work for people outside of QA, but a) it still prevents us from doing any testing; and b) there are going to be people "out there" with machines just like ours, too. Well, I need a vacation anyway....
Summary: 1999-06-24-12-m8 Linux build crashes on startup. → [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup.
The interesting to find out is what changed between yesterday and today's builds to cause apprunner not to work for the majority of us running it in QA (and I'm sure others if they have similar setups to ours which we've been using since Seamonkey project began).
Being unable to run on 5.2 is unacceptable; this should be considered a blocking bug, and is going to keep the tree closed tomorrow.
Target Milestone: M8
Putting on the M8 Target Milestone radar. :-)
Priority: P3 → P1
I agree with leaf. In fact, I am pretty surprised that we opened the tree without resolving this. Let us get this fixed. Waterson, I wouldn't have thought components.reg will have anything to do with this. Did you check this on debug or release build. Could you resolve this (either you convince the testing folks or they convince you. Any middle ground is considered not resolved).
Waterson (and a small cadre of xhead-types) is working on this diligently as we idly comment this bug. The verifications passed because they are psuedo-automated, and because they were done on a redhat 6.0 machine.
This happens because of a race condition in the dynamic loader. Two threads are trying to relocate code at the same time and this royally hoses stuff. Thread #1 is running the component manager. Evil Thread #2 comes from normal static linkage from xpinstall. Reassigning to dveditz to deal.
This happens because of a race condition in the dynamic loader. Two threads are trying to relocate code at the same time and this royally hoses stuff. Thread #1 is running the component manager. Evil Thread #2 comes from normal static linkage from xpinstall. Reassigning to dveditz to deal.
Wow! awesome catch waterson. Is exclusive locking the code that loads an option. I would like to know this more. How is static code from xpinstall causing a race.
I've taken the liberty of turning off the xpinstall build in mozilla/xpinstall/Makefile.in I'd prefer if tomorrow's builds were testable by qa while this is getting resolved.
Assignee: dveditz → sgehani
Samir, leaf has turned off XPInstall in Unix again until this is fixed. Sounds like Waterson has more details than are in the bug report.
the problem is that linux's libdl is not thread-safe, so when multiple threads spin up and start relocating DLLs, they bump into each other. This may happen on other platforms as well.. I don't even think this isn't really even specific to mozilla, it's just specific to applications that link against many many DLLs and have many threads. One solution we talked about was to override dlopen() dlsym() etc to call PR_LoadLibrary/PR_FindSymbol/etc, then modify PR_LoadLibrary/PR_FindSymbol/etc to call _dlopen()/_dlsym()/etc Adding wan-teh to the CC list, because he may have some comments.
Aren't we safe if all our dll loads happen via PR_LoadLibrary() Is xpinstall doing dlopen() instead of PR_LoadLibrary() or is dlopen() happen implicitly as a consequence of PR_LoadLibrary() of xpinstall.so
Linking against _symbol rather than symbol is begging for pain, I think (especially on Linux, where the glibc guys go to ever-increasing pains to hide symbols that aren't part of the API). On the other hand, glibc's threadsafety story is, um, mildly weak, especially pre-2.1. This may be our only choice. =/ I'm going to bug some glibc folks and see if there's not a better way.
I am told by glibc people that glibc's 2.0's dl* isn't threadsafe (``doesn't attempt threadsafety'', whee!), but 2.1's is supposed to be. _dlopen doesn't exist in glibc 2.1, so we're going to need some nice autoconf magic to make this work anyway. This is going to be _so_ much fun. If we're racing with ld.so-driven relocations, I'm not sure that overriding dlopen will actually work, but I guess we'll find out. Also, can someone confirm that it doesn't happen on RH6.0 with glibc 2.1?
I would first try adding a lock in PR_LoadLibrary to serialize its dlopen calls. Is it possible to find out the glibc version at run time? Overriding standard library functions is always a pain.
I agree, but the problem isn't that OUR code is calling dlopen() - it's that libdl (part of glibc, not mozilla) is calling it when it resolves link-time dependancies. basically what's happening is this (or some minor variation, use your imagination) - thread spins up - thread calls some function in another .so that hasn't been called yet - libdl trys to do the resolution using dlsym() or something - at the same time, XPCOM or some such beast is calling PR_LoadLibrary, which in turn calles dlopen() or dlsym() or something. so if we overload dlopen() to call PR_LoadLibrary (assuming PR_LoadLibrary is threadsafe, or is made to be threadsafe) then it should look something like: - thread spins up - thread calls some function in another .so that hasn't been called yet - libdl trys to do the resolution using dlsym() or something - we override dlsym() with PR_FindSymbol(), so that gets called instead - PR_FindSymbol locks some mutex - PR_FindSymbol calls _dlsym() - at the same time, XPCOM or some such beast is calling PR_LoadLibrary, but this time it blocks because PR_FindSymbol() is preventing it from continuing.
Assignee: sgehani → wtc
Reassigning to wtc as this appears to be an issue best dealt with by NSPR. Bug #8971 has been entered specific to XPInstall.
Component: XPInstall → NSPR
Product: MailNews → NSPR
Based on Samir's comments, correcting component from mailnews to nspr since this bug is not mail specific.
QA Contact: srinivas
Settin QA Contact
Added srinivas and larryh to the cc list.
If waterson@netscape.com's comments on 06/25/99 00:35 are correct, the root cause of this crash is that libdl in glibc 2.0 is not thread safe. Because not all the dlopen calls are made through NSPR (e.g., as alecf@netscape.com pointed out on 06/26/99 16:48, libdl is also calling dlopen when it resolves link-time dependancies), we cannot fix or work around this problem at the NSPR level. Larry and I have two suggestions: 1. Make all the dl* calls from the main thread. 2. Ask glibc 2.0 maintainers to make the dl* functions thread-safe. Larry has submitted a bug report to bug-glibc@gnu.org. Overriding dlopen, dlsym, etc., as suggested by alecf@netscape.com on 06/25/99 15:30, may not be feasible. Larry and I examined the glibc 2.0.7 source code. While there is the pair dlopen and _dl_open, there is only dlsym but no _dl_sym. This means it's essentially impossible to wrap dlsym.
I sent the following to bug-glibc@gnu.org -------- Original Message -------- Subject: bug glibc 2.0.7 dlopen() is not thread-safe Date: Tue, 29 Jun 1999 17:43:36 -0700 From: Lawrence Hardiman <LarryH@Netscape.COM> Organization: Netscape Communications Corporation, Mountain View CA, USA To: bug-glibc@gnu.org Sorry if this has been reported before, and fixed. I did not know where or how to search a GNU bug database. Abstract: dlopen() and related functions are not thread safe Description: glibc 2.0.7 dlopen() may yield unpredictable results when used in a highly theaded environment. Users see various faults and other unpredicatable results. glibc 2.1.1 does not exhibit this behavior. Examination of source for dlopen() shows that in glibc 2.1.1 that dlopen() and related functions are thread-safe. Is there a variant of glibc 2.0 in which dlopen(), etc. is thread-safe? See also: http://bugzilla.mozilla.org/show_bug.cgi?id=8849 ============================= ... and got the following answer: -------- Original Message -------- Subject: Re: bug glibc 2.0.7 dlopen() is not thread-safe Date: 29 Jun 1999 17:46:17 -0700 From: Ulrich Drepper <drepper@cygnus.com> Reply-To: drepper@cygnus.com (Ulrich Drepper) To: larryh@netscape.com (Lawrence Hardiman) CC: bug-glibc@gnu.org References: <37796838.49B304FD@Netscape.COM> larryh@netscape.com (Lawrence Hardiman) writes: > Abstract: dlopen() and related functions are not thread safe Use glibc 2.1. -- ---------------. drepper at gnu.org ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Cygnus Solutions `--' drepper at cygnus.com `------------------------ =======================================
Another suggestion: In the source for glibc (2.0 and 2.1) the code for libdl.so appears to be contained in .../elf/*. This suggests that this library can be built independently of the rest of glibc. Consider building the libdl.so from glibc 2.1 and using it with the rest of glibc 2.0. Packaging a replacement libdl with the browser may be more palatable that requiring complete replacement of glibc. ... Give it a whirl.
I don't think that repackaging libdl.so will be enough: ld.so-driven dlopen stuff may not use the system libdl.so, though I'm sure it uses the same source code. (I imagine a bootstrapping issue there: how does ld.so get the dlopen symbol from libdl.so? It can't use libdl.so, that's for sure!) We _could_ package a replacement ld.so as well and start mozilla with that, if glibc2.1's ld.so is compatible with glibc2.0's. When do dlthings happen in the code after startup? Can we not synchronize them via PR_LoadLibrary locks, once the initial ld.so scramble is done? I guess I'm curious as to what _exactly_ is happening on the various threads when this happens? Does ld.so really do lazy symbol stuff unavoidably? Does LD_BIND_NOW=1 in the initial environment help us (at the cost of some startup time, probably)?
Assignee: wtc → briano
Target Milestone: M8 → M9
try m9. reassign to briano to see if he can come up with a creative solution.
Severity: blocker → normal
Summary: [blocker] [PP] 1999-06-24-12-m8 Linux build crashes on startup. → [PP] 1999-06-24-12-m8 Linux may crash on startup.
I've removed the "blocker" status because it isn't crashing at startup anymore (we turned off the XPInstall thread). Now it's just a lurking bug waiting to get us some other time.
Filed http://developer.redhat.com/bugzilla/show_bug.cgi?id=4011 with Red Hat, in case they're still fixing problems with 5.x.
Assignee: briano → chofmann
I'm clearly not the right owner for this bug. Reassigning to chofmann.
libdl.so from the glibc 2.1 is incompatible with the glibc 2.0.x (it is safe to say that all internal ELF handling is changed, data structures are changed, there is no reason for anybody to spend time on backporting libdl) At this point in time it seems that our only hope is to add a number of __libc_lock calls in the libdl code that comes with glibc 2.0.x. However, I am not in a position where I can test or have the test applications doing this, so there must be somebody else that goes in deep and starts throwing thread locks left and right. :-(
Assignee: chofmann → blizzard
I'll take this bug and try to get it fixed. I can probably RH into releasing an official 5.2 update for it.
Summary: [PP] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup.
I'm seeing this now, on the 7/29 build, and it does prevent startup (and I'm hearing the same thing from other 5.2 users), so I'm adding [blocker] back.
Summary: [PP] [BLOCKER] 1999-06-24-12-m8 Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup.
I'm modifying the summary to reflect a more current date.
Summary: [PP] [BLOCKER] 1999-07-29 builds - Linux may crash on startup. → [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup.
adding RH5.2 in the title.
*** Bug 9292 has been marked as a duplicate of this bug. ***
Severity: normal → blocker
Putting on the blocker radar
Status: NEW → ASSIGNED
I'm able to reproduce this problem with Red Hat 6.0 and glibc targeted for Red Hat 6.1. I'm continuing to look into it.
well, after locks of hacking in NSPR, I'm concerned that we're not actually going to be able to override dlopen and friends. There are a few problems: - there is no _dlopen, _dlclose, _dlsym, or _dlerror on linux, which means it's very difficult to call the "real" versions of these functions - the way we would call these functions without the _ versions would be to use dlsym with the RTLD_NEXT flag, but it's kind of hard to call dlsym() when that's the function you're trying to override. One interesting fix that seems to make a big difference is to switch to using RTLD_LAZY in PR_LoadLibrary instead of RTLD_NOW. This reduces the number of symbol lookups (i.e. dlsym() calls) at load-time to almost nothing, and spreads the actual dlsym() calls over the lifetime of the app instance, so the chances of race conditions are much lower. This is not a fix to the problem though, merely a way of reducing the probability of it occuring. This basically means changing #ifdef LINUX #define _PR_DLOPEN_FLAGS RTLD_NOW #else #define _PR_DLOPEN_FLAGS RTLD_LAZY #endif /* LINUX */ to just #define _PR_DLOPEN_FLAGS RTLD_LAZY The problem is that the original #ifdef LINUX was put there to fix a linux porting issue that directory server was having back in february. Larry - what do you think about changing prlink.c back the way it was, without the special case for linux?
Whiteboard: fix on the way
alecf says: Chris Blizzard has a fix for glibc itself - he's done the fix for the unreleased redhat 6.1 and is now backporting it to redhat 6.0 and 5.2.
It's also happening with SUSE 6.1. Is it possible to use RH glibcs when chris hash finished his backport?
RPMS are now available for Red Hat 6.0 and 5.2 for this problem. Please note that these ARE NOT OFFICIAL RED HAT RPMS. They are not signed and pretty much untested. You are taking your own life into your hands by installing them. I hope you know how to use sash, just in case. If it helps, I'm running the 6.0 rpms now and haven't seen any problems. Assuming these work well, they will probably be released as offical Red Hat RPMS. As for people running SUSE, I don't know what to tell you. If you install the Red Hat glibc, your system might be unusable. I'd bug SUSE about it. I'd like feedback on this. Let me know if it works or if it doesn't. URL: http://people.redhat.com/blizzard/glibc/
The glibc update for 5.2 doesn't seem to work. I'm working on trying to sort it out. Keep tuned.
at the RTLD_LAZY trick doesn't seem to work for me anymore (funny, it did yesterday!) so that's not an option either.
oh, and just to clarify: I had two NSPR hacks, both of which seem to be worthless now: - use RTLD_LAZY like all the other platforms (doesn't seem to fix the problem now) - override and wrap dlsym(), dlopen() etc with locks so that they can't be called simultaneously. This turned out to be a flop too because glibc doesn't do the common "weak <symbol>/strong _<symbol>" that most other platforms do (which means if I override dlsym() I can't get back to the original dlsym()) So I think now we're relying on chris' genius.
*** Bug 10600 has been marked as a duplicate of this bug. ***
I've talked to Ulrich about either making the dl library thread safe or porting the dl library from 2.1 back to glibc 2.0. He says there's zero chance of that happening because of the architecture changes in the two versions. So, this presents an interesting problem. Since glibc 2.0 is not thread safe with regards to dynamic loading, nspr is going to be very hard to set up to work around the problem. The problem is that anytime that you want to use dlsym() you will have to suspend the operation of any threads. The reason is that even in normal operation other threads will be resolving symbols via the the same operations that dlsym() will. They will inevitably step on each other's toes. So, there are two ways that we can go here. 1. hack nspr to do this 2. don't use pthreads for a glibc 2.0 release and just use userland threads.
ok, so I've learned that hacking NSPR is next to impossible for dlsym() because dlsym() happens to also be the function that needs to be available to hack NSPR. More clearly: - we need to hack NSPR to override dlsym() - hacking NSPR to override any function requires the _use_ of the function dlsym() - dlsym() is not available for use because we'd be overriding it.
userland threads, eh? Wan-Teh or larry, how does one create a user-level thread in NSPR? I'm confused between the type (USER vs. SYSTEM) and the scope (LOCAL vs. GLOBAL vs. GLOBAL_UNBOUND) would type=SYSTEM, scope=LOCAL create a user-level, NSPR-scheduled thread? I'm experimenting with this now. I've got another kind of interesting idea brewing too: The basic problem is that libc is calling non-threadsafe calls when creating a thread. We've been trying to solve the problem by protecting those calls. What if we instead solved the problem by protecting the creation of threads, so that all threads are suspended between the PR_CreateThread() and the actual kick-off of the thread function. so it essentially looks like this: PR_CreateThread(start_function) lock(thread_creation_lock) data.func=start_function pthread_create(_pt_root, data) /* data contains start_function */ _pt_root(data) unlock(thread_creation_lock) data->func(); /* this is myfunction() */ this will at least synchronize all thread creation, which may be when some of this symbol resolution is happening. I'm going to try a userlevel thread first, then the above locking mechanism.
ok, I can't seem to get userlevel threads working: - using PR_LOCAL_THREAD is the same as PR_GLOBAL_THREAD with pthreads - compiling NSPR with CLASSIC_NSPR=1 (to force NSPR to use user-level threads) makes it crash in _PR_CPU_Idle I'm trying to do the NSPR locking thing I just mentioned but I'm not having much luck with that because you don't seem to be allowed to release a lock that another thread has locked. I'm still fiddling with this though.
Whiteboard: fix on the way → attemping two solutions
A clarification on the type and scope arguments to PR_CreateThread: You should almost always set the thread type to USER; the SYSTEM type was meant primarily for use by JVM. The default build of NSPR on most Unix platforms supports GLOBAL scope threads only; each GLOBAL scope thread is a pthread. When building NSPR using CLASSIC_NSPR=1 you get LOCAL scope (user-level) threads only. Are you seeing the crash in _PR_Cpu_Idle() with the checked in version of NSPR source? If so, do you have a stack trace? And yes, locks have to be unlocked by the owner of the lock.
Ok, I put locks in PR_CreateThread to serialize thread creation, and then added those locks to all the dlopen(), dlsym() etc calls... and we STILL have the same problem. I'm going to try to make user level threads work now. I'm not sure what the problem is... userlevel threads work find in nsprpub/pr/tests/threads.c
Oh, sspitzer had an interesting idea that also didn't work, unfortuantely: if loading the library fails then try again up to 10 times. In our tests it always worked on the second attempt to load the library, but then the pointer we got back from the next PR_FindSymbol (for NSFindFactory, etc) was garbage and it crashes immediately.
Ok, I'm not really having any problems with the stock Redhat 6 install. I now personally think: - we should solicit help from the net to get userlevel threads working on redhat 5.2 (aka glibc2) - we should encourage everyone else to just upgrade to redhat 6.0 (aka glibc2.1) I think by the time this product is released, RH52 will be old skool. RH6.1 will probably be released by then anyway and even 6.0 will be old. glibc2.0 is just too f***ed to waste our time with.
Whiteboard: attemping two solutions → Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2)
Updating summary to reflect my reccomendation. I know lots of people working just fine on Redhat 6.0 (2.2 kernel, 2.1 glibc) and on Redhat 5.2 (2.0 kernel, 2.0 glibc) We can leave it to the net to try and fix this one, though after the sheer volume of work chris and I have put into this, I'll be surprised if anything comes out of it.
Is there any real evidence which glibc function is triggering the dl calls? I'm skeptical that it was really during pthread_create. I think the idea of serializing the PR_CreateThread calls is fundamentally the right solution for glibc 2.0, you just have to find the right function that's triggering the dl calls. The only dl calls off the top of my head that it might be would be the nsswitch stuff perhaps you need to serialize the first call to and resolver functions or something like that? It shouldn't be too hard to do a grep over the glibc sources to find all the locations that could trigger dl calls from outside that section but I haven't got a libc source tree handy.
well, serializing PR_CreateThread calls didn't help anything.... blizzard might have more comments on the glibc thing, but he's spent alot of time hacking at it.
*** Bug 4303 has been marked as a duplicate of this bug. ***
blizzard: could you let us know if the current glibc in redhat rawhide also addresses this? (2.1.2-3) thanks.
rpm -qp --changelog glibc-2.1.2-3.i386.rpm reveals this: * Mon Aug 02 1999 Cristian Gafton <gafton@redhat.com> - upgraded snapshot to get the ld.so fixes for thread safety So the answer is "yes."
We are dropping 5.2 support...shall we mark this Resolved/Won't Fix?
To get a feel for how often this occurs here are 25 runs on my machine: ggggbggbbggggggbbgbggbggb (g=good, b=bad) 8 load failures out of 25 runs. Here's a sample of the errors, Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so: undefined symbol: GetStyleContext__C7nsFramePP15nsIStyleContext Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so: undefined symbol: Destroy__7nsFrameR14nsIPresContext Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so: undefined symbol: SetFrameState__7nsFrameUi Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so: undefined symbol: GetNextPrevLineFromeBlockFrame__7nsFrameP15nsIFocusTracker11nsDirectionP8nsIFrameiiPP10nsIContentPiSc Load(/export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so) FAILED with error: /export/slamm/gecko/mozilla/dist/bin/components/libraptorhtml.so: undefined symbol: SetFrameState__7nsFrameUi
I think either we should leave this open with no target fix and help wanted, or close it resolved/wont fix.
Help Wanted sounds like a good idea for this bug -- maybe some of the glibc gurus will be willing to look at fixes for other versions of Linux.
I tried to upgrade the glibc 2.0 from rh 5.2 to glibc 2.1 from rh 6.0. A bunch of stuff had to be upgraded as well, including binutils, kernel-headers, kernel-source and compilers. Everything seems to be working as expected, except for the debugger. If you use the 5.2 debugger, it (the debugger) hangs right after the first thread is created. If i use the 6.0 debugger, it (the debugger) core dumps right away. So this franken-redhat setup might be a workaround for people that dont want to upgrade to 6.0. Of course, its completely at your own risk, cause it hasnt been QAed at all.
Target Milestone: M9 → M10
I've added a note to get his on the M9 release notes as a high profile item. Update this bug with additional notes and comments to go into the M9 and future releases notes. http://bugzilla.mozilla.org/show_bug.cgi?id=11352 moving the rest of this work to M10 since its not something we will address directly on Seamonkey M9.
*** Bug 11311 has been marked as a duplicate of this bug. ***
Hi, all. I ran into this bug with the CVS tree from today (8/19/99). When I ran viewer I was getting strange symbol resolving errors on my RedHat 5.2 system. Doing this seemed to fix the problem (for me at least). LD_BIND_NOW=1 ./apprunner I ran into this same problem when porting a JNI app to the JDK port from blackdown. In that case, I also needed to preload libpthread.so like this before it would work correctly. LD_PRELOAD=libpthread.so The guys over at blackdown were a big help in tracking down this problem, so it might help to have one of your experts talk to one of their experts. The follow who wrote the fix they are using in the JDK can be contacted at this address. Juergen Kreileder <kreilede@issan.cs.uni-dortmund.de> He is very friendly and helpful, so I am sure he would not mind helping the mozilla team resolve this nasty bug.
Target Milestone: M10 → M11
Moving out to M11
AFAIK RH 5.x support has been dropped. This does work in 6, right? If so, we should probably close this out as WONTFIX? Anyone?
I'd like to leave this open as "HELP WANTED"... because it would suck not to support 5.2, but I don't think we have anyone right now who actually wants to bother with this. What's the HELP WANTED procedure? Let's do that.
Summary: [PP] [BLOCKER] 1999-07-29 - Linux/RH5.2 may crash on startup. → [PP] [HELP WANTED] 1999-07-29 - Linux/RH5.2 may crash on startup.
Whiteboard: Suggest dropping support for [glibc2.0,kernel2.2] (Modified redhat 5.2) → Suggest dropping support for glibc2.0
HELP WANTED-ified, removed [BLOCKER] and updated status field. (It's not a kernel issue.)
Target Milestone: M11 → M20
mving out to m-way-far-away
everyone, I haven't seen apprunner crash on startup i a long time...shall we close this one out.???
You haven't seen the crash on 5.2 in a long time?
there are SOME redhat 5.2 systems that have never seen this problem. I don't know what it is about them. I'd like to leave this open unless we get confirmation from more than 5 or 10 people that they aren't seeing this.
A conditional "works for me": I have a glibc 2.0 system (Debian slink, not everything is RH :-) and it does work _if_ I set LD_BIND_NOW=1 and LD_PRELOAD=libpthread.so.0. Until recently (about 2 weeks ago) I had to also preload libgtk and libgdk, but this does not work any more - seems something in the library dependencies inside mozilla has changed/improved. I think it should get verified with someone who knows glibc internals (and not just tells everyone to upgrade) if this fix really avoids the race condition, and then set in the startup script, depending on library version.
Did anyone ever try preloading glibc 2.1 on one of those systems to see if it worked? That might an acceptable work around.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → WONTFIX
There are ideas in this bug report on workarounds but it's not something that we can actually fix.
Marking Verified/Won't Fix.
Status: RESOLVED → VERIFIED
Please ignore the spam. Changing address.
Assignee: blizzard → blizzard
Status: VERIFIED → NEW
busted when I reassigned
Status: NEW → RESOLVED
Closed: 25 years ago25 years ago
Resolution: WONTFIX → FIXED
busted when I reassigned
Status: RESOLVED → VERIFIED
Target Milestone: M20 → ---
You need to log in before you can comment on or make changes to this bug.