Closed
Bug 52111
Opened 24 years ago
Closed 24 years ago
Sometimes mozilla hangs on startup on SMP systems. Necko SMP race.
Categories
(Core :: Networking, defect, P3)
Tracking
()
VERIFIED
FIXED
People
(Reporter: alla, Assigned: warrensomebody)
Details
(Whiteboard: [nsbeta3+])
Attachments
(1 file)
3.03 KB,
patch
|
Details | Diff | Splinter Review |
I've got an SMP machine (dual PII 450Mhz), and sometimes when i start mozilla it hangs during the startup and doesn't show the initial window. This happens quite seldom, and when it does the last printed line is always something like: WARNING: not calling OnDataAvailable, file nsAsyncStreamListener.cpp, line 403 This line isn't printed when mozilla starts correctly. I suppose this is some SMP/thread deadlock thing, but i haven't been able to find out more about it. If i attach to the deadlocked mozilla with gdb i can see that all threads are waiting in poll() or pthread_cond_wait(). Here is the full output from a start where it hangs: Type Manifest File: /export/alex/mozilla/dist/bin/components/xpti.dat nsNativeComponentLoader: autoregistering begins. *** Registering nsGfxGTKModule components (all right -- a generic module!) nsNativeComponentLoader: autoregistering succeeded nNCL: registering deferred (0) GFX: dpi=96 t2p=0.0666667 p2t=15 depth=16 WEBSHELL+ = 1 Initialized app shell component {18c2f989-b09f-11d2-bcde-00805f0e1353}, rv=0x00000000 Initialized app shell component {33e569b0-40f8-11d4-9a41-000064657374}, rv=0x00000000 WEBSHELL+ = 2 CSSLoaderImpl::LoadAgentSheet: Load of URL 'file:///home/alex/.mozilla/default/ChromeuserChrome.css' failed. Error code: 16389 CSSLoaderImpl::LoadAgentSheet: Load of URL 'file:///home/alex/.mozilla/default/ChromeuserContent.css' failed. Error code: 16389 Enabling Quirk StyleSheet Note: verifyreflow is disabled Note: styleverifytree is disabled Note: frameverifytree is disabled Enabling Quirk StyleSheet WARNING: waaah!, file nsXULPrototypeDocument.cpp, line 523 JavaScript strict warning: chrome://communicator/content/bookmarks/bookmarks.js line 954: redeclaration of var cmd WARNING: waaah!, file nsXULPrototypeDocument.cpp, line 523 JavaScript strict warning: chrome://communicator/content/bookmarks/bookmarks.js line 956: redeclaration of var cmdResource WARNING: waaah!, file nsXULPrototypeDocument.cpp, line 523 JavaScript strict warning: chrome://navigator/content/navigator.js line 1230: function readFromClipboard does not always return a value WARNING: not calling OnDataAvailable, file nsAsyncStreamListener.cpp, line 403
Flaws with that version have been reported. The first recent security upgrade made it even worse (2.1.3-19), the error wasn't random anymore but almost guaranteed. A new upgrade of glibc was issued after this: 2.1.3-21 For some info about the bug that is now fixed see: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=17203 Summary: "glibc-2.1.3-19 breaks Sun and IBM Java 1.3 on SMP" (NON-SMP also commented to be affected, as well as Mozilla.) I believe this might just be the same thread bug you encounter. You find the errata at http://www.redhat.com/support/errata/RHSA-2000-057-04.html Please note that unless you already have, you also need to upgrade rpm. Version 3.0.5-9: http://www.redhat.com/support/errata/RHEA-2000-051-01.html Please report back.
Well: lrwxrwxrwx 1 root root 18 Aug 11 16:41 /boot/vmlinuz -> vmlinuz-2.2.14-5.0 But the kernel used (specified in lilo.conf) is vmlinuz-2.2.14-5.0smp. I've told the sysadm to install the new glibc, i will report if i get any lockups after they are installed.
Upgrading glibc to 2.1.3-21 didn't help. I still get the lockup sometimes.
uh oh. I have a hunch an upgrade of glibc is one of those things that require a reboot, but that was likely performed.. More questions: -Mozilla build ID? (or which date it's pulled from CVS if so) -Does the error occur with a clean profile? (backup ~.mozilla and deleting all content there before starting mozilla. Check that no ghost processes of mozilla are hanging before you start)
You most certanly do *not* have to reboot when you've upgraded libc. And I haven't. I pulled sometime yesterday that would be early 2000-09-13, but i have seen this for a long time, and I don't have any reason to believ any of the recent checkins have made this better. It is pretty hard to test if i get it with a clean profile. I develop mozilla full time all day, and i get this lockup maybe once a day.
I'm working on collecting "suspect" bugs related to glibc bugs, but it's a messy task. I suspect this might be one of them, however. To give a few more hints: There are several "intermittent non starter" bugs in bugzilla. In some there are users comments obviously originating from the glibc bug, even in cases where the original bug is assumed to be native Mozilla bugs. (bug 51267, bug 52390) In bug 41414 Daniel Egger from SuSE mentions that an upgrade to CVS of glibc from early August made the bug go away on a non-SMP system. Bug 21556, another SMP bug but on SUN, was marked "works for me" but tells there are still a few problems left. The last stacktrace there shows a thread bug on an earlyer glibc system and on SUN, which again indicates a mozilla-bug at the time. In bug 51212, another SMP hang bug, reporter indicates the bug went away after reverting from rawhide glibc-2.1.92-5 to glibc-2.1.91. However: same reporter concludes that the bug did not affect a glibc 2.1.3 SMP system (minor version unknown.) One sure thing is that glibc 2.1.3-19 frequently broke a full dozen applications, including Mozilla. Here a list over some bugs i saw using kernel 2.2.16-3 and glibc 2.1.3-19 on a single CPU system. They all went away again - for good - after upgrading to glibc 2.1.3-21: bug 41414, bug 50444, bug 51330, bug 51164, bug 51482, bug 51892 Reporter: It would be useful to know what kind of Mozilla build is involved. Nightly? CVS? Debug or optimized? Are you testing new builds regularly? For how long have you been seeing the bug? (Was there a time when it always used to work, and then went bad after that time?) Do you use a fresh profile? (no previous ~/.mozilla dir when starting) Do you install as root? If yes: Do you instantiate chrome/dir's/registry files by running once as root first?
Reporter | ||
Comment 10•24 years ago
|
||
I don't think this is glibc related. I've read the bugs you refered to, and they all talk about crashes or spinning using 100% cpu. The symptoms i see are much more like a classical thread race. Having attached to a hanged mozilla with gdb and looking at the code in question I think the problem is in the handling of io events. Somehow the target of an "event" OnDataAvailible is destroyed before getting its event, leading to the event being dropped on the floor and mozilla not progressing. This is the state i find when attaching to the process. All threads are sleeping on normal things. In fact, one time i saw this on opening the file requester (file->open file). The same warning printed, and the file requester never appeared. But everything else worked as normal. FYI: I usually run debug build from cvs (update a few times a day), compiled with debug and optimize. not fresh profile, not installed, not root. I unfortunately don't remember when first seeing the bug, but it was a long time ago. Just never bothered to report it until now.
Comment 11•24 years ago
|
||
You should *definately* try it with a fresh profile! Pack the old one down, move it out of the way, and tell how that went.
Reporter | ||
Comment 12•24 years ago
|
||
Reporter | ||
Comment 13•24 years ago
|
||
I finally got tired of this bug and spent some cycles debugging it. The root cause seem to be a race between nsFileTransport.cpp and nsAsyncStreamListener.cpp. nsFileTransport at several places do stuff like: mStatus = mOutputStream->WriteFrom(mSource, mTransferAmount, &writeAmt); if (mStatus == NS_BASE_STREAM_WOULD_BLOCK) { mStatus = NS_OK; return; } This seemingly can run in parallel with nsOnDataAvailableEvent::HandleEvent() which calls mChannel->GetStatus(). If due to bad voodo the GetStatus() call happens at a time between the WriteFrom() call and the mStatus=NS_OK statement the Event will be ignored and lost, leading to no initial window. This could hit on non-SMP too, but the probability is very low. The patch fixes this by storing the status result in a temporary variable, so that mStatus is never set to NS_BASE_STREAM_WOULD_BLOCK. I've run the patch for quite a while and i haven't been able to provoke a hang. Nominating for beta3 since it fixes a problem that has bothered me for quite a while, and the fix is small and "obviously correct" (it is obvious that we don't get any worse behaviour, but possibly not obvious that the race is fixed). I've added warren to the cc-list, because he seems to own nsFileTransport.
Assignee | ||
Comment 16•24 years ago
|
||
This is not obviously correct to me. Why is it that just delaying setting the status is good enough to guarantee that OnStartRequest/OnDataAvailable alway fire? Shouldn't we be prepared to deal with the case where they don't fire? Maybe we should get together on this one to talk it through. Adding nsbeta3+
Whiteboard: [nsbeta3+]
Reporter | ||
Comment 17•24 years ago
|
||
The idea is not to delay the setting of the status, but to not set status to an error (for a short while), when in fact it is ok (not an error) at this point to get WOULD_BLOCK. I'm not into necko, but I think in case of a "real" error stuff might be handled correctly. But in this case a "false" error makes the event not being sent, while the io request returns no error giving no clue that something bad has happened.
Assignee | ||
Comment 18•24 years ago
|
||
Ok, I understand now. It used to be that WOULD_BLOCK was a success code, but it changed somewhere along the way. r=warren
Assignee | ||
Comment 19•24 years ago
|
||
P.S. Let's put a comment in the Process method in the 2 places you do this that lets people know what's going on here, e.g. "don't let mStatus temporarily get set to WOULD_BLOCK because this might get picked up by the async stream listener..." or something like that.
Assignee | ||
Comment 20•24 years ago
|
||
Checked this in. Thanks Alex.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Comment 22•15 years ago
|
||
Hello, I am doing some investigation of concurrency bugs, I am very interested in this bug 52111. I am wondering which CVS branch I should checkout to get the source code of this buggy version? Would gcc 3.4 be able to compile this version of mozilla suite ? Thank you very much, Wei
You need to log in
before you can comment on or make changes to this bug.
Description
•