Closed Bug 52111 Opened 24 years ago Closed 24 years ago

Sometimes mozilla hangs on startup on SMP systems. Necko SMP race.

Categories

(Core :: Networking, defect, P3)

x86
Linux
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: alla, Assigned: warrensomebody)

Details

(Whiteboard: [nsbeta3+])

Attachments

(1 file)

I've got an SMP machine (dual PII 450Mhz), and sometimes when i start mozilla it
hangs during the startup and doesn't show the initial window. This happens quite
seldom, and when it does the last printed line is always something like:
WARNING: not calling OnDataAvailable, file nsAsyncStreamListener.cpp, line 403

This line isn't printed when mozilla starts correctly.

I suppose this is some SMP/thread deadlock thing, but i haven't been able to
find out more about it. If i attach to the deadlocked mozilla with gdb i can see
that all threads are waiting in poll() or pthread_cond_wait().

Here is the full output from a start where it hangs:
Type Manifest File: /export/alex/mozilla/dist/bin/components/xpti.dat
nsNativeComponentLoader: autoregistering begins.
*** Registering nsGfxGTKModule components (all right -- a generic module!)
nsNativeComponentLoader: autoregistering succeeded
nNCL: registering deferred (0)
GFX: dpi=96 t2p=0.0666667 p2t=15 depth=16
WEBSHELL+ = 1
Initialized app shell component {18c2f989-b09f-11d2-bcde-00805f0e1353},
rv=0x00000000
Initialized app shell component {33e569b0-40f8-11d4-9a41-000064657374},
rv=0x00000000
WEBSHELL+ = 2
CSSLoaderImpl::LoadAgentSheet: Load of URL
'file:///home/alex/.mozilla/default/ChromeuserChrome.css' failed.  Error code:
16389
CSSLoaderImpl::LoadAgentSheet: Load of URL
'file:///home/alex/.mozilla/default/ChromeuserContent.css' failed.  Error code:
16389
Enabling Quirk StyleSheet
Note: verifyreflow is disabled
Note: styleverifytree is disabled
Note: frameverifytree is disabled
Enabling Quirk StyleSheet
WARNING: waaah!, file nsXULPrototypeDocument.cpp, line 523
JavaScript strict warning: 
chrome://communicator/content/bookmarks/bookmarks.js line 954: redeclaration of
var cmd

WARNING: waaah!, file nsXULPrototypeDocument.cpp, line 523
JavaScript strict warning: 
chrome://communicator/content/bookmarks/bookmarks.js line 956: redeclaration of
var cmdResource

WARNING: waaah!, file nsXULPrototypeDocument.cpp, line 523
JavaScript strict warning: 
chrome://navigator/content/navigator.js line 1230: function readFromClipboard
does not always return a value

WARNING: not calling OnDataAvailable, file nsAsyncStreamListener.cpp, line 403
Which version of glibc do you use?
I use the one in RH 6.2, 2.1.3-15.
Flaws with that version have been reported. The first recent security upgrade
made it even worse (2.1.3-19), the error wasn't random anymore but almost
guaranteed. A new upgrade of glibc was issued after this: 2.1.3-21

For some info about the bug that is now fixed see:
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=17203
Summary: "glibc-2.1.3-19 breaks Sun and IBM Java 1.3 on SMP" (NON-SMP also
commented to be affected, as well as Mozilla.)
I believe this might just be the same thread bug you encounter.

You find the errata at
http://www.redhat.com/support/errata/RHSA-2000-057-04.html

Please note that unless you already have, you also need to upgrade rpm. Version
3.0.5-9:
http://www.redhat.com/support/errata/RHEA-2000-051-01.html

Please report back.
Would you also please paste the output of
ls -la /boot/vmlinuz
Well:
lrwxrwxrwx    1 root     root           18 Aug 11 16:41 /boot/vmlinuz ->
vmlinuz-2.2.14-5.0

But the kernel used (specified in lilo.conf) is vmlinuz-2.2.14-5.0smp.

I've told the sysadm to install the new glibc, i will report if i get any
lockups after they are installed.
Upgrading glibc to 2.1.3-21 didn't help.
I still get the lockup sometimes.
uh oh. I have a hunch an upgrade of glibc is one of those things that require a
reboot, but that was likely performed..

More questions:

-Mozilla build ID? (or which date it's pulled from CVS if so)

-Does the error occur with a clean profile? (backup ~.mozilla and deleting all
content there before starting mozilla. Check that no ghost processes of mozilla
are hanging before you start)
You most certanly do *not* have to reboot when you've upgraded libc. And I
haven't.

I pulled sometime yesterday that would be early 2000-09-13, but i have seen this
for a long time, and I don't have any reason to believ any of the recent
checkins have made this better. 

It is pretty hard to test if i get it with a clean profile. I develop mozilla
full time all day, and i get this lockup maybe once a day.
I'm working on collecting "suspect" bugs related to glibc bugs, but it's a messy
task. I suspect this might be one of them, however.

To give a few more hints:
There are several "intermittent non starter" bugs in bugzilla. In some there are
users comments obviously originating from the glibc bug, even in cases where the
original bug is assumed to be native Mozilla bugs. (bug 51267, bug 52390)

In bug 41414 Daniel Egger from SuSE mentions that an upgrade to CVS of glibc
from early August made the bug go away on a non-SMP system.

Bug 21556, another SMP bug but on SUN, was marked "works for me" but tells there
are still a few problems left. The last stacktrace there shows a thread bug on
an earlyer glibc system and on SUN, which again indicates a mozilla-bug at the
time.
In bug 51212, another SMP hang bug, reporter indicates the bug went away after
reverting from rawhide glibc-2.1.92-5 to glibc-2.1.91. However: same reporter
concludes that the bug did not affect a glibc 2.1.3 SMP system (minor version
unknown.)

One sure thing is that glibc 2.1.3-19 frequently broke a full dozen
applications, including Mozilla.
Here a list over some bugs i saw using kernel 2.2.16-3 and glibc 2.1.3-19 on a
single CPU system. They all went away again - for good - after upgrading to
glibc 2.1.3-21: bug 41414, bug 50444, bug 51330, bug 51164, bug 51482, bug 51892

Reporter: It would be useful to know what kind of Mozilla build is involved.
Nightly? CVS? Debug or optimized?
Are you testing new builds regularly?
For how long have you been seeing the bug?
(Was there a time when it always used to work, and then went bad after that
time?)
Do you use a fresh profile? (no previous ~/.mozilla dir when starting)
Do you install as root? If yes: Do you instantiate chrome/dir's/registry files
by running once as root first?
I don't think this is glibc related. I've read the bugs you refered to, and they
all talk about crashes or spinning using 100% cpu. The symptoms i see are much
more like a classical thread race. Having attached to a hanged mozilla with gdb
and looking at the code in question I think the problem is in the handling of io
events. Somehow the target of an "event" OnDataAvailible is destroyed before
getting its event, leading to the event being dropped on the floor and mozilla
not progressing. This is the state i find when attaching to the process. All
threads are sleeping on normal things.

In fact, one time i saw this on opening the file requester (file->open file).
The same warning printed, and the file requester never appeared. But everything
else worked as normal.

FYI: I usually run debug build from cvs (update a few times a day), compiled
with debug and optimize. not fresh profile, not installed, not root. I
unfortunately don't remember when first seeing the bug, but it was a long time
ago. Just never bothered to report it until now.
You should *definately* try it with a fresh profile!
Pack the old one down, move it out of the way, and tell how that went.
I finally got tired of this bug and spent some cycles debugging it. The root
cause seem to be a race between nsFileTransport.cpp and
nsAsyncStreamListener.cpp.

nsFileTransport at several places do stuff like:
mStatus = mOutputStream->WriteFrom(mSource, mTransferAmount, &writeAmt);
if (mStatus == NS_BASE_STREAM_WOULD_BLOCK) {
   mStatus = NS_OK;
   return;
}

This seemingly can run in parallel with nsOnDataAvailableEvent::HandleEvent()
which calls mChannel->GetStatus(). If due to bad voodo the GetStatus() call
happens at a time between the WriteFrom() call and the mStatus=NS_OK statement
the Event will be ignored and lost, leading to no initial window. This could hit
on non-SMP too, but the probability is very low.

The patch fixes this by storing the status result in a temporary variable, so
that mStatus is never set to NS_BASE_STREAM_WOULD_BLOCK. I've run the patch for
quite a while and i haven't been able to provoke a hang.

Nominating for beta3 since it fixes a problem that has bothered me for quite a
while, and the fix is small and "obviously correct" (it is obvious that we don't
get any worse behaviour, but possibly not obvious that the race is fixed).

I've added warren to the cc-list, because he seems to own nsFileTransport.
Component: Browser-General → Networking
Keywords: nsbeta3, patch
Summary: Sometimes mozilla hangs on startup on SMP systems → Sometimes mozilla hangs on startup on SMP systems. Necko SMP race.
updating owners.
Assignee: asa → gagan
QA Contact: doronr → tever
warren is the right reviewer for this. 
Assignee: gagan → warren
This is not obviously correct to me. Why is it that just delaying setting the 
status is good enough to guarantee that OnStartRequest/OnDataAvailable alway 
fire? Shouldn't we be prepared to deal with the case where they don't fire?

Maybe we should get together on this one to talk it through.

Adding nsbeta3+
Whiteboard: [nsbeta3+]
The idea is not to delay the setting of the status, but to not set status to an
error (for a short while), when in fact it is ok (not an error) at this point to
get WOULD_BLOCK.

I'm not into necko, but I think in case of a "real" error stuff might be handled
correctly. But in this case a "false" error makes the event not being sent,
while the io request returns no error giving no clue that something bad has
happened.
Ok, I understand now. It used to be that WOULD_BLOCK was a success code, but it 
changed somewhere along the way. r=warren
P.S. Let's put a comment in the Process method in the 2 places you do this that 
lets people know what's going on here, e.g. "don't let mStatus temporarily get 
set to WOULD_BLOCK because this might get picked up by the async stream 
listener..." or something like that.
Checked this in. Thanks Alex.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
verified
Status: RESOLVED → VERIFIED
Hello,

I am doing some investigation of concurrency bugs, I am very interested in this bug 52111. I am wondering which CVS branch I should checkout to get the source code of this buggy version? Would gcc 3.4 be able to compile this version of mozilla suite ?

Thank you very much,

Wei
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: