Closed Bug 99561 Opened 18 years ago Closed 17 years ago

Bad state when quitting on Mac OS X

Categories

(NSPR :: NSPR, defect, P3, critical)

PowerPC
macOS
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sfraser_bugs, Assigned: sfraser_bugs)

References

Details

(Keywords: dataloss, perf)

Attachments

(1 file, 9 obsolete files)

It's possible to get into a very bad state when quitting on Mac OS X, where you
get a continual slew of assertions, and never quit.

This happens because it's possible for Mozilla to go through the quit process
while running on a non-main NSPR thread. The PR_CPU_Idle thread on Mac calls
WaitNextEvent() (which is bad for other reasons), and while in this call, the OS
can send us the Quit AppleEvent that results from the user choosing the OS's
Quit menu item. This causes all of the application teardown code to run on the
PR_CPU_Idle NSPR thread, with distrous consequences.
That's evil that the call to WaitNextEvent() will actually processes an event
when given an event mask of 0. Maybe using EventAvail() would yield to other
processes (the old doc says it calls SystemTask) without processing a high-level
event?
Priority: -- → P2
Target Milestone: --- → 4.2
I'm not sure if what I was able to capture via Sampler is the same as the
previous attachment (mine is quite a bit shorter, but I may have just sampled
for less time), but if this is pointing to a different problem, let's open a new
bug report for it - Mozilla hanging on quit happens to me quite frequently, and
it seems to happen more often if I quit from the Mail/News window than if I quit
from the browser.
> it seems to happen more often if I quit from the Mail/News window than if I
> quit from the browser.

I think this is a different problem. I've noticed this as well. Do you have
"Empty Trash on Exit" checked in the mail server settings? Or, if IMAP, "Clean
up Inbox on Exit?"

I have a fairly flaky connection to my mail server so I unchecked this. Since my
connection is flaky, I didn't consider it to be a bug, and there was a way
around it.
OK - I've unselected these options, I'll see if the problem goes away.  I've
noticed that the way Moz handles IMAP connections is less than ideal (I think
there's already a bug reported about this) ... seems like a relatively simple
thing to determine if a connection has been dropped by the server and to just
rebuild the connection instead of simply trying to use a dead one ... oh well,
all in good time, I'm sure :)
Quitting from the mail/news window may be bug: 
http://bugzilla.mozilla.org/show_bug.cgi?id=97407
See also bug 116431.
This is a dataloss bug (see bug 120666). It's thus critical.
Blocks: 120666
Severity: normal → critical
Keywords: dataloss
Here's the first cut at a patch to use the Multiprocessing APIs to replace the
crufty, and dangerous, WNE in _MD_PauseCPU.  For now the timeout value used
when calling MPWaitOnSemaphore is fixed at 100 milliseconds as the raw timeout
passed to _MD_PauseCPU most certainly isn't in the timebase expected for that
call.

Since I've never been able repro the problem described in this bug, or the
supposedly easier to repro 102666, I'm hoping sfraser can look at this to see
if I'm on the right path and if in fact it makes a difference in the behavior
he and sairuh (se) have seen
This change does very nice things to the pageload times (on my dual 800) on Mac 
OS X.
Before the patch:     ~1680ms
After the patch:      ~1280ms
Keywords: perf
This patch differs from the previous patch in a number of ways:
1. The 'idle semaphore" routines are no longer #ifdeffed in just for carbon.
   They'll be called for both carbon and non-carbon builds.

2. In non-carbon builds, we still call WaitNextEvent() to idle, as now.
However,
   I added a call to WakeUpProcess in SignalIdleSemaphore(), which should have
   some positive impact on performance on 9 (classic anc carbon).

3. I added calls to SignalIdleSemaphore(); from the async file IO callback,
   and ensured that it was called from each of the OT notifiers.

4. I factored the Gestalt() calls into a single RunningOnOSX() method.

5. I changed the order of the termination cleanup calls to be the reverse
   of the init calls (just for cleanliness).

6. I fixed a 'variare used before being initialized' warning in 
   SendReceiveDgram() (using |err| rather than |result|).
Attachment #65578 - Attachment is obsolete: true
This plus the widget drawing patch in bug 107827 gets page load time down to 
1071ms on my dual 800MHz G4 (Mac OS X)
This final patch adds some #ifdef TARGET_CARBON around the MPSemaphore calls,
and adds a call to WakeUpProcess() from the signal routine. This should help
startup performance a little on 9, and does not impact the page load
performance.
Attachment #65711 - Attachment is obsolete: true
Comment on attachment 65716 [details] [diff] [review]
Final NSPR patch, changed to work for non-Carbon builds

r=sdagley
Attachment #65716 - Flags: review+
*** Bug 116431 has been marked as a duplicate of this bug. ***
Status: NEW → ASSIGNED
Whiteboard: waiting for wtc's review and a= from drivers@mozilla.org
Comment on attachment 65716 [details] [diff] [review]
Final NSPR patch, changed to work for non-Carbon builds

a=dbaron for 0.9.8
Attachment #65716 - Flags: approval+
Whiteboard: waiting for wtc's review and a= from drivers@mozilla.org
Comment on attachment 65716 [details] [diff] [review]
Final NSPR patch, changed to work for non-Carbon builds

This patch is fine. However, I insist that you
fix the indentation of the new code in macsockotpt.c
and macthr.c.

It seems that you indent by 2 spaces and set the
tab stop at 2.	In NSPR, we always indent by 4
spaces.  We are not consistent in our tab stop;
some files are 8 and others are 4.

I recommend that you use tab stop 4 for the Mac
NSPR files and indent your code by 4 spaces.

After you fix the indentation in macsockotpt.c
and macthr.c, you can go ahead and check in the
patch.	Please check it into both the tip and the
client branch of NSPR.

Thank you!
Comment on attachment 65716 [details] [diff] [review]
Final NSPR patch, changed to work for non-Carbon builds

Hmm... on second thought

>Index: pr/src/md/mac/macio.c
>===================================================================
>RCS file: /cvsroot/mozilla/nsprpub/pr/src/md/mac/macio.c,v
>retrieving revision 3.27.4.3
>diff -b -u -8 -r3.27.4.3 macio.c
>--- pr/src/md/mac/macio.c	10 Dec 2001 22:27:22 -0000	3.27.4.3
>+++ pr/src/md/mac/macio.c	19 Jan 2002 02:35:42 -0000
>@@ -89,16 +89,18 @@
> /* XXX Not done yet for 68K */
> /* I/O completion routne for _MD_READ and _MD_WRITE */
> static void AsyncIOCompletion (ExtendedParamBlock *pbAsyncPtr)
> {
>     _PRCPU *cpu = _PR_MD_CURRENT_CPU();
>     PRThread *thread = pbAsyncPtr->thread;    
>     PRIntn is;
>     
>+    SignalIdleSemaphore();
>+
>     if (_PR_MD_GET_INTSOFF()) {
>         thread->md.missedIONotify = PR_TRUE;
>         cpu->u.missed[cpu->where] |= _PR_MISSED_IO;
>         return;
>     }
> 
>     _PR_INTSOFF(is);
>

It seems that you should signal the idle semaphore
*right before* you return from AsyncIOCompletion().
Similarly for the OT notifier routine.

I need to think about this more.
OK, I think I know why we must signal the idle semaphore
before _PR_INTSOFF(is).  NSPR sets ints off before it
calls _PR_MD_PAUSE_CPU().  Therefore, if you don't signal
the idle semaphore to wake up NSPR, the async I/O completion
routine will block on _PR_INTSOFF(is).  Did I understand
this right?

I would be more comfortable if we could make the
SignalIdleSemaphore() call after the DoneWaitingForThisThread()
call.  This would require WaitOnIdleSemaphore() to set ints
on before the MPWaitOnSemaphore() call and set ints off after
the MPWaitOnSemaphore() call.  Could you give this a try and
see if it works?

I think waiting on a semaphore during _MD_PauseCPU() is the
right direction.  I just wanted to make sure that we wake up
the idle CPU thread after the callback routine has finished
its work.  Also, the way we use the "intsoff" critical region
still has room for improvement, but that's another bug. 
I'm not sure that it matters where SignalIdleSemaphore(); is called in relation 
to disabling/enabling interrupts (which on X implies being in the critical 
section). The semaphore will be signalled either way, but critical section 
behaviour is maintained (i.e. signalling inside the critical section will not 
cause an immediate pthread switch).

What's clear is that we do need to signal the semaphore from the async notify 
routines whether interrupts are currently on or off.

As to whether to signal the semaphore from within the Timer task; I don't think 
this is necessary, and it would hurt performance. The timer task fires every 8ms, 
and it will continue to fire even when we're waiting on the semaphore (since it's 
on its own MPTask/pthread). Signalling it would just cause MD_PauseCPU to just 
wake every 8ms, then wait again.
In general one signals the semaphore *after* having made the
state change the other thread is waiting for.

Take the producer-consumer problem as an example.  The consumer
waits on a semaphore if the shared buffer is empty.  The
producer signals the semaphore *after* adding an item to
the buffer.  If the producer signals the semaphore upon entering
the buffer.put() method, it's wrong.

This is why it doesn't feel right to see a SignalIdleSemaphore()
call at the beginning of AsyncIOCompletion().
Attachment #65716 - Attachment is obsolete: true
Last patch looks good to me, and runs fine.
Comment on attachment 65963 [details] [diff] [review]
Revised patch to address wtc's concerns with the previous one

Steve, this patch is good.  r=wtc.

I'd like you to fix two nits before checking it in.

1. The indentation and tab stop.  In Mac NSPR code,
we indent by 4 and set the tab stop to 4.

2. Our if-else indentation style is usually:
    if (...) {
	...
    } else {
	...
    }
In a few places, you used
    if (...) {
	...
    }
    else
	statement;
Please move 'else' to the same line as the closing brace.

Thank you for addressing my concerns.  I am surprised that
you don't need to leave the critical region before waiting
on the idle semaphore.	But if you never deadlock or see
unexpected delays with this patch, I think it is fine.
Attachment #65963 - Flags: review+
Interrupts are off when we wait on the semaphore. But because the code that 
signals the semaphore runs whether interrupts are on or not, and because 
semaphore signalling works even inside an MP critical section, we always signal 
the semaphore successfully.
Ok, I've changed to tabs in macthr.c where we had 2 space indentation and in 
macsockoptp I've changed:

    else
        DoneWaitingOnThisThread(thread);

to
    else {
        DoneWaitingOnThisThread(thread);
    }

And let me just say I hate braces on the same line as the conditional (my brain 
likes to parse them on their own line :-)  and for the #ifdef TARGET_CARBON 
sections in macthr.c we need them to be on their own line so the #ifdef'ing is 
clean.

I'll land once the tree opens today
Checked in to NSPR tip and NSPRPUB_PRE_4_2_CLIENT_BRANCH
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Reopening. We're seeing some startup hangs because of this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The deadlock happens because the async file IO callback is waiting to enter the
critical region (turning interrupts off), but, since the main thread is also
waiting, this critical region is never exited. I don't know why we didn't see
this before, or why this doesn't happen all the time.
I predicted that!  See the last paragraph of comment #25.
I described this potential deadlock and proposed a solution
in comment #20.  Could you give it a try?
I can only repro this by rebooting my machine (which probably flushes disk caches 
and makes things load more slowely). Even then, when it hangs, the semaphore 
times out, and it wakes up again.
Hmm, on further thought, I don't see why this ever deadlocks. Interrupts are off 
while in MD_PauseCPU. So the AsyncIOCompletion should just go via the
if (_PR_MD_GET_INTSOFF()) path, and the signal should happen anyway. I don't see
why AsyncIOCompletion() is trying to enter the critical region.
Running commercial opt bits shows this bug pretty reliably. 
I think this is a race between the async IO callback, and the code going into 
MD_PauseCPU (these can run concurrently on a dual CPU machine). It seems to be a 
race to turn interrupts off, and may be caused by changing:

    if (_PR_MD_GET_INTSOFF()) {
        ...
        return;
    }

    _PR_INTSOFF(is);

to
    
    if (_PR_MD_GET_INTSOFF()) {
        ...
    } else {
        _PR_INTSOFF(is);


In the short window of time between the _PR_MD_GET_INTSOFF() test and the 
_PR_INTSOFF() call, the main thread is turning interrupts off in _PR_CPU_Idle, 
and then waiting on the semaphore. The async file IO thread then hangs, waiting 
on the interrupts critical section, that is not exited until a semaphore times 
out.

So we either need an atomic 'test and set' for the interrupts flag, or we change 
the logic.

wtc, does this sound plausible?
Simon,

This sounds plausible.

Does this pseudo-code work?

#ifdef CARBON
  if (NOT running on OS X) {
#endif
    if (_PR_MD_GET_INTSOFF()) {
      ...
      return;
    }
#ifdef CARBON
  }
#endif

  PR_INTSOFF(is);

This means that if we are running on OS X, the callback functions
will simply call PR_INTSOFF(is).

Alternatively, we can use a try-lock function to emulate the current
non-thread-safe code.


This trace is another confusing one. It shows PR_UserRunThread being run on a 
native thread that isn't the main thread, which does not happen. Maybe Sampler is 
somehow finding callers of semaphore_timedwait_signal_trap(), and working back up 
NSPR thread stacks?
Back to the code...
The logic in AsyncIOCompletion() (and some other places) was changed so that we 
could call SignalIdleSemaphore() in the case where interrupts was both on and 
off. So we could do:

    if (_PR_MD_GET_INTSOFF()) {
        ...
        SignalIdleSemaphore();
        return;
    }

    _PR_INTSOFF(is);
    ...

    _PR_FAST_INTSON(is);
    SignalIdleSemaphore();

or we could signal the semaphore first, like the original patch  :)
I can think of an explanantion for those weird-looking Sampler traces. Assume 
that we've introduced some synchronization problem that breaks the assumptions of 
the Mac intson/intsoff strategy (as described in the comment in function
DoneWaitingOnThisThread()). Further assume that breaking this assumption causes 
the code to follow code paths (e.g. in PR_Lock()) that allow for NSPR thread 
switching. Suddenly, we're running NSPR threads from OT notifiers and async file 
IO callbacks. Ouch.
Ha, I have proof that this can happen. I added a fatal assertion in PR_Schedule,
to assert that it was only ever called on the main native thread. That got hit
with these stacks:

Thread 0:
 #0   0x7003f4c8 in semaphore_wait_signal_trap
 #1   0x7003f2c8 in _pthread_cond_wait
 #2   0x70250ab0 in TSWaitOnCondition
 #3   0x70283eb0 in TSWaitOnConditionTimedRelative
 #4   0x702ae8b4 in MPWaitOnSemaphore
 #5   0x0053a2f0 in MD_CriticalRegionEnter
 #6   0x0052c144 in EnterCritialRegion
 #7   0x0051f0fc in PR_Sleep
 #8   0x02f9c918 in nsMacMessagePump::DispatchEvent(int, EventRecord *)
 #9   0x02f9c478 in nsMacMessagePump::DoMessagePump(void)
 #10  0x02f9bc00 in nsAppShell::Run(void)
 #11  0x02f25ac8 in nsAppShellService::Run(void)
 #12  0x004d1e94 in main1(int, char **, nsISupports *)
 #13  0x004d4544 in main

Thread 1 Crashed:
 #0   0x0052c2c0 in OnMainTask
 #1   0x050f6408 in 0x50f6408
 #2   0x00525f9c in PR_Schedule
 #3   0x0052b638 in MD_wait
 #4   0x005246f4 in PR_Lock
 #5   0x0052b8a8 in DoneWaitingOnThisThread
 #6   0x0052dc00 in WakeUpNotifiedThread
 #7   0x0052e068 in NotifierRoutine
 #8   0x7055386c in CarbonCallBackSetup
 #9   0x70551118 in ProcessEventQueue
 #10  0x70552fd4 in CarbonDispatchThread
 #11  0x705539a0 in ProcessReadyEvent
 #12  0x70554c0c in CarbonSelectThreadFunc
 #13  0x7002054c in _pthread_body
Status: REOPENED → ASSIGNED
Let it be known that the state in the previous comment was arrived at because of 
the non-blocking sockets bug 121326.
Thinking about this some more, I think the _PR_MD_GET_INTSOFF()/_PR_INTSOFF(is)
race has been there all along. The change now is that we delay for longer before 
the deadlock is broken (by a timeout); before, that was the length of one call to 
WaitNextEvent. Now it's the duration of the MPWaitOnSemaphore timeout.

I also think it's bad to signal the semaphore before setting state in the 
callbacks, since on OS X callbacks don't necessarily run to completion, so the 
NSPR threads could get a chance to run before the async file IO callback is done.

This suggests that we really need an atomic 'test and set' on the interrupts 
flag, but, since this is wrapped in a critical region, it might be hard to 
implement.
A possible solution here might be to live with the race condition, but to just 
user a shorter timeout in MP_PauseCPU, like a max of 200ms.
Keywords: mozilla0.9.8+
Taking, since I'm working on it.
Assignee: sdagley → sfraser
Status: ASSIGNED → NEW
This race condition/stall exists on the 0.9.8 branch, so 0.9.8 builds will 
exhibit the stall on launch on dual CPU machines. I think we need to get a 
workaround in for this.
Status: NEW → ASSIGNED
This workaround for the deadlock resulting from a race to ender the critical
region should help with startup problems on dual CPU machines. It should limit
that deadlock to half a second.
Since we do a PR_IntervalToMilliseconds() on the timeout value passed to 
WaitOnIdleSemaphore() don't we want to set timeout = 
PR_MillisecondsToInterval(500) here?
Attachment #49293 - Attachment is obsolete: true
Attachment #52511 - Attachment is obsolete: true
Attachment #65963 - Attachment is obsolete: true
Attachment #66405 - Attachment is obsolete: true
Attachment #67204 - Attachment is obsolete: true
Comment on attachment 67212 [details] [diff] [review]
Convert the interval to milliseconds and back correcty

r=sdagley
Attachment #67212 - Flags: review+
Comment on attachment 67212 [details] [diff] [review]
Convert the interval to milliseconds and back correcty

r=wtc.

You are only going to check this into the
mozilla 0.9.8 branch, right?
Correct, 0.9.8 branch only.
Patched checked into MOZILLA_0_9_8_BRANCH branch.
<sigh> we need the patch in attachment 67212 [details] [diff] [review] for Mozilla 0.9.9 as well, and I
think on the trunk.
Comment on attachment 67212 [details] [diff] [review]
Convert the interval to milliseconds and back correcty

a=dbaron for getting this on the trunk / 0.9.9
Attachment #67212 - Flags: approval+
Checked in on NSPR tip, Mozilla 0.9.9, and NSPRPUB_PRE_4_2_CLIENT_BRANCH.

I'll leave the bug open for a possible better fix later, since Mac OS X CFM
builds will live on for a while.
Severity: critical → normal
OS: Mac System 8.5 → MacOS X
Priority: P2 → P3
Comment on attachment 67212 [details] [diff] [review]
Convert the interval to milliseconds and back correcty

obsoleting checked in patch so it doesn't look like an approved change still
outstanding.
Attachment #67212 - Attachment is obsolete: true
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
Since we're moving to MachO, wontfixing this.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago17 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.