Closed
Bug 99561
Opened 23 years ago
Closed 22 years ago
Bad state when quitting on Mac OS X
Categories
(NSPR :: NSPR, defect, P3)
Tracking
(Not tracked)
RESOLVED
WONTFIX
4.2
People
(Reporter: sfraser_bugs, Assigned: sfraser_bugs)
References
Details
(Keywords: dataloss, perf)
Attachments
(1 file, 9 obsolete files)
1.87 KB,
text/plain
|
Details |
It's possible to get into a very bad state when quitting on Mac OS X, where you get a continual slew of assertions, and never quit. This happens because it's possible for Mozilla to go through the quit process while running on a non-main NSPR thread. The PR_CPU_Idle thread on Mac calls WaitNextEvent() (which is bad for other reasons), and while in this call, the OS can send us the Quit AppleEvent that results from the user choosing the OS's Quit menu item. This causes all of the application teardown code to run on the PR_CPU_Idle NSPR thread, with distrous consequences.
Assignee | ||
Comment 1•23 years ago
|
||
Comment 2•23 years ago
|
||
That's evil that the call to WaitNextEvent() will actually processes an event when given an event mask of 0. Maybe using EventAvail() would yield to other processes (the old doc says it calls SystemTask) without processing a high-level event?
Updated•23 years ago
|
Priority: -- → P2
Target Milestone: --- → 4.2
Comment 3•23 years ago
|
||
Comment 4•23 years ago
|
||
I'm not sure if what I was able to capture via Sampler is the same as the previous attachment (mine is quite a bit shorter, but I may have just sampled for less time), but if this is pointing to a different problem, let's open a new bug report for it - Mozilla hanging on quit happens to me quite frequently, and it seems to happen more often if I quit from the Mail/News window than if I quit from the browser.
Comment 5•23 years ago
|
||
> it seems to happen more often if I quit from the Mail/News window than if I
> quit from the browser.
I think this is a different problem. I've noticed this as well. Do you have
"Empty Trash on Exit" checked in the mail server settings? Or, if IMAP, "Clean
up Inbox on Exit?"
I have a fairly flaky connection to my mail server so I unchecked this. Since my
connection is flaky, I didn't consider it to be a bug, and there was a way
around it.
Comment 6•23 years ago
|
||
OK - I've unselected these options, I'll see if the problem goes away. I've noticed that the way Moz handles IMAP connections is less than ideal (I think there's already a bug reported about this) ... seems like a relatively simple thing to determine if a connection has been dropped by the server and to just rebuild the connection instead of simply trying to use a dead one ... oh well, all in good time, I'm sure :)
Quitting from the mail/news window may be bug: http://bugzilla.mozilla.org/show_bug.cgi?id=97407
Assignee | ||
Comment 8•23 years ago
|
||
See also bug 116431.
Assignee | ||
Comment 9•23 years ago
|
||
This is a dataloss bug (see bug 120666). It's thus critical.
Comment 10•23 years ago
|
||
Here's the first cut at a patch to use the Multiprocessing APIs to replace the crufty, and dangerous, WNE in _MD_PauseCPU. For now the timeout value used when calling MPWaitOnSemaphore is fixed at 100 milliseconds as the raw timeout passed to _MD_PauseCPU most certainly isn't in the timebase expected for that call. Since I've never been able repro the problem described in this bug, or the supposedly easier to repro 102666, I'm hoping sfraser can look at this to see if I'm on the right path and if in fact it makes a difference in the behavior he and sairuh (se) have seen
Assignee | ||
Comment 11•23 years ago
|
||
This change does very nice things to the pageload times (on my dual 800) on Mac OS X. Before the patch: ~1680ms After the patch: ~1280ms
Keywords: perf
Assignee | ||
Comment 12•23 years ago
|
||
This patch differs from the previous patch in a number of ways: 1. The 'idle semaphore" routines are no longer #ifdeffed in just for carbon. They'll be called for both carbon and non-carbon builds. 2. In non-carbon builds, we still call WaitNextEvent() to idle, as now. However, I added a call to WakeUpProcess in SignalIdleSemaphore(), which should have some positive impact on performance on 9 (classic anc carbon). 3. I added calls to SignalIdleSemaphore(); from the async file IO callback, and ensured that it was called from each of the OT notifiers. 4. I factored the Gestalt() calls into a single RunningOnOSX() method. 5. I changed the order of the termination cleanup calls to be the reverse of the init calls (just for cleanliness). 6. I fixed a 'variare used before being initialized' warning in SendReceiveDgram() (using |err| rather than |result|).
Attachment #65578 -
Attachment is obsolete: true
Assignee | ||
Comment 13•23 years ago
|
||
This plus the widget drawing patch in bug 107827 gets page load time down to 1071ms on my dual 800MHz G4 (Mac OS X)
Assignee | ||
Comment 14•23 years ago
|
||
This final patch adds some #ifdef TARGET_CARBON around the MPSemaphore calls, and adds a call to WakeUpProcess() from the signal routine. This should help startup performance a little on 9, and does not impact the page load performance.
Attachment #65711 -
Attachment is obsolete: true
Comment 15•23 years ago
|
||
Comment on attachment 65716 [details] [diff] [review] Final NSPR patch, changed to work for non-Carbon builds r=sdagley
Attachment #65716 -
Flags: review+
Assignee | ||
Comment 16•23 years ago
|
||
*** Bug 116431 has been marked as a duplicate of this bug. ***
Updated•23 years ago
|
Status: NEW → ASSIGNED
Whiteboard: waiting for wtc's review and a= from drivers@mozilla.org
Comment on attachment 65716 [details] [diff] [review] Final NSPR patch, changed to work for non-Carbon builds a=dbaron for 0.9.8
Attachment #65716 -
Flags: approval+
Keywords: mozilla0.9.8+
Updated•23 years ago
|
Whiteboard: waiting for wtc's review and a= from drivers@mozilla.org
Comment 18•23 years ago
|
||
Comment on attachment 65716 [details] [diff] [review] Final NSPR patch, changed to work for non-Carbon builds This patch is fine. However, I insist that you fix the indentation of the new code in macsockotpt.c and macthr.c. It seems that you indent by 2 spaces and set the tab stop at 2. In NSPR, we always indent by 4 spaces. We are not consistent in our tab stop; some files are 8 and others are 4. I recommend that you use tab stop 4 for the Mac NSPR files and indent your code by 4 spaces. After you fix the indentation in macsockotpt.c and macthr.c, you can go ahead and check in the patch. Please check it into both the tip and the client branch of NSPR. Thank you!
Comment 19•23 years ago
|
||
Comment on attachment 65716 [details] [diff] [review] Final NSPR patch, changed to work for non-Carbon builds Hmm... on second thought >Index: pr/src/md/mac/macio.c >=================================================================== >RCS file: /cvsroot/mozilla/nsprpub/pr/src/md/mac/macio.c,v >retrieving revision 3.27.4.3 >diff -b -u -8 -r3.27.4.3 macio.c >--- pr/src/md/mac/macio.c 10 Dec 2001 22:27:22 -0000 3.27.4.3 >+++ pr/src/md/mac/macio.c 19 Jan 2002 02:35:42 -0000 >@@ -89,16 +89,18 @@ > /* XXX Not done yet for 68K */ > /* I/O completion routne for _MD_READ and _MD_WRITE */ > static void AsyncIOCompletion (ExtendedParamBlock *pbAsyncPtr) > { > _PRCPU *cpu = _PR_MD_CURRENT_CPU(); > PRThread *thread = pbAsyncPtr->thread; > PRIntn is; > >+ SignalIdleSemaphore(); >+ > if (_PR_MD_GET_INTSOFF()) { > thread->md.missedIONotify = PR_TRUE; > cpu->u.missed[cpu->where] |= _PR_MISSED_IO; > return; > } > > _PR_INTSOFF(is); > It seems that you should signal the idle semaphore *right before* you return from AsyncIOCompletion(). Similarly for the OT notifier routine. I need to think about this more.
Comment 20•23 years ago
|
||
OK, I think I know why we must signal the idle semaphore before _PR_INTSOFF(is). NSPR sets ints off before it calls _PR_MD_PAUSE_CPU(). Therefore, if you don't signal the idle semaphore to wake up NSPR, the async I/O completion routine will block on _PR_INTSOFF(is). Did I understand this right? I would be more comfortable if we could make the SignalIdleSemaphore() call after the DoneWaitingForThisThread() call. This would require WaitOnIdleSemaphore() to set ints on before the MPWaitOnSemaphore() call and set ints off after the MPWaitOnSemaphore() call. Could you give this a try and see if it works? I think waiting on a semaphore during _MD_PauseCPU() is the right direction. I just wanted to make sure that we wake up the idle CPU thread after the callback routine has finished its work. Also, the way we use the "intsoff" critical region still has room for improvement, but that's another bug.
Assignee | ||
Comment 21•23 years ago
|
||
I'm not sure that it matters where SignalIdleSemaphore(); is called in relation to disabling/enabling interrupts (which on X implies being in the critical section). The semaphore will be signalled either way, but critical section behaviour is maintained (i.e. signalling inside the critical section will not cause an immediate pthread switch). What's clear is that we do need to signal the semaphore from the async notify routines whether interrupts are currently on or off. As to whether to signal the semaphore from within the Timer task; I don't think this is necessary, and it would hurt performance. The timer task fires every 8ms, and it will continue to fire even when we're waiting on the semaphore (since it's on its own MPTask/pthread). Signalling it would just cause MD_PauseCPU to just wake every 8ms, then wait again.
Comment 22•23 years ago
|
||
In general one signals the semaphore *after* having made the state change the other thread is waiting for. Take the producer-consumer problem as an example. The consumer waits on a semaphore if the shared buffer is empty. The producer signals the semaphore *after* adding an item to the buffer. If the producer signals the semaphore upon entering the buffer.put() method, it's wrong. This is why it doesn't feel right to see a SignalIdleSemaphore() call at the beginning of AsyncIOCompletion().
Comment 23•23 years ago
|
||
Attachment #65716 -
Attachment is obsolete: true
Assignee | ||
Comment 24•23 years ago
|
||
Last patch looks good to me, and runs fine.
Comment 25•23 years ago
|
||
Comment on attachment 65963 [details] [diff] [review] Revised patch to address wtc's concerns with the previous one Steve, this patch is good. r=wtc. I'd like you to fix two nits before checking it in. 1. The indentation and tab stop. In Mac NSPR code, we indent by 4 and set the tab stop to 4. 2. Our if-else indentation style is usually: if (...) { ... } else { ... } In a few places, you used if (...) { ... } else statement; Please move 'else' to the same line as the closing brace. Thank you for addressing my concerns. I am surprised that you don't need to leave the critical region before waiting on the idle semaphore. But if you never deadlock or see unexpected delays with this patch, I think it is fine.
Attachment #65963 -
Flags: review+
Assignee | ||
Comment 26•23 years ago
|
||
Interrupts are off when we wait on the semaphore. But because the code that signals the semaphore runs whether interrupts are on or not, and because semaphore signalling works even inside an MP critical section, we always signal the semaphore successfully.
Comment 27•23 years ago
|
||
Ok, I've changed to tabs in macthr.c where we had 2 space indentation and in macsockoptp I've changed: else DoneWaitingOnThisThread(thread); to else { DoneWaitingOnThisThread(thread); } And let me just say I hate braces on the same line as the conditional (my brain likes to parse them on their own line :-) and for the #ifdef TARGET_CARBON sections in macthr.c we need them to be on their own line so the #ifdef'ing is clean. I'll land once the tree opens today
Comment 28•23 years ago
|
||
Checked in to NSPR tip and NSPRPUB_PRE_4_2_CLIENT_BRANCH
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 29•23 years ago
|
||
Reopening. We're seeing some startup hangs because of this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 30•23 years ago
|
||
Assignee | ||
Comment 31•23 years ago
|
||
The deadlock happens because the async file IO callback is waiting to enter the critical region (turning interrupts off), but, since the main thread is also waiting, this critical region is never exited. I don't know why we didn't see this before, or why this doesn't happen all the time.
Comment 32•23 years ago
|
||
I predicted that! See the last paragraph of comment #25.
Comment 33•23 years ago
|
||
I described this potential deadlock and proposed a solution in comment #20. Could you give it a try?
Assignee | ||
Comment 34•23 years ago
|
||
I can only repro this by rebooting my machine (which probably flushes disk caches and makes things load more slowely). Even then, when it hangs, the semaphore times out, and it wakes up again.
Assignee | ||
Comment 35•23 years ago
|
||
Hmm, on further thought, I don't see why this ever deadlocks. Interrupts are off while in MD_PauseCPU. So the AsyncIOCompletion should just go via the if (_PR_MD_GET_INTSOFF()) path, and the signal should happen anyway. I don't see why AsyncIOCompletion() is trying to enter the critical region.
Assignee | ||
Comment 36•23 years ago
|
||
Running commercial opt bits shows this bug pretty reliably.
Assignee | ||
Comment 37•23 years ago
|
||
I think this is a race between the async IO callback, and the code going into MD_PauseCPU (these can run concurrently on a dual CPU machine). It seems to be a race to turn interrupts off, and may be caused by changing: if (_PR_MD_GET_INTSOFF()) { ... return; } _PR_INTSOFF(is); to if (_PR_MD_GET_INTSOFF()) { ... } else { _PR_INTSOFF(is); In the short window of time between the _PR_MD_GET_INTSOFF() test and the _PR_INTSOFF() call, the main thread is turning interrupts off in _PR_CPU_Idle, and then waiting on the semaphore. The async file IO thread then hangs, waiting on the interrupts critical section, that is not exited until a semaphore times out. So we either need an atomic 'test and set' for the interrupts flag, or we change the logic. wtc, does this sound plausible?
Comment 38•23 years ago
|
||
Simon, This sounds plausible. Does this pseudo-code work? #ifdef CARBON if (NOT running on OS X) { #endif if (_PR_MD_GET_INTSOFF()) { ... return; } #ifdef CARBON } #endif PR_INTSOFF(is); This means that if we are running on OS X, the callback functions will simply call PR_INTSOFF(is). Alternatively, we can use a try-lock function to emulate the current non-thread-safe code.
Assignee | ||
Comment 39•23 years ago
|
||
Assignee | ||
Comment 40•23 years ago
|
||
This trace is another confusing one. It shows PR_UserRunThread being run on a native thread that isn't the main thread, which does not happen. Maybe Sampler is somehow finding callers of semaphore_timedwait_signal_trap(), and working back up NSPR thread stacks?
Assignee | ||
Comment 41•23 years ago
|
||
Back to the code... The logic in AsyncIOCompletion() (and some other places) was changed so that we could call SignalIdleSemaphore() in the case where interrupts was both on and off. So we could do: if (_PR_MD_GET_INTSOFF()) { ... SignalIdleSemaphore(); return; } _PR_INTSOFF(is); ... _PR_FAST_INTSON(is); SignalIdleSemaphore(); or we could signal the semaphore first, like the original patch :)
Assignee | ||
Comment 42•23 years ago
|
||
I can think of an explanantion for those weird-looking Sampler traces. Assume that we've introduced some synchronization problem that breaks the assumptions of the Mac intson/intsoff strategy (as described in the comment in function DoneWaitingOnThisThread()). Further assume that breaking this assumption causes the code to follow code paths (e.g. in PR_Lock()) that allow for NSPR thread switching. Suddenly, we're running NSPR threads from OT notifiers and async file IO callbacks. Ouch.
Assignee | ||
Comment 43•23 years ago
|
||
Ha, I have proof that this can happen. I added a fatal assertion in PR_Schedule, to assert that it was only ever called on the main native thread. That got hit with these stacks: Thread 0: #0 0x7003f4c8 in semaphore_wait_signal_trap #1 0x7003f2c8 in _pthread_cond_wait #2 0x70250ab0 in TSWaitOnCondition #3 0x70283eb0 in TSWaitOnConditionTimedRelative #4 0x702ae8b4 in MPWaitOnSemaphore #5 0x0053a2f0 in MD_CriticalRegionEnter #6 0x0052c144 in EnterCritialRegion #7 0x0051f0fc in PR_Sleep #8 0x02f9c918 in nsMacMessagePump::DispatchEvent(int, EventRecord *) #9 0x02f9c478 in nsMacMessagePump::DoMessagePump(void) #10 0x02f9bc00 in nsAppShell::Run(void) #11 0x02f25ac8 in nsAppShellService::Run(void) #12 0x004d1e94 in main1(int, char **, nsISupports *) #13 0x004d4544 in main Thread 1 Crashed: #0 0x0052c2c0 in OnMainTask #1 0x050f6408 in 0x50f6408 #2 0x00525f9c in PR_Schedule #3 0x0052b638 in MD_wait #4 0x005246f4 in PR_Lock #5 0x0052b8a8 in DoneWaitingOnThisThread #6 0x0052dc00 in WakeUpNotifiedThread #7 0x0052e068 in NotifierRoutine #8 0x7055386c in CarbonCallBackSetup #9 0x70551118 in ProcessEventQueue #10 0x70552fd4 in CarbonDispatchThread #11 0x705539a0 in ProcessReadyEvent #12 0x70554c0c in CarbonSelectThreadFunc #13 0x7002054c in _pthread_body
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 44•23 years ago
|
||
Let it be known that the state in the previous comment was arrived at because of the non-blocking sockets bug 121326.
Assignee | ||
Comment 45•23 years ago
|
||
Thinking about this some more, I think the _PR_MD_GET_INTSOFF()/_PR_INTSOFF(is) race has been there all along. The change now is that we delay for longer before the deadlock is broken (by a timeout); before, that was the length of one call to WaitNextEvent. Now it's the duration of the MPWaitOnSemaphore timeout. I also think it's bad to signal the semaphore before setting state in the callbacks, since on OS X callbacks don't necessarily run to completion, so the NSPR threads could get a chance to run before the async file IO callback is done. This suggests that we really need an atomic 'test and set' on the interrupts flag, but, since this is wrapped in a critical region, it might be hard to implement.
Assignee | ||
Comment 46•23 years ago
|
||
A possible solution here might be to live with the race condition, but to just user a shorter timeout in MP_PauseCPU, like a max of 200ms.
Updated•23 years ago
|
Keywords: mozilla0.9.8+
Assignee | ||
Comment 47•23 years ago
|
||
Taking, since I'm working on it.
Assignee: sdagley → sfraser
Status: ASSIGNED → NEW
Assignee | ||
Comment 48•23 years ago
|
||
This race condition/stall exists on the 0.9.8 branch, so 0.9.8 builds will exhibit the stall on launch on dual CPU machines. I think we need to get a workaround in for this.
Status: NEW → ASSIGNED
Assignee | ||
Comment 49•23 years ago
|
||
This workaround for the deadlock resulting from a race to ender the critical region should help with startup problems on dual CPU machines. It should limit that deadlock to half a second.
Comment 50•23 years ago
|
||
Since we do a PR_IntervalToMilliseconds() on the timeout value passed to WaitOnIdleSemaphore() don't we want to set timeout = PR_MillisecondsToInterval(500) here?
Assignee | ||
Comment 51•23 years ago
|
||
Attachment #49293 -
Attachment is obsolete: true
Attachment #52511 -
Attachment is obsolete: true
Attachment #65963 -
Attachment is obsolete: true
Attachment #66405 -
Attachment is obsolete: true
Attachment #67204 -
Attachment is obsolete: true
Comment 52•23 years ago
|
||
Comment on attachment 67212 [details] [diff] [review] Convert the interval to milliseconds and back correcty r=sdagley
Attachment #67212 -
Flags: review+
Comment 53•23 years ago
|
||
Comment on attachment 67212 [details] [diff] [review] Convert the interval to milliseconds and back correcty r=wtc. You are only going to check this into the mozilla 0.9.8 branch, right?
Assignee | ||
Comment 54•23 years ago
|
||
Correct, 0.9.8 branch only.
Assignee | ||
Comment 55•23 years ago
|
||
Patched checked into MOZILLA_0_9_8_BRANCH branch.
Assignee | ||
Comment 56•23 years ago
|
||
<sigh> we need the patch in attachment 67212 [details] [diff] [review] for Mozilla 0.9.9 as well, and I think on the trunk.
Comment on attachment 67212 [details] [diff] [review] Convert the interval to milliseconds and back correcty a=dbaron for getting this on the trunk / 0.9.9
Attachment #67212 -
Flags: approval+
Assignee | ||
Comment 58•23 years ago
|
||
Checked in on NSPR tip, Mozilla 0.9.9, and NSPRPUB_PRE_4_2_CLIENT_BRANCH. I'll leave the bug open for a possible better fix later, since Mac OS X CFM builds will live on for a while.
Severity: critical → normal
OS: Mac System 8.5 → MacOS X
Priority: P2 → P3
Comment 59•23 years ago
|
||
Comment on attachment 67212 [details] [diff] [review] Convert the interval to milliseconds and back correcty obsoleting checked in patch so it doesn't look like an approved change still outstanding.
Attachment #67212 -
Attachment is obsolete: true
Comment 60•22 years ago
|
||
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and <http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss bugs are of critical or possibly higher severity. Only changing open bugs to minimize unnecessary spam. Keywords to trigger this would be crash, topcrash, topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
Assignee | ||
Comment 61•22 years ago
|
||
Since we're moving to MachO, wontfixing this.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 22 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•