429304 - Hang involving synchronous proxies and nested event loops

Reporter

Description

•

17 years ago

There is a race condition which causes synchronous proxies to hang occasionally. I think what's going on is something like this: (1) A proxy is waiting for a proxy call to complete by running an event loop in nsProxyEventObject::CallMethod(). (2) nsThread::ProcessNextEvent() calls on the AppShell to process native events before processing xpcom events. (3) The AppShell, in turn, (sometimes) processes xpcom events in a nested event loop. (4) This is where the race condition occurs: If the completion event for the proxy call (1) arrives early enough to be handled by the nested event loop (3), we have a problem. (5) The AppShell inserts a dummy event into the thread event queue before returning. (6) The dummy event is ineffective in letting nsThread::ProcessNextEvent() complete, since it is duly filtered out by the proxy thread filter. (7) We hang indefinitely in nsThread::ProcessNextEvent(). A typical stack trace looks like this: #0 0x0028d2f7 in nsProxyObjectCallInfo::SetCompleted at nsProxyEvent.cpp:302 #1 0x0028d391 in nsProxyCallCompletedEvent::Run at nsProxyEvent.cpp:97 #2 0x00284a5c in nsThread::ProcessNextEvent at nsThread.cpp:510 #3 0x0020fcb0 in NS_ProcessPendingEvents_P at nsThreadUtils.cpp:180 #4 0x116596ab in nsBaseAppShell::NativeEventCallback at nsBaseAppShell.cpp:121 #5 0x1161e236 in nsAppShell::ProcessGeckoEvents at nsAppShell.mm:299 #6 0x965f662e in CFRunLoopRunSpecific #7 0x965f6d18 in CFRunLoopRunInMode #8 0x9103d6a0 in RunCurrentEventLoopInMode #9 0x9103d4b9 in ReceiveNextEventCommon #10 0x9103d32d in BlockUntilNextEventMatchingListInMode #11 0x91a0e7d9 in _DPSNextEvent #12 0x91a0e08e in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] #13 0x1161c3d8 in nsAppShell::ProcessNextNativeEvent at nsAppShell.mm:521 #14 0x1165973f in nsBaseAppShell::DoProcessNextNativeEvent at nsBaseAppShell.cpp:151 #15 0x11659d8e in nsBaseAppShell::OnProcessNextEvent at nsBaseAppShell.cpp:278 #16 0x1161c940 in nsAppShell::OnProcessNextEvent at nsAppShell.mm:656 #17 0x00284958 in nsThread::ProcessNextEvent at nsThread.cpp:497 #18 0x0020fb4e in NS_ProcessNextEvent_P at nsThreadUtils.cpp:227 #19 0x0028f7a0 in nsProxyEventObject::CallMethod at nsProxyEventObject.cpp:259 I think this is a regression of bug #389931. Not sure how to best address it. Maybe we shouldn't process native events while waiting for a proxy call to complete. Or maybe the dummy events posted by nsBaseAppShell::OnProcessNextEvent() could implement NS_PROXYEVENT_FILTER_IID, so that it doesn't get filtered.

Mats Palmgren (inactive)

Comment 1

•

17 years ago

> I think this is a regression of bug #389931. By your description I would guess so too. I don't think either of your suggested solutions scales. We would need to mess with appshell code every time someone implements a new filter class (if I understood you correctly).

Mats Palmgren (inactive)

Comment 2

•

17 years ago

Alternative solutions: Let the dummy event implement a new interface, nsIBypassFiltersEvent, and let those events through nsThread::nsChainedEventQueue::PutEvent. That costs a QI when 'mFilters' is non-null though. Make the dummy event a known global object (or one per thread?) and compare using == instead of QI.

Mats Palmgren (inactive)

Comment 3

•

17 years ago

Seems like a bad regression that we should fix before release. It shouldn't be too hard or risky to fix I think.

Severity: normal → critical

Flags: blocking1.9?

Keywords: hang, regression

OS: Mac OS X → All

Hardware: PC → All

Version: unspecified → Trunk

Mats Palmgren (inactive)

Comment 4

•

17 years ago

Attached patch wip1 — Details — Splinter Review

The QI version. Alex, does this fix the hang?

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 5

•

17 years ago

Mats, how is your new interface any different from doing what Alex suggested with NS_PROXYEVENT_FILTER_IID ?

Mats Palmgren (inactive)

Comment 6

•

17 years ago

Because if I implement a new filter with MY_NEW_FILTER_IID then I would have to add that to the dummy event class too, etc. I got the impression he suggested that the appshell dummy event pass nsProxyThreadFilter::AcceptEvent() specifically, but maybe I misunderstood.

Alex Fritze

Reporter

Comment 7

•

17 years ago

Mats: Thanks for getting on the case. I haven't tested with your patch yet, but on further reflection I think the right course of action would be to arrange for the while(...) { NS_ProcessNextEvent(); } loop in nsProxyEventObject::CallMethod() to only pump xpcom events, and not native events too. The rationale goes something like this: If our main thread 'A' is making a synchronous proxy call to some other thread 'B', we WANT A to suspend processing of events until the proxy call is over. This is after all the whole point of a synchronous proxy: it should behave as much as possible like a normal sequential function call. The only reason for spinning a local nested event loop in nsProxyEventObject::CallMethod() (instead of e.g. blocking on a condition variable) is to support reentrant synchronous proxy calls from B to A. All events other than nsProxyObjectCallInfo or nsProxyCallCompletedEvent are filtered into the ancestor event queue, so in my opinion, we should hold off with processing native events too. It doesn't make sense to pump native events here, and it probably leads to all sorts of weird unreproducible behavior. I guess the question is if there is potential for deadlock if nsProxyEventObject::CallMethod() isn't pumping native events. As far as I can see there isn't; if a reentrant call requires native events to be pumped in order to run to completion (e.g. if it displays a modal dialog), it will spin another nested event loop (with proper processing of native events) anyway.

Alex Fritze

Reporter

Comment 8

•

17 years ago

(In reply to comment #4) > Created an attachment (id=316039) [details] > wip1 > > The QI version. Alex, does this fix the hang? > Mats: I tested with your patch and unfortunately I'm still seeing hangs when a proxy call is being made :-( Was has changed is that it doesn't hang indefinitely anymore. I can force it out of the higher event loop by generating some native events (i.e. moving the mouse).

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 9

•

17 years ago

Comment #7 makes sense to me.

Damon Sicore (:damons)

Comment 10

•

17 years ago

Can you explain how often this is hit? If it's a long shot that doesn't affect user experience, and can be fixed in a dot release, it shouldn't block. Can't tell from the comments here. Please re-nom with reasons for blocking, please.

Flags: blocking1.9? → blocking1.9-

Mats Palmgren (inactive)

Comment 11

•

17 years ago

(In reply to comment #10) > Can you explain how often this is hit? I don't know how common synchronous proxy calls are to be honest. Someone who knows should re-nom if they are common.

Mats Palmgren (inactive)

Comment 12

•

17 years ago

(In reply to comment #8) > ... it doesn't hang indefinitely anymore. I can force it > out of the higher event loop by generating some native events Yeah, I see the problem now. nsProxyEventObject::CallMethod() calls NS_ProcessNextEvent() with mayWait==true, which allows the appshell to block, waiting for a native event. I've been thinking about different ways of fixing this bug, but I always seem to end up with the conclusion that blocking on an event (nsProxyCallCompletedEvent) doesn't really work with nested event loops. Whether or not we should allow native events to be processed while waiting for the synchronous call to be completed is a different question. IMHO, nsProxyEventObject shouldn't rely on a specific behaviour of its consumers to avoid a deadlock. Replacing this block: http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/xpcom/proxy/src/nsProxyEventObject.cpp&rev=1.65&root=/cvsroot&mark=259-262#240 with a non-blocking NS_ProcessNextEvent(thread, PR_FALSE) would fix it although that would probably peg the CPU while polling for the event, so we need a local "sleep" of some sort to fix that. This also means that we don't really need the nsProxyCallCompletedEvent event, we only need the NS_ProcessNextEvent() to process re-entrant synchronous calls. Regarding the issue of processing native events, maybe we should add a method on nsIAppShell to control the behaviour? If someone really wants to disable native event handling, or the nested XPCOM event handling, they could do so.

Mats Palmgren (inactive)

Comment 13

•

17 years ago

Attached patch wip6 — Details — Splinter Review

Would something like this work?

Alex Fritze

Reporter

Comment 14

•

17 years ago

(In reply to comment #10) > Can you explain how often this is hit? I only did some casual grepping, but afaics the only code affected by this is external to firefox (e.g. mailnews/ in nsImapProtocol.cpp). There are a few synchronous proxies in firefox (most notably in security/), but they all appear to proxy call TO the main thread, whereas this particular bug only manifests itself when proxying a call FROM the main thread to some other thread.

Alex Fritze

Reporter

Comment 15

•

17 years ago

(In reply to comment #13) > Created an attachment (id=316339) [details] > wip6 > Would something like this work? This works great. I guess it would be possible to get rid of the polling loop by feeding a flag up to nsIThreadObserver::OnProcessNextEvent(), or by making observers aware of proxy events, but I like how the code is now self-contained in xpcom/proxy. With regards to native events being processed during proxy calls - I think that's something we can live with. Native events will spawn thread events which will be properly sequenced, courtesy of the proxy thread filter. There is a bigger problem with sequencing of proxy calls in scenarios involving more than 2 threads, but that's neither a new problem nor object of this bug.

Ben Turner (not reading bugmail, use the needinfo flag!)

Updated

•

17 years ago

Flags: wanted1.9.0.x?

Peter da Silva

Comment 16

•

17 years ago

I get continual hangs even if I'm not doing anything. I assumed that this was caused by Camino 1.6, but I backed out to Camino 1.5.5 and I'm still getting it... so I went in with Activity Monitor and found this was always due to a bunch of nested js_Invoke called by nsPACMan::getProxyForURI, called, at the top level, from -[Bookmark refreshIcon] Attached Activity Monitor samples for reference to Bug 235853

Samuel Sidler (old account; do not CC)

Comment 17

•

17 years ago

Mats, any more work to get this patch reviewed and landed?

Flags: wanted1.9.0.x? → wanted1.9.0.x+

John Gaunt (redfive)

Comment 18

•

17 years ago

Songbird is seeing what appears to be this bug on shutdown of our unit tests. bug 430260 is the actual symptoms we're seeing but we do a lot of proxy calls so I wouldn't be surprised if this is the underlying cause. Not positive if they are synch from main->background but I'm looking into it. Going to try this patch out (wip6) locally and will report back on it's outcome.

Mats Palmgren (inactive)

Comment 19

•

17 years ago

(In reply to comment #17) > Mats, any more work to get this patch reviewed and landed? I think it was complete, I just couldn't figure out a way to write a regression test for it. (In reply to comment #18) > Going to try this patch out (wip6) locally and will report back on it's > outcome. Did it help?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 20

•

17 years ago

Tests are good, but if we're unable to come up with a test we should still land the fix.

Mats Palmgren (inactive)

Comment 21

•

17 years ago

Ok, I'll try to make a patch for review. There was a similar bug reported today that might help with the test (bug 460366).

Assignee: nobody → mats.palmgren

timeless

Updated

•

17 years ago

Attachment #316339 - Flags: review?(doug.turner)

timeless

Updated

•

17 years ago

Blocks: 468736

Ilya Sherman

Comment 22

•

17 years ago

Attached file Trace with wip6 applied — Details

Trace shows that the attached patch doesn't fix the original bug; per timeless on IRC, we'll use bug 468736 for the original problem.

Alex Fritze

Reporter

Comment 23

•

17 years ago

(In reply to comment #22) > Trace shows that the attached patch doesn't fix the original bug; per timeless > on IRC, we'll use bug 468736 for the original problem. Erm, what do you mean by "doesn't fix the original bug"? wip6 certainly fixes the original problem that I reported on this bug. Your trace seems to be completely unrelated...

Doug Turner (:dougt)

Updated

•

15 years ago

Attachment #316339 - Flags: review?(dougt) → review?(benjamin)

Honza Bambas (:mayhemer)

Updated

•

15 years ago

Blocks: 430260

Mats Palmgren (inactive)

Comment 24

•

14 years ago

Since XPCOM proxies are being removed (bug 675221) this bug should probably be resolved as wontfix.

Assignee: matspal → nobody

Whiteboard: WONTFIX?

Mats Palmgren (inactive)

Updated

•

14 years ago

Attachment #316339 - Flags: review?(benjamin)

Benjamin Smedberg

Updated

•

13 years ago

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → INCOMPLETE

wip1 17 years ago Mats Palmgren (inactive) 8.15 KB, patch		Details \| Diff \| Splinter Review
wip6 17 years ago Mats Palmgren (inactive) 9.47 KB, patch		Details \| Diff \| Splinter Review
Trace with wip6 applied 17 years ago Ilya Sherman 10.11 KB, text/plain		Details