Closed Bug 429304 Opened 12 years ago Closed 8 years ago

Hang involving synchronous proxies and nested event loops

Categories

(Core :: XPCOM, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: alex, Unassigned)

References

Details

(Keywords: hang, regression, Whiteboard: WONTFIX?)

Attachments

(3 files)

There is a race condition which causes synchronous proxies to hang occasionally.

I think what's going on is something like this:

(1) A proxy is waiting for a proxy call to complete by running an event loop in nsProxyEventObject::CallMethod().

(2) nsThread::ProcessNextEvent() calls on the AppShell to process native events before processing xpcom events.

(3) The AppShell, in turn, (sometimes) processes xpcom events in a nested event loop.

(4) This is where the race condition occurs: If the completion event for the proxy call (1) arrives early enough to be handled by the nested event loop (3), we have a problem.

(5) The AppShell inserts a dummy event into the thread event queue before returning.

(6) The dummy event is ineffective in letting nsThread::ProcessNextEvent() complete, since it is duly filtered out by the proxy thread filter. 

(7) We hang indefinitely in nsThread::ProcessNextEvent().  

A typical stack trace looks like this:

#0	0x0028d2f7 in nsProxyObjectCallInfo::SetCompleted at nsProxyEvent.cpp:302
#1	0x0028d391 in nsProxyCallCompletedEvent::Run at nsProxyEvent.cpp:97
#2	0x00284a5c in nsThread::ProcessNextEvent at nsThread.cpp:510
#3	0x0020fcb0 in NS_ProcessPendingEvents_P at nsThreadUtils.cpp:180
#4	0x116596ab in nsBaseAppShell::NativeEventCallback at nsBaseAppShell.cpp:121
#5	0x1161e236 in nsAppShell::ProcessGeckoEvents at nsAppShell.mm:299
#6	0x965f662e in CFRunLoopRunSpecific
#7	0x965f6d18 in CFRunLoopRunInMode
#8	0x9103d6a0 in RunCurrentEventLoopInMode
#9	0x9103d4b9 in ReceiveNextEventCommon
#10	0x9103d32d in BlockUntilNextEventMatchingListInMode
#11	0x91a0e7d9 in _DPSNextEvent
#12	0x91a0e08e in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
#13	0x1161c3d8 in nsAppShell::ProcessNextNativeEvent at nsAppShell.mm:521
#14	0x1165973f in nsBaseAppShell::DoProcessNextNativeEvent at nsBaseAppShell.cpp:151
#15	0x11659d8e in nsBaseAppShell::OnProcessNextEvent at nsBaseAppShell.cpp:278
#16	0x1161c940 in nsAppShell::OnProcessNextEvent at nsAppShell.mm:656
#17	0x00284958 in nsThread::ProcessNextEvent at nsThread.cpp:497
#18	0x0020fb4e in NS_ProcessNextEvent_P at nsThreadUtils.cpp:227
#19	0x0028f7a0 in nsProxyEventObject::CallMethod at nsProxyEventObject.cpp:259


I think this is a regression of bug #389931.

Not sure how to best address it. Maybe we shouldn't process native events while waiting for a proxy call to complete. Or maybe the dummy events posted by nsBaseAppShell::OnProcessNextEvent() could implement NS_PROXYEVENT_FILTER_IID, so that it doesn't get filtered.
> I think this is a regression of bug #389931.

By your description I would guess so too.

I don't think either of your suggested solutions scales.  We would need
to mess with appshell code every time someone implements a new filter
class (if I understood you correctly).
Alternative solutions:

Let the dummy event implement a new interface, nsIBypassFiltersEvent,
and let those events through nsThread::nsChainedEventQueue::PutEvent.
That costs a QI when 'mFilters' is non-null though.

Make the dummy event a known global object (or one per thread?) and
compare using == instead of QI.
Seems like a bad regression that we should fix before release.
It shouldn't be too hard or risky to fix I think.
Severity: normal → critical
Flags: blocking1.9?
Keywords: hang, regression
OS: Mac OS X → All
Hardware: PC → All
Version: unspecified → Trunk
Attached patch wip1Splinter Review
The QI version.  Alex, does this fix the hang?
Mats, how is your new interface any different from doing what Alex suggested with NS_PROXYEVENT_FILTER_IID ?
Because if I implement a new filter with MY_NEW_FILTER_IID
then I would have to add that to the dummy event class too, etc.
I got the impression he suggested that the appshell dummy event pass
nsProxyThreadFilter::AcceptEvent() specifically, but maybe I misunderstood.
Mats: Thanks for getting on the case. I haven't tested with your patch yet, but on further reflection I think the right course of action would be to arrange for the while(...) { NS_ProcessNextEvent(); } loop in nsProxyEventObject::CallMethod() to only pump xpcom events, and not native events too.

The rationale goes something like this:

If our main thread 'A' is making a synchronous proxy call to some other thread 'B', we WANT A to suspend processing of events until the proxy call is over. This is after all the whole point of a synchronous proxy: it should behave as much as possible like a normal sequential function call. 

The only reason for spinning a local nested event loop in nsProxyEventObject::CallMethod() (instead of e.g. blocking on a condition variable) is to support reentrant synchronous proxy calls from B to A. All events other than nsProxyObjectCallInfo or nsProxyCallCompletedEvent are filtered into the ancestor event queue, so in my opinion, we should hold off with processing native events too. It doesn't make sense to pump native events here, and it probably leads to all sorts of weird unreproducible behavior.

I guess the question is if there is potential for deadlock if nsProxyEventObject::CallMethod() isn't pumping native events. As far as I can see there isn't; if a reentrant call requires native events to be pumped in order to run to completion (e.g. if it displays a modal dialog), it will spin another nested event loop (with proper processing of native events) anyway.
(In reply to comment #4)
> Created an attachment (id=316039) [details]
> wip1
> 
> The QI version.  Alex, does this fix the hang?
> 

Mats: I tested with your patch and unfortunately I'm still seeing hangs when a proxy call is being made :-(
Was has changed is that it doesn't hang indefinitely anymore. I can force it out of the higher event loop by generating some native events (i.e. moving the mouse).
Can you explain how often this is hit?  If it's a long shot that doesn't affect user experience, and can be fixed in a dot release, it shouldn't block.  Can't tell from the comments here.  Please re-nom with reasons for blocking, please. 
Flags: blocking1.9? → blocking1.9-
(In reply to comment #10)
> Can you explain how often this is hit?

I don't know how common synchronous proxy calls are to be honest.
Someone who knows should re-nom if they are common.
(In reply to comment #8)
> ... it doesn't hang indefinitely anymore. I can force it
> out of the higher event loop by generating some native events 

Yeah, I see the problem now.  nsProxyEventObject::CallMethod() calls
NS_ProcessNextEvent() with mayWait==true, which allows the appshell
to block, waiting for a native event.

I've been thinking about different ways of fixing this bug, but I always
seem to end up with the conclusion that blocking on an event
(nsProxyCallCompletedEvent) doesn't really work with nested event loops.
Whether or not we should allow native events to be processed while waiting
for the synchronous call to be completed is a different question.
IMHO, nsProxyEventObject shouldn't rely on a specific behaviour of its
consumers to avoid a deadlock.

Replacing this block:
http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/xpcom/proxy/src/nsProxyEventObject.cpp&rev=1.65&root=/cvsroot&mark=259-262#240
with a non-blocking NS_ProcessNextEvent(thread, PR_FALSE) would fix it
although that would probably peg the CPU while polling for the event, so
we need a local "sleep" of some sort to fix that.  This also means that
we don't really need the nsProxyCallCompletedEvent event, we only need
the NS_ProcessNextEvent() to process re-entrant synchronous calls.

Regarding the issue of processing native events, maybe we should add
a method on nsIAppShell to control the behaviour?
If someone really wants to disable native event handling, or the
nested XPCOM event handling, they could do so.
Attached patch wip6Splinter Review
Would something like this work?
(In reply to comment #10)
> Can you explain how often this is hit? 

I only did some casual grepping, but afaics the only code affected by this is external to firefox (e.g. mailnews/ in nsImapProtocol.cpp). There are a few synchronous proxies in firefox (most notably in security/), but they all appear to proxy call TO the main thread, whereas this particular bug only manifests itself when proxying a call FROM the main thread to some other thread.


(In reply to comment #13)
> Created an attachment (id=316339) [details]
> wip6 
> Would something like this work?

This works great.

I guess it would be possible to get rid of the polling loop by feeding a flag up to nsIThreadObserver::OnProcessNextEvent(), or by making observers aware of proxy events, but I like how the code is now self-contained in xpcom/proxy.

With regards to native events being processed during proxy calls -  I think that's something we can live with. Native events will spawn thread events which will be properly sequenced, courtesy of the proxy thread filter. There is a bigger problem with sequencing of proxy calls in scenarios involving more than 2 threads, but that's neither a new problem nor object of this bug.
I get continual hangs even if I'm not doing anything. I assumed that this was
caused by Camino 1.6, but I backed out to Camino 1.5.5 and I'm still getting
it... so I went in with Activity Monitor and found this was always due to a
bunch of nested js_Invoke called by nsPACMan::getProxyForURI, called, at the
top level, from -[Bookmark refreshIcon]

Attached Activity Monitor samples for reference to Bug 235853
Mats, any more work to get this patch reviewed and landed?
Flags: wanted1.9.0.x? → wanted1.9.0.x+
Songbird is seeing what appears to be this bug on shutdown of our unit tests. bug 430260 is the actual symptoms we're seeing but we do a lot of proxy calls so I wouldn't be surprised if this is the underlying cause. Not positive if they are synch from main->background but I'm looking into it.

Going to try this patch out (wip6) locally and will report back on it's outcome.
(In reply to comment #17)
> Mats, any more work to get this patch reviewed and landed?

I think it was complete, I just couldn't figure out a way to write
a regression test for it.

(In reply to comment #18)
> Going to try this patch out (wip6) locally and will report back on it's
> outcome.

Did it help?
Tests are good, but if we're unable to come up with a test we should still land the fix.
Ok, I'll try to make a patch for review.  There was a similar bug reported
today that might help with the test (bug 460366).
Assignee: nobody → mats.palmgren
Attachment #316339 - Flags: review?(doug.turner)
Blocks: 468736
Trace shows that the attached patch doesn't fix the original bug; per timeless on IRC, we'll use bug 468736 for the original problem.
(In reply to comment #22)
 
> Trace shows that the attached patch doesn't fix the original bug; per timeless
> on IRC, we'll use bug 468736 for the original problem.

Erm, what do you mean by "doesn't fix the original bug"? wip6 certainly fixes the original problem that I reported on this bug. Your trace seems to be completely unrelated...
Attachment #316339 - Flags: review?(dougt) → review?(benjamin)
Blocks: 430260
Since XPCOM proxies are being removed (bug 675221) this bug should probably
be resolved as wontfix.
Assignee: matspal → nobody
Whiteboard: WONTFIX?
Attachment #316339 - Flags: review?(benjamin)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.