nested event consumption by nsAppShell::ProcessGeckoEvents breaks processNextEvent semantics; causing OS X xpcshell oranges

RESOLVED FIXED in mozilla2.0b7

Status

()

RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: asuth, Assigned: asuth)

Tracking

Trunk
mozilla2.0b7
All
Mac OS X
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Assignee)

Description

8 years ago
A number of xpcshell (and mozmill) tests use an idiom where they manually spin an event loop using nsIThread:processNextEvent(aWait = true).  The expectation is that an event will modify a closed over variable which then causes the loop to terminate.

There are basically two key ingredients to the problem:

1) The nested event loop (ProcessGeckoEvents) triggered by the weird OS X event mechanism can process the gecko events while we are inside a ProcessNextNativeEvent() call that is waiting on nextEventMatchingMask.  This means that nsBaseAppShell::OnProcessNextEvent will have NS_HasPendingEvents return false.

2) Although the while loop in nsBaseAppShell::OnProcessNextEvent that spins on !NS_HasPendingEvents(thr) will break out of the loop if DoProcessNextNativeEvent returns false, we call DoProcessNextNativeEvent in the block immediately prior to that loop.  Not only that... if mFavorPerf <= 0, we will potentially keep calling DoProcessNextNativeEvent until it has eaten all of the native events that are produced by ProcessGeckoEvents to wake up the event loop so it can exit.

Here you will see the output of some dtrace probes that show the tail end of stuff going off the rails:

1 [184990]           -> non-virtual thunk to nsXPConnect::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]           <- non-virtual thunk to nsXPConnect::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]           -> nsXPConnect::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]           <- nsXPConnect::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]           -> non-virtual thunk to nsAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]           <- non-virtual thunk to nsAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]           -> nsAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]             -> nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]             <- nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]           <- nsAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int)
1 [184990]         <- nsAppShell::ProcessGeckoEvents(void*)
1 [184990]       <- nsAppShell::ProcessNextNativeEvent(int) => 1
1 [184990]     <- nsBaseAppShell::DoProcessNextNativeEvent(int) => 1
1 [184990]     -> nsBaseAppShell::DoProcessNextNativeEvent(int) (mayWait: 0)
1 [184990]       -> nsAppShell::ProcessNextNativeEvent(int) (mayWait: 0)
1 [184990]       <- nsAppShell::ProcessNextNativeEvent(int) => 0
1 [184990]     <- nsBaseAppShell::DoProcessNextNativeEvent(int) => 0
1 [184990]     -> nsBaseAppShell::DoProcessNextNativeEvent(int) (mayWait: 1)
1 [184990]       -> nsAppShell::ProcessNextNativeEvent(int) (mayWait: 1)

I am not entirely clear on the best way to solve this problem because the cocoa event loop state machine is already extremely complex.  My instinct is to add or reuse a variable that tracks whether we have processed any gecko events so far in OnProcessNextEvent and change:
  while (!NS_HasPendingEvents(thr)) {
to
  while (!NS_HasPendingEvents(thr) && processedAGeckoEventThisCycle) {

I'm going to see if there's an existing thing that jumps out at me, but if not I'm going to provide a patch that just adds a new variable because this is causing really bad bustage.

I believe the reason this started happening is that bug 602871 landed a new fake idle service that kills off the real idle service.  I believe the real idle service kept a steady stream of timer events hitting the main thread which should eventually unwedge whatever trouble the OS X event loop gets itself into.
(Assignee)

Comment 1

8 years ago
Er, I think I left out the actual problem.  Things deadlock because we never return from nsIThread:processNextEvent and no more events ever happen because the tests are deterministic and don't have random stuff happening in the background.
(Assignee)

Comment 2

8 years ago
Created attachment 484937 [details] [diff] [review]
do the simple extra variable thing

This does the simple thing I said.  This fixes my breakage locally on my 10.6 mac targeting 10.5.  I am going to push this to the try server and hope it does not explode things.  This was definitely the route of least changes, so I'm hoping it does not.
Attachment #484937 - Flags: review?(joshmoz)
(Assignee)

Comment 3

8 years ago
Created attachment 484938 [details]
full dtrace log output

That it may aid your understanding of the problem, here is the output of the dtrace run that helped inform my conclusions.  I will attach the dtrace probes next.
(Assignee)

Comment 4

8 years ago
Created attachment 484940 [details]
the probes!

I would basically do the following:

in console a:
  cd objdir/storage/test
  make SOLO_FILE=test_statement_executeAsync.js check-interactive

in console b:
  ps ax | grep xpcshell
  sudo dtrace -s osxevents.d -p THE-PID-FROM-THE-OUTPUT-OF-THE-PREVIOUS-THING

Comment 5

8 years ago
this makes Fennec alot better on the mac.  See bug 602539.

Comment 6

8 years ago
sorry about that.  false alarm.  bug 602539 doesn't have anything to do with this bug.

Comment 7

8 years ago
Comment on attachment 484937 [details] [diff] [review]
do the simple extra variable thing

Handing review off to Steven Michaud, if he's fine with it then I am.
Attachment #484937 - Flags: review?(joshmoz) → review?(smichaud)
Comment on attachment 484937 [details] [diff] [review]
do the simple extra variable thing

This looks fine to me.  It just gets around the fact that
NS_HasPendingEvents() doesn't (thanks to event nesting) always give us
the information we need.

Event processing is complex on all platforms.  The main reason is that
we have (in effect) two event loops -- the native event loop and
Gecko's event loop.  This is probably unavoidable in an app that's
designed to be cross-platform.
Attachment #484937 - Flags: review?(smichaud) → review+

Comment 9

8 years ago
Can someone in the know do a risk-analysis for this and request approval-2.0 if deemed appropriate?  It would be wonderful to clear up the frequently not-particularly-intermittent oranges this purports to fix.
(In reply to comment #9)
> Can someone in the know do a risk-analysis for this and request approval-2.0 if
> deemed appropriate?  It would be wonderful to clear up the frequently
> not-particularly-intermittent oranges this purports to fix.

I'll get asuth to do the risk analysis, but this is a blocker for the Thunderbird tree - our Mac builds are failing xpcshell-tests *every time* due to this bug (and no, I don't know what caused it to start happening unfortunately).

Updated

8 years ago
Attachment #484937 - Flags: approval2.0+
(In reply to comment #9)

Probably the most effective risk analysis is to let this bake on the
trunk for a while :-)

I think I count as "in the know" for OS X appshell stuff, and I'm
comfortable with this change.  The earlier calls to
DoProcessNextNativeEvent() in nsBaseAppShell::OnProcessNextEvent()
make it unlikely that we'll see native event starvation.

But I'm much less familiar with event processing on other platforms.

I also don't know how much difference this will make outside the
somewhat artificial environment of our testing harnesses (in other
words to ordinary users).  I also don't know what metrics we'd use to
measure this difference.

Any ideas, Andrew?
(In reply to comment #10)

If this bug makes a significant number of tests unusable on some
branch, I'd say that was a powerful argument to land its patch on that
branch.
Landed on trunk: http://hg.mozilla.org/mozilla-central/rev/6678fab3e784
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b8
(In reply to comment #12)
> (In reply to comment #10)
> 
> If this bug makes a significant number of tests unusable on some
> branch, I'd say that was a powerful argument to land its patch on that
> branch.

Just FYI, Thunderbird tree isn't a branch. It is based on trunk and re-uses mozilla-central directly (similar in style to the mobile builds).
(Assignee)

Comment 15

8 years ago
(In reply to comment #8)
> Event processing is complex on all platforms.  The main reason is that
> we have (in effect) two event loops -- the native event loop and
> Gecko's event loop.  This is probably unavoidable in an app that's
> designed to be cross-platform.

Yeah, mashing up two completely separate event loops is not trivial and bound to result in some complexity, especially when you have to deal with nested event loops.  It was just surprising to me that such a complex platform situation wouldn't just replace the thread mechanisms wholesale to reduce the number of moving parts.  I suspect if I followed the evolution of the code it all makes sense, especially because few people ever are crazy enough or have enough free cycles to completely re-write a working event loop in the pursuit of some platonic ideal :)

(In reply to comment #11)
> I also don't know how much difference this will make outside the
> somewhat artificial environment of our testing harnesses (in other
> words to ordinary users).  I also don't know what metrics we'd use to
> measure this difference.
> 
> Any ideas, Andrew?

I'm assuming this question is generally mooted since the patch has already landed and made the tree happy... but in any event, the strict result of this patch is that processNextEvent(mayWait=true) will always return if it has processed one or more mozilla platform events.  Prior to the patch this guarantee would not be maintained on OS X.  The potential performance impact is that control flow may return to the outer event handling loop more than it did previously.  Because we must process at least one native event for the behaviour change to occur, the extra cost is bounded and believed to be quite low in the native steady state.  When the event loop is pumped by JavaScript code the performance change may be slightly greater, but since that is exactly the situation that could deadlock without the fix in an event-starved environment, the extra cost for correctness is still deemed worth it.

It's conceivable this might also do good things for modal dialog event loops, but since those are unlikely to be (mozilla platform) event-starved, it's unlikely anyone would notice.
Target Milestone: mozilla2.0b8 → mozilla2.0b7
You need to log in before you can comment on or make changes to this bug.