nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending processing

RESOLVED FIXED in Firefox 55

Status

()

Core
XPCOM
RESOLVED FIXED
9 months ago
9 months ago

People

(Reporter: Ehsan, Assigned: Ehsan)

Tracking

(Depends on: 1 bug)

unspecified
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55+ fixed)

Details

Attachments

(1 attachment, 4 obsolete attachments)

(Assignee)

Description

9 months ago
STR:

1. Apply my patch in bug 1361461.
2. Apply my patch in bug 1368285.
3. Run `./mach xpcshell-test js/xpconnect/tests/unit/test_nuke_webextension_wrappers.js`

The test times out.

The reason this is happening is that during event processing we continually hit nsBaseAppShell::DispatchDummyEvent() <https://searchfox.org/mozilla-central/rev/596d188f6dbc8cb023e625f0a4310d184875f8fc/widget/nsBaseAppShell.cpp#299> and therefore every time nsThread::GetEvent() gets called for our main thread, we'd have something in mEvents, and we end up never looking at mIdleEvents as a result and starving the idle event that was dispatched.

This dummy event business is ancient and I'm actually not sure if we really need it any more.  I posted a patch to the try server to see if we can fix this issue by simply removing this old mechanism: <https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab2ea816712de9dea2a214ad3cc684b8d585e9d0>
(Assignee)

Comment 1

9 months ago
Try results look promising!
(Assignee)

Comment 2

9 months ago
Created attachment 8872109 [details] [diff] [review]
Don't dispatch a dummy event at the end of nsBaseAppShell::OnProcessNextEvent()
Attachment #8872109 - Flags: review?(bugs)

Comment 3

9 months ago
Why do we end up calling DispatchDummeEvent all the time? 
What code passes true as mayWait?

Couldn't DispatchDummyEvent dispatch the event to idle queue?

Comment 4

9 months ago
Ah, hrm, nsThread::HasPendingEvents doesn't take idle queue into account. I think it should.

Comment 5

9 months ago
Comment on attachment 8872109 [details] [diff] [review]
Don't dispatch a dummy event at the end of nsBaseAppShell::OnProcessNextEvent()

I don't think the reasons why the code was added have disappeared.
Attachment #8872109 - Flags: review?(bugs) → review-
(Assignee)

Comment 6

9 months ago
(In reply to Olli Pettay [:smaug] from comment #4)
> Ah, hrm, nsThread::HasPendingEvents doesn't take idle queue into account. I
> think it should.

Yes, that is indeed the root cause of the problem.
(Assignee)

Updated

9 months ago
Summary: DummyEvent gets in the way of idle events running → nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending process
(Assignee)

Updated

9 months ago
Summary: nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending process → nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending processing
(Assignee)

Comment 7

9 months ago
Created attachment 8872420 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents()
Attachment #8872420 - Flags: review?(bugs)
(Assignee)

Updated

9 months ago
Attachment #8872109 - Attachment is obsolete: true

Updated

9 months ago
Attachment #8872420 - Flags: review?(bugs) → review+

Comment 8

9 months ago
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b814165b471
Take the idle queue into account in nsThread::HasPendingEvents(); r=smaug

Comment 9

9 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/0b814165b471
Status: NEW → RESOLVED
Last Resolved: 9 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Assignee: nobody → ehsan
(Assignee)

Comment 10

9 months ago
Backed out because of bug 1369123.

https://hg.mozilla.org/mozilla-central/rev/edffa38ec0c5
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 11

9 months ago
An alternative approach here would be to make the dummy event special in some way.  We could use QI to some interface, but a more performance way would be to do something like:

  StaticRefPtr<nsIRunnable> sDummy;

  bool
  IsDummy(nsIRunnable* aRunnable)
  {
    return aRunnable == sDummy;
  }

  nsresult
  DispatchDummy(nsIThread* aThread)
  {
    return aThread->Dispatch(sDummy, nsIThread::DISPATCH_NORMAL);
  }

Then make idle processing do something like:

  if (!HasPendingEvents() || (LengthIsOne() && IsDummy(FirstEvent())) {
    // we're idle
  }

That would avoid changing behavior for other code using HasPendingEvents().
(Assignee)

Comment 12

9 months ago
[Tracking Requested - why for this release]: Idle queue events may end up being discarded...
status-firefox55: fixed → affected
tracking-firefox55: --- → ?
(Assignee)

Comment 13

9 months ago
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #12)
> [Tracking Requested - why for this release]: Idle queue events may end up
> being discarded...

Err, I meant to say starved.
(Assignee)

Comment 14

9 months ago
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #11)
> An alternative approach here would be to make the dummy event special in
> some way.

Yeah, I guess, but it depends on the root cause of the bug, and I for one still at least don't really understand it.  I actually realized I still have my old MacBook at home and spun up a recent build on it but haven't yet gotten a chance to test this myself.  I will give it a shot to see if I can figure out where the root cause is.  If the root cause is NS_HasPendingEvents() returning true in some cases confusing some other code this alternative idea may or may not help.
(Assignee)

Updated

9 months ago
See Also: → bug 1329041
Tracking this for 55 to make sure we follow up to resolve the issue before release.
tracking-firefox55: ? → +
(Assignee)

Comment 16

9 months ago
(Sorry for the length of the comment but the bug was complex.)

I debugged bug 1369123 on OSX.  Here is my findings.

I set a breakpoint in nsThread::HasPendingEvents() based on the condition of mEvents->HasPendingEvent() returning false and mIdleEvents.HasPendingEvent() returning true, since this is the only condition where nsThread::HasPendingEvents() would do something different after my patch.  The breakpoint got hit for the first time when I clicked the BuildAPI link after a few tries from the STR in bug 1369123 under this call stack:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x0000000110e23492 XUL`nsThread::HasPendingEvents(this=0x000000010fc20800, aResult=0x00007fff519f3917) at nsThread.cpp:1051
    frame #1: 0x0000000110e2c1fa XUL`hasPendingEvents(aThread=0x000000010fc20800) at nsThreadUtils.cpp:430
    frame #2: 0x0000000110e2c1cd XUL`NS_HasPendingEvents(aThread=0x000000010fc20800) at nsThreadUtils.cpp:448
    frame #3: 0x00000001157c5644 XUL`nsBaseAppShell::NativeEventCallback(this=0x000000010fc85b80) at nsBaseAppShell.cpp:104
    frame #4: 0x0000000115862b92 XUL`nsAppShell::ProcessGeckoEvents(aInfo=0x000000010fc85b80) at nsAppShell.mm:399
    frame #5: 0x00007fffbd318981 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
    frame #6: 0x00007fffbd2f9a7d CoreFoundation`__CFRunLoopDoSources0 + 557
    frame #7: 0x00007fffbd2f8f76 CoreFoundation`__CFRunLoopRun + 934
    frame #8: 0x00007fffbd2f8974 CoreFoundation`CFRunLoopRunSpecific + 420
    frame #9: 0x00007fffbc884a5c HIToolbox`RunCurrentEventLoopInMode + 240
    frame #10: 0x00007fffbc884891 HIToolbox`ReceiveNextEventCommon + 432
    frame #11: 0x00007fffbc8846c6 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 71
    frame #12: 0x00007fffbae2a5b4 AppKit`_DPSNextEvent + 1120
    frame #13: 0x00007fffbb5a4d6b AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 2789
    frame #14: 0x0000000115861684 XUL`::-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x000000010e4ea740, _cmd="nextEventMatchingMask:untilDate:inMode:dequeue:", mask=18446744073709551615, expiration=4001-01-01 00:00:00 UTC, mode="kCFRunLoopDefaultMode", flag=YES) at nsAppShell.mm:130
    frame #15: 0x00007fffbae1ef35 AppKit`-[NSApplication run] + 926
    frame #16: 0x000000011586360c XUL`nsAppShell::Run(this=0x000000010fc85b80) at nsAppShell.mm:673
    frame #17: 0x00000001183e15fb XUL`nsAppStartup::Run(this=0x000000012a9e1a10) at nsAppStartup.cpp:283
    frame #18: 0x00000001185361cd XUL`XREMain::XRE_mainRun(this=0x00007fff519f5b88) at nsAppRunner.cpp:4571
    frame #19: 0x000000011853775a XUL`XREMain::XRE_main(this=0x00007fff519f5b88, argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at nsAppRunner.cpp:4751
    frame #20: 0x0000000118537edc XUL`XRE_main(argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at nsAppRunner.cpp:4846
    frame #21: 0x000000011854b2c7 XUL`mozilla::BootstrapImpl::XRE_main(this=0x000000010e413138, argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at Bootstrap.cpp:45
    frame #22: 0x000000010e20a3d6 firefox`do_main(argc=5, argv=0x00007fff519f6218, envp=0x00007fff519f6248) at nsBrowserApp.cpp:236
    frame #23: 0x000000010e209eb1 firefox`main(argc=5, argv=0x00007fff519f6218, envp=0x00007fff519f6248) at nsBrowserApp.cpp:309
    frame #24: 0x00007fffd2846255 libdyld.dylib`start + 1
    frame #25: 0x00007fffd2846255 libdyld.dylib`start + 1

This call comes from this code: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#104>

Previously we would return false here but now we would go ahead and call DoProcessMoreGeckoEvents().  This is what our state looks like at this point:

(lldb) p *this
(nsBaseAppShell) $4 = {
  mRefCnt = {
    mValue = 8
  }
  _mOwningThread = (mThread = 0x000000010e4511a0)
  mSuspendNativeCount = 0
  mEventloopNestingLevel = 1
  mDummyEvent = (mRawPtr = 0x0000000000000000)
  mBlockedWait = 0x0000000000000000
  mFavorPerf = 0
  mNativeEventPending = {
    mozilla::detail::AtomicBase<unsigned int, mozilla::MemoryOrdering::SequentiallyConsistent> = {
      mValue = 1
    }
  }
  mStarvationDelay = 0
  mSwitchTime = 81294028
  mLastNativeEventTime = 82328514
  mEventloopNestingState = eEventloopNone
  mRunning = false
  mExiting = false
  mBlockNativeEvent = false
  mProcessedGeckoEvents = true
}

Then we end up returning early from here: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#224> without scheduling another native event callback.

I then looked at what happens when we get to nsBaseAppShell::OnProcessNextEvent() the next time.  We start off with this state:

(lldb) p *this
(nsBaseAppShell) $10 = {
  mRefCnt = {
    mValue = 8
  }
  _mOwningThread = (mThread = 0x000000010e4511a0)
  mSuspendNativeCount = 0
  mEventloopNestingLevel = 1
  mDummyEvent = (mRawPtr = 0x0000000000000000)
  mBlockedWait = 0x0000000000000000
  mFavorPerf = 0
  mNativeEventPending = {
    mozilla::detail::AtomicBase<unsigned int, mozilla::MemoryOrdering::SequentiallyConsistent> = {
      mValue = 1
    }
  }
  mStarvationDelay = 0
  mSwitchTime = 81294028
  mLastNativeEventTime = 82328514
  mEventloopNestingState = eEventloopNone
  mRunning = false
  mExiting = false
  mBlockNativeEvent = false
  mProcessedGeckoEvents = true
}

We first start to process native events but we give up here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/cocoa/nsAppShell.mm#613>, so moreEvents will remain false and ProcessNextNativeEvent() will return false.  So we bail out from processing native events and get to here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#281> and run NS_HasPendingEvents() again.  At this point NS_HasPendingEvents() will still return true so the condition here will evaluate to false and we won’t enter the loop.  Similarly for the condition here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#298>.

From this point on, the program never reaches this point <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/cocoa/nsAppShell.mm#621> so no further native event processing happens any more, we will only process XPCOM events from this point on.




I think the problem is that this call site to NS_HasPendingEvents() <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#104> is really special.  See this comment about its semantics: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/windows/nsAppShell.cpp#287>  Basically this code is trying to see if we hit the starvation limit and if so it’s trying to schedule another native callback to see try to run the rest of the Gecko events in the next run of the callback.  However if we end up scheduling this next native callback in the precise window of time when we only have idle events in our queue and nothing in nsThread::mEvents, we get into this weird scenario on OSX at least (and possibly into other unexpected situations on other OSes, I didn’t try to debug this elsewhere.)  I think the right way to fix this is to add a way to query the state of the non-idle queue of the nsThread and use it for this call site.
(Assignee)

Comment 17

9 months ago
Created attachment 8873923 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents() and special case nsBaseAppShell::NativeEventCallback() appropriately

nsBaseAppShell::NativeEventCallback()'s particular semantics makes it care about
the non-idle event queue specifically so we need to make that particular consumer
not take the idle queue into account.
Attachment #8873923 - Flags: review?(bugs)
(Assignee)

Updated

9 months ago
Attachment #8872420 - Attachment is obsolete: true

Comment 18

9 months ago
I have no idea why the patch makes any sense. Before idle stuff all those runnables were in the normal event queue, why does it matter that they are now in idle queue?
Does the patch fix the issue by accident only?

Comment 19

9 months ago
Comment on attachment 8873923 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents() and special case nsBaseAppShell::NativeEventCallback() appropriately

By default this looks wrong, but please explain why this is the right approach.
The comment in nsBaseAppShell::NativeEventCallback() doesn't explain _why_ this is needed.
Attachment #8873923 - Flags: review?(bugs)
(Assignee)

Comment 20

9 months ago
The last paragraph in comment 16 explains the reason.  Please read that entire comment before reviewing, otherwise the patych looks arbitrary.
(Assignee)

Updated

9 months ago
Attachment #8873923 - Flags: review?(bugs)

Comment 21

9 months ago
Comment on attachment 8873923 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents() and special case nsBaseAppShell::NativeEventCallback() appropriately

Nothing explained the reason, and I'm pretty the issue is that mIdleEvents in nsThread doesn't use nsEventQueue::eSharedCondVarQueue mode, as the normal queue and queue for vsync.
Attachment #8873923 - Flags: review?(bugs)
(Assignee)

Updated

9 months ago
Blocks: 1300275
(Assignee)

Updated

9 months ago
No longer blocks: 1300275
Depends on: 1300275
(Assignee)

Comment 22

9 months ago
Created attachment 8874470 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents()
Attachment #8874470 - Flags: review?(bugs)
(Assignee)

Updated

9 months ago
Attachment #8873923 - Attachment is obsolete: true

Comment 23

9 months ago
Comment on attachment 8874470 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents()

>+TimeStamp
>+nsThread::GetIdleDeadline()
>+{
>+  TimeStamp idleDeadline;
>+  {
>+    // Releasing the lock temporarily since getting the idle period
>+    // might need to lock the timer thread. Unlocking here might make
>+    // us receive an event on the main queue, but we've committed to
>+    // run an idle event anyhow.
>+    MutexAutoUnlock unlock(mLock);
>+    mIdlePeriod->GetIdlePeriodHint(&idleDeadline);
>+  }
>+
>+  // If HasPendingEvents() has been called and it has returned true because of
>+  // pending idle events, there is a risk that we may decide here that we aren't
>+  // idle and return null, in which case HasPendingEvents() has effectively
>+  // lied.  Since we can't go back and fix the past, we have to adjust what we
>+  // do here and forcefully pick the idle queue task here.  Note that this means
>+  // that we are choosing to run a task from the idle queue when we would
>+  // normally decide that we aren't in an idle period, but this can only happen
>+  // if we fall out of the idle period in between the call to HasPendingEvents()
>+  // and here, which should hopefully be quite rare.  We are effectively
>+  // choosing to prioritize the sanity of our API semantics over the optimal
>+  // scheduling.
>+  if (!mHasPendingEventsPromisedIdleEvent &&
>+      (!idleDeadline || idleDeadline < TimeStamp::Now())) {
>+    return TimeStamp();
>+  }
>+  return idleDeadline;
>+}
This can't work. GetIdlePeriodHint may return IsNull timestamp, 
so GetIdleEvent might return null after all. You really need to return some non-null timestamp
when mHasPendingEventsPromisedIdleEvent is true. returning TimeStamp::Now() would be probably fine, or some dummy
now + 1ms



> NS_IMETHODIMP
> nsThread::HasPendingEvents(bool* aResult)
> {
>+  mHasPendingEventsPromisedIdleEvent = false;
>+
>   if (NS_WARN_IF(PR_GetCurrentThread() != mThread)) {
>     return NS_ERROR_NOT_SAME_THREAD;
>   }
> 
>   {
>     MutexAutoLock lock(mLock);

Could you move mHasPendingEventsPromisedIdleEvent = false;
here, after autolock
Attachment #8874470 - Flags: review?(bugs) → review-
(Assignee)

Comment 24

9 months ago
Created attachment 8874493 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents()
Attachment #8874493 - Flags: review?(bugs)
(Assignee)

Updated

9 months ago
Attachment #8874470 - Attachment is obsolete: true
I re-applied the original patch that landed in comment 9 on central, and was easily able to reproduce the hang locally.

I am _not_ able to reproduce the hang with the latest patch. *thumbs up*.
(To be clear, the hang I'm referring to is the one originally reported in bug 1369123).

Updated

9 months ago
Attachment #8874493 - Flags: review?(bugs) → review+
(Assignee)

Updated

9 months ago
Depends on: 1370448
(Assignee)

Comment 27

9 months ago
Once the patch in bug 1370448 gets r+ed I will hopefully be able to land this.

Comment 28

9 months ago
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/16798a4167a1
Take the idle queue into account in nsThread::HasPendingEvents(); r=smaug

Comment 29

9 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/16798a4167a1
Status: REOPENED → RESOLVED
Last Resolved: 9 months ago9 months ago
status-firefox55: affected → fixed
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.