Closed Bug 1368286 Opened 2 years ago Closed 2 years ago

nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending processing

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 + fixed

People

(Reporter: ehsan, Assigned: ehsan)

References

Details

Attachments

(1 file, 4 obsolete files)

STR:

1. Apply my patch in bug 1361461.
2. Apply my patch in bug 1368285.
3. Run `./mach xpcshell-test js/xpconnect/tests/unit/test_nuke_webextension_wrappers.js`

The test times out.

The reason this is happening is that during event processing we continually hit nsBaseAppShell::DispatchDummyEvent() <https://searchfox.org/mozilla-central/rev/596d188f6dbc8cb023e625f0a4310d184875f8fc/widget/nsBaseAppShell.cpp#299> and therefore every time nsThread::GetEvent() gets called for our main thread, we'd have something in mEvents, and we end up never looking at mIdleEvents as a result and starving the idle event that was dispatched.

This dummy event business is ancient and I'm actually not sure if we really need it any more.  I posted a patch to the try server to see if we can fix this issue by simply removing this old mechanism: <https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab2ea816712de9dea2a214ad3cc684b8d585e9d0>
Try results look promising!
Why do we end up calling DispatchDummeEvent all the time? 
What code passes true as mayWait?

Couldn't DispatchDummyEvent dispatch the event to idle queue?
Ah, hrm, nsThread::HasPendingEvents doesn't take idle queue into account. I think it should.
Comment on attachment 8872109 [details] [diff] [review]
Don't dispatch a dummy event at the end of nsBaseAppShell::OnProcessNextEvent()

I don't think the reasons why the code was added have disappeared.
Attachment #8872109 - Flags: review?(bugs) → review-
(In reply to Olli Pettay [:smaug] from comment #4)
> Ah, hrm, nsThread::HasPendingEvents doesn't take idle queue into account. I
> think it should.

Yes, that is indeed the root cause of the problem.
Summary: DummyEvent gets in the way of idle events running → nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending process
Summary: nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending process → nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending processing
Attachment #8872109 - Attachment is obsolete: true
Attachment #8872420 - Flags: review?(bugs) → review+
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b814165b471
Take the idle queue into account in nsThread::HasPendingEvents(); r=smaug
https://hg.mozilla.org/mozilla-central/rev/0b814165b471
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Backed out because of bug 1369123.

https://hg.mozilla.org/mozilla-central/rev/edffa38ec0c5
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
An alternative approach here would be to make the dummy event special in some way.  We could use QI to some interface, but a more performance way would be to do something like:

  StaticRefPtr<nsIRunnable> sDummy;

  bool
  IsDummy(nsIRunnable* aRunnable)
  {
    return aRunnable == sDummy;
  }

  nsresult
  DispatchDummy(nsIThread* aThread)
  {
    return aThread->Dispatch(sDummy, nsIThread::DISPATCH_NORMAL);
  }

Then make idle processing do something like:

  if (!HasPendingEvents() || (LengthIsOne() && IsDummy(FirstEvent())) {
    // we're idle
  }

That would avoid changing behavior for other code using HasPendingEvents().
[Tracking Requested - why for this release]: Idle queue events may end up being discarded...
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #12)
> [Tracking Requested - why for this release]: Idle queue events may end up
> being discarded...

Err, I meant to say starved.
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #11)
> An alternative approach here would be to make the dummy event special in
> some way.

Yeah, I guess, but it depends on the root cause of the bug, and I for one still at least don't really understand it.  I actually realized I still have my old MacBook at home and spun up a recent build on it but haven't yet gotten a chance to test this myself.  I will give it a shot to see if I can figure out where the root cause is.  If the root cause is NS_HasPendingEvents() returning true in some cases confusing some other code this alternative idea may or may not help.
See Also: → 1329041
Tracking this for 55 to make sure we follow up to resolve the issue before release.
(Sorry for the length of the comment but the bug was complex.)

I debugged bug 1369123 on OSX.  Here is my findings.

I set a breakpoint in nsThread::HasPendingEvents() based on the condition of mEvents->HasPendingEvent() returning false and mIdleEvents.HasPendingEvent() returning true, since this is the only condition where nsThread::HasPendingEvents() would do something different after my patch.  The breakpoint got hit for the first time when I clicked the BuildAPI link after a few tries from the STR in bug 1369123 under this call stack:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x0000000110e23492 XUL`nsThread::HasPendingEvents(this=0x000000010fc20800, aResult=0x00007fff519f3917) at nsThread.cpp:1051
    frame #1: 0x0000000110e2c1fa XUL`hasPendingEvents(aThread=0x000000010fc20800) at nsThreadUtils.cpp:430
    frame #2: 0x0000000110e2c1cd XUL`NS_HasPendingEvents(aThread=0x000000010fc20800) at nsThreadUtils.cpp:448
    frame #3: 0x00000001157c5644 XUL`nsBaseAppShell::NativeEventCallback(this=0x000000010fc85b80) at nsBaseAppShell.cpp:104
    frame #4: 0x0000000115862b92 XUL`nsAppShell::ProcessGeckoEvents(aInfo=0x000000010fc85b80) at nsAppShell.mm:399
    frame #5: 0x00007fffbd318981 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
    frame #6: 0x00007fffbd2f9a7d CoreFoundation`__CFRunLoopDoSources0 + 557
    frame #7: 0x00007fffbd2f8f76 CoreFoundation`__CFRunLoopRun + 934
    frame #8: 0x00007fffbd2f8974 CoreFoundation`CFRunLoopRunSpecific + 420
    frame #9: 0x00007fffbc884a5c HIToolbox`RunCurrentEventLoopInMode + 240
    frame #10: 0x00007fffbc884891 HIToolbox`ReceiveNextEventCommon + 432
    frame #11: 0x00007fffbc8846c6 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 71
    frame #12: 0x00007fffbae2a5b4 AppKit`_DPSNextEvent + 1120
    frame #13: 0x00007fffbb5a4d6b AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 2789
    frame #14: 0x0000000115861684 XUL`::-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x000000010e4ea740, _cmd="nextEventMatchingMask:untilDate:inMode:dequeue:", mask=18446744073709551615, expiration=4001-01-01 00:00:00 UTC, mode="kCFRunLoopDefaultMode", flag=YES) at nsAppShell.mm:130
    frame #15: 0x00007fffbae1ef35 AppKit`-[NSApplication run] + 926
    frame #16: 0x000000011586360c XUL`nsAppShell::Run(this=0x000000010fc85b80) at nsAppShell.mm:673
    frame #17: 0x00000001183e15fb XUL`nsAppStartup::Run(this=0x000000012a9e1a10) at nsAppStartup.cpp:283
    frame #18: 0x00000001185361cd XUL`XREMain::XRE_mainRun(this=0x00007fff519f5b88) at nsAppRunner.cpp:4571
    frame #19: 0x000000011853775a XUL`XREMain::XRE_main(this=0x00007fff519f5b88, argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at nsAppRunner.cpp:4751
    frame #20: 0x0000000118537edc XUL`XRE_main(argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at nsAppRunner.cpp:4846
    frame #21: 0x000000011854b2c7 XUL`mozilla::BootstrapImpl::XRE_main(this=0x000000010e413138, argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at Bootstrap.cpp:45
    frame #22: 0x000000010e20a3d6 firefox`do_main(argc=5, argv=0x00007fff519f6218, envp=0x00007fff519f6248) at nsBrowserApp.cpp:236
    frame #23: 0x000000010e209eb1 firefox`main(argc=5, argv=0x00007fff519f6218, envp=0x00007fff519f6248) at nsBrowserApp.cpp:309
    frame #24: 0x00007fffd2846255 libdyld.dylib`start + 1
    frame #25: 0x00007fffd2846255 libdyld.dylib`start + 1

This call comes from this code: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#104>

Previously we would return false here but now we would go ahead and call DoProcessMoreGeckoEvents().  This is what our state looks like at this point:

(lldb) p *this
(nsBaseAppShell) $4 = {
  mRefCnt = {
    mValue = 8
  }
  _mOwningThread = (mThread = 0x000000010e4511a0)
  mSuspendNativeCount = 0
  mEventloopNestingLevel = 1
  mDummyEvent = (mRawPtr = 0x0000000000000000)
  mBlockedWait = 0x0000000000000000
  mFavorPerf = 0
  mNativeEventPending = {
    mozilla::detail::AtomicBase<unsigned int, mozilla::MemoryOrdering::SequentiallyConsistent> = {
      mValue = 1
    }
  }
  mStarvationDelay = 0
  mSwitchTime = 81294028
  mLastNativeEventTime = 82328514
  mEventloopNestingState = eEventloopNone
  mRunning = false
  mExiting = false
  mBlockNativeEvent = false
  mProcessedGeckoEvents = true
}

Then we end up returning early from here: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#224> without scheduling another native event callback.

I then looked at what happens when we get to nsBaseAppShell::OnProcessNextEvent() the next time.  We start off with this state:

(lldb) p *this
(nsBaseAppShell) $10 = {
  mRefCnt = {
    mValue = 8
  }
  _mOwningThread = (mThread = 0x000000010e4511a0)
  mSuspendNativeCount = 0
  mEventloopNestingLevel = 1
  mDummyEvent = (mRawPtr = 0x0000000000000000)
  mBlockedWait = 0x0000000000000000
  mFavorPerf = 0
  mNativeEventPending = {
    mozilla::detail::AtomicBase<unsigned int, mozilla::MemoryOrdering::SequentiallyConsistent> = {
      mValue = 1
    }
  }
  mStarvationDelay = 0
  mSwitchTime = 81294028
  mLastNativeEventTime = 82328514
  mEventloopNestingState = eEventloopNone
  mRunning = false
  mExiting = false
  mBlockNativeEvent = false
  mProcessedGeckoEvents = true
}

We first start to process native events but we give up here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/cocoa/nsAppShell.mm#613>, so moreEvents will remain false and ProcessNextNativeEvent() will return false.  So we bail out from processing native events and get to here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#281> and run NS_HasPendingEvents() again.  At this point NS_HasPendingEvents() will still return true so the condition here will evaluate to false and we won’t enter the loop.  Similarly for the condition here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#298>.

From this point on, the program never reaches this point <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/cocoa/nsAppShell.mm#621> so no further native event processing happens any more, we will only process XPCOM events from this point on.




I think the problem is that this call site to NS_HasPendingEvents() <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#104> is really special.  See this comment about its semantics: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/windows/nsAppShell.cpp#287>  Basically this code is trying to see if we hit the starvation limit and if so it’s trying to schedule another native callback to see try to run the rest of the Gecko events in the next run of the callback.  However if we end up scheduling this next native callback in the precise window of time when we only have idle events in our queue and nothing in nsThread::mEvents, we get into this weird scenario on OSX at least (and possibly into other unexpected situations on other OSes, I didn’t try to debug this elsewhere.)  I think the right way to fix this is to add a way to query the state of the non-idle queue of the nsThread and use it for this call site.
nsBaseAppShell::NativeEventCallback()'s particular semantics makes it care about
the non-idle event queue specifically so we need to make that particular consumer
not take the idle queue into account.
Attachment #8873923 - Flags: review?(bugs)
Attachment #8872420 - Attachment is obsolete: true
I have no idea why the patch makes any sense. Before idle stuff all those runnables were in the normal event queue, why does it matter that they are now in idle queue?
Does the patch fix the issue by accident only?
Comment on attachment 8873923 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents() and special case nsBaseAppShell::NativeEventCallback() appropriately

By default this looks wrong, but please explain why this is the right approach.
The comment in nsBaseAppShell::NativeEventCallback() doesn't explain _why_ this is needed.
Attachment #8873923 - Flags: review?(bugs)
The last paragraph in comment 16 explains the reason.  Please read that entire comment before reviewing, otherwise the patych looks arbitrary.
Attachment #8873923 - Flags: review?(bugs)
Comment on attachment 8873923 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents() and special case nsBaseAppShell::NativeEventCallback() appropriately

Nothing explained the reason, and I'm pretty the issue is that mIdleEvents in nsThread doesn't use nsEventQueue::eSharedCondVarQueue mode, as the normal queue and queue for vsync.
Attachment #8873923 - Flags: review?(bugs)
Blocks: 1300275
No longer blocks: 1300275
Depends on: 1300275
Attachment #8873923 - Attachment is obsolete: true
Comment on attachment 8874470 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents()

>+TimeStamp
>+nsThread::GetIdleDeadline()
>+{
>+  TimeStamp idleDeadline;
>+  {
>+    // Releasing the lock temporarily since getting the idle period
>+    // might need to lock the timer thread. Unlocking here might make
>+    // us receive an event on the main queue, but we've committed to
>+    // run an idle event anyhow.
>+    MutexAutoUnlock unlock(mLock);
>+    mIdlePeriod->GetIdlePeriodHint(&idleDeadline);
>+  }
>+
>+  // If HasPendingEvents() has been called and it has returned true because of
>+  // pending idle events, there is a risk that we may decide here that we aren't
>+  // idle and return null, in which case HasPendingEvents() has effectively
>+  // lied.  Since we can't go back and fix the past, we have to adjust what we
>+  // do here and forcefully pick the idle queue task here.  Note that this means
>+  // that we are choosing to run a task from the idle queue when we would
>+  // normally decide that we aren't in an idle period, but this can only happen
>+  // if we fall out of the idle period in between the call to HasPendingEvents()
>+  // and here, which should hopefully be quite rare.  We are effectively
>+  // choosing to prioritize the sanity of our API semantics over the optimal
>+  // scheduling.
>+  if (!mHasPendingEventsPromisedIdleEvent &&
>+      (!idleDeadline || idleDeadline < TimeStamp::Now())) {
>+    return TimeStamp();
>+  }
>+  return idleDeadline;
>+}
This can't work. GetIdlePeriodHint may return IsNull timestamp, 
so GetIdleEvent might return null after all. You really need to return some non-null timestamp
when mHasPendingEventsPromisedIdleEvent is true. returning TimeStamp::Now() would be probably fine, or some dummy
now + 1ms



> NS_IMETHODIMP
> nsThread::HasPendingEvents(bool* aResult)
> {
>+  mHasPendingEventsPromisedIdleEvent = false;
>+
>   if (NS_WARN_IF(PR_GetCurrentThread() != mThread)) {
>     return NS_ERROR_NOT_SAME_THREAD;
>   }
> 
>   {
>     MutexAutoLock lock(mLock);

Could you move mHasPendingEventsPromisedIdleEvent = false;
here, after autolock
Attachment #8874470 - Flags: review?(bugs) → review-
Attachment #8874470 - Attachment is obsolete: true
I re-applied the original patch that landed in comment 9 on central, and was easily able to reproduce the hang locally.

I am _not_ able to reproduce the hang with the latest patch. *thumbs up*.
(To be clear, the hang I'm referring to is the one originally reported in bug 1369123).
Attachment #8874493 - Flags: review?(bugs) → review+
Depends on: 1370448
Once the patch in bug 1370448 gets r+ed I will hopefully be able to land this.
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/16798a4167a1
Take the idle queue into account in nsThread::HasPendingEvents(); r=smaug
https://hg.mozilla.org/mozilla-central/rev/16798a4167a1
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.