Closed Bug 1734705 Opened 3 months ago Closed 3 months ago

Sluggish scrolling if the gecko event loop is busy (native scroll events no longer get coalesced properly by macOS)

Categories

(Core :: Widget: Cocoa, defect, P2)

defect

Tracking

()

RESOLVED FIXED
95 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox92 --- unaffected
firefox93 --- unaffected
firefox94 + wontfix
firefox95 --- fixed

People

(Reporter: jrmuizel, Assigned: jrmuizel)

References

(Regression)

Details

(Keywords: regression)

Attachments

(4 files)

Check out the "APZ payload presented markers" on the compositor thread https://share.firefox.dev/3AlvLc3

Most of that latency comes from the "native event time stamp -> main thread sees native event" part of the pipeline. You can see high latency on the DOMEvent markers on the main thread: https://share.firefox.dev/3Dnfrcy

So then the question is, what kind of load are we producing so that the system can't deliver events to us fast enough? Or is there a problem with the event loop?

Regressed by: 1690687

I regression windowed this and bug 1690687 is to blame. It seems like we're probably starving the native event loop.

[Tracking Requested - why for this release]:
This regression can make Firefox noticeably less responsive.

Keywords: regression

Set release status flags based on info from the regressing bug 1690687

Steps to reproduce:

  1. In a fresh profile, enable devtools.chrome.enabled.
  2. Open the browser console.
  3. Paste the following and press enter:
var geckoEventLoopClogger = (function () {
  let messageName = 'ping-pong';
  let workerJs = `
    onmessage = e => {
      postMessage(e.data);
    };
  `;
  let blobUrl = URL.createObjectURL(new Blob([workerJs]));
  let worker;
  function handleMessage(event) {
    let startTime = performance.now();
    while (performance.now() - startTime < 14) {}
    worker.postMessage(event.data);
  }
  function start() {
    if (worker) {
      stop();
    }
    worker = new Worker(blobUrl);
    worker.addEventListener('message', handleMessage);
    worker.postMessage('ping-pong');
  }
  function stop() {
    if (!worker) {
      return;
    }
    worker.removeEventListener('message', handleMessage);
    worker.terminate();
    worker = undefined;
  }

  return { start, stop };
})();
  1. Type geckoEventLoopClogger.start() and press enter.
  2. In the main browser window, scroll with the touchpad or the magic mouse, up and down, continuously.
  3. On the browser console, execute geckoEventLoopClogger.stop()

Expected results:
Scrolling should be responsive and not lag very far behind.

Actual results:
Scrolling feels very sluggish and falls behind more and more.

Profile before bug 1690687 (build 2021-09-05): https://share.firefox.dev/3AoLWp3
Profile after bug 1690687 (build 2021-09-08): https://share.firefox.dev/2WZWWvu

Summary: Bad scrolling on Phoronix on macOS → Scroll events no longer get coalesced properly by macOS if the gecko event loop is busy
Summary: Scroll events no longer get coalesced properly by macOS if the gecko event loop is busy → Sluggish scrolling if the gecko event loop is busy (native scroll events no longer get coalesced properly by macOS)

I believe this is caused by AcquireFirstMatchingEventInQueue breaking CGEvent coalescing. I think we were previously often pulling out the dummy event, which may have prevented AcquireFirstMatchingEventInQueue from interfering with the coalescing.

Component: Graphics: WebRender → Widget: Cocoa

I'm not going to be able to work on this anytime soon.

Ah, I wasn't expecting you to. Just thinking you might remember some things from bug 996848.

Here's the patch that seems to avoid the problem:
https://treeherder.mozilla.org/jobs?repo=try&revision=994d02d0eff88ee8003ceaa8aefd1ba9da3684f8&selectedTaskRun=civiSeINTYWqch7VCyyXMg.0

It just skips all of the calls to AcquireFirstMatchingEventInQueue. Steven, do you recall why they're needed?

Flags: needinfo?(smichaud)

The code you're changing is in bool nsAppShell::ProcessNextNativeEvent(bool aMayWait). The change you have in mind basically chops off the second part of the if (aMayWait) conditional. With this change, nsAppShell::ProcessNextNativeEvent() will never process a native event while aMayWait is false.

I don't know whether this will make trouble. It might even be an improvement. But I have a suggestion that's less radical and might still solve your problem:

    diff --git a/widget/cocoa/nsAppShell.mm b/widget/cocoa/nsAppShell.mm
    --- a/widget/cocoa/nsAppShell.mm
    +++ b/widget/cocoa/nsAppShell.mm
    @@ -693,7 +693,7 @@ bool nsAppShell::ProcessNextNativeEvent(
           EventRef currentEvent =
               AcquireFirstMatchingEventInQueue(currentEventQueue, 0, NULL, kEventQueueOptionsNone);
           if (!currentEvent) {
    -        continue;
    +        break;
           }
           EventAttributes attrs = GetEventAttributes(currentEvent);
           UInt32 eventKind = GetEventKind(currentEvent);

Your problem isn't caused by calling AcquireFirstMatchingEventInQueue() directly -- there's another call (below) that your patch doesn't remove. The real question is why I avoided calling [NSApp nextEventMatchingMask:...] and [NSApp sendEvent:] when aMayWait is false. I no longer have a clear memory of what I was thinking when I wrote my patch for bug 996848. But I left copious notes there and in my source code. As best I can tell, the best overall summary is in this code comment.

Flags: needinfo?(smichaud)

Where (in macOS system code) does "CGEvent coalescing" happen?

I now have much better debugging tools than I did when I wrote my patch for bug 996848 (HookCase and the techniques I've developed for using it). At some point I'd like to dig into this code, and see what might interfere with it (for example not spinning the native event loop when aMayWait is false).

Oops, I just noticed a bug in my patch for bug 996848, which should result in the leaking of EventRefs, and possibly other problems. Here's the fix:

    diff --git a/widget/cocoa/nsAppShell.mm b/widget/cocoa/nsAppShell.mm
    --- a/widget/cocoa/nsAppShell.mm
    +++ b/widget/cocoa/nsAppShell.mm
    @@ -730,8 +730,17 @@ bool nsAppShell::ProcessNextNativeEvent(
       } while (mRunningEventLoop);
     
       if (eventProcessed) {
    -    moreEvents = (AcquireFirstMatchingEventInQueue(currentEventQueue, 0, NULL,
    -                                                   kEventQueueOptionsNone) != NULL);
    +    EventRef nextEvent =
    +      AcquireFirstMatchingEventInQueue(currentEventQueue, 0, NULL,
    +                                       kEventQueueOptionsNone);
    +    if (nextEvent) {
    +      // This call to ReleaseEvent() matches a call to RetainEvent() in
    +      // AcquireFirstMatchingEventInQueue().
    +      ReleaseEvent(nextEvent);
    +      moreEvents = true;
    +    } else {
    +      moreEvents = false;
    +    }
       }
     
       mRunningEventLoop = wasRunningEventLoop;

(In reply to Steven Michaud [:smichaud] (Retired) from comment #8)

Your problem isn't caused by calling AcquireFirstMatchingEventInQueue() directly -- there's another call (below) that your patch doesn't remove.

Setting eventProcessed = false avoids the second call to AcquireFirstMatchingEventInQueue() as well.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #9)

Where (in macOS system code) does "CGEvent coalescing" happen?

AcquireFirstMatchingEventInQueue() calls PullEventsFromWindowServer() which calls Convert1CGEvent(). This pulls an event out of sCGEventQueue into the Carbon event queue.

The coalescing happens in PushToCGEventQueue it will coalesce a wheel event with the sLastMouseCGEvent

Setting eventProcessed = false avoids the second call to AcquireFirstMatchingEventInQueue() as well.

Yes, when nsAppShell::ProcessNextNativeEvent() is called with aMayWait == false. But it is called when aMayWait and eventProcessed are true.

AcquireFirstMatchingEventInQueue() calls PullEventsFromWindowServer() which calls Convert1CGEvent(). This pulls an event out of sCGEventQueue into the Carbon event queue.

So you're saying that pulling a CGEvent out of the sCGEventQueue prevents it from being coalesced with another CGEvent?

The coalescing happens in PushToCGEventQueue it will coalesce a wheel event with the sLastMouseCGEvent

What calls PushToCGEventQueue?

(In reply to Steven Michaud [:smichaud] (Retired) from comment #12)

Setting eventProcessed = false avoids the second call to AcquireFirstMatchingEventInQueue() as well.

Yes, when nsAppShell::ProcessNextNativeEvent() is called with aMayWait == false. But it is called when aMayWait and eventProcessed are true.

True. Perhaps that's uncommon enough that the problem won't be noticeable.

AcquireFirstMatchingEventInQueue() calls PullEventsFromWindowServer() which calls Convert1CGEvent(). This pulls an event out of sCGEventQueue into the Carbon event queue.

So you're saying that pulling a CGEvent out of the sCGEventQueue prevents it from being coalesced with another CGEvent?

I believe so yes.

The coalescing happens in PushToCGEventQueue it will coalesce a wheel event with the sLastMouseCGEvent

What calls PushToCGEventQueue?

PullEventsFromWindowServerOnConnection which is called by PullEventsFromWindowServer. (This seems to be mostly happening on a separate NSEventThread thread now)

I'm looking at several Apple frameworks in a disassembler (HIToolbox, CoreGraphics, SkyLight, CoreFoundation), trying to find a way to discover whether or not an event is waiting for delivery without either spinning the event loop or pulling it out of the sCGEventQueue. So far no luck. But these things tend to be very complicated, and I don't really have the time now.

We can't spin the event loop when aMayWait is false. But that may not leave us another way to query for waiting events (or to process them) that doesn't cause other problems. So it might be worthwhile testing a patch on the trunk that gets rid of the second half (the false half) of the if (aMayWait) conditional. It should bake there (and potentially on the other branches) for a long time. And I think it should be backed out again if there are any problems.

Maybe by then I'll have time to really work on this bug.

By the way, I've written a HookCase hook library that covers a lot of this ground, and (if I may say so myself) is the best general reference for macOS event processing that I've ever seen:

https://github.com/steven-michaud/HookCase/blob/master/examples-events.md

It doesn't currently cover event coalescing, but I'll try to add that later.

I'd still like you to try my suggestion from comment #8. The results may be good enough to muddle through without making major changes to nsAppShell::ProcessNextNativeEvent().

Severity: -- → S2
Priority: -- → P2

I tried out your suggestion from comment 8 and it did not fix the issue. Since AcquireFirstMatchingEventInQueue() is still being called it seems like we'd be still pulling an event out of sCGEventQueue which will prevent the coalescing.

I hoped that it would be called less often. Oh well. Thanks for trying.

We'll already leave some events that we get back from
AcquireFirstMatchingEventInQueue in the queue. This change makes it so
we always do that without even looking at the queue. It's not completely
clear what impact this will have, so let's just keep it Nightly only
for now.

Assignee: nobody → jmuizelaar
Status: NEW → ASSIGNED

Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is true

This should actually be:

Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is false

Attachment #9245273 - Attachment description: Bug 1734705. Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is true. → Bug 1734705. Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is false.
Pushed by jmuizelaar@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8b7ed045dbd1
Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is false. r=smichaud

Backed out changeset 8b7ed045dbd1 (Bug 1734705) for causing build bustages on nsAppShell.mm.
Backout link
Push with failures
Failure Log

Flags: needinfo?(jmuizelaar)
Pushed by jmuizelaar@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/48ce6a14c833
Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is false. r=smichaud
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch
Flags: needinfo?(jmuizelaar)

What kind of baking/testing/verification does this need before we nominate for Beta uplift? I guess we'd need another patch making it not Nightly-only when the time comes too.

Flags: needinfo?(jmuizelaar)

As far as I'm concerned, I think it just needs lots of exposure to many different kinds of users. This argues for extended baking on both trunk and beta (possibly for longer than a single update cycle on each). I have no idea what kinds of problems might occur, and so have no specific ideas how to test for them. Jeff might have some ideas, though.

Edit: On second thought, it'd be worthwhile to test for native event starvation (if we can figure out how to do that). The current patch stops all processing of native events by nsAppShell::ProcessNextNativeEvent() while aMayWait is false.

Are we going to want to ship bug 1690687 in Fx94 if we don't fix this bug there?

For what it's worth, I just performed a very primitive test for native event starvation in today's mozilla-central nightly (build id 20211012093200, which has this patch). I didn't see any problems.

I dragged objects around on Firefox's default start page. I also tested dragging the native color picker (and colors that I'd selected with it), using the testcase for bug 1618418.

Edit: I tested on macOS 10.15.7 build 19H1419 (the current version).

This will let us get some more testing.

We'll already leave some events that we get back from
AcquireFirstMatchingEventInQueue in the queue. This change makes it so
we always do that without even looking at the queue. It's not completely
clear what impact this will have, so let's just keep it Nightly only
for now.

Pushed by jmuizelaar@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/487c3eb6a0cd
Allow the event loop changes into EARLY_BETA. r=mstange

Comment on attachment 9245781 [details]
Bug 1734705. Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is false.

Beta/Release Uplift Approval Request

  • User impact if declined: Fixes the sluggish scrolling caused by bug 1690687.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It was somewhat risky but the code is currently marked #ifdef EARLY_BETA_OR_EARLIER so it shouldn't cause any risk to release yet. I'm still trying to figure out whether it makes sense to back out bug 1690687 of beta yet, but in the mean time this will help us get more testing.
  • String changes made/needed:
Flags: needinfo?(jmuizelaar)
Attachment #9245781 - Flags: approval-mozilla-beta?

Comment on attachment 9245781 [details]
Bug 1734705. Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is false.

Approved for 94.0b6. I will note, however, that 94.0b6 is our final early beta this cycle. Once 94.0b7 ships on Sunday night, we'll be back to where we started here.

Attachment #9245781 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9246238 [details]
Bug 1734705. Let bug 1734705 ride out to release.

Beta/Release Uplift Approval Request

  • User impact if declined: Fixes the sluggish scrolling caused by bug 1690687.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): There's definitely some risk to this, however, the code path is very hot and used for everything on macOS. This has been on Nightly for about a week and in beta for a day or two. The alternative is to back out 1690687.
  • String changes made/needed:
Attachment #9246238 - Flags: approval-mozilla-beta?

Comment on attachment 9245781 [details]
Bug 1734705. Avoid calling AcquireFirstMatchingEventInQueue when aMayWait is false.

We decided to backout bug 1690687 from Fx94, so this is no longer necessary.

Attachment #9245781 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-
Attachment #9246238 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Flags: qe-verify+
QA Whiteboard: [qa-triaged]
You need to log in before you can comment on or make changes to this bug.