Open Bug 1677509 Opened 4 years ago Updated 9 days ago

SendReceiveMouseInputEvent often blocks the main thread of the parent process

Categories

(Core :: IPC, defect)

defect

Tracking

()

Performance Impact high

People

(Reporter: florian, Unassigned, NeedInfo)

References

Details

(Keywords: perf:responsiveness, Whiteboard: [bhr:SendReceiveMouseInputEvent])

Attachments

(1 file)

BHR data shows this represents a bit more than 1% of the total hang time reported from our Nightly population: http://queze.net/bhr/test/#row=&filter=SendReceiveMouseInputEvent

Here's a profile of it on the slow quantum reference laptop: https://share.firefox.dev/3nv2UMg

Why is this IPC synchronous? Is there something that could be done about it?

As far as I can tell from the profile, it's a large number of quick IPC calls, rather than a small number of slow IPC calls. Does that match your interpretation? The IPC is synchronous because we need the input events to get properly un-transformed based on the current APZ async scroll, before they can be dispatched to gecko. If we made this asynchronous there's a risk of incoming events from the OS getting delivered to gecko out-of-order.

(In reply to Kartikaya Gupta (email:kats@mozilla.staktrace.com) from comment #1)

As far as I can tell from the profile, it's a large number of quick IPC calls, rather than a small number of slow IPC calls. Does that match your interpretation?

The individual blue bars on this profile (the Gaph type was "Stack height", maybe you are more used to the "Categories" view? https://share.firefox.dev/2UDsFxz) show indvidual samples, ie. when the profiler stopped the process to capture a stack.

It's impossible to know how many IPC calls happened during this profile (I would have needed to profile with the ipcmessages feature enabled to have this information, or we would need to fix bug 1641181), but it's reasonable to guess that the first IPC where there's a red jank marker took about 100ms. When looking at the GPU process at the same time, it has no sample, which likely means it got no CPU from the operating system scheduler during that time... so it likely didn't process the IPC either.

I was looking at the mousemove markers in the marker table for the parent process main thread. Each one of those will have an associated synchronous IPC to the GPU process main thread, and looking at the time delta between those markers, most of them are pretty small which indicates the main thread wasn't unduly held up while processing them. But yes, the 100ms jank at the beginning is concerning. If the problem is what you suggest, that the OS didn't schedule that thread during that time, then maybe that's something we can resolve by adjusting thread priorities.

Whiteboard: [qf][bhr:SendReceiveMouseInputEvent] → [qf:p1:responsiveness][bhr:SendReceiveMouseInputEvent]
Severity: -- → S3

As part of perf bug bash week, I tried to reproduce it on my Acer ref hw, but I didn't see much jank, and sync mouse IPCs seem fast: https://share.firefox.dev/3gN2Kju
Florian, maybe you'd like to try again, now that bug 1641181 has landed?

How easy was it to get this to show up when you tried, Florian? I know the 2018 reference hardware is... exceptional when it comes to getting perf problems to show up. Would you be able to try to get another profile of this if it manifests easily?

Flags: needinfo?(florian)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #5)

How easy was it to get this to show up when you tried, Florian? I know the 2018 reference hardware is... exceptional when it comes to getting perf problems to show up. Would you be able to try to get another profile of this if it manifests easily?

It's easy to see it on the 2018 ref hardware: just load about:home, start the profiler, and move the mouse around quickly on the top site icons.

Here's a profile without CPU recording and without screenshots: https://share.firefox.dev/3aPuznd that has 2 >100ms sync IPC. And here's another one with more profile features enabled, but the sync IPC are slightly shorter (60-80ms range for the slowest ones): https://share.firefox.dev/32ZXYXK

(In reply to Gerald Squelart [:gerald] (he/him) from comment #4)

As part of perf bug bash week, I tried to reproduce it on my Acer ref hw, but I didn't see much jank, and sync mouse IPCs seem fast: https://share.firefox.dev/3gN2Kju

Looks like you tried on wikipedia.org. It seems smoother for me on that page, but I still see some sync IPC taking 20-30ms: https://share.firefox.dev/2S8kYBr

Flags: needinfo?(florian)

For any outside or later observers: here's a profile Florian took with the IPC I/O threads included: https://share.firefox.dev/3xC5G8e

Nothing interesting going on in those threads though. They look to be as idle as the rest of the threads we're waiting on. It seems like at a fundamental level here, we're just dealing with the fact that we're sending a ton of sync IPC messages from the parent to the GPU process, and in some circumstances that means that we jank the browser either because something is busy or because the necessary threads are just not being scheduled. We can't meaningfully adjust process or thread priorities here as far as I can reason, so either we accept this or we find a way to make these messages async.

tnikkel, does what kats said in comment 1 seem insurmountable to you? I don't know this code well enough to judge whether we could work around this and still provide the same ordering guarantees on events.

Flags: needinfo?(tnikkel)

(In reply to Kartikaya Gupta (email:kats@mozilla.staktrace.com) from comment #1)

As far as I can tell from the profile, it's a large number of quick IPC calls, rather than a small number of slow IPC calls. Does that match your interpretation? The IPC is synchronous because we need the input events to get properly un-transformed based on the current APZ async scroll, before they can be dispatched to gecko. If we made this asynchronous there's a risk of incoming events from the OS getting delivered to gecko out-of-order.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #7)

tnikkel, does what kats said in comment 1 seem insurmountable to you? I don't know this code well enough to judge whether we could work around this and still provide the same ordering guarantees on events.

I don't know this area of code very well but my naive guess is that in a fission world not using an up to date async scroll offset could mean we send the event to the wrong process (the location of an oop iframe could be changing from under/not under the cursor). Forwarding question to botond.

Flags: needinfo?(tnikkel) → needinfo?(botond)

Are the messages mouse related? I recall seeing code that throttled mouse moves (?) somewhere, maybe we can cut down on the number of sync calls?

(In reply to Doug Thayer [:dthayer] (he/him) from comment #7)

does what kats said in comment 1 seem insurmountable to you? I don't know this code well enough to judge whether we could work around this and still provide the same ordering guarantees on events.

I don't think it's completely insurmountable. We could conceivably make the messages async while preserving event ordering if we did something like:

  • Have the parent process maintain a queue of events. Note, all event types would need to go through the queue, not just the ones which need to be sent to the GPU process.
  • When the parent process receives an event from the OS, enqueue it. If it needs to be sent to the GPU process, send an async IPC message (with the response also async).
  • Consider an event ready to be dequeued if it has received the async response (for event types that need to be sent to the GPU process), or unconditionally (for other event types).
  • On appropriate triggers (after enqueue, after an async response from the GPU process was handled), check the queue for events ready to be dequeued and continue processing them.

It does sound like this would be a fairly non-trivial refactor.

I think such a design also has the potential to add latency to the processing of input events, though it's hard to reason about that and whether or not it actually adds latency is likely to be workload dependent (i.e. depending on what else the parent and GPU processes are doing).

Flags: needinfo?(botond)

(In reply to Timothy Nikkel (:tnikkel) from comment #9)

Are the messages mouse related? I recall seeing code that throttled mouse moves (?) somewhere, maybe we can cut down on the number of sync calls?

That could be something to explore, yeah.

cc Masayuki and Smaug, who might know more about any existing mouse-move throttling, and whether it may be possible to do in the parent process.

This is almost certainly a very small optimization, and likely won't address
the frequency of hangs from bug 1677509. However, it still should be an
improvement, and will work on anything Vista or later.

We observed as part of trying to diagnose the somewhat mysterious bug 1677509
that sometimes multiple messages would be queued, and yet the IPC I/O thread
would go idle in between servicing them in GetQueuedCompletionStatusEx. Given
that we send frequent sync ipc messages for mouse move events, it seems prudent
to be able to service multiple at a time.

Assignee: nobody → dothayer
Status: NEW → ASSIGNED

I'm a little confused about the patch:

  • If these are sync messages between the same pair of threads, how would it be possible to handle more than one at a time? Aren't we necessarily ping-ponging regardless of how channels work?
  • Do we ever have more than one read and one write (per channel) in flight at the same time?

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #13)

I'm a little confused about the patch:

  • If these are sync messages between the same pair of threads, how would it be possible to handle more than one at a time? Aren't we necessarily ping-ponging regardless of how channels work?

The situation we observed (can't find the profile) looked something like this:

Parent process                  | GPU process
--------------------------------------------
send async msg                  |
send sync msg                   |
                                |
                                | recv async msg
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                | recv sync msg
  • Do we ever have more than one read and one write (per channel) in flight at the same time?

When I add logging right after the call to GetQueuedCompletionStatusEx with this patch applied, we very frequently encounter multiple messages returned. I didn't bother to count the ratio, but it's a constant spew of them in the console.

Does that answer your questions?

Flags: needinfo?(jld)

Botond, are there large swaths of time where we could know very cheaply that we don't need to transform any mouse move events? We see this occurring when the user isn't scrolling at all. Could we just have a shmem that the APZ code updates when it's obvious no scrolling is going on so the parent process can check that without having to send a sync IPC message?

Flags: needinfo?(botond)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #15)

Botond, are there large swaths of time where we could know very cheaply that we don't need to transform any mouse move events? We see this occurring when the user isn't scrolling at all. Could we just have a shmem that the APZ code updates when it's obvious no scrolling is going on so the parent process can check that without having to send a sync IPC message?

I think that could have been feasible in a pre-Fission world.

However, with Fission, the sync dispatch is also needed so that the events can undergo compositor hit testing in the GPU process, and be annotated with a target LayersId so we know which content process to dispatch them to.

Flags: needinfo?(botond)

Two more cents:

I don't think that this is about spamming GPU with mouse events. Judging by the markers, in the worst of these profiles, we are sending a mouse event on the order of every 10ms. This should be manageable, since by all accounts the handler is fast. Also, since ReceiveMouseInputEvent is sync, the only one ever in the queue is the one we are currently servicing. Since there aren't markers to indicate that sampling is missing any of the IPC round-trips, I think we can assume that the pauses are for one message. Moreover, the periodicity doesn't seem to correlate with jank -- indeed, we get more mouse moves when there isn't jank (presumably that's the OS throttling). Sync messaging is Bad, but I don't think it is the main culprit.

FYI, the IPDL mechanism for throttling mouse moves is used for parent <-> content communication -- it's the IPDL keyword [Compress] (see for example PBrowser::RealMouseMoveEvent). There is also [CompressAll]. They aren't especially versatile or deep -- a [CompressAll]ed message type will throw out an old message if one of that type is waiting in the message queue when you send a new one. [Compress] will only throw out the new message if the old one was at the back of the queue (i.e. it was the last message added). Either seems like it could be a decent idea for these messages, but I don't think it'll help this bug because I don't think we have a spamming issue.

Now for the random guessing (I can't easily run experiments since my Windows machine self-immolated but I could dig out an old Surface to take a look). It's hard to see from the profiles but it kind of looks like the GPU process is waiting in ProcessNextNativeEvent a lot (essentially EVERY stack that isn't handling the mouse message -- 1399 out of 1548 samples), so it sits idle while IPC messages wait. The only profile with IPC markers didn't have much of a repro of the hang so I can't say that this is really a good guess but it would explain the latency.

Doug, re-upping your needinfo -- looks like you had a patch that landed and was backed out here. Do you have plans to get it re-landed and/or do you know what the extent of the treeherder badness was? (just a small issue that needs fixing, or some fundamental problem with the approach, or something else?)

Flags: needinfo?(dothayer)
Flags: needinfo?(dothayer)

Hey Daniel - sorry for the delay here. This patch was just intended to be a small change which should in theory be more optimal. However (and I should have documented this at the time) it introduced timeout failures in tests. Now, it is possible that it was a perf improvement and yet introduced timeout issues due to changing the ordering of some things, but I think it's more likely that it either introduced a perf degradation or had a bug, in either case I think the impact of the patch should have been small enough that it's not worth looking into.

I don't and didn't historically have a patch that I am confident would resolve the actual hang here, so I'm unassigning myself.

Assignee: dothayer → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(dothayer)

Timothy pointed out that, apart from trying to make these messages async (which is possible but nontrivial and may affect input handling latency, per comment 10), another avenue to explore is speeding up the processing time of the messages while keeping them sync.

Nical, I know you did some work (in bug 1580178 and related) to try to avoid blocking on the render backend thread when doing hit-testing inside APZ. Do you know what the current status of that is, i.e. were the changes successful in making us avoid blocking in the common case?

Flags: needinfo?(nical.bugzilla)

Do you know what the current status of that is, i.e. were the changes successful in making us avoid blocking in the common case?

That landed a while ago and works well. What it does is that we have an immutable hit-testing tree that is atomically swapped each time the scene is updated (for example layout changes), and all threads can safely read from the most current hit testing tree without blocking.

This stuff only avoids synchronousness within the process where rendering happens, so if the GPU process is enabled we probably have some sync IPC chatter, but at least it shouldn't block on WebRender work completing.

Flags: needinfo?(nical.bugzilla)

(In reply to Florian Quèze [:florian] (PTO until January) from comment #6)

Here's a profile without CPU recording and without screenshots: https://share.firefox.dev/3aPuznd that has 2 >100ms sync IPC. And here's another one with more profile features enabled, but the sync IPC are slightly shorter (60-80ms range for the slowest ones): https://share.firefox.dev/32ZXYXK

(In reply to Gerald Squelart [:gerald] (he/him) from comment #4)

As part of perf bug bash week, I tried to reproduce it on my Acer ref hw, but I didn't see much jank, and sync mouse IPCs seem fast: https://share.firefox.dev/3gN2Kju

Looks like you tried on wikipedia.org. It seems smoother for me on that page, but I still see some sync IPC taking 20-30ms: https://share.firefox.dev/2S8kYBr

Can someone clarify the reason why the content process document is mush faster than the parent one?

Performance Impact: --- → P1
Whiteboard: [qf:p1:responsiveness][bhr:SendReceiveMouseInputEvent] → [bhr:SendReceiveMouseInputEvent]
See Also: → 1768766

Florian, do you have a more recent profile which shows this issue and which includes dynamic process priority markers?

Flags: needinfo?(florian)

(In reply to Markus Stange [:mstange] from comment #24)

Florian, do you have a more recent profile which shows this issue and which includes dynamic process priority markers?

I just grabbed one: https://share.firefox.dev/3JfKWJJ

Flags: needinfo?(florian)

I can see one slow synchronous IPC in that profile, here: https://share.firefox.dev/3Sdlk4w
It looks like the message is processed quickly in the compositor (20ms), but it takes 200ms for the "Reply" message to be processed in the parent. During that time the parent process main thread simply doesn't wake up.

It would be interesting to know if the IPC I/O thread woke up during that time. Florian, can you grab another profile with that thread added?

Flags: needinfo?(florian)
Flags: needinfo?(florian)

Ok, here's another case: https://share.firefox.dev/3PJnze1
The IPC markers tell the following story: 35ms to receive, 0.5ms to process, 64ms to reply
I think the reply took so long to get out because the IPC I/O thread in the GPU process didn't wake up sooner.

Based on the "Awake" markers, the IPC I/O thread in the GPU process has a "current priority" of 10, whereas the parent process main thread has a "current priority" of 12. But the latter is blocked because the former doesn't run.

Should we play a bit with the process priorities?
I can't recall what Chromium does currently, but https://bugs.chromium.org/p/chromium/issues/detail?id=965778 may have some background.

The Performance Priority Calculator has determined this bug's performance priority to be P1. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: [x] Windows [x] macOS [x] Linux
Impact on browser UI: Causes noticeable jank
[x] Bug affects multiple sites

Keywords: perf

The severity field for this bug is set to S3. However, the Performance Impact field flags this bug as having a high impact on the performance.
:botond, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit auto_nag documentation.

Flags: needinfo?(botond)

Trying to summarize my understanding based on recent comments in this bug:

  • Based on comment 26, the ReceiveMouseInputEvent IPDL message being synchronous may not be the primary culprit for the observed performance issue after all
  • Based on comment 29, revising process priorities may be an appropriate next step

Markus, does this match your understanding? If so, would it be appropriate to find a home for this bug in a different component (e.g. XPCOM, or Performance perhaps)?

Flags: needinfo?(botond) → needinfo?(mstange.moz)

Good idea. So the bug is basically: Sync IPCs take too long to round-trip to the GPU process on busy low-core Windows machines.

I wonder if this has improved recently with the GPU process priority fix. There was also an IPC fix which removed one thread hop from the "sending" side, but I don't know if that would have affected the sync IPC reply case.

Florian, can you get another profile to see if this has improved?

Component: Panning and Zooming → IPC
Flags: needinfo?(mstange.moz) → needinfo?(florian)

(In reply to Markus Stange [:mstange] from comment #28)

Ok, here's another case: https://share.firefox.dev/3PJnze1
The IPC markers tell the following story: 35ms to receive, 0.5ms to process, 64ms to reply
I think the reply took so long to get out because the IPC I/O thread in the GPU process didn't wake up sooner.

If the reply was taking a long time to get out bcause the IPC I/O thread wasn't waking up, that would be potentially fixed by the thread hop removal. The IPC I/O thread theoretically only needs to wake up when receiving messages in most cases after that change. There's not really a way to get rid of the wake-up when receiving messages though.

It's worth noting that this is only 0.5% of our total hang time these days.

Flags: needinfo?(jld)
Flags: needinfo?(florian)

:jld can you review this bug and determine if the severity matches the performance impact? A High impact would translate to a S2. If you feel this is an S3, can you comment on the mismatch between our calculations and yours and we can reassess the impact?

Flags: needinfo?(jld)

Bas, do you think we should downgrade the performance impact or upgrade the severity?

Flags: needinfo?(jld) → needinfo?(bas)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #37)

Bas, do you think we should downgrade the performance impact or upgrade the severity?

Hrm. That's a tricky question and I could see it go either way.

So this is still responsible for about 2000 hangs a day, averaging half a second. (A fair amount of them longer), on nightly. Making it one of the most severe causes of parent process jank that our users experience, and it's likely disproportionately affecting those with less powerful hardware.

A half a second delay in responsiveness is a pretty big disruption, it's possible that there's not much that we can do here, but I'd say the triage calculator is mostly right, at least getting a better understanding of whether there -is- something we can do here is S2, if only because it has the potential of improving the browser experience in other ways (i.e. a lot of delays would not be long enough to qualify as a BHR 'hang', but would still affect the browser responsiveness).

Flags: needinfo?(bas)

Hi Florian,

Assuming you can still reproduce this issue, I wonder if we'd be able to locally test if surrounding the sync call with a priority rise for the GPU process would actually make the issue mostly go away. It does look like the priority boost from the main thread doesn't transfer to the GPU process successfully and it might be worth attempting to correct that.

Flags: needinfo?(florian)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: