1677509 - SendReceiveMouseInputEvent often blocks the main thread of the parent process

Reporter

Description

•

4 years ago

BHR data shows this represents a bit more than 1% of the total hang time reported from our Nightly population: http://queze.net/bhr/test/#row=&filter=SendReceiveMouseInputEvent

Here's a profile of it on the slow quantum reference laptop: https://share.firefox.dev/3nv2UMg

Why is this IPC synchronous? Is there something that could be done about it?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 1

•

4 years ago

As far as I can tell from the profile, it's a large number of quick IPC calls, rather than a small number of slow IPC calls. Does that match your interpretation? The IPC is synchronous because we need the input events to get properly un-transformed based on the current APZ async scroll, before they can be dispatched to gecko. If we made this asynchronous there's a risk of incoming events from the OS getting delivered to gecko out-of-order.

Florian Quèze [:florian]

Reporter

Comment 2

•

4 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.staktrace.com) from comment #1)

As far as I can tell from the profile, it's a large number of quick IPC calls, rather than a small number of slow IPC calls. Does that match your interpretation?

The individual blue bars on this profile (the Gaph type was "Stack height", maybe you are more used to the "Categories" view? https://share.firefox.dev/2UDsFxz) show indvidual samples, ie. when the profiler stopped the process to capture a stack.

It's impossible to know how many IPC calls happened during this profile (I would have needed to profile with the ipcmessages feature enabled to have this information, or we would need to fix bug 1641181), but it's reasonable to guess that the first IPC where there's a red jank marker took about 100ms. When looking at the GPU process at the same time, it has no sample, which likely means it got no CPU from the operating system scheduler during that time... so it likely didn't process the IPC either.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 3

•

4 years ago

I was looking at the mousemove markers in the marker table for the parent process main thread. Each one of those will have an associated synchronous IPC to the GPU process main thread, and looking at the time delta between those markers, most of them are pretty small which indicates the main thread wasn't unduly held up while processing them. But yes, the 100ms jank at the beginning is concerning. If the problem is what you suggest, that the OS didn't schedule that thread during that time, then maybe that's something we can resolve by adjusting thread priorities.

Barret Rennie [:barret] (they/them)

Updated

•

4 years ago

Whiteboard: [qf][bhr:SendReceiveMouseInputEvent] → [qf:p1:responsiveness][bhr:SendReceiveMouseInputEvent]

Botond Ballo [:botond]

Updated

•

3 years ago

Severity: -- → S3

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 4

•

3 years ago

•

Edited

As part of perf bug bash week, I tried to reproduce it on my Acer ref hw, but I didn't see much jank, and sync mouse IPCs seem fast: https://share.firefox.dev/3gN2Kju
Florian, maybe you'd like to try again, now that bug 1641181 has landed?

Alex Thayer [:alexical] (she/her)

Comment 5

•

3 years ago

How easy was it to get this to show up when you tried, Florian? I know the 2018 reference hardware is... exceptional when it comes to getting perf problems to show up. Would you be able to try to get another profile of this if it manifests easily?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Reporter

Comment 6

•

3 years ago

(In reply to Doug Thayer [:dthayer] (he/him) from comment #5)

How easy was it to get this to show up when you tried, Florian? I know the 2018 reference hardware is... exceptional when it comes to getting perf problems to show up. Would you be able to try to get another profile of this if it manifests easily?

It's easy to see it on the 2018 ref hardware: just load about:home, start the profiler, and move the mouse around quickly on the top site icons.

Here's a profile without CPU recording and without screenshots: https://share.firefox.dev/3aPuznd that has 2 >100ms sync IPC. And here's another one with more profile features enabled, but the sync IPC are slightly shorter (60-80ms range for the slowest ones): https://share.firefox.dev/32ZXYXK

(In reply to Gerald Squelart [:gerald] (he/him) from comment #4)

As part of perf bug bash week, I tried to reproduce it on my Acer ref hw, but I didn't see much jank, and sync mouse IPCs seem fast: https://share.firefox.dev/3gN2Kju

Looks like you tried on wikipedia.org. It seems smoother for me on that page, but I still see some sync IPC taking 20-30ms: https://share.firefox.dev/2S8kYBr

Flags: needinfo?(florian)

Alex Thayer [:alexical] (she/her)

Comment 7

•

3 years ago

For any outside or later observers: here's a profile Florian took with the IPC I/O threads included: https://share.firefox.dev/3xC5G8e

Nothing interesting going on in those threads though. They look to be as idle as the rest of the threads we're waiting on. It seems like at a fundamental level here, we're just dealing with the fact that we're sending a ton of sync IPC messages from the parent to the GPU process, and in some circumstances that means that we jank the browser either because something is busy or because the necessary threads are just not being scheduled. We can't meaningfully adjust process or thread priorities here as far as I can reason, so either we accept this or we find a way to make these messages async.

tnikkel, does what kats said in comment 1 seem insurmountable to you? I don't know this code well enough to judge whether we could work around this and still provide the same ordering guarantees on events.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Comment 8

•

3 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.staktrace.com) from comment #1)

As far as I can tell from the profile, it's a large number of quick IPC calls, rather than a small number of slow IPC calls. Does that match your interpretation? The IPC is synchronous because we need the input events to get properly un-transformed based on the current APZ async scroll, before they can be dispatched to gecko. If we made this asynchronous there's a risk of incoming events from the OS getting delivered to gecko out-of-order.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #7)

tnikkel, does what kats said in comment 1 seem insurmountable to you? I don't know this code well enough to judge whether we could work around this and still provide the same ordering guarantees on events.

I don't know this area of code very well but my naive guess is that in a fission world not using an up to date async scroll offset could mean we send the event to the wrong process (the location of an oop iframe could be changing from under/not under the cursor). Forwarding question to botond.

Flags: needinfo?(tnikkel) → needinfo?(botond)

Timothy Nikkel (:tnikkel)

Comment 9

•

3 years ago

Are the messages mouse related? I recall seeing code that throttled mouse moves (?) somewhere, maybe we can cut down on the number of sync calls?

Botond Ballo [:botond]

Comment 10

•

3 years ago

(In reply to Doug Thayer [:dthayer] (he/him) from comment #7)

does what kats said in comment 1 seem insurmountable to you? I don't know this code well enough to judge whether we could work around this and still provide the same ordering guarantees on events.

I don't think it's completely insurmountable. We could conceivably make the messages async while preserving event ordering if we did something like:

Have the parent process maintain a queue of events. Note, all event types would need to go through the queue, not just the ones which need to be sent to the GPU process.
When the parent process receives an event from the OS, enqueue it. If it needs to be sent to the GPU process, send an async IPC message (with the response also async).
Consider an event ready to be dequeued if it has received the async response (for event types that need to be sent to the GPU process), or unconditionally (for other event types).
On appropriate triggers (after enqueue, after an async response from the GPU process was handled), check the queue for events ready to be dequeued and continue processing them.

It does sound like this would be a fairly non-trivial refactor.

I think such a design also has the potential to add latency to the processing of input events, though it's hard to reason about that and whether or not it actually adds latency is likely to be workload dependent (i.e. depending on what else the parent and GPU processes are doing).

Flags: needinfo?(botond)

Botond Ballo [:botond]

Comment 11

•

3 years ago

(In reply to Timothy Nikkel (:tnikkel) from comment #9)

Are the messages mouse related? I recall seeing code that throttled mouse moves (?) somewhere, maybe we can cut down on the number of sync calls?

That could be something to explore, yeah.

cc Masayuki and Smaug, who might know more about any existing mouse-move throttling, and whether it may be possible to do in the parent process.

Alex Thayer [:alexical] (she/her)

Comment 12

•

3 years ago

Attached file Bug 1677509 - Use GetQueuedCompletionStatusEx in win message pump r?jld — Details

This is almost certainly a very small optimization, and likely won't address
the frequency of hangs from bug 1677509. However, it still should be an
improvement, and will work on anything Vista or later.

We observed as part of trying to diagnose the somewhat mysterious bug 1677509
that sometimes multiple messages would be queued, and yet the IPC I/O thread
would go idle in between servicing them in GetQueuedCompletionStatusEx. Given
that we send frequent sync ipc messages for mouse move events, it seems prudent
to be able to service multiple at a time.

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → dothayer

Status: NEW → ASSIGNED

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 13

•

3 years ago

I'm a little confused about the patch:

If these are sync messages between the same pair of threads, how would it be possible to handle more than one at a time? Aren't we necessarily ping-ponging regardless of how channels work?
Do we ever have more than one read and one write (per channel) in flight at the same time?

Alex Thayer [:alexical] (she/her)

Comment 14

•

3 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #13)

I'm a little confused about the patch:

If these are sync messages between the same pair of threads, how would it be possible to handle more than one at a time? Aren't we necessarily ping-ponging regardless of how channels work?

The situation we observed (can't find the profile) looked something like this:

Parent process                  | GPU process
--------------------------------------------
send async msg                  |
send sync msg                   |
                                |
                                | recv async msg
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                |
                                | recv sync msg

Do we ever have more than one read and one write (per channel) in flight at the same time?

When I add logging right after the call to GetQueuedCompletionStatusEx with this patch applied, we very frequently encounter multiple messages returned. I didn't bother to count the ratio, but it's a constant spew of them in the console.

Does that answer your questions?

Flags: needinfo?(jld)

Alex Thayer [:alexical] (she/her)

Comment 15

•

3 years ago

Botond, are there large swaths of time where we could know very cheaply that we don't need to transform any mouse move events? We see this occurring when the user isn't scrolling at all. Could we just have a shmem that the APZ code updates when it's obvious no scrolling is going on so the parent process can check that without having to send a sync IPC message?

Flags: needinfo?(botond)

Botond Ballo [:botond]

Comment 16

•

3 years ago

(In reply to Doug Thayer [:dthayer] (he/him) from comment #15)

Botond, are there large swaths of time where we could know very cheaply that we don't need to transform any mouse move events? We see this occurring when the user isn't scrolling at all. Could we just have a shmem that the APZ code updates when it's obvious no scrolling is going on so the parent process can check that without having to send a sync IPC message?

I think that could have been feasible in a pre-Fission world.

However, with Fission, the sync dispatch is also needed so that the events can undergo compositor hit testing in the GPU process, and be annotated with a target LayersId so we know which content process to dispatch them to.

Flags: needinfo?(botond)

David Parks [:handyman]

Comment 17

•

3 years ago

Two more cents:

I don't think that this is about spamming GPU with mouse events. Judging by the markers, in the worst of these profiles, we are sending a mouse event on the order of every 10ms. This should be manageable, since by all accounts the handler is fast. Also, since ReceiveMouseInputEvent is sync, the only one ever in the queue is the one we are currently servicing. Since there aren't markers to indicate that sampling is missing any of the IPC round-trips, I think we can assume that the pauses are for one message. Moreover, the periodicity doesn't seem to correlate with jank -- indeed, we get more mouse moves when there isn't jank (presumably that's the OS throttling). Sync messaging is Bad, but I don't think it is the main culprit.

FYI, the IPDL mechanism for throttling mouse moves is used for parent <-> content communication -- it's the IPDL keyword [Compress] (see for example PBrowser::RealMouseMoveEvent). There is also [CompressAll]. They aren't especially versatile or deep -- a [CompressAll]ed message type will throw out an old message if one of that type is waiting in the message queue when you send a new one. [Compress] will only throw out the new message if the old one was at the back of the queue (i.e. it was the last message added). Either seems like it could be a decent idea for these messages, but I don't think it'll help this bug because I don't think we have a spamming issue.

Now for the random guessing (I can't easily run experiments since my Windows machine self-immolated but I could dig out an old Surface to take a look). It's hard to see from the profiles but it kind of looks like the GPU process is waiting in ProcessNextNativeEvent a lot (essentially EVERY stack that isn't handling the mouse message -- 1399 out of 1548 samples), so it sits idle while IPC messages wait. The only profile with IPC markers didn't have much of a repro of the hang so I can't say that this is really a good guess but it would explain the latency.

Dorel Luca [:dluca]

Comment 18

•

3 years ago

Backed out changeset af367782bea4 (bug 1677509) for multipe failures. CLOSED TREE

Log:
https://treeherder.mozilla.org/logviewer?job_id=340227342&repo=autoland&lineNumber=3411

https://treeherder.mozilla.org/logviewer?job_id=340232342&repo=autoland&lineNumber=2133

https://treeherder.mozilla.org/logviewer?job_id=340228135&repo=autoland&lineNumber=42003

https://treeherder.mozilla.org/logviewer?job_id=340228284&repo=autoland&lineNumber=1620

Push with failures:
https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&revision=af367782bea45a6fbee9521bc27e034fa43b02b9

Backout:
https://hg.mozilla.org/integration/autoland/rev/23d9669eecb6dbbccd91f8ea514b4467d5b666be

Flags: needinfo?(dothayer)

Daniel Holbert [:dholbert]

Comment 19

•

3 years ago

Doug, re-upping your needinfo -- looks like you had a patch that landed and was backed out here. Do you have plans to get it re-landed and/or do you know what the extent of the treeherder badness was? (just a small issue that needs fixing, or some fundamental problem with the approach, or something else?)

Flags: needinfo?(dothayer)

Daniel Holbert [:dholbert]

Updated

•

3 years ago

Flags: needinfo?(dothayer)

Alex Thayer [:alexical] (she/her)

Comment 20

•

3 years ago

Hey Daniel - sorry for the delay here. This patch was just intended to be a small change which should in theory be more optimal. However (and I should have documented this at the time) it introduced timeout failures in tests. Now, it is possible that it was a perf improvement and yet introduced timeout issues due to changing the ordering of some things, but I think it's more likely that it either introduced a perf degradation or had a bug, in either case I think the impact of the patch should have been small enough that it's not worth looking into.

I don't and didn't historically have a patch that I am confident would resolve the actual hang here, so I'm unassigning myself.

Assignee: dothayer → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(dothayer)

Botond Ballo [:botond]

Comment 21

•

2 years ago

Timothy pointed out that, apart from trying to make these messages async (which is possible but nontrivial and may affect input handling latency, per comment 10), another avenue to explore is speeding up the processing time of the messages while keeping them sync.

Nical, I know you did some work (in bug 1580178 and related) to try to avoid blocking on the render backend thread when doing hit-testing inside APZ. Do you know what the current status of that is, i.e. were the changes successful in making us avoid blocking in the common case?

Flags: needinfo?(nical.bugzilla)

Nicolas Silva [:nical]

Comment 22

•

2 years ago

Do you know what the current status of that is, i.e. were the changes successful in making us avoid blocking in the common case?

That landed a while ago and works well. What it does is that we have an immutable hit-testing tree that is atomically swapped each time the scene is updated (for example layout changes), and all threads can safely read from the most current hit testing tree without blocking.

This stuff only avoids synchronousness within the process where rendering happens, so if the GPU process is enabled we probably have some sync IPC chatter, but at least it shouldn't block on WebRender work completing.

Flags: needinfo?(nical.bugzilla)

Hiroyuki Ikezoe (:hiro)

Comment 23

•

2 years ago

(In reply to Florian Quèze [:florian] (PTO until January) from comment #6)

Here's a profile without CPU recording and without screenshots: https://share.firefox.dev/3aPuznd that has 2 >100ms sync IPC. And here's another one with more profile features enabled, but the sync IPC are slightly shorter (60-80ms range for the slowest ones): https://share.firefox.dev/32ZXYXK

(In reply to Gerald Squelart [:gerald] (he/him) from comment #4)

As part of perf bug bash week, I tried to reproduce it on my Acer ref hw, but I didn't see much jank, and sync mouse IPCs seem fast: https://share.firefox.dev/3gN2Kju

Looks like you tried on wikipedia.org. It seems smoother for me on that page, but I still see some sync IPC taking 20-30ms: https://share.firefox.dev/2S8kYBr

Can someone clarify the reason why the content process document is mush faster than the parent one?

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 years ago

Performance Impact: --- → P1

Keywords: perf:responsiveness

Whiteboard: [qf:p1:responsiveness][bhr:SendReceiveMouseInputEvent] → [bhr:SendReceiveMouseInputEvent]

Timothy Nikkel (:tnikkel)

Updated

•

2 years ago

Comment 24

•

2 years ago

Florian, do you have a more recent profile which shows this issue and which includes dynamic process priority markers?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Reporter

Comment 25

•

2 years ago

(In reply to Markus Stange [:mstange] from comment #24)

Florian, do you have a more recent profile which shows this issue and which includes dynamic process priority markers?

I just grabbed one: https://share.firefox.dev/3JfKWJJ

Flags: needinfo?(florian)

Markus Stange [:mstange]

Comment 26

•

2 years ago

I can see one slow synchronous IPC in that profile, here: https://share.firefox.dev/3Sdlk4w
It looks like the message is processed quickly in the compositor (20ms), but it takes 200ms for the "Reply" message to be processed in the parent. During that time the parent process main thread simply doesn't wake up.

It would be interesting to know if the IPC I/O thread woke up during that time. Florian, can you grab another profile with that thread added?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Reporter

Comment 27

•

2 years ago

https://share.firefox.dev/3JhfBGL

Flags: needinfo?(florian)

Markus Stange [:mstange]

Comment 28

•

2 years ago

Ok, here's another case: https://share.firefox.dev/3PJnze1
The IPC markers tell the following story: 35ms to receive, 0.5ms to process, 64ms to reply
I think the reply took so long to get out because the IPC I/O thread in the GPU process didn't wake up sooner.

Based on the "Awake" markers, the IPC I/O thread in the GPU process has a "current priority" of 10, whereas the parent process main thread has a "current priority" of 12. But the latter is blocked because the former doesn't run.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 29

•

2 years ago

Should we play a bit with the process priorities?
I can't recall what Chromium does currently, but https://bugs.chromium.org/p/chromium/issues/detail?id=965778 may have some background.

Frank Doty [:fdoty]

Comment 30

•

2 years ago

The Performance Priority Calculator has determined this bug's performance priority to be P1. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: [x] Windows [x] macOS [x] Linux
Impact on browser UI: Causes noticeable jank
[x] Bug affects multiple sites

Keywords: perf

BugBot [:suhaib / :marco/ :calixte]

Comment 31

•

1 year ago

The severity field for this bug is set to S3. However, the Performance Impact field flags this bug as having a high impact on the performance.
:botond, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit auto_nag documentation.

Flags: needinfo?(botond)

Botond Ballo [:botond]

Comment 32

•

1 year ago

Trying to summarize my understanding based on recent comments in this bug:

Based on comment 26, the ReceiveMouseInputEvent IPDL message being synchronous may not be the primary culprit for the observed performance issue after all
Based on comment 29, revising process priorities may be an appropriate next step

Markus, does this match your understanding? If so, would it be appropriate to find a home for this bug in a different component (e.g. XPCOM, or Performance perhaps)?

Flags: needinfo?(botond) → needinfo?(mstange.moz)

Markus Stange [:mstange]

Comment 33

•

1 year ago

•

Edited

Good idea. So the bug is basically: Sync IPCs take too long to round-trip to the GPU process on busy low-core Windows machines.

I wonder if this has improved recently with the GPU process priority fix. There was also an IPC fix which removed one thread hop from the "sending" side, but I don't know if that would have affected the sync IPC reply case.

Florian, can you get another profile to see if this has improved?

Component: Panning and Zooming → IPC

Flags: needinfo?(mstange.moz) → needinfo?(florian)

Nika Layzell [:nika] (ni? for response)

Comment 34

•

1 year ago

(In reply to Markus Stange [:mstange] from comment #28)

Ok, here's another case: https://share.firefox.dev/3PJnze1
The IPC markers tell the following story: 35ms to receive, 0.5ms to process, 64ms to reply
I think the reply took so long to get out because the IPC I/O thread in the GPU process didn't wake up sooner.

If the reply was taking a long time to get out bcause the IPC I/O thread wasn't waking up, that would be potentially fixed by the thread hop removal. The IPC I/O thread theoretically only needs to wake up when receiving messages in most cases after that change. There's not really a way to get rid of the wake-up when receiving messages though.

Bas Schouten (:bas.schouten)

Comment 35

•

1 year ago

It's worth noting that this is only 0.5% of our total hang time these days.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Updated

•

10 months ago

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Updated

•

10 months ago

Flags: needinfo?(florian)

Frank Doty [:fdoty]

Comment 36

•

15 days ago

•

Edited

:jld can you review this bug and determine if the severity matches the performance impact? A High impact would translate to a S2. If you feel this is an S3, can you comment on the mismatch between our calculations and yours and we can reassess the impact?

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 37

•

15 days ago

Bas, do you think we should downgrade the performance impact or upgrade the severity?

Flags: needinfo?(jld) → needinfo?(bas)

Bas Schouten (:bas.schouten)

Comment 38

•

11 days ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #37)

Bas, do you think we should downgrade the performance impact or upgrade the severity?

Hrm. That's a tricky question and I could see it go either way.

So this is still responsible for about 2000 hangs a day, averaging half a second. (A fair amount of them longer), on nightly. Making it one of the most severe causes of parent process jank that our users experience, and it's likely disproportionately affecting those with less powerful hardware.

A half a second delay in responsiveness is a pretty big disruption, it's possible that there's not much that we can do here, but I'd say the triage calculator is mostly right, at least getting a better understanding of whether there -is- something we can do here is S2, if only because it has the potential of improving the browser experience in other ways (i.e. a lot of delays would not be long enough to qualify as a BHR 'hang', but would still affect the browser responsiveness).

Flags: needinfo?(bas)

Bas Schouten (:bas.schouten)

Comment 39

•

9 days ago

Hi Florian,

Assuming you can still reproduce this issue, I wonder if we'd be able to locally test if surrounding the sync call with a priority rise for the GPU process would actually make the issue mostly go away. It does look like the priority boost from the main thread doesn't transfer to the GPU process successfully and it might be worth attempting to correct that.

Flags: needinfo?(florian)