Intermittent gfx/layers/apz/test/mochitest/test_bug1285070.html | helper_bug1285070.html | Event pointerup should be generated once - got +0, expected 1

RESOLVED FIXED in Firefox 51

Status

()

P3
normal
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: intermittent-bug-filer, Assigned: kats)

Tracking

({intermittent-failure})

51 Branch
mozilla52
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 unaffected, firefox51 fixed, firefox52 fixed)

Details

(Whiteboard: [gfx-noted])

Attachments

(2 attachments)

Priority: -- → P3
Version: unspecified → 51 Branch
Assignee: nobody → sshih
19:49:26     INFO -  Received event pointerover
19:49:26     INFO -  Received event pointerenter
19:49:26     INFO -  Received event pointerdown

Test failures...

19:49:28     INFO -  Received event pointerup
19:49:28     INFO -  Received event pointerout
19:49:28     INFO -  Received event pointerleave

From the log, it looks like the click event was fired before pointerup. Have no idea how this could be happened and trying to reproduce it.
This sounds like the sort of thing that could be regressed by bug 1289650. There were a number of other intermittent failures that also spiked in volume yesterday, which are likely also a result of threading changes from the same bug. I'll try to investigate today.
The pointerup event is triggered by the touchup event. The path for that is as follows:

- nsBaseWidget calls APZCTreeManager::ReceiveInputEvent with it on the main thread [1]
- nsBaseWidget then dispatches it to the parent process event mechanism [2]
- the event is sent over PContent to the child process main thread [3]

Whereas the click event is also spawned from the touchup event using this path:

- nsBaseWidget calls APZCTreeManager::ReceiveInputEvent with it on the main thread [1]
- AsyncPanZoomController does a PostDelayedTask back to the parent process main thread [4]
- RemoteContentController shifts it to the compositor thread [5]
- the event is sent over PAPZ to the child process main thread (?) [6]

So it sounds like if PContent is blocked, the touchup event might be delayed. If PAPZ (which now is part of PCompositorBridge rather than PContent) is not blocked, the click event can "overtake" the touchup event and get into the child process main thread queue first. That would result in the click event getting fired before the touchup, which sounds like what Stone was seeing.

In practice this can result in things like bug 965381, as per the comment at [7]. In fact the PostDelayedTask call at [4] was inserted specifically to avoid this, but now that the click event is getting remoted over PCompositorBridge it is less effective.

[1] http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/widget/nsBaseWidget.cpp#1227
[2] http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/widget/nsBaseWidget.cpp#1233
[3] http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/dom/ipc/TabParent.cpp#1641
[4] http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/gfx/layers/apz/src/AsyncPanZoomController.cpp#2043
[5] http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/gfx/layers/ipc/RemoteContentController.cpp#59
[6] http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/gfx/layers/ipc/RemoteContentController.cpp#70
[7] http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/gfx/layers/apz/src/AsyncPanZoomController.cpp#2030
(Note that the above is still speculation on my part, I haven't confirmed it. If I'm right, it should be possible to reproduce by somehow artificially jamming PContent so that messages are delayed).
Blocks: 1289650
status-firefox50: --- → unaffected
status-firefox51: --- → affected
Whiteboard: [gfx-noted]
Assigning to rhunt for now, since I'm pretty sure it's a regression from his change.
Assignee: sshih → rhunt
I caught an instance of this using rr chaos mode. I'll investigate.
Assignee: ryanhunt → bugmail
Comment 3 is indeed what is happening. And it'll get worse with GPU process separation, because the touchup event will have an extra process hop (GPU process -> main process -> content process) whereas the click will go directly from GPU process -> content process) and so will be more likely to "overtake" the touchup.

Not yet sure how to fix this.
As far as I can see there are two main options for fixing this:

1) Route the HandleTap message back through PAPZCTreeManager to the parent process, and from there to the child process (if necessary). That way the HandleTap would "follow" the touchup on the same protocols/threads, and therefore be guaranteed to arrive after the touchup.

2) Add some machinery in APZEventState that queues up and delays the single-tap events from being disabled until after the corresponding touch-up has been handled.

I was initially leaning towards (1) but it's pretty messy to implement because there are no other messages that take that path, and so a bunch of new plumbing needs to be added. Another complication is that smaug is working on some DOM changes that entail receiving input events into the content process off the main thread. So stuff like RecvRealTouchEvent would move off the main thread to some other thread, and RecvHandleTap is in the same bucket and would need to move. As part of that I realized that a couple of other functions like RequestContentRepaint and NotifyAPZStateChange might need moving also. It might be that PAPZ ends up getting shifted out of PCompositorBridge and into some other top-level protocol.

All this to say, I would like to build a solution that is robust at the endpoint so that no matter which protocol we end up using to send the messages, it will remain ordered in the way it's supposed to be - and that means going with option (2). So now I'm leaning towards that, and will try cooking up something that uses that approach.
See Also: → bug 1145090
After some discussion with smaug and keeping in mind future changes I'm leaning towards (1) again. I implemented it and it seems to work. Try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4ecf04afffc

I might still try implementing (2) as it might help with bug 1145090.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> I might still try implementing (2) as it might help with bug 1145090.

I spent some time thinking about this and didn't get anywhere useful. I was hoping to find a way to add a bunch of state tracking to APZEventState that would ensure the incoming messages for a given input block are handled together and don't get interrupted by messages for a different input block. However it's turning out rather complicated and confusing because I'm not fully clear on which interleavings are possible and am having trouble dealing with all the cases. So I'm going to punt on that for now and get these patches up.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment on attachment 8792980 [details]
Bug 1298173 - Remove redundant parameter.

https://reviewboard.mozilla.org/r/79792/#review78544
Attachment #8792980 - Flags: review?(dvander) → review+
Comment on attachment 8792981 [details]
Bug 1298173 - Push HandleTap from the GPU process back to the parent process before passing it on to the child process.

https://reviewboard.mozilla.org/r/79794/#review78550

::: gfx/layers/ipc/RemoteContentController.cpp:61
(Diff revision 1)
>                                     uint64_t aInputBlockId)
>  {
> -  if (MessageLoop::current() != mCompositorThread) {
> -    // We have to send messages from the compositor thread
> -    mCompositorThread->PostTask(NewRunnableMethod<TapType, LayoutDevicePoint, Modifiers,
> -                                        ScrollableLayerGuid, uint64_t>(this,
> +  APZThreadUtils::AssertOnControllerThread();
> +  const CompositorBridgeParent::LayerTreeState* state =
> +      CompositorBridgeParent::GetIndirectShadowTree(aGuid.mLayersId);
> +  if (state) {

If the controller thread != compositor thread, this probably isn't safe. LayerTreeState and sIndirectLayerTreeState are mutated on the compositor thread, so stuff like mParent and mApzcTreeManagerParent have to be grabbed inside the lock.

Some helper method like CompositorBridgeParent::GetApzcTreeManagerForRootLayer(aLayersId) would be fine to hide all this.

::: gfx/layers/ipc/RemoteContentController.cpp:75
(Diff revision 1)
> +    if (state) {
> +      APZCTreeManagerParent* apzctmp = state->mApzcTreeManagerParent;
> +      if (apzctmp) {
> +        // We're in the GPU process, on the compositor thread. Send the message
> +        // to the parent process to deal with.
> +        // XXX MOZ_ASSERT(XRE_IsGpuProcess());

Can this only fire in the GPU process because APZCTreeManagerParent won't exist in-process? Is it possible to do a process-type check before we grab layer tree state to make this clearer?
Comment on attachment 8792981 [details]
Bug 1298173 - Push HandleTap from the GPU process back to the parent process before passing it on to the child process.

https://reviewboard.mozilla.org/r/79794/#review78558

r=me with the LayerTreeState stuff addressed
Attachment #8792981 - Flags: review?(dvander) → review+
(In reply to David Anderson [:dvander] from comment #15)
> If the controller thread != compositor thread, this probably isn't safe.
> LayerTreeState and sIndirectLayerTreeState are mutated on the compositor
> thread, so stuff like mParent and mApzcTreeManagerParent have to be grabbed
> inside the lock.
> 
> Some helper method like
> CompositorBridgeParent::GetApzcTreeManagerForRootLayer(aLayersId) would be
> fine to hide all this.

Sounds good, I'll do that.

> ::: gfx/layers/ipc/RemoteContentController.cpp:75
> (Diff revision 1)
> > +    if (state) {
> > +      APZCTreeManagerParent* apzctmp = state->mApzcTreeManagerParent;
> > +      if (apzctmp) {
> > +        // We're in the GPU process, on the compositor thread. Send the message
> > +        // to the parent process to deal with.
> > +        // XXX MOZ_ASSERT(XRE_IsGpuProcess());
> 
> Can this only fire in the GPU process because APZCTreeManagerParent won't
> exist in-process?

Yes. (To be clear, the APZCTreeManagerParent won't exist in-process *for the root layers id* - it will still exist for the content processes.)

> Is it possible to do a process-type check before we grab
> layer tree state to make this clearer?

Sure, but since there isn't a gpu process type check I'll just do a !XRE_IsParentProcess() instead.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 21

2 years ago
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/037d241ec65c
Remove redundant parameter. r=dvander
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3879b41dc6c
Push HandleTap from the GPU process back to the parent process before passing it on to the child process. r=dvander

Comment 22

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/037d241ec65c
https://hg.mozilla.org/mozilla-central/rev/c3879b41dc6c
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox52: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Please request Aurora approval on this when you get a chance.
Flags: needinfo?(bugmail)
I wanted to give this one a few days' bake time before requesting approval. There's a bunch of IPDL code changes that I'm mildly worried will cause crashes, because... well it's IPDL. I'm keeping it on my radar and will request uplift sometime next week once I'm more confident in it.
Flags: needinfo?(bugmail)
Comment on attachment 8792980 [details]
Bug 1298173 - Remove redundant parameter.

Approval Request Comment
[Feature/regressing bug #]: bug 1289650
[User impact if declined]: If the user taps (on touch devices) the click event for the tap may be dispatched before the touchend event, which is wrong. This has caused issues on websites in the past.
[Describe test coverage new/current, TreeHerder]: We have an automated test to catch this, and it was failing intermittently
[Risks and why]: relatively low, there's a bunch of IPDL boilerplate changes and some new stuff but if it results in crashes it should show up pretty quickly on crash-stats (and I haven't seen anything so far).
[String/UUID change made/needed]: none
Attachment #8792980 - Flags: approval-mozilla-aurora?
Attachment #8792981 - Flags: approval-mozilla-aurora?
Comment on attachment 8792980 [details]
Bug 1298173 - Remove redundant parameter.

Fix a tap/click behavior failure. Take it in 51 aurora.
Attachment #8792980 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

2 years ago
Attachment #8792981 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
The second patch hit conflicts in gfx/ipc/RemoteCompositorSession.cpp. Could we get a rebased patch for uplift?
Flags: needinfo?(bugmail)
Depends on: 1305198
See Also: → bug 1306644
See Also: → bug 1309678
You need to log in before you can comment on or make changes to this bug.