Closed Bug 1139575 Opened 5 years ago Closed 5 years ago

[Aries][Nexus-5][hidpi?] Touch events are flaky, especially with 2 fingers

Categories

(Core :: Panning and Zooming, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox37 --- wontfix
firefox38 --- wontfix
firefox39 --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: drs, Assigned: kats)

References

()

Details

(Whiteboard: [spark])

Attachments

(4 files, 1 obsolete file)

See video.
Alex, could you take a look?
Flags: needinfo?(lissyx+mozillians)
I've noticed this, and not consistently, only in gallery app. It's fine in Browser on webpages, for example. Someone more familiar with touches events and apzc should rather look at this instead of me.
Flags: needinfo?(lissyx+mozillians) → needinfo?(bugmail.mozilla)
What is Aries? Is this behaviour specific to particular devices? What app is being used in the video? Is that a screenshot in the gallery app or something else?
Flags: needinfo?(bugmail.mozilla)
That's the gallery app that Doug is using. Aries is the codename for the device being considered as part of project lightsaber.  I see similar issues (though maybe not as bad as Doug) on Nexus 5 in the gallery app.  Both of those devices use a soft homescreen button rather than a hard one, for what it is worth.

Doug and I (on Aries and Nexus 5 respectively) see error messages in console when pinching and zooming in gallery. They errors are from shared/js/gesture_detector.js and indicate that the getIdentifiedTouch() method of the touch events is returning null. So it looks to me like the hardware is losing track of a touch and is replacing it with something with a new identifier.
Flags: needinfo?(bugmail.mozilla)
Hm, so if pinching works in the browser but not in the gallery, it's unlikely to be a problem at the hardware level. Still, it might help to log all the touch events coming out of the hardware and also all the touch events received by gesture_detector.js. That will help trisect the problem to see if it is coming from hardware, APZ, or gesture_detector.js.
Flags: needinfo?(bugmail.mozilla)
This diff adds debugging output to shared/js/gesture_detector.js.

If you apply the patch then pinch to zoom on images in the Gallery app, you'll see that sometimes gecko seems to lose track of a touch events.  We get two touches with ids 0 and 1 and begin the pinch, then sometimes we get a move event on one of those touches, and the other touch is just gone. The touch with id 0 moves and the touch with id 1 is no longer in the e.touches[] array. Or vice versa. 

The gesture_detector.js code has been a stable part of gaia for over two years and we've never had this kind of problem with it, so I think this is some kind of recent regression.  I don't have an Aries device to test with, but this is what I'm seeing on Nexus 5.
Thanks. I don't have an Aries or a Nexus 5 but I can try on the Nexus 4. Is Aries a Lollipop device?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)
> Hm, so if pinching works in the browser but not in the gallery, it's
> unlikely to be a problem at the hardware level. Still, it might help to log
> all the touch events coming out of the hardware and also all the touch
> events received by gesture_detector.js. That will help trisect the problem
> to see if it is coming from hardware, APZ, or gesture_detector.js.

I can all but guarantee that the problem is not in gesture_detector.js.  That code is has been stable for years. And also, we're seeing multi-touch problems in new, unrelated gesture detection code in https://github.com/fxos/gesture

Does the APZ event handling code do anything differently on devices with a soft home button instead of a hard button?  If not, I'd suspect it has something to do with hardware. Nexus 5 and Aries are both high-end large screen quad core devices. Different chipsets, I think, though.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> Thanks. I don't have an Aries or a Nexus 5 but I can try on the Nexus 4. Is
> Aries a Lollipop device?

No, Aries is based on KitKat.
(In reply to David Flanagan [:djf] from comment #8)
> Does the APZ event handling code do anything differently on devices with a
> soft home button instead of a hard button?  If not, I'd suspect it has
> something to do with hardware. Nexus 5 and Aries are both high-end large
> screen quad core devices. Different chipsets, I think, though.

No, APZ doesn't have any direct interaction with a soft vs hard home button. It might affect the layer tree structure which might have an indirect impact, I'm not sure. I can post a patch to log events closer to where they come out of the hardware.
And the nexus5 I'm using is Lollipop.
Try applying this to the gecko tree and grab a log with both sets of output. We can compare the touch events from hardware to the touch events arriving in JS and see if they match up.
I'm not really set up to easily create patched gecko builds, so unlikely to try out that gecko logging patch right away.

I have verified that I see the same bug with the home gesture enabled (so the soft home button is hidden). And I see the same bug with gaia 2.1 and gaia 2.2, so this does not seem to caused by any recent changes in the system app.
GestureDetector does not listen for touchcancel events, but I've added a listener to verify that no touchcancel events are being delivered. So the gesture detector is not getting confused by missing touchcancel.
I can repro on the Nexus 4. My investigation shows that we are hitting the code in the APZ at [1] which is preventing the input events from getting to content. It looks like the gesture_detector.js never does a preventDefault() on the touch event stream (in fact it says so explicitly at the top of the file) which IMO is the root of this problem, because it means that both APZ and the gesture detector are going to be handling these events. In some cases APZ will eat blocks of touch input for more advanced gestures (like double-flings) and so the JS gesture detector won't get all the raw events.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/InputQueue.cpp?rev=d805db38cd5f#113
Tangentially it looks like there's an infinite number of repaints or layer transactions going through when viewing an image in the gallery, even if nothing is moving. That might be worth profiling to see if it can be improved (it almost certainly can).
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #15)
> I can repro on the Nexus 4. My investigation shows that we are hitting the
> code in the APZ at [1] which is preventing the input events from getting to
> content. It looks like the gesture_detector.js never does a preventDefault()
> on the touch event stream (in fact it says so explicitly at the top of the
> file) which IMO is the root of this problem, because it means that both APZ
> and the gesture detector are going to be handling these events. In some
> cases APZ will eat blocks of touch input for more advanced gestures (like
> double-flings) and so the JS gesture detector won't get all the raw events.
> 

Thanks for the quick investigation, Kats.  Gesture detector predates APZ, I think. But even if it didn't I've never understood how it works or what the protocol is for not having it interfere with content event handling. Can you point me to docs somewhere that would help me understand why and when to call preventDefault?

Also, if the problem is what you suggest, why do you think it does not occur on the Flame? Could this have something to do with software home button vs no software home button? On the flame, the gallery app takes up the entire screen so maybe APZ sees that there is nothing scrollable and doesn't do anything?  But with a software home button the app doesn't take up the whole screen? I'm speculating wildly here, since I have no real clue what APZ does...

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #16)
> Tangentially it looks like there's an infinite number of repaints or layer
> transactions going through when viewing an image in the gallery, even if
> nothing is moving. That might be worth profiling to see if it can be
> improved (it almost certainly can).

Is this something I should be able to see with "Flash repainted area" turned on?  I don't see repeated repaints on either Flame or Nexus 5
Flags: needinfo?(bugmail.mozilla)
(In reply to David Flanagan [:djf] from comment #17)
> I've never understood how it works or what the
> protocol is for not having it interfere with content event handling. Can you
> point me to docs somewhere that would help me understand why and when to
> call preventDefault?

There aren't docs describing this explicitly. The best I can point you to is the W3C spec for touch events at [1] which says that calling preventDefault will prevent the default actions associated the event. The "default actions" is a poorly-defined term that is basically browser-specific. On B2G the default actions include APZ panning. In this case it makes sense to be calling preventDefault on them because you want to consume the event in the gesture_detector code and not have APZ handle it.

> Also, if the problem is what you suggest, why do you think it does not occur
> on the Flame? Could this have something to do with software home button vs
> no software home button? On the flame, the gallery app takes up the entire
> screen so maybe APZ sees that there is nothing scrollable and doesn't do
> anything?  But with a software home button the app doesn't take up the whole
> screen? I'm speculating wildly here, since I have no real clue what APZ
> does...

It's definitely a possibility. I'll check the layer dumps and see what's going on.

> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #16)
> > Tangentially it looks like there's an infinite number of repaints or layer
> > transactions going through when viewing an image in the gallery, even if
> > nothing is moving. That might be worth profiling to see if it can be
> > improved (it almost certainly can).
> 
> Is this something I should be able to see with "Flash repainted area" turned
> on?  I don't see repeated repaints on either Flame or Nexus 5

The "Flash repainted area" will only flash if layout is repainting stuff, and it's possible to have layer transactions without repaints. If that were the case you wouldn't see anything with "Flash repainted area". However if you turn on the FPS counter you should see that we're still compositing at around 50 fps even while idle in the gallery app. I can file a separate bug for that and track it down.

[1] http://www.w3.org/TR/touch-events/
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #18)
> It's definitely a possibility. I'll check the layer dumps and see what's
> going on.

So actually I can reproduce this on the flame as well, with no soft home button. When pinching you just have to move the first finger a bit and then add the second finger. I suspect it's just easier to reproduce on higher DPI devices because it's easier to go into a "fast" fling.

I filed bug 1139918 for the repainting issue, I think it's probably a Gecko problem somewhere.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #19)
> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #18)
> > It's definitely a possibility. I'll check the layer dumps and see what's
> > going on.
> 
> So actually I can reproduce this on the flame as well, with no soft home
> button. When pinching you just have to move the first finger a bit and then
> add the second finger. I suspect it's just easier to reproduce on higher DPI
> devices because it's easier to go into a "fast" fling.

Sorry, I was mistaken here - I think this is just intentional behaviour in the gesture_detector code. It looks like once you go into the panStartedState you can't go into the transformState without lifting the finger and starting over.
I debugged this some more and found a couple of APZ bugs I can fix to make this problem go away. I still think the gesture_detector code should be calling preventDefault, and that's the root cause, but it not doing so did help expose these APZ bugs. Patches coming in a sec.
One problem I was seeing here is that the velocity was never getting cleared once we started a pinch. So finger 1 goes down, moves around -> we have a velocity. Finger 2 goes down -> we enter a pinch state, velocity is still nonzero. Fingers are lifted -> velocity is still nonzero. Finger 1 goes down again -> InputQueue detects fast-motion and doesn't send events to content.
Attachment #8573371 - Flags: review?(botond)
So, if I understand correctly:

1) Normally, my gesture detector code gets to look at events before APZ does and can call preventDefault on those events.

2) But if I don't call preventDefault, then when APZ detects a gesture it may start removing events from the queue before my code ever gets to see them.

Is that right?

I'll try preventDefault and see if that solves the problem in my github.com/fxos/gesture code. Changing shared/js/gesture_detector.js will be a lot trickier since that could introduce regressions in the apps that use it.

It still seems to me that there is an APZ bug here. If, as you say in comment #15, APZ can "eat a block of touch events" it needs to guarantee that the uneaten parts of the event stream are valid so that apps can be written with no knowledge of APZ at all.  What I'm seeing is that touches start and then just disappear from the e.touches array with no touchend event. If APZ is making the touch disappear, it seems to me that it is a bug that it does not synthesize a touchend event so that the event stream my code sees remains valid.
The second problem I found is that we would set the fast-motion flag on multitouch input blocks which I think doesn't make sense. So for example finger 1 goes down and moves around -> we have a velocity, then finger 2 goes down to start a pinch, but since we are "moving fast" this input block doesn't get sent to content. My patch doesn't set the fast-motion flag unless we have only one touch input in the block.

I also moved the code out of the function it currently lives in because the name of the function doesn't reflect what it's really doing.
Attachment #8573372 - Flags: review?(botond)
Doug/Alexandre/David: please try the patches and let me know if it resolves the issues you were seeing.
I wrote comment 23 before I saw comment 21. Thanks for continuing to debug this. Does that final paragraph in comment 23 make sense?

I don't think I can safely change the legacy gesture detector code (shared/js/gesture_detector.js) that the gallery app uses to add preventDefault.

But I will add it to my new gesture detector which is what lead to this bug being filed.
(In reply to David Flanagan [:djf] from comment #23)
> 1) Normally, my gesture detector code gets to look at events before APZ does
> and can call preventDefault on those events.

Technically, the events go to APZ first, and the APZ queues them. APZ then sends a copy of them to content (e.g. your gesture detector code). If content calls preventDefault on them, that sends a notification back to the APZ, which then drops the events from its queue. If the notification doesn't arrive in 300ms or if content doesn't call preventDefault then APZ goes ahead and consumes them.

> 2) But if I don't call preventDefault, then when APZ detects a gesture it
> may start removing events from the queue before my code ever gets to see
> them.

In some cases when events go the APZ initially, the APZ is already in some state (e.g. in a fast scroll) where it makes to simply consume those events right away rather than queueing them and sending a copy on to content. That's what was happening here. Let me know if that makes sense.

> It still seems to me that there is an APZ bug here. If, as you say in
> comment #15, APZ can "eat a block of touch events" it needs to guarantee
> that the uneaten parts of the event stream are valid so that apps can be
> written with no knowledge of APZ at all.  What I'm seeing is that touches
> start and then just disappear from the e.touches array with no touchend
> event. If APZ is making the touch disappear, it seems to me that it is a bug
> that it does not synthesize a touchend event so that the event stream my
> code sees remains valid.

Yes, you're right - I hadn't considered that fully. When APZ consumes a block of events in this manner it will eat everything from one touchstart to the next. In this case it would consume everything from the second finger's touchstart up to but not including the next touchstart event. The set of consumed events therefore included both touchend events. However the "part 2" patch that I posted should fix this as well.
Assignee: nobody → bugmail.mozilla
Component: GonkIntegration → Panning and Zooming
Product: Firefox OS → Core
Whoops, made a mistake in the old patch that caused a gtest to fail. Fixed it up.
Attachment #8573372 - Attachment is obsolete: true
Attachment #8573372 - Flags: review?(botond)
Attachment #8573403 - Flags: review?(botond)
Adding preventDefault() to my gesture code in https://github.com/fxos/gesture does not resolve the issue I'm seeing. This was purely a gaia-level change without the gecko patches attached here. I just wanted to see if using preventDefault would prevent APZ bugs from interfering with my gesture detection.

What I'm doing with that gesture detection code is trying to detect a two-finger swipe. In my testing on a nexus 5, what I'm seeing is that if both fingers touch the screen at the same time, the gesture typically works. But if one finger goes down before the other one then I get one touch start event, and a second touch start event. At this point, I consider the gesture to have started so I register handlers for touchmove and touchend events. Then, the next event that I see is a touchmove event with only one touch in the e.touches[] array.

Less commonly, but still reproducibly, I see two touch start events followed by a third touch start. But that third touchstart event only has two touches in its e.touches array.

In both cases, there seems to be a missing touchend event. This could be because I don't register my touchend handler until after the second touchstart event. Maybe a touchend is being sent but I'm missing it?  (I think Doug tested that hypothesis already, though.) Or maybe some other bug is causing the touchend not to be sent at all. To be clear the actual physical touch is not ending: my finger is still down, but the event stream is behaving as if there was a touchend event in the stream somewhere.

I do not call preventDefault on the first touchstart event because if I do that, it breaks single finger gestures like clicks. I only call it on the second touchstart and the move and end events that follow. I did test calling it on everything (and breaking clicks) and it still didn't fix the issue.

Kats says that using preventDefault should workaround the APZ bugs he's found. Since it does not resolve the issue, it makes me wonder if there are also hardware or driver bugs on these devices...
Horizontal flings are broken on some apps and pages with these patches applied, e.g. Browser pointed to https://people.mozilla.org/~dsherk/

(In reply to David Flanagan [:djf] from comment #29)
> What I'm doing with that gesture detection code is trying to detect a
> two-finger swipe. In my testing on a nexus 5, what I'm seeing is that if
> both fingers touch the screen at the same time, the gesture typically works.
> But if one finger goes down before the other one then I get one touch start
> event, and a second touch start event. At this point, I consider the gesture
> to have started so I register handlers for touchmove and touchend events.
> Then, the next event that I see is a touchmove event with only one touch in
> the e.touches[] array.

These patches definitely mitigate the problems. However, I'm still finding the issue that David described here.

Note that our app is not scrollable, so the pinch-to-zoom code should be disabling itself in this frame. Thus, I don't think that this is a result of our not using `preventDefault()`.
We're getting around the issue that I described in comment 30 for now using this: https://github.com/fxos/gesture/commit/095a27be06c1f6d24729264bd78f9999f87fe62f
(In reply to David Flanagan [:djf] from comment #29)
> What I'm doing with that gesture detection code is trying to detect a
> two-finger swipe. In my testing on a nexus 5, what I'm seeing is that if
> both fingers touch the screen at the same time, the gesture typically works.
> But if one finger goes down before the other one then I get one touch start
> event, and a second touch start event. At this point, I consider the gesture
> to have started so I register handlers for touchmove and touchend events.
> Then, the next event that I see is a touchmove event with only one touch in
> the e.touches[] array.

Hm, interesting. This sounds broken and I can't think of a way APZC might be producing an event stream that looks like this. That doesn't mean there isn't one though.

> Less commonly, but still reproducibly, I see two touch start events followed
> by a third touch start. But that third touchstart event only has two touches
> in its e.touches array.

This is also bad. If you can make a simple test page to reproduce this and log the bad cases I can take a look.

> This could be
> because I don't register my touchend handler until after the second
> touchstart event.

As long as you register the listener inside the touchstart handler (i.e. not a setTimeout or something) you should be getting the move/end events, since it will register the listener before the move is dispatched.

> To be clear the actual
> physical touch is not ending: my finger is still down, but the event stream
> is behaving as if there was a touchend event in the stream somewhere.

That's interesting too - that does make it sound hardware related. APZ might eat entire input blocks but it should never be splitting stuff like this.

> I do not call preventDefault on the first touchstart event because if I do
> that, it breaks single finger gestures like clicks.

Yeah, this is expected. The behaviour of touch events leaves much to be desired.

> I only call it on the
> second touchstart and the move and end events that follow. I did test
> calling it on everything (and breaking clicks) and it still didn't fix the
> issue.

Huh. Yeah, that sounds like hardware then.
I've created a test case to demonstrate the issue where the number of touches in the touches[] array of an event does not match the number of touchstart events.  Visit http://djf.net/et.html in the FirefoxOS browser (it also works in app form).  Then make multi-touch gesture on the screen.  It will print a + for each touchstart, a digit (the touch identifier) for each touch move and a - for each touch end.  And a ! for touch cancel, but I never see any of those.

It keeps track of the number of starts and ends and expects that many touch objects in the e.touches[] array. If the numbers don't match, it prints an "touch length mismatch at n" error, where n is the index in the event stream where the mismatch occurred.

Unlike previous test cases, I can reproduce this on Flame, so maybe this is not device specific after all. 

I cannot reproduce if I visit that website using Firefox Beta on my Android phone, so this is either FirefoxOS specific, or it is a recent regression that is not in Beta yet.
(In reply to Doug Sherk (:drs) (use needinfo?) from comment #30)
> Horizontal flings are broken on some apps and pages with these patches
> applied, e.g. Browser pointed to https://people.mozilla.org/~dsherk/

Was this with the updated part 2 patch? The first one I posted was buggy.

> Note that our app is not scrollable, so the pinch-to-zoom code should be
> disabling itself in this frame. Thus, I don't think that this is a result of
> our not using `preventDefault()`.

Pinch-to-zoom code doesn't disable itself unless you set a meta-viewport tag that specifies that.

Nonetheless the easiest way to figure out if it's a hardware issue or not is to use the gecko patch I attached to this bug and just look at the stream of events coming in. That will tell you pretty quickly where the blame lies. I can play around with the test page djf posted above and see what I can find.
That seems to fix it in Gallery app. I had strange display issues when pinch-to-zoom on pictures, but that may be related to the OMTA-failures that have been going on for some days ...
Long shot: it seems this is also fixing the nasty « press power-cannot unlock lockscreen » issue
(In reply to Alexandre LISSY :gerard-majax from comment #36)
> Long shot: it seems this is also fixing the nasty « press power-cannot
> unlock lockscreen » issue

I've given a shot to retesting this, and it seems that I can confirm: the pending patches do also fix the strange lockscreen touch not doing anything.
Flags: needinfo?(kgrandon)
Flags: needinfo?(anygregor)
Attachment #8573371 - Flags: review?(botond) → review+
Attachment #8573403 - Flags: review?(botond) → review+
Thanks, great work here!
Flags: needinfo?(kgrandon)
(In reply to David Flanagan [:djf] from comment #33)
> I've created a test case to demonstrate the issue where the number of
> touches in the touches[] array of an event does not match the number of
> touchstart events.  Visit http://djf.net/et.html in the FirefoxOS browser
> (it also works in app form).

Thanks for this test case! It makes it quite easy to reproduce and track down these problems. I found an issue with the touch resampling code, where it might issue a touchmove event with a single touch right after a second touch point is added, because it doesn't flush its resampling queue. I'll put together a fix for that.
https://hg.mozilla.org/mozilla-central/rev/2ce0fe8018b2
https://hg.mozilla.org/mozilla-central/rev/264c30d29374
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Comment on attachment 8573371 [details] [diff] [review]
Part 1 - Reset velocity when going from a pan to a pinch

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): APZ
User impact if declined: Touch event listeners in web content can get invalid touch event streams (or streams with missing touch events) around when the user does a pinch. This can lead to websites not behaving properly. Although we haven't found any examples of this "in the wild" some of our Gaia/platform developers noticed bad behavior resulting from this bug.
Testing completed: on master
Risk to taking this patch (and alternatives if risky): fairly low risk, code is well understood and patches are small.
String or UUID changes made by this patch: none
Attachment #8573371 - Flags: approval-mozilla-b2g37?
Attachment #8573403 - Flags: approval-mozilla-b2g37?
Changing the title to reflect this also impacts nexux-5 which is the only reason to consider approval for 2.2 as I think Aries is targeting 3.0 at this time.
Summary: [Aries] Touch events are flaky, especially with 2 fingers → [Aries][Nexus-5] Touch events are flaky, especially with 2 fingers
I believe this will affect all services although is most easily encountered on high-dpi devices, which is why it's been seen on aries, nexus-5, and nexus-4.
Summary: [Aries][Nexus-5] Touch events are flaky, especially with 2 fingers → [Aries][Nexus-5][hidpi?] Touch events are flaky, especially with 2 fingers
Attachment #8573371 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Attachment #8573403 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Flags: needinfo?(anygregor)
Whiteboard: [lightsaber] → [ignite]
Whiteboard: [ignite] → [spark]
Blocks: spark-device
No longer blocks: spark
You need to log in before you can comment on or make changes to this bug.