Closed Bug 1274597 Opened 4 years ago Closed 3 years ago
Clicking on links sometimes doesn't work, long-pressing results in text selection instead of context menu
44.58 KB, text/plain
64.96 KB, text/plain
1009.91 KB, image/png
52.00 KB, text/plain
6.53 KB, text/plain
Bug 1274597 - Update the pause counter in nsAppShell to track multiple resumes as well multiple pauses.
58 bytes, text/x-review-board-request
It seems like sometimes Firefox gets into a situation where clicks on links (both short and long) aren't recognised any more: (In reply to Ryan VanderMeulen [:RyanVM] from bug 1231570 comment #29) > I'm guessing this is why I'm finding it nearly impossible to click links > with today's Aurora nightly? (In reply to Ryan VanderMeulen [:RyanVM] from bug 1231570 comment #31) > Yes. HTC M8 w/ Marshmallow. I see it most easily on Reddit. When I try to > long-click a link, I get selection handles (and the associated menu) instead > of the context menu letting me open the link in a new tab. Oddly, I was > unable to even short-click links at that point either. (In reply to Jan Henning [:JanH] from bug 1231570 comment #32) > I've seen something similar (short-clicking not working and long-clicking > only displaying the selection handles) very recently on a local build, > although in my case it was on a plain HTTP server directory listing page. A > restart fixed it, although I can't quite remember whether restarting Fennec > was enough, or if I had to actually reboot my phone... apart from that > incident, I haven't encountered any difficulties clicking links.
Like Jan, I force-killed and restarted Fennec and I could no longer reproduce. We'll see if it returns I guess.
This is really bad and I've seen this and did not realize we closed bug 1265251 which may be the first report of this. Might be APZ?
Sounds reasonable to track this for 48/49 until we figure out what is going on.
Assignee: nobody → rbarker
tracking-fennec: ? → 48+
I ran into it this morning again, on Nightly. Switching tabs to about:config, checking the value of some prefs, and then switching back seemed to fix it. I'm going to start running local builds so that if it happens again I can try to gdb it.
Summary: Clicking on links sometimes doesn't work → Clicking on links sometimes doesn't work, long-pressing results in text selection instead of context menu
I have my local Fennec build in this state now. From the logging I added and gdb investigation everything *appears* to be working correctly in terms of coordinate flow, but I think the hit test is failing for some reason. In particular: - The page is loaded and zoomed in to a resolution of ~2.2x - If I tap on a particular link and inspect the coordinates the widget creates and puts into the MultiTouchInput as a ScreenIntPoint, they are in the neighbourhood of (100, 480) depending on exactly where my finger lands. - The APZ untransform is a no-op, so the touch events and single-tap message sent to gecko all have the same LayoutDevice coordinates of (100, 480) - The APZ callback transform is empty, so that's not affecting this at all. - The division at  is doing nothing (which I believe is correct on Fennec - that was intended for B2G IIRC) - During hit-testing the PET code at  is getting what appears to be the correct aPointRelativeToRootFrame of (3000, 14400) app units (which corresponds to 100,480 with a 2x devicePixelRatio). - During the hit-test I see that point getting RemoveResolution'd at  - HOWEVER, the result of the hit-test doesn't appear to be correct. It seems to find the body element as the target, rather than the link that I'm actually tapping on. - I also inspected the coordinates at various other points and I think they all seem to be correct based on my understanding of how the resolution removal works. For example at  the coordinates that get logged are (50,240) which is correct for resolution-applied CSS coordinates. - In order to trace through the hit-testing code, I need to enable display-list dumping or ideally turn on the dumping at . Unfortunately I did an opt build without paint dumping, so I can't make that dumping work. And I can't seem to attach the devtools (filed a bug for that) which would allow me to turn on display-list dumping via JS APIs. If I open about:config or switch tabs at all I know this state will get cleared so I won't be able to debug further. For now I'm leaving it in this state, if anybody has any ideas on what I can do let me know. Otherwise I might just have to do another build with paint-dumping enabled and wait until the next time I can reproduce this. Also I feel like the results I have seen so far are consistent with the observable symptoms, which gives me a little bit of confidence that I'm on the right track. The bad hit-test means that the _highlightElement in browser.js gets set to the body, which is why long-pressing doesn't give a context menu. I think the mouse events that constitute the click are also probably getting the wrong target (the body) and so clicks have no effect. Text selection still works because it uses the eMouseLongTap event point rather than an event target (I think).  http://searchfox.org/mozilla-central/rev/bfcc10319e4e3ce78367fa9bba9316f7eb5248b6/gfx/layers/apz/util/APZCCallbackHelper.cpp#404  http://searchfox.org/mozilla-central/source/layout/base/PositionedEventTargeting.cpp#547  http://searchfox.org/mozilla-central/rev/bfcc10319e4e3ce78367fa9bba9316f7eb5248b6/layout/base/nsDisplayList.cpp#4959  http://searchfox.org/mozilla-central/rev/bfcc10319e4e3ce78367fa9bba9316f7eb5248b6/gfx/layers/apz/util/APZEventState.cpp#172  http://searchfox.org/mozilla-central/rev/bfcc10319e4e3ce78367fa9bba9316f7eb5248b6/layout/base/nsLayoutUtils.cpp#3176
so my understanding of how things were supposed to work was wrong. The call to GetEventCoordinatesRelativeTo is supposed to do a RemoveResolution on the point, so that when it goes through the display list build and hit-test the coordinates already have the resolution removed. At least that's what I'm seeing now on a "working normally" Fennec build. So for some reason we seem to get into a state where after having called GetEventCoordinatesRelativeTo, the coordinates still have the resolution applied. I'm still unable to find a good way to reproduce this behaviour, so it's hard to debug. I did find a bug in the touch-action support while investigating though, will file another bug for that.
I am really scared that we are shipping this in 48. I know that investigation has been fruitless so far :-(
I have my device in this state again. The logging seems to indicate that during the course of a long-press, there are many calls to GetEventCoordinatesRelativeTo, some of which go through the code path at  and some of which go through . When it goes through , the resolution is 1.0 and at  the resolution is correctly reflecting my zoom level. The main-thread hit-testing seems to occur mostly with the coordinates returned by , and so the display items that get built are for something not directly under the finger.  http://searchfox.org/mozilla-central/rev/ff0e3782b1c44ffd1a68ef8e9e4f8be3866741e2/layout/base/nsLayoutUtils.cpp#2234  http://searchfox.org/mozilla-central/rev/ff0e3782b1c44ffd1a68ef8e9e4f8be3866741e2/layout/base/nsLayoutUtils.cpp#2273
I've found that the scrolling behavior with nightly on marketwatch.com mobile site has been suspect since they revamped it a few months ago. Specifically, it seems like when you attempt to scroll down the page, you can "miss" the page or something, and the content won't scroll at all or it will scroll 1-3px and then just stop scrolling even though you keep moving your finger. After this "miss-scroll" occurs, bug 1274597 seems to occur (more?) and it seems to select the text of the link rather than try to hit the link itself. Just some behavior I've been seeing for the last 4+ weeks on nightly w/apz. Further, you can try to scroll again and scrolling will work on the 2nd+. Bug 1274597 then does not seem to occur as much(?), but it still does sometimes. Try the following: 1. marketwatch.com 2. Scroll down the page about 1-inch with your finger. Note how the page will scroll just a few pixels. 3. Long tap a link. It doesn't go, text selected instead. Refresh the page, rinse, repeat to do it again. I hope this helps; this bug is aggravating.
Thanks for the suggestions, but unfortunately even with that I'm not able to reproduce the problem.
I ran into this again. I don't have much time to look into this now since I'm going on PTO but I'm attaching a log which covers me long-pressing on a link and it getting selected instead of showing the context menu. The patch which I used to generating the logging is at . It looks like nsLayoutUtils::GetEventCoordinatesRelativeTo is getting called with a frame 0x91b2d740 that is the "reference frame" (it shows up as such in the display list dump). The code at  gets the presShell from the frame and then gets the APZ resolution scale from that, but that's coming out as 1.0. This results in the event not being transformed properly and so the hit-test returns the wrong thing. In the log you can also see other calls to nsLayoutUtils::GetEventCoordinatesRelativeTo where the frame is 0x8b421620 (this is the ViewportFrame, according to the display list dump). In that case the frame's view's widget is 0x0 and it ends up falling through to  where it correctly gets 1.547295 as the resolution scale and unapplies it. However it looks like those calls are not the ones that are part of a hit-test (because there's no display list dump immediately after). I *think* the upshot of all this is that we're using the wrong presShell at , at least in some cases, where the frame being passed in is the reference frame.  https://github.com/staktrace/gecko-dev/commit/e58c4fe58d173ad606d7fb2d47cf36fcfb11ded4  http://searchfox.org/mozilla-central/rev/d9a04f71630ce4203ff0a5e26722723045d035b5/layout/base/nsLayoutUtils.cpp#2230  http://searchfox.org/mozilla-central/rev/d9a04f71630ce4203ff0a5e26722723045d035b5/layout/base/nsLayoutUtils.cpp#2273
Also, I did a tab switch to clear the bad state, and repeated the long-press. This is the resulting log. Note how in this one GECRT never gets called with the frame 0x91b2d740.
(In reply to (away Aug5-14) Kartikaya Gupta (email:email@example.com) from comment #14) > I *think* the upshot of all this is that we're using the wrong presShell at > , at least in some cases, where the frame being passed in is the > reference frame. > >  > http://searchfox.org/mozilla-central/rev/ > d9a04f71630ce4203ff0a5e26722723045d035b5/layout/base/nsLayoutUtils.cpp#2230 cc'ing Timothy; I have a recollection of us talking about this early exit from GECRT in London.
3 years ago
Version: Trunk → 48 Branch
I looked at the logs I posed in comment 14 and comment 15 again, and noticed that in the bad case, where the 0x91b2d740 is used as the GECRT frame, the hit-testing display list dump is actually dumping the root XUL document rather than the HTML document. So 0x91b2d740 is actually the reference frame for the XUL document while 0x8b421620 is the reference frame for the HTML document, and what we want to be targeting. So the root cause is somewhere earlier from that point, probably in PresShell::HandleEvent, where we pick the wrong document as the target. Will need to add more logging and run this again.
kats, would it be useful to have more attempts to reproduce and log or find STR? ioana, can your team try to reproduce this issue?
Yes, finding reliable STR would be quite helpful. I've only been able to hit randomly (and not at all in the last couple of weeks) so debugging it is very slow.
I am able to reproduce it both on Reddit and using steps from Comment #12 aroudn 1 out of 4/5 times. I can reproduce easily using end of string or even near proximity ( this last case might be intended) see screens
Hope it helps - but again I am not sure I'm reproducing it clean all times. I can have a look tomorrow morning too.
This is exactly what I was able to do in Comment 12. Notice in the screenshot, however, that the selected text is specifically at the end of the string in both cases. I can accomplish the same thing to any string of text, whether or not it is a hyperlink, simply by long-tapping somewhere in the small amount of white space just after the last character in the string, and it happens 100% of the time, but I don't think it actually reproduces the bug being described here. I believe it is correct behavior to highlight the text of a link if the browser believes the text is being selected for text manipulation rather than being invoked to perform the link action. But I could be very wrong here too. I dropped off the radar because ever since Comment 12, the thing I've been trying to reproduce is to get a token which isn't the first or last token in the hyperlink string to be selected, proving this bug's existence and sidestepping any issues from my above description, whether or not it is true. And again, maybe this is wrongful thinking. After a seemingly week+ hiatus, this behavior just happened again to me today using some email unsubscribe page (I was on yesterday's Nightly at the time). A word in the center of the hyperlink string was selected when I attempted to long-tap the link, which was surely an occurrence of this bug. I wasn't really paying attention at the time because it was early morning, but the things I remember were that: 1) I did a single zoom-in gesture on the page immediately after it finished loading and before I selected the link. 2) I began selecting the link almost immediately after finishing the zoom gesture, i.e. I was navigating quickly. 3) I don't remember exactly, but I'm pretty sure the dynamic toolbar was displayed in its entirety the whole time I was on the page, and I do have full screen browsing turned on. 4) All of this happened in about 1 second between the time the page finished loading and the time the bug occurred/hyperlink text was selected. Try and see if you can reproduce where some token in the middle of the hyperlink string is selected. I think that's what we're looking for. It doesn't even have to be text! I've seen picture elements and radio buttons get selected as well. Other things I've noticed: It seems this bug likes to occur when you start to see some fps drop or see other kinds of jank while panning or zooming, or the behavior I described above where you can "miss" scrolling the page. I called out marketwatch.com because that site was a pig at the time. I'll sometimes have to select a link for a very long time before the context menu appears as well (2+ seconds?). With marketwatch.com, I was usually lucky in reproducing this bug because the browser showed both of these symptoms in a significant way and just so happened to notice that I was able to reproduce this bug more frequently on that site at the time. The weird scrolling behavior no longer occurs there and I haven't been able to reproduce this bug on that site in a few weeks. I'd be happy to file bugs for any behavior I've called out above.
tracking-fennec: 48+ → 50+
(In reply to Ioana Chiorean from comment #25) > Hope it helps - but again I am not sure I'm reproducing it clean all times. > I can have a look tomorrow morning too. What DrBill said in comment 26 is basically correct - long-pressing outside a link will generally do text-selection. It's when you tap directly on a link and it doesn't work, or if you long-press directly on a link and it does text selection instead of showing the context menu, that you're seeing the same problem. At this point logs probably won't help much because the logs won't contain any necessary information, but reliable STR would be great.
This month, I saw the problem several times in the following STR with firefox 48 on my nexus 9. [Prerequisite] - Open http://syosetu.com/ and log in to an account. I already registered some novels to read. Updated novels appear left of the page when I log in. - I always keep Firefox and the tab open. [STR]  Touch page update icon in the toolbar. Then latest updated novel appear.  Long press a link of a latest story. And select "Open Link in New Tab". I did it for all new updated stories.  Move to a Tab and read a story. When I end to read it, close the tab Continue - several times in a day. There was a case that long push did not work for a link in .
(In reply to Sotaro Ikeda [:sotaro] from comment #28) > Continue - several times in a day. There was a case that long push did > not work for a link in . In this case text selection just happened.
(In reply to Kartikaya Gupta (email:firstname.lastname@example.org) from comment #27) > but reliable STR would be great. For STR I mentioned same as DrBill's in comment #12: 1. marketwatch.com 2. Scroll down the page about 1-inch with your finger. Note how the page will scroll just a few pixels. 3. Long tap a link. It doesn't go, text selected instead or simply go to Reddit and long press any link from the page at a end or where some white space s ( when a title is on 2 lines etc)
I wasn't able to reproduce this on marketwatch.com, although I just ran into it after loading https://bugzilla.mozilla.org/show_bug.cgi?id=1296887 in a local Fennec build. Unfortunately the local build didn't have my logging, but I was able to use gdb and I discovered that during hit-testing, the call at  is returning null. That's why the events are getting sent to the root XUL document instead of the content document. And that function in turn is returning null because the primaryContentShell is null at . According to the comment just above that, that would indicate there is no "active tab". Not sure how we can get into that state but at least it's another step in the investigation.  http://searchfox.org/mozilla-central/rev/3611a7607db7554b9d7065ebfef3d5111f7171b8/layout/base/nsPresShell.cpp#7688  http://searchfox.org/mozilla-central/rev/3611a7607db7554b9d7065ebfef3d5111f7171b8/layout/base/nsPresShell.cpp#8246
I forgot to mention - the last time I saw this, I also checked BrowserApp.selectedTab and BrowserApp.selectedBrowser using the JS debugger. And it looked like selectedBrowser had type="content-targetable" instead of the type="content-primary" that I would expect based on .  http://searchfox.org/mozilla-central/rev/3582398bc9db5a83e25c2a8299cc780a52ad7ca8/mobile/android/chrome/content/browser.js#3689
(In reply to Kartikaya Gupta (email:email@example.com) from comment #32) > I forgot to mention - the last time I saw this, I also checked > BrowserApp.selectedTab and BrowserApp.selectedBrowser using the JS debugger. > And it looked like selectedBrowser had type="content-targetable" instead of > the type="content-primary" that I would expect based on . > >  > http://searchfox.org/mozilla-central/rev/ > 3582398bc9db5a83e25c2a8299cc780a52ad7ca8/mobile/android/chrome/content/ > browser.js#3689 The first line there causes the function to bail out if the docShell is null. Is that possible? My guess is that it is possible if someone decided to guard against it. The result would be as you observed, an 'active' tab that isn't really active.
I have it in this state again. I started Fennec, which restored the one tab I had from last time (my bugmail dashboard). I reloaded it, and then after some panning around, long-pressed on a link to open it in a new background tab. After that point it was in the bad state. Using the JS debugger I see that both tabs are returning false from getActive(). This is on a local build where I added logging into Tab.setActive, and I don't see that logging anywhere in my logcat, which seems to indicate Tab.setActive was never called at all. That's bizarre - it should have at least gotten called from the Tab.create function.
Ugh. It looks like my logcat is incomplete, because of http://stackoverflow.com/questions/34587273/android-logcat-logs-chatty-module-line-expire-message
Hit it again. My logging is still incomplete, but I see that the tab is in fact deactivated by browser.js. I see this in the log: 09-11 06:05:14.637 D/GeckoAppShell(27012): Notification client already set 09-11 06:05:14.652 D/GeckoBrowser(27012): staktrace: setting tab active true 09-11 06:05:14.652 D/GeckoBrowser(27012): setActive@chrome://browser/content/browser.js:3718:11 09-11 06:05:14.652 D/GeckoBrowser(27012): ao_observe@chrome://browser/content/browser.js:6606:7 09-11 06:05:14.654 D/GeckoScreenOrientation(27012): unlocking 09-11 06:05:14.679 D/GeckoBrowser(27012): staktrace: setting tab active false 09-11 06:05:14.679 D/GeckoBrowser(27012): setActive@chrome://browser/content/browser.js:3718:11 09-11 06:05:14.679 D/GeckoBrowser(27012): ao_observe@chrome://browser/content/browser.js:6606:7 which is the last thing to change a tab's state - so something is dispatching a application-foreground quickly followed by a application-background notification, and the selectedtab is activated and then deactivated. This happened while Fennec was in the foreground and I was interacting with it (although I don't recall if maybe I got some system notification popup at this time). The "Notification client already set" logging is coming from GeckoService.onCreate(), which does in fact dispatch the application-foreground message via GeckoThread.onResume(). It seems logical to assume that the application-background message is coming from GeckoService.onDestroy() via GeckoThread.onPause(). I've added more logging to catch this if happens. snorp, do you know what this GeckoService is, and what causes it to get created/destroyed? And in particular, if it might get created/destroyed while in the middle of interacting with Fennec?
Interesting. GeckoService is used to handle some background stuff like updating addons or showing a persistent notification while Fennec isn't running. The onResume/onPause stuff in GeckoThread ends up in nsAppShell which keeps a count. Apparently we have an unbalanced pause/resume, leading to nsAppShell thinking we are 'paused' when we are not. A quick glance suggests that we are missing an initial GeckoThread.onResume() call when Fennec is started. Without this, GeckoService starting and stopping yields a 'paused' state. Jim does this sound right to you?  https://dxr.mozilla.org/mozilla-central/source/widget/android/nsAppShell.cpp?q=android%2FnsAppShell.cpp&redirect_type=direct#141
Flags: needinfo?(snorp) → needinfo?(nchen)
So it looks like the problem is that the sPauseCounter in nsAppShell isn't set up to track multiple resumes (i.e. it's a one-way counter, rather than two way). So on startup, sPauseCount is 0. Then GeckoThread.onResume() is called by GeckoService.onCreate, and we go through . Since sPauseCount is 0 we return immediately without doing anything. Then GeckoThread.onPause() is called by GeckoService.onDestroy, and we go through . Since sPauseCount is 0, it gets incremented to 1 and we send the application-background message. As snorp pointed out on IRC, this can be reproduced by sticking in calls to GeckoThread.onResume(); GeckoThread.onPause(); somewhere in a code path that can be executed on-demand. Doing so reproduces the problem. I think it's a relatively simple fix to make sPauseCount signed instead of unsigned and update the conditions so that it can track multiple resumes as well as multiple pauses.  http://searchfox.org/mozilla-central/rev/591354c5e5521cf845bf6b6ef44e8b3b0aeda17d/widget/android/nsAppShell.cpp#173  http://searchfox.org/mozilla-central/rev/591354c5e5521cf845bf6b6ef44e8b3b0aeda17d/widget/android/nsAppShell.cpp#141
Assignee: rbarker → bugmail
Comment on attachment 8790371 [details] Bug 1274597 - Update the pause counter in nsAppShell to track multiple resumes as well multiple pauses. https://reviewboard.mozilla.org/r/78240/#review77254 The patch looks correct, but I don't think we should actually be allowing multiple pause/resume calls. The app is either in the foreground or background, and there should be one place managing that. We (or just myself) can rework that in a followup.
Attachment #8790371 - Flags: review?(snorp) → review+
(In reply to James Willcox (:snorp) (firstname.lastname@example.org) from comment #41) > I don't think we should actually be allowing > multiple pause/resume calls. The app is either in the foreground or > background, and there should be one place managing that. We (or just myself) > can rework that in a followup. Sounds good.
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/109281cadd42 Update the pause counter in nsAppShell to track multiple resumes as well multiple pauses. r=snorp
Comment on attachment 8790371 [details] Bug 1274597 - Update the pause counter in nsAppShell to track multiple resumes as well multiple pauses. Approval Request Comment [Feature/regressing bug #]: unsure, but probably is an interaction between bug 1260243 and APZ [User impact if declined]: randomly fennec will get into a state where tapping on links doesn't work. a tab switch is required to get it working again properly. it's very annoying. There might be other bugs that result from the same root cause, not sure. [Describe test coverage new/current, TreeHerder]: no automated coverage. the bug is intermittent so it's hard to verify [Risks and why]: fairly low risk. even if this ends up not fixing the bug the patch is pretty safe and shouldn't break anything [String/UUID change made/needed]: none
Attachment #8790371 - Flags: approval-mozilla-aurora?
Comment on attachment 8790371 [details] Bug 1274597 - Update the pause counter in nsAppShell to track multiple resumes as well multiple pauses. Approval Request Comment - see above comment I'm not sure if we're still taking patches for 49 beta but if we are this is probably a good one to consider.
Hi Jan, Ryan, could you please verify this issue is fixed as expected on a latest Nightly build? Thanks!
I wasn't ever able to consistently reproduce it on my device. Not sure if Ioana's team has had better luck in doing so.
Flags: needinfo?(ryanvm) → needinfo?(ioana.chiorean)
Comment on attachment 8790371 [details] Bug 1274597 - Update the pause counter in nsAppShell to track multiple resumes as well multiple pauses. Fixes a recent regression, Aurora50+
Attachment #8790371 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
What Ryan said - from a user perspective it's essentially random.
Maybe a potential candidate to ride along on a dot release, should there be any?
That's a good point. Liz, is there a process for flagging patches as potential ride-alongs for dot releases? This one would be good on 49 if there's a dot release for Fennec.
Yes, I can mark it affected and track it. My list of tracked bugs is pretty small at this point, mostly things that are also candidates for dot release. I'm pretty sure we will end up with one, but there isn't a clear "driver" for it yet.
Hello, This fix was verified on latest Fennec Beta 51.0b2 - build1, using the next devices: - Huawei Honor 5x (Android 5.1.1); - Nexus 9 (Android 7.0).
I've seen a resurgence of this problem under aurora 52 on Galaxy S4 w/ Cyanogenmod-13/Android-6.0.1
I also started seeing this again this week.
(In reply to neil from comment #59) > I've seen a resurgence of this problem under aurora 52 on Galaxy S4 w/ > Cyanogenmod-13/Android-6.0.1 (In reply to DrBill from comment #60) > I also started seeing this again this week. Could you please file a new bug with as much detail as possible? Given the amount of trouble we had tracking this down last time, the more information we have the better. Thanks!
[Tracking Requested - why for this release]: (In reply to Kartikaya Gupta (email:firstname.lastname@example.org) from comment #61) Bug resurging in Firefox 53 aurora as of February 2017, on Galaxy S4 i9515 w/ up-to-date official stock ROM. What kind of extra detail can I provide for you on an unrooted device? Any log-files might be provided if I know how to create them. This really needs to be addressed, since opening links in new tabs is an absolute must for newspapers, etc. Browser history to go back is painful. And thanks for the friendly reply on IRC :)
I filed bug 1336209 for this. Please continue any discussion over on that bug.
You need to log in before you can comment on or make changes to this bug.