Closed Bug 1257862 Opened 8 years ago Closed 8 years ago

Scrolling via mousewheel and touchpad intermittently stops working

Categories

(Core :: Panning and Zooming, defect, P1)

48 Branch
Unspecified
All
defect

Tracking

()

VERIFIED FIXED
mozilla48
Tracking Status
firefox47 --- unaffected
firefox48 + verified

People

(Reporter: RyanVM, Assigned: kats)

References

Details

(Keywords: regression, Whiteboard: [gfx-noted])

Attachments

(2 files)

There have been a couple recent mentions on IRC of people encountering this on OSX recently. I've also seen the same on Win10 and restarting Fx is the only way to get it working again. Scrolling via arrow keys and scrollbars still works. This only broke for me yesterday, but happened twice throughout the course of the day. Unfortunately, I don't have reliable STR :(
Version: 47 Branch → 48 Branch
Blocks: apz-desktop
Whiteboard: [gfx-noted]
I'm seeing this too every few hours on OSX. Trying to get some STR.
Tracking since this sounds likely to be a recent regression.
Could it have been caused by bug 1247854? The timing looks similar
(In reply to :Felipe Gomes (needinfo me!) from comment #3)
> Could it have been caused by bug 1247854? The timing looks similar

That got uplifted to beta, so that's not very good... Markus?
Flags: needinfo?(mstange)
It's unlikely to have been caused by that. If mousewheel/trackpad stops working entirely it's likely that some global state is getting screwed up, where as the changes in bug 1247854 are per-page, so loading a new page or switching tabs would fix it.
That being said we definitely need to figure out what's going on here. Marking it as a APZ release blocker, although for the moment we've only had it reported on Nightly/48.
Severity: normal → major
Priority: -- → P1
Flags: needinfo?(mstange)
I've found a page where I can reproduce this reliably. https://www.rightscon.org/program/

If you mouse over the schedule and scroll, it doesn't work, but if you're just left or right of the schedule, it starts working again.
(In reply to Mike Hoye [:mhoye] from comment #7)
> but if you're
> just left or right of the schedule, it starts working again.

Unfortunately, this means that it's a different bug. In this bug, scrolling stops permanently for the whole window.
The next time I hit this (which seems to be about once a day), is there a place I should be setting a breakpoint with lldb / Xcode to give you some actionable data about what's going wrong?
Some useful places to set breakpoints (assuming this is OS X trackpad scrolling): [1] [2] [3] [4]. Depending on which of these breakpoints get hit (or not) we'll have a better idea of what's going on. If you run into this and botond is in the office he should be able to help you debug it.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/InputQueue.cpp?rev=fd968c710a39#49
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/AsyncPanZoomController.cpp?rev=4ba52b16f83a#1131
[3] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/AsyncPanZoomController.cpp?rev=4ba52b16f83a#2929
[4] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/util/APZCCallbackHelper.cpp?rev=6a6e000c7a42#100
I can reliably reproduce this on my main profile , on OS X, using this url:

data:text/html,<input type=button onclick=window.print()>aaa</input>

You need to click on the button and you can cancel the print afterwards. Manually triggering a print by Cmd+P doesn't seem to work.


But I cannot reproduce it on a fresh profile

I'm digging to see if I can get any more info
(In reply to Mike Hoye [:mhoye] from comment #7)
> I've found a page where I can reproduce this reliably.
> https://www.rightscon.org/program/
> 
> If you mouse over the schedule and scroll, it doesn't work, but if you're
> just left or right of the schedule, it starts working again.

I can reproduce this on Linux. Split it out into bug 1259135.
15:09.93 INFO: Narrowed nightly regression window from [2016-03-10, 2016-03-12] (2 days) to [2016-03-11, 2016-03-12] (1 days) (~0 steps left)
15:09.93 INFO: Got as far as we can go bisecting nightlies...
15:09.93 INFO: Last good revision: 102886e9ac63b3fa33a6f1b394aea054abce2dfd (2016-03-11)
15:09.93 INFO: First bad revision: 946ed22cad04431c75ab5093989dfedf1bae5a3e (2016-03-12)
15:09.93 INFO: Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=102886e9ac63b3fa33a6f1b394aea054abce2dfd&tochange=946ed22cad04431c75ab5093989dfedf1bae5a3e

Continuing mozregression into inbound now
Hmm mozregression won't go into inbound

 0:01.54 INFO: Getting mozilla-inbound builds between 102886e9ac63b3fa33a6f1b394aea054abce2dfd and 946ed22cad04431c75ab5093989dfedf1bae5a3e
 0:07.28 WARNING: Skipping build 28bca8e29596: Unable to find build info using the taskcluster route 'gecko.v2.mozilla-inbound.revision.28bca8e29596cfa1c24b864925cb5cb5e20e35ab.firefox.macosx64-opt'
 0:09.33 INFO: Oh noes, no (more) inbound revisions :(

From the range above, bug 1242690 looks like a good candidate
Ah, I can repro using your STR if I enable apz.drag.enabled, but not if that pref is disabled (the default). Everybody else who's seen this issue: do you guys have apz.drag.enabled set to true?
Flags: needinfo?(ryanvm)
Flags: needinfo?(mconley)
I do not have apz.drag.enabled set to true.
Flags: needinfo?(mconley)
Nor do I.
Flags: needinfo?(ryanvm)
Felipe, do you? If so that might be the difference between your main profile and the clean profile - and if so this issue you found is a separate issue (although may still have the same underlying root cause).
Actually I can repro this problem even without apz.drag.enabled. However it's racy. What's happening is that we schedule the main-thread timeout on the drag block in the InputQueue, and the timeout callback never gets called. I think the print dialog opening up somehow destroys the MessageLoop on which the callback was scheduled, and so it drops that task on the floor. As a result the drag block never moves out of the queue and just all future input events get stuck behind it.
Are there other ways that race could happen? I don't seem to recall printing lately, certainly not leading up to hitting this bug.
I was printing just before I hit this bug.
I don't have apz.drag.enabled either

I hit this once right when printing but also a couple of times when printing was not involved. So something else in my profile causes the race condition to always fail
So it seems like it's not the MessageLoop getting destroyed, but some code calling SetNestableTasksAllowed(false) on the MessageLoop. I'm still digging but presumably any sort of modal dialog (not just printing) might trigger this.
That could fit the times I've hit it. I seem to recall it was when I was doing tree management-type things that could involve modal dialogs for entering credentials.
Ok, so I think I've figured out what's going on. Summary:

a) APZ code calls MessageLoop::PostDelayedTask() with a task scheduled for 300ms later.
b) MessageLoop moves the task from the "incoming" queue to the "work" queue, and then to the "delayed work" queue, and calls ScheduleDelayedWork so that DoDelayedWork eventually runs
c) Then MessageLoop::RunTask runs (for some other task, triggered by the print dialog). This sets nestable_tasks_allowed_ to false before Run()'ing the task
d) This task goes into DoWorkRunnable::Run [1] a bunch of times. Each time through, this function saves the current value of NestableTasksAllowed(), forces it true, and then restores the old value. The old value is always false (because it was set to false in step c).
e) At some point, between invocations of DoWorkRunnable::Run, MessageLoop::DoDelayedWork gets run (recall it was scheduled in step b). When this runs, nestable_tasks_allowed_ is false, and so it bails out after resetting *next_delayed_work_time [2]
f) The RunTask call from step (c) unwinds, and sets nestable_tasks_allowed_ back to true
g) Any future delayed tasks scheduled on the MessageLoop are pushed onto the back of the delayed work queue [3] and AFAICT DoDelayedWork never runs again, and the stuff in the delayed work queue is never processed.

[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp?rev=6d0c889a1c0e#211
[2] http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc?rev=e8c7dfe727cd#471
[3] http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc?rev=e8c7dfe727cd#455
khuey, is the issue described in comment 25 a known issue? Do you have any suggestions on the best way to fix it (other than "don't use MessageLoop" :))?
Flags: needinfo?(khuey)
Attached patch PatchSplinter Review
After discussion on IRC with khuey, he suggested this as a fix. It seems to work, by wrapping the invocation of MessageLoop::DoDelayedWork in step (e) with code that sets nestable_tasks_allowed_ to true, which allows the delayed work task to run.

Flagging bsmedberg for review as well as per khuey's request (bsmedberg: see comment 25 for a summary of the problem).
Assignee: nobody → bugmail.mozilla
Flags: needinfo?(khuey)
Attachment #8734139 - Flags: review?(khuey)
Attachment #8734139 - Flags: review?(benjamin)
Comment on attachment 8734139 [details] [diff] [review]
Patch

Review of attachment 8734139 [details] [diff] [review]:
-----------------------------------------------------------------

r=me in the "this looks like a reasonable thing to do" sense, not in the "this will definitely work and will have no unintended side effects" sense.  But we can try it.
Attachment #8734139 - Flags: review?(khuey) → review+
Comment on attachment 8734139 [details] [diff] [review]
Patch

I'd really like Billm to think about this, but dvander might be good as well. I totally don't remember what nestable tasks should/shouldn't be for.
Attachment #8734139 - Flags: review?(benjamin) → review?(dvander)
Attached file Testcase
(attaching the testcase for posterity)
Comment on attachment 8734139 [details] [diff] [review]
Patch

Review of attachment 8734139 [details] [diff] [review]:
-----------------------------------------------------------------

It looks like the approach here is to always allow nestable tasks for all of Gecko. I don't know how risky that is, or what could go wrong, but our current behavior does seem broken. We never reschedule the timer.

The Chromium approach is to opt-in at each re-entrant task point, so "ScopedNestableTaskAllower" appears all over the place.

My vote is to try it out and see what happens, and if anything breaks we might have to restrict where nestable tasks are allowed. (Or fix the scheduling bug.) Anyway, I'll r? Bill just to see if he has any thoughts.
Attachment #8734139 - Flags: review?(wmccloskey)
Attachment #8734139 - Flags: review?(dvander)
Attachment #8734139 - Flags: review+
I'm going to land this so we can find out sooner if it's gonna break things. Leaving review flag for billm since he should still look at it.
https://hg.mozilla.org/mozilla-central/rev/f56afd13cb91
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
No longer blocks: 1259965
Comment on attachment 8734139 [details] [diff] [review]
Patch

Review of attachment 8734139 [details] [diff] [review]:
-----------------------------------------------------------------

Trying to figure out how the Chrome developers wanted this to work, I keep coming to the conclusion that it doesn't work. However, the underlying Chrome code doesn't seem to have changed much that I can see.

I think the best thing is to remove all the nestable_tasks_allowed_ code. I don't see any benefits to it. We're now not using it at all on our main threads, and I seriously doubt it's useful on non-main threads.
Attachment #8734139 - Flags: review?(wmccloskey) → review+
See Also: → 1263786
Verified as fixed using latest Dev Edition 48.0a2.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.