Blank white persistent frames happen intermittently when scrolling with the scroll wheel

VERIFIED FIXED in Firefox 46

Status

()

defect
P1
normal
VERIFIED FIXED
4 years ago
3 years ago

People

(Reporter: jgilbert, Assigned: kats)

Tracking

unspecified
mozilla46
Unspecified
Windows 10
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox46 fixed)

Details

()

Attachments

(7 attachments)

Scroll-wheel scrolling will sometimes cause all (or more rarely, most) of a page to be blank white. Arrow-key scrolling (which I'm told doesn't use APZ) consistently behaves fine.
It was more obvious on longer pages. I saw it frequently on:
https://news.ycombinator.com/item?id=10512882

After a restart, it's not showing up. I restarted because I was unable to access the start menu to open the screenshotting util. Perhaps these are related, and there's handle or resource exhaustion?

Alternatively, my machine resets its GPU about one in three times I sleep+wake it, so perhaps it's due to device reset handling.
I had restarted the browser a couple times, but that didn't seem to fix it, or didn't fix it for long.
Resolving since this went away and is not actionable any more. If you encounter it again please reopen this bug and don't reboot. We'll need to enable various loggings to diagnose the problem when somebody is able to reproduce it.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
I'm still getting this, though it's closer to 1 in 300 frames right now. Previously it was as bad as 1 in 3 before I restarted.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Given that this also happens sometimes partially, maybe it's happening along tile boundaries?
Hm, interesting. Can you set the apz.highlight_checkerboarded_areas pref to true (requires browser restart)? That will make checkerboarding appear as a pink color. If the white is replaced by pink then it's a perma-checkerboarding problem. If not then it's likely that the painted layers themselves are bad. A layers dump might also help.
Flags: needinfo?(bugmail.mozilla)
Ok, so that definitely looks like perma-checkerboarding. :( Either repaint requests aren't getting through or the painted content isn't getting through. Likely the former since scrolling with keyboard works.

What device are you seeing this on? It would be easiest to debug if I can reproduce it locally, otherwise I'll need to send you a special build with extra logging or ask you to attach a debugger and see what's going on.

If you *do* want to attach a debugger, I'd be interested to know if, in the parent process, we enter the function at [1] and where in that function code flow goes through, and also if we hit the code at [2].

[1] https://dxr.mozilla.org/mozilla-central/rev/a8ed7dd831d1969a5a1a8636e63bd93d6aeaf94a/gfx/layers/apz/src/AsyncPanZoomController.cpp#2732
[2] https://dxr.mozilla.org/mozilla-central/rev/a8ed7dd831d1969a5a1a8636e63bd93d6aeaf94a/gfx/layers/apz/src/AsyncPanZoomController.cpp#2790
Flags: needinfo?(bugmail.mozilla)
It's a 2013 15" retina macbook pro running Win10.

This is from my about:memory:

      5,488.44 MB ── canvas-2d-pixels                   <<<<<< What?!
         58.04 MB ── d3d11-shared-textures
          0.00 MB ── d3d9-shared-texture
          0.00 MB ── d3d9-shared-textures
          0.00 MB ── gfx-d2d-vram-draw-target
          0.00 MB ── gfx-d2d-vram-source-surface
          0.00 MB ── gfx-surface-win32
          0.00 MB ── gfx-textures
          0.00 MB ── gfx-tiles-waste
                2 ── ghost-windows
          0.00 MB ── gpu-committed
         75.40 MB ── gpu-dedicated
        224.34 MB ── gpu-shared
      1,630.80 MB ── heap-allocated
            2,110 ── heap-chunks
          1.00 MB ── heap-chunksize
      1,666.61 MB ── heap-committed
      2,110.00 MB ── heap-mapped
            2.19% ── heap-overhead-ratio
                0 ── host-object-urls
          3.84 MB ── imagelib-surface-cache-estimated-locked
          3.94 MB ── imagelib-surface-cache-estimated-total
                0 ── imagelib-surface-cache-overflow-count
         10.59 MB ── js-main-runtime-temporary-peak
      4,174.24 MB ── private
      2,849.61 MB ── resident
          0.18 MB ── shmem-allocated
          0.18 MB ── shmem-mapped
         50.17 MB ── system-heap-allocated
      5,861.49 MB ── vsize
129,988,053.61 MB ── vsize-max-contiguous
                2 ── webgl-buffer-count
          0.00 MB ── webgl-buffer-memory
                2 ── webgl-context-count
                0 ── webgl-renderbuffer-count
          0.00 MB ── webgl-renderbuffer-memory
                4 ── webgl-shader-count
                2 ── webgl-texture-count
          0.00 MB ── webgl-texture-memory
After scrolling around some of my GL spec PDFs (pdf.js) some more, it's at:
5,630.24 MB ── canvas-2d-pixels

Maybe, or maybe not unrelated.
I see this as well on a PC with Intel HD Graphics 2000. With apz.highlight_checkerboarded_areas set to true I see the fetching pink colour. (Most of the time -- though I did once see pink in one tab and white in another.)

On pages where it has started to occur (usually long pages), it will reliably happen if I make the smallest possible movement of the mouse wheel.

Nightly 45.0a1, for about the last week or more.
(In reply to Alistair Vining from comment #14)
> Nightly 45.0a1, for about the last week or more.

If this is a recent regression, any chance you could use mozregression to narrow down when it started? I don't know if you can reproduce it reliably enough for that though.
Yes, I had trouble with mozregression for a while because of false negatives, but after reading Jeff's comments about restarting, I think the following STR were reliable for me:
* Open URL in new tab (e.g. this page, or the original URL), no need to scroll or resize browser
* Hibernate PC (didn't try with sleep)
* Restart PC and scroll page with mouse wheel

That gave the following pushes:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d53a52b39a95dced722cca90ac74529b66dd5253&tochange=5ca03a00d26823ce91ee0eaa2937bed605bd53c1

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=076f95c512c459f9bd668222fc65b75f442ae4e0&tochange=c9af4f7acba9c7cb9c486c4343b3bad26e367925

Bug 1207412: Mouse wheel smooth-scroll "jumps forward" every ~250ms with APZ enabled - regression from 2015-08-24

Obviously if the STR are not 100% then this is just a red herring. Without hibernating the PC it comes and goes; yesterday it seemed to be present all day, this morning I couldn't otherwise see it or it would happen once and go away.
Thank you so much for getting that regression window. It certainly seems reliable enough if you were able to bisect it down that far, and the regressing changeset is definitely related so your result makes sense. Brian, any thoughts on how hibernation might affect the timestamps on the wheel events?
Blocks: 1207412
Flags: needinfo?(bbirtles)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> Thank you so much for getting that regression window. It certainly seems
> reliable enough if you were able to bisect it down that far, and the
> regressing changeset is definitely related so your result makes sense.
> Brian, any thoughts on how hibernation might affect the timestamps on the
> wheel events?

That changeset only switched a call from TimeStamp::NowLoRes to TimeStamp::Now. It's possible the underlying API calls behave differently in the face of hibernation but I really don't know without stepping through with a debugger. Unfortunately I won't have time to look into it this week. If anyone else can have a look, I'd be most grateful.
Flags: needinfo?(bbirtles)
(In reply to Alistair Vining from comment #16)
> Yes, I had trouble with mozregression for a while because of false
> negatives, but after reading Jeff's comments about restarting, I think the
> following STR were reliable for me:
> * Open URL in new tab (e.g. this page, or the original URL), no need to
> scroll or resize browser
> * Hibernate PC (didn't try with sleep)
> * Restart PC and scroll page with mouse wheel
> 
> That gave the following pushes:
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=d53a52b39a95dced722cca90ac74529b66dd5253&tochange=5ca0
> 3a00d26823ce91ee0eaa2937bed605bd53c1
> 
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=076f95c512c459f9bd668222fc65b75f442ae4e0&tochange=c9af
> 4f7acba9c7cb9c486c4343b3bad26e367925
> 
> Bug 1207412: Mouse wheel smooth-scroll "jumps forward" every ~250ms with APZ
> enabled - regression from 2015-08-24
> 
> Obviously if the STR are not 100% then this is just a red herring. Without
> hibernating the PC it comes and goes; yesterday it seemed to be present all
> day, this morning I couldn't otherwise see it or it would happen once and go
> away.

Alistair, I've been trying to reproduce this on my laptop but haven't been able to get it yet. Should it be sufficient to close the lid, wait for hibernation to complete, then reopen it? (I did make it is hibernating, rather than sleeping.)
That should be sufficient, David: I have a desktop PC and the power button is set to Hibernate (S4), so I just push that and wait for the power light to go out. I have just tested sleep and that seems to reproduce the bug as well.
(In reply to Alistair Vining from comment #20)
> That should be sufficient, David: I have a desktop PC and the power button
> is set to Hibernate (S4), so I just push that and wait for the power light
> to go out. I have just tested sleep and that seems to reproduce the bug as
> well.

I'm having trouble reproducing this on my Desktop as well. What version of Windows do you have? Could you also attach a copy of about:support? (Just the graphics section is enough)
I am running the recent Windows 10 Home version 1511; I have attached the contents of about:support and below that the same details from safe mode where the bug also shows up.
(In reply to Alistair Vining from comment #22)
> Created attachment 8691021 [details]
> Contents of about:support (normal + safe mode)
> 
> I am running the recent Windows 10 Home version 1511; I have attached the
> contents of about:support and below that the same details from safe mode
> where the bug also shows up.

Well, I'm out of ideas for trying to repro this locally then :) Alistair, I'll try to get you an instrumented build with extra logging.
Assignee: nobody → dvander
Status: REOPENED → ASSIGNED
Priority: -- → P1
Jeff Gilbert was looking at this today, he caught it in the debugger but with questionable symbols. It looked like the timestamp object was getting corrupted when it was getting passed into TaskThrottler::TaskComplete from AsyncPanZoomController. Also it looked like some of the TimeStamp values were 0 and otherwise kind of dodgy. I'm hoping that dvander's deletion of the TaskThrottler in bug 1192919 will just make this problem go away but if it doesn't then we'll have to debug it further.
Depends on: 1192919
I'm able to reproduce this issue using Hibernate on a Surface running windows 10. I also tested using a try build that had the patches from bug 1192919, and I can still repro it. I'll investigate.
Assignee: dvander → bugmail.mozilla
No longer depends on: 1192919
So the problem seems to happen at the start of a wheel-scroll animation in the APZ. What happens it that mLastSampleTime gets reset to Now() at [1] which has little relationship to the vsync timestamp that comes in at [2]. By "little relationship" I mean that Now() is almost always ahead of the vsync timestamp, and by sometimes large amounts. The computed sampleTimeDelta value is effectively garbage on the first animation sample, which results in the axis velocity getting set to garbage at [3], and that throws the displayport calculation way off.

I'm not entirely sure why this only happens after hibernation, though. As far as I can tell it should happen even during normal operation. I have a patch that band-aids the problem but I'm not really satisfied that I've found the root cause yet, so I'll dig a bit more.

[1] https://dxr.mozilla.org/mozilla-central/rev/0711218a018d912036f7d3be2ae2649e213cfb85/gfx/layers/apz/src/AsyncPanZoomController.cpp#2573
[2] https://dxr.mozilla.org/mozilla-central/rev/0711218a018d912036f7d3be2ae2649e213cfb85/gfx/layers/apz/src/AsyncPanZoomController.cpp#2910
[3] https://dxr.mozilla.org/mozilla-central/rev/0711218a018d912036f7d3be2ae2649e213cfb85/gfx/layers/apz/src/WheelScrollAnimation.cpp#61
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #26)
> The computed
> sampleTimeDelta value is effectively garbage on the first animation sample,
> which results in the axis velocity getting set to garbage at [3], and that
> throws the displayport calculation way off.

So this is true. The difference between the pre- and post-hibernation scenarios is that in the post-hibernation scenarios almost all the animation samples are "first" animation samples. This happens because the wheel transaction keeps getting terminated and we start a new wheel block, which cancels and restarts the animation. The wheel transaction getting terminated, in turn, because we keep getting eMouseHitTest event types coming in to the APZCTreeManager and hitting the MaybeTimeout call at [1]. For some reason the timestamps on the eMouseHitTest events are a few seconds off from the timestamps on the wheel events, causing the problem.

But wait! The call at [1] is actually using TimeStamp::Now() instead of the timestamp on the eMouseHitTest (but they're about the same, so it doesn't matter). Some more investigation reveals that the wheel events get their timestamp from SystemTimeConverter::GetTimeStampFromSystemTime(::GetMessageTime()) at [2]. And this is different from TimeStamp::Now(). Pre-hibernation the difference is less than 20ms but post-hibernation this goes up to ~3300ms.

The code in SystemTimeConverter is supposed to ensure this never happens, but on windows the GetTimeAsyncForPossibleBackwardsSkew function for the callback isn't implemented, and so backwards skew never actually gets corrected.

[1] https://dxr.mozilla.org/mozilla-central/rev/66fb852962c0d5f6f5fe0604204da4f5d17763c9/gfx/layers/apz/src/APZCTreeManager.cpp#994
[2] https://dxr.mozilla.org/mozilla-central/rev/66fb852962c0d5f6f5fe0604204da4f5d17763c9/widget/windows/nsWindow.cpp#3822
Posted patch PatchSplinter Review
This fixes the skew post-hibernate and solves the problem. Looking at the Linux implementation I suppose it would be better to do a ::PostMessage and then run the skew fix when that message is delivered, but to do a PostMessage I think I need a window handle which this code doesn't have access to.
Attachment #8700090 - Flags: review?(bbirtles)
Comment on attachment 8700090 [details] [diff] [review]
Patch

Review of attachment 8700090 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me. If Karl is available it would be good for him to check it over too.
Attachment #8700090 - Flags: review?(karlt)
Attachment #8700090 - Flags: review?(bbirtles)
Attachment #8700090 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/203644fd8bbf
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Comment on attachment 8700090 [details] [diff] [review]
Patch

There are a couple of issues here.

One is that I don't know of anything to provide that native messages already
in the queue will be processed before this Gecko event is processed.  Without
that, the skew adjustment can change between two associated native events that
have the same native time, which will result in them no longer having the same
timestamp and so they will appear not to be associated.

The other is that GetMessageTime() returns the time of the last native message
processed and there is nothing to provide that another message will be
generated after aNow was generated.  CompensateForBackwardsSkew() assumes that
aReferenceTime is after aLowerBound, which is not necessarily true here, and so the UpdateReferenceTime() call can over-correct.  Pending events will seem to be
in the future when received and so their timestamps will compressed together.

Using GetTickCount() instead of GetMessageTime() would address the second
issue but not the first.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #28)
> Looking at the
> Linux implementation I suppose it would be better to do a ::PostMessage and
> then run the skew fix when that message is delivered, but to do a
> PostMessage I think I need a window handle which this code doesn't have
> access to.

Yes, I think that would produce the desired results (though I'm not familiar
with the WINAPI methods).
https://msdn.microsoft.com/en-us/library/windows/desktop/ms644944%28v=vs.85%29.aspx
indicates that the hWnd parameter is optional and can have the NULL value.
Attachment #8700090 - Flags: review?(karlt)
Thanks, I filed bug 1234838 to switch to use ::PostMessage. I'll test my patch and put it up shortly.
We were unable to reproduce this. Alistair or Jeff, could you please confirm that this works as expected on latest versions? Thank you!
Flags: needinfo?(jgilbert)
Flags: needinfo?(alistair.vining)
This has worked fine for me on Nightly for months, presumably since the checkin in comment 32.
Flags: needinfo?(alistair.vining)
Thanks a lot! Marking as verified.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Flags: needinfo?(jgilbert)
You need to log in before you can comment on or make changes to this bug.