Open Bug 1502334 Opened 3 years ago Updated 3 years ago

Reddit content is no longer displayed when scrolled with a keyboard

Categories

(Core :: Graphics: Layers, defect, P2)

Desktop
Windows 10
defect

Tracking

()

Tracking Status
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- fix-optional

People

(Reporter: cbaica, Unassigned)

Details

(Keywords: regression)

Attachments

(1 file)

Attached video reddit scroll bug
[Affected versions]:
- Fx63.0 RC
- Fx64.0b4
- Fx65.0a1

[Affected platforms]:
- Windows 10 x64

[Steps to reproduce]:
1. Launch Firefox.
2. Go to www.reddit.com
3. Scroll the page down to load a lot of post (create a long redered post list available for scrolling using the keyboard).
4. Start scrolling up using the keyboard by pressing the 'up' arrow.

[Expected result]:
- Scroll operation is done smoothly and without any issues.

[Actual result]:
- At some point the content disappears, only a grey background is displayed.

[Regression range]:
- I will come back with a regression rage ASAP.

[Additional notes]:
- The issue does not occur on macOS or Ubuntu.
- For some reason, when using a screen recorder(OBS Studio in my case), the issue stops happening.
I suspect this is something broken in our Async Panning & Zooming code (or possibly graphics/painting, given that a screen recorder makes the problem go away, presumably by prompting more aggressive screen-repainting).

A regression window would be super-useful for tracking down the issue here!
Component: Layout: Scrolling and Overflow → Panning and Zooming
Last known good build: 2016-05-30 
First known bad build: 2016-05-31

Best I could narrow it down to was this change set:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3435dd7ad71fe9003bdeee18fd38d815e033beef&tochange=111970c738234569c8c180319155327316335deb
Too late to fix in 63 but we could still take a patch for 65 (and maybe 64).
In that range, bug 1274528 jumps out as possibly-related.

(It's entirely believable that the STR here cause us to reach the limit on active layers, and hence that bug 1274528 would've caused a change in behavior for this STR.)
Blocks: 1274528
Component: Panning and Zooming → Graphics: Layers
I doubt that bug 1274528 is responsible - layers.max-active only affects android (it is infinite on desktop), and there's not very many layers on reddit, and that change makes us less likely to hit any limit.

Having said that, there's nothing else in that range immediately sticking out to me. I'll investigate further.
I tried bisecting myself. While I'm not confident in what I found, as I'm only able to reproduce sporadically, I did reproduce on several builds earlier than the regression range posted above.

The earliest I reproduced on was 2016-03-23, but as I said, I'm not confident this is the when the problem was introduced
No longer blocks: 1274528
I'm actually able to reproduce this on my Linux machine too, and on windows with the old compositor, advanced layers, or webrender.

I think the problem here is that the network is too slow at loading new content as you scroll. If I throttle the network in devtools the problem seems to be much more pronounced.

Here's a profile, in which the problem occurred at the end for a second or two, then the content became visible again right before I captured the profile: https://perfht.ml/2zyL182

Here's a profile in which I managed to capture before the content became visible again: https://perfht.ml/2zAukJx

The painting and compositing looks fine in both, except for at the end in the first one there's a large gap between 2 composites, with a fairly long style, reflow, and rasterize time. Note that the problem had already occurred by this stage though. Since so much content had changed since the last paint, it's not surprising it took a little longer than usual.

I'm fairly confident this isn't a gfx issue, but don't know how to investigate it any further.

Kats, might this be a panning and zooming issue? Or some other component?

I can reproduce this locally, and based on what I see it looks like garden-variety checkerboarding. However, it's not actually APZ checkerboarding, because we're not generating reports in about:checkerboard - so it's a content-side checkerboarding (i.e. the content isn't getting put into the layers that APZ is compositing). This agrees with what Jamie said in comment 7 and with what his profiles are showing.

I think next steps here would be to try and understand how the page is doing it's infinite list scrolling implementation, and why the content isn't getting into the DOM promptly enough. I did notice that while scrolling the page is sending out XHR POST requests but it doesn't seem to be actually fetching content, more like it's reporting which content is visible to the user. But maybe it's blocking on those XHR responses before inserting stuff into the DOM, or something. Also probably relevant is that when holding down the "up" keyboard event we end up doing lots of small scrolls at regular intervals (compared to wheel scrolling or scrollbar dragging) and so it could be that the event timing causes the page to delay DOM updates for much longer.

Either way this doesn't seem like a graphics or APZ bug, but maybe DOM/network or tech evangelism?

Flags: needinfo?(kats)

Jessie, can you help find someone to take a look? kats describes what he thinks is going on in comment 9.
Though if we decide it is a problem for Reddit itself to fix, please let me know.

Component: Graphics: Layers → DOM: Networking
Flags: needinfo?(jbonisteel)

From Kats' comment:

Either way this doesn't seem like a graphics or APZ bug, but maybe DOM/network or tech evangelism?

redirecting to appropriate triage owner.

Flags: needinfo?(jbonisteel) → needinfo?(sdeckelmann)

Honza -- can you have a look at this please?

Flags: needinfo?(sdeckelmann) → needinfo?(honzab.moz)
Priority: -- → P2
Whiteboard: [necko-triaged]

If I will be able to reproduce, then yes.

Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)

Had to look at the video instead of trying to reproduce locally (which I was able to, exactly as shows in the video attached to this bug). This is I believe a rendering problem. However, I'd like to investigate why we are so slow to repaint.

Moving a component.

Component: DOM: Networking → Graphics: Layers
Whiteboard: [necko-triaged]

(In reply to Honza Bambas (:mayhemer) from comment #14)

This is I believe a rendering problem.

Do you have any justification for this? In particular do you have anything that contradicts what I said in comment 9?

Flags: needinfo?(honzab.moz)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #15)

(In reply to Honza Bambas (:mayhemer) from comment #14)

This is I believe a rendering problem.

Do you have any justification for this? In particular do you have anything that contradicts what I said in comment 9?

"I think the problem here is that the network is too slow at loading new content as you scroll." from comment 7 seems a bit out. The problem appears when you scroll up, so all the content is already loaded and "DOM'ified." What I can see is that layers are not rendered, not individual content, like images. The whole frame area that is scrolled appears to be non-re-rendered. But I have to confess that I was not monitoring network activity.

Can you please explain more in detail what you say in comment 9 for someone unfamiliar with checkerboarding? I really would love to investigate this (having some local tools in hands). I think a private chat/vidyo with you, me and Jamie about what I have in mind could be in place. Will contact you on IRC shortly.

Flags: needinfo?(honzab.moz) → needinfo?(kats)

No need for ni?, we are going to chat f2f.

Flags: needinfo?(kats)

(In reply to Honza Bambas (:mayhemer) from comment #17)

No need for ni?, we are going to chat f2f.

Honza, do you have an update on this bug? Given that this is a P2, do you intend to work on this for 67? Thanks

Flags: needinfo?(honzab.moz)

(In reply to Pascal Chevrel:pascalc from comment #18)

(In reply to Honza Bambas (:mayhemer) from comment #17)

No need for ni?, we are going to chat f2f.

Honza, do you have an update on this bug? Given that this is a P2, do you intend to work on this for 67? Thanks

Actually, I work on this bug most of my time! This is currently my primary Backtrack use case.

So far I can say the following:

  • I can track repaints to the place where invalidation bits are set for it; so far this doesn't seem enough to find the right path to the actual source, hence all the following bits are so far just theories I can't confirm
  • it seems that there is a lot of content code that removes and re-adds dom nodes from scroll change notifications
  • there are also content timers involved in the processing of dom with regard to scrolling
  • (I'm reproducing the problem with mouse, not with arrow up key) a single mouse event on the critical path (if tracked correctly) to the repaint may take several hundred milliseconds (lot of removals and few complicated additions to the dom tree)

Overall, I'm still missing some tracking bits. I'm just today about to involve more people in the process to add more exact tracking that may reveal the exact cause.

As this is P2, I don't consider this a blocker for 67. And I'm sure this problem is there for a long time, but I was not trying to hunt for a regressions range - as it's not the way I want to approach this bug.

Does that answer your question?

Flags: needinfo?(honzab.moz) → needinfo?(pascalc)

(In reply to Honza Bambas (:mayhemer) from comment #19)

As this is P2, I don't consider this a blocker for 67. And I'm sure this problem is there for a long time, but I was not trying to hunt for a regressions range - as it's not the way I want to approach this bug.

Does that answer your question?

I was actually asking because it is a P2 ("Fix in the next release or iteration" https://mozilla.github.io/bug-handling/triage-bugzilla#how-do-you-triage), but given your explanations maybe it should be moved to the backlog (P3).

Flags: needinfo?(pascalc)

(In reply to Pascal Chevrel:pascalc from comment #20)

(In reply to Honza Bambas (:mayhemer) from comment #19)

As this is P2, I don't consider this a blocker for 67. And I'm sure this problem is there for a long time, but I was not trying to hunt for a regressions range - as it's not the way I want to approach this bug.

Does that answer your question?

I was actually asking because it is a P2 ("Fix in the next release or iteration" https://mozilla.github.io/bug-handling/triage-bugzilla#how-do-you-triage), but given your explanations maybe it should be moved to the backlog (P3).

Ah, then yes, P3 is probably more proper for this. OTOH, I really keep working on this 80+% of my time.

FWIW, I can't reproduce the issue on Firefox nightly on my Linux box, but now I could see the same issue on Chrome on the same Linux box. :)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)

FWIW, I can't reproduce the issue on Firefox nightly on my Linux box, but now I could see the same issue on Chrome on the same Linux box. :)

With throttling network actually.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #23)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)

FWIW, I can't reproduce the issue on Firefox nightly on my Linux box, but now I could see the same issue on Chrome on the same Linux box. :)

With throttling network actually.

Interesting point. But my research so far never showed a delay because of network in Fx.

To sum my findings so far:

  • there is a content script that adds display: none; height: current-height; styling on article-wrapping <divs> when they are out of the screen; the inner html is preserved, tho, nothing seems to be removed and later re-added from/to the DOM
  • on scroll event, it seems this content script starts periodic observation via a timer for position changes (few milliseconds or few tens of milliseconds intervals)
  • intended behavior of the script seems to be to let <divs> be removed the display: none styling when they are (or are soon to be?) in the visible part of the view port again

My theory is that either processing the styling change takes us too long time (I can see single vsync tick processing on the main thread to occasionally take some 600ms+, which is sometimes blocking repaints to happen) or that we give the script wrong (too old, maybe) coordinates information.

Note that it takes about 900ms or less on my screen to scroll one page with the arrow key. So, if there is a cumulative delay close to this number, the divs intended to appear do appear, but just under the edge of the screen.

I currently don't work on this bug.

Assignee: honzab.moz → nobody
You need to log in before you can comment on or make changes to this bug.