1658205 - Webrender Android: janky scrolling on mozilla.org

Mark

Reporter

Description

•

4 years ago

Attached image Screenshot_20200809-150433_Firefox Nightly.jpg — Details

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:81.0) Gecko/20100101 Firefox/81.0

Steps to reproduce:

Galaxy A40 (Mali-G71)
Today's Fenix Nightly
Turn on WebRender & restart
navigate to mozilla.org

Actual results:

First fling, inertia scroll stutters as the images fade/animate in, CPU (Renderer) spikes of up to 30 ms, see attached debug profiler.
I see the same on other sites eg bbc.co.uk, particularly as images fade in during fling inertia scrolls.

Expected results:

Smooth inertia scrolling despite images fading/animating in

(I expect this bug is a duplicate, feel free to close it if so)

Thanks!

Jamie Nicol [:jnicol]

Assignee

Comment 1

•

4 years ago

Thanks for the bug report Mark!

Would you be able to capture this jank in the gecko profiler? To do this, go to about:debugging on your desktop. Enable USB devices if required. Plug your phone in to your computer, then in Fenix's settings on your phone you need to enable "Remote debugging via USB". Once that's done, your phone should appear on the left hand side of the about:debugging page on your computer. Connect to it, then select it. Then click the "profile performance" button. Then change the "settings" dropdown to "custom" and click "edit settings". Make sure that the "Renderer" thread is ticked, and that "Native Stacks" is ticked, and please untick the screenshots option. Then click "Save settings and go back" at the top.

Finally click "start recording" and wait a few seconds, then scroll around on your phone to reproduce the jank, then click "capture recording".

I have a hunch this is caused by shader compilation, but the profile should help confirm that.

Flags: needinfo?(mark.paxman99)

Mark

Reporter

Comment 2

•

4 years ago

https://share.firefox.dev/3iquXdd
(hope I did it correctly!)

Today's Fenix Nightly
Galaxy A40
Reload mozilla.org
Start the profile recording
Fling the page top to bottom at moderate speed, it took four or five flings in total.
Nasty Inertia scroll jank as the images fade in
Stop the profile recording & publish

Flags: needinfo?(mark.paxman99)

Mark

Reporter

Comment 3

•

4 years ago

Correction:- I only see the inertia scroll jank the first time I load mozilla.org into a tab. Subsequent reloads of mozilla.org are jank-free until I close the tab & open a new one, then navigate to mozilla.org.

So I see this
1 close all tabs
2 open a new tab
3 navigate to mozilla.org
4 fling down the page
5 horrible inertia scroll jank as images fade in
6 reload mozilla.org
7 inertia scroll is (mostly) smooth even when images fade in
8 close tab
9 open a new tab
go back to 3

I profiled step 4 & 5

BugBot [:suhaib / :marco/ :calixte]

Comment 4

•

4 years ago

The severity field is not set for this bug.
:ktaeleman, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(ktaeleman)

Kris Taeleman (:ktaeleman)

Comment 5

•

4 years ago

While this might not be mali specific, adding it to wr-mali for tracking.

Blocks: wr-perf, wr-mali

Severity: -- → S2

Flags: needinfo?(ktaeleman)

Priority: -- → P2

Nicolas Silva [:nical]

Updated

•

4 years ago

Blocks: wr-android-perf
No longer blocks: wr-perf

Kris Taeleman (:ktaeleman)

Updated

•

4 years ago

Blocks: wr-malig-release
No longer blocks: wr-mali

Mark

Reporter

Comment 6

•

4 years ago

See also https://github.com/mozilla-mobile/fenix/issues/16307. My guess is a general issue not just Mali (Comment 5)

Mark

Reporter

Comment 7

•

4 years ago

Scrolling got substantially better in last couple of days, maybe bug 1661528 kicking in????

It's better, but IMHO there is still more jank than with the OpenGL renderer which is butter smooth

A fling of mozilla.org still stutters occasionally as the images fade in, particularly near the top of the page. Profile below, Galaxy A40, Mali G71

https://share.firefox.dev/38Yz7Yi

Thanks!

Jamie Nicol [:jnicol]

Assignee

Comment 8

•

4 years ago

Thanks for the profile! I was going to ask you if you could check again after bug 1661528 landed. :)

Could you try setting gfx.webrender.max-partial-present-rects to 0, restart, and see if that helps? (And attach another profile?) Thanks.

Mark

Reporter

Comment 9

•

4 years ago

Jank is more or less the same

https://share.firefox.dev/3lOCyUZ

Mark

Reporter

Comment 10

•

4 years ago

My Xperia X Compact (Adreno 510) has about the same level of jank with WebRender & is butter smooth with OpenGL. So I think perhaps the residual jank is not Mali specific.

Jamie Nicol [:jnicol]

Assignee

Comment 11

•

4 years ago

That's certainly possible. There is still a large amount of time spent in eglSwapBuffers, that I do not see on other devices. But I'm inclined to agree that the device-specific low hanging fruit has probably been fixed and we're mainly left with optimizing webrender in general for less powerful devices

Jamie Nicol [:jnicol]

Assignee

Comment 12

•

4 years ago

Mark, couple more things that I can think of at this stage. Thanks for your patience!

First, could you try enabling the webrender profiler again (gfx.webrender.debug.profiler = true, like you did in your initial bug report), and attach a screenshot?

Second, could you try enabling gfx.webrender.debug.picture-caching and do a screen recording of the jank on mozilla.com or describe to me what you see. It should cover the screen in green or red rectangular tiles. Hopefully they should remain mostly green and full-sized while you scroll. If they flash red a lot or are completely red and get smaller, then that could explain the poor performance.

Flags: needinfo?(mark.paxman99)

Mark

Reporter

Comment 13

•

4 years ago

FWIW here's a profile from the Xperia, WebRender enabled, fairly janky. gfx.webrender.max-partial-present-rects default (1)
https://share.firefox.dev/35P2QB4

will check out your other stuff now

Mark

Reporter

Comment 14

•

4 years ago

Attached image Galaxy jank.jpg — Details

Screenshot, fling scrolling just the upper part of the page where jank is worst. Galaxy A40 (Mali), WebRender on, gfx.webrender.max-partial-present-rects default (1). Jank was worst during the first fling where all the red peaks are.

Flags: needinfo?(mark.paxman99)

Mark

Reporter

Comment 15

•

4 years ago

Re picture caching I see 4 complete tiles plus part of a 5th tile and fragments of 5 more tiles on RHS of screen. Tiles are green when stationary but during a fling the each tile stays red as the image fades in and then turns green. So in practice several tiles can stay red most of the way up the screen, because the fade-in takes approx whole screen height to complete. Once all fades are complete flinging up & down page all tiles are green. Jank is definitely associated with most of the screen being red. Possibly (????) the jank occurs mostly when the red turns to green ie the animation completes???

(I'm going to try to do a video capture to when in the fade animation the jank occurs, but that's going to take a while, I might even get too bored).

Mark

Reporter

Comment 16

•

4 years ago

When analysing a frame-by-frame of a fling, if the GPU chokes and fails to deliver a frame on the clock tick, what should I see?

My guess is I'll see two adjacent frames with no vertical panning motion, then the third frame will have a big jump in panning to "catch up"??

Like this (Ignoring scroll friction)

smooth scrolling:-
frame         offset 
N             base
N+1           base + pan
N+2           base + 2*pan
N+3           base + 3*pan
etc

jank :-
frame         offset 
N             base
N+1           base + pan
N+2           base + pan <- GPU chokes, no new frame
N+3           base + 3*pan

cheers!

Jamie Nicol [:jnicol]

Assignee

Comment 17

•

4 years ago

I think that sounds right. You might actually find that on the N+3rd frame you only see base + 2 * pan because the frames are pipelined, so that one had already been created assuming it would be presented on time rather than delayed. Then on N+4 it would catch up to base + 4*pan.

Mark

Reporter

Comment 18

•

4 years ago

Yeah ignoring friction it should always move by zero or an integer multiple of pan. But I am sometimes seeing jumps by a fraction of pan. Weird.

I wonder if there is something funny going on with WR interaction with APZ or vsync? A bit like bug 1432019.

This guy was complaining about WR jank when using non-standard APZ settings, perhaps he's on to something.
https://github.com/mozilla-mobile/fenix/issues/16307

I'll try to do a montage of frame by frame, it's tedious :(

I see very similar jank on my S7 Exynos even when screen set to 1280x720. S7 GPU is waaaay more powerful than Adreno 510, it ought to be buttery at low res??? Maybe.

watch this space!

Mark

Reporter

Comment 19

•

4 years ago

Attached image Jank 1.jpg — Details

Here's a frame by frame, earliest frame on left at back, newest frame on right at front. So content is panning upwards. For most frames the content pans by value pan but for one frame it pans by a small fraction of pan. Which I don't understand.

Sometimes it doesn't pan at all, zero offset. That could be a frame drop I guess.

for the fraction of pan case, I wonder if there is a mismatch between the clock tick used by APZ and that used by WebRender??? It looks like sometimes there's a calculation pan = velocity*delta-t which is using an incorrect delta-t.

This is on Galaxy A40, another example to follow

Mark

Reporter

Comment 20

•

4 years ago

Attached image Jank 2.jpg — Details

Another frame by frame. Similar story. Different point on same fling, Galaxy A40. One frame is offset by small fraction of pan

Mark

Reporter

Comment 21

•

4 years ago

Attached image Slide7.jpg — Details

Smoking gun??? Sometimes WebRender advances the Sample counters in the Profiler, and advances the fade animations, so WR has correctly generated a new frame and sent to hardware. But the scroll offset is incorrect, it's only incremented a tiny fraction of the correct frame-to-frame "pan" value. This picture (Slide 7) is a longer sequence, the next slide (Slide 8) is a close up of just 4 frames so you can see the Sample counters increment in the Profiler.

Mark

Reporter

Comment 22

•

4 years ago

Attached image Slide8.jpg — Details

This is the same sequence, just a close up of 4 frames so you can see the Profiler data.

Mark

Reporter

Comment 23

•

4 years ago

This is on the Galaxy A40 with WebRender at its default values. I see pretty much the same on the S7 Exynos with WebRender on, even through the S7 has a much more powerful chipset. I don't see this behavior with the OpenGL compositor. The jank seems worse when the chipset is working hard, eg when it's handling a fade animation. So, mozilla.org janks a lot when I first fling it, because of all the fade animations. But subsequent flings of the same page are smoother (but not perfect). I also see the jank on bbc.co.uk/news and about:config and other sites.

https://github.com/mozilla-mobile/fenix/issues/16307 is right, if you reduce the fling friction (apz.android.chrome_fling_physics.friction = 0.002) the jank becomes much more noticeable. I don't think it changes the amount of jank, I think the long low-friction inertia scrolls just make it easier to see.

My guess is WR is behaving well but something's going wrong in APZ, it sometimes sends incorrect scroll offset values to WR, perhaps because APZ is getting the wrong vsync clock tick???? I wonder if bug 1432019 isn't fixed on WR for some reason?

Mark

Reporter

Comment 24

•

4 years ago

Hi Kats, I'm back to torture you with more APZ stuff (maybe)

:(

Mark

Reporter

Comment 25

•

4 years ago

Attached image Slide5.jpg — Details

One more example & then I'll shut up. In the 4th and 5th frames the fade animation advances correctly towards a darker mustard colour but the content moves upwards only a little bit, much less than it should. But some other frames move maybe twice what they should????

Markus Stange [:mstange]

Comment 26

•

4 years ago

Mark, can you re-test with today's Nightly, and also attach the raw video? And ideally also a profile that was recorded at the same time as the screen recording. Set the profiling interval to 10ms (for lower overhead), and add the RenderBackend thread to the profiled threads.
Also, do you see the "bad pan displacement" frame while the finger is down, or during the inertial animation that runs while after the finger has left the screen?

I recently landed two changes that might have a chance of making a difference here: Bug 1675887 may have improved the scroll delta uniformity during the animation (through more consistent sample timestamps), and bug 1676771 improved the scroll delta uniformity during the pan gesture (while the finger is touching the screen) on devices where the touch sensor runs at a rate that doesn't align cleanly with the display refresh rate.

Markus Stange [:mstange]

Updated

•

4 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Mark

Reporter

Comment 27

•

4 years ago

Hi Markus

I always use latest Nightly, I've just updated to 201121 17:01 Build #2015776971 and the jank is much the same.

The "bad pan displacement" happens when finger is up during the inertial animation. So I expect bug 1675887 is relevant but the other might not be. IIUC the inertial animation should be governed purely by vsync timestamps and apz.android.chrome_fling_physics.friction.

I'll tweak the profiler and try to grab a profile and screen recording of a "bad pan displacement". Sounds tricky.

Mark

Reporter

Comment 28

•

4 years ago

Attached video 3lTW16J.mp4 — Details

https://share.firefox.dev/3lTW16J

The screen recorder + profiler make the phone work very hard, I had to really turn down the screen recorder quality to prevent additional jank. Hope it's OK.

There are at least 2 and maybe 3 bad pan events,

when the Mozilla VPN image is near the top of the screen, the image fades in by one step (correctly) but the scroll offset advances by less than it should (incorrectly)
when the hexagonal data breaches image is center screen, the image fades in by one step (correctly) but the scroll offset doesn't advance at all (incorrectly)
(possible) when the mustard-coloured gate image is center screen, the the image fades in by one step (correctly) but the scroll offset advances by less than it should (incorrectly)

I think also in all cases the frame before the bad pan event has too much scrolling. So I think maybe the scroll stepping goes like this:- normal - normal - normal - TOO BIG - TOO SMALL - normal - normal

Mark

Reporter

Comment 29

•

3 years ago

Attached image Slide9.jpg — Details

I think I see the same on Desktop.

Check WebRender is ON. Open a new tab, navigate to mozilla.org, fling page with screen recorder running = jank. Subsequent flings in the same tab are much less janky.

In the slide attached I see the same behavior as Android, scroll steps are normal - TOO BIG - ZERO - normal - normal. Between frames 2 and 3 the two fade animations progress (particularly obvious on the mustard colored image at the bottom) but the content does not scroll. I see quite often. I haven't yet seen the Android case where content scrolls just a little bit, on Desktop it's mostly normal, too far, or zero.

This is Nightly Desktop Mac 85.0a1 (2020-11-28) (64-bit), I made the Firefox window small to minimise GPU & CPU loads.

Cheers now!

Mark

Reporter

Comment 30

•

3 years ago

PS I think I also see it on Desktop with the OpenGL compositor (WebRender OFF). Exactly the same behavior with the fade animations progressing but the content not scrolling. I checked pretty carefully in about:support that WebRender was off :) I I think I don't see it on Android with the OpenGL compositor.

So I think the "bad pan displacement" issue goes like this (but not 100% sure)

Occasional bad pan displacement on fling with fade animations?
                       WebRender          OpenGL
Android                 Yes                No
Desktop Mac             Yes                Yes

Markus Stange [:mstange]

Comment 31

•

3 years ago

Thanks, Mark!

I think I have managed to reproduce this on my Moto G5. Profile: https://share.firefox.dev/36GGXUY
In that profile, we manage to composite at full 60fps (and there are no SkippedComposite markers on the Compositor thread). However, in some parts of the page, the scroll delta between frames is not uniform, and alternates between big steps and small steps.
In the selected range in the profile above, the scroll deltas are as follows (as observed in the screenshots):
big, small, big, big, small, big, big, small, big, big, big

There's clearly something fishy going on here. We should always be sampling based on the vsync timestamp. The evidence suggests that we're not. Maybe we're sampling with TimeStamp::Now() when a transaction comes in from content?

Mark

Reporter

Comment 32

•

3 years ago

(In reply to Markus Stange [:mstange] from comment #31)

I think I have managed to reproduce this on my Moto G5. Profile: https://share.firefox.dev/36GGXUY
...
big, small, big, big, small, big, big, small, big, big, big

Yes that looks like the same thing, the pan step at 6.68 seconds is much smaller than it should be but is not zero either. Glad you see it too

Maybe we're sampling with TimeStamp::Now() when a transaction comes in from content?

A rhetorical question I hope ;) what little I know about vsync timestamps I learned from from bug 1432019.

Clues????

See Comment 30, I think I see the same issue on OpenGL Desktop Mac too, so the only version which doesn't suffer the issue is OpenGL Android which is what Kats fixed in bug 1432019. I don't think this bug is the same as bug 1432019, the symptoms are different, but perhaps Kats accidentally fixed this bug at the same time as bug 1432019???

On Desktop OpenGL and Desktop WebRender I don’t see the “too small” pan steps, just integer multiples 0, 1 or 2 of the “correct” pan step (it should always be 1). But I do sometimes see the fade animations update with a 0 pan step so there’s definitely something wrong. On Android WebRender I see the funny small pan steps so the steps can can be 0, 1, 1.1 (ish) or 2 of the "correct” pan step. I don't know why Android often shows the small steps but Desktop doesn't (ever?).

On Android OpenGL everything is butter smooth :)

I wonder if there is a race, the jank mostly seems to happen when there’s a lot of compositing/rendering going on. I wonder if a thread gets too busy to pick up the vsync value and uses 0 or Now()? Wild guess I barely even know the words.

If you have the time could you do a build with extra vsync logging like Kats did, see bug 1432019 Comment 17 onwards. I’d be happy to logcat & grep it out and maybe we can see what’s going on? It might take a few iterations and some head scratching.

Thanks for your interest! [the jank is really bugging me now, I see it everywhere :( ]

Mark

Reporter

Comment 33

•

3 years ago

Correction: on Android there's sometimes a small pan step, so it can be 0, 0.1 (ish), 1 or 2 of the correct pan step. On Desktop there seems to be no small pan step, just 0, 1 or 2.

Markus Stange [:mstange]

Comment 34

•

3 years ago

(In reply to Mark from comment #32)

(In reply to Markus Stange [:mstange] from comment #31)

Maybe we're sampling with TimeStamp::Now() when a transaction comes in from content?

A rhetorical question I hope ;) what little I know about vsync timestamps I learned from from bug 1432019.

Oh haha, yes, this was a question for myself so that I would know where to continue when I come back to this bug.

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Updated

•

3 years ago

Comment 35

•

3 years ago

Hi Mark, sorry about the lack of activity on this bug. I'm trying to track down the cause of it now. I don't think I've found the full answer yet but I have found something in the APZ/webrender code that seems not quite right. Would you be able to test a build for me and see if it helps at all?

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UYEt5x3fT7uIeVZ8fw1c0w/runs/0/artifacts/public/build/geckoview_example.apk

Flags: needinfo?(mark.paxman99)

Jamie Nicol [:jnicol]

Assignee

Comment 36

•

3 years ago

(In reply to Markus Stange [:mstange] from comment #31)

Maybe we're sampling with TimeStamp::Now() when a transaction comes in from content?

I've found one cause of an inconsistent timestamp: https://searchfox.org/mozilla-central/rev/f47a4b67643b3048ef9a2e2ac0c34edf6d1ebff3/gfx/layers/apz/src/APZSampler.cpp#119-123

The build in comment 35 removes this, and according to my logging makes the scroll deltas more consistent. I think it appears smoother too, but I don't have the eye for this.

This seems to occur when we perform a sample after a (slow?) scene build. This was added in bug 1653796, and looks like it was intended to fix another source of inconsistent sample times.

(Note that bug 1653796 landed 3 days before this bug was filed, and around the same time that webrender was enabled on Mali. And there were legitimate performance issues on Mali at the time which may have obscured sampling-related issues. So I can't be confident that this is just a regression from that bug, but it does seem like a lead worth following.)

Here is some logging I added:

update_document() frame_id: None has_built_scene: true, render_frame: false
SampleForWebrender() time: 89291499538233, now: 89291500130403. Using now
UpdateAnimation() aSampleTime: 89291500130403, mLastSampleTime: 89291499538233
SampleCompositedAsyncTransform() offset: (0,6822.48), prev: (0,6739.59), delta: (0,82.8895)
GenericFlingAnimation::DoSample() delta: 0.592170ms
SpatialNode::set_scroll_origin() new offset: (0.0, -6739.593), delta: (0.0, -85.933105)

update_document() frame_id: Some(610) has_built_scene: false, render_frame: true
SampleForWebrender() time: 89291516158247, now: 89291500776941. Using sample time
UpdateAnimation() aSampleTime: 89291516158247, mLastSampleTime: 89291500130403
SampleCompositedAsyncTransform() offset: (0,6825.39), prev: (0,6822.48), delta: (0,2.90987)
GenericFlingAnimation::DoSample() delta: 16.027844ms
SpatialNode::set_scroll_origin() new offset: (0.0, -6822.4824), delta: (0.0, -82.88965)
new_frame_ready() render=true

update_document() frame_id: Some(611) has_built_scene: false, render_frame: true
SampleForWebrender() time: 89291532779949, now: 89291516846556. Using sample time
UpdateAnimation() aSampleTime: 89291532779949, mLastSampleTime: 89291516158247
SampleCompositedAsyncTransform() offset: (0,6902.53), prev: (0,6825.39), delta: (0,77.1378)
GenericFlingAnimation::DoSample() delta: 16.621702ms
SpatialNode::set_scroll_origin() new offset: (0.0, -6825.392), delta: (0.0, -2.909668)
new_frame_ready() render=true

The first update_document is from the scene build and we use SampleTime::FromNow() for the sample time. We don't actually render this frame, but the way the sampling seems to work means we don't see the bad offset until 2 frames later. (The SpatialNode::set_scroll_origin() call.) It's not clear to me yet why a bad sample time for a non-rendered frame should matter: as long as the next timestamp is correct shouldn't the total delta between rendered frames should be okay? It's also not clear to me why we sample at all when the frame isn't going to rendered. Removing the sample might fix the issue (but maybe we could still hit the underlying problem for frames which are rendered.)

Markus, Botond, perhaps either of you might be able to point me in the direction of where to look next. Specifically, does using SampleTime::FromNow() when we've missed the composite seem correct? If so, can we make the subsequent samples more resilient to it? If not, then do we need an alternative fix for bug 1653796?

Flags: needinfo?(mstange.moz)

Flags: needinfo?(botond)

Mark

Reporter

Comment 37

•

3 years ago

Substantial improvement on my Galaxy A40 (Mali) and Galaxy S7 (Exynos) on bbc.co.uk/news which has a lot of image fade-ins. mozilla.org is perhaps better but far from perfect. I would say OpenGL is still better on both sites but the gap is narrowed.

I guess there might be other issues causing jank eg WebRender dropping a frame. My S7 has a more powerful GPU so I'd hoped it could give a smoother scroll but not so, the A40 and S7 are much the same on mozilla.org.

Let me fool around for a day or two, I might even do some videos, I'll try to see if the funny "half step" is still present.

Thanks!

Flags: needinfo?(mark.paxman99)

Jamie Nicol [:jnicol]

Assignee

Comment 38

•

3 years ago

Thanks Mark, much appreciated. That seems encouraging that it helps on the BBC, but yes it's probably not the full picture and/or there may be other issues at play too. Hopefully we can track them all down

Mark

Reporter

Comment 39

•

3 years ago

Attached image Jank.jpg — Details

I can't see any more of those funny half steps as discussed around comment 31. It's possible I missed them, they were hard to spot. But I think it's a step forward (if you'll pardon the pun). Great!

Does your proposed fix also apply to Desktop, I think I saw the same issue there?

I think some or all of the remaining jank on mozilla.org is WebRender choking the CPU and dropping frames. mozilla.org seems particularly difficult for WebRender (irony). When the big dome image fades in there is often a huge CPU spike 25-30 ms Frame CPU Total, then there are several similar spikes more or less coinciding with other jank. See attached screenshot.

Want me to file a separate bug?

Jamie Nicol [:jnicol]

Assignee

Comment 40

•

3 years ago

I don't think there's anything android specific about the code in question, so you might well have seen the same thing on desktop too.

Yes, please do file a new bug. Let's make this one about the sampling issue (the half steps), and deal with the slow frames separately.

Mark

Reporter

Comment 41

•

3 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1694471

Markus Stange [:mstange]

Comment 42

•

3 years ago

(In reply to Jamie Nicol [:jnicol] from comment #36)

(In reply to Markus Stange [:mstange] from comment #31)

Maybe we're sampling with TimeStamp::Now() when a transaction comes in from content?

I've found one cause of an inconsistent timestamp: https://searchfox.org/mozilla-central/rev/f47a4b67643b3048ef9a2e2ac0c34edf6d1ebff3/gfx/layers/apz/src/APZSampler.cpp#119-123

Good find!

This seems to occur when we perform a sample after a (slow?) scene build. This was added in bug 1653796, and looks like it was intended to fix another source of inconsistent sample times.

Hmm, I think this code is showing some confusion from the fact that scene building and frame building aren't properly separated at the API level. In Firefox, scene building doesn't cause rendering. It only tells our compositor that a frame is needed. And then at vsync, we kick off a frame build. We should only sample animations during frame building, never during scene building.

It's not clear to me yet why a bad sample time for a non-rendered frame should matter: as long as the next timestamp is correct shouldn't the total delta between rendered frames should be okay?

I have the same question.

Flags: needinfo?(mstange.moz)

Timothy Nikkel (:tnikkel)

Comment 43

•

3 years ago

(In reply to Jamie Nicol [:jnicol] from comment #36)

I've found one cause of an inconsistent timestamp: https://searchfox.org/mozilla-central/rev/f47a4b67643b3048ef9a2e2ac0c34edf6d1ebff3/gfx/layers/apz/src/APZSampler.cpp#119-123

The build in comment 35 removes this, and according to my logging makes the scroll deltas more consistent. I think it appears smoother too, but I don't have the eye for this.

I tried your backout patch of bug 1653796 with the STR from bug 1653796 plus my own time stamp logging code and I can still reproduce the non-monotonicity in comment 0 of bug 1653796. (I can't seem to reproduce the original bug though.) ie

0x135430000 AsyncPanZoomController::StartAnimation mLastSampleTime 5816.370672 now
0x135430000 AsyncPanZoomController::UpdateAnimation mLastSampleTime 5284.127645 vsync
0x135430000 AsyncPanZoomController::UpdateAnimation mLastSampleTime 5833.106252 vsync

Where the last number on each line is the TimeStamp in mLastSampleTime minus the creation time of |this| saved in a new member, followed by the type of SampleTime it is.

So seems like backing out that patch is probably going to regress something even if it doesn't regress the original STR.

Jamie Nicol [:jnicol]

Assignee

Comment 44

•

3 years ago

(In reply to Markus Stange [:mstange] from comment #42)

We should only sample animations during frame building, never during scene building.

This was my conceptual understanding too. If we remove the sample after scene building, that should make the non-monotonic timestamps of bug 1653796 less likely. Nical thinks it was added to satisfy some APZ tests, so I will do a try run and see where we stand.

But it seems like the non-monotonicity in bug 1653796 is caused equally by the post-scene-build timestamp being in the past, but also because the animations start time was GetFrameTime() ie SampleTime::FromNow(). Shouldn't the start time be based off of vsync too? FWIW when I do a fling animation I can see that the 2nd or so frame has an unusual step size which appears to be due to this.

(In reply to Timothy Nikkel (:tnikkel) from comment #43)

So seems like backing out that patch is probably going to regress something even if it doesn't regress the original STR.

Thanks for testing Timothy. Yeah I figured we couldn't get a way with just reverting that. Could you test also removing the || has_built_scene condition from here and see if that avoids your problem?

Flags: needinfo?(timothygu99)

Timothy Nikkel (:tnikkel)

Comment 45

•

3 years ago

(In reply to Jamie Nicol [:jnicol] from comment #44)

Thanks for testing Timothy. Yeah I figured we couldn't get a way with just reverting that. Could you test also removing the || has_built_scene condition from here and see if that avoids your problem?

I wasn't able to reproduce if I also made the change to render_backend.rs.

Jamie Nicol [:jnicol]

Assignee

Comment 46

•

3 years ago

Great. A try run with the same change looks good, so whatever issue Nical added the sample for has gone away (or I ran the wrong tests).

I still have concerns about the fragility that a rogue sample which doesn't get rendered causes problems for later, rendered samples. And I think that the fact animations are started from SampleTime::FromNow() instead of being vsync based can cause unusual scroll deltas at the start of a fling. But, I think landing this change (to stop sampling after a scene build) makes sense. It's a step in the right direction and fixes a real issue. Does that sound reasonable?

Timothy Nikkel (:tnikkel)

Comment 47

•

3 years ago

(In reply to Jamie Nicol [:jnicol] from comment #46)

I still have concerns about the fragility that a rogue sample which doesn't get rendered causes problems for later, rendered samples. And I think that the fact animations are started from SampleTime::FromNow() instead of being vsync based can cause unusual scroll deltas at the start of a fling. But, I think landing this change (to stop sampling after a scene build) makes sense. It's a step in the right direction and fixes a real issue. Does that sound reasonable?

Not sure if this question is directed at me or not but that sounds fine to me.

Jamie Nicol [:jnicol]

Assignee

Comment 48

•

3 years ago

It was directed at anyone who knows more about APZ than me :)

Timothy Gu [:TimothyGu]

Comment 49

•

3 years ago

(clearing needinfo; don't think I'm the right Tim…)

Flags: needinfo?(timothygu99)

Botond Ballo [:botond]

Comment 50

•

3 years ago

(In reply to Jamie Nicol [:jnicol] from comment #46)

I still have concerns about the fragility that a rogue sample which doesn't get rendered causes problems for later, rendered samples.

Perhaps the proper solution here would be to do the thing that this comment describes as being hard to plumb in, i.e. "ideally we would be able to know when the current vsync interval ends, and use that timestamp"?

And I think that the fact animations are started from SampleTime::FromNow() instead of being vsync based can cause unusual scroll deltas at the start of a fling.

I assume you're referring to this line. I agree that this is inconsistent; it predates our use of vsync timestamps for sampling animations, and we likely didn't notice it thus far because it only affects one frame per animation.

But, I think landing this change (to stop sampling after a scene build) makes sense. It's a step in the right direction and fixes a real issue. Does that sound reasonable?

+1

Flags: needinfo?(botond)

Jamie Nicol [:jnicol]

Assignee

Comment 51

•

3 years ago

Attached file Bug 1658205 - Don't sample APZ post webrender scene build. r?botond — Details

Conceptually we should only sample APZ on frame build, never on scene
build. When a late scene build occured, this unnecessary sample was
the underlying cause of the non-monotonic sample times as seen in bug
1653796. The fix for bug 1653796 introduced non-vsync-aligned sample
times to work around this, but these lead to inconsistent scroll
deltas during fling animations, appearing as janky scrolling on
Android.

This patch fixes the underling issue, by removing the sample after
scene building. This means we no longer hit the case where sample
times are in the past after a late scene build, allowing us to revert
bug 1653796 to ensure that sample times remain vsync-aligned.

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → jnicol

Status: NEW → ASSIGNED

Pulsebot

Comment 52

•

3 years ago

Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4e6e23275c5f
Don't sample APZ post webrender scene build. r=botond,nical

Andreea Pavel [:apavel]

Comment 53

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4e6e23275c5f

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox88: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 88 Branch

Mark

Reporter

Comment 54

•

3 years ago

Just dropped by to say scrolling is substantially smoother on bbc.co.uk/news and mozilla.org since this landed. Thanks! Main remaining issue on mozilla.org (and some other sites) is the jank when images fade in, that's bug 1694471.

Screenshot_20200809-150433_Firefox Nightly.jpg 4 years ago Mark 589.73 KB, image/jpeg		Details
Galaxy jank.jpg 4 years ago Mark 684.22 KB, image/jpeg		Details
Jank 1.jpg 4 years ago Mark 249.42 KB, image/jpeg		Details
Jank 2.jpg 4 years ago Mark 258.92 KB, image/jpeg		Details
Slide7.jpg 4 years ago Mark 388.97 KB, image/jpeg		Details
Slide8.jpg 4 years ago Mark 393.51 KB, image/jpeg		Details
Slide5.jpg 4 years ago Mark 222.74 KB, image/jpeg		Details
3lTW16J.mp4 4 years ago Mark 704.09 KB, video/mp4		Details
Slide9.jpg 3 years ago Mark 228.95 KB, image/jpeg		Details
Jank.jpg 3 years ago Mark 987.42 KB, image/jpeg		Details
Bug 1658205 - Don't sample APZ post webrender scene build. r?botond 3 years ago Jamie Nicol [:jnicol] 48 bytes, text/x-phabricator-request		Details \| Review