Rough scrolling in GeckoView

VERIFIED FIXED in Firefox 68

Status

defect
P1
normal
VERIFIED FIXED
2 months ago
a month ago

People

(Reporter: rbarker, Assigned: emilio)

Tracking

(Blocks 2 bugs, Regression, {regression})

unspecified
mozilla68
Unspecified
Android
Dependency tree / graph

Firefox Tracking Flags

(firefox66 unaffected, firefox67 unaffected, firefox68 fixed)

Details

(Whiteboard: [geckoview:fxr:p1] [geckoview:fenix:p1] [gvtv:p2])

Attachments

(4 attachments)

Reporter

Description

2 months ago

When scrolling in GeckoView, scrolling is now no longer smooth. Bisecting m-c has show https://bugzilla.mozilla.org/show_bug.cgi?id=1535788 to be the cause of the regression. This scrolling regression is particularly noticeable in Firefox Reality where scrolling now causes a strobing effect. Stepping through video it appears that the page no longer smoothly tracks the finger and will miss frames and then jump to catch up.

Reporter

Updated

2 months ago
Regressed by: 1535788
Whiteboard: [geckoview:fxr:p1]
Assignee

Comment 1

2 months ago

I'm happy to take a look at this, but FWIW I'm somewhat surprised bug 1535788 regressed something like this (it effectively just moved some object around, though arguably wasn't a small change).

Specially I'm surprised it regressed scrolling performance, given scrolling is asynchronous. Though Geckoview is the only one using some of the stylesheet service features, so... maybe?

Is there a chance to get a profile or some more detailed information?

Reporter

Comment 2

2 months ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

I'm happy to take a look at this, but FWIW I'm somewhat surprised bug 1535788 regressed something like this (it effectively just moved some object around, though arguably wasn't a small change).

Specially I'm surprised it regressed scrolling performance, given scrolling is asynchronous. Though Geckoview is the only one using some of the stylesheet service features, so... maybe?

Is there a chance to get a profile or some more detailed information?

I'm trying to capture video to show the issue. I was surprised as well after doing the bisect. To make double sure, I checked out a branch with 1535788 at the head. The issue is 100% reproducible. I then backed it out and the issue is gone. Unfortunately it is very hard to see on a phone. However it is very visible on the Oculus Go headset running Firefox Reality. I would guess one reason it is more noticeable in VR is that the pixel density is 1.

Assignee

Comment 3

2 months ago

I assume you mean https://hg.mozilla.org/mozilla-central/rev/60c20a0f320c? (I guess there's a bit of a chance of it not building, but the other one surely shouldn't have caused hazards).

Just wanted to double-check, since between the first and the second commit from that bug there's a bunch of other stuff that landed.

Reporter

Comment 4

2 months ago

This is the commit: https://phabricator.services.mozilla.com/D23903 GeckoView is able to build with that revision as the head and it exhibits the issue. If I back that revision out, the strobing and flickering go away.

Reporter

Comment 5

2 months ago
Posted video bad-trim720.mov

This video shows the finger moving and composite happening every other frame.

Reporter

Comment 6

2 months ago
Posted video good-trim720.mov

This shows the finger and composite happening in the same frame.

Reporter

Comment 7

2 months ago

Note: the videos run at 60Hz. In order to see the issue it is necessary to step through one frame at a time.

Assignee

Comment 8

2 months ago

Is there any chance to see a profile of that? I don't have any VR hardware handy.

Also, do you know if fxr does something with the stylesheet service or such to inject stylesheets? That's pretty much the only thing that I'd think can affect this.

Reporter

Comment 9

2 months ago

The page used in the video was https://webxr.today which is the default home page for Firefox Reality.(In reply to Emilio Cobos Álvarez (:emilio) from comment #8)

Is there any chance to see a profile of that? I don't have any VR hardware handy.

I can try and get some profiles tomorrow.

Also, do you know if fxr does something with the stylesheet service or such to inject stylesheets? That's pretty much the only thing that I'd think can affect this.

For this page it does not alter the css or run any content scripts. There are content scripts for youtube.com but they should not be active for the page in the videos.

The page used in the video is https://webxr.today which is the default home page for Firefox Reality.

Reporter

Comment 10

2 months ago

Profile with scrolling regression present: https://perfht.ml/2OYYL3b
Using GeckoView nightly 68.0.20190404063228

Reporter

Comment 11

2 months ago

Profile with out scrolling regression: https://perfht.ml/2uXFqGC
Using GeckoView nightly 68.0.20190403060632

Reporter

Comment 12

2 months ago

Profiles were captured on Oculus Go using https://webxr.today webpage.

Flags: needinfo?(emilio)

Huh, at first glance looking at the marker chart, both of those profiles have significant jank, and it's hard for me to see a significant difference.

Ehsan, anything obvious I'm missing?

Flags: needinfo?(ehsan)
Whiteboard: [geckoview:fxr:p1] → [geckoview:fxr:p1] [geckoview:fenix:p1]
Whiteboard: [geckoview:fxr:p1] [geckoview:fenix:p1] → [geckoview:fxr:p1] [geckoview:fenix:p1] [gvtv:p2]
Assignee

Comment 14

2 months ago

Styling time seems comparable in both cases, I don't see any significant difference there either.

Longest rasterization is slower in the profile with the regression. One of them is painting a bunch of gradients, the other one is painting a bunch of shadows. But I don't know how my patch could've potentially caused that, I think it's more likely that the content on-screen was different at the point they were taken...

Also, I thought GeckoView used e10s these days, and the profile claims to have been taken on Fennec, which is weird... Do you know what's up with that?

Flags: needinfo?(emilio) → needinfo?(rbarker)

Comment 15

2 months ago

(In reply to Bobby Holley (:bholley) from comment #13)

Huh, at first glance looking at the marker chart, both of those profiles have significant jank, and it's hard for me to see a significant difference.

Ehsan, anything obvious I'm missing?

I don't really have a lot of context for this problem, but I think you should start by finding what's interesting to compare in the two profiles. Randall is probably the best person to help walk you through what's happening on the FXR side, but from the existing info on the bug it looks like the problem is that we're running out of frame budget in some cases, so the first step would be to narrow down which parts of the before and after profile would correspond to the interesting times to pay attention to.

Flags: needinfo?(ehsan)
Reporter

Comment 16

2 months ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #14)

Also, I thought GeckoView used e10s these days, and the profile claims to have been taken on Fennec, which is weird... Do you know what's up with that?

GeckoView can use e10s but e10s is still broken on android for a number of WebVR sites so it is currently disable in FxR. I don't believe the profiler knows how to distinguish between GeckoView based apps and Fennec. I collected the profiles from the no-api version of FxR which runs on a standard phone.

Flags: needinfo?(rbarker)
Reporter

Comment 17

2 months ago

(In reply to :Ehsan Akhgari from comment #15)

(In reply to Bobby Holley (:bholley) from comment #13)

Huh, at first glance looking at the marker chart, both of those profiles have significant jank, and it's hard for me to see a significant difference.

Ehsan, anything obvious I'm missing?

I don't really have a lot of context for this problem, but I think you should start by finding what's interesting to compare in the two profiles. Randall is probably the best person to help walk you through what's happening on the FXR side, but from the existing info on the bug it looks like the problem is that we're running out of frame budget in some cases, so the first step would be to narrow down which parts of the before and after profile would correspond to the interesting times to pay attention to.

So maybe the regression is caused by this patch degrading performance enough that pages already on edge of missing a frame are pushed over the threshold?

Assignee

Comment 18

2 months ago

(In reply to Randall Barker [:rbarker] from comment #16)

GeckoView can use e10s but e10s is still broken on android for a number of WebVR sites so it is currently disable in FxR. I don't believe the profiler knows how to distinguish between GeckoView based apps and Fennec.

I see, thanks.

I collected the profiles from the no-api version of FxR which runs on a standard phone.

Do you know how can I build that locally? Do I need WebVR hardware to see the regression? I'd like to at least run a FxR build with some logging around the code I touched to confirm nothing silly is going on there.

(In reply to Randall Barker [:rbarker] from comment #17)

So maybe the regression is caused by this patch degrading performance enough that pages already on edge of missing a frame are pushed over the threshold?

Even if that was the case, my patch should be performance-neutral, so I still want to understand what's going on.

Reporter

Comment 19

2 months ago

FxR GitHub repository: https://github.com/MozillaReality/FirefoxReality

The README has instructions for building with a local build of GeckoView. If you select the no-api build variant in AndroidStudio you can run it on a regular phone with out any special VR hardware.

The regression is harder to see on a regular phone than in VR hardware, but the video I attached to this bug was captured from the no-api version so it does happen there.

It's probably more noticeable on the Oculus Go because the hardware is so under powered.

Flags: needinfo?(emilio)

Emilio, what are the next steps for this bug? Randall has STR, a regression range, and a video demonstrating the frame drops. We can get you an Oculus Go headset or a low-end Android phone if that will help with debugging.

Can we back out style set bug 1535788 in the meantime? Randall says the bug 1535788 can no longer be backed out cleanly because other style changes since landed.

(In reply to Chris Peterson [:cpeterson] from comment #20)

Emilio, what are the next steps for this bug?

It's on my list of things to do getting a FxR no-api build and try to repro / investigate the regression.

We can get you an Oculus Go headset or a low-end Android phone if that will help with debugging.

According to Randall it should repro on a regular phone, so I'll try to repro on my phone first, but I'll shout if I can't figure out.

Can we back out style set bug 1535788 in the meantime? Randall says the bug 1535788 can no longer be backed out cleanly because other style changes since landed.

I'd rather not back it out. It's not only a big change, but it's also a necessary step to fix some important compat issues like bug 1490401 (which is probably what causes it not to be rotten).

I think I can reproduce this with a simple geckoview_example build on the HD8 (no FxR required).

I'll try to find a moment tomorrow to take a look.

Flags: needinfo?(emilio) → needinfo?(bobbyholley)

So, to put here some of what we've been debugging over IRC, so I don't forget...

  • This reproduces in GVE.
  • We came up with a consistent way to "see" the regression (via layers.acceleration.draw-fps). In the "bad" build the "[CC] Build" time is consistently larger (in the "good" build it's always < 2ms).
  • The regression only happens on the first session. Anything that causes the session to be re-created (switching to private mode, toggling e10s...) makes the bug go away.
  • I thought that it was going to be because of this code calling into widget/ too early during startup, but it turns out not to be the case.

So this is likely some pre-existing bug that my patch happened to somehow trigger. But I still need to figure out what exactly and how.

Those profiles show the difference clearly, and they're from the same build.

In the "bad" case, we spend a lot of time waiting on the graphics driver. In the "good" case, we don't.

But I don't know yet why... Or how this relates to my change at all.

I'll figure this out no matter what, this is so bizarre :)

Assignee: nobody → emilio

Thanks emilio!

Flags: needinfo?(bobbyholley)

(In reply to Randall Barker [:rbarker] from comment #10)

Profile with scrolling regression present: https://perfht.ml/2OYYL3b
Using GeckoView nightly 68.0.20190404063228

(In reply to Randall Barker [:rbarker] from comment #11)

Profile with out scrolling regression: https://perfht.ml/2uXFqGC
Using GeckoView nightly 68.0.20190403060632

The marker chart shows a pretty clear difference in the timing of the composites.
Bad: https://perfht.ml/2UfXtlw Good: https://perfht.ml/2UfXyFQ

In the "good" profile, there is one composite per Vsync tick, which is what we would expect. 1 1 1 1
In the "bad" profile, on every other vsync tick, there are two composites! 1 2 1 2

I don't know what could cause those extra composites. Extra composites can definitely cause stuttering.

The other difference I can see is in the types of DOM events that get triggered on the main thread:
In the "good" profile, the types of DOM events on the main thread are "scroll", "scrollend" and "mozvisualscroll". https://perfht.ml/2Ug7SNY
In the "bad" profile, in addition to those events, we also seem to fire "touchmove" events. https://perfht.ml/2UgyGxw

Maybe the touchmove events cause extra composites somehow? And maybe those are only dispatched because of emilio's change?

(I think the profiles in comment 24 are easier to analyze, since they are from the same build, and repeating the same action, which is just scrolling about:support).

I manually bisected my patch, by writing it from scratch in smaller, potentially slightly broken but not too broken pieces, this is the piece that introduces the regression: https://hg.mozilla.org/try/rev/b4f0258b9ae0b3c55cfa8746b03c7b6a7ba3c07e

How, I don't know yet. I did notice yesterday while I was looking at the patch that that changed the order on which quirks.css is applied, but fixing that still doesn't fix the regression here, so not sure what's going on just yet, but at least I have it narrowed down...

Hmm, also, this affects XUL docs, which don't even have quirks mode. So the only thing in that patch that gets lost is the EnsureStyleFlush() call. If that fixes it I'd be sad :(

And bingo...

Assignee

Updated

a month ago
Blocks: 1544185

Sorry for the vague commit message, but I haven't dug yet on why or how the
Android code is depending on this.

This call used to be part of nsPresContext::CompatibilityModeChanged, which
unconditionally called PresShell::EnsureStyleFlush.

This was not (in theory, at least) always necessary. There's there's no point in
ensuring a style flush is going to happen if styles haven't changed, and
CompatibilityModeChanged() didn't actually guarantee that the compat mode was
different at all before my patch.

Styles only change if the compat mode actually changes (since then selectors
become case-sensitive or case-insensitive), or more obviously when you insert or
remove the quirks.css stylesheet, and in that case ApplicableStylesChanged makes
sure that the flush happens.

Yet here we are, and not having that early call to EnsureStyleFlush, even in the
case there's no quirks mode or quirks sheet change or anything of that sort
(this happens even on XUL docs, which are always FullStandards) makes the first
(and only the first) browsing session in Geckoview have terrible scrolling
performance.

I'm calling it a day for today, will investigate as time permits in bug 1544185.

Following the reasoning from the previous commit. Selectors may become case
sensitive or case insensitive as a result of this operation, something I forgot
about when writing the patch for bug 1535788.

Now Document::CompatibilityModeChanged is actually only called if the
compatibility mode actually changes, and thus we always need to restyle.
The Stylist already takes care of fully invalidating the document when it
changes (via mark_origins_dirty and co.), so we technically just need to
guarantee that a style flush will happen.

In practice I doubt this call will have any effect in practice given how early
in the document's lifetime the compatibility mode can change, but still I think
it's worth landing.

Depends on D27416

Can you confirm that just applying the first patch fixes the issue? The second one is a bit drive-by, while I reasoned about the shape of the code before my patch.

Flags: needinfo?(rbarker)

Comment 37

a month ago
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3b45bccf61cf
Re-introduce a call to EnsureStyleFlush from PresShell::Init to fix an Android scrolling regression. r=heycam
https://hg.mozilla.org/integration/autoland/rev/6ff81d173403
Document::CompatibilityModeChanged should call EnsureStyleFlush. r=heycam

Comment 38

a month ago
bugherder
Status: NEW → RESOLVED
Last Resolved: a month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Reporter

Comment 39

a month ago

Can confirm it fixes the issue in FxR. Thanks!

Flags: needinfo?(rbarker)

Yay

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.