Closed Bug 1276850 Opened 8 years ago Closed 8 years ago

When scrolling back to top of kryogenix.org, screen is entirely white for a second or two

Categories

(Core :: Graphics, defect)

48 Branch
Unspecified
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox47 --- unaffected
firefox48 --- unaffected
firefox49 --- fixed

People

(Reporter: dholbert, Assigned: kats)

References

()

Details

(Keywords: regression)

STR:
 1. Visit http://kryogenix.org/
 2. Scroll to the bottom (or at least down a few screen-worths of space)
 3. Rapidly scroll to the top.

ACTUAL RESULTS:
The top section of the page is just blank white for a second or so.  (when you reach it by scrolling up)

EXPECTED RESULTS:
Top section of the page should repaint a low-res version perhaps, but not white.

I'm hitting this in recent Nightlies on Fennec (Firefox for Android). Working on a regression range.
(In reply to Daniel Holbert [:dholbert] from comment #0)
> EXPECTED RESULTS:
> Top section of the page should repaint a low-res version perhaps, but not
> white.

(Or at least it should repaint something non-blank pretty rapidly.)
mozregression says this was probably a regression from:
 
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=eca3a6b94f7934377b099ddb22ae9c91f340cbfc&tochange=9066ef31441959ae15b83d59395ed34d3ed03c0a
> Kartikaya Gupta — Bug 1192910 - Ensure we flush paints on the main thread during an APZ flush. r=mstange
> Markus Stange — Bug 1192910 - Stop triggering repaints when the displayport margins change without the displayport changing. r=kats

(I say "probably" because this doesn't reproduce 100% of the time -- I had to reload a few times to get it to trigger.  Makes sense from the blamed bug's commit message that it would be related to this, though.)

FWIW, I'm using a OnePlus One phone, running CyanogenMod 13 (Android 6.01).
Blocks: 1192910
Component: General → Panning and Zooming
Flags: needinfo?(bugmail.mozilla)
Product: Firefox for Android → Core
Version: unspecified → Trunk
I can reproduce this on Nightly pretty easily, but I can't repro it on Aurora. There we have low-res painting enabled, so that might be masking the problem. However I also enabled the APZ minimap on both nightly and aurora, and there's a clear difference in behaviour there - on Nightly the painted rect stays "stuck" below the top of the page for a few seconds before snapping up, and on aurora it paints the top much faster. So I'm inclined to mark 48/aurora as unaffected, even though bug 1192910 is on that branch.

I'll investigate to see why this is happening.
Assignee: nobody → bugmail.mozilla
Flags: needinfo?(bugmail.mozilla)
OS: Unspecified → Android
Summary: [Android] When scrolling back to top of kryogenix.org, screen is entirely white for a second or two → When scrolling back to top of kryogenix.org, screen is entirely white for a second or two
Version: Trunk → 48 Branch
Weird, I can't reproduce this on a local build from m-c tip, with a clean profile. I can still repro in Nightly.
I did some try pushes with logging [1] and I can reproduce on those builds (although they're using my Nightly profile). From the logs it looks like the APZ code is calling RequestContentRepaint as intended, but since it's calling it from the compositor thread, it needs to redispatch to the main thread at [2], and that redispatch doesn't happen until much later. So the main thread must be busy doing something else, I don't know what. I was unsuccessful in running the profiler on it. I'll try to get a backtrace from gdb.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f6fb8da959f
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/AsyncPanZoomController.cpp?rev=ddd9b902f9e1#2784
Couldn't get gdb to work on the try build either, since it's built as a Nightly and appears to be non-debuggable.
Ok, I got a profile from a Nightly build (I didn't have my setup correct before).

https://cleopatra.io/#report=f321c5b18da81dc1b7302e01ac66cc896afcc015

We're basically spending a lot of time painting backgrounds and gradients, and that blocks the main thread from processing the new incoming paint requests.
Assignee: bugmail.mozilla → nobody
Component: Panning and Zooming → Graphics
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> Weird, I can't reproduce this on a local build from m-c tip, with a clean profile.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)
> I did some try pushes with logging [1] and I can reproduce on those builds
> (although they're using my Nightly profile).

(FWIW, I don't think there are any special normal-browsing-profile quirks required to trigger this -- I'm able to reproduce in "guest mode", as well as in mozregression which uses fresh profiles.)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> We're basically spending a lot of time painting backgrounds and gradients,
> and that blocks the main thread from processing the new incoming paint
> requests.

Thanks for taking a look at this!  So, the page does have some complex layered tiled-background + partially-transparent gradient on the top section (see rule for #basic).  I imagine that's what we're working to paint here.

Do you know how we managed to paint responsively before, despite this background being expensive?
(In reply to Daniel Holbert [:dholbert] from comment #8)
> Do you know how we managed to paint responsively before, despite this
> background being expensive?

I'm not sure, but considering that I see the issue on try build but not local builds, I suspect there might be something racy or otherwise intermittent going on. I also did a try build with bug 1192910 backed out [1] and I'm still able to reproduce the issue on that build. I wasn't able to install the 47 nightly build just before bug 1192910 landed (I'd have to blow away my profile which I don't want to do right now) so I couldn't verify the regression range, but I suspect it might be off?

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=81c027858c2e
Ok, so I verified your regression range, and confirmed that bug 1192910 did make things worse. However the specific thing that bug 1192910 regressed was fixed in bug 1263347. I think it regressed again since then, trying to track down the regression range for that.
The new regression range is https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=7389f08fb91c83a239add2be15e648431ed645f1&tochange=d74e9d5a64c352187ab74b60255b571f6192a0e5

Are we now using rust code for any part of painting/layout on fennec?
Blocks: 1220307
No longer blocks: 1192910
Flags: needinfo?(nfroyd)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> The new regression range is
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=7389f08fb91c83a239add2be15e648431ed645f1&tochange=d74e
> 9d5a64c352187ab74b60255b571f6192a0e5
> 
> Are we now using rust code for any part of painting/layout on fennec?

Uh, no, we aren't.  That's a peculiar regression range.  My best guess is that when we link an empty rust crate, we drag in bits and pieces of support libraries (e.g. what one would find in libgcc) and those pieces get linked into libxul, rather than libxul depending on some outside library.  And then somehow, one of those pieces has a bug?  That seems a little unlikely, but...
Flags: needinfo?(nfroyd)
Hm. It's far-fetched, but it would also explain why try builds demonstrate the problem and my local build does not. Somebody should verify my regression range just to be sure though.

Assuming the range is valid, I don't really know where to go from here - is it useful to bisect within those three changesets? Or should we look at the profile in comment 7 more closely to see if there's things in there that wouldn't have been involved before?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> Hm. It's far-fetched, but it would also explain why try builds demonstrate
> the problem and my local build does not. Somebody should verify my
> regression range just to be sure though.
> 
> Assuming the range is valid, I don't really know where to go from here - is
> it useful to bisect within those three changesets? Or should we look at the
> profile in comment 7 more closely to see if there's things in there that
> wouldn't have been involved before?

I have two suggestions, one of which is more work than the other:

1. Look at the profile in comment 7 to see if you can find anything unusual, particularly with functions starting with '_' or '__'.  If we can at least identify *something*, we'll have more information to possibly take to the Rust team or even someplace else.

2. Back out the patches; nothing else we're building depends on them, because the only Rust pieces in the tree aren't being built on Android for tedious reasons (bug 1266792).

These are not mutually exclusive, of course.
(In reply to Nathan Froyd [:froydnj] from comment #14)
> 2. Back out the patches; nothing else we're building depends on them,
> because the only Rust pieces in the tree aren't being built on Android for
> tedious reasons (bug 1266792).

Well, at least back out the patch that flips on --enable-rust; the other ones can stay.
Here's a try push with --enable-rust disabled, let's see if it helps any.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8ab94a6e08cb
(In reply to Nathan Froyd [:froydnj] from comment #14)
> 1. Look at the profile in comment 7 to see if you can find anything unusual,
> particularly with functions starting with '_' or '__'.  If we can at least
> identify *something*, we'll have more information to possibly take to the
> Rust team or even someplace else.

There are some symbols like __addXf3__, __mulXf3__, and __divdXf3__ which are showing up as pretty high in the profile. They're the top three nontrivial symbols in the profile if you select "invert callstack". Does that help?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> (In reply to Nathan Froyd [:froydnj] from comment #14)
> > 1. Look at the profile in comment 7 to see if you can find anything unusual,
> > particularly with functions starting with '_' or '__'.  If we can at least
> > identify *something*, we'll have more information to possibly take to the
> > Rust team or even someplace else.
> 
> There are some symbols like __addXf3__, __mulXf3__, and __divdXf3__ which
> are showing up as pretty high in the profile. They're the top three
> nontrivial symbols in the profile if you select "invert callstack". Does
> that help?

Well, it's information, at least. :)  What is gfxContext::Fill doing with adding, multiplying, and dividing floating point numbers?

OK, I don't know the answer to that question.  What I suspect is going on here is this: we call runtime functions __add{s,d}f3__ for doing addition of floating-point numbers, because ARM.  Prior to the --enable-rust push, references to these functions were being resolved to libgcc or similar on the system, which probably uses the floating-point hardware to do the work (possibly with some cost for massaging the data in and out of the integer calling convention registers, I think).  But after the --enable-rust push, those __add{s,d}f3__ references are getting resolved by using Rust's support library; __addXf3__ is the llvm internal implementation of the function we're actually calling.  And Rust's support library on the ARM target that we use is compiled to be fully general and run on any ARM chip...including chips that don't have a floating-point unit.  So the Rust routines are, I think, doing everything in software (!), which as you might imagine is somewhat expensive.

I'm really surprised a regression from this hasn't shown up sooner.

There's a Rust issue open on this: https://github.com/rust-lang/rust/issues/33278  Clearly, we need to use the correct Rust target for our Android builds. :)  I don't think the necessary changes to add an armv7 target have showed up in stable Rust, though, so perhaps the best course of action for now is simply to back out the --enable-rust changes?
The try build with rust disabled fixed the issue. After discussion with Nathan on IRC we backed out part 3 of bug 1220307.
Whew, in under the 49 wire so we don't have to ask for uplifts. :)
Assignee: nobody → bugmail.mozilla
Oh whoa, this is indeed quite surprising! The armv7 Android target is in nightly Rust now and I'll try to work on producing nightlies for it once I get back to SF after the work week to ensure they start riding the trains.
This got merged

https://hg.mozilla.org/mozilla-central/rev/da10eecd0e76
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in before you can comment on or make changes to this bug.