Closed
Bug 1276850
Opened 9 years ago
Closed 9 years ago
When scrolling back to top of kryogenix.org, screen is entirely white for a second or two
Categories
(Core :: Graphics, defect)
Tracking
()
RESOLVED
FIXED
mozilla49
Tracking | Status | |
---|---|---|
firefox47 | --- | unaffected |
firefox48 | --- | unaffected |
firefox49 | --- | fixed |
People
(Reporter: dholbert, Assigned: kats)
References
()
Details
(Keywords: regression)
STR:
1. Visit http://kryogenix.org/
2. Scroll to the bottom (or at least down a few screen-worths of space)
3. Rapidly scroll to the top.
ACTUAL RESULTS:
The top section of the page is just blank white for a second or so. (when you reach it by scrolling up)
EXPECTED RESULTS:
Top section of the page should repaint a low-res version perhaps, but not white.
I'm hitting this in recent Nightlies on Fennec (Firefox for Android). Working on a regression range.
Reporter | ||
Comment 1•9 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #0)
> EXPECTED RESULTS:
> Top section of the page should repaint a low-res version perhaps, but not
> white.
(Or at least it should repaint something non-blank pretty rapidly.)
Reporter | ||
Comment 2•9 years ago
|
||
mozregression says this was probably a regression from:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=eca3a6b94f7934377b099ddb22ae9c91f340cbfc&tochange=9066ef31441959ae15b83d59395ed34d3ed03c0a
> Kartikaya Gupta — Bug 1192910 - Ensure we flush paints on the main thread during an APZ flush. r=mstange
> Markus Stange — Bug 1192910 - Stop triggering repaints when the displayport margins change without the displayport changing. r=kats
(I say "probably" because this doesn't reproduce 100% of the time -- I had to reload a few times to get it to trigger. Makes sense from the blamed bug's commit message that it would be related to this, though.)
FWIW, I'm using a OnePlus One phone, running CyanogenMod 13 (Android 6.01).
Blocks: 1192910
Component: General → Panning and Zooming
Flags: needinfo?(bugmail.mozilla)
Product: Firefox for Android → Core
Version: unspecified → Trunk
Assignee | ||
Comment 3•9 years ago
|
||
I can reproduce this on Nightly pretty easily, but I can't repro it on Aurora. There we have low-res painting enabled, so that might be masking the problem. However I also enabled the APZ minimap on both nightly and aurora, and there's a clear difference in behaviour there - on Nightly the painted rect stays "stuck" below the top of the page for a few seconds before snapping up, and on aurora it paints the top much faster. So I'm inclined to mark 48/aurora as unaffected, even though bug 1192910 is on that branch.
I'll investigate to see why this is happening.
Assignee: nobody → bugmail.mozilla
status-firefox47:
--- → unaffected
status-firefox48:
--- → unaffected
status-firefox49:
--- → affected
Flags: needinfo?(bugmail.mozilla)
OS: Unspecified → Android
Summary: [Android] When scrolling back to top of kryogenix.org, screen is entirely white for a second or two → When scrolling back to top of kryogenix.org, screen is entirely white for a second or two
Version: Trunk → 48 Branch
Assignee | ||
Comment 4•9 years ago
|
||
Weird, I can't reproduce this on a local build from m-c tip, with a clean profile. I can still repro in Nightly.
Assignee | ||
Comment 5•9 years ago
|
||
I did some try pushes with logging [1] and I can reproduce on those builds (although they're using my Nightly profile). From the logs it looks like the APZ code is calling RequestContentRepaint as intended, but since it's calling it from the compositor thread, it needs to redispatch to the main thread at [2], and that redispatch doesn't happen until much later. So the main thread must be busy doing something else, I don't know what. I was unsuccessful in running the profiler on it. I'll try to get a backtrace from gdb.
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f6fb8da959f
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/AsyncPanZoomController.cpp?rev=ddd9b902f9e1#2784
Assignee | ||
Comment 6•9 years ago
|
||
Couldn't get gdb to work on the try build either, since it's built as a Nightly and appears to be non-debuggable.
Assignee | ||
Comment 7•9 years ago
|
||
Ok, I got a profile from a Nightly build (I didn't have my setup correct before).
https://cleopatra.io/#report=f321c5b18da81dc1b7302e01ac66cc896afcc015
We're basically spending a lot of time painting backgrounds and gradients, and that blocks the main thread from processing the new incoming paint requests.
Assignee: bugmail.mozilla → nobody
Component: Panning and Zooming → Graphics
Reporter | ||
Comment 8•9 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> Weird, I can't reproduce this on a local build from m-c tip, with a clean profile.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)
> I did some try pushes with logging [1] and I can reproduce on those builds
> (although they're using my Nightly profile).
(FWIW, I don't think there are any special normal-browsing-profile quirks required to trigger this -- I'm able to reproduce in "guest mode", as well as in mozregression which uses fresh profiles.)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> We're basically spending a lot of time painting backgrounds and gradients,
> and that blocks the main thread from processing the new incoming paint
> requests.
Thanks for taking a look at this! So, the page does have some complex layered tiled-background + partially-transparent gradient on the top section (see rule for #basic). I imagine that's what we're working to paint here.
Do you know how we managed to paint responsively before, despite this background being expensive?
Assignee | ||
Comment 9•9 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #8)
> Do you know how we managed to paint responsively before, despite this
> background being expensive?
I'm not sure, but considering that I see the issue on try build but not local builds, I suspect there might be something racy or otherwise intermittent going on. I also did a try build with bug 1192910 backed out [1] and I'm still able to reproduce the issue on that build. I wasn't able to install the 47 nightly build just before bug 1192910 landed (I'd have to blow away my profile which I don't want to do right now) so I couldn't verify the regression range, but I suspect it might be off?
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=81c027858c2e
Assignee | ||
Comment 10•9 years ago
|
||
Ok, so I verified your regression range, and confirmed that bug 1192910 did make things worse. However the specific thing that bug 1192910 regressed was fixed in bug 1263347. I think it regressed again since then, trying to track down the regression range for that.
Assignee | ||
Comment 11•9 years ago
|
||
The new regression range is https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=7389f08fb91c83a239add2be15e648431ed645f1&tochange=d74e9d5a64c352187ab74b60255b571f6192a0e5
Are we now using rust code for any part of painting/layout on fennec?
Comment 12•9 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> The new regression range is
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=7389f08fb91c83a239add2be15e648431ed645f1&tochange=d74e
> 9d5a64c352187ab74b60255b571f6192a0e5
>
> Are we now using rust code for any part of painting/layout on fennec?
Uh, no, we aren't. That's a peculiar regression range. My best guess is that when we link an empty rust crate, we drag in bits and pieces of support libraries (e.g. what one would find in libgcc) and those pieces get linked into libxul, rather than libxul depending on some outside library. And then somehow, one of those pieces has a bug? That seems a little unlikely, but...
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 13•9 years ago
|
||
Hm. It's far-fetched, but it would also explain why try builds demonstrate the problem and my local build does not. Somebody should verify my regression range just to be sure though.
Assuming the range is valid, I don't really know where to go from here - is it useful to bisect within those three changesets? Or should we look at the profile in comment 7 more closely to see if there's things in there that wouldn't have been involved before?
Comment 14•9 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> Hm. It's far-fetched, but it would also explain why try builds demonstrate
> the problem and my local build does not. Somebody should verify my
> regression range just to be sure though.
>
> Assuming the range is valid, I don't really know where to go from here - is
> it useful to bisect within those three changesets? Or should we look at the
> profile in comment 7 more closely to see if there's things in there that
> wouldn't have been involved before?
I have two suggestions, one of which is more work than the other:
1. Look at the profile in comment 7 to see if you can find anything unusual, particularly with functions starting with '_' or '__'. If we can at least identify *something*, we'll have more information to possibly take to the Rust team or even someplace else.
2. Back out the patches; nothing else we're building depends on them, because the only Rust pieces in the tree aren't being built on Android for tedious reasons (bug 1266792).
These are not mutually exclusive, of course.
Comment 15•9 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #14)
> 2. Back out the patches; nothing else we're building depends on them,
> because the only Rust pieces in the tree aren't being built on Android for
> tedious reasons (bug 1266792).
Well, at least back out the patch that flips on --enable-rust; the other ones can stay.
Assignee | ||
Comment 16•9 years ago
|
||
Here's a try push with --enable-rust disabled, let's see if it helps any.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8ab94a6e08cb
Assignee | ||
Comment 17•9 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #14)
> 1. Look at the profile in comment 7 to see if you can find anything unusual,
> particularly with functions starting with '_' or '__'. If we can at least
> identify *something*, we'll have more information to possibly take to the
> Rust team or even someplace else.
There are some symbols like __addXf3__, __mulXf3__, and __divdXf3__ which are showing up as pretty high in the profile. They're the top three nontrivial symbols in the profile if you select "invert callstack". Does that help?
Comment 18•9 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> (In reply to Nathan Froyd [:froydnj] from comment #14)
> > 1. Look at the profile in comment 7 to see if you can find anything unusual,
> > particularly with functions starting with '_' or '__'. If we can at least
> > identify *something*, we'll have more information to possibly take to the
> > Rust team or even someplace else.
>
> There are some symbols like __addXf3__, __mulXf3__, and __divdXf3__ which
> are showing up as pretty high in the profile. They're the top three
> nontrivial symbols in the profile if you select "invert callstack". Does
> that help?
Well, it's information, at least. :) What is gfxContext::Fill doing with adding, multiplying, and dividing floating point numbers?
OK, I don't know the answer to that question. What I suspect is going on here is this: we call runtime functions __add{s,d}f3__ for doing addition of floating-point numbers, because ARM. Prior to the --enable-rust push, references to these functions were being resolved to libgcc or similar on the system, which probably uses the floating-point hardware to do the work (possibly with some cost for massaging the data in and out of the integer calling convention registers, I think). But after the --enable-rust push, those __add{s,d}f3__ references are getting resolved by using Rust's support library; __addXf3__ is the llvm internal implementation of the function we're actually calling. And Rust's support library on the ARM target that we use is compiled to be fully general and run on any ARM chip...including chips that don't have a floating-point unit. So the Rust routines are, I think, doing everything in software (!), which as you might imagine is somewhat expensive.
I'm really surprised a regression from this hasn't shown up sooner.
There's a Rust issue open on this: https://github.com/rust-lang/rust/issues/33278 Clearly, we need to use the correct Rust target for our Android builds. :) I don't think the necessary changes to add an armv7 target have showed up in stable Rust, though, so perhaps the best course of action for now is simply to back out the --enable-rust changes?
Comment 19•9 years ago
|
||
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/da10eecd0e76
Back out part 3 of bug 1220307. rs=froydnj
Assignee | ||
Comment 20•9 years ago
|
||
The try build with rust disabled fixed the issue. After discussion with Nathan on IRC we backed out part 3 of bug 1220307.
Comment 21•9 years ago
|
||
Whew, in under the 49 wire so we don't have to ask for uplifts. :)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → bugmail.mozilla
Comment 22•9 years ago
|
||
Oh whoa, this is indeed quite surprising! The armv7 Android target is in nightly Rust now and I'll try to work on producing nightlies for it once I get back to SF after the work week to ensure they start riding the trains.
Assignee | ||
Comment 23•9 years ago
|
||
This got merged
https://hg.mozilla.org/mozilla-central/rev/da10eecd0e76
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in
before you can comment on or make changes to this bug.
Description
•