Closed Bug 931028 Opened 6 years ago Closed 6 years ago

Panning about:start is very janky in 10/24 nightly

Categories

(Core :: Layout, defect)

27 Branch
x86_64
Windows 8.1
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla29
Tracking Status
firefox27 --- unaffected
firefox28 + verified
firefox29 --- verified

People

(Reporter: jimm, Assigned: tnikkel)

References

Details

(Keywords: verified-aurora, Whiteboard: [metro] [beta28] [layout])

Attachments

(4 files)

I haven't updated this tablet in a while, so this could be a regression over the last few weeks.

str: 
1) surf around a bit to accumulate some history and bookmarks
2) open about:start and scroll
note in fx nightly, loading this video will start out broken, you have to move into it about 50% before you get video. Chrome can display the video fine.
Can you grab a profile using the profiler? I presume this is the same issue as in  bug 927121 but would like to confirm.
Flags: needinfo?(jmathies)
Version: 26 Branch → 27 Branch
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #3)
> Can you grab a profile using the profiler? I presume this is the same issue
> as in  bug 927121 but would like to confirm.

Sure. Don't have that set up yet but will, probably over the weekend or early next week.
Flags: needinfo?(jmathies)
Whiteboard: [blocker]
Whiteboard: [blocker] → [block28]
(In reply to Jim Mathies [:jimm] from comment #5)
> Janky profile - 
> 
> http://people.mozilla.org/~bgirard/cleopatra/
> #report=e77eef3004a8373e0c50c674c7ae10d4668142ba

If I'm reading that right composition is experiencing heavy jank, sometimes ~500 msec or more between renders.
Bas, curious if you have any comments on the posted profile. It looks like we spend a lot of time idle, although I have not spent a lot of time working with these profiles to know for sure.
Flags: needinfo?(bas)
A couple interesting points:

1) In some cases I see ~50msec delays between dispatching an async touch move and receiving the dispatched async runnable in DeliverNextQueuedTouchEvent, which delivers the event to the apzc.
2) In some cases I see a ~70msec delay between the apzc processing an input events the the compositor rendering to the screen.
Depends on: 935037
(In reply to Jim Mathies [:jimm] from comment #7)
> Bas, curious if you have any comments on the posted profile. It looks like
> we spend a lot of time idle, although I have not spent a lot of time working
> with these profiles to know for sure.

During the workweek in Paris I noticed something very interesting (on windows desktop), when scrolling with the scrollwheel on my machine, scrolling was -much- worse than with the arrow key. Our frame rates were much lower and jankier in that case. In neither case we were maxing out the CPU or the GPU so it wasn't clear what was limiting our framerate or causing the jank.

I didn't get the time to investigate the cause (nor do I suspect the cause is really up my alley), but it looked like we were spending a lot of time idle there as well.
Flags: needinfo?(bas)
(In reply to Jim Mathies [:jimm] from comment #8)
> A couple interesting points:
> 
> 1) In some cases I see ~50msec delays between dispatching an async touch
> move and receiving the dispatched async runnable in
> DeliverNextQueuedTouchEvent, which delivers the event to the apzc.

This turned out to be a side effect of the way the profiler takes snapshots. I added some tagging to our event dispatching code and found that touch events are getting dispatched in a timely fashion (especially with the patch in bug 935037 applied). 

> 2) In some cases I see a ~70msec delay between the apzc processing an input
> events the the compositor rendering to the screen.

Frequency of composition really seems to the the issue here. Will be looking at this next.
(In reply to Bas Schouten (:bas.schouten) from comment #9)
> (In reply to Jim Mathies [:jimm] from comment #7)
> > Bas, curious if you have any comments on the posted profile. It looks like
> > we spend a lot of time idle, although I have not spent a lot of time working
> > with these profiles to know for sure.
> 
> During the workweek in Paris I noticed something very interesting (on
> windows desktop), when scrolling with the scrollwheel on my machine,
> scrolling was -much- worse than with the arrow key. Our frame rates were
> much lower and jankier in that case. In neither case we were maxing out the
> CPU or the GPU so it wasn't clear what was limiting our framerate or causing
> the jank.
> 
> I didn't get the time to investigate the cause (nor do I suspect the cause
> is really up my alley), but it looked like we were spending a lot of time
> idle there as well.

Scrolling takes two different paths depending on how you scroll - scroll wheel / keyboard goes through the dom and comes back to the apzc. Touch scrolling goes through the apzc directly. I'm not sure why you would see differences between wheel and keyboard, since they both take the same path.
A couple of additional notes on this:

1) The apzc requests composition at relatively reasonable intervals. In the CompositorParent we will occasionally delay requests to composite to limit frame rate. One of the things I'm noticing is that delayed requests tend to exceed the requested callback time quite a bit. I commented out this delaying mechanism and was able to get improved scroll performance.

2) painting about:start takes a long time, usually around 50-80 msec. When we paint, we don't composite so it looks like the jank shown in the video is primarily the result of the pause while we paint. We paint on the input thread, so while painting, we also don't process input event afaict. This would explain the regular jump/jank in scrolling we see as you scroll the tab sideways.
Disconnecting the profiler and just logging deltas between calls to CompositorParent::Composite() looks pretty good.

ScheduleComposition: delta: 4.54 timeout: 10.46
Composite          : delta: 13.15
ScheduleComposition: delta: 1.55 timeout: 13.45
Composite          : delta: 13.55
ScheduleComposition: delta: 1.27 timeout: 13.73
Composite          : delta: 12.89
ScheduleComposition: delta: 1.25 timeout: 13.75
Composite          : delta: 12.82
ScheduleComposition: delta: 1.27 timeout: 13.73
Composite          : delta: 18.02
ScheduleComposition: delta: 2.86 timeout: 12.14
Composite          : delta: 12.56
ScheduleComposition: delta: 3.39 timeout: 11.61
Composite          : delta: 13.77
ScheduleComposition: delta: 2.66 timeout: 12.34
Composite          : delta: 12.82
ScheduleComposition: delta: 2.11 timeout: 12.89
Composite          : delta: 12.82
ScheduleComposition: delta: 2.70 timeout: 12.30
Composite          : delta: 13.12
ScheduleComposition: delta: 61.39 timeout: -46.39
Composite          : delta: 63.49
ScheduleComposition: delta: 2.26 timeout: 12.74
Composite          : delta: 10.44
http://people.mozilla.org/~bgirard/cleopatra/#report=e77eef3004a8373e0c50c674c7ae10d4668142ba

So since the compositor doesn't seem to be at issue here, I've been looking at painting. In the above profile you can put 'paintroot' (displaylist) or 'paint' (pressshell's paint) in the filter box to see exactly how much time we spend.

In general, 70-80 msec for every paint of the root view while scrolling.

I've stepped through the painting code a bit but I'm not sure what to look for. Any help from layout folks would be appreciated. 80msec with all other threads idle seems like a really long time.
I'm having trouble figuring out how to find out where most of the time is spent in this profile. It seems to be different from every other cleopatra profile I've looked at. Poking around like a child though I've seen box shadow come up a few times. Could try getting rid of all box shadows and see if that makes it better, then we'd have a culprit.

Another way to look at this would be to turn on paint flashing to see if we are repainting things unnecessarily. If we are that would be the first thing to look into.
(In reply to Timothy Nikkel (:tn) from comment #15)
> I'm having trouble figuring out how to find out where most of the time is
> spent in this profile. It seems to be different from every other cleopatra
> profile I've looked at.

Agreed. I think the sampling interval is too large on that profile. I have a profile of me panning slowly left/right on about:start here:

http://people.mozilla.org/~bgirard/cleopatra/#report=773a053be234d5bbd32639302f53a0278b442cb2

> Another way to look at this would be to turn on paint flashing to see if we
> are repainting things unnecessarily. If we are that would be the first thing
> to look into.

I turned on paint flashing and I see that there are vertical borders on the sides of each of the blocks (bookmarks, recent history, etc) that are constantly repainting as I scroll. If I can figure out how to take a screenshot I will attach it.
Attached image screenshot.png
Here is a screenshot with paint flashing turned on. Note the vertical lines around the bookmarks block, those are getting repainted like crazy.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #16)
> http://people.mozilla.org/~bgirard/cleopatra/
> #report=773a053be234d5bbd32639302f53a0278b442cb2
> 

Actually that profile isn't much better and is missing symbols in some places. I'll try to get a better one.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> Created attachment 828639 [details]
> screenshot.png
> 
> Here is a screenshot with paint flashing turned on. Note the vertical lines
> around the bookmarks block, those are getting repainted like crazy.

This is something we've been curious about, when scrolling with the apz, we expect that layout would only paint the new areas added to the display port. But apparently this not the case, and the whole port often gets repainted.

Also I don't think this is a problem that just affects about:start.
I bet this is some stupid rounding thing. We're going to be triggering request-content-repaints regularly while panning (based on the apz.pan_repaint_interval pref) and gecko is probably going to repaint things like blurs/shadows (which is what I'm seeing in the profile) because of pixel alignment.
I tried snapping the displayport to an integer CSS pixel boundary but that didn't seem to help. I'm going to try to see if I can log why those borders or shadows or whatever getting invalidated as we pan around.
So after much breakpointing in Visual Studio I determined that when about:start is panned, the invalid region is coming from some nsDisplayTransform, nsXULTextBox, and nsDisplayBackgroundColor display items. The code path that actually populates the invalid region is the one at [1] and specifically [2]. The width of the two rects passed in to AccumulateRectDifference is off by 160 pixels (in the instance I caught it) although the x, y, and height are identical.

[1] http://mxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp?rev=82ff69542540#2309
[2] http://mxr.mozilla.org/mozilla-central/source/layout/base/DisplayItemClip.cpp?rev=82ff69542540#302
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #23)
> nsDisplayBackgroundColor display

Sorry, that should be nsDisplayCanvasBackgroundColor
Just an update since I was doing some investigation here last week. The specific elements on about:start that are causing the problem, as best I can tell, are the XULTextBox elements (the ones that say "Bookmarks", "Recent History" etc.). I think the clip rect on those are wrong in layout but I don't know why. I was wondering if botond's patch in bug 936277 would fix it so I decided to update to latest m-c and apply that patch. Turns out that now (even without his patch) all of about:start is getting repainted on every RequestContentRepaint. But the paintflashing behaviour seems different somehow.. like after a few frames of composition I can't make out the underyling content any more.

At any rate both the original problem I was seeing and the new problem are probably in layout code.
Component: Panning and Zooming → Layout
The new behaviour I'm seeing is a result of bug 934860. With that backed out I see the old behaviour again.
Blocks: 934860
Matt woodrow pointed out that bug 934860 breaks paint-flashing which explains the behaviour I was seeing. It's not actually repainting everything but the paint-flash isn't clipped properly. Removing the dependency but leaving this bug in layout since the original invalidation/clip is a problem there. I might get back to this bug later but for the moment I have other things to look at.
No longer blocks: 934860
What does one need to reproduce this btw, windows 8? Or a metro tablet?
You need to run FF in metro mode, so yeah I think you need a metro tablet.
Assignee: nobody → ajones
Looks like most of the painting time (~75%) is spent in drawing box-shadow. That time is split fairly evenly between allocating the CPU surface and blurring it, and allocating a D2D texture and uploading it.

You could try the patches in bug 940845 to see how much this helps.
Sure that would make the repainting faster, but the fact that we have to redraw at all seems like a bug (from comment 16).
Based on the diagnosis thus far indicating this is something we can fix, and since this is the main tab users will interact with, requesting tracking28 so we get to this before we hit beta.
ajones not the right person - I'll look for somebody else.
Assignee: ajones → nobody
(In reply to Jim Mathies [:jimm] from comment #0)
> I haven't updated this tablet in a while, so this could be a regression over
> the last few weeks.
> 
> str: 
> 1) surf around a bit to accumulate some history and bookmarks
> 2) open about:start and scroll

Jim is this a regression or not?
(In reply to Milan Sreckovic [:milan] from comment #34)
> (In reply to Jim Mathies [:jimm] from comment #0)
> > I haven't updated this tablet in a while, so this could be a regression over
> > the last few weeks.
> > 
> > str: 
> > 1) surf around a bit to accumulate some history and bookmarks
> > 2) open about:start and scroll
> 
> Jim is this a regression or not?

It's hard to say since so much has changed in apzc land over the last couple months. I think kat hit on something though when he was looking at paint flashing (comment 16 / comment 31). Generally speaking it appears we are doing too much painting which is likely effecting more than just our about:start page. 

If we don't have any free resources in gfx/layout for this I'll take it again and see if we can sort it out.
Blocks: 909752
Whiteboard: [block28] → [beta28]
Whiteboard: [beta28] → [beta28][layout]
Blocks: metrov1backlog, metrov1omtc&apzc
No longer blocks: metro-apzc
Assignee: nobody → jmathies
jim,

We will take you up on that offer it would be great to have some attention on this.
(In reply to Benjamin Kerensa [:bkerensa] from comment #36)
> jim,
> 
> We will take you up on that offer it would be great to have some attention
> on this.

Someone from layout who understands how we decide what to paint when we scroll would be far better. I can look at this worst case but lets try to find someone who knows this code. (Besides I'm currently working on the front end beta blockers we have (45 bugs!) so I can't look at this currently anyway.)
Assignee: jmathies → nobody
Milan will be taking lead on the prioritization and resourcing of Metro's 11 Beta 28 blocking graphic/layout bugs.

(In reply to Benjamin Kerensa [:bkerensa] from comment #36)
> jim,
> 
> We will take you up on that offer it would be great to have some attention
> on this.
Assignee: nobody → milan
Tracked bugs must have owners so based on comment 38 passing this over to Milan
Assignee: milan → tnikkel
Whiteboard: [beta28][layout] → [beta28] [layout]
Looking at this now it looks like there has been a further regression in paint flashing since Nov 1. I see the one pixel bands link comment 17 in a nov 1 build, and I see them in a current nightly build. But I also see the chunks of content that make up about:start (recent sites, top sites, bookmarks) get repainted any time they either begin overlapping the faint watermark firefox logo in the background, or stop overlapping the watermark logo. The watermark logo didn't exist in nov 1 builds, so likely just the introduction of the logo caused this new regression.
Whiteboard: [beta28] [layout] → [metro] [beta28] [layout]
nsDisplayTransform::GetTransform rounds ToReferenceFrame() to the nearest pixel based on the passed in aAppUnitsPerPixel and bakes it into the transform. nsDisplayTransform::GetBounds passes app units per CSS pixels into GetTransform, so we round to the nearest CSS pixel. This causes the bounds to not match up when scrolling and our scroll position is snapped to dev pixels and we have to invalidate.

This fixes the majority of the bad painting happening, but not all of it. (I'm testing with a patch to remove the firefox watermark logo, it was added after this bug was filed. The watermark causes other bad painting behaviour, but it's independent, so can be tackled seperately.)
Attachment #8362197 - Flags: review?(matt.woodrow)
Big improvements not only on about:start but also on web pages like techmeme.com as well. Thanks for working on this Tim!
This also fixes bug 946502 for me.
No longer blocks: 909752
Duplicate of this bug: 909752
This also fixes bug 956690, and it addresses all the "black rect at the bounds" problems I've seen.
(In reply to Jim Mathies [:jimm] from comment #42)
> Big improvements not only on about:start but also on web pages like
> techmeme.com as well. Thanks for working on this Tim!

(In reply to Jim Mathies [:jimm] from comment #45)
> This also fixes bug 956690, and it addresses all the "black rect at the
> bounds" problems I've seen.

Hmm, actually, I think I had omtc and apzc disabled from testing bug 951120. Oops.
So when you test in the correct situation what does it all improve exactly? Just about:home panning?
The other thing to note that app units per dev pixel on my Surface Pro 2 is set to 43 by default on metro. I'm not sure if rounding to something a little less odd (like 45) would be a good idea in general or not.
Comment on attachment 8362197 [details] [diff] [review]
use dev pixels for transform matrices when painting

Review of attachment 8362197 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!

I'm pretty sure this covers all the callers of GetTransform(), so we should remove the parameter and just retrieve the factor within the function.
Attachment #8362197 - Flags: review?(matt.woodrow) → review+
Good idea. Here is a patch doing that.

In several places this makes things less obvious because we still need the app units per pixel ratio to use the matrix we get back from GetTransform, so we just have to make sure it's right, and it's not obvious if someone uses the wrong one.
Attachment #8362298 - Flags: review?(matt.woodrow)
Comment on attachment 8362298 [details] [diff] [review]
clean up GetTransform

Review of attachment 8362298 [details] [diff] [review]:
-----------------------------------------------------------------

I think this worth having. We had the problem where you could use the wrong factor before anyway, so it's not much different in that regard.
Attachment #8362298 - Flags: review?(matt.woodrow) → review+
(In reply to Matt Woodrow (:mattwoodrow) from comment #52)
> Comment on attachment 8362298 [details] [diff] [review]
> I think this worth having. We had the problem where you could use the wrong
> factor before anyway, so it's not much different in that regard.

Before if you used the wrong factor you'd get slightly wrong rounded results, now you'd get completely incorrect results. So it's a little different.
That's true. I think that's an improvement then, since it should be easier to catch.
So remaining issues I see here that are not fixed yet. I think followups are appropriate.
-transforms sometimes change bounds by 1 single app unit (due to lack of precision or rounding I'm guessing), causing unneeded invalidation.
-the watermark jitters during panning, I think that is bug 946502.
-the content in front of the watermark gets invalidated when it begins intersecting, or stops intersecting, the watermark
https://hg.mozilla.org/mozilla-central/rev/94ff9aac5ef1
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
(In reply to Timothy Nikkel (:tn) from comment #53)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #52)
> > Comment on attachment 8362298 [details] [diff] [review]
> > I think this worth having. We had the problem where you could use the wrong
> > factor before anyway, so it's not much different in that regard.
> 
> Before if you used the wrong factor you'd get slightly wrong rounded
> results, now you'd get completely incorrect results. So it's a little
> different.

There's definite improvement in about:start panning. We still have some bounds drawing issues and the watermark issue you pointed out which roc has patches for. I think part of the remaining jankiness is perceived due to that watermark problem. I took that out and things felt much smoother.
(In reply to Timothy Nikkel (:tn) from comment #56)
> -transforms sometimes change bounds by 1 single app unit (due to lack of
> precision or rounding I'm guessing), causing unneeded invalidation.
> -the content in front of the watermark gets invalidated when it begins
> intersecting, or stops intersecting, the watermark

Do we need follow ups on these, and would you consider either of them blockers for our rollout?
No longer blocks: metrov1backlog
(In reply to Jim Mathies [:jimm] from comment #59)
> (In reply to Timothy Nikkel (:tn) from comment #56)
> > -transforms sometimes change bounds by 1 single app unit (due to lack of
> > precision or rounding I'm guessing), causing unneeded invalidation.

This one we should really fix, but I don't think we need to block on it, unless we find that it is causing more problems.

> > -the content in front of the watermark gets invalidated when it begins
> > intersecting, or stops intersecting, the watermark

Not sure if roc's patches will affect this or not (haven't tested). We do end up repainting a lot, it only happens half a dozen times during a full pan, but we shouldn't need to be painting at all. If panning perf in about:home is good enough we could get away with it not being a blocker, but it's a lot of useless painting.

> Do we need follow ups on these, and would you consider either of them
> blockers for our rollout?

Yeah, I think so.

(In reply to Jim Mathies [:jimm] from comment #58)
> (In reply to Timothy Nikkel (:tn) from comment #53)
> > (In reply to Matt Woodrow (:mattwoodrow) from comment #52)
> > > Comment on attachment 8362298 [details] [diff] [review]
> > > I think this worth having. We had the problem where you could use the wrong
> > > factor before anyway, so it's not much different in that regard.
> > 
> > Before if you used the wrong factor you'd get slightly wrong rounded
> > results, now you'd get completely incorrect results. So it's a little
> > different.
> 
> There's definite improvement in about:start panning. We still have some
> bounds drawing issues and the watermark issue you pointed out which roc has
> patches for. I think part of the remaining jankiness is perceived due to
> that watermark problem. I took that out and things felt much smoother.

Removing the watermark removes the issue, it's presence is what is causing the over painting. There is the jitteryness issue of the watermark, but also the fact that the presence of the watermark causes us to repaint the content in front of it as we pan.
Timothy - awesome, thanks for the fix - I think Metro team wants these uplifted to Aurora - can you set that up?
Flags: needinfo?(tnikkel)
Comment on attachment 8362197 [details] [diff] [review]
use dev pixels for transform matrices when painting

[Approval Request Comment]
Bug caused by (feature/regressing bug #): been present for quite a while, only came to light with recent metro work
User impact if declined: unneeded painting leading to slower panning in some situations (about:start on metro is one)
Testing completed (on m-c, etc.): m-c for a few days
Risk to taking this patch (and alternatives if risky): should be low risk, the old behaviour was wrong
String or IDL/UUID changes made by this patch: none
Attachment #8362197 - Flags: approval-mozilla-aurora?
Flags: needinfo?(tnikkel)
Comment on attachment 8362298 [details] [diff] [review]
clean up GetTransform

[Approval Request Comment]
This is a refactoring patch on top of the fix patch. It should have no functional change, but we should uplift it as well so the code stays in sync.
Attachment #8362298 - Flags: approval-mozilla-aurora?
Attachment #8362197 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8362298 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Kamil, can you please verify this is fixed?
Flags: needinfo?(kamiljoz)
To verify you need to turn on nglayout.debug.paint_flashing then as you pan about:start the tall 1 or 2 px wide strips should change only sometimes, instead of all the time.
Keywords: verified-aurora
Went through the following verification process using the following builds:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-25-03-02-01-mozilla-central/
- http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-25-00-40-03-mozilla-aurora/
- http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/28.0b6/win32/en-US/

I downloaded an older version of the browser and enabled nglayout.debug.paint_flashing as per comment #67. Once enabled, I noticed the wide strips changed color a lot when scrolling through the about:start screen.

- Enabled nglayout.debug.paint_flashing once the browser as been installed using about:config
- Swiped through the about:start screen and ensured that the 1-2 px wide columns didn't change as often as before
- Added a few tiles and than swiped through the about:start screen and ensured that 1-2 px columns didn't change as often as before
- Added several tiles into the "Bookmarks" category and ensured that the 1-2xp column didn't change as often as before
- Used several of the scrolling methods (touch, mouse wheel, scroll bars, mouse)
- Used the above test cases in different variations of snapped view

Hardware used during testing:
- X1 Carbon
- Surface Pro 2

Panning through the about:start is still a bit janky but isn't as bad as original described in comment #0. Talked to :jimm and :tn last night and they're working on other solutions to fix the janky problems under about:start. Under Nightly, the columns are changing a bit more than Aurora/BETA but this is due to the mozilla watermark.
Status: RESOLVED → VERIFIED
Flags: needinfo?(kamiljoz)
OS: Windows 8 Metro → Windows 8.1
You need to log in before you can comment on or make changes to this bug.