Closed Bug 770144 Opened 12 years ago Closed 12 years ago

Investigate Android native Talos tcheckerboard regressions from DLBI

Categories

(Core :: Layout, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
fennec - ---

People

(Reporter: mattwoodrow, Unassigned)

References

Details

(Whiteboard: [leave open])

Attachments

(1 file, 2 obsolete files)

      No description provided.
We're getting a lot of paints while idling on timecube.com from an animated image that is offscreen.

This patch lets us skip invalidating (and more importantly, scheduling a paint) if the frame doesn't have any display items.

It should always be correct to skip invalidation, but I'm less sure about scheduling a paint. Can't think of a concrete example that it will break though. We might need to move the logic into the callers of InvalidateFrame instead.

Unfortunately, this doesn't fix the regression at all though. :(
Ok, so looks like the nsDisplayCanvasBackground is being invalidated in it's entirety when scrolling.

Some of these are +/- 1 AppUnit differences and look like rounding bugs (we're rounding down by default I think) here:

http://mxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp#2046

The others are in the 50-100 range, and I have no idea what would cause the difference to be that big. Definitely computing something wrong, since the background doesn't move position relative to the ThebesLayer at all.

Pixel snapping maybe?

Will sleep on it.
(In reply to Matt Woodrow (:mattwoodrow) from comment #1)
> This patch lets us skip invalidating (and more importantly, scheduling a
> paint) if the frame doesn't have any display items.

But what if the invalidation would result in a display item being created for the frame?

I think we should go ahead and schedule a paint, but do as little work in that paint as possible. We should already not be painting any ThebesLayers. I guess the question is whether/how we can skip display list and layer construction. Maybe we could do that if there hasn't been any reflow and scanning the frame tree for invalid frames reveals only frames that are clipped out *and* not inside an SVG filter (since filters can expand the area that needs to be invalidated).
How about we add a second InvalidateFrame() function (or a flag passed to it), and then make some of the callers use that?

The problem here is nsImageFrame::FrameChanged(). That one in particular should be entirely safe, changing the frame of an animated image will never change whether it's visible or not.

We can add the infrastructure and fix this case pretty easily, and then audit the other callers as a separate patch when we have more time.
Still waiting on tryserver results to see if this fixes perf regressions. We should take it regardless though.
Attachment #638570 - Flags: review?(roc)
Comment on attachment 638570 [details] [diff] [review]
Restrict invalidation to the Layer's visible region

r+ assuming you take out the DEBUG_INVALIDATIONS changes
Attachment #638570 - Flags: review?(roc) → review+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/9574d7725ec5 for making reftests/text-overflow/scroll-rounding.html unhappy.
https://hg.mozilla.org/integration/mozilla-inbound/rev/3175d72d8a40

And again, using the bounds of the visible region since that's what OGL layers paint.
Alright, well that didn't really help Talos (but it does help real world performance).

We're still repainting the entire page of timecube.com when scrolling because DLBI thinks that the display items have shifted.

Looks like this code isn't taking pixel snapping into account: http://mxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp#2049

I'm attaching a paste of values and calculations that show that our previous/current display items differ by around 0.3 device pixels, which I believe snaps to 0 and we shouldn't need to repaint.

The current code doesn't take this into account and instead sees a difference of 33 app units, causing most display items to be invalidated.

http://pastebin.mozilla.org/1689036

I'm not really sure how to fix this, I can't think of an easy way to shift just the oldGeometry such that it will have app units that match, but I might be missing something here. Note that we can't snap the new geometry values since they are calculated inside ComputeInvalidationRegion.

One option would be to modify ComputeInvalidationRegion() to take 2 nsDisplayGeometry pointers, and pass both the old and new geometries. Then we could adjust both of them to be relative to the ThebesLayer, and in snapped pixels.

This still doesn't take into account the fact that only some display items are snapped. And worse, some that are snapped, aren't marked as such. nsDisplayText::GetBounds is returning false for snapping, but if it wasn't snapped then we'd be expecting it to move slightly while scrolling, right?

roc: Do you have any other ideas here?
Whiteboard: [leave open]
Note that this only really affects android because of the scale factors involved.
I think we should not be taking pixel-snapping into account in any way in the stored display item geometries. So I would not expect to see large error here and I don't know why we are seeing it.

Basically everything we work with here should not be snapped except that when we scroll, we are trying to choose scroll values that result in layers moving by an integer number of layer pixels. (Is that working in this case?)
Er, sorry about that, posted to the wrong bug.
No longer depends on: 795576
The eideticker tests (http://eideticker.wrla.ch) don't seem to be working after about 9/25.  I would love to know what they have to say.
OK, eideticker results are in!

Since there weren't perf results for several days before dlbi landed, we don't know for sure whether dlbi is to be praised or blamed for moving needles, but it's certainly the most significant change in that timeframe.

There's a clear regression on taskjs

http://eideticker.wrla.ch/videos/video-1348655880.02.webm (http://hg.mozilla.org/mozilla-central/rev/08d435dedc7f)

to

http://eideticker.wrla.ch/videos/video-1349000905.17.webm (http://hg.mozilla.org/mozilla-central/rev/85f561c755f6)

There's also a small regression on nightly-zooming

http://eideticker.wrla.ch/videos/video-1348657229.94.webm (http://hg.mozilla.org/mozilla-central/rev/08d435dedc7f)

to

http://eideticker.wrla.ch/videos/video-1349001533.75.webm (http://hg.mozilla.org/mozilla-central/rev/85f561c755f6)

There might be a small regression on imgur, but it's hard to tell from one datapoint.

It looks like startup-abouthome-cold was improved, but dlbi shouldn't have affected that.
wlach / clint, to better understand the regressions here, Matt would like to see an eideticker run with paint flashing enabled.  To so, we need to set the pref "nglayout.debug.paint_flashing" to true.  Can you guys help us get this data?  Thanks!
Sorry about the dashboard being down. I triggered a manual run of the Galaxy Nexus results to get some profiling information to go along with the captures. From my brief glance at what we have so far, it does indeed look like we're spending more time painting in the taskjs benchmark at least.

E.g. compare this profile on the 24th (checkerboard 92): http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=profiles/sps-profile-1348512508.01.zip&videoCapture=videos/video-1348512508.01.webm

With this one of the 30th (checkerboard 156): http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=profiles/sps-profile-1349039877.12.zip&videoCapture=videos/video-1349039877.11.webm
(In reply to Chris Jones [:cjones] [:warhammer] from comment #18)
> wlach / clint, to better understand the regressions here, Matt would like to
> see an eideticker run with paint flashing enabled.  To so, we need to set
> the pref "nglayout.debug.paint_flashing" to true.  Can you guys help us get
> this data?  Thanks!

Sure, I'd need to rejigger some stuff so it's unfortunately not totally straightforward. I'll do it tomorrow first thing (EST).
OK, eideticker + profiler just blew my mind.  Nice work guys!
The profile suggests that we're spending a lot more time painting gradients.  That probably means we're over-invalidating the background here.
From the looks of it we now do 3 full page invalidation. On a page like task js we're so slow to paint that the first 2 paints don't even make it on screen on time so are completely wasted. Previously the invalidation was such that we did smaller invalidation+paint that would finish on time roughly 4 out 5 times to show up on the screen.

I don't think getting paint flashing will be helpful because all we see is the initial and final paint.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #18)
> wlach / clint, to better understand the regressions here, Matt would like to
> see an eideticker run with paint flashing enabled.  To so, we need to set
> the pref "nglayout.debug.paint_flashing" to true.  Can you guys help us get
> this data?  Thanks!

Ok, I just realized maybe I don't completely understand what's being ask here. Do you want a full run of the dsahboard with that pref on? Maybe I should just change eideticker to make that setting the default? Will it affect anything, except maybe the # of unique frames metric?

Just in case it's useful, here's two videos, one showing paintflashing on the latest nightly:

http://people.mozilla.com/~wlachance/cnn-nightly-paintflashing.webm

And one with the last beta:

http://people.mozilla.com/~wlachance/cnn-beta-paintflashing.webm

I also captured a run of taskjs with paintflashing enabled, but it's pretty boring:

http://people.mozilla.com/~wlachance/taskjs-paintflashing.webm

BTW, it's quite easy to run eideticker without the video capture portion if you just want to observe what's happening during a test. http://wrla.ch/blog/2012/04/eideticker-with-less-eideticker/
(In reply to William Lachance (:wlach) from comment #24)
> Just in case it's useful, here's two videos, one showing paintflashing on
> the latest nightly:
> 
> http://people.mozilla.com/~wlachance/cnn-nightly-paintflashing.webm
> 
> And one with the last beta:
> 
> http://people.mozilla.com/~wlachance/cnn-beta-paintflashing.webm
> 

This show a significant improvement. I don't see any redundant painting (excluding the touch area eideticker uses to fling).
Bug 795796 might have made a big difference here.  Need to re-measure with that applied.
Attachment #638328 - Attachment is obsolete: true
Attachment #638570 - Attachment is obsolete: true
I really have no idea how DLBI changed this, but we are ending up with a marginally different scale set on the layers (difference of about 0.00000003 if I counted right) and throwing all the layer contents away during load.

This fixes that and we now only do one full screen paint when loading taskjs.
Attachment #666813 - Flags: review?(roc)
And looks like timecube.com is improved too. We still do 2 full screen paints, once without the background image drawn, and then we invalidate an draw with the background. All obvious unnecessary paints are gone though.


(In reply to Benoit Girard (:BenWa) from comment #23)
> From the looks of it we now do 3 full page invalidation. On a page like task
> js we're so slow to paint that the first 2 paints don't even make it on
> screen on time so are completely wasted. Previously the invalidation was
> such that we did smaller invalidation+paint that would finish on time
> roughly 4 out 5 times to show up on the screen.
> 
> I don't think getting paint flashing will be helpful because all we see is
> the initial and final paint.

Good catch :)

Made this much easier to debug
https://hg.mozilla.org/mozilla-central/rev/396630ce4035

Leaving this open for now until we know if this fixed the issue.
Unfortunately this doesn't appear to have improved 'rck' at all :(

wlach: Would it be possible to get the paint flashing enabled runs done again with this changeset applied please.

I'm a bit stuck for ideas at the moment, I can't reproduce any obvious unneeded invalidation on taskjs or timecube.
Looks like the latest eideticker run didn't include your changes :(
(In reply to Matt Woodrow (:mattwoodrow) from comment #30)
> Unfortunately this doesn't appear to have improved 'rck' at all :(
> 
> wlach: Would it be possible to get the paint flashing enabled runs done
> again with this changeset applied please.
> 
> I'm a bit stuck for ideas at the moment, I can't reproduce any obvious
> unneeded invalidation on taskjs or timecube.

Sorry about being slow to reply to this.

Which tests did you want me to run with paintflashing enabled? CNN? TaskJs?
TaskJS was the site I was testing with locally, so that will be fine. Thanks!
Fennec tracking since this is a size-able performance regression :(
tracking-fennec: --- → ?
tracking-fennec: ? → 18+
Uploaded a video with paintflashing enabled:

http://people.mozilla.com/~wlachance/taskjs-paintflashing-nightly-oct4.webm

Not sure if the video is high quality enough to see the paintflashing, just in case I'm also uploading the full capture (which includes individual frames as pngs inside):

http://people.mozilla.com/~wlachance/capture-2012-10-04T18:21:23.836323.zip

(to be honest I'm not seeing much here but hopefully it will help you)
I'm out of ideas here.

I still can't reproduce any unnecessary painting on timecube.com.

I've attempted to run talos locally, and can't get it to work.

Log: http://pastebin.mozilla.org/1868123

If anyone can reproduce a regression outside of the harness, or even within the harness and wants to help, please let me know.

Otherwise, not sure what else I can do.
Yeah, talos can be a pain to run for this kind of case. :jmaher, :jhammel, or myself (:wlach) are good people to ask if you're having problems. Offhand I'm not sure what's going on just from your traceback -- a logcat would be really helpful.

As an alternative, I managed to reproduce the regression we're seeing on the taskjs benchmark fairly quickly using eideticker in non-video-capture-mode (http://wrla.ch/blog/2012/04/eideticker-with-less-eideticker/)

Results before:

=== Results for 2012-09-24 (b2867d82dcad) ===
  Internal Checkerboard Stats (sum of percents, not percentage):
  [26.416945000000005, 5.438986999999997, 20.605086]

Results after:

=== Results for 2012-10-01 (beee809b7ade) ===
  Internal Checkerboard Stats (sum of percents, not percentage):
  [96.474575, 98.755932, 94.79660799999999]

If you have a rooted Galaxy Nexus device, it should be fairly easy to reproduce my results (and instrument as you see fit). Or I can try things for you.
(In reply to William Lachance (:wlach) from comment #38)
> Yeah, talos can be a pain to run for this kind of case. :jmaher, :jhammel,
> or myself (:wlach) are good people to ask if you're having problems. Offhand
> I'm not sure what's going on just from your traceback -- a logcat would be
> really helpful.

Actually, since bug 754873 just landed, try updating your talos source and rerunning the test. You should get some logcat information on what went wrong.
Thanks wlach!

Taskjs is definitely affected by the final patch in bug 795674. I'll hopefully get that landed today.

Does that test work for timecube too?

I don't have a nexus, only a Galaxy S2, will that work?

I'll have a go at running updating talos too.
(In reply to Matt Woodrow (:mattwoodrow) from comment #40)
> Thanks wlach!
> 
> Taskjs is definitely affected by the final patch in bug 795674. I'll
> hopefully get that landed today.
> 
> Does that test work for timecube too?

No, I'd have to add a dedicated timecube test. Which perhaps I should do anyway, as people keep on asking for it...

Let me know if it would be helpful.

> I don't have a nexus, only a Galaxy S2, will that work?

It probably could without too much difficulty, but at present, no. :(

You could use the GN on the Mountain View VPN when it's not being used for the dashboard (true for most of the day). Ping me on irc and I can give you details on how to get access and run builds.

> I'll have a go at running updating talos too.

Might be the easiest route for now.
> 
> No, I'd have to add a dedicated timecube test. Which perhaps I should do
> anyway, as people keep on asking for it...
> 
> Let me know if it would be helpful.

That would definitely be helpful.

The latest talos still doesn't appear to obviously record logcat:

Talos output: http://pastebin.mozilla.org/1868813
Logcat: http://pastebin.mozilla.org/1868808
Looks like pastebin truncated that, here's the rest:

http://pastebin.mozilla.org/1868816
The last patch for bug 795674 is on inbound and improved tchk2/3 considerably. I'd expect to see taskjs improved too.
I reactivated bug 744724 to add timecube to the eideticker suite.
Depends on: 744724
(In reply to Matt Woodrow (:mattwoodrow) from comment #42)
> > 
> > No, I'd have to add a dedicated timecube test. Which perhaps I should do
> > anyway, as people keep on asking for it...
> > 
> > Let me know if it would be helpful.
> 
> That would definitely be helpful.
> 
> The latest talos still doesn't appear to obviously record logcat:
> 
> Talos output: http://pastebin.mozilla.org/1868813
> Logcat: http://pastebin.mozilla.org/1868808

Not really sure what's going on, aside from logcat not working because the su -c logcat command didn't take. Any ideas :jmaher? (background: :mattwoodrow is trying to run talos locally to debug some dlbi regressions...)
logcat is working great in inbound.  Please link me to a log, maybe there is a corner case where it doesn't work.

Here is an example of one from about 2 hours ago:
https://tbpl.mozilla.org/php/getParsedLog.php?id=16214252&tree=Mozilla-Inbound&full=1
I added a timecube test today. One result with today's build is here:

http://eideticker.wrla.ch/#/samsung-gn/timecube/checkerboard

(checkerboard value of 276)

Here's results before the DLBI commit. I'm almost going to gather data on what we had after it (Oct 1st)

=== Results for 2012-09-24 (b2867d82dcad) ===
  Number of unique frames:
  [339, 333, 315]

  Average number of unique frames per second:
  [41.34146341463415, 41.28099173553719, 39.29313929313929]

  Checkerboard area/duration (sum of percents NOT percentage):
  [70.49449673583337, 37.39482705955063, 79.4987831300482]
Results for October 1st.

=== Results for 2012-10-01 (beee809b7ade) ===
  Number of unique frames:
  [275, 277, 281]

  Average number of unique frames per second:
  [34.375, 34.69728601252609, 34.9792531120332]

  Checkerboard area/duration (sum of percents NOT percentage):
  [278.4688778228467, 286.52152787150715, 266.38470707745233]
Thanks for doing those!

Interesting, so looks like the same regression shows up there.

Are these done in non-video-capture-mode? (ie, is it something I can repeat locally?)

Also, would it be possible to get paint flashing enabled for the before/after runs?
(In reply to Matt Woodrow (:mattwoodrow) from comment #50)
> Thanks for doing those!
> 
> Interesting, so looks like the same regression shows up there.
> 
> Are these done in non-video-capture-mode? (ie, is it something I can repeat
> locally?)

Those were done with video capture, but I have found that the non-video-capture-mode tends to produce roughly the same checkerboarding results (give or take).

> Also, would it be possible to get paint flashing enabled for the
> before/after runs?

Not sure how much good these will do but:

http://people.mozilla.com/~wlachance/timecube-sept24.webm
http://people.mozilla.com/~wlachance/timecube-oct1.webm
Getting timecube profiles pre/post regression.

Profile of timecube from Sept 24:

http://people.mozilla.com/~wlachance/profile-org.mozilla.fennec-1350670831.zip

Profile of timecube from Oct 1:

http://people.mozilla.org/~wlachance/profile-org.mozilla.fennec-1350670283.zip
(In reply to William Lachance (:wlach) from comment #52)
> Getting timecube profiles pre/post regression.

Oops, those are wrong. The urls should be reversed:

Profile of timecube from Sept 24:

http://people.mozilla.org/~wlachance/profile-org.mozilla.fennec-1350670283.zip

Profile of timecube from Oct 1:

http://people.mozilla.com/~wlachance/profile-org.mozilla.fennec-1350670831.zip
These profiles can be open by using the 'Import package' button. You may need addr2line/arm-eabi-addr2line in your PATH
What are the next steps here?
Flags: needinfo?(matt.woodrow)
(In reply to Brad Lassey [:blassey] from comment #55)
> What are the next steps here?

Matt?
Sorry, forgot about this.

I'm not sure really, we weren't able to find any evidence of over-invalidation. On top of that, these results have changed hugely with the landing of progressive painting, so I don't think this is really an issue any more.
Flags: needinfo?(matt.woodrow)
tracking-fennec: 18+ → -
Matt, do you want to close this out?
Flags: needinfo?(matt.woodrow)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(matt.woodrow)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: