770144 - Investigate Android native Talos tcheckerboard regressions from DLBI

Matt Woodrow (:mattwoodrow)

Reporter

Description

•

13 years ago

No description provided.

Matt Woodrow (:mattwoodrow)

Reporter

Comment 1

•

13 years ago

Attached patch Skip invalidating frames that don't have display items (obsolete) — Details — Splinter Review

We're getting a lot of paints while idling on timecube.com from an animated image that is offscreen. This patch lets us skip invalidating (and more importantly, scheduling a paint) if the frame doesn't have any display items. It should always be correct to skip invalidation, but I'm less sure about scheduling a paint. Can't think of a concrete example that it will break though. We might need to move the logic into the callers of InvalidateFrame instead. Unfortunately, this doesn't fix the regression at all though. :(

Matt Woodrow (:mattwoodrow)

Reporter

Comment 2

•

13 years ago

Ok, so looks like the nsDisplayCanvasBackground is being invalidated in it's entirety when scrolling. Some of these are +/- 1 AppUnit differences and look like rounding bugs (we're rounding down by default I think) here: http://mxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp#2046 The others are in the 50-100 range, and I have no idea what would cause the difference to be that big. Definitely computing something wrong, since the background doesn't move position relative to the ThebesLayer at all. Pixel snapping maybe? Will sleep on it.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 3

•

13 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #1) > This patch lets us skip invalidating (and more importantly, scheduling a > paint) if the frame doesn't have any display items. But what if the invalidation would result in a display item being created for the frame? I think we should go ahead and schedule a paint, but do as little work in that paint as possible. We should already not be painting any ThebesLayers. I guess the question is whether/how we can skip display list and layer construction. Maybe we could do that if there hasn't been any reflow and scanning the frame tree for invalid frames reveals only frames that are clipped out *and* not inside an SVG filter (since filters can expand the area that needs to be invalidated).

Matt Woodrow (:mattwoodrow)

Reporter

Comment 4

•

13 years ago

How about we add a second InvalidateFrame() function (or a flag passed to it), and then make some of the callers use that? The problem here is nsImageFrame::FrameChanged(). That one in particular should be entirely safe, changing the frame of an animated image will never change whether it's visible or not. We can add the infrastructure and fix this case pretty easily, and then audit the other callers as a separate patch when we have more time.

Matt Woodrow (:mattwoodrow)

Reporter

Comment 5

•

13 years ago

Attached patch Restrict invalidation to the Layer's visible region (obsolete) — Details — Splinter Review

Still waiting on tryserver results to see if this fixes perf regressions. We should take it regardless though.

Attachment #638570 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 6

•

13 years ago

Comment on attachment 638570 [details] [diff] [review] Restrict invalidation to the Layer's visible region r+ assuming you take out the DEBUG_INVALIDATIONS changes

Attachment #638570 - Flags: review?(roc) → review+

Matt Woodrow (:mattwoodrow)

Reporter

Comment 7

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/921c2e13b45e

Phil Ringnalda (:philor)

Comment 8

•

13 years ago

Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/9574d7725ec5 for making reftests/text-overflow/scroll-rounding.html unhappy.

Matt Woodrow (:mattwoodrow)

Reporter

Comment 9

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/3175d72d8a40 And again, using the bounds of the visible region since that's what OGL layers paint.

Matt Woodrow (:mattwoodrow)

Reporter

Comment 10

•

13 years ago

Alright, well that didn't really help Talos (but it does help real world performance). We're still repainting the entire page of timecube.com when scrolling because DLBI thinks that the display items have shifted. Looks like this code isn't taking pixel snapping into account: http://mxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp#2049 I'm attaching a paste of values and calculations that show that our previous/current display items differ by around 0.3 device pixels, which I believe snaps to 0 and we shouldn't need to repaint. The current code doesn't take this into account and instead sees a difference of 33 app units, causing most display items to be invalidated. http://pastebin.mozilla.org/1689036 I'm not really sure how to fix this, I can't think of an easy way to shift just the oldGeometry such that it will have app units that match, but I might be missing something here. Note that we can't snap the new geometry values since they are calculated inside ComputeInvalidationRegion. One option would be to modify ComputeInvalidationRegion() to take 2 nsDisplayGeometry pointers, and pass both the old and new geometries. Then we could adjust both of them to be relative to the ThebesLayer, and in snapped pixels. This still doesn't take into account the fact that only some display items are snapped. And worse, some that are snapped, aren't marked as such. nsDisplayText::GetBounds is returning false for snapping, but if it wasn't snapped then we'd be expecting it to move slightly while scrolling, right? roc: Do you have any other ideas here?

Whiteboard: [leave open]

Matt Woodrow (:mattwoodrow)

Reporter

Comment 11

•

13 years ago

Note that this only really affects android because of the scale factors involved.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 12

•

13 years ago

I think we should not be taking pixel-snapping into account in any way in the stored display item geometries. So I would not expect to see large error here and I don't know why we are seeing it. Basically everything we work with here should not be snapped except that when we scroll, we are trying to choose scroll values that result in layers moving by an integer number of layer pixels. (Is that working in this case?)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 13

•

13 years ago

Suspected dup of bug 795576.

Depends on: 795576

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 14

•

13 years ago

Er, sorry about that, posted to the wrong bug.

No longer depends on: 795576

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 15

•

13 years ago

The eideticker tests (http://eideticker.wrla.ch) don't seem to be working after about 9/25. I would love to know what they have to say.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 17

•

13 years ago

OK, eideticker results are in! Since there weren't perf results for several days before dlbi landed, we don't know for sure whether dlbi is to be praised or blamed for moving needles, but it's certainly the most significant change in that timeframe. There's a clear regression on taskjs http://eideticker.wrla.ch/videos/video-1348655880.02.webm (http://hg.mozilla.org/mozilla-central/rev/08d435dedc7f) to http://eideticker.wrla.ch/videos/video-1349000905.17.webm (http://hg.mozilla.org/mozilla-central/rev/85f561c755f6) There's also a small regression on nightly-zooming http://eideticker.wrla.ch/videos/video-1348657229.94.webm (http://hg.mozilla.org/mozilla-central/rev/08d435dedc7f) to http://eideticker.wrla.ch/videos/video-1349001533.75.webm (http://hg.mozilla.org/mozilla-central/rev/85f561c755f6) There might be a small regression on imgur, but it's hard to tell from one datapoint. It looks like startup-abouthome-cold was improved, but dlbi shouldn't have affected that.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 18

•

13 years ago

wlach / clint, to better understand the regressions here, Matt would like to see an eideticker run with paint flashing enabled. To so, we need to set the pref "nglayout.debug.paint_flashing" to true. Can you guys help us get this data? Thanks!

William Lachance (:wlach)

Comment 19

•

13 years ago

Sorry about the dashboard being down. I triggered a manual run of the Galaxy Nexus results to get some profiling information to go along with the captures. From my brief glance at what we have so far, it does indeed look like we're spending more time painting in the taskjs benchmark at least. E.g. compare this profile on the 24th (checkerboard 92): http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=profiles/sps-profile-1348512508.01.zip&videoCapture=videos/video-1348512508.01.webm With this one of the 30th (checkerboard 156): http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=profiles/sps-profile-1349039877.12.zip&videoCapture=videos/video-1349039877.11.webm

William Lachance (:wlach)

Comment 20

•

13 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #18) > wlach / clint, to better understand the regressions here, Matt would like to > see an eideticker run with paint flashing enabled. To so, we need to set > the pref "nglayout.debug.paint_flashing" to true. Can you guys help us get > this data? Thanks! Sure, I'd need to rejigger some stuff so it's unfortunately not totally straightforward. I'll do it tomorrow first thing (EST).

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 21

•

13 years ago

OK, eideticker + profiler just blew my mind. Nice work guys!

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 22

•

13 years ago

The profile suggests that we're spending a lot more time painting gradients. That probably means we're over-invalidating the background here.

Benoit Girard (:BenWa)

Comment 23

•

13 years ago

From the looks of it we now do 3 full page invalidation. On a page like task js we're so slow to paint that the first 2 paints don't even make it on screen on time so are completely wasted. Previously the invalidation was such that we did smaller invalidation+paint that would finish on time roughly 4 out 5 times to show up on the screen. I don't think getting paint flashing will be helpful because all we see is the initial and final paint.

William Lachance (:wlach)

Comment 24

•

13 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #18) > wlach / clint, to better understand the regressions here, Matt would like to > see an eideticker run with paint flashing enabled. To so, we need to set > the pref "nglayout.debug.paint_flashing" to true. Can you guys help us get > this data? Thanks! Ok, I just realized maybe I don't completely understand what's being ask here. Do you want a full run of the dsahboard with that pref on? Maybe I should just change eideticker to make that setting the default? Will it affect anything, except maybe the # of unique frames metric? Just in case it's useful, here's two videos, one showing paintflashing on the latest nightly: http://people.mozilla.com/~wlachance/cnn-nightly-paintflashing.webm And one with the last beta: http://people.mozilla.com/~wlachance/cnn-beta-paintflashing.webm I also captured a run of taskjs with paintflashing enabled, but it's pretty boring: http://people.mozilla.com/~wlachance/taskjs-paintflashing.webm BTW, it's quite easy to run eideticker without the video capture portion if you just want to observe what's happening during a test. http://wrla.ch/blog/2012/04/eideticker-with-less-eideticker/

Benoit Girard (:BenWa)

Comment 25

•

13 years ago

(In reply to William Lachance (:wlach) from comment #24) > Just in case it's useful, here's two videos, one showing paintflashing on > the latest nightly: > > http://people.mozilla.com/~wlachance/cnn-nightly-paintflashing.webm > > And one with the last beta: > > http://people.mozilla.com/~wlachance/cnn-beta-paintflashing.webm > This show a significant improvement. I don't see any redundant painting (excluding the touch area eideticker uses to fling).

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 26

•

13 years ago

Bug 795796 might have made a big difference here. Need to re-measure with that applied.

Matt Woodrow (:mattwoodrow)

Reporter

Updated

•

13 years ago

Attachment #638328 - Attachment is obsolete: true

Matt Woodrow (:mattwoodrow)

Reporter

Updated

•

13 years ago

Attachment #638570 - Attachment is obsolete: true

Matt Woodrow (:mattwoodrow)

Reporter

Comment 27

•

13 years ago

Attached patch Use FuzzyEqual to compare scales — Details — Splinter Review

I really have no idea how DLBI changed this, but we are ending up with a marginally different scale set on the layers (difference of about 0.00000003 if I counted right) and throwing all the layer contents away during load. This fixes that and we now only do one full screen paint when loading taskjs.

Attachment #666813 - Flags: review?(roc)

Matt Woodrow (:mattwoodrow)

Reporter

Comment 28

•

13 years ago

And looks like timecube.com is improved too. We still do 2 full screen paints, once without the background image drawn, and then we invalidate an draw with the background. All obvious unnecessary paints are gone though. (In reply to Benoit Girard (:BenWa) from comment #23) > From the looks of it we now do 3 full page invalidation. On a page like task > js we're so slow to paint that the first 2 paints don't even make it on > screen on time so are completely wasted. Previously the invalidation was > such that we did smaller invalidation+paint that would finish on time > roughly 4 out 5 times to show up on the screen. > > I don't think getting paint flashing will be helpful because all we see is > the initial and final paint. Good catch :) Made this much easier to debug

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

13 years ago

Attachment #666813 - Flags: review?(roc) → review+

Matt Woodrow (:mattwoodrow)

Reporter

Comment 29

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/396630ce4035 Leaving this open for now until we know if this fixed the issue.

Matt Woodrow (:mattwoodrow)

Reporter

Comment 30

•

13 years ago

Unfortunately this doesn't appear to have improved 'rck' at all :( wlach: Would it be possible to get the paint flashing enabled runs done again with this changeset applied please. I'm a bit stuck for ideas at the moment, I can't reproduce any obvious unneeded invalidation on taskjs or timecube.

Benoit Girard (:BenWa)

Comment 31

•

13 years ago

Looks like the latest eideticker run didn't include your changes :(

William Lachance (:wlach)

Comment 32

•

13 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #30) > Unfortunately this doesn't appear to have improved 'rck' at all :( > > wlach: Would it be possible to get the paint flashing enabled runs done > again with this changeset applied please. > > I'm a bit stuck for ideas at the moment, I can't reproduce any obvious > unneeded invalidation on taskjs or timecube. Sorry about being slow to reply to this. Which tests did you want me to run with paintflashing enabled? CNN? TaskJs?

Matt Woodrow (:mattwoodrow)

Reporter

Comment 33

•

13 years ago

TaskJS was the site I was testing with locally, so that will be fine. Thanks!

Benoit Girard (:BenWa)

Comment 34

•

13 years ago

Fennec tracking since this is a size-able performance regression :(

tracking-fennec: --- → ?

Brad Lassey [:blassey] (use needinfo?)

Updated

•

13 years ago

tracking-fennec: ? → 18+

William Lachance (:wlach)

Comment 35

•

13 years ago

Uploaded a video with paintflashing enabled: http://people.mozilla.com/~wlachance/taskjs-paintflashing-nightly-oct4.webm Not sure if the video is high quality enough to see the paintflashing, just in case I'm also uploading the full capture (which includes individual frames as pngs inside): http://people.mozilla.com/~wlachance/capture-2012-10-04T18:21:23.836323.zip (to be honest I'm not seeing much here but hopefully it will help you)

Matt Woodrow (:mattwoodrow)

Reporter

Comment 36

•

13 years ago

I'm out of ideas here. I still can't reproduce any unnecessary painting on timecube.com. I've attempted to run talos locally, and can't get it to work. Log: http://pastebin.mozilla.org/1868123 If anyone can reproduce a regression outside of the harness, or even within the harness and wants to help, please let me know. Otherwise, not sure what else I can do.

William Lachance (:wlach)

Comment 38

•

13 years ago

Yeah, talos can be a pain to run for this kind of case. :jmaher, :jhammel, or myself (:wlach) are good people to ask if you're having problems. Offhand I'm not sure what's going on just from your traceback -- a logcat would be really helpful. As an alternative, I managed to reproduce the regression we're seeing on the taskjs benchmark fairly quickly using eideticker in non-video-capture-mode (http://wrla.ch/blog/2012/04/eideticker-with-less-eideticker/) Results before: === Results for 2012-09-24 (b2867d82dcad) === Internal Checkerboard Stats (sum of percents, not percentage): [26.416945000000005, 5.438986999999997, 20.605086] Results after: === Results for 2012-10-01 (beee809b7ade) === Internal Checkerboard Stats (sum of percents, not percentage): [96.474575, 98.755932, 94.79660799999999] If you have a rooted Galaxy Nexus device, it should be fairly easy to reproduce my results (and instrument as you see fit). Or I can try things for you.

William Lachance (:wlach)

Comment 39

•

13 years ago

(In reply to William Lachance (:wlach) from comment #38) > Yeah, talos can be a pain to run for this kind of case. :jmaher, :jhammel, > or myself (:wlach) are good people to ask if you're having problems. Offhand > I'm not sure what's going on just from your traceback -- a logcat would be > really helpful. Actually, since bug 754873 just landed, try updating your talos source and rerunning the test. You should get some logcat information on what went wrong.

Matt Woodrow (:mattwoodrow)

Reporter

Comment 40

•

13 years ago

Thanks wlach! Taskjs is definitely affected by the final patch in bug 795674. I'll hopefully get that landed today. Does that test work for timecube too? I don't have a nexus, only a Galaxy S2, will that work? I'll have a go at running updating talos too.

William Lachance (:wlach)

Comment 41

•

13 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #40) > Thanks wlach! > > Taskjs is definitely affected by the final patch in bug 795674. I'll > hopefully get that landed today. > > Does that test work for timecube too? No, I'd have to add a dedicated timecube test. Which perhaps I should do anyway, as people keep on asking for it... Let me know if it would be helpful. > I don't have a nexus, only a Galaxy S2, will that work? It probably could without too much difficulty, but at present, no. :( You could use the GN on the Mountain View VPN when it's not being used for the dashboard (true for most of the day). Ping me on irc and I can give you details on how to get access and run builds. > I'll have a go at running updating talos too. Might be the easiest route for now.

Matt Woodrow (:mattwoodrow)

Reporter

Comment 42

•

13 years ago

> > No, I'd have to add a dedicated timecube test. Which perhaps I should do > anyway, as people keep on asking for it... > > Let me know if it would be helpful. That would definitely be helpful. The latest talos still doesn't appear to obviously record logcat: Talos output: http://pastebin.mozilla.org/1868813 Logcat: http://pastebin.mozilla.org/1868808

Matt Woodrow (:mattwoodrow)

Reporter

Comment 43

•

13 years ago

Looks like pastebin truncated that, here's the rest: http://pastebin.mozilla.org/1868816

Matt Woodrow (:mattwoodrow)

Reporter

Comment 44

•

13 years ago

The last patch for bug 795674 is on inbound and improved tchk2/3 considerably. I'd expect to see taskjs improved too.

William Lachance (:wlach)

Comment 45

•

13 years ago

I reactivated bug 744724 to add timecube to the eideticker suite.

Depends on: 744724

William Lachance (:wlach)

Comment 46

•

13 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #42) > > > > No, I'd have to add a dedicated timecube test. Which perhaps I should do > > anyway, as people keep on asking for it... > > > > Let me know if it would be helpful. > > That would definitely be helpful. > > The latest talos still doesn't appear to obviously record logcat: > > Talos output: http://pastebin.mozilla.org/1868813 > Logcat: http://pastebin.mozilla.org/1868808 Not really sure what's going on, aside from logcat not working because the su -c logcat command didn't take. Any ideas :jmaher? (background: :mattwoodrow is trying to run talos locally to debug some dlbi regressions...)

Joel Maher ( :jmaher ) (UTC -8)

Comment 47

•

13 years ago

logcat is working great in inbound. Please link me to a log, maybe there is a corner case where it doesn't work. Here is an example of one from about 2 hours ago: https://tbpl.mozilla.org/php/getParsedLog.php?id=16214252&tree=Mozilla-Inbound&full=1

William Lachance (:wlach)

Comment 48

•

13 years ago

I added a timecube test today. One result with today's build is here: http://eideticker.wrla.ch/#/samsung-gn/timecube/checkerboard (checkerboard value of 276) Here's results before the DLBI commit. I'm almost going to gather data on what we had after it (Oct 1st) === Results for 2012-09-24 (b2867d82dcad) === Number of unique frames: [339, 333, 315] Average number of unique frames per second: [41.34146341463415, 41.28099173553719, 39.29313929313929] Checkerboard area/duration (sum of percents NOT percentage): [70.49449673583337, 37.39482705955063, 79.4987831300482]

William Lachance (:wlach)

Comment 49

•

13 years ago

Results for October 1st. === Results for 2012-10-01 (beee809b7ade) === Number of unique frames: [275, 277, 281] Average number of unique frames per second: [34.375, 34.69728601252609, 34.9792531120332] Checkerboard area/duration (sum of percents NOT percentage): [278.4688778228467, 286.52152787150715, 266.38470707745233]

Matt Woodrow (:mattwoodrow)

Reporter

Comment 50

•

13 years ago

Thanks for doing those! Interesting, so looks like the same regression shows up there. Are these done in non-video-capture-mode? (ie, is it something I can repeat locally?) Also, would it be possible to get paint flashing enabled for the before/after runs?

William Lachance (:wlach)

Comment 51

•

13 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #50) > Thanks for doing those! > > Interesting, so looks like the same regression shows up there. > > Are these done in non-video-capture-mode? (ie, is it something I can repeat > locally?) Those were done with video capture, but I have found that the non-video-capture-mode tends to produce roughly the same checkerboarding results (give or take). > Also, would it be possible to get paint flashing enabled for the > before/after runs? Not sure how much good these will do but: http://people.mozilla.com/~wlachance/timecube-sept24.webm http://people.mozilla.com/~wlachance/timecube-oct1.webm

William Lachance (:wlach)

Comment 52

•

13 years ago

Getting timecube profiles pre/post regression. Profile of timecube from Sept 24: http://people.mozilla.com/~wlachance/profile-org.mozilla.fennec-1350670831.zip Profile of timecube from Oct 1: http://people.mozilla.org/~wlachance/profile-org.mozilla.fennec-1350670283.zip

William Lachance (:wlach)

Comment 53

•

13 years ago

(In reply to William Lachance (:wlach) from comment #52) > Getting timecube profiles pre/post regression. Oops, those are wrong. The urls should be reversed: Profile of timecube from Sept 24: http://people.mozilla.org/~wlachance/profile-org.mozilla.fennec-1350670283.zip Profile of timecube from Oct 1: http://people.mozilla.com/~wlachance/profile-org.mozilla.fennec-1350670831.zip

Benoit Girard (:BenWa)

Comment 54

•

13 years ago

These profiles can be open by using the 'Import package' button. You may need addr2line/arm-eabi-addr2line in your PATH

Brad Lassey [:blassey] (use needinfo?)

Comment 55

•

13 years ago

What are the next steps here?

Flags: needinfo?(matt.woodrow)

Brad Lassey [:blassey] (use needinfo?)

Comment 57

•

13 years ago

(In reply to Brad Lassey [:blassey] from comment #55) > What are the next steps here? Matt?

Matt Woodrow (:mattwoodrow)

Reporter

Comment 58

•

13 years ago

Sorry, forgot about this. I'm not sure really, we weren't able to find any evidence of over-invalidation. On top of that, these results have changed hugely with the landing of progressive painting, so I don't think this is really an issue any more.

Flags: needinfo?(matt.woodrow)

Brad Lassey [:blassey] (use needinfo?)

Updated

•

13 years ago

tracking-fennec: 18+ → -

Brad Lassey [:blassey] (use needinfo?)

Comment 59

•

13 years ago

Matt, do you want to close this out?

Flags: needinfo?(matt.woodrow)

Matt Woodrow (:mattwoodrow)

Reporter

Updated

•

13 years ago

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → WORKSFORME

Matt Woodrow (:mattwoodrow)

Reporter

Updated

•

13 years ago

Flags: needinfo?(matt.woodrow)

Skip invalidating frames that don't have display items 13 years ago Matt Woodrow (:mattwoodrow) 7.17 KB, patch		Details \| Diff \| Splinter Review
Restrict invalidation to the Layer's visible region 13 years ago Matt Woodrow (:mattwoodrow) 2.00 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Use FuzzyEqual to compare scales 13 years ago Matt Woodrow (:mattwoodrow) 2.10 KB, patch	roc : review+	Details \| Diff \| Splinter Review