1602803 - Power usage is worse with gfx.webrender.compositor=true on https://news.ycombinator.com/item?id=21750747 on Surface GO

Assignee

Description

•

4 years ago

With gfx.webrender.compositor=true I see 50% GPU usage in Intel Power Gadget.
With gfx.webrender.compositor=false I see more like 25% GPU usage

CPU usage is also slightly better with it off.

Jeff Muizelaar [:jrmuizel]

Assignee

Updated

•

4 years ago

Summary: Power usage is worse with gfx.webrender.compositor on https://news.ycombinator.com/item?id=21750747 on Surface GO → Power usage is worse with gfx.webrender.compositor=true on https://news.ycombinator.com/item?id=21750747 on Surface GO

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

Blocks: WR-Win-Compositing

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

Blocks: wr-73

Priority: -- → P3

Dzmitry Malyshau [:kvark]

Comment 1

•

4 years ago

•

Edited

Did some testing (by scrolling HN) on a laptop with Intel Iris 550 and Nvidia GTX 1050:

GPU	Compositor	Intel Power	CPU util	Intel GPU util
Iris 550	true	20 w	30%	85%
Iris 550	false	15.5 w	25%	75-80%
Gtx 1050	true	13.3 w	17%	60-75%
Gtx 1050	false	8.5 w	17%	70%

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 2

•

4 years ago

Also interesting, it seems like much of the GPU usage has moved from the Firefox process to the DWM so that's sort of a good sign.

Dzmitry Malyshau [:kvark]

Comment 3

•

4 years ago

Another round of testing to compare DWM GPU usage, done by scrolling a wiki page:

GPU	Compositor	FF GPU	DWM GPU
Iris 550	false	40%	20%
Iris 550	true	27%	40%
Gtx 1050	false	25%	22%
Gtx 1050	true	6%	52%

Markus Stange [:mstange]

Comment 4

•

4 years ago

High DWM GPU usage usually indicates overdraw / transparent layers / transparent windows. We should check how these numbers change if we make all surfaces opaque. We should also double-check whether our window itself is treated as opaque. Glenn mentioned something about drawing an opaque rectangle somewhere that might have an effect on this.

Glenn Watson [:gw]

Updated

•

4 years ago

Assignee: nobody → gwatson

Glenn Watson [:gw]

Updated

•

4 years ago

Depends on: 1602992

Glenn Watson [:gw]

Comment 5

•

4 years ago

I added a number of modes to the example-compositor application (bug #1602992). We can use this as an initial test case to see what power usage looks like in various (fake) scenarios. Once we have reasonable numbers and conclusions there, we can compare to what we see in Gecko and see if / why there is a discrepancy.

Glenn Watson [:gw]

Comment 6

•

4 years ago

One scenario we can test is comparing DC mode when we are just scrolling (no rasterization), and using vsync (the ideal case for DC mode). We can test this with the example application:

Simple mode: compositor.exe none scroll swap 1920 1080
DC mode: compositor.exe native scroll flush 1920 1080

On the machine I'm testing on (Intel HD530), Intel Power Gadget reports:

Simple mode:
Package Pwr0 ~3.8 W
GPU Util: ~71%

DC mode:
Package Pwr0 ~2.2 W
GPU Util: ~8%

So these numbers look good (idle power usage is ~1.5 W) so far.

We need to check if the comparison method is valid first. And if so, then we could compare a similar scenario in Gecko.

Glenn Watson [:gw]

Comment 7

•

4 years ago

Further confirming the results above, I compared reported GPU usage in the GPU process and DWM while running the example-compositor application and Gecko on a very simple page at 4k.

Scrolling a simple page in Gecko sees:

1.3% GPU usage in GPU process
45% GPU usage in DWM

Running the scrolling benchmark in example-compositor sees:

1% GPU usage in GPU process
4% GPU usage in DWM

So, it does seem that most of the GPU time has been moved to DWM (good!) but that there is a very large amount of GPU time in DWM in Gecko that doesn't occur in the example-compositor application (bad!).

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 8

•

4 years ago

I can confirm that on the Surface GO with compositor.exe I see a massive difference in GPU usage as reported by PowerGadget 54% (with none) vs 10% (with native)

Pkg Power and DRAM pwr are also both noticeably lower.

Glenn Watson [:gw]

Updated

•

4 years ago

Depends on: 1603314

Glenn Watson [:gw]

Comment 9

•

4 years ago

The patch in 1603314 definitely helps - it fixes most of the current talos regressions we saw [1], and anecdotally drops the reported GPU usage quite significantly.

However, it's not clear to me why this is required - all of the background tiles that we supply are already marked as opaque.

We're definitely on the right track, but I'm going to do a bit more investigation and experimenting before landing that patch - maybe there's a more efficient way to express what we want, or a reason that DC isn't already working out it can composite the existing tiles without considering what's behind them.

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=188ac70dd800ae0dfbd4b24ff1682bdaec3bf228&framework=1&selectedTimeRange=172800

Jeff Muizelaar [:jrmuizel]

Assignee

Updated

•

4 years ago

Blocks: wr-74
No longer blocks: wr-73

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 10

•

4 years ago

I've remeasured this and it doesn't seem as obviously worse.

Glenn Watson [:gw]

Comment 11

•

4 years ago

Given that, should we close this? Or do you want to take this bug for now, if you're still investigating?

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 12

•

4 years ago

I'll take it

Assignee: gwatson → jmuizelaar

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 13

•

4 years ago

I had a look at this more today. I was able to turn off the chrome and the scroll bar and Firefox scrolling news.ycombinator.com still used a lot more memory bandwidth than the example compositor. This is actually the reverse of what I'd expect because the example compositor is scrolling a transparent layer. This suggests that further investigation will be fruitful.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 14

•

4 years ago

Attached patch full-scroll — Details — Splinter Review

I tried applying this patch to make the content of the example compositor more similar to scrolling Firefox. i.e. No transparency. With this patch memory bandwidth goes up dramatically which is the opposite direction that I'd expect. We should try to understand why.

Interestingly when you turn on picture cache debugging we're doing a bug of tile splitting. It's not obvious to me why.

Jeff Muizelaar [:jrmuizel]

Assignee

Updated

•

4 years ago

Flags: needinfo?(gwatson)

Jeff Muizelaar [:jrmuizel]

Assignee

Updated

•

4 years ago

Flags: needinfo?(bpeers)

Bert Peers [:bpeers]

Comment 15

•

4 years ago

Taking a look at the unexpected tile splitting, my first hunch is that prim_clip_rect keeps changing as we scroll up, for example on the first frame:
from PrimitiveDescriptor { prim_uid: ItemUid { uid: 1 }, origin: PointKey { x: 0.0, y: 0.0 }, prim_clip_rect: RectangleKey { x: 0.0, y: 0.0, w: 512.0, h: 512.0 } ...
to PrimitiveDescriptor { prim_uid: ItemUid { uid: 1 }, origin: PointKey { x: 0.0, y: 0.0 }, prim_clip_rect: RectangleKey { x: 0.0, y: 1.0, w: 512.0, h: 511.0 }, ...
So the tile keeps invalidating and after 64 frames of that, it splits up.

That seems to happen for both none swap and native flush so probably not related to power differences.

Flags: needinfo?(bpeers)

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 16

•

4 years ago

Indeed. I reduced the size of the added opaque rect to avoid the invalidations happening and that drastically reduced powerusage.

I've since done some more testing and have a better model now for what's going on. Some of it seems obvious in retrospect but I'm going to write it down anyways.

The DWM seems to do partial draws if possible. It would be interesting to do more investigation to confirm this but what investigation I did do seems to suggest if only a small part of the screen is changing the DWM will not actually composite the entire screen.
In it's default configuration the example compositor only has a redraw region that is the bounds of the moving rectangles. This is significantly less than entire window which explains one large part of the lower memory bandwidth. Power usage went up when the moving rectangles were moved further apart.
The Intel GPU seems to avoid writing out completely transparent pixels. This was suggested by an experiment that compared the memory bandwidth of the two far apart rectangles compared with the same two rectangles and a partially transparent background
Making the partially transparent background opaque reduced the memory bandwidth usage. It's not completely clear whether this reduction comes from an optimization in the DWM or on the GPU
I still see appreciably higher memory bandwidth usage in the uninvalidating windows size opaque rect moving compared to Firefox scrolling so it seems like there may still be fruit to be picked here.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 17

•

4 years ago

Firefox seems to have the lowest memory bandwidth usage when scrolling compared to Chrome and Edge.

Bert Peers [:bpeers]

Comment 18

•

4 years ago

•

Edited

(continuing to chase down the invalidations, sorry if this is now tangential)

When we scroll, we create a SpaceMapper that's a CoordinateSpaceMapping::ScaleOffset with eg. offset (0.0, -10.0), and pass it as part of TilePreUpdateContext.

Then Tile::pre_update will map self.local_tile_rect == Rect(1024.0×512.0 at (0.0,0.0)) with that spacemapper into self.world_tile_rect == Rect(1024.0×512.0 at (0.0,-10.0)).

Which then gets union()ed into a world_culling_rect (returned by tile.pre_update) and passed into update_visibility as part of the recursive call of the PrimitiveInstanceKind::Picture case. world_culling_rect becomes an input to build_clip_chain_instance and therefor an input like pic_rect == Rect(512.0×512.0 at (0.0,0.0)) gets clipped to Rect(512.0×502.0 at (0.0,10.0)).

So now clip_chain.pic_clip_rect changes as we scroll, and is used to initialize a PrimitiveDependencyInfo, which goes into tile.add_prim_dependency where it's used to calculate prim_clip_rect which goes into self.current_descriptor.prims.push(PrimitiveDescriptor.. and hence the descriptors are changing and causing invalidations.

I don't know if this is all working exactly as intended and/or if Gecko does this any differently.

(Edit: tested at 512x512 resolution to reduce tile count.)

Glenn Watson [:gw]

Comment 19

•

4 years ago

Could you clarify how you're doing the bandwidth measurements, and how that compares to power and/or GPU usage measurements?

Is there something specific you wanted me to investigate further to your findings in https://bugzilla.mozilla.org/show_bug.cgi?id=1602803#c16 ?

Flags: needinfo?(gwatson) → needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 20

•

4 years ago

I was measuring memory bandwidth using https://github.com/opcm/pcm. I don't think there's anything I need from you right now. I'm going to do some further investigation on Monday.

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 21

•

4 years ago

I looked at this some more and recorded some results here: https://docs.google.com/document/d/10x9N7iw5mPlGKhtfkwehoWrDG7No7BKcR9GEdPTbHI0/edit

Generally things are better or the same. We still have some mystery about why Edge uses less read bandwidth and more write bandwidth. We can look into that when we have time.

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED