Closed Bug 1602803 Opened 4 years ago Closed 4 years ago

Power usage is worse with gfx.webrender.compositor=true on https://news.ycombinator.com/item?id=21750747 on Surface GO

Categories

(Core :: Graphics: WebRender, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jrmuizel, Assigned: jrmuizel)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

With gfx.webrender.compositor=true I see 50% GPU usage in Intel Power Gadget.
With gfx.webrender.compositor=false I see more like 25% GPU usage

CPU usage is also slightly better with it off.

Summary: Power usage is worse with gfx.webrender.compositor on https://news.ycombinator.com/item?id=21750747 on Surface GO → Power usage is worse with gfx.webrender.compositor=true on https://news.ycombinator.com/item?id=21750747 on Surface GO
Blocks: wr-73
Priority: -- → P3

Did some testing (by scrolling HN) on a laptop with Intel Iris 550 and Nvidia GTX 1050:

GPU Compositor Intel Power CPU util Intel GPU util
Iris 550 true 20 w 30% 85%
Iris 550 false 15.5 w 25% 75-80%
Gtx 1050 true 13.3 w 17% 60-75%
Gtx 1050 false 8.5 w 17% 70%

Also interesting, it seems like much of the GPU usage has moved from the Firefox process to the DWM so that's sort of a good sign.

Another round of testing to compare DWM GPU usage, done by scrolling a wiki page:

GPU Compositor FF GPU DWM GPU
Iris 550 false 40% 20%
Iris 550 true 27% 40%
Gtx 1050 false 25% 22%
Gtx 1050 true 6% 52%

High DWM GPU usage usually indicates overdraw / transparent layers / transparent windows. We should check how these numbers change if we make all surfaces opaque. We should also double-check whether our window itself is treated as opaque. Glenn mentioned something about drawing an opaque rectangle somewhere that might have an effect on this.

Assignee: nobody → gwatson
Depends on: 1602992

I added a number of modes to the example-compositor application (bug #1602992). We can use this as an initial test case to see what power usage looks like in various (fake) scenarios. Once we have reasonable numbers and conclusions there, we can compare to what we see in Gecko and see if / why there is a discrepancy.

One scenario we can test is comparing DC mode when we are just scrolling (no rasterization), and using vsync (the ideal case for DC mode). We can test this with the example application:

Simple mode: compositor.exe none scroll swap 1920 1080
DC mode: compositor.exe native scroll flush 1920 1080

On the machine I'm testing on (Intel HD530), Intel Power Gadget reports:

Simple mode:
Package Pwr0 ~3.8 W
GPU Util: ~71%

DC mode:
Package Pwr0 ~2.2 W
GPU Util: ~8%

So these numbers look good (idle power usage is ~1.5 W) so far.

We need to check if the comparison method is valid first. And if so, then we could compare a similar scenario in Gecko.

Further confirming the results above, I compared reported GPU usage in the GPU process and DWM while running the example-compositor application and Gecko on a very simple page at 4k.

Scrolling a simple page in Gecko sees:

  • 1.3% GPU usage in GPU process
  • 45% GPU usage in DWM

Running the scrolling benchmark in example-compositor sees:

  • 1% GPU usage in GPU process
  • 4% GPU usage in DWM

So, it does seem that most of the GPU time has been moved to DWM (good!) but that there is a very large amount of GPU time in DWM in Gecko that doesn't occur in the example-compositor application (bad!).

I can confirm that on the Surface GO with compositor.exe I see a massive difference in GPU usage as reported by PowerGadget 54% (with none) vs 10% (with native)

Pkg Power and DRAM pwr are also both noticeably lower.

Depends on: 1603314

The patch in 1603314 definitely helps - it fixes most of the current talos regressions we saw [1], and anecdotally drops the reported GPU usage quite significantly.

However, it's not clear to me why this is required - all of the background tiles that we supply are already marked as opaque.

We're definitely on the right track, but I'm going to do a bit more investigation and experimenting before landing that patch - maybe there's a more efficient way to express what we want, or a reason that DC isn't already working out it can composite the existing tiles without considering what's behind them.

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=188ac70dd800ae0dfbd4b24ff1682bdaec3bf228&framework=1&selectedTimeRange=172800

Blocks: wr-74
No longer blocks: wr-73

I've remeasured this and it doesn't seem as obviously worse.

Given that, should we close this? Or do you want to take this bug for now, if you're still investigating?

Flags: needinfo?(jmuizelaar)

I'll take it

Assignee: gwatson → jmuizelaar
Flags: needinfo?(jmuizelaar)

I had a look at this more today. I was able to turn off the chrome and the scroll bar and Firefox scrolling news.ycombinator.com still used a lot more memory bandwidth than the example compositor. This is actually the reverse of what I'd expect because the example compositor is scrolling a transparent layer. This suggests that further investigation will be fruitful.

Attached patch full-scrollSplinter Review

I tried applying this patch to make the content of the example compositor more similar to scrolling Firefox. i.e. No transparency. With this patch memory bandwidth goes up dramatically which is the opposite direction that I'd expect. We should try to understand why.

Interestingly when you turn on picture cache debugging we're doing a bug of tile splitting. It's not obvious to me why.

Flags: needinfo?(gwatson)
Flags: needinfo?(bpeers)

Taking a look at the unexpected tile splitting, my first hunch is that prim_clip_rect keeps changing as we scroll up, for example on the first frame:
from PrimitiveDescriptor { prim_uid: ItemUid { uid: 1 }, origin: PointKey { x: 0.0, y: 0.0 }, prim_clip_rect: RectangleKey { x: 0.0, y: 0.0, w: 512.0, h: 512.0 } ...
to PrimitiveDescriptor { prim_uid: ItemUid { uid: 1 }, origin: PointKey { x: 0.0, y: 0.0 }, prim_clip_rect: RectangleKey { x: 0.0, y: 1.0, w: 512.0, h: 511.0 }, ...
So the tile keeps invalidating and after 64 frames of that, it splits up.

That seems to happen for both none swap and native flush so probably not related to power differences.

Flags: needinfo?(bpeers)

Indeed. I reduced the size of the added opaque rect to avoid the invalidations happening and that drastically reduced powerusage.

I've since done some more testing and have a better model now for what's going on. Some of it seems obvious in retrospect but I'm going to write it down anyways.

  1. The DWM seems to do partial draws if possible. It would be interesting to do more investigation to confirm this but what investigation I did do seems to suggest if only a small part of the screen is changing the DWM will not actually composite the entire screen.

  2. In it's default configuration the example compositor only has a redraw region that is the bounds of the moving rectangles. This is significantly less than entire window which explains one large part of the lower memory bandwidth. Power usage went up when the moving rectangles were moved further apart.

  3. The Intel GPU seems to avoid writing out completely transparent pixels. This was suggested by an experiment that compared the memory bandwidth of the two far apart rectangles compared with the same two rectangles and a partially transparent background

  4. Making the partially transparent background opaque reduced the memory bandwidth usage. It's not completely clear whether this reduction comes from an optimization in the DWM or on the GPU

  5. I still see appreciably higher memory bandwidth usage in the uninvalidating windows size opaque rect moving compared to Firefox scrolling so it seems like there may still be fruit to be picked here.

  1. Firefox seems to have the lowest memory bandwidth usage when scrolling compared to Chrome and Edge.

(continuing to chase down the invalidations, sorry if this is now tangential)

When we scroll, we create a SpaceMapper that's a CoordinateSpaceMapping::ScaleOffset with eg. offset (0.0, -10.0), and pass it as part of TilePreUpdateContext.

Then Tile::pre_update will map self.local_tile_rect == Rect(1024.0×512.0 at (0.0,0.0)) with that spacemapper into self.world_tile_rect == Rect(1024.0×512.0 at (0.0,-10.0)).

Which then gets union()ed into a world_culling_rect (returned by tile.pre_update) and passed into update_visibility as part of the recursive call of the PrimitiveInstanceKind::Picture case. world_culling_rect becomes an input to build_clip_chain_instance and therefor an input like pic_rect == Rect(512.0×512.0 at (0.0,0.0)) gets clipped to Rect(512.0×502.0 at (0.0,10.0)).

So now clip_chain.pic_clip_rect changes as we scroll, and is used to initialize a PrimitiveDependencyInfo, which goes into tile.add_prim_dependency where it's used to calculate prim_clip_rect which goes into self.current_descriptor.prims.push(PrimitiveDescriptor.. and hence the descriptors are changing and causing invalidations.

I don't know if this is all working exactly as intended and/or if Gecko does this any differently.

(Edit: tested at 512x512 resolution to reduce tile count.)

Could you clarify how you're doing the bandwidth measurements, and how that compares to power and/or GPU usage measurements?

Is there something specific you wanted me to investigate further to your findings in https://bugzilla.mozilla.org/show_bug.cgi?id=1602803#c16 ?

Flags: needinfo?(gwatson) → needinfo?(jmuizelaar)

I was measuring memory bandwidth using https://github.com/opcm/pcm. I don't think there's anything I need from you right now. I'm going to do some further investigation on Monday.

Flags: needinfo?(jmuizelaar)

I looked at this some more and recorded some results here: https://docs.google.com/document/d/10x9N7iw5mPlGKhtfkwehoWrDG7No7BKcR9GEdPTbHI0/edit

Generally things are better or the same. We still have some mystery about why Edge uses less read bandwidth and more write bandwidth. We can look into that when we have time.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: