Power usage is worse with gfx.webrender.compositor=true on https://news.ycombinator.com/item?id=21750747 on Surface GO
Categories
(Core :: Graphics: WebRender, defect, P3)
Tracking
()
People
(Reporter: jrmuizel, Assigned: jrmuizel)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
1.58 KB,
patch
|
Details | Diff | Splinter Review |
With gfx.webrender.compositor=true I see 50% GPU usage in Intel Power Gadget.
With gfx.webrender.compositor=false I see more like 25% GPU usage
CPU usage is also slightly better with it off.
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 1•4 years ago
•
|
||
Did some testing (by scrolling HN) on a laptop with Intel Iris 550 and Nvidia GTX 1050:
GPU | Compositor | Intel Power | CPU util | Intel GPU util |
---|---|---|---|---|
Iris 550 | true | 20 w | 30% | 85% |
Iris 550 | false | 15.5 w | 25% | 75-80% |
Gtx 1050 | true | 13.3 w | 17% | 60-75% |
Gtx 1050 | false | 8.5 w | 17% | 70% |
Assignee | ||
Comment 2•4 years ago
|
||
Also interesting, it seems like much of the GPU usage has moved from the Firefox process to the DWM so that's sort of a good sign.
Comment 3•4 years ago
|
||
Another round of testing to compare DWM GPU usage, done by scrolling a wiki page:
GPU | Compositor | FF GPU | DWM GPU |
---|---|---|---|
Iris 550 | false | 40% | 20% |
Iris 550 | true | 27% | 40% |
Gtx 1050 | false | 25% | 22% |
Gtx 1050 | true | 6% | 52% |
Comment 4•4 years ago
|
||
High DWM GPU usage usually indicates overdraw / transparent layers / transparent windows. We should check how these numbers change if we make all surfaces opaque. We should also double-check whether our window itself is treated as opaque. Glenn mentioned something about drawing an opaque rectangle somewhere that might have an effect on this.
Updated•4 years ago
|
Comment 5•4 years ago
|
||
I added a number of modes to the example-compositor application (bug #1602992). We can use this as an initial test case to see what power usage looks like in various (fake) scenarios. Once we have reasonable numbers and conclusions there, we can compare to what we see in Gecko and see if / why there is a discrepancy.
Comment 6•4 years ago
|
||
One scenario we can test is comparing DC mode when we are just scrolling (no rasterization), and using vsync (the ideal case for DC mode). We can test this with the example application:
Simple mode: compositor.exe none scroll swap 1920 1080
DC mode: compositor.exe native scroll flush 1920 1080
On the machine I'm testing on (Intel HD530), Intel Power Gadget reports:
Simple mode:
Package Pwr0 ~3.8 W
GPU Util: ~71%
DC mode:
Package Pwr0 ~2.2 W
GPU Util: ~8%
So these numbers look good (idle power usage is ~1.5 W) so far.
We need to check if the comparison method is valid first. And if so, then we could compare a similar scenario in Gecko.
Comment 7•4 years ago
|
||
Further confirming the results above, I compared reported GPU usage in the GPU process and DWM while running the example-compositor application and Gecko on a very simple page at 4k.
Scrolling a simple page in Gecko sees:
- 1.3% GPU usage in GPU process
- 45% GPU usage in DWM
Running the scrolling benchmark in example-compositor sees:
- 1% GPU usage in GPU process
- 4% GPU usage in DWM
So, it does seem that most of the GPU time has been moved to DWM (good!) but that there is a very large amount of GPU time in DWM in Gecko that doesn't occur in the example-compositor application (bad!).
Assignee | ||
Comment 8•4 years ago
|
||
I can confirm that on the Surface GO with compositor.exe I see a massive difference in GPU usage as reported by PowerGadget 54% (with none) vs 10% (with native)
Pkg Power and DRAM pwr are also both noticeably lower.
Comment 9•4 years ago
|
||
The patch in 1603314 definitely helps - it fixes most of the current talos regressions we saw [1], and anecdotally drops the reported GPU usage quite significantly.
However, it's not clear to me why this is required - all of the background tiles that we supply are already marked as opaque.
We're definitely on the right track, but I'm going to do a bit more investigation and experimenting before landing that patch - maybe there's a more efficient way to express what we want, or a reason that DC isn't already working out it can composite the existing tiles without considering what's behind them.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 10•4 years ago
|
||
I've remeasured this and it doesn't seem as obviously worse.
Comment 11•4 years ago
|
||
Given that, should we close this? Or do you want to take this bug for now, if you're still investigating?
Assignee | ||
Comment 12•4 years ago
|
||
I'll take it
Assignee | ||
Comment 13•4 years ago
|
||
I had a look at this more today. I was able to turn off the chrome and the scroll bar and Firefox scrolling news.ycombinator.com still used a lot more memory bandwidth than the example compositor. This is actually the reverse of what I'd expect because the example compositor is scrolling a transparent layer. This suggests that further investigation will be fruitful.
Assignee | ||
Comment 14•4 years ago
|
||
I tried applying this patch to make the content of the example compositor more similar to scrolling Firefox. i.e. No transparency. With this patch memory bandwidth goes up dramatically which is the opposite direction that I'd expect. We should try to understand why.
Interestingly when you turn on picture cache debugging we're doing a bug of tile splitting. It's not obvious to me why.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Comment 15•4 years ago
|
||
Taking a look at the unexpected tile splitting, my first hunch is that prim_clip_rect
keeps changing as we scroll up, for example on the first frame:
from PrimitiveDescriptor { prim_uid: ItemUid { uid: 1 }, origin: PointKey { x: 0.0, y: 0.0 }, prim_clip_rect: RectangleKey { x: 0.0, y: 0.0, w: 512.0, h: 512.0 } ...
to PrimitiveDescriptor { prim_uid: ItemUid { uid: 1 }, origin: PointKey { x: 0.0, y: 0.0 }, prim_clip_rect: RectangleKey { x: 0.0, y: 1.0, w: 512.0, h: 511.0 }, ...
So the tile keeps invalidating and after 64 frames of that, it splits up.
That seems to happen for both none swap
and native flush
so probably not related to power differences.
Assignee | ||
Comment 16•4 years ago
|
||
Indeed. I reduced the size of the added opaque rect to avoid the invalidations happening and that drastically reduced powerusage.
I've since done some more testing and have a better model now for what's going on. Some of it seems obvious in retrospect but I'm going to write it down anyways.
-
The DWM seems to do partial draws if possible. It would be interesting to do more investigation to confirm this but what investigation I did do seems to suggest if only a small part of the screen is changing the DWM will not actually composite the entire screen.
-
In it's default configuration the example compositor only has a redraw region that is the bounds of the moving rectangles. This is significantly less than entire window which explains one large part of the lower memory bandwidth. Power usage went up when the moving rectangles were moved further apart.
-
The Intel GPU seems to avoid writing out completely transparent pixels. This was suggested by an experiment that compared the memory bandwidth of the two far apart rectangles compared with the same two rectangles and a partially transparent background
-
Making the partially transparent background opaque reduced the memory bandwidth usage. It's not completely clear whether this reduction comes from an optimization in the DWM or on the GPU
-
I still see appreciably higher memory bandwidth usage in the uninvalidating windows size opaque rect moving compared to Firefox scrolling so it seems like there may still be fruit to be picked here.
Assignee | ||
Comment 17•4 years ago
|
||
- Firefox seems to have the lowest memory bandwidth usage when scrolling compared to Chrome and Edge.
Comment 18•4 years ago
•
|
||
(continuing to chase down the invalidations, sorry if this is now tangential)
When we scroll, we create a SpaceMapper
that's a CoordinateSpaceMapping::ScaleOffset
with eg. offset (0.0, -10.0)
, and pass it as part of TilePreUpdateContext
.
Then Tile::pre_update
will map self.local_tile_rect == Rect(1024.0×512.0 at (0.0,0.0))
with that spacemapper into self.world_tile_rect == Rect(1024.0×512.0 at (0.0,-10.0))
.
Which then gets union()
ed into a world_culling_rect
(returned by tile.pre_update
) and passed into update_visibility
as part of the recursive call of the PrimitiveInstanceKind::Picture
case. world_culling_rect
becomes an input to build_clip_chain_instance
and therefor an input like pic_rect == Rect(512.0×512.0 at (0.0,0.0))
gets clipped to Rect(512.0×502.0 at (0.0,10.0))
.
So now clip_chain.pic_clip_rect
changes as we scroll, and is used to initialize a PrimitiveDependencyInfo
, which goes into tile.add_prim_dependency
where it's used to calculate prim_clip_rect
which goes into self.current_descriptor.prims.push(PrimitiveDescriptor..
and hence the descriptors are changing and causing invalidations.
I don't know if this is all working exactly as intended and/or if Gecko does this any differently.
(Edit: tested at 512x512 resolution to reduce tile count.)
Comment 19•4 years ago
|
||
Could you clarify how you're doing the bandwidth measurements, and how that compares to power and/or GPU usage measurements?
Is there something specific you wanted me to investigate further to your findings in https://bugzilla.mozilla.org/show_bug.cgi?id=1602803#c16 ?
Assignee | ||
Comment 20•4 years ago
|
||
I was measuring memory bandwidth using https://github.com/opcm/pcm. I don't think there's anything I need from you right now. I'm going to do some further investigation on Monday.
Assignee | ||
Comment 21•4 years ago
|
||
I looked at this some more and recorded some results here: https://docs.google.com/document/d/10x9N7iw5mPlGKhtfkwehoWrDG7No7BKcR9GEdPTbHI0/edit
Generally things are better or the same. We still have some mystery about why Edge uses less read bandwidth and more write bandwidth. We can look into that when we have time.
Updated•4 years ago
|
Description
•