[WebRender Shield Study] Higher CPU usage with WebRender enabled on YouTube (Windows)

RESOLVED FIXED

Status

()

defect
P1
normal
RESOLVED FIXED
10 months ago
7 months ago

People

(Reporter: acupsa, Assigned: nical)

Tracking

(Depends on 1 bug, Blocks 1 bug)

63 Branch
x86_64
Windows 10
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox63 affected)

Details

Attachments

(1 attachment)

Version: 63.0a1
Build: ID 20180709221247
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0

[Affected Platforms]:
- Windows 10 64bit

[Prerequisites]
- Have Task Manager opened at Processes section.

[Steps to reproduce]:
1. Open Firefox Nightly (63.0a1) with a new profile and navigate to https://www.youtube.com/watch?v=aqz-KE-bpKQ
2. Observe the CPU usage of Firefox Nightly in Task Manager.
3. Go to about:config and add "gfx.webrender.all.qualified" preference with true value
4. Restart the browser.
5. Navigate to https://www.youtube.com/watch?v=aqz-KE-bpKQ
6. Observe the CPU usage of Firefox Nightly in Task Manager.

[Expected result]:
- CPU usage of Firefox Nightly with WebRender enabled should be approximately the same as without it. 

[Actual result]:
- CPU usage is on average with 50% higher with WebRender enabled.

[Notes]:
- This issue is reproducible by performing any action in the browser. 
- Attached a link to CPU usage test results: https://tinyurl.com/yaesrl2d.
- Attached a copy of "about:support" from both systems this issue was tested on.
- Attached a screen recording of the issue: https://tinyurl.com/y9d79u69.
See Also: → 1430451
Forgot to add the about:support info file.
Priority: -- → P1
Summary: [WebRender Shield Study] Higher CPU usage with WebRender enabled → [WebRender Shield Study] Higher CPU usage with WebRender enabled on YouTube
Assignee: nobody → jmuizelaar
What are the actual before and after CPU usage numbers you see?
Flags: needinfo?(andreea.cupsa)
Just to clarify, this issue also reproduces with other websites like Twitch, Amazon, Wikipedia. 
For exemple livestreams on https://twitch.tv has an average ~19% CPU usage without WebRender, and has an average ~ 34% CPU usage with WebRender enabled, also navigating on https://amazon.com with WebRender enabled almost doubles the average CPU usage. 

Also to answer your question the minimum and maximum CPU usage for the video from this bug is like this: without WebRender is ~5% and goes up to ~21% (average CPU usage ~14%), while with WebRender enabled it goes up to ~50% CPU usage, and won't go lower than 20% (average CPU usage ~30%).
Flags: needinfo?(andreea.cupsa)
I can reproduce this locally. I see about 10% CPU with WebRender and 5% CPU without.

Here's a profile https://perfht.ml/2L40n8W. The extra time is being spent in the Renderer and the RenderBackend, about equally. It's expected that these numbers would be higher, because the scene that we're drawing is more complex with WebRender than it is with layers. That being said, I wouldn't be surprised if there's work we can do to get this number closer to our current levels.

Glenn, can you take a look and see what can be done?
Assignee: jmuizelaar → gwatson
Flags: needinfo?(gwatson)
Here's where time on the Renderer thread is being spent:
- 30% of the time is being spent in ANGLE. 27% of the time in the nvidia driver (this goes up to 41% if you include all subtrees under the nvidia driver)
- The time spent in xul.dll only accounts for 6% of the Renderer time.
Depends on: 1474664

Comment 6

9 months ago
I haven't tested this on the target hardware yet (I will do shortly), but on my Linux box, the youtube link is running at 30fps for most of the video, but it is running at a steady 60 fps the entire time with WR enabled.

Could it be something simple like this? Would it be possible to measure the framerate with / without WR on the machines tested on above just to make sure we're doing a fair comparison?
Flags: needinfo?(gwatson)
See Also: → 1474532

Comment 7

9 months ago
Regardless of the question about frame rate about, youtube.com does seem to be consuming more CPU time than most sites. On my profiling machine, it is ~3.2ms / frame in backend, whereas nytimes.com is ~1.4ms / frame, and nytimes.com has far more primitives / vertices.

The primitive count on youtube is 653, which is quite low. There are 362 nodes in the CST, which is higher than ideal, but shouldn't be a massive issue. The reference frame count is 97 - this is *much* higher than most sites. I wouldn't expect this to cause a problem, but maybe there is something strange going on there.

The vertex count is ~12k - this seems high for this site, but is much lower than nytimes.com, which is ~30k vertices without a problem, so that seems unlikely to cause it.

Still investigating...
The 60fps instead of 30fps rendering should be fixed by bug 1474532.

Comment 9

9 months ago
I was able to test on a Win10 + nVidia machine, and that does run at 60fps both with and without WR, so that's not the cause in this case (although it is interesting that non-WR runs at 30 fps on both Mac + Linux).

I can reproduce this, I think - but it's somewhat difficult to measure since the CPU usage jumps around so much depending on the video (and the difference doesn't seem to be as large on my test setup).

One interesting bit of data - the CPU backend and compositor times in WR are (almost exactly) constant throughout the video (1.2ms and 1.4ms, respectively). Does this imply that the CPU time variation is related to something happening outside WR? I'm not sure, but it seems possible.

Is the video decode path going to be the same between WR and non-WR? Is there a way to confirm this?
Huh, it occurs to me that in this case WR is actually doing some redundant work - the DL is not changing, just the contents of an external texture.

Thus, we should be able to detect this case and redraw the same built Frame as the previous frame. I think this should be quite easy to detect - I'll prototype this today and see if (a) there's any gotchas I'm missing (b) it measurably reduces the CPU time reported in Task Manager.
OK, this is slightly more involved than I thought, due to the way the texture cache update list is collected.

It is certainly feasible to make this case (no new DL, just external texture cache updates) handled significantly more efficiently in WR though.

This would help CPU usage in both the youtube and twitch cases. I'm not sure about the amazon.com case mentioned above - I couldn't repro the CPU usage difference there, and WR is doing almost no work on that page for me (it is idle except for the occasional banner animation that occurs once every few seconds).

Since this is more involved than originally thought, it becomes a question of priority. Do we want to implement this optimization ASAP or is it low priority compared to other correctness bugs?
Flags: needinfo?(jmuizelaar)
(In reply to Glenn Watson [:gw] from comment #9)
> I was able to test on a Win10 + nVidia machine, and that does run at 60fps
> both with and without WR, so that's not the cause in this case (although it
> is interesting that non-WR runs at 30 fps on both Mac + Linux).
> 
> I can reproduce this, I think - but it's somewhat difficult to measure since
> the CPU usage jumps around so much depending on the video (and the
> difference doesn't seem to be as large on my test setup).
> 
> One interesting bit of data - the CPU backend and compositor times in WR are
> (almost exactly) constant throughout the video (1.2ms and 1.4ms,
> respectively). Does this imply that the CPU time variation is related to
> something happening outside WR? I'm not sure, but it seems possible.
> 
> Is the video decode path going to be the same between WR and non-WR?

WR and non-WR use same decoding path.

> Is there a way to confirm this?

:jya, do you know how to confirm it?
Flags: needinfo?(jyavenard)
(In reply to Sotaro Ikeda [:sotaro] from comment #12)

> > Is the video decode path going to be the same between WR and non-WR?
> 
> WR and non-WR use same decoding path.
> 
> > Is there a way to confirm this?
> 
> :jya, do you know how to confirm it?

The easiest to compare would be to install the media devtools https://addons.mozilla.org/en-US/firefox/addon/devtools-media-panel/

Start playing a video, press Ctrl-Shift-I and go to the media tab, click on the URL showing.

there will be a line describing the decoder used.

Unfortunately, the about:support attachment here has been truncated and I can't know for sure of what's going on.

On YouTube: it is possible that in one case H264 (which is hardware accelerated) would be used when webrender is not used, but vp9 (software decoded on the nvidia 210) is used with webrender.

As the OP mentioned that the problem was seen with other sites, which use exclusively h264 (like Twitch), it could be that in one case HW acceleration is used, but not the other.

I'll test that shortly.
Flags: needinfo?(jyavenard)
Flags: needinfo?(jyavenard)
I have a AMD Vega64 on this machine, setting gfx.webrender.all to true

I don't see much difference in CPU usage between webrender on and off.. It appears slightly higher with webrender on.
Webrender off, it oscillates between 1.9% and 3%
with webrender off, it oscillates between 3% and 4.5%

so yes, you could say that CPU usage is about 50% more, but we're talking 1-2% difference in actual total CPU usage... do we care?
Flags: needinfo?(jyavenard)
in the test above, the video was hardware accelerated

this is what the media devtools show:
"Video Decoder(video/avc, 1920x1080 @ 60.00)":"wmf hardware video decoder - nv12 (remote)"
"Hardware Video Decoding":"enabled"
(In reply to Jean-Yves Avenard [:jya] from comment #14)
> I have a AMD Vega64 on this machine, setting gfx.webrender.all to true
> 
> I don't see much difference in CPU usage between webrender on and off.. It
> appears slightly higher with webrender on.
> Webrender off, it oscillates between 1.9% and 3%
> with webrender off, it oscillates between 3% and 4.5%
> 
> so yes, you could say that CPU usage is about 50% more, but we're talking
> 1-2% difference in actual total CPU usage... do we care?

I think the answer to this ought to be 'yes'. That small increase may mean a big difference in battery life for long-lived browsing sessions on laptops, an area where Firefox could already use some improvement.
After watching the screen capture video, the difference is much more than 50%.

We see that when the video is just playing once the page has been rendered, we have a 5-6% CPU usage only, while with webrender it's 50+% , if I didn't know about webrender it would be tell that one is HW accelerated and the other is not.

Looking at the information displayed by the media devtools would confirm that theory
(In reply to Glenn Watson [:gw] from comment #11)
> Since this is more involved than originally thought, it becomes a question
> of priority. Do we want to implement this optimization ASAP or is it low
> priority compared to other correctness bugs?

I think this work is pretty high priority. My feeling is that most of the correctness bugs are pretty rare and don't significantly impact user experience. This seems pretty noticeable to people.
Flags: needinfo?(jmuizelaar)
I can confirm that on my Win10 + nVidia test machine, I'm seeing the same hardware accelerated video decoder in both the non-WR and WR code paths, that is:

"Video Decoder(video/avc, 1920x1080 @ 60.00)":"wmf hardware video decoder - nv12 (remote)"
"Hardware Video Decoding":"enabled"

The way the YUV planes are being supplied to WR is via an ExternalImage::RawData callback. This means that WR issues a callback at frame render time, and the Gecko code supplies a pointer to the CPU-side YUV planes. WR then uploads the YUV planes from that pointer as a texture to the GPU each frame. (this is different from the WR native texture interface, which allows supplying a texture handle directly instead of a CPU texture upload).

I know very little about hardware media decoding - is this what would be happening in the non-WR path? Is it possible that in the non-WR path the decode is happening on the GPU and avoiding a readback / upload of the YUV planes? If the non-WR path doesn't do those extra copies, that could explain the CPU time difference?
Flags: needinfo?(jyavenard)
Flags: needinfo?(jmuizelaar)
My understanding is that the non-WR path should not be uploading from the CPU and should be using a texture directly from the decoder.
Flags: needinfo?(jmuizelaar)
Sotaro, do you know what might be going on here?
Flags: needinfo?(sotaro.ikeda.g)
FWIW, implementing the fast path mentioned above in WR is much easier if Gecko was supplying native texture handles, rather than using the CPU-side texture cache upload path it currently is. So if we're able to use native texture handles here, we could get a double CPU win from skipping the redundant WR work, in addition to skipping the readback / upload.
My comment in #19 is wrong - I had made a mistake in some debug logging code, and it does appear that the native texture handle code path is being hit, as we'd expect.

So, I'll try to prototype the fast path to skip the WR frame build in this case and see what difference it makes to the reported CPU usage (I suspect there may still be something else going on, as the WR CPU usage is quite low here anyway, but we can test this first and see).
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jyavenard)
Note that for the purposes of the Shield Study we don't really care about CPU usage/battery life; it's not one of the things we're measuring and we have plans to address this eventually. Also per the release criteria, battery life is not a constraint for the V1 release since we're targeting desktop users. So this shouldn't need to block the Shield Study, although I'd like to leave it as a P1 blocker for wr-stage-nightly so that we make sure to mitigate this before we enable by default on Nightly.
I've reviewed the top site performance concerns with Andreas Bovens from Product. He has Approved the experiment to go out with Nightly 63.

Please let me know if anything else is needed to move forward.
I've put a static version of this test case up at https://trusting-kirch-5ba4e0.netlify.com/big-buck-bunny.html that should give more reliable cpu usage numbers. It's possible to get even more stable cpu usage numbers by only looking the gpu process (which you can get from about:support)

On the reference machine without WebRender I get about 2-3% usage in the GPU process, with WebRender I get 6-8% usage.
The basic idea to improve this is that if the only thing changing is an external texture, we should be able to just call wr.render(), without doing a frame build. This is because the render() step does a lock()/unlock() to get the current external texture each composite.

There's a couple of things causing a frame build to occur in this test case.

(1) There is a transaction being sent which sets/updates the dynamic scene properties. Currently, this message unconditionally triggers a frame build. We can fairly easily and cheaply compare the new scene properties to the previous scene properties inside WR, and only set the 'render' field on DocumentOps if they have changed. Alternatively, if the information is already available in Gecko to know this, it may be simpler / more efficient for Gecko to just not include these messages in the transaction?

(2) The code here https://github.com/servo/webrender/blob/2cb682553816200bb74ce75d3851753bc122f488/webrender/src/render_backend.rs#L1086 sets op.render = true if there is a generate frame message. I *think* this is probably wrong - I would think generate frame should only specify to do a composite. However, there's lots of changes since I looked at this code, and lots of subtleties here, so there might be something I'm missing?

Unfortunately, if I hack in those changes locally, although the frame build doesn't happen, the external video texture no longer updates correctly in Gecko. I hacked the WR yuv example to animate the texture, and it seems to work as expected in this case. I wonder if something in Gecko is deciding not to advance the texture frame in that case - I need to investigate further.

kats, any thoughts on (1) and (2) ?
Flags: needinfo?(bugmail)
I confirmed that disabling op.render from the generate frame message is affecting which textures Gecko provides as video frames. However, in the WR YUV example code, it does appear to work as expected.

I suspect that not doing the build frame may be working fine in WR itself, but somehow affecting whether Gecko provides a new video frame. sotaro, where would I look to see how / when the external image handler would be providing new video frames?
Flags: needinfo?(sotaro.ikeda.g)
I do not know how :gw checked comment 28. But it seems that async scene building and IFrame usage for video seemed to be related. I locally modified source and seemed to reproduce the problem.

In current gecko, transaction of video is sent by using render backend thread. AsyncImagePipelineManager::ApplyAsyncImagesOfImageBridge() adds transaction tasks of video. And it is called by WebRenderBridgeParent::CompositeToTarget(). 
  https://hg.mozilla.org/mozilla-central/file/tip/gfx/layers/wr/AsyncImagePipelineManager.cpp#l271
  https://hg.mozilla.org/mozilla-central/file/tip/gfx/layers/wr/WebRenderBridgeParent.cpp#l1524

SceneBuilderResult::Transaction of video transaction is handled on render_backend. But it does not trigger render operation. It seems to cause the problem of video with disabling op.render from the generate frame message.
Flags: needinfo?(sotaro.ikeda.g)
As in Comment 29, current gecko's implementation expects doc.render() for updating video. It is not good.

Bug 1476846 is a bug for sending only image key add/update of video with generate frame message. Since gecko side expects to update video frames(external images) by using same ImageKey.
forward_transaction_to_scene_builder() did not request scent because of the following check. Then async scene building for video transaction did not trigger doc.render()

>        let build_scene: bool = document_ops.build
>            && self.pending.scene.root_pipeline_id.map(
>                |id| { self.pending.scene.pipelines.contains_key(&id) }
>            ).unwrap_or(false);
On gecko, external image is bound to video buffer. And it tried to minimize ImageKey recreation by using TransactionBuilder::UpdateExternalImage().
(In reply to Sotaro Ikeda [:sotaro] from comment #32)
> On gecko, external image is bound to video buffer. And it tried to minimize
> ImageKey recreation by using TransactionBuilder::UpdateExternalImage().

Hmm, it seems that WebRender does not implement a way of updating ImageKey to another external image without scene build.
When using a native external texture, there is no need to update the ImageKey if the contents of the image changes - WR will always invoke the external handler callback on each render() call.

So just calling render() and providing the new image data in the callback should be enough, and this will allow us to completely skip a scene and frame rebuild.

The only time you should need to update the ImageKey for an external native texture handle is if the size or format changes.
(In reply to Glenn Watson [:gw] from comment #34)
> When using a native external texture, there is no need to update the
> ImageKey if the contents of the image changes - WR will always invoke the
> external handler callback on each render() call.

Yea, it could be done for video. I am going to create a bug for it.

It was not done yet because it improves only for native external texture, but not for ExternalImageSource::RawData. And I thought that webrender will do optimization of "update the ImageKey" as not to request doc.render(). If it is done, performance will be optimized also for ExternalImageSource::RawData.

In current implementation, external image is always bound to one buffer. Then when video buffer is updated, gecko updates the ImageKey. It could avoid several problem like the following. But the problem could be minimized even using same external image id for different video frames.

- WebRenderBridgeParent::CompositeToTarget() could enqueue several frames generations and the frames will be generated on render thread at different timings. Then there is a risk of inconsistent frame generation if same external image id is used for different video frame.
- Each video frame could have totally different video buffer type/format/size.
Depends on: 1477608
(In reply to Glenn Watson [:gw] from comment #27)
> (1) There is a transaction being sent which sets/updates the dynamic scene
> properties. Currently, this message unconditionally triggers a frame build.
> We can fairly easily and cheaply compare the new scene properties to the
> previous scene properties inside WR, and only set the 'render' field on
> DocumentOps if they have changed. Alternatively, if the information is
> already available in Gecko to know this, it may be simpler / more efficient
> for Gecko to just not include these messages in the transaction?

I think it would be easier to do this on the WR side. And we should do this both for the dynamic properties and the scroll offsets, if possible. On the gecko side we will trigger these "GenerateFrame" transactions anytime we feel like we should recomposite stuff, which will include dynamic property changes and async scroll animations. However, some of these "changes" are going to be no-ops and in those cases we can avoid the render.

> (2) The code here
> https://github.com/servo/webrender/blob/
> 2cb682553816200bb74ce75d3851753bc122f488/webrender/src/render_backend.
> rs#L1086 sets op.render = true if there is a generate frame message. I
> *think* this is probably wrong - I would think generate frame should only
> specify to do a composite. However, there's lots of changes since I looked
> at this code, and lots of subtleties here, so there might be something I'm
> missing?

Yeah I think this should also be doable, as long as we ensure we do still render with scroll offset changes, as mentioned above. Gecko will send ScrollNodeWithId commands for each scrollable frame with each GenerateFrame transaction. Some of these will have actual changes and others may not. Right now doc.render_on_scroll [1] will not be true with Gecko, so the DocumentOps returned from a ScrollNodeWithId will always have "true" for scroll and "false" for render. If we drop the op.render = true line you're referring to, I think async scroll animations will stop rendering their intermediate frames because of this.

[1] https://searchfox.org/mozilla-central/rev/ad36eff63e208b37bc9441b91b7cea7291d82890/gfx/webrender/src/render_backend.rs#673
Flags: needinfo?(bugmail)
Depends on: 1477970
Assignee: gwatson → kats
Assignee: kats → sotaro.ikeda.g
Sotaro, can you take this over now that kats is on parental leave?
Yes, I could take this.

https://github.com/servo/webrender/pull/2951 might affect to this.
See Also: → 1482699
Depends on: 1478566
Depends on: 1483610
Depends on: 1473290
Bug 1473290 is cpu usage during scroll. Then it seems very different from video playback.
No longer depends on: 1473290
No longer depends on: 1483610
Duplicate of this bug: 1430451
(Assignee)

Comment 41

8 months ago
I'm stealing this bug from Sotaro while he is away.
Assignee: sotaro.ikeda.g → nical.bugzilla
(Assignee)

Comment 42

7 months ago
https://github.com/servo/webrender/pull/3043 adds some basic infrastructure for avoiding redundant CPU work, on top of which we can optimize.
Since we've already improved things compared to the shield study. This doesn't need to block nightly.
Blocks: stage-wr-trains
No longer blocks: stage-wr-nightly
No longer depends on: 1478566
Duplicate of this bug: 1478566
(Assignee)

Comment 45

7 months ago
The situation is different on Windows and the other platforms, so I'll repurpose this bug for Windows specifically since its our initial target.

I did some profiling and right now the CPU side of things is in pretty good shape thanks to Sotaro's work which lets us skip frame building altogether when the only thing that change is a video frame that is already on the GPU (which is the case for youtube on windows).
If there's any improvement left to make it would be on the GPU side (the CPU parts of webrender are pretty much idle the whole time). These improvements can either come from Glenn's caching work and direct composition, but that's not part of the MVP.
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → FIXED
Summary: [WebRender Shield Study] Higher CPU usage with WebRender enabled on YouTube → [WebRender Shield Study] Higher CPU usage with WebRender enabled on YouTube (Windows)
You need to log in before you can comment on or make changes to this bug.