Open Bug 1357571 Opened 2 years ago Updated 2 years ago

Youtube settings widget is still janky when playing fullscreen video

Categories

(Core :: Graphics, defect, P3)

defect

Tracking

()

Tracking Status
platform-rel --- ?

People

(Reporter: dmajor, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: perf, Whiteboard: [qf:meta][gfx-noted][platform-rel-youtube])

https://perfht.ml/2pz3HyM

This was on the nightly with the fix for bug 1351733.

My steps were:
- https://www.youtube.com/watch?v=jKGKX-lM1wM
- Clear out the ad
- Go fullscreen, set 1080p, let it buffer a bit, seek back to beginning
- Set speed to 1.5 -- the menu is already pretty slow
- Start profiler
- Set speed back to 1.0 -- painfully slow menu
- Stop profiler

About:support at bug 1351733 comment 5.
See Also: → 1351733
I see the second compositor process spending huge amounts of time in igd10umd64.dll, under:
mozilla::layers::DataTextureSourceD3D11::Update
CContext::TID3D11DeviceContext_UpdateSubresource_<1>

Does this activity block the other processes/threads?
(In reply to David Major [:dmajor] from comment #1)
> I see the second compositor process spending huge amounts of time in
> igd10umd64.dll, under:
> mozilla::layers::DataTextureSourceD3D11::Update
> CContext::TID3D11DeviceContext_UpdateSubresource_<1>
> 
> Does this activity block the other processes/threads?

Do you know David or Bas?
Flags: needinfo?(dvander)
Flags: needinfo?(bas)
(In reply to David Major [:dmajor] from comment #1)
> I see the second compositor process spending huge amounts of time in
> igd10umd64.dll, under:
> mozilla::layers::DataTextureSourceD3D11::Update
> CContext::TID3D11DeviceContext_UpdateSubresource_<1>
> 
> Does this activity block the other processes/threads?

Yes, it looks like something in the content process is not using Direct2D, so we're doing a buffer upload on the compositor thread.
Flags: needinfo?(dvander)
If you force your discrete GPU, do you get the same problem?
Flags: needinfo?(dmajor)
Oh, good catch. No problem with my discrete GPU.

I'll note that my Nvidia control panel apparently had a preset for firefox.exe to force the integrated GPU. I don't recall ever setting that myself; it may have come that way out of the box.
Flags: needinfo?(dmajor)
A note here that in profiles with intel gpus, you will get random 30-50ms flushes just cause. So the box shadow here might just be the thing tickling the driver's long pauses rather than anything fundamental. I tried removing creations of temporary surfaces and draw targets, but at the end of the day, we have to upload a blurred text to the GPU since we blur on the CPU right now. Removing the temporary surfaces just moved the pauses locally for me to the flushes / uploads instead.
(In reply to Mason Chang [:mchang] from comment #2)
> (In reply to David Major [:dmajor] from comment #1)
> > I see the second compositor process spending huge amounts of time in
> > igd10umd64.dll, under:
> > mozilla::layers::DataTextureSourceD3D11::Update
> > CContext::TID3D11DeviceContext_UpdateSubresource_<1>
> > 
> > Does this activity block the other processes/threads?
> 
> Do you know David or Bas?

That's a complicated question, it depends a lot on how this particular driver handles uploads. But the reality here is 'yes it can'. If a flush actually has to finish a certain batch of work, it may end up blocking on any flush on the upload to finish. This is a pretty fundamental issue of uploading large surfaces. What probably affects this here as well is that with the integrated GPU, we are already using a lot of bandwidth on the system for decoding the video and such work. Not to mention doing the actual blur (I believe our current blur code is bandwidth rather than instruction bound). It's quite possible the system's memory bandwidth is simply starved.
Flags: needinfo?(bas)
Has STR: --- → yes
Priority: -- → P3
Whiteboard: [qf] → [qf][gfx-noted]
Bas is this problem evident on Chrome? Does it occur regularly on YouTube? If so we may need an optimized path in FF or a different impl from YouTube. It sounds like it is p1 though.
Flags: needinfo?(bas)
(In reply to Naveed Ihsanullah [:naveed] from comment #8)
> Bas is this problem evident on Chrome? Does it occur regularly on YouTube?
> If so we may need an optimized path in FF or a different impl from YouTube.
> It sounds like it is p1 though.

We are using component alpha for the settings widget, this should give an improved 'look' particularly on low DPI displays, since the text in the widget gets subpixel anti-aliasing in Firefox. This does, however incur a 2x gfx performance cost. Chrome does not use subpixel AA for the text in the widget (at least on the two machines I tested on), we could make a different trade-off here and disable component alpha for things 'like' this widget, but I'm not sure what the heuristics there would be, and this is a decision made in layout.

Now having said that, we still seem to be spending a lot of time inside box blur during the rasterizations in this profile. This will likely consume a lot of memory bandwidth. If you put the speed at 0.25x, performance is fine, if you put the speed at 2.0x, performance is abysmal, considering the video decoding required for these speed changes is happening on another thread, everything indicates that video decoding is using memory bandwidth here that causes our composites, uploads and box blurs to be slow. (This video is VP9 for me, which essentially means that on my machine we are not getting any hardware video decoding)

I'll do some more profiling to see if there's obvious wins to be had on the graphics side of things. One thing I can tell you is that disabling client side uploads on intel (which we're doing in bug 1359416 for beta), is going to make things a little worse compared to nightly (not compared to release).
Flags: needinfo?(bas)
I've done extensive profiling here and confirmed that the performance issues are indeed related to the severe usage of memory bandwidth, the bus on my system is completely swamped when using the integrated GPU.

One thing I've noticed is that when using Ffmpeg, unlike when doing H.264 software decoding, we go through SharedPlanarYCbCrImage, and we incur considerably larger copying costs (One full copy of the YUV data in the decoder thread and another full copy of the YUV data on the compositor thread when uploading to the GPU) than for example the IMFYCbCrImage (which results in a DXGIYCbCrTextureData), which can even upload on the client side on some systems (and hopefully most in the near future).

This is causing, on my machines event he JS event handlers on the content main thread to use significant amounts of time. We could probably improve the situation by not using component alpha, but in reality we should be using less memory bandwidth in general.

Jean-Yves, is there a particular reason the choice was made not to use a more efficient YCbCr upload path or was it just time to market concerns? If there's no good reason I can probably create a patch to address the issue.
Flags: needinfo?(jyavenard)
Anything relate to the windows decoder predates me working for Mozilla...
So I don't know why and how YUV surface are uploaded to the GPU, and it's not something the media team would have worked on, but gfx.

I'm not sure I fully understand your comment about FFmpeg, it's only used with h264 on Linux. If VP9 and no hardware decoder, ffmpeg is used but then D3D11 should be irrelevant. 

From the comments from David, using nightly he may be on VP9 hardware decoding... so maybe his problem is that and so,liar to bug 1314458

:dmajor what's your hardware?
Can you paste here the output of right click on the video and select "stats for nerds", it's a text overlay.
Flags: needinfo?(jyavenard) → needinfo?(dmajor)
(In reply to Jean-Yves Avenard [:jya] from comment #11)
> Anything relate to the windows decoder predates me working for Mozilla...
> So I don't know why and how YUV surface are uploaded to the GPU, and it's
> not something the media team would have worked on, but gfx.

GFX offers several ways to do this, the one used here though, is one of the less efficient ones, I'm not sure who wrote the FFMPEG code in media, it may have been someone from GFX, or from media, but the one being used here is probably not the best.

> I'm not sure I fully understand your comment about FFmpeg, it's only used
> with h264 on Linux. If VP9 and no hardware decoder, ffmpeg is used but then
> D3D11 should be irrelevant. 
> 
> From the comments from David, using nightly he may be on VP9 hardware
> decoding... so maybe his problem is that and so,liar to bug 1314458
> 
> :dmajor what's your hardware?
> Can you paste here the output of right click on the video and select "stats
> for nerds", it's a text overlay.

I'm on software decoding and I can reproduce this problem quite easily. D3D11 is still relevant, as we use D3D11 to composite the window. Our FFMPEG codepath (used for VP9 software decoding) uses an inefficient codepath to upload the YCbCr surface to D3D11 though, which uses more memory bandwidth than strictly necessary.

When I have some time I will try to remedy this.
(In reply to Bas Schouten (:bas.schouten) from comment #12)
> (In reply to Jean-Yves Avenard [:jya] from comment #11)
> > Anything relate to the windows decoder predates me working for Mozilla...
> > So I don't know why and how YUV surface are uploaded to the GPU, and it's
> > not something the media team would have worked on, but gfx.
> 
> GFX offers several ways to do this, the one used here though, is one of the
> less efficient ones, I'm not sure who wrote the FFMPEG code in media, it may
> have been someone from GFX, or from media, but the one being used here is
> probably not the best.

The decoder simply copy the YUV buffer provided by FFmpeg's libavcodec. How else can you copy that buffer. it's memory based and the reference count provided by libavcodec is unusable for our needs. so we must make a copy.

Matt wrote a code that allows to immediately upload the data to the D3D buffer but I don't believe it went in.
(bug 1223270)

not sure why this never went in :(

> I'm on software decoding and I can reproduce this problem quite easily.
> D3D11 is still relevant, as we use D3D11 to composite the window. Our FFMPEG
> codepath (used for VP9 software decoding) uses an inefficient codepath to
> upload the YCbCr surface to D3D11 though, which uses more memory bandwidth
> than strictly necessary.

that would be good.

I'll check bug 1223270...
(In reply to Jean-Yves Avenard [:jya] from comment #13)
> (In reply to Bas Schouten (:bas.schouten) from comment #12)
> > (In reply to Jean-Yves Avenard [:jya] from comment #11)
> > > Anything relate to the windows decoder predates me working for Mozilla...
> > > So I don't know why and how YUV surface are uploaded to the GPU, and it's
> > > not something the media team would have worked on, but gfx.
> > 
> > GFX offers several ways to do this, the one used here though, is one of the
> > less efficient ones, I'm not sure who wrote the FFMPEG code in media, it may
> > have been someone from GFX, or from media, but the one being used here is
> > probably not the best.
> 
> The decoder simply copy the YUV buffer provided by FFmpeg's libavcodec. How
> else can you copy that buffer. it's memory based and the reference count
> provided by libavcodec is unusable for our needs. so we must make a copy.
> 
> Matt wrote a code that allows to immediately upload the data to the D3D
> buffer but I don't believe it went in.
> (bug 1223270)
> 
> not sure why this never went in :(

Right, yes, this is the copy I mean that at the very least can be avoided. :)
Depends on: 1223270
Flags: needinfo?(dmajor)
Whiteboard: [qf][gfx-noted] → [qf:p1][gfx-noted]
Looks like the work for this is happening on some of the blocking bugs, this feels like a meta.

One thing we could decide in this bug - Bas, if we can't get the performance we need, what do you think about disabling subpixel AA on Intel graphics (assuming that's the integrated scenario), or perhaps go with the blocklist approach so that we can tweak the list remotely.
Flags: needinfo?(bas)
(In reply to Milan Sreckovic [:milan] from comment #15)
> Looks like the work for this is happening on some of the blocking bugs, this
> feels like a meta.

I think you're right.

> One thing we could decide in this bug - Bas, if we can't get the performance
> we need, what do you think about disabling subpixel AA on Intel graphics
> (assuming that's the integrated scenario), or perhaps go with the blocklist
> approach so that we can tweak the list remotely.

I actually wouldn't be wholly against reducing the amount of places where we do subpixel AA. afaik no other browsers jump through the kinds of hoops we do to supply subpixel aa'ed text. And I have my doubts about how much this would be a reason for people to pick firefox.
Flags: needinfo?(bas)
(In reply to Bas Schouten (:bas.schouten) from comment #16)
> ...
> 
> > One thing we could decide in this bug - Bas, if we can't get the performance
> > we need, what do you think about disabling subpixel AA on Intel graphics
> > (assuming that's the integrated scenario), or perhaps go with the blocklist
> > approach so that we can tweak the list remotely.
> 
> I actually wouldn't be wholly against reducing the amount of places where we
> do subpixel AA. afaik no other browsers jump through the kinds of hoops we
> do to supply subpixel aa'ed text. And I have my doubts about how much this
> would be a reason for people to pick firefox.

Opened bug 1365772 for this.
David, can you set the pref for component alpha (layers.componentalpha.enabled) to false and see how the performance changes?
Flags: needinfo?(dmajor)
Maybe slightly less bad (it's hard to say) but overall still very unpleasant to use.
Flags: needinfo?(dmajor)
Whiteboard: [qf:p1][gfx-noted] → [qf:meta][gfx-noted]
Could you retry again with a build that has bug 1223270 in ?

thanks
Flags: needinfo?(dmajor)
> Could you retry again with a build that has bug 1223270 in ?

It's still somewhat laggy, but it's not horrible like before.
Flags: needinfo?(dmajor)
Keywords: perf
platform-rel: --- → ?
Whiteboard: [qf:meta][gfx-noted] → [qf:meta][gfx-noted][platform-rel-youtube]
You need to log in before you can comment on or make changes to this bug.