Closed Bug 1491703 Opened 6 years ago Closed 6 years ago

Extreme lag with download button when WebRender is enabled with lots of Windows open

Categories

(Core :: Graphics: WebRender, defect, P2)

64 Branch
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: loic.yhuel, Assigned: mattwoodrow)

References

Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0
Build ID: 20180916100118

Steps to reproduce:

Start a download, with the download button present the toolbar.



Actual results:

Since WebRender is enabled (Nightly, GTX970 with 399.24 driver on Win10), the whole browser (both UI and chrome) becomes extremely laggy in this situation.


Expected results:

Even if the download button is shown in all windows, the graphics performance shouldn't be affected by the slow changing integrated progress bar.
Hey Jeff, does this sound familiar? I've made it a P1 at least until we know more.
Flags: needinfo?(jmuizelaar)
Priority: -- → P1
Loïc, can you get a profile when this happens using the gecko profiler from https://perfht.ml. Please make sure "Threads" is set to "GeckoMain,Compositor,RenderBackend,Renderer,WebRender,Wr,Paint" in settings.
Flags: needinfo?(jmuizelaar) → needinfo?(hwti)
Before running the profiler, I looked at memory and GPU usage (I only looked at CPU previously, which was fine).
Even just after restarting Firefox (about 20 windows, only one tab per window should be loaded), one of the content processes has a working set of 3.3GB, and almost 3GB of dedicated GPU memory used. So there might be an issue here.

During download, the GPU is often fully used, but it's shown as "Copy" in the task manager, not "3D".
GPU core and graphics memory clocks go up the maximum level (1342Mhz/1752Mhz).

During normal usage (scrolling, loading pages, ...) :
 - "3D" usage is a few percents (it can go higher for example with WebGL or 4K video)
 - "Copy" usage is almost 0

Here is the profile : https://perfht.ml/2Nrignm
I started the download, then the profiler, waited 10s, then tried to scroll (it worked for 1s then started to heavily stutter, I would say about almost 1s between frames).
I had difficulties to stop the profiler, since the extension popup was obviously affected by the issue and didn't even fully paint unless moving the cursor over items).
Flags: needinfo?(hwti)
I'm unable to reproduce this while testing on the HP WR reference machine (Win10).

With the WR profiler enabled, the frame rate doesn't seem to change, whether I'm downloading or not. I tried with the download button on the toolbar, and also with the download window open / closed.

Is anyone else able to reproduce this?
Ah, interesting - so perhaps there is some kind of leak, perhaps specific to the site of the download URL? Does it occur on any file you download, or perhaps only specific pages?
When I try to download an Ubuntu ISO from a mirror site, the memory usage I'm seeing in nightly on Win10 is:

229 MB in the root process, ~35 MB in four of the content processes, and ~140 MB in the other child process.

So it seems likely that there is a memory leak, but it's only triggered by certain pages, perhaps?

One interesting test - could you enable gfx.webrender.debug.render-targets and gfx.webrender.debug.texture-cache, and either post here roughly how many of each of those targets show up, or a screenshot of the page with those debug options enabled? On my machine, I see 11 texture cache rectangles, and 4 render target rectangles, which seems quite normal.
Thanks for reporting this.  Can you help us by answering Glenn's questions above?
Flags: needinfo?(hwti)
(In reply to Glenn Watson [:gw] from comment #5)
> Ah, interesting - so perhaps there is some kind of leak, perhaps specific to
> the site of the download URL? Does it occur on any file you download, or
> perhaps only specific pages?

It affects all downloads, even when just pasting an url in a blank tab.
When the download is slow, the issue is intermittent (fluid for a few seconds, then 3 frames with freezes, and so on), it's probably linked to the progress bar update.
Minimizing all windows but one doesn't avoid the issue.
However, doing the download in a private navigation window (so the other ones don't show the download button) avoids it : even the private window itself is fluid, GPU Copy usage is almost 0%.

(In reply to Glenn Watson [:gw] from comment #6)
> So it seems likely that there is a memory leak, but it's only triggered by
> certain pages, perhaps?
The memory usage is high right from the start, so that would be almost at first paint.
But it wouldn't explain the GPU Copy usage, unless the driver is doing something crazy since the dedicated memory usage is high (total 3.2-3.4/4GB).
Outside of downloads the memory usage is similar without any major problem (only a few stutters from time to time, for example the first time the ctrl+F bar is shown).

But I just found one other case where the GPU Copy usage is near 100% : on startup, for a few seconds.
It probably explains why the startup seems slower with WebRender (but still hugely faster than a few months ago when enabling WebRender would make the startup take more than 30s).
It's interesting that (a) memory usage for you is high from startup (it's not on my test machine) and that (b) it doesn't seem to occur in a private window.

I wonder if it's a WR bug that is being trigger by something in an add-on? Would you be able to list any add-ons / themes you have enabled, or perhaps try and see if you can still reproduce this bug on your machine in a fresh profile that doesn't have any add-ons?
(In reply to Glenn Watson [:gw] from comment #6)
> One interesting test - could you enable gfx.webrender.debug.render-targets
> and gfx.webrender.debug.texture-cache, and either post here roughly how many
> of each of those targets show up, or a screenshot of the page with those
> debug options enabled? On my machine, I see 11 texture cache rectangles, and
> 4 render target rectangles, which seems quite normal.
On this page, 12 or 13 depending on the window size, and 4.
On a new window with a blank tab, 9 and 3.
Most windows have about the same number of texture cache rectangles (I didn't switch tabs).

The only exception seems to be https://forum.xda-developers.com, which starts at 18 and goes up/down when scrolling, something seems strange. Closing the tab seemed to free 300MB of GPU memory (and system memory too). I reproduced it several times, but after restarting Firefox it don't see it any more.

But even with the tab closed, the biggest content process still takes 3GB working set (many windows so perhaps expected) + 3GB GPU memory on start.
The system memory slowly goes down after some time (GC ?), it's now at 700MB after more than 10 min of almost idle, but can also increase quickly (perhaps background tab or extension activity).
On the other hand, the GPU memory doesn't decrease, except sometimes when closing tabs (but then I saw several times a 100MB-200MB increase just by opening a window with a blank tab, but again failed to reproduce it later, so I'm not sure).

But the issue with downloads is still present.

(In reply to Glenn Watson [:gw] from comment #9)
> It's interesting that (a) memory usage for you is high from startup (it's
> not on my test machine) and that (b) it doesn't seem to occur in a private
> window.
The private window doesn't reproduce the graphics lags, nothing about memory in this test.
I openened 7 private windows then started a download : the download button is shown in all of them, but there is no issue.

The memory usage might be a different issue from the graphics lags when downloading.
It might just be because I have many windows (and tabs, but only one tab per window should load on startup).
But the GPU part is worrying, unless there are caches which can be freed on demand (if the system is running low on GPU memory).
The GPU "Copy" usage on startup might still suggest a link, but there is probably so much done at this time it's difficult to say.

> 
> I wonder if it's a WR bug that is being trigger by something in an add-on?
> Would you be able to list any add-ons / themes you have enabled, or perhaps
> try and see if you can still reproduce this bug on your machine in a fresh
> profile that doesn't have any add-ons?
Addons (which seems to be enabled in the private window too) :
 - Gecko profiler (installed to create the profile asked by [:jrmuizel] here, so not an issue
 - Greasemonkey
 - Stylus
 - Test pilot
 - uBlock Origin
I use the Mozilla black theme (so not default, but it shouldn't be much different).
Flags: needinfo?(hwti)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #11)
> What if you try a fresh profile?
> https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-
> firefox-profiles
I works fine with a fresh profile (even creating many windows), exactly like with private windows.

On my profile, disabling the addons doesn't seem to change anything.
When the icons are added or removed, or switching themes, I see a short GPU Copy spike (much shorter than on startup), but no GPU 3D usage.
So there seems to be costly chrome redraws, but I wonder why it's the GPU "Copy".
It could be a texture upload from system to GPU memory, but if even a small update like the progress bar takes time, it suggests :
 - the uploaded size might be too large (no subtexture ?)
 - it might be repeated for all windows, without either sharing a texture
 - even minimized windows would do it, instead of deferring the work
Flags: needinfo?(hwti)
(In reply to Loïc Yhuel from comment #12)
> I works fine with a fresh profile (even creating many windows), exactly like
> with private windows.
> 

So the profile does show very long composites. With all of the time being spent in NDXGI::CDevice::PresentImpl.

If you change the fresh profile to the dark theme does it happen? It would be really valuable to figure out what about your profile causes the problem.
Flags: needinfo?(hwti)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)
> (In reply to Loïc Yhuel from comment #12)
> > I works fine with a fresh profile (even creating many windows), exactly like
> > with private windows.
> > 
> 
> So the profile does show very long composites. With all of the time being
> spent in NDXGI::CDevice::PresentImpl.
> 
> If you change the fresh profile to the dark theme does it happen? It would
> be really valuable to figure out what about your profile causes the problem.

In fact I didn't try enough with the fresh profile.

Each new window (with the default new tab) seems to allocate 200MB of graphics memory, until it hits the limit.
Then, it stops growing, but doesn't use shared memory either : I think it keeps data in system memory, and copies it to the card when needed.
So it's like swapping, but between CPU and GPU ! After 13 windows it starts to slow down, and it goes worse and worse, and obviously it affects operations which need to update all windows like the download button, switching theme, ...
Flags: needinfo?(hwti)
When the issue happens, the task manager still shows dedicated GPU memory usage at 3.xGB/4GB, so perhaps it's how Windows manages the memory, or there are peaks which aren't displayed (the GPU memory usage of dwm show is strange, either there is a bug, or it temporarily allocates or takes ownership of hundred of megabytes when rendering a frame).
I can reproduce the extreme lag when I have lots of windows open. Can you confirm that you don't see the problem when you have only one window open?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(hwti)
Summary: Extreme lag with download button when WebRender is enabled → Extreme lag with download button when WebRender is enabled with lots of Windows open
(In reply to Jeff Muizelaar [:jrmuizel] from comment #16)
> I can reproduce the extreme lag when I have lots of windows open. Can you
> confirm that you don't see the problem when you have only one window open?
Yes, no issue with only one window (even with more than 30 loaded tabs in it, the graphics memory usage doesn't really increase much).

The 200MB per window is strange, especially since they aren't even maximized.

If the code was able to defer the composition for non-visible windows (either minimized, or even overlapped ones), that would help, since even with the high graphics memory usage it would most likely only have to work with a smaller subset.
Obviously being able to release resources for minimized windows would be even better.
Flags: needinfo?(hwti)
It appears that the upload animation triggers scene building. It'd be interesting to see if we can make it a simple UpdateImage message somehow. If we manage to do this, another improvement could be to make it so we can update images without re-building the frame (which we'll want to do sooner or later because it will greatly improve software video playback on non-windows platforms).

In the mean time, we'll most likely get improvements from landing https://github.com/servo/webrender/pull/3092 because it will remove one frame build and some rendering work for each tick of the animation per window, which isn't most of the work but is not negligible nonetheless.
So the same jank problem happens with Gecko when you have lots of windows open. It's just much less extreme, perhaps because composition is cheaper. 

If you look at https://perfht.ml/2O5mzo5 during a composition flood, you can see we have a bunch of compositions but they take a really wide range of times (2.8ms to 99ms) I suspect there might be something wrong going on here causing some compositions to take a really long time.
Another interesting data point: Running an async animation (http://jsfiddle.net/GXPS8) in all of the windows gives the following profile: https://perfht.ml/2O8CFNQ. In that profile we're saturating with composites, but all of them take < 2ms. This suggests that we're able to present to all of these windows without everything going really bad and that there's something more special about the download animation. The next thing to try is a main-thread animation and see what happens with that.
Many windows of the sync animation https://jrmuizel.github.io/tmp/margin-anim.html seem to work ok too. So there's something even more mysterious going on. I wonder if it's related to updating of the OS progress animation.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #19)
> So the same jank problem happens with Gecko when you have lots of windows
> open. It's just much less extreme, perhaps because composition is cheaper. 
I remember sometimes seeing higher than expected CPU usage during downloads.
So the effect is probably different, and it might depend on the CPU performance.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)
> Another interesting data point: Running an async animation
> (http://jsfiddle.net/GXPS8) in all of the windows gives the following
> profile: https://perfht.ml/2O8CFNQ. In that profile we're saturating with
> composites, but all of them take < 2ms. This suggests that we're able to
> present to all of these windows without everything going really bad and that
> there's something more special about the download animation. The next thing
> to try is a main-thread animation and see what happens with that.
Tried on a fresh profile (with WebRender) :
 - up to 9 windows, everything seems fine
 - with 10-15 windows, the performance slowly decreases, GPU Copy usage is still 0%
 - with 16 windows, the animations stutters a lot more, and GPU Copy usage is 55%

(In reply to Jeff Muizelaar [:jrmuizel] from comment #21)
> Many windows of the sync animation
> https://jrmuizel.github.io/tmp/margin-anim.html seem to work ok too. So
> there's something even more mysterious going on. I wonder if it's related to
> updating of the OS progress animation.
Tried on a fresh profile (with WebRender) :
 - up to 14 windows, everything seems fine
 - with 15 windows the framerate drops a little, but GPU Copy usage is still 0%
 - with 16 windows the framerate is very low, and GPU Copy usage is 75%

So there might be different causes, but the GPU memory usage definitively plays a big role in my case (obviously the threshold will depend on how much graphics memory is available).
How many windows did you try ?
I'm observing the same problem when having 6 windows run a simple canvas animation (with composite times ranging from 1.9 to 150ms, all time spent in NDXGI::CDevice::Present)
See Also: → 1494115
Assignee: nobody → jmuizelaar
GPUview seems to be working for me. I'll look into this tomorrow.
Jeff sent me a gpuview capture and what stands out (disclaimer I am no gpuview expert) to me is a big pile of fences alive for long periods of time (in the hundreds of ms) which we already saw from profiling. the lifetime of these fences correlate with work sitting in the hardware copy queue and also correlate with DxgKrnl AllocationFault events and interestingly also correlate with DxgKrnl EvictAllocation.

Looks like the hardware copy queue needs some memory to be alloated but has to wait for some memory to be deallocated before it can do that. Could be that we are running out of GPU memory (or just some poor scheduling on the driver's part).

There are long (80ms) VidMmOpMakeResident items in the VidMm, I'm assuming that this is the system that fetches back memory that was flushed out of the GPU to make room.
and Also the system paging context's queue is full of work for the during the same time.

Also, not sure what to make of that but from the profiles the as looking at the other day, present calls looked like they were periodically growing (2ms, 3ms, 5ms, 40ms, 150ms, back to 2ms, 3ms, 40ms, etc.) so there's some snowball effect that carries over frames until everything gets flushed. In gpuview we can see that at the end of this loooong 200ms GPU fence festival the whole pile of fences gets resolved in about 500 microseconds so all of these long fences seem to have been blocked a single thing. Maybe it's one of our throttling mechanisms kicking in after a few frames.


If I can make one uneducated guess here: we are just running out of memory, causing the GPU to flush GPU pages out to CPU memory (or worse), and then end up needing that memory again.
It's interesting that we are seeing this with canvas animations and not CSS animations. It could be a sign that the texture sharing code isn't cleaning its memory up properly, or if it is we aren't managing to get the driver to understand that the memory is available again (or maybe not? I don't know how the download animation's story fits there).

In any case the probelm is definitely memory related. The rest of the hardware looks really bored.

Sorry for the messy brain dump.
Depends on: 1494760
(In reply to Nicolas Silva [:nical] from comment #25)
> It's interesting that we are seeing this with canvas animations and not CSS
> animations.
As soon as the whole working set doesn't fit in the GPU memory, any composite cause will trigger the issue.
Usually most windows are idle, so they can stay in the CPU memory. But not with the download animation, or on startup.

The question is : can the per-window memory usage be reduced enough it won't be a problem (on a reasonable number of windows), or should composition be skipped for minimized or even obscured windows ?
Matt -- Can you take on this one?  You're already working on a fix for at least part of it.
Assignee: jmuizelaar → matt.woodrow
Priority: P1 → P2
Depends on: 1495977
I think the memory usage reductions that bholley has been working have helped a bit here.

I can reproduce the issue by running an animated <canvas> in lots of windows, but it's flawless up to 23 windows, and then degrades really quickly on 24+.
I can also reproduce this without WebRender, it just takes a lot more windows (~80).

I took a GPUView trace of the slow WR case, and there's massive entries for 'Evict Paging Queue' and 'Paging Queue' which aren't present in the trace for the fast case.

My understanding of this is that we're just running out of GPU memory in order to composite all the windows, and we're forcing the driver to page other window content out before we can composite the current window.

It looks like waiting on fences is the expected behaviour for when this happens: https://docs.microsoft.com/en-us/windows-hardware/drivers/display/device-paging-queues

I'm also seeing massive numbers of DxgKrnl EvictAllocation events, as suggested by https://developer.nvidia.com/content/are-you-running-out-video-memory-detecting-video-memory-overcommitment-using-gpuview
I'll re-test once bug 1495977 lands, since in general I think the best solution here is to just be more careful with how much GPU memory we use.
The only solution I can think of is to detect the low-memory situation, and have WR delete the gpu cache textures.

We'd still have to re-upload when we want to use them again (which the driver is doing internally), but at least we'd stop the driver from doing a readback of the contents (since we already have a cpu side copy).

It'd be nice if we could tell the GPU to just overwrite the textures and notify us to upload them again when we need them, but I'm not aware of a way to do that.
(In reply to Matt Woodrow (:mattwoodrow) from comment #31)
> The only solution I can think of is to detect the low-memory situation, and
> have WR delete the gpu cache textures.

Do you mean the GpuCache textures, or the TextureCache textures?

If it's the texture cache, then the stuff I'm working on should do that automatically. We detect aggregate WR memory usage (across all windows, using a static atomic), and evict more and more aggressively the higher overall memory usage goes.

Bug 1495977 doesn't do any eviction, but the lazy growing alone is probably going to be a big help here.
(In reply to Bobby Holley (:bholley) from comment #32)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #31)
> > The only solution I can think of is to detect the low-memory situation, and
> > have WR delete the gpu cache textures.
> 
> Do you mean the GpuCache textures, or the TextureCache textures?

Ultimately it's both, but on the synthetic workload I constructed (25 windows, all with the same simple page that has a simple <canvas> animation) it was primarily GpuCache using the memory.

I think the root problem is if the total GPU memory (from all sources) used to composite all currently animating windows exceeds the available vram, then the driver has to swap and things get really bad.

We can do our best to only reserve memory that we actually need, but that just moves the ceiling. Hopefully we can get it to only happen in the most pathological of cases.

> 
> If it's the texture cache, then the stuff I'm working on should do that
> automatically. We detect aggregate WR memory usage (across all windows,
> using a static atomic), and evict more and more aggressively the higher
> overall memory usage goes.

Will it evict even if we used a given texture in the previous composite and will use it again on the next?

> 
> Bug 1495977 doesn't do any eviction, but the lazy growing alone is probably
> going to be a big help here.

Lazy growth is the bit that I expected to help the most.
(In reply to Matt Woodrow (:mattwoodrow) from comment #33)
> (In reply to Bobby Holley (:bholley) from comment #32)
> > (In reply to Matt Woodrow (:mattwoodrow) from comment #31)
> > > The only solution I can think of is to detect the low-memory situation, and
> > > have WR delete the gpu cache textures.
> > 
> > Do you mean the GpuCache textures, or the TextureCache textures?
> 
> Ultimately it's both, but on the synthetic workload I constructed (25
> windows, all with the same simple page that has a simple <canvas> animation)
> it was primarily GpuCache using the memory.

I see. Anyway, my work hasn't done anything to decrease the size of the GPU cache. It's worth noting that the GPU cache doesn't do any shrinking though, even when you minimize memory usage. This seems suboptimal, so I've filed bug 1505449 to investigate it.

What kind of machine were you testing on, what was the overall webrender texture usage in about:memory, and how well did that align with the reported quantity of vram?

> I think the root problem is if the total GPU memory (from all sources) used
> to composite all currently animating windows exceeds the available vram,
> then the driver has to swap and things get really bad.
> 
> We can do our best to only reserve memory that we actually need, but that
> just moves the ceiling. Hopefully we can get it to only happen in the most
> pathological of cases.

Yeah. I do think the aggressiveness scaling is useful here. There's a minimum amount of texture usage needed for correctness for a given workload, but we can try harder to hug that minimum when pressure is high.

> > If it's the texture cache, then the stuff I'm working on should do that
> > automatically. We detect aggregate WR memory usage (across all windows,
> > using a static atomic), and evict more and more aggressively the higher
> > overall memory usage goes.
> 
> Will it evict even if we used a given texture in the previous composite and
> will use it again on the next?

It will not evict anything used in the given frame.

> > Bug 1495977 doesn't do any eviction, but the lazy growing alone is probably
> > going to be a big help here.
> 
> Lazy growth is the bit that I expected to help the most.

Yep. That just hit central, so should be in Nightly shortly.
(In reply to Bobby Holley (:bholley) from comment #34) 
> I see. Anyway, my work hasn't done anything to decrease the size of the GPU
> cache. It's worth noting that the GPU cache doesn't do any shrinking though,
> even when you minimize memory usage. This seems suboptimal, so I've filed
> bug 1505449 to investigate it.
> 
> What kind of machine were you testing on, what was the overall webrender
> texture usage in about:memory, and how well did that align with the reported
> quantity of vram?

Ugh, I did indeed get them mixed up. TextureCache is the issue, so your work *should* help.

I'm running this on my Lenovo P50, so it's an NVIDIA M2000M card (forcibly disabled the intel chip from the bios).

With a slightly less extreme version of the test case, about:memory reports 980mb total for the gfx section, with 722mb of that from the texture cache.

Windows is reporting 1500mb of vmem usage though, so there's a fairly big difference there. We don't report swapchain memory usage, which is significant on 11 windows (~200mb by my rough estimate), but still not all the difference.
Does process explorer show any other apps using significant vram?
Also, it would be nice to report swap chain memory. Is there an easy way to measure it? If so, maybe get a bug on file?
Bug 1495977 helps quite a lot, I can get up to 40 animating windows now before we run out of memory.

I'll have a go at adding a reporter for swap chain memory.
(In reply to Matt Woodrow (:mattwoodrow) from comment #38)
> Bug 1495977 helps quite a lot, I can get up to 40 animating windows now
> before we run out of memory.
> 
> I'll have a go at adding a reporter for swap chain memory.

FYI: We may well want to put this in MemoryReport, since that's the type that gets shuttled around through all the IPC. It's a type defined in WebRender, but you can also fill fields from C++. Note that adding fields requires a wider rebuild than usual, because the IPC serialization bits need to be rebuilt as well.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
(In reply to Bobby Holley (:bholley) from comment #39)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #38)
> > Bug 1495977 helps quite a lot, I can get up to 40 animating windows now
> > before we run out of memory.
> > 
> > I'll have a go at adding a reporter for swap chain memory.
> 
> FYI: We may well want to put this in MemoryReport, since that's the type
> that gets shuttled around through all the IPC. It's a type defined in
> WebRender, but you can also fill fields from C++. Note that adding fields
> requires a wider rebuild than usual, because the IPC serialization bits need
> to be rebuilt as well.

Filed bug 1506492.
You need to log in before you can comment on or make changes to this bug.