Massive shmem leak of textures in e10s (from canvas.captureStream()?)

RESOLVED WORKSFORME

Status

()

P1
normal
Rank:
12
RESOLVED WORKSFORME
2 years ago
2 years ago

People

(Reporter: jesup, Unassigned)

Tracking

(Depends on: 1 bug, {memory-leak, regression})

52 Branch
memory-leak, regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 unaffected, firefox51 unaffected, firefox52+ fixed, firefox53+ fixed, firefox54+ fixed)

Details

(Whiteboard: [e10s][MemShrink], URL)

Attachments

(3 attachments, 2 obsolete attachments)

(Reporter)

Description

2 years ago
Found by pehrsons

We're leaking massive amounts of shmem when this page ends the MediaRecorder use and puts up the save requester.  This continues so long as the page is open.

The leak is in shmem-mapped. It's allocated in mozilla::ipc::CreateSegment().  We are continually allocating 2 frames; one from mozilla::layers::TextureClient::CreateForDrawing() (which leaks), one from mozilla::layers::TexClientFactory::Create() (which is freed).  Before the recording finishes, both allocations are freed (Create, Create, Unmap, Destroy, Unmap, Destroy).

I suspect the problem isn't really MediaRecorder, but perhaps might be that the MediaStream is no longer consuming the canvas.captureStream(). 

Note: requires e10s.

Go to the URL.  Try the "demo" song.  Hit "Export and Download", choose "Basic".  When the recording finishes, and the save requester pops up, the leak starts.  Dismissing the save dialog doesn't stop the leak.  reloading the page does, but the leaked shmems are still leaked.
(Reporter)

Updated

2 years ago
Rank: 12
(Reporter)

Comment 1

2 years ago
You can watch this using a separate window with about:memory, and check the shmem-mapped entry once the recording ends.  I was testing on local linux64 debug.  Note that a small fix is needed or this page tends to crash in an assert when loading it.  See bug 1329082 for the patch.  

On Win10 running Aurora 52 (Nightly opt build), I see a one-time leak, not a continuing one -- but in my test, it leaked 1.5GB.  However a second run within the same browser didn't appear to leak any more.  Not sure what's going on with that.  And not sure it's the same leak.
(Reporter)

Updated

2 years ago
status-firefox51: --- → ?
status-firefox52: --- → ?
status-firefox53: --- → affected
(Reporter)

Comment 2

2 years ago
Created attachment 8824323 [details]
GDB dumps of Create(), Create(), (Unmap(), Destroy())*2 before leak started
(Reporter)

Comment 3

2 years ago
Created attachment 8824324 [details]
GDB dumps of Create(), Create(), Unmap(), Destroy() after leak started
I'm not sure whether you know the problem.
Flags: needinfo?(pehrson)
I do, thanks. Haven't had time to dig in yet. I'm guessing the canvas is producing frames, but in some circumstance something is not consuming them properly - probably MediaRecorder, perhaps the compositor. The circumstance seems to be that main thread is completely occupied by the RefreshDriver.
Flags: needinfo?(pehrson)
Here's a jsfiddle that reproduces the problem (I feel pretty confident the leaking is for the same reason - though this also reveals a starvation issue).

STR
1. Click "Start". This starts drawing into the webgl canvas, which is already captured to a MediaStream.
2. Click "Remove Elements". This removes both the canvas and the video element from the dom, so no transaction to the compositor will now be made.

Expected: Main thread should be at least as responsive as before since we're doing less work.

Actual: Main thread is starved out by RefreshDriver ticks. This can be observed in DevTools' performance recorder - JS Flame Chart, with gecko platform data enabled.

When testing this, beware that if you run out of available RAM your computer might freeze, that has happened to me a couple of times now. I can reproduce this at a 100% rate with Firefox 52 and 53 on Ubuntu. Could be hardware and platform dependent since it's using WebGL.
status-firefox52: ? → affected
It does appear to be e10s-only.
Whiteboard: [e10s]
I think the starvation issue comes from bug 1306591. Smaug, could you take a look?
Flags: needinfo?(bugs)
Created attachment 8827548 [details]
Main thread is only running the refresh driver (with canvas capture being the slowest thing)

Here's a screenshot showing the issue in devtools. See comment 6 to repro.
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #9)
> Here's a screenshot showing the issue in devtools. See comment 6 to repro.

Note that I clicked start after ~1.6s and stop immediately after. It took ~10s for stop to get processed, which was just before I would have run out of memory.
bug 1306591 doesn't starve main thread. Normal events are processed too.
Sure, if refresh driver processing is slow, then we don't end up processing too many normal runnables.
Why we're so slow? That is the real bug.
Flags: needinfo?(bugs)
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #6)
> Here's a jsfiddle that reproduces the problem 
Where?

Updated

2 years ago
Depends on: 1331706

Updated

2 years ago
Flags: needinfo?(pehrson)
[Tracking Requested - why for this release]: large memory leaks when using canvas.captureStream()
status-firefox50: --- → unaffected
status-firefox51: ? → unaffected
tracking-firefox52: --- → ?
tracking-firefox53: --- → ?
Keywords: regression
Version: unspecified → 52 Branch
Whiteboard: [e10s] → [e10s][MemShrink]
Sorry, forgot the link. https://jsfiddle.net/pehrsons/o5xbu31o/
Flags: needinfo?(pehrson) → needinfo?(bugs)
(In reply to Olli Pettay [:smaug] (review request backlog because of a work week) from comment #11)
> Why we're so slow? That is the real bug.

Canvas captureStream's frame capture happens in the refresh driver, and with webgl this does a synchronous readback from the gpu. That can be very slow depending on the gpu bandwidth and canvas size.

The synchronous readback can be improved to be async, but that would need support from gfx and be a fairly large chunk of work I think. Though it would improve perf of canvas capture greatly too.
ok, so that is slow. Why are we then not releasing memory, that is another bug. main event loop doesn't starve totally, so it should be able to release whatever is needed.
Flags: needinfo?(bugs) → needinfo?(pehrson)
Not knowing about the setup at all, but I'd expect memory to be released before the next frame is captured.
Yeah, I'll look into that next. Frames are handled off the main thread but here it seems to only clean the allocations up during GC which is not what we want.

I can easily get main thread to not process anything else than vsync events for 10 seconds or more. It is completely oblivious to user input during this time. That doesn't seem ideal either.

Is there no generic solution to this? I imagine cases where drawing into a canvas is slow, thus causing the same issue, since the requestAnimationFrame handler runs in the refresh driver. Though it seems to be less of a problem when said canvas is visible for some reason.
Flags: needinfo?(pehrson)
Tracking 52/53+ to address these large memory leaks.
tracking-firefox52: ? → +
tracking-firefox53: ? → +
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #18)
> I can easily get main thread to not process anything else than vsync events
> for 10 seconds or more. 
I don't understand how that is possible, since in thread level we always try to process one non-prioritized runnable after a prioritized runnable (assuming there is something in the normal queue)

http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/xpcom/threads/nsThread.cpp#804,808-809,816,818-819

Perhaps we could process more normal events, like have also some timestamp of the end time of the high priority runnable handling and try to process normal events until x ms has elapsed.
That is a bit hacky solution, since one really shouldn't do super heavy stuff during refreshdriver tick. It should be moved to idle queue when possible, or some background thread or so.
(In reply to Olli Pettay [:smaug] (review request backlog because of a work week) from comment #20)
> (In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #18)
> > I can easily get main thread to not process anything else than vsync events
> > for 10 seconds or more. 
> I don't understand how that is possible, since in thread level we always try
> to process one non-prioritized runnable after a prioritized runnable
> (assuming there is something in the normal queue)

I don't know, but see the screenshot. During that whole time I was waiting for a mouse event that took 10+ seconds to run.
sure, because there are other things in the main event loop too.
I'm writing a hacky patch to let us process more normal events, but I still think 
'Canvas captureStream's frame capture happens in the refresh driver' is a real issue to fix, at least in webgl case. Isn't that close to as bad as sync I/O. Not sure how to fix it though? Can we readback in some background thread?

Updated

2 years ago
Depends on: 1332226
(In reply to Olli Pettay [:smaug] (review request backlog because of a work week) from comment #23)
> I'm writing a hacky patch to let us process more normal events, but I still
> think 
> 'Canvas captureStream's frame capture happens in the refresh driver' is a
> real issue to fix, at least in webgl case. Isn't that close to as bad as
> sync I/O. Not sure how to fix it though? Can we readback in some background
> thread?

There's async readback, but I don't think our graphics backend is up for it yet.

Comment 25

2 years ago
bug 1326311 seems to solve also the bug from the test in comment 0. A big leak was visible without the patches and disappear with it.

It is connected with the MediaRecorder and it seems to me logical that this bug 1326311 impact also the test in comment 0 here.

See the bug 1326311 for more information.
(In reply to Olli Pettay [:smaug] from comment #16)
> ok, so that is slow. Why are we then not releasing memory, that is another
> bug. main event loop doesn't starve totally, so it should be able to release
> whatever is needed.

The producer is in the refresh driver tick -- prioritized.

The last consumer (looking at the stack trace of the destructor) is nsOwningThreadSourceSurfaceRef at [1] which dispatches a runnable to destroy the SourceSurface on the owner thread (main) if Release() happened on another thread. 

This could explain why I don't see the same problems when the canvas is visible in the DOM -- CompositorChild might be the last consumer, releasing it on main thread, not causing any extra runnables in the regular queue.

[1] http://searchfox.org/mozilla-central/rev/bf98cd4315b5efa1b28831001ad27d54df7bbb68/gfx/layers/ImageContainer.h#90
The thing calling nsOwningThreadSourceSurfaceRef::Release() is StreamTracks::Track::ForgetUpTo(), so that bit seems to work. It just takes a long time for the SurfaceReleaser to run on main thread.

Smaug, what would take to dispatch the SurfaceReleaser with high priority?
I tried to look through your code but it only seems supported for IPC messages for now?
Flags: needinfo?(bugs)
high priority runnables must not be used for random things.
(but using them is simple, just implementing nsIRunnablePriority)
Flags: needinfo?(bugs)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
A quick manual test confirms that this works.
Assignee: nobody → pehrson
Status: NEW → ASSIGNED

Comment 32

2 years ago
mozreview-review
Comment on attachment 8830696 [details]
Bug 1329079 - Make the SurfaceReleaser runnable HIGH priority.

https://reviewboard.mozilla.org/r/107438/#review108540

No. We should not start to use high priority events for random things.

I'm in process to slowing down vsync processing when processing refresh driver ticks takes too much time, like here.
Attachment #8830696 - Flags: review?(bugs) → review-
Attachment #8830695 - Flags: review?(jmuizelaar)
Attachment #8830696 - Flags: review?(jmuizelaar)
Ok, I'll wait for a more general fix in bug 1332226.
Assignee: pehrson → nobody
Status: ASSIGNED → NEW
I think comment 23 is still valid. We're blocking refresh driver for too long.
IIUC, bug 1331706 should have fixed this for the time-being. Can you please confirm? We'll presumably want to leave this bug open for tracking future work needed after bug 1332226 lands, but hopefully we can at least get it off the immediate radar.
Flags: needinfo?(rjesup)
(Reporter)

Comment 36

2 years ago
andreas - forwarding to you; do you still see this?  We'll need to leave this open in any case
Flags: needinfo?(rjesup) → needinfo?(pehrson)
I don't see the issue per comment 0, but there's still the issue of allocating frames faster than they can be consumed in 53 and 54. Those should be fixed by bug 1332226, and we can revisit when a fix for that has landed.
Flags: needinfo?(pehrson)
Actually this was fixed by bug 1331706 in both 52 and 53.
Blocks: 1306591
status-firefox52: affected → unaffected
status-firefox53: affected → unaffected
status-firefox54: --- → affected
tracking-firefox54: --- → ?
That fix landed on 53 when it was still trunk, so this should be fixed across all versions at the moment. But yeah, we'll want to keep this open and keep an eye on things once bug 1332226 lands.
status-firefox52: unaffected → fixed
status-firefox53: unaffected → fixed
status-firefox54: affected → fixed
tracking-firefox54: ? → +
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #15)
> The synchronous readback can be improved to be async, but that would need
> support from gfx and be a fairly large chunk of work I think. Though it
> would improve perf of canvas capture greatly too.
Who would be the right person to look at that? We really must not block main thread longer than needed
(which is why we for example are removing all the sync IPC we can).
Blocking main thread mean nothing else there gets to run, so the thread is just idle for too long.
(Reporter)

Comment 41

2 years ago
(In reply to Olli Pettay [:smaug] from comment #40)
> (In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #15)
> > The synchronous readback can be improved to be async, but that would need
> > support from gfx and be a fairly large chunk of work I think. Though it
> > would improve perf of canvas capture greatly too.
> Who would be the right person to look at that? We really must not block main
> thread longer than needed
> (which is why we for example are removing all the sync IPC we can).
> Blocking main thread mean nothing else there gets to run, so the thread is
> just idle for too long.

Milan?  This is talking about a canvas readback,  From Comment 15:
> Canvas captureStream's frame capture happens in the refresh driver, and with webgl this does a 
> synchronous readback from the gpu. That can be very slow depending on the gpu bandwidth and canvas 
> size.

> The synchronous readback can be improved to be async, but that would need support from gfx and be a 
> fairly large chunk of work I think. Though it would improve perf of canvas capture greatly too.
Flags: needinfo?(milan)
(In reply to Olli Pettay [:smaug] from comment #40)
> Who would be the right person to look at that? We really must not block main
> thread longer than needed
> (which is why we for example are removing all the sync IPC we can).
> Blocking main thread mean nothing else there gets to run, so the thread is
> just idle for too long.

I reached out to Jeff Gilbert (WebGL peer) a week or two ago but haven't heard back yet.
We're now talking about the performance aspect of captureStream, rather than a memory leak, right?  Perhaps another bug for that, and we can link it to the WebGL performance meta bug 1207170, as we're currently evaluating the relative priority of different aspects of WebGL performance.
Flags: needinfo?(milan)
(Reporter)

Updated

2 years ago
Blocks: 1340142
(Reporter)

Updated

2 years ago
No longer blocks: 1340142
Depends on: 1340142

Comment 44

2 years ago
It looks like the memory portion has been fixed and we have bug 1340142 for the perf discussion. Lets go ahead and close this.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
Attachment #8830695 - Attachment is obsolete: true
Attachment #8830696 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.