Closed Bug 617453 Opened 14 years ago Closed 12 years ago

JS engine keeps unreferenced WebGLContexts around too long before GC'ing

Categories

(Core :: XPConnect, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla17

People

(Reporter: bjacob, Assigned: bjacob)

References

(Depends on 1 open bug)

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink:P2] [See comment 35] webgl-next)

Attachments

(2 files)

This bug is a refactoring of the remains of bug 613079.


*** What this testcase does ***

The attached test case is basically doing this:

   var webgl = canvas.getContext("experimental-webgl");
   var webgl = canvas.getContext("experimental-webgl");
   var webgl = canvas.getContext("experimental-webgl");
...
(repeated hundreds of times)

So we have a single |webgl| variable here, so the intent of this JS code is to have only one WebGLContext at a time, and have it go away when it reassigns a new WebGLContext to |webgl| (which causes the previous WebGLContext to become unreferenced).


*** The bug (actual result of testcase) ***

However, the actual behavior of this code is that unreferenced WebGLContext's accumulate until the next GC run, and that GC run occurs so late that eventually we have 165 unreferenced WebGLContext's and that JS program starts failing to create new WebGLContext's.

To confirm this, I forced a GC run between each new WebGL context creation, with
this code,

    netscape.security.PrivilegeManager.enablePrivilege("UniversalXPConnect");
    window.QueryInterface(Components.interfaces.nsIInterfaceRequestor)
          .getInterface(Components.interfaces.nsIDOMWindowUtils)
          .garbageCollect();

and that did fix the problem.


*** Why WebGLContext's are special ***

Each WebGLContext holds a system resource known as a OpenGL context, and that is a limited resource. Even without any particular implementation limitation, there remais the fact that a OpenGL context has resources allocated in video memory, so it can potentially have an arbitrary large cost, but that is not currently something that the JS engine can know.


*** What's needed to fix this bug ***

So what we need to fix this bug, is a JSAPI change whereby our WebGL implementation can inform the JS engine that WebGLContext's have an arbitrary large cost in terms of graphics system resources, and therefore WebGLContext's should always be immediately garbage-collected as soon as they become unreferenced.
(In reply to comment #0)
> So what we need to fix this bug, is a JSAPI change whereby our WebGL
> implementation can inform the JS engine that WebGLContext's have an arbitrary
> large cost in terms of graphics system resources, and therefore WebGLContext's
> should always be immediately garbage-collected as soon as they become
> unreferenced.

This is unfortunately tricky (impossible?), because the way to know whether something isn't referenced is to do a garbage collection pass... otherwise, we wouldn't need to GC at all!
You can't rely on when GC happens, ever. That's simply not how it conceptually works.
Can WebGLContext know when its reference count, from the point of view of the JS engine, hits 0? If we caught that and released all GL resources then that would fix the problem.
Yeah, we could use a weak pointer model here. Thats a good idea. I will see if I can grab vlad upstairs.
There is no concept of "reference count" in the JS engine.  The way a JS object knows that nothing is referencing it is that we do a GC and the object doesn't get marked.

I'd love to look at the testcase, but it's compressed using some random compression utility that I don't have a decompression tool for and which macports, for one, knows nothing about.

But if the code in comment 0 correctly represents the testcase, why would this ever allocate more than one single webgl context?  Repeated context gets with the same context id on the same canvas just return the same object...
Yeah, you'd need to use different canvas objects.  At some point you would hit a resource limit (whether memory, or number of active contexts, or something), but it's really no different than hitting any other resource limit.
Ah, looks like the testcase in this bug keeps creating new <canvas> objects.  Is that actually something someone is likely to be doing?

If so, does a WebGLContext _have_ to hold on to an OpenGL context all the time?  I realize that it's clearly faster if it does, but could we treat this as a cache and drop some existing GLContexts if we detect failure to create a new one?
In particular, I'm more worried about 40 tabs each with 4 gl canvases in them all of which are in fact live than I am about synthetic testcases like the one attached.
(In reply to comment #8)
> Ah, looks like the testcase in this bug keeps creating new <canvas> objects. 
> Is that actually something someone is likely to be doing?
> 
> If so, does a WebGLContext _have_ to hold on to an OpenGL context all the time?
>  I realize that it's clearly faster if it does, but could we treat this as a
> cache and drop some existing GLContexts if we detect failure to create a new
> one?

No, there has to be a 1:1 mapping.  There's lots of state and objects that are created on the GL side that are specific to each instance.
Hmm.  And we have no way to pull them back browser-side if we run out of GLContexts?
Unfortunately no, not in any useful way (even if we were able to pull them back, it would be very expensive, and then we couldn't keep rendering there anyway... so not sure what good it would do us).
Well GC is not a way to manage the lifetime of limited resources. We need an indirection here, similar to what we have for workers, and multiplex virtual contexts onto physical WebGL contexts if the WebGL side offers only a limited set of handlers.
(In reply to comment #6)
> I'd love to look at the testcase, but it's compressed using some random
> compression utility that I don't have a decompression tool for and which
> macports, for one, knows nothing about.

I suppose you've figured it already, but a google search for xz mac gave this:
http://xz-devel.darwinports.com/
Depending on the platform, video memory may be virtualized, which means that it uses regular RAM when it runs out of video memory. Here the testcase eats most of my RAM before it starts failing. 

So I don't really know how much multiplexing WebGLContext's onto real OpenGL contexts would help: it wouldn't help with the memory usage issue.

The example with 40 tabs with 4 WebGLContexts each does not convince me, because at the end of the day we have to let it to the user to realize that his computer has finite resources.

What we have here is different: a single tab with a JS program that only asked for a single WebGLContext at a time, and it's only because of a technical detail on our part that 160 WebGLContexts are in memory simultaneously.
> a single tab with a JS program that only asked for a single WebGLContext at a
> time

Not quite.  It asked for 160 different WebGLContexts, for 160 different canvases that may have different images, right?  The particular confluence of events that actually causes there to be that many canvases around seems pretty odd to me; I'd think most people just create a canvas and then use it, no?
(In reply to comment #16)
> > a single tab with a JS program that only asked for a single WebGLContext at a
> > time
> 
> Not quite.  It asked for 160 different WebGLContexts, for 160 different
> canvases that may have different images, right?  The particular confluence of
> events that actually causes there to be that many canvases around seems pretty
> odd to me; I'd think most people just create a canvas and then use it, no?

Let me know if something's wrong in my understanding:

But everytime, this is assigned to the same |webgl| variable, right? Everytime the code is:

var webgl = initWebGL("canvas",
                      document.getElementById("vshader").text,
                      document.getElementById("fshader").text);

So everytime this code happens again, the former WebGLContext referenced by |webgl| goes unreferenced, so there is only 1 WebGLContext referenced at a time, except during the initWebGL call where there may be 2 WebGLContext's (|webgl| and the newly created context not yet assigned to it).
Everything you say is correct; the part that makes this testcase look odd to me is this part:

  //---[ new canvas object ]
  document.getElementById("canvas_container").innerHTML = backup;

If that line didn't get run, there would only be one webglcontext involved, ever, in this testcase.
And if that line were slightly modified to hold on to the old canvas instead of just forgetting it, then the old webglcontext could still be retrieved.  So it's not as simple as keeping track of the references to the webglcontext; you also have to keep track of references to the canvas it came from.
(In reply to comment #17)
> But everytime, this is assigned to the same |webgl| variable, right? Everytime
> the code is:
> 
> var webgl = initWebGL("canvas",
>                       document.getElementById("vshader").text,
>                       document.getElementById("fshader").text);
> 
> So everytime this code happens again, the former WebGLContext referenced by
> |webgl| goes unreferenced, so there is only 1 WebGLContext referenced at a
> time, except during the initWebGL call where there may be 2 WebGLContext's
> (|webgl| and the newly created context not yet assigned to it).

That's actually not correct -- assuming that initWebGL creates a new context each time (which is what happens because of what bz points out above), assigning to this webgl variable just orphans the previous JS object on the heap.  The actual deref doesn't happen until GC occurs, which is when the orphaned object will be discovered and cleaned up.
Just had a discussion with bz on irc.. one thing we could do to fix this is to trigger a local GC whenever getContext would create a new canvas rendering context, perhaps only if that context is webgl, as a special-case).  That would resolve this in a pretty cheap way.
No, seriously, please stop triggering GCs from random points in the embedding. That is absolutely terrible. A GC is a global event, stopping the entire browser, all threads, flushing caches.
Fwiw, we can't do this with a local gc.  The canvas context and the canvas element hold refs to each other, so we'd need to CC to free them.

Andreas, any better ideas?  My other proposal was to GC when we fail to create a GLContext (similar to GC on OOM), but vlad points out that if we're using GL for our own rendering we might be pretty dead in the water by then anyway...
A garbage collected heap object has an undeterministic lifetime. Coupling a GL context to that is wrong by design. (We shouldn't do that with reference counted objects either, btw.) I would still prefer a multiplexing layer.

Short of that: Allocating a GL context is an ... allocation. So I don't mind if you GC if you are out of GL contexts. Thats fair game. It should trigger the memory pressure hook we have been talking about adding. The renderer path should do the same thing. Keep in mind its not a guarantee that we will free up a GL context, but there is a chance.
> I would still prefer a multiplexing layer.

See comment 12?
I don't understand comment 12. A GL context is a rendering context associated with a state. We can save that state, and then let someone else use the context, then flip back. vlad, why exactly does that not work?
Scarce resources unrelated to JS heap memory should be managed explicitly, not by the GC (except as a leak backstop). Period, end of story.

/be
(In reply to comment #26)
> I don't understand comment 12. A GL context is a rendering context associated
> with a state. We can save that state, and then let someone else use the
> context, then flip back. vlad, why exactly does that not work?

Because it's incredibly expensive, and in some cases impossible.  The "state" includes things like the the contents of all textures and other buffers.  In some cases it's impossible to retrieve the information without destroying it (e.g. multisample buffers must be resolved before they can be read).  Any kind of multiplexing is a non-starter.
Well then we are solving an unsolvable problem here. If you build an array with rendering contexts and make sure they stay live a single window can push the browser into out-of-GL-contexts.
(In reply to comment #27)
> Scarce resources unrelated to JS heap memory should be managed explicitly, not
> by the GC (except as a leak backstop). Period, end of story.

I agree.  We just don't have any such mechanism right now, and there's no support for anything like it in the DOM, either.  Even ignoring WebGL, you can get the same effect by allocating DOM objects with large internal data and discarding references to them, leading to out of memory.

Triggering a GC seems reasonable.  The only reason a cycle relationship exists is to fix a security bug, but we can work around that in a different way.
(In reply to comment #29)
> Well then we are solving an unsolvable problem here. If you build an array with
> rendering contexts and make sure they stay live a single window can push the
> browser into out-of-GL-contexts.

Just like, a simple script allocating huge JS arrays can give the same result, and there would be nothing to do about it.

The difference is that here, we have a testcase where manually running the GC (comment 0) solves the out-of-memory problem.
You are triggering a JS engine GC at a point where you think it serves your purpose, however, such a GC is a global system event, with tons of global effects. Its simply wrong, and future hostile in an environment where we are planning major changes such as generational GC. I understand this seems like an easy fix instead of fixing this problem the right way, but if we go down this path soon enough everyone is going to start abusing GC for resource management.

Don't believe me? Check out XREDirProvider.cpp. Someone is using explicit GCs to observe object finalization, which by definition and design is undeterministic.

http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsXREDirProvider.cpp#806

We want to use N+1 GL contexts but only have N. It seems multiplexing isn't an option, so someone has to give up their context or you refuse to hand out another one. There is no other fix.
I only pointed this out to explain how I believe that the bug discussed here is different from what you describe in comment 29, and is not necessarily "unsolvable".

I understand that, as you already explained, that triggering a GC is a bad idea; I was not disputing this fact!
Blocks: mlk-fx4-beta
Keywords: footprint
Whiteboard: [MemShrink:P2]
Assignee: nobody → bjacob
IIRC there have been some changes to the way WebGL memory handling works.  Is this bug still relevant?  I'm hoping not, because I don't understand it at all and there doesn't seem to be any agreement about it.
Yes, this bug is still relevant. I'm sorry I don't understand the JS engine well enough to be able to explain the problem.
To see the problem, run the WebGL conformance tests:
https://cvs.khronos.org/svn/repos/registry/trunk/public/webgl/sdk/tests/webgl-conformance-tests.html
and in another tab, open about:memory, scroll down to webgl-context-count, and reload repeatedly. You'll see the webgl-context-count climb as high as 40 between two consecutive GC runs. This means that at some points in time there are 40 WebGL contexts, hence 40 OpenGL contexts, simultaneously alive. This means that we run out of memory (note that video memory is virtualized) on some system just by running the WebGL conformance tests. There is no need to have these 40 WebGL contexts simultaneously alive: this test runner just runs one WebGL test page after the other; typically there is only 1 WebGL context per test page.
For example, in the mochitest version of this WebGL conformance test suite (content/canvas/test/webgl) I have had to insert explicit GC calls before running each test page, to fix random mochitest-1 failures that turned out to be caused by out-of-memory when too many WebGL contexts had accumulated between two spontaneous GC runs.
Whiteboard: [MemShrink:P2] → [MemShrink:P2] [See comment 35]
Attachment #495976 - Attachment description: test case (xz-compressed html) → test case (xz-compressed html) but see Comment 35 instead for a real-world test case URL
It sounds like the particular problem of creating a bunch of GL contexts that become garbage could be dealt with by better memory pressure triggers.  You start creating contexts, then address space or whatever starts becoming tight, the browser notices this and triggers a GC, cleaning up the old junk GL contexts.  I'm not sure of the exact kind of memory pressure these contexts create, or what the status of memory pressure events is.
(In reply to Andrew McCreight [:mccr8] from comment #36)
> It sounds like the particular problem of creating a bunch of GL contexts
> that become garbage could be dealt with by better memory pressure triggers.

See comment 32. I 100% agree with Andreas that the GC is not an option to deal with this. 

Really, if we cannot multiplex the WebGL contexts, then lets restrict the number of them per domain and also restrict the total number of them.

Then when a page wants more than the limit, we can try to reuse the least recently used context, saving, if possible its older state. And if that context will be accessed again through that old reference and it would be impossible to recover its state, then  just document that that at any point the system can decide to invalidate the context. At least this put a pressure to web authors for more careful programming style.
I don't want to say more stupid things as clearly I don't understand how GC works, but just these bit of info may be useful:
1. GL contexts use memory, but that memory is typically not in our process's address space.
2. We have estimates of the memory usage, which we use in the WebGL about:memory reporters, but they don't account for all GL memory usage. For example, if you create 1000 GL contexts and don't create any GL resources on them (like textures) then you're consuming a lot of memory that's not reported at all in the about:memory reporters, except for the webgl-context-count reporter.
3. The main symptom of the bug I'm reporting here, is that new WebGL contexts fail to create. So we could use WebGL context creation failure as the event to respond to. That would already solve the concrete issues I've been experiencing.
(In reply to Igor Bukanov from comment #37)
> Really, if we cannot multiplex the WebGL contexts

Indeed we can't, because of the excessive performance regression that would incur (also, that would require complex and bug-prone code).

> then lets restrict the
> number of them per domain and also restrict the total number of them.
> 
> Then when a page wants more than the limit, we can try to reuse the least
> recently used context, saving, if possible its older state. And if that
> context will be accessed again through that old reference and it would be
> impossible to recover its state, then  just document that that at any point
> the system can decide to invalidate the context. At least this put a
> pressure to web authors for more careful programming style.

Ah, OK. Something loosely along these lines can work. We're not going to try to save and restore a WebGL context's state (too complex and bug prone, like above) but WebGL contexts can be "lost", in which case the script is notified that it has to recreate it from scratch (we consider that only scripts know how to recreate a WebGL context's state).
This would work: when a script tries to create a WebGL context, if there already are >= 15 WebGL contexts alive on this domain, lose the oldest-used one.
(In reply to Benoit Jacob [:bjacob] from comment #39)
> but WebGL contexts can be "lost", in which case the script is
> notified that it has to recreate it from scratch

So the WebGL API does provide a way for the embedding to invalidate contexts without running the GC but we have not yet implemented that part, right?
(In reply to Igor Bukanov from comment #40)
> So the WebGL API does provide a way for the embedding to invalidate contexts
> without running the GC but we have not yet implemented that part, right?

I don't understand the word "embedding" in this sentence, but yes, the WebGL API defines a special "lost" state for a WebGL context in which all of the context's resources are lost and no rendering can happen; the implementation is free to put a WebGL context into this "lost" state at any time; this is primarily intended for the case of powered-off mobile devices and offending (DoS) WebGL contexts but we're free to use this in other cases like here; scripts are supposed to be listening to "webglcontextlost" events; see this part of the spec: http://www.khronos.org/registry/webgl/specs/latest/#5.15.2
(In reply to Benoit Jacob [:bjacob] from comment #41)
> I don't understand the word "embedding" in this sentence,

Embedding == browser

> but yes, the WebGL
> API defines a special "lost" state for a WebGL context in which all of the
> context's resources are lost and no rendering can happen; the implementation
> is free to put a WebGL context into this "lost" state at any time;

Do we implement this in any way currently? What about other browsers?
We do, in Firefox >= 11. We try to detect loss of the underlying GL context and lose the WebGL context in that case. We also support the lose_context extension allowing to artificially trigger context losses, for testing purposes.
WebKit does as well, so at least Chrome does, and I suppose Safari (maybe a future version of it). Don't know about Opera.
(In reply to Benoit Jacob [:bjacob] from comment #43)
> We do, in Firefox >= 11. We try to detect loss of the underlying GL context
> and lose the WebGL context in that case. We also support the lose_context
> extension allowing to artificially trigger context losses, for testing
> purposes.

So we have all the infrastructure to loose older contexts if the page creates too many of them and it is a matter of utilizing that...
Yes. patch coming up.
Whiteboard: [MemShrink:P2] [See comment 35] → [MemShrink:P2] [See comment 35] webgl-next
Some new data points:
 - Chromium is now doing a nursery GC on WebGL context creation, to collect unreferenced WebGL contexts before creating new ones. (Indeed, in typical cases where this matters, the WebGL contexts are still in nursery).
 - some mobile devices have a very low global limit on the number of GL contexts that can be live simultaneously. This makes this issue more severe than on desktop (is blocking Vlad's ability to run WebGL performance tests on mobile devices).
 - as part of the games effort we _are_ considering giving more control on the GC to other parts of Gecko, e.g. allowing the compositor to schedule incremental GC during the compositing's GPU work.

So is this a good time to ask to reconsider allowing the WebGL implementation to trigger some kind of GC just before WebGL context creation?

I _am_ going to implement the other approach discussed above (lose the least-recently-used contexts) but that is not quite as good as GC'ing unreferenced contexts, and ideally I would like to have both.
Had a chat with billm + some ideas.

Vlad: one solution you can do immediately is do this at the end of every test page:

   gl.getExtensions("MOZ_WEBGL_lose_context").loseContext();

This should destroy the underlying GL context and free associated resources immediately. If it doesn't, that's a bug.

Bill agreed with the other JS people here, that relying on GC for that isn't the right solution. Instead, he suggested that the browser explicitly loses WebGL contexts of any page that is navigated away from. So for example, in Vlad's case, when the test suite loads a new test page into its iframe replacing the previous one, we could perhaps detect that and then walk the DOM tree of the old page, find its canvases, lose their WebGL contexts.
Blocks: 778006
The current limits are:

#ifdef MOZ_GFX_OPTIMIZE_MOBILE
    // some mobile devices can't have more than 8 GL contexts overall
    const size_t kMaxWebGLContextsPerPrincipal = 2;
    const size_t kMaxWebGLContexts             = 4;
#else
    const size_t kMaxWebGLContextsPerPrincipal = 8;
    const size_t kMaxWebGLContexts             = 16;
#endif

The current notion of 'last use' is basically last compositing in the sense of DidTransactionCallback, with the exception that we also could SetDimensions calls as 'use' too (SetDimensions calls LoseOldestWebGLContextIfLimitExceeded which calls UpdateLastUseTimestamp). This is far less expensive than doing this work in every WebGL call. If there was a concern about losing "offscreen" WebGL contexts we could add UpdateLastUseTimestamp in e.g. readPixels or any other way that a WebGL context can matter.

Notice that this patch removes a DestroyResourcesAndContext call in SetDimensions. The reasons is that by the time we called it, we already knew that |gl=null| and so DestroyResourcesAndContext was a no-op anyway.
Attachment #646622 - Flags: review?(vladimir)
With the WebGL mochitest reenabled on Android:
https://tbpl.mozilla.org/?tree=Try&rev=290002717fb4
Comment on attachment 646622 [details] [diff] [review]
lose the least-recently-used context when a certain limit is exceeded

This is potentially helpful, but doesn't actually solve the problem.  Calling LoseContext doesn't destroy the actual GL context -- we have no API to do that, other than letting its refcount go to 0.  That might not happen for a while, depending on who else has a ref to it.  It's a good start though, but I think we also need API to destroy a GLContext on demand.  This is a bit risky, but maybe not so much if we ensure that we never use a GLContext on a different thread (no races) and add appropriate checks in various places.
This API should be exactly loseContext(), right? I understand that loseContext can't guarantee that the context is immediately lost, as it might run during e.g. compositing. But aside from that, it seems like a bug that we are keeping references to the lost context indefinitely: we should fix that! So loseContext would have the behavior of freeing GL resources "as soon as possible".

Have you figured what is holding references to the GLContext? The tool in bug 704240 can find that out: do you have a testcase that I can run?
Comment on attachment 646622 [details] [diff] [review]
lose the least-recently-used context when a certain limit is exceeded

Looks good, but don't call things a Timestamp when they're not -- just call it a Counter or something.
Attachment #646622 - Flags: review?(vladimir) → review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/b9c0e16d1730

Renamed Timestamp->Index

I think we can close this bug now (once this hits central).
Target Milestone: --- → mozilla17
https://hg.mozilla.org/mozilla-central/rev/b9c0e16d1730
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 790138
Depends on: 1501142
You need to log in before you can comment on or make changes to this bug.