Last Comment Bug 795186 - Declare refptrs between WebGL objects to the CC even if there is no cycle at C++ level, because JS could create cycles
: Declare refptrs between WebGL objects to the CC even if there is no cycle at ...
Status: RESOLVED FIXED
webgl-next [MemShrink:P1]
:
Product: Core
Classification: Components
Component: Canvas: WebGL (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla19
Assigned To: Benoit Jacob [:bjacob] (mostly away)
:
Mentors:
Depends on:
Blocks: 800488 802834
  Show dependency treegraph
 
Reported: 2012-09-27 19:18 PDT by Benoit Jacob [:bjacob] (mostly away)
Modified: 2013-03-20 11:05 PDT (History)
8 users (show)
continuation: in‑testsuite-
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
+
fixed


Attachments
fix cycle collection in the WebGL implementation (9.50 KB, patch)
2012-10-18 15:30 PDT, Benoit Jacob [:bjacob] (mostly away)
bugs: review+
bajaj.bhavana: approval‑mozilla‑aurora+
Details | Diff | Review
cycle collect the mContextRestorer pointer (2.05 KB, patch)
2012-10-19 20:22 PDT, Benoit Jacob [:bjacob] (mostly away)
bugs: review-
Details | Diff | Review
fix the robustness timer, don't inherit the timer callback base class (6.70 KB, patch)
2012-10-22 14:15 PDT, Benoit Jacob [:bjacob] (mostly away)
bugs: review+
bajaj.bhavana: approval‑mozilla‑aurora+
Details | Diff | Review

Description Benoit Jacob [:bjacob] (mostly away) 2012-09-27 19:18:03 PDT
For example, WebGLUniformLocation has a nsRefPtr to WebGLProgram. While there is no cycle at the c++ level, js code could do

  program.foo = uniformloc;

thus creating a cycle.

Thus we need to declare all such refptrs to the cc.

Cycles that are entirely created by JS code do not need to be declared to the CC as the JS GC takes care of them, but if part of the cycle is a reference created by c++ code, then the CC needs to know about it.
Comment 1 Benoit Jacob [:bjacob] (mostly away) 2012-10-17 15:06:56 PDT
Smaug tells me that we have a leak in http://maps.nokia.com/webgl/ that looks like it might be caused by this bug. Prioritizing.
Comment 2 Benoit Jacob [:bjacob] (mostly away) 2012-10-17 15:19:50 PDT
To be clear, the only nontrivial thing in this bug is that WebGL inter-object strong references use a custom refptr class, WebGLRefPtr, so it's not clear beforehand that it's just a matter of using the usual CC macros -- have to look into these macros to see what they are doing.

WebGLRefPtr is here:
http://hg.mozilla.org/mozilla-central/file/39d70a2e0dea/content/canvas/src/WebGLContext.h#l300
Comment 3 Andrew McCreight [:mccr8] 2012-10-17 16:04:48 PDT
The macros are fairly simple: grab out the underlying pointer, do a callback.  Assuming your underlying classes are the usual nsISupports CCed classes, and the WebGLRefPtr supports .get(), then it looks like the usual NS_IMPL_CYCLE_COLLECTION_TRAVERSE_NSCOMPTR should just work.
Comment 4 Benoit Jacob [:bjacob] (mostly away) 2012-10-18 15:30:58 PDT
Created attachment 672984 [details] [diff] [review]
fix cycle collection in the WebGL implementation
Comment 5 Benoit Jacob [:bjacob] (mostly away) 2012-10-18 15:36:36 PDT
(Hit enter before I could write the explanatory comment)

Unfortunately, while this does all I can see needs doing here, the issue with Nokia Maps seems to persists: if I view that page and then go to about:memory in the same tab, then no matter how long I wait and click GC/CC buttons, the webgl metrics stay. Note: in about:memory, do NOT click 'minimize memory usage' as this would cause the WebGL context to destroy all its resources, artificially making the problem disappear.

Explanation on this patch:

The header file where all those classes are declared is WebGLContext.h.

Aside from usual refptr classes, the main custom refptr class here is WebGLRefPtr. There is a comment there explaining why it exists. Also, WebGLVertexAttribData stores a WebGLRefPtr member, so we had to add a .get() method to it to make CC macros work with it.

A nontrivial change in this patch is that it adds CC to WebGLUniformLocation, which wasn't CC'd so far. The reason why it wasn't CC'd was that contrary to most other classes there, WebGLUniformLocation objects can only ever be gotten by scripts by a [Creator] method, webgl.getUniformLocation, so there is no need for wrappercaching. That's why WebGLUniformLocation still doesn't inherit nsWrapperCache.
Comment 6 Olli Pettay [:smaug] 2012-10-18 15:42:59 PDT
(In reply to Benoit Jacob [:bjacob] from comment #5)
> Note: in about:memory, do NOT click
> 'minimize memory usage' as this would cause the WebGL context to destroy all
> its resources, artificially making the problem disappear.
Ah, this is interesting. Add a breakpoint to the context release and check what is releasing it when
minimizing memory.
Comment 7 Olli Pettay [:smaug] 2012-10-19 06:50:05 PDT
Comment on attachment 672984 [details] [diff] [review]
fix cycle collection in the WebGL implementation

The LinkedList member variables don't need traversing/unlinking, right?
They are just list of objects which have raw refs to each other?



Can there be lots of WebGL related objects alive?
If so, we should probably make them skippable. Something like
http://mxr.mozilla.org/mozilla-central/source/content/events/src/nsDOMEventTargetHelper.h?rev=d8798b6331be&mark=31-31#24
and
http://mxr.mozilla.org/mozilla-central/source/content/events/src/nsDOMEventTargetHelper.cpp?rev=5a6a36d109d7&mark=47-62#42

There are 3 different skip-phases, documented here
http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsCycleCollectionParticipant.h?rev=1e5bf398fc88&mark=256-260,265-268,273-276#244

So, file a followup to add support for skippability.
Comment 8 Andrew McCreight [:mccr8] 2012-10-19 08:55:50 PDT
Can you write a test for this?
Comment 9 Benoit Jacob [:bjacob] (mostly away) 2012-10-19 20:22:21 PDT
Created attachment 673510 [details] [diff] [review]
cycle collect the mContextRestorer pointer

At a point, I thought that this fixed all the bugs (nokia maps and ro.me); but now I can't reproduce anymore. There are poltergeists around here. Anyway, this timer _is_ causing a call to WebGLContext::AddRef() so this seems like another cycle that we failed to declare.
Comment 10 Benoit Jacob [:bjacob] (mostly away) 2012-10-19 21:08:39 PDT
My reference analysis tool (bug 704240) finds only one persistent (not broken by the CC) cycle involving WebGL, and that appears to be the cycle between the WebGLContext and the HTMLCanvasElement:


Reference cycle 39:

Contains 2 blocks, listed below:

block 592393
has references to these blocks: 415899, 592399
allocation call stack:
at 0x402B1FE: malloc (vg_replace_malloc.c:271)
by 0x403BEE4: moz_xmalloc (mozalloc.cpp:57)
by 0x709865F: NS_NewHTMLCanvasElement(already_AddRefed<nsINodeInfo>, mozilla::dom::FromParser) (mozalloc.h:200)
by 0x70F2476: CreateHTMLElement(unsigned int, already_AddRefed<nsINodeInfo>, mozilla::dom::FromParser) (nsHTMLContentSink.cpp:497)
by 0x70F3DCA: NS_NewHTMLElement(nsIContent**, already_AddRefed<nsINodeInfo>, mozilla::dom::FromParser) (nsHTMLContentSink.cpp:480)
by 0x6F94058: NS_NewElement(nsIContent**, already_AddRefed<nsINodeInfo>, mozilla::dom::FromParser) (nsNameSpaceManager.cpp:201)
by 0x6F54C75: nsDocument::CreateElem(nsAString_internal const&, nsIAtom*, int, nsIContent**) (nsDocument.cpp:6943)
by 0x6F478EC: nsDocument::CreateElement(nsAString_internal const&, nsIContent**) (nsDocument.cpp:4561)
by 0x762AEF0: nsIDOMDocument_CreateElement(JSContext*, unsigned int, JS::Value*) (dom_quickstubs.cpp:1273)
by 0x821A89B: js::CallJSNative(JSContext*, int (*)(JSContext*, unsigned int, JS::Value*), JS::CallArgs const&) (jscntxtinlines.h:364)
by 0x822638E: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jsinterp.cpp:367)
by 0x82213D7: js::Interpret(JSContext*, js::StackFrame*, js::InterpMode) (jsinterp.cpp:2370)

block 592399
has references to these blocks: 592393, 592400, 592871, 593172
allocation call stack:
at 0x402B1FE: malloc (vg_replace_malloc.c:271)
by 0x403BEE4: moz_xmalloc (mozalloc.cpp:57)
by 0x7011F1E: NS_NewCanvasRenderingContextWebGL(nsIDOMWebGLRenderingContext**) (mozalloc.h:200)
by 0x6C7BFFC: CreateCanvasRenderingContextWebGL(nsISupports*, nsID const&, void**) (in /hack/mozilla-central/obj-firefox-valgrind/toolkit/library/libxul.so)
by 0x7BFC628: nsComponentManagerImpl::CreateInstanceByContractID(char const*, nsISupports*, nsID const&, void**) (nsComponentManager.cpp:1006)
by 0x7BB5E8C: nsCreateInstanceByContractID::operator()(nsID const&, void**) const (nsComponentManagerUtils.cpp:178)
by 0x7098830: nsCOMPtr<nsICanvasRenderingContextInternal>::assign_from_helper(nsCOMPtr_helper const&, nsID const&) (nsCOMPtr.h:1286)
by 0x7098D46: nsCOMPtr<nsICanvasRenderingContextInternal>::nsCOMPtr(nsCOMPtr_helper const&) (nsCOMPtr.h:685)
by 0x7098EBB: nsHTMLCanvasElement::GetContextHelper(nsAString_internal const&, nsICanvasRenderingContextInternal**) (nsHTMLCanvasElement.cpp:628)
by 0x70991A3: nsHTMLCanvasElement::GetContext(nsAString_internal const&, JS::Value const&, nsISupports**) (nsHTMLCanvasElement.cpp:652)
by 0x76210D0: nsIDOMHTMLCanvasElement_GetContext(JSContext*, unsigned int, JS::Value*) (dom_quickstubs.cpp:16548)
by 0x821A89B: js::CallJSNative(JSContext*, int (*)(JSContext*, unsigned int, JS::Value*), JS::CallArgs const&) (jscntxtinlines.h:364)
Comment 11 Olli Pettay [:smaug] 2012-10-20 10:28:52 PDT
Comment on attachment 673510 [details] [diff] [review]
cycle collect the mContextRestorer pointer

Hmm, this shouldn't help.
nsTimerImpl isn't cycle collected.
Need to change timer handling, so that it doesn't keep context alive.
Use InitWithFuncCallback and make sure to cancel the timer when
context is deleted.
Comment 12 Benoit Jacob [:bjacob] (mostly away) 2012-10-22 14:15:26 PDT
Created attachment 674011 [details] [diff] [review]
fix the robustness timer, don't inherit the timer callback base class

There you go.
Comment 13 Benoit Jacob [:bjacob] (mostly away) 2012-10-22 14:17:02 PDT
(In reply to Andrew McCreight [:mccr8] from comment #8)
> Can you write a test for this?

How would you write a test, checking that things are correctly cycle collected?

A compiled-code test maybe?
Comment 14 Andrew McCreight [:mccr8] 2012-10-22 14:25:40 PDT
(In reply to Benoit Jacob [:bjacob] from comment #13)
> How would you write a test, checking that things are correctly cycle
> collected?

Well, these leaks are apparently bad enough to cause shutdown leaks, so you can just set up one of these leaky cycles, and the shutdown leak detection will catch it if it breaks.
Comment 15 Benoit Jacob [:bjacob] (mostly away) 2012-10-22 15:07:27 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/48502b61a63e
Comment 16 Benoit Jacob [:bjacob] (mostly away) 2012-10-22 15:09:01 PDT
(In reply to Andrew McCreight [:mccr8] from comment #14)
> (In reply to Benoit Jacob [:bjacob] from comment #13)
> > How would you write a test, checking that things are correctly cycle
> > collected?
> 
> Well, these leaks are apparently bad enough to cause shutdown leaks, so you
> can just set up one of these leaky cycles, and the shutdown leak detection
> will catch it if it breaks.

OK.

It seems like the two patches there together fix the leaks, and I was just using the wrong way to check if the leak was fixed -- I was navigating to about:memory in the SAME tab as the testcase and so the testcase was purposefully kept in memory.

I'll try to figured _what_ exactly caused the leak and then maybe I'll be able to make a test.
Comment 17 Ryan VanderMeulen [:RyanVM] 2012-10-22 18:59:53 PDT
https://hg.mozilla.org/mozilla-central/rev/48502b61a63e
Comment 18 Olli Pettay [:smaug] 2012-10-23 04:55:47 PDT
Comment on attachment 674011 [details] [diff] [review]
fix the robustness timer, don't inherit the timer callback base class

>+    void RobustnessTimerCallback(nsITimer* timer);
>+
>+    static void RobustnessTimerCallbackStatic(nsITimer* timer, void *thisPointer) {
>+        static_cast<WebGLContext*>(thisPointer)->RobustnessTimerCallback(timer);
>+    }
Looks like the file doesn't use Gecko coding style :(
But ok, let's not change that now.
(Parameters should be aParameterName, and indentation should be 2 spaces)
Comment 19 Benoit Jacob [:bjacob] (mostly away) 2012-10-23 09:24:25 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/15b7770dcfcc

Regarding the testcases, after some testing, it seems that Nokia Maps is fixed by the first patch (48502b61a63e) and RO.ME is fixed by the second patch (15b7770dcfcc).

I will try to think of a test for this, but as it's not clear yet to me why these particular pages trigger these bugs while others don't, it would take some more time to isolate a test case, and I'm sorry, I have too many things to do this week.

Can close the bug now.
Comment 20 Nicholas Nethercote [:njn] 2012-10-23 10:03:35 PDT
What's the effect on RO.ME?  Will bug 657115 be affected by this?
Comment 21 Benoit Jacob [:bjacob] (mostly away) 2012-10-23 10:17:28 PDT
The effect on RO.ME was that the WebGL context and its resources were staying around forever after closing the RO.ME tab ; now they are properly going away several seconds after the tab is closed. I haven't tested if it has any effect on bug 657115 which seems to be more about memory usage _during_ playing RO.ME.
Comment 22 Andrew McCreight [:mccr8] 2012-10-23 10:23:49 PDT
(In reply to Benoit Jacob [:bjacob] from comment #19)
> I will try to think of a test for this, but as it's not clear yet to me why
> these particular pages trigger these bugs while others don't, it would take
> some more time to isolate a test case, and I'm sorry, I have too many things
> to do this week.

No problem, I just was wondering if it might be easy to throw together a little test, but I guess not!

Nick: To reiterate what Benoit said, for RO.ME, there was a recent regression (bug 800488) from the new DOM bindings for WebGL stuff that caused a shutdown leak, which is what this fixes. It is possible that the old bindings for WebGL had some kind of temporary leak issue that will be fixed by the new DOM bindings, but I'd be surprised if that was the case.
Comment 23 Andrew McCreight [:mccr8] 2012-10-23 16:07:07 PDT
This needs to land on Aurora, too, right?  But not really earlier?
Comment 24 Benoit Jacob [:bjacob] (mostly away) 2012-10-23 18:24:12 PDT
Yes, let's backport to Aurora and not earlier. Backporting to Beta would be more intrusive as these classes were not CC'd before the WebIDL port (Aurora).
Comment 25 Benoit Jacob [:bjacob] (mostly away) 2012-10-23 18:26:18 PDT
Comment on attachment 672984 [details] [diff] [review]
fix cycle collection in the WebGL implementation

[Approval Request Comment]
Bug caused by (feature/regressing bug #): don't know. The RO.ME leak is claimed to be from the WebIDL port (bug 779611) but I don't see how this can have not leaked before. The cycles existed before.
User impact if declined: shutdown memory leaks on certain heavy WebGL applications
Testing completed (on m-c, etc.): m-c since yesterday
Risk to taking this patch (and alternatives if risky): seems low risk: rather standard CC stuff
String or UUID changes made by this patch: none
Comment 26 Benoit Jacob [:bjacob] (mostly away) 2012-10-23 18:26:42 PDT
Comment on attachment 674011 [details] [diff] [review]
fix the robustness timer, don't inherit the timer callback base class

[Approval Request Comment]
See previous request comment.
Comment 27 Ryan VanderMeulen [:RyanVM] 2012-10-23 19:58:11 PDT
https://hg.mozilla.org/mozilla-central/rev/15b7770dcfcc

Note You need to log in before you can comment on or make changes to this bug.