Closed Bug 731868 Opened 12 years ago Closed 12 years ago

cycles through CanvasRenderingContext2DUserData can leak everything

Categories

(Core :: Graphics: Canvas2D, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla15
Tracking Status
firefox14 + wontfix
firefox15 + fixed

People

(Reporter: luke, Assigned: roc)

References

Details

(Keywords: regression, Whiteboard: [MemShrink:P2])

Attachments

(5 files)

A side-effect of bug 730497 is to make UI_init (in tabview.js) a "heavyweight" function which seems to cause the BackstagePass and 1000 other things to get leaked at shutdown.  However, this is nothing special about bug 730497, to see this on trunk, just put an "eval('')" at the head of UI_init (browser/components/tabview/tabview.js:8800) and run the browser/components/tabview/test/browser_tabview_firstrun_pref.js browser-chrome test.

I spent a little bit of time looking at the code and the cycle collector dump.  The cause seems to be several event handlers that never get un-registered:
  - line 8829: iQ("#exit-button").click(function() { ... })
  - line 8837: iQ(gTabViewFrame.contentDocument).mousedown(function(e) { ... })
  - line 2016: $container.mousedown(function(e) { ... })
  - line 2627: handleKeyUp

dump() calls show UI_uninit() getting called so perhaps the cleanup is missing some bits?

I'd really appreciate any help anyone can give me understanding this leak.  Again, this is technically a bug on trunk, saved only by an optimization that we want to remove soon.  The optimization is fickle so that means any future modifications to this code may turn off the optimization and trigger the same leak.
Whiteboard: [MemShrink]
Blocks: 730497
Attached patch maybe fixSplinter Review
After looking at the code a bit more, I think the fix is simple.
Attachment #601892 - Flags: review?(ttaubert)
Thanks for providing a patch! While there's nothing wrong with adding some cleanup code I don't understand why we're actually leaking here. Sure, these event handlers get never unregistered but I thought that's not a problem because the whole window gets destroyed on shutdown anyway? Is that technically even a leak?
Yeah, I'm not really sure, I was just looking at the cycle collector dumps.  I'm not really sure what the lifetimes of all these things are here, but I'm thinking there is a long-lived object (that will live until shutdown) and some content windows (which should all be dead before shutdown) and the lexical closures of the lambdas that my path unregisters are creating edges from the long-lived object to the content windows.  Does that make sense?  It is quite possible this isn't the right fix.

Perhaps peterv has something to add?
Panorama's content window is contained in the property gWindow.Tabview._window which is first destroyed on shutdown. Does adding |this._window = null| to TabView.uninit() fix the problem for you?

http://mxr.mozilla.org/mozilla-central/source/browser/base/content/browser-tabview.js#163

As TabView.uninit() is supposed to clean up, this is clearly missing here.
I tried that and I still get the leak.  If you want to play around with it more, it is pretty easy to stick an eval('') in the iQ("#exit-button").click lambda and then run:

  python _tests/testing/mochitest/runtests.py --browser-chrome --autorun --close-when-done --test-path=browser/components/tabview/test/browser_tabview_firstrun_pref.js
Thanks for the precise STR, easy to reproduce. This line is causing the leak:

http://mxr.mozilla.org/mozilla-central/source/browser/components/tabview/ui.js#222

Removing that line fixes the leaks though I'm not exactly sure why. Maybe we're creating a circular reference here?
Argh, false alarm. I accidentally removed the eval() somehow.
Ok, think I found it now. I checked with about:cc(dump) to see what's holding the document alive and it was in fact a xhtml:canvas element of which we have one for every tab in Panorama. I'd say it's probably some kind of CC bug wrt to try/catch blocks?

Removing the try/catch block at this line makes the leak disappear:

http://mxr.mozilla.org/mozilla-central/source/browser/components/tabview/tabitems.js#1492
(In reply to Tim Taubert [:ttaubert] from comment #8)
> Removing the try/catch block at this line makes the leak disappear:

I mean of course removing only the try/catch stmt, not the code wrapped in it.
Comment on attachment 601892 [details] [diff] [review]
maybe fix

Marking the patch as obsolete, seems like we need a fix in the core.
Attachment #601892 - Attachment is obsolete: true
Attachment #601892 - Flags: review?(ttaubert)
(In reply to Tim Taubert [:ttaubert] from comment #8)
> I'd say it's probably some kind of CC bug wrt to try/catch blocks?

The CC shouldn't have any direct interaction with try/catch except through scope objects created.  Given that there are no nested closures/eval in this function, I don't think there would be any scope objects.  Removing the try/catch may also alter control flow (if an exception is thrown), can you confirm that the 'catch' is never executed?
Ok, sorry for the hassle. I could swear this fixed it but I can't reproduce it anymore. Latest discovery:

If I remove everything from TabCanvas.paint() except this line http://mxr.mozilla.org/mozilla-central/source/browser/components/tabview/tabitems.js#1480 then the leak still exists. Removing this line makes it disappear. Any chance that the 2d context can somehow hold a canvas alive?
2d context does keep its canvas element alive.
Assignee: nobody → ttaubert
Whiteboard: [MemShrink] → [MemShrink:P2]
The the original patch I posted, do you see the leaks via about:cc?
Comment on attachment 601892 [details] [diff] [review]
maybe fix

Tim, you mentioned earlier that these changes look good; can we land this and investigate this issue involving canvas in a followup bug?
Attachment #601892 - Attachment is obsolete: false
Attachment #601892 - Flags: review?(ttaubert)
It seems like we need someone closer to the JS engine to investigate this, since it doesn't appear to be a front-end bug?
Dao: Given that ttaubert said that my patch looks like something that belongs in uninit regardless, that sounds like something to investigate in the follow-up bug.  (Also, it seems like the issue isn't with the JS engine, but, iiuc, with canvas.)
(In reply to Luke Wagner [:luke] from comment #17)
> Dao: Given that ttaubert said that my patch looks like something that
> belongs in uninit regardless

I don't think he said that. The patch doesn't hurt but it shouldn't be needed.
On irc.
This patch is holding back the landing of dependent patches.
Can you explain why this belongs in uninit regardless?
Because uninit's job is to remove registered event handlers.  E.g., see all the removeEventListener calls registered with _cleanupFunctions.
Why is it uninit's job is to remove registered event handlers? (There's actually only one instance of _cleanupFunctions being used to explicitly call removeEventListener. AllTabs.unregister is probably a leftover from when AllTabs was a JS module.)
Comment on attachment 601892 [details] [diff] [review]
maybe fix

We need better understanding of what's going on here. Once we have that, we can possibly handle it in a followup bug. We shouldn't wallpaper over it just like this.
Attachment #601892 - Flags: review?(ttaubert) → review-
Alright, so are you going to help?
(In reply to Luke Wagner [:luke] from comment #17)
> (Also, it seems like the issue isn't with the JS engine, but, iiuc, with canvas.)

Note that the behavior of a canvas context holding the corresponding canvas element alive is by design.  If this is a problem in any particular case, then code should go to extra effort to not hold onto the context when it's no longer needed.
Judging by <http://mxr.mozilla.org/mozilla-central/source/browser/components/tabview/tabitems.js#1480> which I only glanced at, it doesn't look like something should be holding onto the context.
Moving to Core since based on what we know so far this doesn't seem to be a front-end bug, although we may take the front-end workaround if deemed appropriate.
Assignee: ttaubert → nobody
Component: Panorama → General
Product: Firefox → Core
QA Contact: panorama → general
So, how about a little help.  Where is the canvas created?  What is it associated with?  When should it become unreachable?
Using Andrew's find_roots.py, I get the following path to the HTMLDocument, starting at a canvas which has 1 unknown edge:

Parsing cc-edges-5.2788.log. Done loading graph. Reversing graph. Done.

0x7f43b25b1b60 [nsGenericElement (xhtml) canvas chrome://browser/content/tabview.html]
    --[GetParent()]-> 0x7f43b5b31780 [nsGenericElement (xhtml) div class='thumb' chrome://browser/content/tabview.html]
    --[GetParent()]-> 0x7f43b5b31600 [nsGenericElement (xhtml) div class='tab focus' chrome://browser/content/tabview.html]
    --[GetParent()]-> 0x7f43b0ae5e90 [nsGenericElement (xhtml) body chrome://browser/content/tabview.html]
    --[mAttrsAndChildren[i]]-> 0x7f43b076fe00 [nsGenericElement (xhtml) div class='groupItem iq-droppable iq-resizable activeGroupItem' chrome://browser/content/tabview.html]
    --[mAttrsAndChildren[i]]-> 0x7f43b0770000 [nsGenericElement (xhtml) div class='titlebar' chrome://browser/content/tabview.html]
    --[mAttrsAndChildren[i]]-> 0x7f43b0770100 [nsGenericElement (xhtml) div class='title-container' chrome://browser/content/tabview.html]
    --[mAttrsAndChildren[i]]-> 0x7f43b5adcd20 [nsGenericElement (xhtml) input class='name' chrome://browser/content/tabview.html]
    --[[via hash] mListenerManager]-> 0x7f43a73330d0 [nsEventListenerManager]
    --[mListeners[i] mListener]-> 0x7f43b0770590 [nsXPCWrappedJS (nsIDOMEventListener)]
    --[mJSObj]-> 0x7f43a566f160 [JS Object (Function)]
    --[upvars[0]]-> 0x7f43a451c180 [JS Object (Function)]
    --[fun_callscope]-> 0x7f43b09f09c0 [JS Object (Call)]
    --[**UNKNOWN SLOT 1**]-> 0x7f43b096f240 [JS Object (Function - GroupItem)]
    --[prototype]-> 0x7f43a50c1a40 [JS Object (Object)]
    --[showExpandControl]-> 0x7f43a501f6a0 [JS Object (Function - GroupItem_showExpandControl)]
    --[nativeReserved[1]]-> 0x7f43a50c1ac0 [JS Object (Object)]
    --[removeAll]-> 0x7f43a501f4c0 [JS Object (Function - GroupItem_removeAll)]
    --[script]-> 0x7f43b1766180 [JS Script]
    --[object, parent, script_global]-> 0x7f43a5053c40 [JS Object (ChromeWindow)]
    --[GroupItems]-> 0x7f43a50c1b00 [JS Object (Object)]
    --[_lastActiveList]-> 0x7f43a5006840 [JS Object (Object)]
    --[_list]-> 0x7f43a50470a0 [JS Object (Array)]
    --[element[0]]-> 0x7f43a451a240 [JS Object (Object)]
    --[$closeButton]-> 0x7f43a451a7c0 [JS Object (Object)]
    --[0]-> 0x7f43a56e7f70 [JS Object (HTMLDivElement)]
    --[parent]-> 0x7f43a5088a60 [JS Object (HTMLDocument)]

    Root 0x7f43b25b1b60 is a ref counted object with 1 unknown edge(s).
    known edges:
       0x7f43b5b31780 [nsGenericElement (xhtml) div class='thumb' chrome://browser/content/tabview.html] --[mAttrsAndChildren[i]]-> 0x7f43b25b1b60
       0x7f43a7dfb630 [nsCanvasRenderingContext2D] --[mCanvasElement]-> 0x7f43b25b1b60
       0x7f43af5dd7c0 [nsDOMCSSAttributeDeclaration] --[mElement]-> 0x7f43b25b1b60

Anyone have any clues about what could keep a <canvas> alive?
Oh, right, the *2d context*, from earlier discussions.  Yes, I can see the same behavior as ttaubert mentioned earlier: commenting out everything but the this.context.getCanvas("2d") leaks, commenting out that line does not leak.
The context (which seems to be in a cycle with the canvas also has 1 unknown edge:

0x7fba31eb1b60 [nsGenericElement (xhtml) canvas chrome://browser/content/tabview.html]
    --[mCurrentContext]-> 0x7fba26c6a6a0 [nsCanvasRenderingContext2D]

    Root 0x7fba31eb1b60 is a ref counted object with 1 unknown edge(s).
    known edges:
       0x7fba2ecf7a00 [nsDOMCSSAttributeDeclaration] --[mElement]-> 0x7fba31eb1b60
       0x7fba26c6a6a0 [nsCanvasRenderingContext2D] --[mCanvasElement]-> 0x7fba31eb1b60
       0x7fba35431580 [nsGenericElement (xhtml) div class='thumb' chrome://browser/content/tabview.html] --[mAttrsAndChildren[i]]-> 0x7fba31eb1b60
The canvas and canvas context hold references to each other.  But both those edges should be known to CC, in general...
I'm still digging.  It seems that the leaking version has a canvas elem with rc=4 and only 3 edges known to the CC.  In the non-leaking version, rc=3 so it gets cleaned up early.  I'm thinking getContext('2d') must be causing this outstanding ref-count (the context2d itself has no unaccounted incoming refs).
Alright, I dug in more and I think I have the offender and indeed I think it may be platform (canvas) code, but I'll feedback from someone who knows canvas.

So, a nsHTMLCanvasElement is being leaked. The CC dump shows 3 incoming edges but the ref-count is 4.  Doing a refcnt-trace diff (w/ and w/o the eval('') that causes the leak) shows that the non-leaking version has one more 'Release' than the leaking version.  I attached the callstack of this 'Release'; basically, there is a path from ~nsDOMEvent through ~CanvasLayer and finally the nsRefPtr memvar of CanvasRenderingContext2DUserData.  I don't see any cycle collection annotations for this edge, so I think that explains the outstanding refcount from C++.  (The path through layers also explains why getContext('2d') was necessary: I suspect a canvas doesn't get a CanvasLayer until it has a 2d context.)

So why isn't the nsDOMEvent getting destroyed?  Because it is held alive by the 'event' local variable of UI_init which is being closed over by event listeners on the window which is being kept alive by the canvas which is being kept alive by the event...

Setting 'event' to null (or narrowing the scope) fixes the leak, but it seems that the real fix is to annotate CanvasRenderingContext2DUserData for CC.  Can any CC-knowing folk comment on this?
Sounds right.
Blocks: 622072
Component: General → Canvas: 2D
QA Contact: general → canvas.2d
Making the layers not hold that strong reference to an element seems like a good change, can we also destroy the whole layer system when we destroy the layout frames?
ccing some folks who might know the answer to Peter's question in comment 37.
I'm wondering what's keeping the canvas layer alive, it should die shortly after we no longer need to paint the canvas.
(In reply to Timothy Nikkel (:tn) from comment #39)
The nsDOMEvent object (see destructor callstack in attachment to comment 35).

Any volunteers to take this bug?
Summary: tabview.js:UI_init() leaks everything when scope is heavyweight → cycles through CanvasRenderingContext2DUserData can leak everything
Could we release nsPrescontext::mDeviceContext earlier? Not in the dtor, but when the
presentation actually goes away. That should fix this, I think.
There is no reason the nsDOMEvent should keep alive a Layer. Layers should get deleted when they are no longer needed to paint to the screen. Is the canvas still visible on screen when you get that callstack (or the instant before it anyway)?
(In reply to Timothy Nikkel (:tn) from comment #42)
You can follow the chain of ref-counting in the destructor callstack in comment 25...
er, comment 35.
What I am suggesting is that either that layer needs to be alive that long because it needs to be painted to the screen until that point (in which case the canvas element node can't go away either because it is on screen), or there is a bug in the layer system that is keeping it alive longer than it should be.
Based on the stack layer system is keeping it alive very long, up until prescontext dtor.
If anyone would like to help, it is easy to reproduce this leak: just add "eval('');" to dist/bin/chrome/browser/content/browser/tabview.js:8830 and run:

python _tests/testing/mochitest/runtests.py --browser-chrome --autorun --close-when-done --test-path=browser/components/tabview/test/browser_tabview_firstrun_pref.js

Adding "event = null;" after tabview.js:8940 makes the leak go away.
Ok, what I could see being the case is that the last paint of the window has the canvas element visible. Then the window is closed. After this is doesn't get any more paints so the layers don't get updated. Then they get destroyed when the widget gets destroyed. Maybe we should flush the layers sometime before the widget destructor.
Try clearing out the layer manager in nsIWidget::Show(false)?
Was there a reason we only did that on widget destroy?
I don't recall a reason.
No longer blocks: 730497
tn, could you perhaps look at this. Leaks are bad, they increase cycle collection times
significantly.
I'll take it, but I'm not going to get to this right away.
Assignee: nobody → tnikkel
Bug 757749 has a test case that creates a similar leak.
(In reply to Luke Wagner [:luke] from comment #47)
> If anyone would like to help, it is easy to reproduce this leak: just add
> "eval('');" to dist/bin/chrome/browser/content/browser/tabview.js:8830 and
> run:
> 
> python _tests/testing/mochitest/runtests.py --browser-chrome --autorun
> --close-when-done
> --test-path=browser/components/tabview/test/browser_tabview_firstrun_pref.js
> 
> Adding "event = null;" after tabview.js:8940 makes the leak go away.

Can you give instructions that don't depend on line numbers that have probably changed now? :-)
Since bug 730497 landed, just comment out "event = null" in:
  http://mxr.mozilla.org/mozilla-central/source/browser/components/tabview/ui.js#255
(or the analogous line in dist/bin/chrome/browser/content/browser/tabview.js).  I just tested again and this still repros.
I couldn't reproduce this on Windows. Is it Mac-only like bug 757749 seems to be?
I repro'd on Linux.
OK, I reproduce now. Thanks!
Attachment #628224 - Flags: review? → review?(jmuizelaar)
Attachment #628225 - Flags: review? → review?(dietrich)
I'm almost certain this will fix bug 757749 as well.
Blocks: 757749
Attachment #628225 - Flags: review?(dietrich) → review+
Comment on attachment 628216 [details] [diff] [review]
Part 1: make CanvasRenderingContext2DUserData's reference to the context weak (non-Azure)

Review of attachment 628216 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/canvas/src/nsCanvasRenderingContext2D.cpp
@@ +440,5 @@
>      bool mIPC;
>  
>      // the canvas element we're a context of
>      nsCOMPtr<nsIDOMHTMLCanvasElement> mCanvasElement;
> +    nsTArray<CanvasRenderingContext2DUserData*> mUserDatas;

A LinkedList will avoid the O(n) RemoveElement in CanvasRenderingContext2DUserData. Maybe we don't have enough of these for this to matter?

@@ +758,5 @@
>  
>      friend struct nsCanvasBidiProcessor;
>  };
>  
> +class CanvasRenderingContext2DUserData : public LayerUserData {

It might be worth explicitly calling this out as a weak reference.
Attachment #628216 - Flags: review?(jmuizelaar) → review+
(In reply to Jeff Muizelaar [:jrmuizel] from comment #64)
> A LinkedList will avoid the O(n) RemoveElement in
> CanvasRenderingContext2DUserData. Maybe we don't have enough of these for
> this to matter?

Correct. 0 or 1 are the overwhelmingly common cases. We could have more than one in unusual situations like temporary layer managers constructed for drawWindow calls.

> @@ +758,5 @@
> >  
> >      friend struct nsCanvasBidiProcessor;
> >  };
> >  
> > +class CanvasRenderingContext2DUserData : public LayerUserData {
> 
> It might be worth explicitly calling this out as a weak reference.

OK.
Comment on attachment 628224 [details] [diff] [review]
Part 2: Azure fix

Assuming this is basically the same, it looks good to me.
Attachment #628224 - Flags: review?(jmuizelaar) → review+
I don't think we should uplift the patches here to an earlier release. I believe this leak can only happen during shutdown.
Actually I take that back. It might be possible to trigger this leak just by closing a window.
Comment on attachment 628216 [details] [diff] [review]
Part 1: make CanvasRenderingContext2DUserData's reference to the context weak (non-Azure)

Review of attachment 628216 [details] [diff] [review]:
-----------------------------------------------------------------

So I think we should uplift this to Aurora. Leaks are bad.
Attachment #628216 - Flags: approval-mozilla-aurora?
Comment on attachment 628216 [details] [diff] [review]
Part 1: make CanvasRenderingContext2DUserData's reference to the context weak (non-Azure)

Review of attachment 628216 [details] [diff] [review]:
-----------------------------------------------------------------

So I think we should uplift this to Aurora. Leaks are bad.
Attachment #628216 - Flags: approval-mozilla-aurora?
Seemed to cause high-frequency failure in test_videocontrols.html on Macos 10.7 debug only:
https://tbpl.mozilla.org/php/getParsedLog.php?id=12263937&tree=Mozilla-Inbound&full=1

Reftest failures across all Mac platforms:
https://tbpl.mozilla.org/php/getParsedLog.php?id=12264983&tree=Mozilla-Inbound&full=1
It looks like under some conditions drawing into a canvas isn't triggering invalidation. That suggests I somehow failed to call MarkContextClean when I need to.
Backed out: https://hg.mozilla.org/mozilla-central/rev/ebb3dd6d811a
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
jrmuizelaar: you have a Mac setup, can you debug this? It shouldn't be hard, just run the reftest layout/reftests/bugs/579323-1.html (e.g. by putting it in layout/reftests/bugs/tmp.list and running -reftest file://.../tmp.list) and see how it is we're getting into an infinite loop.
I applied the patches to an old rev around when the patches were posted (they didn't apply to current trunk) and I can't reproduce any infinite loop when running reftests on my mac 10.6 machine.
https://tbpl.mozilla.org/?tree=Try&rev=e3db41113451

Seems to me that using the same class name defined differently in two different files was a bad idea. I've fixed that and tests on Mac are passing now. I'll try relanding.
Comment on attachment 628216 [details] [diff] [review]
Part 1: make CanvasRenderingContext2DUserData's reference to the context weak (non-Azure)

Review of attachment 628216 [details] [diff] [review]:
-----------------------------------------------------------------

It stuck this time. Apart from my boneheaded mistake earlier, the patch is straightforward, and memory leaks are really bad, so I think taking this on Aurora would be wise.
Attachment #628216 - Flags: approval-mozilla-aurora?
Comment on attachment 628224 [details] [diff] [review]
Part 2: Azure fix

Review of attachment 628224 [details] [diff] [review]:
-----------------------------------------------------------------

It stuck this time. Apart from my boneheaded mistake earlier, the patch is straightforward, and memory leaks are really bad, so I think taking this on Aurora would be wise.
Attachment #628224 - Flags: approval-mozilla-aurora?
Comment on attachment 628216 [details] [diff] [review]
Part 1: make CanvasRenderingContext2DUserData's reference to the context weak (non-Azure)

[Triage Comment]
New memory leak in 14, let's skip up the trains and consider taking on beta if deemed low risk.
Attachment #628216 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #628224 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #85)
> Comment on attachment 628216 [details] [diff] [review]
> Part 1: make CanvasRenderingContext2DUserData's reference to the context
> weak (non-Azure)
> 
> Review of attachment 628216 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It stuck this time. Apart from my boneheaded mistake earlier, the patch is
> straightforward, and memory leaks are really bad, so I think taking this on
> Aurora would be wise.

How do you feel about getting this FF14 regression fix into our final beta?
Keywords: regression
While I'm confident in the patch, I'm not sure this bug is a big deal. No leaks on real sites have been traced to this. The only report we have of a bug in FF14 would be bug 757749. Bug 757749 appears to be Mac-only too.

So I'm not sure. I think on balance we should probably take it, but not taking it would also be reasonable.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #91)
> While I'm confident in the patch, I'm not sure this bug is a big deal. No
> leaks on real sites have been traced to this. The only report we have of a
> bug in FF14 would be bug 757749. Bug 757749 appears to be Mac-only too.
> 
> So I'm not sure. I think on balance we should probably take it, but not
> taking it would also be reasonable.

Given that, we'll wontfix for FF14.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: