Closed Bug 37639 Opened 24 years ago Closed 24 years ago

New XUL widgets (XBL) leak like a sieve!

Categories

(Core :: XBL, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bugs, Assigned: waterson)

References

Details

(Keywords: memory-leak)

Attachments

(3 files)

I checked in a new navigator.xul, containing menubuttons, buttons, etc., as 
required for nsbeta2/mozilla skinnability. The leak count has risen to ~740K. 
This isn't good ;)

Below is the new leak log just after my checkin:

--NEW-LEAKS--------------------------------
nsBookmarksService           84          -
nsTimerGtk                   88          -
MemoryElementSet::List       2148          -
nsJSEventListener          9376          -
nsJSContext                 144          -
LocalSearchDataSource         16          -
ContentIdTestNode::Element        304          -
nsXULContentUtils            12          -
nsJISx4501LineBreaker         12          -
nsCaseConversionImp2         12          -
HTMLStyleSheetImpl           56          -
nsBindingSet               1312          -
CompositeDataSourceImpl       1216          -
nsDOMAttributeMap           288          -
Value                      2960          -
nsXULPrototypeDocument        440          -
nsXULPDGlobalObject          96          -
nsHTMLElementFactory         24          -
nsXULKeyListenerImpl         28          -
InternetSearchDataSource         28          -
ConflictSet::MatchEntry       6912          -
nsXULTemplateBuilder       8512          -
CSSParserImpl               504          -
CSSLoaderImpl               152          -
ConflictSet::SupportEntry       7436          -
nsDOMWindowController         16          -
nsXULElement              71192          -
DateImpl                   1720          -
XULSortServiceImpl           12          -
nsXULElement::Slots       29276          -
URLKey                      288          -
CSSImportantRule            300          -
nsXULDocument               524          -
LiteralImpl               34768          -
nsXULPrototypeText           36          -
nsHTTPIndex                  48          -
Rule                       1080          -
ArenaImpl                    48          -
nsXULPrototypeElement      36288          -
XULPopupListenerImpl         48          -
nsBinding                  1848          -
nsXULPrototypeCache          28          -
Match                      3280          -
MemoryElementSet            656          -
nsJSDOMEventListener         84          -
SupportsKey                  32          -
nsXULCommandDispatcher         56          -
nsElementMap                 40          -
Key                        2592          -
InMemoryDataSource           36          -
Instantiation              1312          -
LocalStoreImpl               28          -
nsMenuBarListener            28          -
nsRDFResource              6440          -
nsWindowMediator             44          -
HTMLCSSStyleSheetImpl         32          -
RDFServiceImpl               48          -
nsCharsetMenu                32          -
nsElementMap::ContentListItem       3464          -
RDFXMLDataSourceImpl        288          -
FileSystemDataSource         16          -
nsDateTimeFormatUnix        156          -
nsRDFDOMNodeList             56          -
nsXULPrototypeScript        544          -
RDFPropertyTestNode::Element        480          -
nsCollationUnix              88          -
nsXULAttribute            69300          -
nsToolbarDragListener         96          -
nsBindingSet::List         3080          -
RDFContainerMemberTestNode::Element       2640          -
nsToolboxFrame::DragListenerDelegate         16          -
nsXBLAttributeEntry        2736          -
RDF_Assertion             16152          -
RDFContainerInstanceTestNode::Element        960          -
AttributeKey                 24          -
nsXULAttributes           44560          -
RDFContainerUtilsImpl         12          -
PropertySet                 192          -
MatchSet                   9648          -
GenericTableRule             32          -
nsAtomList                 7904   49300.00%
nsEventListenerManager      23688   16350.00%
CSSStyleSheetImpl          6432    6600.00%
nsCSSSelector             47872    5653.85%
nsSupportsArray           14440    4412.50%
CSSStyleSheetInner         1620    4400.00%
nsXBLBinding               3872    4300.00%
nsClassList                2936    3977.78%
nsCSSRule                 14048    3890.91%
nsStdURLParser              840    3400.00%
CSSDeclarationImpl        54112    3290.48%
nsStdURL                    308    2466.67%
nsVoidArray               29984    2398.67%
nsAttrSelector            16640    1980.00%
nsStr                     80980     776.41%
NameSpaceImpl              2632     754.55%
NameSpaceManagerImpl        288     500.00%
nsNoAuthURLParser            48     300.00%
nsXMLElement              11600     262.50%
nsXULControllers             60     200.00%
nsXPCComponents             240     150.00%
nsXPCWrappedNativeScope        140     150.00%
nsHTMLInputElement          176     100.00%
XMLElementFactoryImpl         24     100.00%
nsGenericFactory             40     100.00%
nsXBLEventHandler         11880      83.33%
nsTextNode                  480      60.00%
nsLocalFile                 812      40.00%
nsCommentNode               240      33.33%
HTMLAttributesImpl          600      11.11%
HTMLAttribute               304       5.56%
TOTAL                    724200

I'd rather these problems get fixed than me get backed out because my navigator 
changes need to happen and from a XUL point of view are perfectly valid. I did 
not change any observer behaviour. The button elements that observed various 
back/forward broadcasters were as prior to my checkin, the change just modifying 
their class name.
I've looked at this bug a bit yesterday and today.  I isolated the problem to
ben's changes to navigator.xul, navigatorOverlay.xul, and navigator.dtd
(versions 1.173, 1.76, and 1.61 respectively) at 2000-04-29 19:42.

The leak is ugly.  It looks the same as what radha caused a week ago.  The
"cause" of the leak is, I think, that the nsXULDocument does not get JS-garbage
collected.  That's the main difference on refcount balancer logs between the
before and after pictures.  I have a bunch of logs at
http://dbaron.student.harvard.edu/u/david/leaks/regress/ben/ when my computer is
on.  These use the refcount balancer, many including my experimental COMPtr
logging tools.  The files beginning with good-* are the "before" picture.  A
filename ending in balance-* is refcount balancer output, where the * can
contain ib meaning "--ignore-balance" and "ic" meaning that nsCOMPtrs (balanced
in all cases for the nsXULDocument logs) are cancelled out.

The leak is one of those things that looks like a circular reference
(nsXULDocument::GetScriptObject makes a script object using
NS_NewScriptXULDocument, which addrefs the nsXULDocument because it passed this
as the second argument), except it looked the same in the "before" picture and
it didn't leak then.

At least, I think everything goes back to the nsXULDocument.  The two JSContexts
leaked (serial#s 3 and 4) are owned by two of the 10 nsXULPrototypeDocuments,
which are owned by an nsXULDocument, presumably the one that leaked.

cc:ing jband in case he has any insight into what causes the garbage collection
not to release the nsXULDocument.
This is the same leak that Radha introduced and then fixed (by changing the XUL
back) last week.  I have a very simple minimum diff to trigger that leak (if you
don't have Ben's changes).

The differences between the addrefs/releases for a before/after picture of that
leak are at:
http://dbaron.student.harvard.edu/u/david/leaks/regress/radha/refcnt.diffs
wow, this is really cool. thanks, dbaron, you rock! 
Adding self & waterson to cc list. The bug I filed for the same leaks 36961 went 
to waterson. Maybe one s'd be closed out as dupe.
No clue.  It's not menubuttons.  Who knows?  Handing over to LeakBitch(tm) for 
triage.
Assignee: hyatt → waterson
My JS reentrancy bug theory doesn't hold water (I didn't actually mention it
above, but...).  js_DestroyContext is never called (in the leaky world) from
within js_GC.  (I think it was before the leaks...)

It's hard to isolate what XUL/JS caused the leak because Ben's changes were
rather large.  I can get it down to a chunk where removing the chunk fixes the
leak - but that could easily because the JS code that expects some object (in
the removed chunk) to exist, causes an error, and prevents the real problem code
from being executed.

However, I do know that the original navigatorOverlay.xul and navigator.dtd
changes did not cause the leak, but adding the navigator.xul changes does cause
the leak.
Here's some more guessing about what could be happening here.

I'm getting the feeling (although maybe I'm just a pessimist) we're getting into
more ugly issues (like bug 28570) relating to the ownership model of JS global
objects.  It's my understanding that the JS engine wasn't really designed for a
situation where things that own JS contexts are being held by other JS contexts.

js_DestroyContext(http://lxr.mozilla.org/seamonkey/source/js/src/jscntxt.c#131 )
does some extra-special stuff (which it seems to me is to be sure to free up
everything) if the context being destroyed is the last one.  Here the last
contexts to be destroyed are being held by something (in this case, the XUL
Prototype documents) that are not going to be released until another JS context
is fully cleaned up (the one associated with the main nsXULDocument).  Is it
possible that a context is not fully garbage-collected (does that phrase make
sense?) when it is destroyed if it is not the last context to be destroyed?

If this were the case, would it be easier to change the ownership model (i.e,
not have JS contexts own their global object, but make the global object, or
something it owns, responsible for destroying the context first) or the JS
engine?

Since I think jband is on sabbatical, cc:ing brendan.  Brendan:  Does this
theory make any sense at all?  What (else) could cause something not to be
garbage-collected?

(BTW, I think the circular reference issues that worried me before are properly
handled in DocumentViewerImpl::~DocumentViewerImpl,
GlobalWindowImpl::SetNewDocument, and nsXULDocument::SetScriptGlobalObject .)
*** Bug 36961 has been marked as a duplicate of this bug. ***
Status: NEW → ASSIGNED
Priority: P3 → P1
Target Milestone: --- → M16
>I'm getting the feeling (although maybe I'm just a pessimist) we're getting 
into
>more ugly issues (like bug 28570) relating to the ownership model of JS global
>objects.

Bug 28570 was fixed while I was on sabbatical, wasn't it?  It was a failure of 
the
JS_THREADSAFE mode of the JS garbage collector to cope with more than one 
JSContext
per thread.  It was easy to fix, once diagnosed.

>It's my understanding that the JS engine wasn't really designed for a
>situation where things that own JS contexts are being held by other JS 
contexts.

No, the JS engine was designed (by me) for the old "classic" codebase, which had
multiple contexts per thread, one context per window, just like the Mozilla 
modern
codebase.  Therefore in trees induced by the frameset-contains-frame relation, a
window held strong (owning) references to its frame children, although there 
could
be other strong refs.  Each window native data structure owned its JSContext.  
We
eliminated leaks, so the design did address the situation you describe.

It's probably worth recapping how the old scheme worked.  There were two heaps 
and
two storage management schemes: the JS heap, which was garbage collected; and 
the
libmocha (portion of the malloc) heap, which was reference-counted with strong 
and
weak reference counts.  JS heap objects on the left, libmocha on the right, 
strong
refs are -->, weak are ==>, property 'ids' along the arrows:

        window context <------------------+----------------------------< 
MWContext
        |                                 |
        v                                 v
    JS: window object ========> libmocha: window's MochaDecoder <======+
        |           |                                                  |
     'frames[i]'    'document'                                         |
        |           |                                                  |
        |           v                                                  |
        |           document object ====> document private data =======+
        |
        v
        frames[i] (window) object ======> frames[i]'s MochaDecoder
        ^                                 ^
        |                                 |
        frames[i] context <---------------+----------------------------< 
MWContext

Note the double-ended arrows between contexts and decoders -- JS contexts are 
not
reference counted, they are manually created and destroyed by a single owner, 
which
itself must be reference-counted in a world of multiple potential references to 
a
window data structure.

[Note that the JSContext struct contains no actual void * or otherwise-typed
pointer to the decoder.  Instead, there was another malloc-heap object, 
MWContext,
corresponding to the MochaDecoder, that pointed to the JSContext and held a 
strong
ref to the decoder that we attributed to the owning pointer (namely,
MWContext::mocha_context, of type JSContext *).]

The weak references are from JS objects, via their private data slots, to their
corresponding libmocha data structures.  These refs are weak because a JS object
may become garbage and be finalized, but its underlying libmocha structure 
should
not be destroyed until all strong and weak refs from native C/C++ code and from
other JS object private data slots have gone away.  In particular, references 
"up
the tree" from descendent DOM objects' private data to the window's private data
structure, the decoder, must be weak.

When a document is released from a window (caused by the first packet of new
document data arriving and displacing an old document already loaded in the 
window),
then the document's containing window has its scope and GC roots cleared, and 
the GC
is run.  If one of its frame kids is now unreachable by JS strong refs (which 
are
the only kind of JS refs), the frame object is finalized.  That decrements the 
weak
reference count (decoder->back_count) for the frame's decoder.  If that count
reaches 0, and if the strong ref count (forw_count) is already 0, the decoder 
*and
its JSContext* are destroyed.

When a native window (MWContext) is closed or otherwise destroyed (app exit), 
the
strong ref held indirectly from the MWContext via the JSContext is removed, and 
if
the strong reference count (decoder->forw_count) therefore goes to 0, then the 
GC
roots are cleared, the context's global object is set to null (rather than 
clearing
the window object's scope, which would frustrate attempts to discover properties 
set
in that window object for discovery by the caller to window.open, after the 
window
has been closed; and which would also waste cycles in the case that no one has a
strong ref to the JS window object [it will be GC'd soon enough, so no need to
clear its scope]), and the GC is run.

Around this GC run, the decoder's back_count is elevated to keep it alive in 
case
there are no strong or weak refs to it.  Then the weak ref-count is dropped, and 
if
zero, then *all* memory (GC roots registered with JS, other deep data 
structures)
owned exclusively by the decoder, and the decoder memory itself, are freed.  But 
if
there are JS refs to the window (say, in a running script's variables, or 
stashed
in another window's scope), then the weak ref from the window object's private 
data
slot to the decoder keeps its storage alive, even though it is mostly "closed" 
and
all-but-freed.  About the only thing it is good for is handling win.name 
property
gets and the like, and of course getting properties set by script in the opened 
or
child window.

Whew.

Ok, let's update the diagram to the new world, prior to "brutal sharing" and the
XUL prototype document: MWContext => webshell, MochaDecoder => GlobalWindowImpl,
forw_count => mRefCnt -- oops!  The JS window object no longer holds a weak ref
(decoder->back_count-accounted pointer in the old world).  It adds to the 
mRefCnt
pot of the GlobalWindowImpl.  That's big bad difference #1.

I can't get to dbaron.student.harvard.edu right now, but based on dbaron's next
to last comment about JS_DestroyContext not being called for a certain context,
I'm guessing the leak here is not just an nsXULDocument, but a whole DOM window
(GlobalWindowImpl).  Is that true?  If so, I can see why.  The lack of a weak
ref count in GlobalWindowImpl (back_count in the MochaDecoder in the classic
codebase) means that there can be a strong ref cycle, if a JS GC root in the
GlobalWindowImpl, or in another data structure reachable from its strong XPCOM
refs, points back to the JS object.

Note how in the classic world, because the window JS object uses the weak ref
count, when there are no strong refs to the native data structure, its roots are
all cleared and the GC is run.  But the weak ref count keeps the window object's
private data alive, albeit "closed".  In the modern Mozilla world, I see no such
code to clear roots.

I may be missing some root clearing and/or removing somewhere in the DOM, that
suffices at least for non-XUL (no brutal sharing) windows.  But I believe the 
bug
involves only stuck root(s) and refcnts.

If there is a stuck GC root, we should be able to find it with 
js_LiveThingToFind
and GC_MARK_DEBUG.  I'll help waterson with this tomorrow, if he has time.

>js_DestroyContext(http://lxr.mozilla.org/seamonkey/source/js/src/jscntxt.c#131 
)
>does some extra-special stuff (which it seems to me is to be sure to free up
>everything) if the context being destroyed is the last one.  Here the last
>contexts to be destroyed are being held by something (in this case, the XUL
>Prototype documents) that are not going to be released until another JS context
>is fully cleaned up (the one associated with the main nsXULDocument).

Then which context is "last"?  The answer is the one that goes through the code
at the start of js_DestroyContext, which finds the runtime's context-list empty
after removing the context.  I believe this means that what you call the "last"
context above won't be considered last, because if there are other contexts not
yet destroyed, they will still be on the runtime's contextList.

Anyway, all this JS engine "last context-destruction runtime cleanup" code is
way below the level at which I believe this leak bug lives: the world of refcnt
cycles and stuck GC roots.

>Is it
>possible that a context is not fully garbage-collected (does that phrase make
>sense?) when it is destroyed if it is not the last context to be destroyed?

A context is not a garbage-collected object, so no.  Nor is it leaking any 
roots,
as the context holds no explicit GC roots (added by JS_Add{,Named}Root), and its
struct members that are roots-by-definition (e.g., globalObject) are no longer
roots once the context is off the runtime's contextList (see jsgc.c).

>If this were the case, would it be easier to change the ownership model (i.e,
>not have JS contexts own their global object, but make the global object, or
>something it owns, responsible for destroying the context first) or the JS
>engine?

None of this Mozilla leak bugginess requires JS engine changes; my proof is the
classic codebase.

Dbaron, thanks for all your work here, anyway -- JS needs scrutiny, as bug 28570
shows.  It's just not at all likely to be at fault here.

/be
Sorry, my bad for using notepad and overflowing the 80th column.  Please don't 
bother reading all of my long blurb in-line in the bug -- try reading the 
attachment instead.

/be
Blocks: 8241
Blocks: 37480
Turns out to be a circularity caused by XBL. Specifically, if an XBL event gets
dispatched to anonymous content, then the anonymous content's script object gets
created. This circularity is *supposed* to be broken by
nsIContent::SetDocument(nsnull), but since the parent of the anonymous content
isn't aware of it's anonymous children, this never gets called.

Hyatt and I are working out a fix for this: I'll patch diffs soon. In the
meantime, I've checked in code that comments out "urlbar.focus()" in
navigator.js, which was causing this.
Attached patch fixSplinter Review
Sorry it took so long to get the patch up. Friggin' bugzilla. Anyway, the fix is
to make sure that nsXULElement::SetDocument() tells the anonymous content nodes
created from a binding to fix up *their* document backpointers. This'll cause
them to do the right thing when we're tearing down a document.

Hyatt sez this change is scary, so help me test it. Seems to work fine for me.
Checked in the changes. I'll wait to mark fix until I see leak stats go down.
ok, it worked! marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: