Closed Bug 294893 Opened 19 years ago Closed 19 years ago

Major leakage on Linux balsa Dep GTK1 (gcc 3.4)

Categories

(Core :: XPConnect, defect, P1)

x86
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla1.8beta2

People

(Reporter: Gijs, Assigned: bzbarsky)

References

()

Details

(Keywords: memory-leak, perf, regression)

Attachments

(9 files)

4.84 KB, patch
Details | Diff | Splinter Review
6.50 KB, patch
Details | Diff | Splinter Review
7.93 KB, patch
bzbarsky
: review+
bzbarsky
: superreview+
benjamin
: approval1.8b2+
Details | Diff | Splinter Review
2.25 KB, patch
jst
: review+
Details | Diff | Splinter Review
14.98 KB, patch
bzbarsky
: review+
bzbarsky
: superreview+
Details | Diff | Splinter Review
1.43 KB, patch
bzbarsky
: review+
bzbarsky
: superreview+
Details | Diff | Splinter Review
3.84 KB, patch
brendan
: review+
brendan
: approval1.8b2+
Details | Diff | Splinter Review
3.36 KB, patch
bzbarsky
: review+
bzbarsky
: superreview+
Details | Diff | Splinter Review
1.89 KB, patch
brendan
: review+
brendan
: approval1.8b3+
Details | Diff | Splinter Review
As of the 2005-05-19 20:26 build, trace_malloc_leakage on this build has
skyrocketed from 300k to 1.35MB

I've attempted to CC everyone who checked in something that might relate to this
bug. I apologize if I was overzealous here, I'm trying to do the best I can
while not knowing an awful lot about Tinderbox / build mechanisms.

Leak graph:
http://build-graphs.mozilla.org/graph/query.cgi?tbox=balsa&testname=trace_malloc_leaks&autoscale=1&size=&days=7&units=bytes&ltype=&points=&avg=1&showpoint=


Bonsai timeframe:
http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2005-05-19+17%3A52%3A00&maxdate=2005-05-19+20%3A26%3A00&cvsroot=%2Fcvsroot

Note: this probably doesn't belong in Firefox > General, but I have no idea
where else it ought to go for now. In a somewhat similar timeframe, the
Seamonkey build also got 200k extra leakage - I don't know if this is related.
Regression from bug 281988 i think -- see bug 281988 comment 144
Keywords: regression
Marking depends per comment 1 and discussion on IRC
Depends on: 281988
Blocks: 281988
No longer depends on: 281988
Minimal testcase is in the URL bar....  For some reason, we end up with:

###!!! ASSERTION: nsGenericElement's event listener manager hash not empty at
shutdown!: 'sEventListenerManagersHash.entryCount == 0', file
../../../../mozilla/content/base/src/nsGenericElement.cpp, line 804

and leak the world when we load that testcase.  I've confirmed that disabling
the XPCNativeWrapper stuff (by just hacking NativeInterface2JSObject to never do
it) avoids the leak.  :(

We really don't want to ship 1.8b2 with this.  Debugging now...
Assignee: nobody → dbradley
Component: General → XPConnect
Flags: blocking1.8b2?
Keywords: mlk
Product: Firefox → Core
QA Contact: general → pschwartau
Assignee: dbradley → bzbarsky
Priority: -- → P1
Target Milestone: --- → mozilla1.8beta2
Flags: blocking1.8b2? → blocking1.8b2+
Flags: blocking1.8b2+ → blocking1.8b2?
Flags: blocking1.8b2? → blocking1.8b2+
I don't fully understand this yet, but this is IMO what we want regardless of
whether this really does fix the leak (which it appears to do with the trivial
testcase in this bug). What the current code does is to mark mNativeWrapper
when a wrapped native is marked, and that makes the nw stay around as long as
the wn stays around. I don't see a reason for that, if all script references to
a nw are gone then we might as well let the nw go, even if the wn is still
referenced (in a different scope, most likely).

I'm going to check this in now, there's noone around to r/sr, so bz, please
r+sr after the fact, and back this out if there's something really wrong with
this. I want testing for this in tomorrows build, and I want to see how this
affects the leak numbers ASAP. With some luck this'll make the leak go away and
we can ship with this even if we don't yet fully understand all aspects of this
leak etc.
Attachment #184387 - Flags: superreview?(bzbarsky)
Attachment #184387 - Flags: review?(bzbarsky)
Probable leak fix checked in, a=asa (over IRC).
The first cycle with the leak fix (double-checked with build log CVS output that
the fix is in) still leaks like before.
We need to understand the leak to fix it.  Also, I think we want that mark of
the nw from its wn, or auto-magically created nw's will be ephemeral.  That's
bloaty and (if you can stick a property on an nw, as you can for certain ids)
arguably user-hostile.

Did anyone get a GC-heap dump that matched the one bz saw, where the
BackstagePass object was implicated, and certain prototype chains had one too
many Object protos near the end?

/be
jst, I think comment 7 is right -- we want to keep the XPCNativeWrapper alive
just like we keep XPCWrappedNatives alive if they have random user-defined props
on them....

So this is what we found yesterday, just so we all have the same info.  Note
that I was testing with the minimal testcase in this bug, so we may also have
other issues (given that jst's change fixes this testcase but not tinderbox).

1)  The leak on this testcase went away if every time
XPCNativeWrapper::GetNewOrUsed was called I had it create a new wrapper _and_
set the new wrapper on the XPCWrappedNative.  If I just had it create a new
wrapper every time but left the wrapped native alone if it already had an
mNativeWrapper, we still leaked.  This is probably the issue that jst's patch
tries to address...

2)  I beat things about the head and shoulders till we stopped leaking the JS
runtime service (through XBL and XUL), so I got a printout of the remaining
roots at runtime shutdown.  I also hacked XBL to use slightly more informative
root names.  This is what I get at shutdown:

JS engine warning: leaking GC root ' --
chrome://global/content/bindings/scrollbar.xml#scrollbar --
nsXBLProtoImplMethod::mJSMethodObject' at 0x85a7d64
JS engine warning: leaking GC root 'res->input' at 0x876d7a4
JS engine warning: leaking GC root 'initScrollbar --
chrome://global/content/bindings/scrollbar.xml#scrollbar --
nsXBLProtoImplMethod::mJSMethodObject' at 0x85a80a4
JS engine warning: leaking GC root 'res->input' at 0x8598624
JS engine warning: 4 GC roots remain after destroying the JSRuntime.
                   These roots may point to freed memory. Objects reachable
                   through them have not been finalized.

So the leaking roots are two regexp-related things and the initScrollbar method
and constructor of the scrollbar binding....

I'm not really sure where to go with that.  The address 0x85a80a4 does not
appear in the gc heap dump that I (or emacs i-search, and I did just search for
85a80a4) can see.  What _does_ appear in the dump, is this:

08473ef0 object 0x87712d8 Window via global object(Window).

I'm not really sure what that means, exactly.  Is the global object an implicit
GC root or something?  Still trying to sort out how I can use this gc heap dump...

jst and I were also a little concerned about the calls to JS_NewObject for
XPCNativeWrappers whose parent is null (i.e. the wrapper for the window's
wrapped native).  We tried setting the parent to the window's JSObject (the one
off the wrapped native), but that didn't seem to help.
I posted the full output of a leaking build at
http://web.mit.edu/bzbarsky/www/gc_log.txt (and a gzipped version at
http://web.mit.edu/bzbarsky/www/gc_log.txt.gz if people want to download it via
"save link as" -- the file is about 4MB).  In addition to the GC heap dump, this
file contains two more types of debug output:

Lines like:

  Created wrapper: '[object Window @ 0x87712d8]', object is: 0x8473ef0

mean we created an XPCWrappedNative for a Window object, the address of the
XPCWrappedNative is 0x87712d8 and the address of its flat JSObject is 0x8473ef0.

Lines like:

  Created extra-wrapper: 0x8474f20, for: [object Window @ 0x87712d8]

mean we created an XPCNativeWrapper for the XPCWrappedNative at address
0x87712d8.  The address of the XCPNativeWrapper JSObject is 0x8474f20.

The gc dump is also somewhat hacked; I changed the path-dumping in gc_dump_thing
as follows:

     while (next) {
         path = JS_sprintf_append(path, "%s(%s).",
                                  next->name,
                                  gc_object_class_name(next->thing));
+        if ((*js_GetGCThingFlags(next->thing) & GCF_TYPEMASK) == GCX_OBJECT) {
+            path = JS_sprintf_append(path, "[was %p]",
+                                     (JSObject*)next->thing);
+        }
         next = next->next;
     }

in the hopes of getting addresses for things in the path.  Note that this puts
the address after the '.' because I was lazy...  So a line like:

  08473ef0 object 0x87712d8 Window via global object(Window).[was 8473ef0]

would mean that the JSObject at 0x08473ef0 (with private data 0x87712d8, so this
is exactly our Window object above) had on the path the JSObject at 0x08473ef0,
if I understand what this code is doing right...

I didn't include the actual log XPCOM_MEM_LEAK_LOG creates, since all it tells
me is that we leak a window and a document and everything that entails.

Note that this is all without jst's patch.  I'll try applying that now and
seeing whether I can create a minimal leak testcase with it applied.
(In reply to comment #7)
> We need to understand the leak to fix it.  Also, I think we want that mark of
> the nw from its wn, or auto-magically created nw's will be ephemeral.  That's
> bloaty and (if you can stick a property on an nw, as you can for certain ids)
> arguably user-hostile.

What id's would that be? I can't imagine we would ever want anyone to be able to
store anything on these wrappers. What am I missing?
Attachment #184428 - Flags: superreview?(bzbarsky)
Attachment #184428 - Flags: review?(bzbarsky)
Comment on attachment 184428 [details] [diff] [review]
Same as above w/ more newborn object protection.

r+sr=bzbarsky; requesting approval for yet more attempted leak-fixing.
Attachment #184428 - Flags: superreview?(bzbarsky)
Attachment #184428 - Flags: superreview+
Attachment #184428 - Flags: review?(bzbarsky)
Attachment #184428 - Flags: review+
Attachment #184428 - Flags: approval1.8b2?
Comment on attachment 184428 [details] [diff] [review]
Same as above w/ more newborn object protection.

per deerpark meeting, any reviewed patch for this bug is auto-approved
Attachment #184428 - Flags: approval1.8b2? → approval1.8b2+
Additional change landed now too.
Bz's the man, he figured this out (assuming that this *really* does fix the
leak, and we think we know why it would now). What's happening here is that we
end up with a wrapped native for a content window and a native wrapper for that
WN. Chrome code accesses content.document, which ends up wrapping the content
document and sticking it in the property cache on the NW. If the window is then
closed, we end up clearing the scope, but leaving the NW around with a
reference to the document. At least before my earlier changes, the cycle was
something along these lines:

  window > window::mNativeWrapper > nsHTMLDocument > child nodes >
  nsEventListenerManager > nsJSEventListener > nsJSContext > window

With my earlier changes the cycle must be different, but I haven't been able to
reproduce a leak w/ my earlier changes so I can't say for sure what the cycle
really is.

This patch is going in, r+sr=bzbarsky (over IRC).
Attachment #184466 - Flags: superreview+
Attachment #184466 - Flags: review+
Great work, sorry I wasn't around to help.

jst, I'm still wondering (comment 7): don't we want to mark mNativeWrapper from
XPCWrappedNative::MarkBeforeJSFinalize?  And therefore not need a finalize hook
in XPCNativeWrapper?  At least toString may be overridden on the nw, and it
seems like a waste to let the GC destroy all the "faulted-in" ids on an busy nw
for a live wn.

/be
balsa is green again, still leaks like before.
So in the range in which leaks went up, the following patches went in:

Bug 294808
Bug 294831
Bug 294630
Bug 294815
Bug 281988 -- first patch only (so wrappers NOT enabled at that point)
Bug 258650

I'm pretty sure the first three things on that list don't leak.

Bug 294815 could, in theory, have issues...
Bug 281988 could have issues if wrappers are being created manually or if
           somehow code was getting wrappers because it had a null script
           filename... or if there is a bug in the rest of the code somewhere,
           but I don't see one.
Bug 258650 is probably safe.

There _are_ some manual XPCNativeWrapper creations in the code -- see
<http://lxr.mozilla.org/seamonkey/search?string=XPCNativeWrapper%28>.  The only
ones I see in Firefox code are pageinfo and the DOMWillOpenModalDialog handler
in tabbrowser.  I'll try removing the latter and seeing what happens, I guess.

It really would be nice if I could reproduce the leak at this point...
I think I've finally reproduced this.  Balsa is reregistering components on
every single startup (why?  I have no idea).  If I remove my compreg.dat and
then start, I do see a leak.  It's not leaking through the wrappers added in bug
281988, though...
Attachment #184427 - Flags: superreview?(bzbarsky)
Attachment #184427 - Flags: review?(bzbarsky)
Attachment #184387 - Flags: superreview?(bzbarsky)
Attachment #184387 - Flags: review?(bzbarsky)
Regarding    #294893 [Core]-Major leakage on Linux balsa Dep GTK1 (gcc 3.4)
[Lin] Attempt 4....

GTK2+ is out, why haven't they updated?

I have noticed this recently myself, 512MB of ram and fresh reboot say, I am
using near 32% of it, then loads FF and it goes up to near 80% use. Does not
even release it either. Has to do a Cold reboot to clear it out.

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8b2) Gecko/20050524 Firefox/1.0+
ID:2005052421

One last note, when I control T (open new Tab) then try to control v (paste)
tonew tab, it does not default to the URL like it did in the previous build I had.

Ok Graphical Representation here 
[url=http://images5.theimagehosting.com/Memory1.png]With FF Loaded[/url]

[url=http://images5.theimagehosting.com/Memory2.png]Without FF loaded and a
Reboot[/url]

would have shown here, but for some reason they are the size of my Display 1280x1024

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8b2) Gecko/20050525 Firefox/1.0+
ID:2005052509

Since I did this, the Prometheus Build is what had the problem.
Comment on attachment 184556 [details] [diff] [review]
Also clear the scope of every explicitly created nw, and put back the marking of the "primary" nw when wn's are marked.

> Index: js/src/xpconnect/src/XPCNativeWrapper.cpp

+	 JS_ClearScope(d->cx, (JSObject*)entry->key);

::JS_ClearScope, please.

>Index: js/src/xpconnect/src/xpcjsruntime.cpp
>+   mExplicitNativeWrapperMap(XPCNativeWrapperMap::newMap(XPC_NATIVE_WRAPPER_MAP_SIZE)),

You need to delete this map in ~XPCJSRuntime.

You also need to null-check this map in newXPCJSRuntime().

r+sr=bzbarsky with those fixed.
Attachment #184556 - Flags: superreview?(bzbarsky)
Attachment #184556 - Flags: superreview+
Attachment #184556 - Flags: review?(bzbarsky)
Attachment #184556 - Flags: review+
So I've been debugging the leak on startup and registration.  We seem to be
leaking two chrome windows -- the hidden window and the main browser window.  I
can sorta tell them apart, since the hidden window is a ChromeWindow with an
HTMLDocument hanging off it:

086bc088 object 0x86e3b60 HTMLDocument via
XPCWrappedNativeProto::mJSProtoObject(XPC_WN_ModsAllowed_Proto_JSClass @
0x081c65f0).outerWidth getter(Function @ 0x08653580).__proto__(Function @
0x08653578).__proto__(Function @ 0x08502f20).__proto__(Object @
0x08502f50).__proto__(Object @ 0x08502e88).__parent__(ChromeWindow @
0x08502e48).document(HTMLDocument @ 0x086bc088).

08502e48 object 0x851cbf0 ChromeWindow via
XPCWrappedNativeProto::mJSProtoObject(XPC_WN_ModsAllowed_Proto_JSClass @
0x081c65f0).outerWidth getter(Function @ 0x08653580).__proto__(Function @
0x08653578).__proto__(Function @ 0x08502f20).__proto__(Object @
0x08502f50).__proto__(Object @ 0x08502e88).__parent__(ChromeWindow @ 0x08502e48).

The other window we're leaking is:

081c65f8 object 0x85ad1c0 ChromeWindow via atom(AnyName @
0x0812c438).constructor(Function @ 0x0821e930).__proto__(Function @
0x081c66d8).__proto__(Object @ 0x081c6c10).__proto__(Object @
0x081c6640).__parent__(ChromeWindow @ 0x081c65f8).

This is the root cause of the leaks; leaking this window means we leak various
stuff that entrains the chrome window as the compilation scope (you can see that
above).  This is the only path via which this window is marked.  The leak log
says "(1 references; 0 from COMPtrs)", so the only ref to it is held from the
XPCWrappedNative we're leaking here.

Brendan, what's this atom(AnyName) thing that claims to be a GCX_OBJECT in its
typemask (otherwise it wouldn't get an "@ pointer" thing to go with it)?

More importantly, why's it acting like a root?

Fwiw, I checked with auto-wrapping completely disabled (and there's no manual
wrapping in Firefox outside of page info, which I did not open), and the startup
leak was still there....
We (brendan, jst, I) have no idea why this wasn't a problem before, but this
AnyName atom is what's entraining the non-hidden chrome window...  Just landed
this patch, with r+a=brendan, sr=jst.
Attachment #184577 - Flags: superreview+
Attachment #184577 - Flags: review+
And it looks like that patch causes crashes during standard class init -- we
manage to somehow lose our just-constructed proto for the AnyName class to GC.
So the reason for the crashes is that js_GetAnyName returns
rt->atomState.lazy.anynameAtom if that's set.  If we don't pin the atom, that
can get GC'd, and then we're returning bogus stuff.  Since that atom is also the
proto for the class, as far as I can tell, bad things happen when we try to 
js_SetClassPrototype in JS_InitClass.  We have to be initing the class not for
the first time, but that's ok -- we have plenty of global objects around.

Brendan, this is sounding like an all-you kinda thing...
This just nulls out the atomstate entries when the atom is finalized.  With
this change, I don't leak and don't crash, even with TOO_MUCH_GC.
Comment on attachment 184597 [details] [diff] [review]
Better, stronger, faster

brendan, whatcha think?
Attachment #184597 - Flags: review?(brendan)
First balsa build after bz' checking.

RLk:816B
Lk:307KB
MH:9.48MB
A:259K
The following is for comparative purposes (presumably to be used in working out
whether this should be "Status -> RESOLVED, Resolution -> FIXED" or not.)

The previous values before this bug was filed according to
http://forums.mozillazine.org./viewtopic.php?t=271385&postdays=0&postorder=asc&postsperpage=15&start=15
(and backed-up by bug 281988 comment 144) were:

RLk:780B
Lk:308KB
MH:8.00MB
A:247K
Marking fixed, again pending Brendan's review.  Remaining issues, if any, should
get separate bugs.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Looks like both beast and pacifica tinderboxes crash on TP after this commit
Attachment #184615 - Flags: superreview?(bzbarsky)
Attachment #184615 - Flags: review?(bzbarsky)
Attachment #184615 - Flags: superreview?(bzbarsky)
Attachment #184615 - Flags: superreview+
Attachment #184615 - Flags: review?(bzbarsky)
Attachment #184615 - Flags: review+
finalize hook change landed.
Bug 295634 for random crashes on Win after last part landed
Comment on attachment 184597 [details] [diff] [review]
Better, stronger, faster

Great, thanks for covering.

/be
Attachment #184597 - Flags: review?(brendan)
Attachment #184597 - Flags: review+
Attachment #184597 - Flags: approval1.8b2+
His code should compile to the same machine code as mine, for release builds. 
I changed some names, registerized rt, and added an assertion.	I'll check this
in for 1.8b3.

/be
Attachment #184728 - Flags: review+
Attachment #184728 - Flags: approval1.8b3+
Blocks: 299593
*** Bug 295634 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: