Open Bug 842476 Opened 9 years ago Updated 9 months ago

leak in child process in test_browserElement_oop_SetVisibleFrames.html with bug 404077 patches

Categories

(Core :: DOM: Core & HTML, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

People

(Reporter: dbaron, Unassigned)

References

Details

(Keywords: memory-leak, Whiteboard: [MemShrink:P3][leave open])

Attachments

(4 files, 1 obsolete file)

With a current mozilla-central tree plus the patch for bug 404077, i.e., https://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/3d9faacd1603/mochitest-assertion-checking
I see orange due to leaks on all debug Mac OS X builds (not quite 100% of the time, but close) in the child process of the mochitest test_browserElement_oop_SetVisibleFrames.html .


I can reproduce the leak on Linux if I enable the test.

The nsDocShell is held by two nsCOMPtrs from an HttpChannelChild/HttpBaseChannel.  The channel is an nsHTMLDocument's mChannel.  The document leaked is the one for file_browserElement_SetVisibleFrames_Inner.html?child1

If I take refcnt+comptr logs on the nsHTMLDocument (3 references leaked, 0 from nsCOMPtrs), the leak goes away in a debug non-opt build, but in a debug+opt build I can still reproduce the leak if I take the logs with XPCOM_MEM_LOG_CLASSES but *not* XPCOM_MEM_LOG_OBJECTS.

That said, I haven't yet been able to make heads or tails of the logs.
Try run showing the leaks is https://tbpl.mozilla.org/?tree=Try&rev=359de52f597b
Attached file logs.tar.bz2
This is the full set of logs for the one successful set of refcount/comptr logs I managed to get, processed with fix-linux-stack, plus the corresponding cc/gc logs from the cycle collector.

It was generated with the command, in the root of my objdir:

MOZ_CC_LOG_DIRECTORY=$(pwd) XPCOM_CC_SHUTDOWN_COLLECTIONS=5 XPCOM_CC_LOG_SHUTDOWN=1 XPCOM_CC_ALL_TRACES_AT_SHUTDOWN=1 XPCOM_MEM_LOG_CLASSES=nsHTMLDocument XPCOM_MEM_REFCNT_LOG=nsHTMLDocument.refcnt XPCOM_MEM_COMPTR_LOG=nsHTMLDocument.comptr TEST_PATH=dom/browser-element/mochitest/test_browserElement_oop_SetVisibleFrames.html make mochitest-plain

which dumped the cc/gc logs in the objdir and the refcount logs in _tests/testing/mochitest/, and then by running fix-linux-stack over those refcount logs.
Whiteboard: [MemShrink]
dbaron: why does making assertions fatal trigger a leak?
Maybe SpecialPowers is involved?
Flags: needinfo?(dbaron)
Whiteboard: [MemShrink] → [MemShrink:P3]
The main change in the current patch I'm trying to land (see URL in the top 2 lines of comment 0) is actually loading an interstitial page between tests so we can attribute assertions that happen during test unload or load to the correct tests.  This probably changes timing of loading/unloading a bit, which given that this leak seems timing-sensitive anyway, seems likely to be the trigger.
Flags: needinfo?(dbaron)
So I *think* (though it's a little fuzzy since I'm searching for things in stacks from optimized builds... in which the things I'm looking for might not show up; though odds are that an nsCOMPtr-subtracted balance tree won't have many accidentally-zeroed branches, which are the only reason I'm searching in the first place) that the two references (other than the one the CC knows about from the nodeinfomanager) are from:
 * the global scope polluter, added at the end of nsWindowSH::InstallGlobalScopePolluter
 * the XPCWrappedNative's primary (non-tearoff) reference, added in XPCWrappedNative::GetNewOrUsed

I'm wondering if the live-document optimizations are interfering during the final CC; I'll see what happens if I disable them.
(In reply to David Baron [:dbaron] from comment #6)
> I'm wondering if the live-document optimizations are interfering during the
> final CC; I'll see what happens if I disable them.

That doesn't seem to be it; making nsCCUncollectableMarker::Observe return early before doing any of its marking doesn't make the bug go away.
It occurs to me that another reason those edges might not be showing up in the CC graph is if some relevant JS object were marked during the first phase (GC marking ignoring roots that come from CC-able things) of the CC (if my memory of how CC/GC interaction works is still at all valid), so I guess I should look at the gc-edges.
So the document and window were indeed black in the gc-edges log, but it's not clear to me how to figure out why; I'm not sure what the roots for that determination are.
Had a chat with amccreight and learned about https://github.com/amccreight/heapgraph , which points to the reason the edges from JS are black being an nsAutoJSValHolder owned by an nsXPCException.
So the leaked nsXPCException was put into an nsCOMPtr here:

nsCOMPtr<nsIException>::Assert_NoQueryNeeded() (/home/dbaron/builds/ssd/mozilla-central/obj/firefox-debugopt/xpcom/base/../../dist/include/nsCOMPtr.h:502)
nsXPCWrappedJSClass::CheckForException(XPCCallContext&, char const*, char const*, bool) (/home/dbaron/builds/ssd/mozilla-central/mozilla/js/xpconnect/src/XPCWrappedJSClass.cpp:1102)
nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) (/home/dbaron/builds/ssd/mozilla-central/mozilla/js/xpconnect/src/XPCWrappedJSClass.cpp:1467)
nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) (/home/dbaron/builds/ssd/mozilla-central/mozilla/js/xpconnect/src/XPCWrappedJS.cpp:579)
PrepareAndDispatch (/home/dbaron/builds/ssd/mozilla-central/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcstubs_x86_64_linux.cpp:124)
SharedStub (xptcstubs_x86_64_linux.cpp:0)


Who's responsible for dealing with such a thing?  It shouldn't just sit there.
Flags: needinfo?(bobbyholley+bmo)
That said, it might be worth fixing the actual exceptions, in addition to the fact that they're not reported anywhere.  If I print them out in nsXPCWrappedJSClass::CheckForException with this diff (plus #include "nsPrintfCString.h"):

             if (NS_FAILED(e_result)) {
+                {
+                    nsXPIDLCString str;
+                    xpc_exception->ToString(getter_Copies(str));
+                    NS_WARNING(nsPrintfCString("STORING THE FOLLOWING nsIXPCException:\n======\n%s\n======\n", str.get()).get());
+                }
                 XPCJSRuntime::Get()->SetPendingException(xpc_exception);
                 return e_result;
             }

then I see these exceptions in the child process:

[Child 5600] WARNING: STORING THE FOLLOWING nsIXPCException:
======
[Exception... "'TypeError: can't access dead object' when calling method: [nsIPrompt::alert]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "JS frame :: http://mochi.test:8888/tests/dom/browser-element/mochitest/file_browserElement_SetVisibleFrames_Outer.html :: handlePrompt :: line 25"  data: no]
======
: file /home/dbaron/builds/ssd/mozilla-central/mozilla/js/xpconnect/src/XPCWrappedJSClass.cpp, line 1105
[Child 5600] WARNING: STORING THE FOLLOWING nsIXPCException:
======
[Exception... "'Error' when calling method: [nsIDOMEventListener::handleEvent]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: no]
======
: file /home/dbaron/builds/ssd/mozilla-central/mozilla/js/xpconnect/src/XPCWrappedJSClass.cpp, line 1105
[Child 5600] WARNING: NS_ENSURE_TRUE(mCallback) failed: file /home/dbaron/builds/ssd/mozilla-central/mozilla/content/base/src/nsFrameMessageManager.cpp, line 417
[Child 5600] WARNING: STORING THE FOLLOWING nsIXPCException:
======
[Exception... "'TypeError: can't access dead object' when calling method: [nsIPrompt::alert]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "JS frame :: http://mochi.test:8888/tests/dom/browser-element/mochitest/file_browserElement_SetVisibleFrames_Inner.html?child1 :: <TOP_LEVEL> :: line 13"  data: no]
======
: file /home/dbaron/builds/ssd/mozilla-central/mozilla/js/xpconnect/src/XPCWrappedJSClass.cpp, line 1105
JavaScript error: , line 0: Error:
[Child 5600] WARNING: STORING THE FOLLOWING nsIXPCException:
======
[Exception... "'Error' when calling method: [nsIDOMEventListener::handleEvent]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: no]
======
: file /home/dbaron/builds/ssd/mozilla-central/mozilla/js/xpconnect/src/XPCWrappedJSClass.cpp, line 1105
And I confirmed that the exception is the root of the leak; changing:

-                XPCJSRuntime::Get()->SetPendingException(xpc_exception);
+                XPCJSRuntime::Get()->SetPendingException(nullptr);

near the end of nsXPCWrappedJSClass::CheckForException makes the leak go away.


It's not clear to me what the purpose of stuffing these exceptions into global variables is, or who's supposed to be responsible for taking them out of said global variables and actually reporting them.
Make this test less sensitive to the timing of its own unloading by
passing on only the expected messages and not any messages that occur
during unloading.  Such messages might cause exceptions (e.g.,
"'TypeError: can't access dead object' when calling method:
[nsIPrompt::alert]"), and those exceptions might in turn trigger leaks
in nsXPConnect that last until shutdown because we store the most recent
exception from a call to an XPCWrappedJS on the XPCJSRuntime through
shutdown (rather than reporting it, as we probably should, and then
freeing it), leaking the window.
Attachment #717679 - Flags: review?(justin.lebar+bug)
Assignee: nobody → dbaron
This fixes the leak reported in bug 842476.

This approach seems to make sense to me, based on my understanding of
the purpose of storing the exception on the XPCJSRuntime, as explained
in the comment.  But I'm far from an expert on this code, so please
review carefully.
Attachment #717747 - Flags: review?(bobbyholley+bmo)
This refactors the computation of |reportable| by replacing some (but
not all) of the use of |if| repitition with logical operators.  It's
tempting to refactor more, though.
Attachment #717748 - Flags: review?(bobbyholley+bmo)
This fixes the leak reported in bug 842476.

This approach seems to make sense to me, based on my understanding of
the purpose of storing the exception on the XPCJSRuntime, as explained
in the comment.  But I'm far from an expert on this code, so please
review carefully.

(Is it guaranteed that the JS caller is in fact an XPConnect caller that
will do something with this exception?  If not, is there a better test?)
Attachment #717749 - Flags: review?(bobbyholley+bmo)
Attachment #717747 - Attachment is obsolete: true
Attachment #717747 - Flags: review?(bobbyholley+bmo)
I'm also still pretty disturbed that the exception never got printed to stdout (without debugging patches that I added).  The reason it didn't appears to be this chunk of code in XPCWrappedJSClass.cpp:

            // Try to use the error reporter set on the context to handle this
            // error if it came from a JS exception.
            if (reportable && is_js_exception &&
                JS_GetErrorReporter(cx) != xpcWrappedJSErrorReporter) {
                reportable = !JS_ReportPendingException(cx);
            }

which changes |reportable| from true to false, while in some cases reporting nothing to stdout, or in other cases reporting merely:
JavaScript error: , line 0: Error:
(as in comment 12), despite better information being available.
Comment on attachment 717679 [details] [diff] [review]
Work around leak resulting from exception that happens during unload by propagating only the prompts we expect and not the later ones.

This is totally fine, but might you want to leave it as-is to exercise your XPCOM patches?  That's not an ideal way to test the change, but perhaps it's better than nothing.

Do you think we should take your XPCOM changes on branches, so as to avoid this leak there?  I'm particularly concerned about b2g18.

Thanks a lot for tracking this down.
Attachment #717679 - Flags: review?(justin.lebar+bug) → review+
> Do you think we should take your XPCOM changes on branches, so as to avoid this leak there?

To be clear, I understand that the potential to leak here exists without the patches in bug 404077; is that right?
Whiteboard: [MemShrink:P3] → [MemShrink]
(In reply to Justin Lebar [:jlebar] from comment #19)
> Comment on attachment 717679 [details] [diff] [review]
> Work around leak resulting from exception that happens during unload by
> propagating only the prompts we expect and not the later ones.
> 
> This is totally fine, but might you want to leave it as-is to exercise your
> XPCOM patches?  That's not an ideal way to test the change, but perhaps it's
> better than nothing.

I think it's better to take this; I'd rather the test not throw exceptions.

> Do you think we should take your XPCOM changes on branches, so as to avoid
> this leak there?  I'm particularly concerned about b2g18.

I'm not that concerned about the leak; it potentially means holding on to a window a little longer than necessary, but it's only one window; it entrains at worst the window on which we've most recently executed JS.  If we run JS elsewhere, it'll fix itself.
> it entrains at worst the window on which we've most recently executed JS.  If we run JS elsewhere, 
> it'll fix itself.

Does it fix itself as soon as we run JS elsewhere, or as soon as such JS throws an exception?  If it's the former, I understand how this could be a shutdown leak unless this particular test was the last one we ran.  But that doesn't seem possible, since it's not lexicographically the last test in its directory.
(In reply to Justin Lebar [:jlebar] from comment #22)
> > it entrains at worst the window on which we've most recently executed JS.  If we run JS elsewhere, 
> > it'll fix itself.
> 
> Does it fix itself as soon as we run JS elsewhere, or as soon as such JS
> throws an exception?

Roughly the former, as long as XPConnect is involved.  XPConnect seems to clear its pending exception (XPCJSRuntime::Get()->SetPendingException(nullptr)) in a bunch of places, though maybe not everywhere it runs JS.

> If it's the former, I understand how this could be a
> shutdown leak unless this particular test was the last one we ran.  But that
> doesn't seem possible, since it's not lexicographically the last test in its
> directory.

The leak reported is a leak in a child process, which is specific to the test.
> The leak reported is a leak in a child process, which is specific to the test.

Aha, I see.  It used to be that we held ran all these tests in a single child process, but we changed our behavior in b2g a while ago, and it affected our tests as well.

Thanks!
Whiteboard: [MemShrink] → [MemShrink:P3]
https://hg.mozilla.org/integration/mozilla-inbound/rev/fb5a9401e2d4
Whiteboard: [MemShrink:P3] → [MemShrink:P3][leave open]
Comment on attachment 717748 [details] [diff] [review]
Refactor tests to use boolean logic rather than if-chains.

Review of attachment 717748 [details] [diff] [review]:
-----------------------------------------------------------------

r=bholley
Attachment #717748 - Flags: review?(bobbyholley+bmo) → review+
Flags: needinfo?(bobbyholley+bmo)
Comment on attachment 717749 [details] [diff] [review]
XPCWrappedJS calls should only store exception when there's a JS caller.

Review of attachment 717749 [details] [diff] [review]:
-----------------------------------------------------------------

This seems strictly better than what we do now. Whether it's entirely correct, I don't really know. I don't grok most of XPCThrower, and I'm trying to avoid doing so until all of the DOM has been moved to new-bindings and I can proceed with simplifying it without worry about web-compat.

If you want to land this now, feel free. If you want to get to the bottom of the matter and figure out the most correct solution, I'd suggest roping in mrbkap or bent.
Attachment #717749 - Flags: review?(bobbyholley+bmo) → review+
Component: DOM: Mozilla Extensions → DOM
So I never landed this because I wasn't particularly confident in the patch.

However, https://hg.mozilla.org/mozilla-central/rev/6fc284339ae3 from bug 1275704 removed some of the code that this was modifying (I think on the assumption that having a scripted caller is an unusual case).

The conclusion I had here, though, was that we would leak an exception if we called XPCJSRuntime::Get()->SetPendingException() when we didn't have a scripted caller.

So at this point I wonder if we should be calling XPCJSRuntime::Get()->SetPendingException() at all.
Flags: needinfo?(bzbarsky)
Having a scripted caller in the JS_DescribeScriptedCaller sense _was_ an unusual case here until we just ripped out saved frame chains: it could only happen when the XPCWrappedJS was wrapping an object that came from a window (not a JS component), _and_ we were doing a JS -> C++ -> JS call where both of the "JS" bits came from the same window, more or less.  A try run I did that fatally asserted when JS_DescribeScriptedCaller returned true showed this happening only in a small number of tests and one place in browser code.

But for handling the exception on the XPCJSRuntime, it doesn't matter what JS_DescribeScriptedCaller reports.  What matters is whether as the stack unwinds we hit either dom::Throw or XPCThrower::CheckForPendingException.  And even then it's a guessing game as to whether either one is really trying to get our exception; hence the checks for the nsresult matching in both cases.  :(

Anyway, the XPCThrower::CheckForPendingException case will be hit whenever we have a JS -> C++ call via XPCWrappedNative.  The dom::Throw case will be hit when we have a JS -> C++ call via WebIDL bindings _and_ the C++ throws an nsresult exception instead of a TypeError or whatnot.  If C++ just calls into an XPCWrappedJS, we're kinda out of luck until the next one of those two things happening.  This is all horribly fragile and broken, but I don't have terribly bright ideas about how to fix it short of having explicit propagation of the actual exception (e.g. via ErrorResult) instead of stashing it somewhere and propagating the nsresult.  Basically, change all the signatures for all XPCOM methods.  ;)

Maybe the right answer that can be implemented in finite time and will fix the leaks in practice is just that whenever the event loop spins we should clear the pending exception on the XPCJSRuntime, if any.  I can't think of any good reasons to have values hang out there across the event loop spinning....  For example, we could clear it out in XPCJSRuntime::AfterProcessTask.
Flags: needinfo?(bzbarsky)
Component: DOM → DOM: Core & HTML
Assignee: dbaron → nobody
You need to log in before you can comment on or make changes to this bug.