Last Comment Bug 775676 - Leak of 1 or more nsXULWindows when running dom/tests/mochitests/bugs on Windows
: Leak of 1 or more nsXULWindows when running dom/tests/mochitests/bugs on Windows
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: General (show other bugs)
: Trunk
: All Windows XP
: -- normal (vote)
: mozilla17
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
Depends on: 777063
Blocks: 769254 778355
  Show dependency treegraph
 
Reported: 2012-07-19 11:41 PDT by Justin Lebar (not reading bugmail)
Modified: 2012-07-27 16:44 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Fix timer-based leak in nsWebShellWindow. (2.67 KB, patch)
2012-07-20 09:31 PDT, Justin Lebar (not reading bugmail)
justin.lebar+bug: review-
Details | Diff | Splinter Review
Patch, v2 (4.76 KB, patch)
2012-07-20 10:28 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
Patch v2.5 - Use a strong ref (4.65 KB, patch)
2012-07-22 12:26 PDT, Justin Lebar (not reading bugmail)
roc: review+
Details | Diff | Splinter Review

Description Justin Lebar (not reading bugmail) 2012-07-19 11:41:24 PDT
My patch in bug 769254 was backed out due to windows mochitest-3 leaks (see bug 769254 comment 24).

I was able to reproduce the leak locally, and I figured out that the leaking set of tests is dom/tests/mochitests/bugs.  But then I discovered that when I run that directory's tests /without/ my changes, it still leaks!

So it appears that somehow bug 769254 is tickling the directory's existing leaks into turning permaorange, perhaps by changing which tests are part of mochitest-3.

We really need to figure this out, because bug 769254 fixes a critical browser API bug, and it's blocking other fixes.
Comment 1 Justin Lebar (not reading bugmail) 2012-07-19 11:42:04 PDT
I should add that I think the leak has to do with showModalDialog.
Comment 2 Justin Lebar (not reading bugmail) 2012-07-19 12:15:31 PDT
When I run dom/tests/mochitests, I don't leak, but when I run dom/tests/mochitests/bugs, I do leak.  So it seems that something is not getting cleaned up as quickly as it should be.
Comment 3 Justin Lebar (not reading bugmail) 2012-07-19 12:18:39 PDT
Here's an example of the leak I see locally (and on tinderbox):

>nsTraceRefcntImpl::DumpStatistics: 1092 entries
>TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 3816 bytes during test execution
>TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 10 instances of Mutex with size 12 bytes each (120 bytes total)
>TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of ReentrantMonitor with size 16 bytes
>TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 9 instances of nsStringBuffer with size 8 bytes each (72 bytes total)
>TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 10 instances of nsTArray_base with size 4 bytes each (40 bytes total)
>TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of nsThread with size 112 bytes
>TEST-INFO | automationutils.processLeakLog() | leaked 9 instances of nsTimerImpl with size 72 bytes each (648 bytes total)
>TEST-INFO | automationutils.processLeakLog() | leaked 9 instances of nsVoidArray with size 4 bytes each (36 bytes total)
>TEST-INFO | automationutils.processLeakLog() | leaked 9 instances of nsWebShellWindow with size 164 bytes each (1476 bytes total)
>TEST-INFO | automationutils.processLeakLog() | leaked 9 instances of nsXULWindow with size 144 bytes each (1296 bytes total)
Comment 4 Justin Lebar (not reading bugmail) 2012-07-20 09:01:45 PDT
At the risk of jinxing this, I think I figured it out:

nsWebShellWindow::SetPersistenceTimer addrefs this, and releases it on a timer.

In theory, nsWebShellWindow::Destroy cancels the timer and releases this.  But what appears to be happening is SetPersistenceTimer happens /again/ after the first call to destroy.  Now the window doesn't get released until the timer fires.
Comment 5 Justin Lebar (not reading bugmail) 2012-07-20 09:31:31 PDT
Created attachment 644369 [details] [diff] [review]
Fix timer-based leak in nsWebShellWindow.
Comment 6 Justin Lebar (not reading bugmail) 2012-07-20 09:35:03 PDT
Comment on attachment 644369 [details] [diff] [review]
Fix timer-based leak in nsWebShellWindow.

Actually, this isn't right.  It doesn't leak, but a test fails with this change.
Comment 7 Justin Lebar (not reading bugmail) 2012-07-20 10:28:41 PDT
Created attachment 644387 [details] [diff] [review]
Patch, v2
Comment 8 Justin Lebar (not reading bugmail) 2012-07-20 10:34:19 PDT
Comment on attachment 644387 [details] [diff] [review]
Patch, v2

I have no idea why we're doing this thing off a 500ms timer here; that seems supremely bogus.  But at least it doesn't leak anymore...
Comment 9 Justin Lebar (not reading bugmail) 2012-07-20 10:42:43 PDT
https://tbpl.mozilla.org/?tree=Try&rev=aedaec1bd709
Comment 10 Justin Lebar (not reading bugmail) 2012-07-20 15:18:38 PDT
The relevant test suite is green on winxp try (with the patches in bug 769254 which used to turn it orange).
Comment 11 Justin Lebar (not reading bugmail) 2012-07-20 21:36:43 PDT
To add on to our IRC conversation about this: Using a weak ref in the timer doesn't seem to be causing problems, but if you want to be conservative, I'd be happy to switch to a strong ref.  I certainly don't understand this code well enough to argue that a weak ref is correct beyond reasonable doubt -- I don't even understand why we're using a timer in the first place.
Comment 12 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2012-07-21 06:46:16 PDT
I assume we're using a timer to batch the property updates and to keep them off the critical path of event handling.

What did you find out about SetPersistenceTimer being called during or after Destroy()?

It seems to me that either a SetPersistenceTimer after/during Destroy needs to ensure SavePersistenceAttributes happens, or it does not. If it does, then your patch is incorrect when the nsWebShellWindow is released before the timer fires. If it does not, then we could just add a boolean flag to note whether Destroy has been called and if it has, make SetPersistenceTimer a no-op.
Comment 13 Justin Lebar (not reading bugmail) 2012-07-21 07:57:37 PDT
> What did you find out about SetPersistenceTimer being called during or after Destroy()?

It is almost certainly being called off the main thread.  The three sequential stacks look like this:

release
  xul.dll!nsCOMPtr<nsIXULWindow>::~nsCOMPtr<nsIXULWindow>()
  xul.dll!nsCOMPtr<nsIXULWindow>::Assert_NoQueryNeeded()
  xul.dll!nsCOMPtr<nsIXULWindow>::nsCOMPtr<nsIXULWindow>()
  xul.dll!nsXULWindow::Destroy()
  xul.dll!nsWebShellWindow::Destroy()
  xul.dll!nsContentTreeOwner::Destroy()
  xul.dll!nsGlobalWindow::ReallyCloseWindow()
  xul.dll!nsCloseEvent::Run()
  xul.dll!nsThread::ProcessNextEvent()
  xul.dll!NS_ProcessNextEvent_P()
  xul.dll!mozilla::ipc::MessagePump::Run()
  xul.dll!MessageLoop::RunInternal()
  xul.dll!MessageLoop::RunHandler()
  xul.dll!MessageLoop::Run()
  xul.dll!nsBaseAppShell::Run()
  xul.dll!nsAppShell::Run()
  xul.dll!nsAppStartup::Run()
  xul.dll!XREMain::XRE_mainRun()
  xul.dll!XREMain::XRE_main()
  xul.dll!XRE_main()
  firefox.exe!do_main()
  firefox.exe!NS_internal_main()
  firefox.exe!wmain()
  firefox.exe!__tmainCRTStartup()
  firefox.exe!wmainCRTStartup()
  kernel32.dll!769ced6c()

addref
  xul.dll!nsWebShellWindow::SetPersistenceTimer()
  xul.dll!nsWebShellWindow::HandleEvent()
  xul.dll!nsWindow::DispatchEvent()
  xul.dll!nsWindow::DispatchWindowEvent()
  xul.dll!nsWindow::OnMove()
  xul.dll!nsWindow::ProcessMessage()
  xul.dll!nsWindow::WindowProcInternal()
  xul.dll!CallWindowProcCrashProtected()
  xul.dll!nsWindow::WindowProc()
  user32.dll!764fc4e7()

addref
  xul.dll!nsXULWindow::QueryInterface()
  xul.dll!nsWebShellWindow::QueryInterface()
  xul.dll!nsChromeTreeOwner::GetInterface()
  xul.dll!NS_InvokeByIndex_P()
  xul.dll!CallMethodHelper::Invoke()
  xul.dll!CallMethodHelper::Call()
  xul.dll!XPCWrappedNative::CallMethod()
  xul.dll!XPC_WN_CallMethod()
  mozjs.dll!js::CallJSNative()
  mozjs.dll!js::InvokeKernel()
  mozjs.dll!js::Interpret()
  mozjs.dll!js::RunScript()
  mozjs.dll!js::InvokeKernel()
  mozjs.dll!js::Invoke()
  mozjs.dll!js::Invoke()
  mozjs.dll!JS_CallFunctionValue()
  xul.dll!nsJSContext::CallEventHandler()
  xul.dll!nsJSEventListener::HandleEvent()
  xul.dll!nsEventListenerManager::HandleEventSubType()
  xul.dll!nsEventListenerManager::HandleEventInternal()
  xul.dll!nsEventListenerManager::HandleEvent()
  xul.dll!nsEventTargetChainItem::HandleEvent()
  xul.dll!nsEventTargetChainItem::HandleEventTargetChain()
  xul.dll!nsEventDispatcher::Dispatch()
  xul.dll!DocumentViewerImpl::PageHide()
  xul.dll!nsDocShell::FirePageHideNotification()
  xul.dll!nsDocShell::Destroy()
  xul.dll!nsXULWindow::Destroy()
  xul.dll!nsWebShellWindow::Destroy()
  xul.dll!nsContentTreeOwner::Destroy()
  xul.dll!nsGlobalWindow::ReallyCloseWindow()
  xul.dll!nsCloseEvent::Run()
  xul.dll!nsThread::ProcessNextEvent()
  xul.dll!NS_ProcessNextEvent_P()
  xul.dll!mozilla::ipc::MessagePump::Run()
  xul.dll!MessageLoop::RunInternal()
  xul.dll!MessageLoop::RunHandler()
  xul.dll!MessageLoop::Run()
  xul.dll!nsBaseAppShell::Run()
  xul.dll!nsAppShell::Run()
  xul.dll!nsAppStartup::Run()
  xul.dll!XREMain::XRE_mainRun()
  xul.dll!XREMain::XRE_main()
  xul.dll!XRE_main()
  firefox.exe!do_main()
  firefox.exe!NS_internal_main()
  firefox.exe!wmain()
  firefox.exe!__tmainCRTStartup()
  firefox.exe!wmainCRTStartup()
  kernel32.dll!769ced6c()

The top and bottom stacks are clearly the main thread.  Destroy starts with two addrefs in RemoveProgressListener, so this final stack was from a new call to Destroy, it wouldn't look like this.
 
Unfortunately I don't know what test this is from, so I don't know what the JS code is doing here; I just know it's one of the tests in dom/tests/mochitest/bugs.

> It seems to me that either a SetPersistenceTimer after/during Destroy needs to ensure 
> SavePersistenceAttributes happens, or it does not. If it does, then your patch is incorrect 
> when the nsWebShellWindow is released before the timer fires.

A third possibility is that a SetPersistenceTimer after/during Destroy needs to ensure that SavePersistenceAttributes happens, if, at the point that its timer runs, the nsWebShellWindow is still alive.

The test that failed with patch v1 was reading the height of a window.opened() window.  It seems like what was happening is that someone set the height during Destroy (perhaps in an onunload handler?).  But if the window had a refcount of 0, then obviously nobody can read its height.

> If it does not, then we could 
> just add a boolean flag to note whether Destroy has been called and if it has, make 
> SetPersistenceTimer a no-op.

This is patch v1, and fails a test, so this is not the case.

Anyway, it seems you'd be more comfortable with a strong ref here?  I'm totally cool with that.
Comment 14 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2012-07-21 08:27:53 PDT
The second addref stack shows how the call to nsXULWindow::Destroy from nsWebShellWindow::Destroy can fire events that trigger SetPersistenceTimer. In your patch 1, you set mDestroyed before nsXULWindow::Destroy. What if you set mDestroyed and cancel the timer *after* nsXULWindow::Destroy? That would make sure, in the case of the second addref stack, we actually persist the data.
Comment 15 Justin Lebar (not reading bugmail) 2012-07-21 23:05:26 PDT
>  What if you set mDestroyed and cancel the timer *after* nsXULWindow::Destroy? That would make sure, 
> in the case of the second addref stack, we actually persist the data.

If I understand correctly, that wouldn't solve the original problem: If we call SetPersistenceTimer after we Destroy releases mSPTimer, then SetPersistenceTimer creates a new mSPTimer and addref's this, and Destroy never releases that addref.
Comment 16 Justin Lebar (not reading bugmail) 2012-07-22 12:26:07 PDT
Created attachment 644777 [details] [diff] [review]
Patch v2.5 - Use a strong ref

Pick your poison.

An alternate alternative would be to addref this before InitWithFuncCallback if mSPTimer.closure != NULL, and then to release this when the timer fires.  But that's effectively what this patch does, without the manual addref/release footgun.
Comment 17 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2012-07-22 14:59:18 PDT
(In reply to Justin Lebar [:jlebar] from comment #15)
> If I understand correctly, that wouldn't solve the original problem: If we
> call SetPersistenceTimer after we Destroy releases mSPTimer, then
> SetPersistenceTimer creates a new mSPTimer and addref's this, and Destroy
> never releases that addref.

No, I'm suggesting patch 1 plus moving the mSPTimer-related code in nsWebShellWindow::Destroy to happen after nsXULWindow::Destroy.
Comment 18 Justin Lebar (not reading bugmail) 2012-07-22 15:23:17 PDT
> No, I'm suggesting patch 1 plus moving the mSPTimer-related code in nsWebShellWindow::Destroy to 
> happen after nsXULWindow::Destroy.

Oh, I see.  That sounds sane.

But given the trickiness with the manual addref/release here, I'd rather use an approach which obviously doesn't leak, all things being equal.  What's the advantage of the approach you suggested?
Comment 19 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2012-07-22 16:47:10 PDT
Patch 2 seems a bit less robust than what I'm suggesting. I think it's much nicer to have SetPersistenceTimer always fail (or always succeed) if called after nsWebShellWindow::Destroy, rather than sometimes succeed and sometimes fail.

How does patch 2.5 avoid a leak if we shut down before the timer can fire? And it can't really make SetPersistenceTimer always succeed when called after Destroy, since shutdown would prevent it.
Comment 20 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2012-07-22 16:50:17 PDT
where by "succeed" I mean "SavePersistenceAttributes happens"
Comment 21 Justin Lebar (not reading bugmail) 2012-07-22 19:46:30 PDT
> I think it's much nicer to have SetPersistenceTimer always fail (or always succeed) if called 
> after nsWebShellWindow::Destroy, rather than sometimes succeed and sometimes fail.

My thinking is that, if nobody can /observe/ that SetPersistenceTimer failed, then it doesn't matter if it failed.  This observer might be held alive by the nsWebShellWindow, or it might be held alive by someone else.  But since when we leak these nsWebShellWindows, we don't transitively leak anything interesting, this observer must be held alive by someone else.  So then the question is: Does this observer also hold a strong reference to the nsWebShellWindow?  If so, then the weak reference is safe.

I don't understand this code well enough to claim that the weak ref is safe for this reason, though, and it's exactly this kind of complex logic I'm trying to avoid here, so I'm convinced that the weak ref isn't the way to go.

> How does patch 2.5 avoid a leak if we shut down before the timer can fire?

On shutdown we cancel all outstanding timers.  (TimerThread::Shutdown, called by nsTimerImpl::Shutdown, called in nsXPComInit.cpp::ShutdownXPCOM.)

> And it can't really make SetPersistenceTimer always [cause a call to 
> SavePersistenceAttributes] when called after Destroy, since shutdown would prevent it.

I must not understand this part -- surely shutdown can prevent the call to SavePersistenceAttributes in the current scheme as well.  If we shut down before the timer goes off, then we don't fire the timer, in either scheme.
Comment 22 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2012-07-22 20:36:56 PDT
Comment on attachment 644777 [details] [diff] [review]
Patch v2.5 - Use a strong ref

Review of attachment 644777 [details] [diff] [review]:
-----------------------------------------------------------------

OK
Comment 23 Justin Lebar (not reading bugmail) 2012-07-23 07:49:57 PDT
> OK

Thanks!

https://hg.mozilla.org/integration/mozilla-inbound/rev/4e95ccf43b6e
Comment 24 Ed Morley [:emorley] 2012-07-24 03:02:51 PDT
https://hg.mozilla.org/mozilla-central/rev/4e95ccf43b6e
Comment 25 Joshua Cranmer [:jcranmer] 2012-07-24 16:30:49 PDT
This patch is causing crashes in Thunderbird's compose window and is playing major havoc with our tests.

Note You need to log in before you can comment on or make changes to this bug.