Closed Bug 1238707 Opened 4 years ago Closed 4 years ago

[e10s] Frequent shutdown leak default process: 2482603 bytes leaked (AnimationTimeline, AsyncLatencyLogger, AsyncStatement, AtomImpl, BackstagePass, ...) on Win7 and OSX

Categories

(Core :: DOM: Service Workers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
e10s + ---
firefox46 --- wontfix
firefox47 + fixed
firefox48 --- fixed

People

(Reporter: jgriffin, Assigned: bkelly)

References

(Blocks 3 open bugs)

Details

(Keywords: memory-leak, Whiteboard: [e10s-orangeblockers] btpp-followup-2016-03-04)

Attachments

(2 files, 1 obsolete file)

On Windows 7 debug, mochitest-plain chunk 4 in e10s mode frequently suffers from a shutdown leak, default process: 2482603 bytes leaked (AnimationTimeline, AsyncLatencyLogger, AsyncStatement, AtomImpl, BackstagePass, ...) 

example log: https://treeherder.mozilla.org/logviewer.html#?job_id=15286412&repo=try

This is similar to but distinct from bug 1186675.
Flags: needinfo?(mrbkap)
I think I get this message when this happens:
WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything alive inside it, that is) AT JS_ShutDown TIME.  FIX THIS!
Blocks: 933741
(In reply to Milan Sreckovic [:milan] from comment #1)
> I think I get this message when this happens:
> WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything
> alive inside it, that is) AT JS_ShutDown TIME.  FIX THIS!

That's just a symptom of a huge leak like this. If you leak tons of pages, then you can end up not shutting down the JS engine correctly, which then prints out this message.
I think RyanVM already filed something for this, and bkelly has been needinfo'd on that.
Flags: needinfo?(mrbkap)
Oh, this is older than that.
Flags: needinfo?(bkelly)
Duplicate of this bug: 1252241
Summary: [e10s] Frequent shutdown leak default process: 2482603 bytes leaked (AnimationTimeline, AsyncLatencyLogger, AsyncStatement, AtomImpl, BackstagePass, ...) → [e10s] Frequent shutdown leak default process: 2482603 bytes leaked (AnimationTimeline, AsyncLatencyLogger, AsyncStatement, AtomImpl, BackstagePass, ...) on Win7 and OSX
Whiteboard: btpp-followup-2016-03-04
Digging out from PTO and my windows machine is in a box at the moment.  I hope to get that setup over the weekend so I can look at these leak bugs next week.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8add5c2a0841
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Flags: needinfo?(bkelly)
See Also: → 1186675
FWIW, there's a try run from kats from last week that shows this issue:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=9cd7ed658735&selectedJob=17714419
(In reply to Ben Kelly [:bkelly] from comment #7)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=8add5c2a0841

FYI, you left a global Windows debug e10s skip-if in the top of the manifest.
(In reply to Andrew McCreight [:mccr8] from comment #8)
> FWIW, there's a try run from kats from last week that shows this issue:

You also might like the Ash branch, where I've got e10s tests running across all desktop platforms now. Plan is to keep it in sync with m-c, but haven't gotten that part automated yet.
https://treeherder.mozilla.org/#/jobs?repo=ash
There are known issues with http cache shutdown at the moment.  I'm inclined to wait for bug 1251130 to be fixed before spending time here.  I don't see any ServiceWorker objects in the leaked list, but I do see http cache objects.
Depends on: 1251130
I guess the dates don't make sense there.
No longer depends on: 1251130
I'll look at this locally and do a basic CC log analysis.
tracking-e10s: --- → +
Windows debug serviceworker tests were disabled across the board due to this bug. Interestingly, they never got disabled on OSX and appear to be running green on Ash. Might be worth a fresh Try push to see where we stand with them re-enabled on Windows too.
Whiteboard: btpp-followup-2016-03-04 → [e10s-orangeblockers] btpp-followup-2016-03-04
Yeah, I ran the test suite a few times locally on OSX and I wasn't able to reproduce the leak. I could look at CC logs from a Windows run if somebody generates them.
This will prevent us from running debug sw test suites on a number of platforms. Ben, can we get this assigned to someone and looked at?
Flags: needinfo?(bkelly)
(In reply to Jim Mathies [:jimm] from comment #16)
> This will prevent us from running debug sw test suites on a number of
> platforms. Ben, can we get this assigned to someone and looked at?

This is a major pain to work on when the whole platform is disabled in automation.  It would be helpful if we could enable the platform with the tests disabled.  We can then investigate the tests without fighting automation as well.
Flags: needinfo?(bkelly)
(In reply to Ben Kelly [:bkelly] from comment #17)
> (In reply to Jim Mathies [:jimm] from comment #16)
> > This will prevent us from running debug sw test suites on a number of
> > platforms. Ben, can we get this assigned to someone and looked at?
> 
> This is a major pain to work on when the whole platform is disabled in
> automation.  It would be helpful if we could enable the platform with the
> tests disabled.  We can then investigate the tests without fighting
> automation as well.

These suites are running and enabled on ash; feel free to work on debugging there.
Here's a Try push log with CC shutdown logging enabled. Sorry, Treeherder lost track of the jobs, but the log is still there.
http://archive.mozilla.org/pub/firefox/try-builds/ryanvm@gmail.com-be73f8e02191cbc098deb8fadcc2e617fb3d7bd3/try-win32-debug/try_win7-all-debug_test-mochitest-4-bm110-tests1-windows-build454.txt.gz

I believe that the cc-edges.2868 logs are the ones you want.
Flags: needinfo?(continuation)
These logs are weird. There are 7 CC logs. The first 5 don't have anything in the CC graph.

Anyways, all the stuff in those last two CC logs that I see that is being held alive from a single nsXPCWrappedJS (nsIBrowserDOMWindow), which has 1 missing reference. As far as I can see, there are only two places that hold nsIBrowserDOMWindow alive from C++. One is nsGlobalWindow, which is properly traced so that can't be it. The other is TabParent.

I also see some code related to nsIBrowserDOMWindow in OpenWindowRunnable::OpenWindow(), which is a service worker class, so maybe that is related? The test case is leaking for nsRunnable, so perhaps those are OpenWindowRunnable.
Flags: needinfo?(continuation)
Hmm, didn't we have some case where js is keeping nsITabParent alive too long. That would lead to
keeping mBrowserDOMWindow alive too long.

Looks like destroying TabParent is too messy atm.
The sanest please to clear the member variable is probably in
TabParent::Destroy() *before* if (mIsDestroyed) { check, so that we clear it even if the child process
has crashed and mIsDestroyed already set.
This implements Olli's suggestion from comment 21.  Ryan, can you see if this helps in automation?
Attachment #8735503 - Flags: feedback?(ryanvm)
Here's a try to see if it breaks any other existing tests:

   https://treeherder.mozilla.org/#/jobs?repo=try&revision=47c531053883
I think try is showing some issues with this patch.
It seems with this patch we are consistently hitting this on linux debug in M-e10s(2):

 12:42:20     INFO -  [Parent 4238] WARNING: '!browserDOMWin', file /builds/slave/try-l64-d-00000000000000000000/build/src/dom/ipc/ContentParent.cpp, line 5424
 12:42:20     INFO -  [Child 4296] ###!!! ASSERTION: attempted to open a new window with no WindowCreator: 'mWindowCreator', file /builds/slave/try-l64-d-00000000000000000000/build/src/embedding/components/windowwatcher/nsWindowWatcher.cpp, line 792
 12:43:06     INFO -  #01: nsWindowWatcher::OpenWindow2 [embedding/components/windowwatcher/nsWindowWatcher.cpp:453]
 12:43:06     INFO -  #02: nsGlobalWindow::OpenInternal [dom/base/nsGlobalWindow.cpp:11560]
 12:43:06     INFO -  #03: nsGlobalWindow::OpenNoNavigate [dom/base/nsGlobalWindow.cpp:7847]
 12:43:06     INFO -  #04: nsDocShell::InternalLoad [xpcom/string/nsTSubstring.h:95]
 12:43:06     INFO -  #05: nsDocShell::OnLinkClickSync [xpcom/string/nsTSubstring.h:95]
 12:43:06     INFO -  #06: OnLinkClickEvent::Run [docshell/base/nsDocShell.cpp:13496]
 12:43:06     INFO -  #07: nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:994]
 12:43:06     INFO -  #08: NS_ProcessNextEvent [xpcom/glue/nsThreadUtils.cpp:297]
 12:43:06     INFO -  #09: mozilla::ipc::MessagePump::Run [ipc/glue/MessagePump.cpp:98]
 12:43:06     INFO -  #10: MessageLoop::RunInternal [ipc/chromium/src/base/message_loop.cc:235]
 12:43:06     INFO -  #11: MessageLoop::Run [ipc/chromium/src/base/message_loop.cc:520]
 12:43:06     INFO -  #12: nsBaseAppShell::Run [widget/nsBaseAppShell.cpp:158]
 12:43:06     INFO -  #13: XRE_RunAppShell [toolkit/xre/nsEmbedFunctions.cpp:801]
 12:43:06     INFO -  #14: mozilla::ipc::MessagePumpForChildProcess::Run [ipc/glue/MessagePump.cpp:263]
 12:43:06     INFO -  #15: MessageLoop::RunInternal [ipc/chromium/src/base/message_loop.cc:235]
 12:43:06     INFO -  #16: MessageLoop::Run [ipc/chromium/src/base/message_loop.cc:520]
 12:43:06     INFO -  #17: XRE_InitChildProcess [toolkit/xre/nsEmbedFunctions.cpp:641]
 12:43:06     INFO -  #18: content_process_main [ipc/contentproc/plugin-container.cpp:240]
 12:43:06     INFO -  #19: libc.so.6 + 0x2176d
12:43:06 INFO - #20: _start
Comment on attachment 8735503 [details] [diff] [review]
Release the window immediately in TabParent::Destroy() to avoid leaks. r=smaug

I have good news and bad news. The good news is that it indeed appears to fix the leak-the-world situation in the serviceworkers tests!
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae444245de83&group_state=expanded

The bad news is that it appears there's another cache-related leak lurking behind those:
https://treeherder.mozilla.org/logviewer.html#?job_id=18936889&repo=try

I ran another push with leak stack walking enabled, and the leaked CacheEntry has the following stack:
19:00:51     INFO -  Serial Numbers of Leaked Objects:
19:00:51     INFO -  123 @1D2FB9D0 (1 references)
19:00:51     INFO -  allocation stack:
19:01:06     INFO -  #00: mozilla::net::CacheEntry::CacheEntry(nsACString_internal const &,nsIURI *,nsACString_internal const &,bool,bool,bool) [netwerk/cache2/CacheEntry.cpp:225]
19:01:06     INFO -  #01: mozilla::net::CacheStorageService::AddStorageEntry(nsACString_internal const &,nsIURI *,nsACString_internal const &,bool,bool,bool,bool,mozilla::net::CacheEntryHandle * *) [netwerk/cache2/CacheStorageService.cpp:1493]
19:01:06     INFO -  #02: mozilla::net::CacheStorageService::AddStorageEntry(mozilla::net::CacheStorage const *,nsIURI *,nsACString_internal const &,bool,mozilla::net::CacheEntryHandle * *) [netwerk/cache2/CacheStorageService.cpp:1411]
19:01:06     INFO -  #03: mozilla::net::CacheStorage::OpenTruncate(nsIURI *,nsACString_internal const &,nsICacheEntry * *) [netwerk/cache2/CacheStorage.cpp:131]
19:01:06     INFO -  #04: mozilla::net::nsHttpChannel::OpenCacheEntry(bool) [netwerk/protocol/http/nsHttpChannel.cpp:3110]
19:01:06     INFO -  #05: mozilla::net::nsHttpChannel::Connect() [netwerk/protocol/http/nsHttpChannel.cpp:380]
19:01:06     INFO -  #06: mozilla::net::nsHttpChannel::ContinueBeginConnectWithResult() [netwerk/protocol/http/nsHttpChannel.cpp:5560]
19:01:06     INFO -  #07: mozilla::net::nsHttpChannel::BeginConnect() [netwerk/protocol/http/nsHttpChannel.cpp:5463]
19:01:06     INFO -  #08: mozilla::net::nsHttpChannel::OnProxyAvailable(nsICancelable *,nsIChannel *,nsIProxyInfo *,nsresult) [netwerk/protocol/http/nsHttpChannel.cpp:5630]
19:01:06     INFO -  #09: nsAsyncResolveRequest::DoCallback() [netwerk/base/nsProtocolProxyService.cpp:253]
19:01:06     INFO -  #10: nsAsyncResolveRequest::OnQueryComplete(nsresult,nsCString const &,nsCString const &) [netwerk/base/nsProtocolProxyService.cpp:214]
19:01:06     INFO -  #11: ExecuteCallback::Run() [netwerk/base/nsPACMan.cpp:87]
19:01:06     INFO -  #12: nsThread::ProcessNextEvent(bool,bool *) [xpcom/threads/nsThread.cpp:994]
19:01:06     INFO -  #13: NS_ProcessNextEvent(nsIThread *,bool) [xpcom/glue/nsThreadUtils.cpp:297]
19:01:06     INFO -  #14: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [ipc/glue/MessagePump.cpp:98]
19:01:06     INFO -  #15: MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:236]
19:01:06     INFO -  #16: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:230]
19:01:06     INFO -  #17: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:204]
19:01:06     INFO -  #18: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:158]
19:01:06     INFO -  #19: nsAppShell::Run() [widget/windows/nsAppShell.cpp:264]
19:01:06     INFO -  #20: nsAppStartup::Run() [toolkit/components/startup/nsAppStartup.cpp:282]
19:01:06     INFO -  #21: XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:4342]
19:01:06     INFO -  #22: XREMain::XRE_main(int,char * * const,nsXREAppData const *) [toolkit/xre/nsAppRunner.cpp:4439]
19:01:06     INFO -  #23: XRE_main [toolkit/xre/nsAppRunner.cpp:4545]
19:01:06     INFO -  #24: do_main [browser/app/nsBrowserApp.cpp:220]
19:01:06     INFO -  #25: NS_internal_main(int,char * *,char * *) [browser/app/nsBrowserApp.cpp:360]
19:01:06     INFO -  #26: wmain [toolkit/xre/nsWindowsWMain.cpp:138]
19:01:06     INFO -  #27: __scrt_common_main_seh [f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:264]
19:01:06     INFO -  #28: kernel32.dll + 0x53c45
19:01:06     INFO -  #29: ntdll.dll + 0x637f5
19:01:06     INFO -  #30: ntdll.dll + 0x637c8

19:00:51     INFO -  Leaked URLs:
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/fetch/synthesized-redirect-twice-real-file.txt
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/fetch/index.html
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/fetch/fetch_worker_script.js
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/fetch/fetch_worker_script.js
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/synthesized-redirect-real-file.txt
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/fetch/synthesized-redirect-twice-real-file.txt
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/fetch/fetch_worker_script.js
19:00:51     INFO -    http://mochi.test:8888/tests/dom/workers/test/serviceworkers/fetch/synthesized-redirect-twice-real-file.txt
Attachment #8735503 - Flags: feedback?(ryanvm) → feedback+
Flags: needinfo?(bkelly)
Let's see if this fixes the issue from the previous try:

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=a0340b4eca8d

I'm also looking at the revealed leak as well.  Its more interesting in that the leaked URLs are mainly from a single test case.  That test case performs a number of redirects which bounces back and forth between parent and child.  It would not surprise me if we had some corner case in there that did not close a cache entry properly.
Flags: needinfo?(bkelly)
Olli, this implements your suggestion from comment 21.

It also adjusts the error value returned if code tries to open a window after we clear the reference.  We need to use NS_ERROR_ABORT in order to signal a hard failure back to the nsWindowWatcher.  It checks for this value here:

  https://dxr.mozilla.org/mozilla-central/source/embedding/components/windowwatcher/nsWindowWatcher.cpp#738

Without adjusting the error value it would continue on its way and later hit the mWindowCreator assertion at:

  https://dxr.mozilla.org/mozilla-central/source/embedding/components/windowwatcher/nsWindowWatcher.cpp#792
Attachment #8735503 - Attachment is obsolete: true
Attachment #8738143 - Flags: review?(bugs)
It would be nice to backport your patch to at least Aurora, in case it is responsible for some of the mysterious high memory usage we're seeing with e10s enabled on the beta experiments. (We're pretty late in beta, so it probably isn't worth the risk to get it there.)
[Tracking Requested - why for this release]:
See comment 29.
Attachment #8738143 - Flags: review?(bugs) → review+
Blocks: 1262224
Comment on attachment 8738219 [details] [diff] [review]
Remove blanket disable of serviceworker mochitests on windows debug e10s. r=ryanvm

Review of attachment 8738219 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good. We can handle the individual subtests in the follow-up bug you're filing for the Necko leak.
Attachment #8738219 - Flags: review?(ryanvm) → review+
The additional leak from comment 26 will be handled in bug 1262224.
https://hg.mozilla.org/mozilla-central/rev/281636a40854
https://hg.mozilla.org/mozilla-central/rev/5ea9c875d871
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Comment on attachment 8738143 [details] [diff] [review]
Release the window immediately in TabParent::Destroy() to avoid leaks. r=smaug

Approval Request Comment
[Feature/regressing bug #]: e10s
[User impact if declined]: Memory leaks in automation due to windows being leaked by tabs that are not shut down cleanly.  Comment 29 suggests it might be useful to have this fix since the beta e10s experiments showed increased OOMs.
[Describe test coverage new/current, TreeHerder]: Existing mochitests and leak checks.
[Risks and why]: Minimal.
[String/UUID change made/needed]: None.
Attachment #8738143 - Flags: approval-mozilla-aurora?
Comment on attachment 8738143 [details] [diff] [review]
Release the window immediately in TabParent::Destroy() to avoid leaks. r=smaug

Mem leak fix that might help reduce e10s OOMs, Aurora47+
Attachment #8738143 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.