Closed Bug 1401883 Opened 7 years ago Closed 6 years ago

Intermittent leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell)

Categories

(Core :: Widget, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- fixed
firefox60 --- wontfix
firefox61 --- fixed
firefox62 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: froydnj)

References

Details

(Keywords: intermittent-failure, memory-leak, Whiteboard: [stockwell disable-recommended])

Attachments

(3 files)

Filed by: archaeopteryx [at] coole-files.de https://treeherder.mozilla.org/logviewer.html#?job_id=132405319&repo=autoland https://queue.taskcluster.net/v1/task/alkWkeAoRoqKt4aUNZi1Hg/runs/0/artifacts/public/logs/live_backing.log 06:58:46 INFO - 365 INFO SimpleTest START 06:58:46 INFO - 366 INFO TEST-START | layout/mathml/tests/test_disabled_chrome.html 06:58:46 INFO - GECKO(10188) | ++DOMWINDOW == 20 (000001CECBC32000) [pid = 10188] [serial = 20] [outer = 000001CECBC19800] 06:58:47 INFO - GECKO(10188) | ++DOMWINDOW == 21 (000001CEC5B18800) [pid = 10188] [serial = 21] [outer = 000001CECBC19800] 06:58:47 INFO - GECKO(10188) | ++DOCSHELL 000001CEC5B20000 == 8 [pid = 10188] [id = {3e43cb67-9bdd-48c0-bbed-a203c015ec46}] 06:58:47 INFO - GECKO(10188) | ++DOMWINDOW == 22 (000001CEC5B20800) [pid = 10188] [serial = 22] [outer = 0000000000000000] 06:58:47 INFO - GECKO(10188) | ++DOMWINDOW == 23 (000001CEC5B21000) [pid = 10188] [serial = 23] [outer = 000001CEC5B20800] 06:58:47 INFO - GECKO(10188) | ++DOCSHELL 000001CEC5B2C000 == 9 [pid = 10188] [id = {684bcc33-2689-4d3b-95d1-a6f856936448}] 06:58:47 INFO - GECKO(10188) | ++DOMWINDOW == 24 (000001CEC5B2C800) [pid = 10188] [serial = 24] [outer = 0000000000000000] 06:58:47 INFO - GECKO(10188) | ++DOMWINDOW == 25 (000001CEC5B2E800) [pid = 10188] [serial = 25] [outer = 000001CEC5B2C800] 06:58:47 INFO - GECKO(10188) | [10188, Main Thread] WARNING: NS_ENSURE_TRUE(aSecondURI) failed: file z:/build/build/src/dom/base/ThirdPartyUtil.cpp, line 98 06:58:48 INFO - GECKO(10188) | ++DOMWINDOW == 26 (000001CEC5E65000) [pid = 10188] [serial = 26] [outer = 000001CEC5B2C800] 06:58:48 INFO - GECKO(10188) | MEMORY STAT | vsize 1813MB | vsizeMaxContiguous 132289253MB | residentFast 294MB | heapAllocated 123MB 06:58:48 INFO - 367 INFO TEST-OK | layout/mathml/tests/test_disabled_chrome.html | took 1206ms 06:58:48 INFO - 368 INFO None369 INFO TEST-START | Shutdown 06:58:48 INFO - 370 INFO Passed: 3 06:58:48 INFO - 371 INFO Failed: 0 06:58:48 INFO - 372 INFO Todo: 0 06:58:48 INFO - 373 INFO Mode: non-e10s 06:58:48 INFO - 374 INFO Slowest: 1206ms - chrome://mochitests/content/chrome/layout/mathml/tests/test_disabled_chrome.html 06:58:48 INFO - 375 INFO SimpleTest FINISHED 06:58:48 INFO - 376 INFO TEST-INFO | Ran 1 Loops 06:58:48 INFO - 377 INFO SimpleTest FINISHED 06:58:48 INFO - GECKO(10188) | --DOCSHELL 000001CEC17E0000 == 8 [pid = 10188] [id = {4bf49a39-dcd7-4ec5-aa0d-53cf4ff6ceaf}] 06:58:48 INFO - GECKO(10188) | *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: telemetry_modules_ping 06:58:49 INFO - GECKO(10188) | [10188, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file z:/build/build/src/xpcom/threads/nsThread.cpp, line 797 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CEBD8D2800 == 7 [pid = 10188] [id = {d1d3a87d-3d81-46c0-ad9a-8f9f60f519d1}] 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CEC100C000 == 6 [pid = 10188] [id = {4b741d05-b6dc-4d23-a5f0-b2d0ed54d0ea}] 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CEC2705800 == 5 [pid = 10188] [id = {75b79781-89b2-494d-8491-17986e13e5d6}] 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CECB74A800 == 4 [pid = 10188] [id = {b53f2ea2-ba5a-4326-bbb4-41c3272c1e15}] 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CEC5B2C000 == 3 [pid = 10188] [id = {684bcc33-2689-4d3b-95d1-a6f856936448}] 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CEC1260000 == 2 [pid = 10188] [id = {f6f356a5-bcfe-44f3-9389-d4325d468de3}] 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CEC5B20000 == 1 [pid = 10188] [id = {3e43cb67-9bdd-48c0-bbed-a203c015ec46}] 06:58:49 INFO - GECKO(10188) | --DOCSHELL 000001CEC3AB3800 == 0 [pid = 10188] [id = {d84f3041-290a-49ba-b0d5-a4f7a62d012f}] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 25 (000001CEC1261800) [pid = 10188] [serial = 6] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 24 (000001CEC5E65000) [pid = 10188] [serial = 26] [outer = 0000000000000000] [url = http://mochi.test:8888/chrome/layout/mathml/tests/mathml_example_test.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 23 (000001CEC5B2E800) [pid = 10188] [serial = 25] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 22 (000001CEC5B2C800) [pid = 10188] [serial = 24] [outer = 0000000000000000] [url = http://mochi.test:8888/chrome/layout/mathml/tests/mathml_example_test.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 21 (000001CEBD8D3000) [pid = 10188] [serial = 1] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanityparent.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 20 (000001CEBD8D4000) [pid = 10188] [serial = 2] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 19 (000001CEC4A32000) [pid = 10188] [serial = 15] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 18 (000001CEC49DA000) [pid = 10188] [serial = 14] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 17 (000001CEC5B21000) [pid = 10188] [serial = 23] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/mathml_example_test.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 16 (000001CEC5B20800) [pid = 10188] [serial = 22] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/mathml_example_test.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 15 (000001CEC100C800) [pid = 10188] [serial = 3] [outer = 0000000000000000] [url = resource://gre-resources/hiddenWindow.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 14 (000001CEC100D800) [pid = 10188] [serial = 4] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 13 (000001CEC5B18800) [pid = 10188] [serial = 21] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/test_disabled_chrome.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 12 (000001CEBDC59000) [pid = 10188] [serial = 11] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 11 (000001CECBC32000) [pid = 10188] [serial = 20] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 10 (000001CEC1260800) [pid = 10188] [serial = 5] [outer = 0000000000000000] [url = chrome://browser/content/browser.xul] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 9 (000001CECBC1C800) [pid = 10188] [serial = 19] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 8 (000001CECBC19800) [pid = 10188] [serial = 18] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/test_disabled_chrome.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 7 (000001CEC271C800) [pid = 10188] [serial = 10] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 6 (000001CEC2706800) [pid = 10188] [serial = 9] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html] 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 5 (000001CEC3ABE800) [pid = 10188] [serial = 13] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 4 (000001CECA3E0800) [pid = 10188] [serial = 17] [outer = 0000000000000000] [url = chrome://mochikit/content/harness.xul] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 3 (000001CEC3AB4800) [pid = 10188] [serial = 12] [outer = 0000000000000000] [url = chrome://mochikit/content/harness.xul] 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 2 (000001CEC17E0800) [pid = 10188] [serial = 8] [outer = 0000000000000000] [url = about:blank] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 1 (000001CEC8E4C800) [pid = 10188] [serial = 16] [outer = 0000000000000000] [url = http://mochi.test:8888/redirect.html] 06:58:50 INFO - GECKO(10188) | --DOMWINDOW == 0 (000001CEC1567000) [pid = 10188] [serial = 7] [outer = 0000000000000000] [url = resource://gre-resources/hiddenWindow.html] 06:58:50 INFO - GECKO(10188) | [10188, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file z:/build/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 3562 06:58:50 INFO - GECKO(10188) | nsStringStats 06:58:50 INFO - GECKO(10188) | => mAllocCount: 81528 06:58:50 INFO - GECKO(10188) | => mReallocCount: 6717 06:58:50 INFO - GECKO(10188) | => mFreeCount: 81528 06:58:50 INFO - GECKO(10188) | => mShareCount: 104859 06:58:50 INFO - GECKO(10188) | => mAdoptCount: 5495 06:58:50 INFO - GECKO(10188) | => mAdoptFreeCount: 5495 06:58:50 INFO - GECKO(10188) | => Process ID: 10188, Thread ID: 928 06:58:50 INFO - TEST-INFO | Main app process: exit 0 06:58:50 INFO - runtests.py | Application ran for: 0:00:08.954000 06:58:50 INFO - zombiecheck | Reading PID log: c:\users\genericworker\appdata\local\temp\tmpmpcq2zpidlog 06:58:50 INFO - Stopping web server 06:58:50 INFO - Stopping web socket server 06:58:50 INFO - Stopping ssltunnel 06:58:50 INFO - TEST-INFO | leakcheck | default process: leak threshold set at 0 bytes 06:58:50 INFO - TEST-INFO | leakcheck | plugin process: leak threshold set at 0 bytes 06:58:50 INFO - TEST-INFO | leakcheck | tab process: leak threshold set at 10000 bytes 06:58:50 INFO - TEST-INFO | leakcheck | geckomediaplugin process: leak threshold set at 20000 bytes 06:58:50 INFO - TEST-INFO | leakcheck | gpu process: leak threshold set at 0 bytes 06:58:50 INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 10188 06:58:50 INFO - |<----------------Class--------------->|<-----Bytes------>|<----Objects---->| 06:58:50 INFO - | | Per-Inst Leaked| Total Rem| 06:58:50 INFO - 0 |TOTAL | 33 176| 1163016 2| 06:58:50 INFO - 446 |Mutex | 80 80| 1748 1| 06:58:50 INFO - 913 |nsBaseAppShell | 96 96| 1 1| 06:58:50 INFO - nsTraceRefcnt::DumpStatistics: 1464 entries 06:58:50 INFO - TEST-INFO | leakcheck | default process: leaked 1 Mutex 06:58:50 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsBaseAppShell 06:58:50 ERROR - 378 ERROR TEST-UNEXPECTED-FAIL | leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell)
this is in mochitest-chrome and appears to be in many different directories- this makes it harder to track the leak down. :gbrown, can you help track this down?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell needswork]
https://treeherder.mozilla.org/#/jobs?repo=try&revision=72fa9760204dcb1f3255db15c02f809ee9ac5aec The nsBaseAppShell dtor would normally be called like so: 22:15:48 INFO - GECKO(7484) | #01: nsBaseAppShell::Release() [widget/nsBaseAppShell.cpp:22] 22:15:48 INFO - GECKO(7484) | #02: nsAppStartup::~nsAppStartup() [toolkit/components/startup/nsAppStartup.h:46] 22:15:48 INFO - GECKO(7484) | #03: nsAppStartup::Release() [toolkit/components/startup/nsAppStartup.cpp:230] 22:15:48 INFO - GECKO(7484) | #04: ScopedXPCOMStartup::~ScopedXPCOMStartup() [toolkit/xre/nsAppRunner.cpp:1524] 22:15:48 INFO - GECKO(7484) | #05: XREMain::XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4884] In the leak case https://public-artifacts.taskcluster.net/fTjvtO82TeqU_37kCWoImw/0/public/logs/live_backing.log nsBaseAppShell is created: 22:26:03 INFO - GECKO(8748) | #01: nsAppShellInit [widget/nsAppShellSingleton.h:39] 22:26:03 INFO - GECKO(8748) | #02: nsComponentManagerImpl::KnownModule::Load() [xpcom/components/nsComponentManager.cpp:764] 22:26:03 INFO - GECKO(8748) | #03: nsFactoryEntry::GetFactory() [xpcom/components/nsComponentManager.cpp:1785] 22:26:03 INFO - GECKO(8748) | #04: nsComponentManagerImpl::CreateInstance(nsID const &,nsISupports *,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1001] 22:26:03 INFO - GECKO(8748) | #05: nsComponentManagerImpl::GetService(nsID const &,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1246] 22:26:03 INFO - GECKO(8748) | #06: nsGetServiceByCIDWithError::operator()(nsID const &,void * *) [xpcom/components/nsComponentManagerUtils.cpp:266] 22:26:03 INFO - GECKO(8748) | #07: nsCOMPtr<nsIAppShell>::assign_from_gs_cid_with_error(nsGetServiceByCIDWithError const &,nsID const &) [xpcom/base/nsCOMPtr.h:1203] 22:26:03 INFO - GECKO(8748) | #08: nsAppStartup::Init() [toolkit/components/startup/nsAppStartup.cpp:164] 22:26:03 INFO - GECKO(8748) | #09: nsAppStartupConstructor [toolkit/components/build/nsToolkitCompsModule.cpp:66] 22:26:03 INFO - GECKO(8748) | #10: nsComponentManagerImpl::CreateInstanceByContractID(char const *,nsISupports *,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1087] 22:26:03 INFO - GECKO(8748) | #11: nsComponentManagerImpl::GetServiceByContractID(char const *,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1446] 22:26:03 INFO - GECKO(8748) | #12: nsGetServiceByContractID::operator()(nsID const &,void * *) [xpcom/components/nsComponentManagerUtils.cpp:281] 22:26:03 INFO - GECKO(8748) | #13: nsCOMPtr<nsIWindowCreator>::assign_from_gs_contractid(nsGetServiceByContractID,nsID const &) [xpcom/base/nsCOMPtr.h:1215] 22:26:03 INFO - GECKO(8748) | #14: ScopedXPCOMStartup::SetWindowCreator(nsINativeAppSupport *) [toolkit/xre/nsAppRunner.cpp:1657] 22:26:03 INFO - GECKO(8748) | #15: XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:4360] but the dtor is never called. I wonder why...
Sorry, I'm not finding much time for this. In the leak case, I am seeing ScopedXPCOMStartup::~ScopedXPCOMStartup() called (whew!) but not nsAppStartup::Release().
Bug 870460 makes the leakage occur frequently. However, I have no idea about the relationship between cookie and AppShell.
Blocks: 870460
The leakage depends on the life-cycle of thread. I do some experiment in cookie thread. The intermittent rate would be: Using StreamTransportService (short lifecycle thread): 2% [1] Shutdown on profile-before-change: 8% [2] Shutdown on xpcom-shutdown: ~30% [3] [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bfd99ad67846e6b54f24ad1f6107d0a03176063 [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=97aa32bc8e72153da74d4018b894beae446cee73 [3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6963a183d2032a456d6f58f031251e89601f5d0
I don't find any thread which shutdowns after xpcom-shutdown for the leakage case I get https://treeherder.mozilla.org/logviewer.html#?job_id=138198756&repo=try&lineNumber=44467
Sorry, I'm not getting anywhere with this. I note that all such leaks appear to be Windows-10 only and mochitest-chrome only. There may be fewer leaks in chunk 2, but all chunks are affected and I don't see a close correlation with any small subset of mochitest-chrome tests. This leak is fairly easy to reproduce on try: -p win64 -u mochitest-chrome --rebuild 10 I tried to apply https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Refcount_tracing_and_balancing on try, but that introduced a lot of failures: https://treeherder.mozilla.org/#/jobs?repo=try&revision=67b9a022b0948eec857f66e923a1869551162c0e :junior's observations are interesting (thanks!), but I don't know what else to suggest.
Flags: needinfo?(gbrown)
Keywords: mlk
No longer blocks: 870460
:gbrown , :junior , do you think we can find anyone who could work on this?
Flags: needinfo?(gbrown)
:mccr8, can you help here?
Flags: needinfo?(gbrown) → needinfo?(continuation)
I think the problem with your previous attempt to get refcount logs was that it was attempting to create a file in the child process, to do refcount logging, but the sandbox is enabled there. I could disable the sandbox, but because the leak here is only in the parent process anyways, I disabled refcount logging in the child process instead. Hopefully that will work. https://treeherder.mozilla.org/#/jobs?repo=try&revision=4cff28eb159c3d35c38332cbfd4e4fa3e741bdbd
Flags: needinfo?(continuation)
I'll move this to widget for now. There are a lot of places that use this base app shell, so I think a refcount log is the way to go. One problem will be that the refcount log will be unsymbolicated. I can get a leak log and filter out everything but the leaked object, but somebody with a Windows machine will have to do the symbolification. Assuming that it works on Windows like it works on Linux and OSX.
Component: XPCOM → Widget
I finally got a run that worked correctly. I still need to actually hit the leak to see a log. I've hit too other intermittent leaks, though... https://treeherder.mozilla.org/#/jobs?repo=try&revision=96c6077c741103b3e36068a203fd7a992804c373
Priority: -- → P3
There are 52 failures on the last week for this bug. :mccr8 did you had any time to look further into this?
Flags: needinfo?(continuation)
I came up with a patch to refcount log nsBaseAppShell that seemed to actually work, but after I retriggered around 200 times I wasn't able to reproduce the leak: https://treeherder.mozilla.org/#/jobs?repo=try&revision=96c6077c741103b3e36068a203fd7a992804c373 Possibly the slowdown from logging keeps the leak from happening. Another idea I had was to allocate a C++ wrapper for the nsBaseAppShell every time something requests a new reference, that forwards all calls. Then we could only log object creation and possibly still figure out what was leaking, with less overhead. I don't know if that will help. Jim, maybe you have some ideas how this widget leak could be investigated?
Flags: needinfo?(continuation) → needinfo?(jmathies)
There have been 32 failures in the last week. It seems the number of failures increased starting with February 5. This only occurs on windows10-64 / debug. Here is a recent log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=161017372&lineNumber=2265 18:00:42 INFO - TEST-PASS | browser/components/search/test/browser_oneOffHeader.js | Search header string is correct when a search term has been entered - 18:00:43 INFO - Buffered messages finished 18:00:43 INFO - TEST-UNEXPECTED-FAIL | browser/components/search/test/browser_oneOffHeader.js | Header has the correct index selected when a search engine has been selected - Got 1, expected 2 18:00:43 INFO - Stack trace: 18:00:43 INFO - chrome://mochikit/content/browser-test.js:test_is:1271 18:00:43 INFO - chrome://mochitests/content/browser/browser/components/search/test/browser_oneOffHeader.js:test_text:121 18:00:43 INFO - Not taking screenshot here: see the one that was previously logged :jimm, have you had the chance to look over this?
See Also: → 1447440
Update: There have been 56 total failures in the last 7 days. Almost all the failures occur on windows10-64 / debug, with 2 exceptions for windows10-64-ccov / debug. Here is a recent relevant log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=169402099&lineNumber=16022 10:58:25 INFO - TEST-INFO | leakcheck | gpu process: leak threshold set at 0 bytes 10:58:25 INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 6056 10:58:25 INFO - |<----------------Class--------------->|<-----Bytes------>|<----Objects---->| 10:58:25 INFO - | | Per-Inst Leaked| Total Rem| 10:58:25 INFO - 0 |TOTAL | 34 176| 7560403 2| 10:58:25 INFO - 540 |Mutex | 80 80| 3446 1| 10:58:25 INFO - 1239 |nsBaseAppShell | 96 96| 1 1| 10:58:25 INFO - nsTraceRefcnt::DumpStatistics: 1854 entries 10:58:25 INFO - TEST-INFO | leakcheck | default process: leaked 1 Mutex 10:58:25 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsBaseAppShell 10:58:25 ERROR - 1236 ERROR TEST-UNEXPECTED-FAIL | leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell) 10:58:25 INFO - runtests.py | Running tests: end. 10:58:25 INFO - Buffered messages finished
There have been 127 failures in the last week. It fails on windows10-64 / debug. Recent log file and snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=170955498&lineNumber=10683 03:05:50 INFO - nsTraceRefcnt::DumpStatistics: 1849 entries 03:05:50 INFO - TEST-INFO | leakcheck | default process: leaked 1 Mutex 03:05:50 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsBaseAppShell 03:05:50 ERROR - 658 ERROR TEST-UNEXPECTED-FAIL | leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell) 03:05:50 INFO - runtests.py | Running tests: end. 03:05:50 INFO - Buffered messages finished 03:05:50 INFO - Running manifest: toolkit\components\windowcreator\test\chrome.ini 03:05:50 INFO - Z:\task_1522291025\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
I already looked at this back in November and I was unable to make any progress.
Flags: needinfo?(continuation)
this looks to be in the category of perma failing- we will look to disable something soon.
Assignee: nobody → apavel
Attachment #8964632 - Flags: review?(jmaher)
Comment on attachment 8964632 [details] Bug 1401883 - disabled on Win64 one reason for the r- is that on my try pushes: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=58a4fcd84bc875067bed6bf4f36e73520bd8d8da you can see that we have the same rate of failure before/after this patch (which is ~40%). To me we just need to disable these tests on windows, or we need to fix this issue. 40% failure rate is not acceptable. Unfortunately C1/C2/C3 all fail and it seems to be not just a couple tests or a couple directories, it seems to be wider spread.
Attachment #8964632 - Flags: review?(jmaher) → review-
I did some retriggers on treeherder for march 20th: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=windows%20x64%20mochitest-chrome&tochange=fae7d981440374495e890a5ecda4bb698a84f95b&fromchange=28088e11bfc458d12e809b79628d4c2dd942bd83 not every orange job is the same failure, but the majority are the 176 bytes leaked. It looks like bug 1439383 introduced something to cause this to fail quite frequently: https://hg.mozilla.org/integration/autoland/rev/5be07e86738e8993022e1c507b5b09e602872a33 :keeler, I see you authored the patch in bug 1439383- this bug is now the #1 intermittent failure on the whole tree- can you take a look at fixing this leak?
Blocks: 1439383
Flags: needinfo?(dkeeler)
We should back out bug 1439383 before we land a workaround like that. (Thanks so much for figuring out what caused the regression!)
I tried to figure out what from bug 1439383 could have made this more frequent, but since it's just shutting down a thread and doesn't have anything to do with nsBaseAppShell, I couldn't come up with any explanations. Bug 1439383 just got backed out anyway, so this should go back to pre-March 21 levels (which, of course, doesn't solve the intermittent, but at least we have an idea of potentially associated code).
Flags: needinfo?(dkeeler)
It is super annoying that this caused bug 1439383 to be backed out. Holding up memory wins for silly intermittent memory leaks is unfortunate. I took mccr8's patch from comment 24 and wrote a wrapper for the appshell constructor to hand out: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d575a6b0635a6769708cf5f9040396e2a3a27944 I haven't looked closely at the refcount logs, but I believe what the leak checker is saying is that we didn't leak any wrapper objects for nsBaseAppShell being used *as an nsIAppShell*. If we were leaking nsBaseAppShell as an nsIAppShell, then the wrapper objects would have leaked too (I think). If that's the case, the next step is to get the wrapper's QI to hand out different objects for nsIThreadObserver and nsIObserver, the other two interfaces implemented by nsBaseAppShell. One of those wrappers should leak, and then we should have a better idea of where to start diagnosing the leak. It's possible that idea doesn't work, in which case I think that means that somebody is casting ns{Base,}AppShell objects directly to the interface superclasses without going through QI...which also should narrow down the potential space of places to check?
FWIW my clang-cl try pushes hit this intermittent a *lot*. I don't know if the compiler change is tickling something or if simply everyone else is hitting this a lot too. (Sad to say, I saw the disable-recommended keyword and am quietly hoping that this bug just goes away before I need to worry about landing.)
(In reply to David Major [:dmajor] from comment #69) > FWIW my clang-cl try pushes hit this intermittent a *lot*. I don't know if > the compiler change is tickling something or if simply everyone else is > hitting this a lot too. This was the top intermittent failure, so you aren't alone. Please let us know if you are still seeing it a lot after bug 1439383 got backed out.
you can see the failure rate dropped significantly 2 days ago: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-03-30&endday=2018-04-06&tree=trunk&bug=1401883 The backout (or something else around the same time) did resolve the large majority of these failures.
OK, I don't understand what's going on here. Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2d35da556cb039b07e62873960aa7e8ed9a8f548 To make it clear, this push implements: * A wrapper returned for constructing nsIAppShell instances. The wrapper QIs itself to nsIAppShell and nsISupports. This *should* mean that when we leak nsBaseAppShell, we're also leaking this object. It's possible that the wrapper gets QI'd to nsI{Thread,}Observer instead, and we lose tracking of the wrapper, but this does not seem to be the case. (nsBaseAppShell sets itself up as a thread observer and registers itself with the observer service directly, and it seems kind of unusual for something else to QI to the observer classes directly. Previous try pushes intercepted QI to the observer classes and didn't see anything unusual. See also below.) * Wrappers are used for registering nsI{Thread,}Observer instances from inside nsBaseAppShell. Thus, if we were leaking through those references, we'd leak those wrappers as well. But the leaks (which are numerous, as you can see) never leak the wrapper objects; they continue to only leak the Mutex and the nsBaseAppShell. Which makes no sense to me! There are also a fair number of crashes in the nsTraceRefcnt code, which I'm puzzled about too. Usually when that happens, it means that two different classes are using the same name for their nsTraceRefcnt key...but that doesn't appear to be happening in this case: grepping around the tree doesn't produce any hits for the newly added class names besides the newly added classes themselves. Going to try asserting on peculiar QI requests and see if that turns up anything unusual.
OK, asserting on peculiar QI requests doesn't turn up anything unusual. That means that all the references code are getting are coming through XPCOM construction mechanisms. And those should all go through the wrapper handed out in the constructor...but the wrappers themselves are not leaking! I think this means that the only reference that can leak is the one held by nsAppShellSingleton.h itself...but that reference *should* go away by the time the final bit of the leak checker runs to check for not-freed stuff. Bug 1439383 suggests that an extra event causes this to occur more frequently, so maybe there's some runnable that's holding a reference, and the runnable is unceremoniously dropped on the floor at some point? But then the runnable would leak too, so that doesn't seem tenable. Oh! Here's the problem. Runnables get dispatched, and when they get dispatched, the thread observer (i.e. the appshell) is notified: https://searchfox.org/mozilla-central/source/widget/nsBaseAppShell.cpp#219 which calls the widget-specific ScheduleNativeEventCallback: https://searchfox.org/mozilla-central/source/widget/windows/nsAppShell.cpp#460 On Windows, that scheduling takes an extra reference to the appshell, and posts a message to the native Windows event loop. The corresponding handler will release the reference: https://searchfox.org/mozilla-central/source/widget/windows/nsAppShell.cpp#189 But what's probably happening in this bug is that ScheduleNativeEventCallback is called, and then we shut down processing of the native Windows event loop before calling EventWindowProc, thereby leaking the reference. *Those* refcounting operations are run on the actual object, and that's why the wrappers weren't being leaked. That is, this is just the reincarnation of bug 1220517. More tomorrow when I can think.
Update: There have been 42 failures in the last 7 days. All of the failures occur on debug. Platforms affected: windows10-64: 28 windows10-64-ccov: 14 Here is a recent log file: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=174251092&lineNumber=1942 Summary: Intermittent leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell) :froydnj, did you manage to have another look at this?
Flags: needinfo?(nfroyd)
Checking the logs we found that this leak is not on a specific directory.
the ccov failures are on this directory: toolkit/components/url-classifier/tests/mochitest and the normal windows10 failures are on: browser/base/content/test/chrome we should be able to bisect this down and skip some tests- luckily the failure rate is pretty low these days.
bisection didn't work so well for me, seemed that I couldn't repro or everything reproduced it. The faillure rate is pretty low where we could consider disabling the url-classifier tests for win10 ccov builds if needed- but I would think we will stay <20 failures/week until something triggers this again.
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Joel - did you try with the patch from bug 1439383? It seemed to make this much more common - might make it easier to narrow it down.
Flags: needinfo?(jmaher)
why did these get annotated on try :(
I tried applying the patch from bug 1439383 and did a bisection again: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=f49e9d82ae4c49327947b6f4e1ec1189e7bdab8e&filter-searchStr=c3&tochange=0cf3bc8396ce0ab411e899f3fd377052982e5969&selectedJob=179893480 I didn't see any on the ccov builds, and a few instances on the debug builds, although so few it is hard to know if there is a pattern or a fix.
Flags: needinfo?(jmaher)
When runnables are posted to the main thread's event loop, the event loop notifies any thread observers that this has been done. The app shell registers itself as just such a runnable, and posts messages to the native event loop that processing native events needs to be done. On Windows, this posting takes an extra reference to the app shell, to keep it alive until the message is processed by the native event loop, since app shell code needs to be invoked during that processing. The processing releases this extra reference, so everything stays balanced. Except that it's possible for messages to be posted to the native event loop, and then browser shutdown happens. Those messages are not processed and the associated references taken are not released. This imbalance means that in debug builds, we appear to be leaking the app shell, and that leaking results in intermittent oranges. This intermittent orange has manifested itself in a variety of ways over the years, depending on how big the app shell itself was (since that changes the number of bytes leaked) and how many leak-checked things the app shell was holding on to. This bug is merely the latest manifestation; the last serious work on analyzing the phenomenon and fixing it was done in bug 1220517. The solution proposed in that bug was that we simply stop the extra reference counting; when the app shell is destroyed normally, we shouldn't be processing the native event loop any more anyway. So even if the native event loop is holding (freed) pointers to the app shell, we'd never execute the callback and perform a use-after-free. Reading through the code suggests that this *ought* to be the case, but the potential for shooting ourselves in the foot seems awfully high. In any event, this is not a problem unique to Windows; we have seen this same sort of thing happen on OS X. See nsAppShell::ProcessGeckoEvents in widget/cocoa/nsAppShell.mm. Here we propose a slightly different solution: we keep track of the number of native event callbacks we have scheduled, incrementing when we schedule, and decrementing when we actually run one. When the app shell is destroyed, we simply set the number of outstanding events to zero, and we prohibit the callback from accessing the app shell if there are no outstanding events. This solution is analogous to dropping the extra reference counting, but avoids potential badness if we do wind up processing native events after the app shell is destroyed. This patch is really Ting-Yu's patch from bug 1220517, modified by me to be slightly safer. You can see some of the discussion in bug 1220517 comment 31 and following. bsmedberg and I were skeptical that the proposed fix of just dropping the refcounting was bulletproof, but it did fix the problem. This fix should work just as well, and be slightly less sketchy.
Attachment #8982625 - Flags: review?(jmathies)
Flags: needinfo?(nfroyd)
Comment on attachment 8982625 [details] [diff] [review] don't hold unnecessary references to the Windows app shell Review of attachment 8982625 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the detail in the patch. Seems like a safe change.
Attachment #8982625 - Flags: review?(jmathies) → review+
Flags: needinfo?(jmathies)
Pushed by nfroyd@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/5fd3144d5d00 don't hold unnecessary references to the Windows app shell; r=jimm
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
No reports of this since this patch landed on the 7th. Is this something we could consider uplifting to Beta/ESR60?
Assignee: apavel → nfroyd
Flags: needinfo?(nfroyd)
Comment on attachment 8982625 [details] [diff] [review] don't hold unnecessary references to the Windows app shell (In reply to Ryan VanderMeulen [:RyanVM] from comment #99) > No reports of this since this patch landed on the 7th. Is this something we > could consider uplifting to Beta/ESR60? Sure, would basically only matter for tests, though. [Approval Request Comment] If this is not a sec:{high,crit} bug, please state case for ESR consideration: Makes sheriffs' lives easier, perhaps? User impact if declined: None. Fix Landed on Version: 62 Risk to taking this patch (and alternatives if risky): Very low risk. String or UUID changes made by this patch: None Approval Request Comment [Feature/Bug causing the regression]: Something lost to the sands of time. [User impact if declined]: None. [Is this code covered by automated tests?]: Yes. [Has the fix been verified in Nightly?]: Yes. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: Not really. [String changes made/needed]: None.
Flags: needinfo?(nfroyd)
Attachment #8982625 - Flags: approval-mozilla-esr60?
Attachment #8982625 - Flags: approval-mozilla-beta?
Comment on attachment 8982625 [details] [diff] [review] don't hold unnecessary references to the Windows app shell Makes Windows a bit less leaky in CI. Approved for 61.0b14 and ESR 60.1.
Attachment #8982625 - Flags: approval-mozilla-esr60?
Attachment #8982625 - Flags: approval-mozilla-esr60+
Attachment #8982625 - Flags: approval-mozilla-beta?
Attachment #8982625 - Flags: approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: