Closed Bug 1401883 Opened 2 years ago Closed 2 years ago

Intermittent leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell)

Categories

(Core :: Widget, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- fixed
firefox60 --- wontfix
firefox61 --- fixed
firefox62 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: froydnj)

References

Details

(Keywords: intermittent-failure, memory-leak, Whiteboard: [stockwell disable-recommended])

Attachments

(3 files)

Filed by: archaeopteryx [at] coole-files.de

https://treeherder.mozilla.org/logviewer.html#?job_id=132405319&repo=autoland

https://queue.taskcluster.net/v1/task/alkWkeAoRoqKt4aUNZi1Hg/runs/0/artifacts/public/logs/live_backing.log

06:58:46     INFO -  365 INFO SimpleTest START
06:58:46     INFO -  366 INFO TEST-START | layout/mathml/tests/test_disabled_chrome.html
06:58:46     INFO -  GECKO(10188) | ++DOMWINDOW == 20 (000001CECBC32000) [pid = 10188] [serial = 20] [outer = 000001CECBC19800]
06:58:47     INFO -  GECKO(10188) | ++DOMWINDOW == 21 (000001CEC5B18800) [pid = 10188] [serial = 21] [outer = 000001CECBC19800]
06:58:47     INFO -  GECKO(10188) | ++DOCSHELL 000001CEC5B20000 == 8 [pid = 10188] [id = {3e43cb67-9bdd-48c0-bbed-a203c015ec46}]
06:58:47     INFO -  GECKO(10188) | ++DOMWINDOW == 22 (000001CEC5B20800) [pid = 10188] [serial = 22] [outer = 0000000000000000]
06:58:47     INFO -  GECKO(10188) | ++DOMWINDOW == 23 (000001CEC5B21000) [pid = 10188] [serial = 23] [outer = 000001CEC5B20800]
06:58:47     INFO -  GECKO(10188) | ++DOCSHELL 000001CEC5B2C000 == 9 [pid = 10188] [id = {684bcc33-2689-4d3b-95d1-a6f856936448}]
06:58:47     INFO -  GECKO(10188) | ++DOMWINDOW == 24 (000001CEC5B2C800) [pid = 10188] [serial = 24] [outer = 0000000000000000]
06:58:47     INFO -  GECKO(10188) | ++DOMWINDOW == 25 (000001CEC5B2E800) [pid = 10188] [serial = 25] [outer = 000001CEC5B2C800]
06:58:47     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: NS_ENSURE_TRUE(aSecondURI) failed: file z:/build/build/src/dom/base/ThirdPartyUtil.cpp, line 98
06:58:48     INFO -  GECKO(10188) | ++DOMWINDOW == 26 (000001CEC5E65000) [pid = 10188] [serial = 26] [outer = 000001CEC5B2C800]
06:58:48     INFO -  GECKO(10188) | MEMORY STAT | vsize 1813MB | vsizeMaxContiguous 132289253MB | residentFast 294MB | heapAllocated 123MB
06:58:48     INFO -  367 INFO TEST-OK | layout/mathml/tests/test_disabled_chrome.html | took 1206ms
06:58:48     INFO -  368 INFO None369 INFO TEST-START | Shutdown
06:58:48     INFO -  370 INFO Passed:  3
06:58:48     INFO -  371 INFO Failed:  0
06:58:48     INFO -  372 INFO Todo:    0
06:58:48     INFO -  373 INFO Mode:    non-e10s
06:58:48     INFO -  374 INFO Slowest: 1206ms - chrome://mochitests/content/chrome/layout/mathml/tests/test_disabled_chrome.html
06:58:48     INFO -  375 INFO SimpleTest FINISHED
06:58:48     INFO -  376 INFO TEST-INFO | Ran 1 Loops
06:58:48     INFO -  377 INFO SimpleTest FINISHED
06:58:48     INFO -  GECKO(10188) | --DOCSHELL 000001CEC17E0000 == 8 [pid = 10188] [id = {4bf49a39-dcd7-4ec5-aa0d-53cf4ff6ceaf}]
06:58:48     INFO -  GECKO(10188) | *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: telemetry_modules_ping
06:58:49     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file z:/build/build/src/xpcom/threads/nsThread.cpp, line 797
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CEBD8D2800 == 7 [pid = 10188] [id = {d1d3a87d-3d81-46c0-ad9a-8f9f60f519d1}]
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CEC100C000 == 6 [pid = 10188] [id = {4b741d05-b6dc-4d23-a5f0-b2d0ed54d0ea}]
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CEC2705800 == 5 [pid = 10188] [id = {75b79781-89b2-494d-8491-17986e13e5d6}]
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CECB74A800 == 4 [pid = 10188] [id = {b53f2ea2-ba5a-4326-bbb4-41c3272c1e15}]
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CEC5B2C000 == 3 [pid = 10188] [id = {684bcc33-2689-4d3b-95d1-a6f856936448}]
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CEC1260000 == 2 [pid = 10188] [id = {f6f356a5-bcfe-44f3-9389-d4325d468de3}]
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CEC5B20000 == 1 [pid = 10188] [id = {3e43cb67-9bdd-48c0-bbed-a203c015ec46}]
06:58:49     INFO -  GECKO(10188) | --DOCSHELL 000001CEC3AB3800 == 0 [pid = 10188] [id = {d84f3041-290a-49ba-b0d5-a4f7a62d012f}]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 25 (000001CEC1261800) [pid = 10188] [serial = 6] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 24 (000001CEC5E65000) [pid = 10188] [serial = 26] [outer = 0000000000000000] [url = http://mochi.test:8888/chrome/layout/mathml/tests/mathml_example_test.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 23 (000001CEC5B2E800) [pid = 10188] [serial = 25] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 22 (000001CEC5B2C800) [pid = 10188] [serial = 24] [outer = 0000000000000000] [url = http://mochi.test:8888/chrome/layout/mathml/tests/mathml_example_test.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 21 (000001CEBD8D3000) [pid = 10188] [serial = 1] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanityparent.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 20 (000001CEBD8D4000) [pid = 10188] [serial = 2] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 19 (000001CEC4A32000) [pid = 10188] [serial = 15] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 18 (000001CEC49DA000) [pid = 10188] [serial = 14] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 17 (000001CEC5B21000) [pid = 10188] [serial = 23] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/mathml_example_test.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 16 (000001CEC5B20800) [pid = 10188] [serial = 22] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/mathml_example_test.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 15 (000001CEC100C800) [pid = 10188] [serial = 3] [outer = 0000000000000000] [url = resource://gre-resources/hiddenWindow.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 14 (000001CEC100D800) [pid = 10188] [serial = 4] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 13 (000001CEC5B18800) [pid = 10188] [serial = 21] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/test_disabled_chrome.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 12 (000001CEBDC59000) [pid = 10188] [serial = 11] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 11 (000001CECBC32000) [pid = 10188] [serial = 20] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 10 (000001CEC1260800) [pid = 10188] [serial = 5] [outer = 0000000000000000] [url = chrome://browser/content/browser.xul]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 9 (000001CECBC1C800) [pid = 10188] [serial = 19] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 8 (000001CECBC19800) [pid = 10188] [serial = 18] [outer = 0000000000000000] [url = chrome://mochitests/content/chrome/layout/mathml/tests/test_disabled_chrome.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 7 (000001CEC271C800) [pid = 10188] [serial = 10] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 6 (000001CEC2706800) [pid = 10188] [serial = 9] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html]
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 5 (000001CEC3ABE800) [pid = 10188] [serial = 13] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 4 (000001CECA3E0800) [pid = 10188] [serial = 17] [outer = 0000000000000000] [url = chrome://mochikit/content/harness.xul]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 3 (000001CEC3AB4800) [pid = 10188] [serial = 12] [outer = 0000000000000000] [url = chrome://mochikit/content/harness.xul]
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 396
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 2 (000001CEC17E0800) [pid = 10188] [serial = 8] [outer = 0000000000000000] [url = about:blank]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 1 (000001CEC8E4C800) [pid = 10188] [serial = 16] [outer = 0000000000000000] [url = http://mochi.test:8888/redirect.html]
06:58:50     INFO -  GECKO(10188) | --DOMWINDOW == 0 (000001CEC1567000) [pid = 10188] [serial = 7] [outer = 0000000000000000] [url = resource://gre-resources/hiddenWindow.html]
06:58:50     INFO -  GECKO(10188) | [10188, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file z:/build/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 3562
06:58:50     INFO -  GECKO(10188) | nsStringStats
06:58:50     INFO -  GECKO(10188) |  => mAllocCount:          81528
06:58:50     INFO -  GECKO(10188) |  => mReallocCount:         6717
06:58:50     INFO -  GECKO(10188) |  => mFreeCount:           81528
06:58:50     INFO -  GECKO(10188) |  => mShareCount:         104859
06:58:50     INFO -  GECKO(10188) |  => mAdoptCount:           5495
06:58:50     INFO -  GECKO(10188) |  => mAdoptFreeCount:       5495
06:58:50     INFO -  GECKO(10188) |  => Process ID: 10188, Thread ID: 928
06:58:50     INFO -  TEST-INFO | Main app process: exit 0
06:58:50     INFO -  runtests.py | Application ran for: 0:00:08.954000
06:58:50     INFO -  zombiecheck | Reading PID log: c:\users\genericworker\appdata\local\temp\tmpmpcq2zpidlog
06:58:50     INFO -  Stopping web server
06:58:50     INFO -  Stopping web socket server
06:58:50     INFO -  Stopping ssltunnel
06:58:50     INFO -  TEST-INFO | leakcheck | default process: leak threshold set at 0 bytes
06:58:50     INFO -  TEST-INFO | leakcheck | plugin process: leak threshold set at 0 bytes
06:58:50     INFO -  TEST-INFO | leakcheck | tab process: leak threshold set at 10000 bytes
06:58:50     INFO -  TEST-INFO | leakcheck | geckomediaplugin process: leak threshold set at 20000 bytes
06:58:50     INFO -  TEST-INFO | leakcheck | gpu process: leak threshold set at 0 bytes
06:58:50     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 10188
06:58:50     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:58:50     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
06:58:50     INFO -     0 |TOTAL                                 |       33      176| 1163016        2|
06:58:50     INFO -   446 |Mutex                                 |       80       80|    1748        1|
06:58:50     INFO -   913 |nsBaseAppShell                        |       96       96|       1        1|
06:58:50     INFO -  nsTraceRefcnt::DumpStatistics: 1464 entries
06:58:50     INFO -  TEST-INFO | leakcheck | default process: leaked 1 Mutex
06:58:50     INFO -  TEST-INFO | leakcheck | default process: leaked 1 nsBaseAppShell
06:58:50    ERROR -  378 ERROR TEST-UNEXPECTED-FAIL | leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell)
this is in mochitest-chrome and appears to be in many different directories- this makes it harder to track the leak down.

:gbrown, can you help track this down?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell needswork]
https://treeherder.mozilla.org/#/jobs?repo=try&revision=72fa9760204dcb1f3255db15c02f809ee9ac5aec

The nsBaseAppShell dtor would normally be called like so:

22:15:48     INFO -  GECKO(7484) | #01: nsBaseAppShell::Release() [widget/nsBaseAppShell.cpp:22]
22:15:48     INFO -  GECKO(7484) | #02: nsAppStartup::~nsAppStartup() [toolkit/components/startup/nsAppStartup.h:46]
22:15:48     INFO -  GECKO(7484) | #03: nsAppStartup::Release() [toolkit/components/startup/nsAppStartup.cpp:230]
22:15:48     INFO -  GECKO(7484) | #04: ScopedXPCOMStartup::~ScopedXPCOMStartup() [toolkit/xre/nsAppRunner.cpp:1524]
22:15:48     INFO -  GECKO(7484) | #05: XREMain::XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4884]

In the leak case

https://public-artifacts.taskcluster.net/fTjvtO82TeqU_37kCWoImw/0/public/logs/live_backing.log

nsBaseAppShell is created:

22:26:03     INFO -  GECKO(8748) | #01: nsAppShellInit [widget/nsAppShellSingleton.h:39]
22:26:03     INFO -  GECKO(8748) | #02: nsComponentManagerImpl::KnownModule::Load() [xpcom/components/nsComponentManager.cpp:764]
22:26:03     INFO -  GECKO(8748) | #03: nsFactoryEntry::GetFactory() [xpcom/components/nsComponentManager.cpp:1785]
22:26:03     INFO -  GECKO(8748) | #04: nsComponentManagerImpl::CreateInstance(nsID const &,nsISupports *,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1001]
22:26:03     INFO -  GECKO(8748) | #05: nsComponentManagerImpl::GetService(nsID const &,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1246]
22:26:03     INFO -  GECKO(8748) | #06: nsGetServiceByCIDWithError::operator()(nsID const &,void * *) [xpcom/components/nsComponentManagerUtils.cpp:266]
22:26:03     INFO -  GECKO(8748) | #07: nsCOMPtr<nsIAppShell>::assign_from_gs_cid_with_error(nsGetServiceByCIDWithError const &,nsID const &) [xpcom/base/nsCOMPtr.h:1203]
22:26:03     INFO -  GECKO(8748) | #08: nsAppStartup::Init() [toolkit/components/startup/nsAppStartup.cpp:164]
22:26:03     INFO -  GECKO(8748) | #09: nsAppStartupConstructor [toolkit/components/build/nsToolkitCompsModule.cpp:66]
22:26:03     INFO -  GECKO(8748) | #10: nsComponentManagerImpl::CreateInstanceByContractID(char const *,nsISupports *,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1087]
22:26:03     INFO -  GECKO(8748) | #11: nsComponentManagerImpl::GetServiceByContractID(char const *,nsID const &,void * *) [xpcom/components/nsComponentManager.cpp:1446]
22:26:03     INFO -  GECKO(8748) | #12: nsGetServiceByContractID::operator()(nsID const &,void * *) [xpcom/components/nsComponentManagerUtils.cpp:281]
22:26:03     INFO -  GECKO(8748) | #13: nsCOMPtr<nsIWindowCreator>::assign_from_gs_contractid(nsGetServiceByContractID,nsID const &) [xpcom/base/nsCOMPtr.h:1215]
22:26:03     INFO -  GECKO(8748) | #14: ScopedXPCOMStartup::SetWindowCreator(nsINativeAppSupport *) [toolkit/xre/nsAppRunner.cpp:1657]
22:26:03     INFO -  GECKO(8748) | #15: XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:4360]

but the dtor is never called. I wonder why...
Sorry, I'm not finding much time for this. In the leak case, I am seeing ScopedXPCOMStartup::~ScopedXPCOMStartup() called (whew!) but not nsAppStartup::Release().
Bug 870460 makes the leakage occur frequently.
However, I have no idea about the relationship between cookie and AppShell.
Blocks: 870460
The leakage depends on the life-cycle of thread.
I do some experiment in cookie thread.

The intermittent rate would be:
Using StreamTransportService (short lifecycle thread): 2% [1]
Shutdown on profile-before-change: 8% [2]
Shutdown on xpcom-shutdown: ~30% [3]

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bfd99ad67846e6b54f24ad1f6107d0a03176063
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=97aa32bc8e72153da74d4018b894beae446cee73
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6963a183d2032a456d6f58f031251e89601f5d0
I don't find any thread which shutdowns after xpcom-shutdown for the leakage case I get
https://treeherder.mozilla.org/logviewer.html#?job_id=138198756&repo=try&lineNumber=44467
Sorry, I'm not getting anywhere with this.

I note that all such leaks appear to be Windows-10 only and mochitest-chrome only. There may be fewer leaks in chunk 2, but all chunks are affected and I don't see a close correlation with any small subset of mochitest-chrome tests.

This leak is fairly easy to reproduce on try: -p win64 -u mochitest-chrome --rebuild 10

I tried to apply https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Refcount_tracing_and_balancing on try, but that introduced a lot of failures:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=67b9a022b0948eec857f66e923a1869551162c0e


:junior's observations are interesting (thanks!), but I don't know what else to suggest.
Flags: needinfo?(gbrown)
Keywords: mlk
No longer blocks: 870460
:gbrown , :junior , do you think we can find anyone who could work on this?
Flags: needinfo?(gbrown)
:mccr8, can you help here?
Flags: needinfo?(gbrown) → needinfo?(continuation)
I think the problem with your previous attempt to get refcount logs was that it was attempting to create a file in the child process, to do refcount logging, but the sandbox is enabled there. I could disable the sandbox, but because the leak here is only in the parent process anyways, I disabled refcount logging in the child process instead. Hopefully that will work.
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=4cff28eb159c3d35c38332cbfd4e4fa3e741bdbd
Flags: needinfo?(continuation)
I'll move this to widget for now.

There are a lot of places that use this base app shell, so I think a refcount log is the way to go. One problem will be that the refcount log will be unsymbolicated. I can get a leak log and filter out everything but the leaked object, but somebody with a Windows machine will have to do the symbolification. Assuming that it works on Windows like it works on Linux and OSX.
Component: XPCOM → Widget
I finally got a run that worked correctly. I still need to actually hit the leak to see a log. I've hit too other intermittent leaks, though...
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=96c6077c741103b3e36068a203fd7a992804c373
Priority: -- → P3
There are 52 failures on the last week for this bug.

:mccr8 did you had any time to look further into this?
Flags: needinfo?(continuation)
I came up with a patch to refcount log nsBaseAppShell that seemed to actually work, but after I retriggered around 200 times I wasn't able to reproduce the leak: https://treeherder.mozilla.org/#/jobs?repo=try&revision=96c6077c741103b3e36068a203fd7a992804c373

Possibly the slowdown from logging keeps the leak from happening.

Another idea I had was to allocate a C++ wrapper for the nsBaseAppShell every time something requests a new reference, that forwards all calls. Then we could only log object creation and possibly still figure out what was leaking, with less overhead. I don't know if that will help.

Jim, maybe you have some ideas how this widget leak could be investigated?
Flags: needinfo?(continuation) → needinfo?(jmathies)
There have been 32 failures in the last week.
It seems the number of failures increased starting with February 5.

This only occurs on windows10-64 / debug.

Here is a recent log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=161017372&lineNumber=2265

18:00:42     INFO - TEST-PASS | browser/components/search/test/browser_oneOffHeader.js | Search header string is correct when a search term has been entered - 
18:00:43     INFO - Buffered messages finished
18:00:43     INFO - TEST-UNEXPECTED-FAIL | browser/components/search/test/browser_oneOffHeader.js | Header has the correct index selected when a search engine has been selected - Got 1, expected 2
18:00:43     INFO - Stack trace:
18:00:43     INFO - chrome://mochikit/content/browser-test.js:test_is:1271
18:00:43     INFO - chrome://mochitests/content/browser/browser/components/search/test/browser_oneOffHeader.js:test_text:121
18:00:43     INFO - Not taking screenshot here: see the one that was previously logged

:jimm, have you had the chance to look over this?
See Also: → 1447440
Update: 
There have been 56 total failures in the last 7 days.
Almost all the failures occur on windows10-64 / debug, with 2 exceptions for windows10-64-ccov / debug.

Here is a recent relevant log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=169402099&lineNumber=16022

10:58:25     INFO -  TEST-INFO | leakcheck | gpu process: leak threshold set at 0 bytes
10:58:25     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 6056
10:58:25     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
10:58:25     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
10:58:25     INFO -     0 |TOTAL                                 |       34      176| 7560403        2|
10:58:25     INFO -   540 |Mutex                                 |       80       80|    3446        1|
10:58:25     INFO -  1239 |nsBaseAppShell                        |       96       96|       1        1|
10:58:25     INFO -  nsTraceRefcnt::DumpStatistics: 1854 entries
10:58:25     INFO -  TEST-INFO | leakcheck | default process: leaked 1 Mutex
10:58:25     INFO -  TEST-INFO | leakcheck | default process: leaked 1 nsBaseAppShell
10:58:25    ERROR -  1236 ERROR TEST-UNEXPECTED-FAIL | leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell)
10:58:25     INFO -  runtests.py | Running tests: end.
10:58:25     INFO -  Buffered messages finished
There have been 127 failures in the last week.
It fails on windows10-64 / debug.

Recent log file and snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=170955498&lineNumber=10683

03:05:50     INFO -  nsTraceRefcnt::DumpStatistics: 1849 entries
03:05:50     INFO -  TEST-INFO | leakcheck | default process: leaked 1 Mutex
03:05:50     INFO -  TEST-INFO | leakcheck | default process: leaked 1 nsBaseAppShell
03:05:50    ERROR -  658 ERROR TEST-UNEXPECTED-FAIL | leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell)
03:05:50     INFO -  runtests.py | Running tests: end.
03:05:50     INFO -  Buffered messages finished
03:05:50     INFO -  Running manifest: toolkit\components\windowcreator\test\chrome.ini
03:05:50     INFO -  Z:\task_1522291025\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
I already looked at this back in November and I was unable to make any progress.
Flags: needinfo?(continuation)
this looks to be in the category of perma failing- we will look to disable something soon.
Assignee: nobody → apavel
Attachment #8964632 - Flags: review?(jmaher)
Comment on attachment 8964632 [details]
Bug 1401883 - disabled on Win64

one reason for the r- is that on my try pushes:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=58a4fcd84bc875067bed6bf4f36e73520bd8d8da

you can see that we have the same rate of failure before/after this patch (which is ~40%).

To me we just need to disable these tests on windows, or we need to fix this issue.  40% failure rate is not acceptable.  Unfortunately C1/C2/C3 all fail and it seems to be not just a couple tests or a couple directories, it seems to be wider spread.
Attachment #8964632 - Flags: review?(jmaher) → review-
I did some retriggers on treeherder for march 20th:
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=windows%20x64%20mochitest-chrome&tochange=fae7d981440374495e890a5ecda4bb698a84f95b&fromchange=28088e11bfc458d12e809b79628d4c2dd942bd83

not every orange job is the same failure, but the majority are the 176 bytes leaked.

It looks like bug 1439383 introduced something to cause this to fail quite frequently:
https://hg.mozilla.org/integration/autoland/rev/5be07e86738e8993022e1c507b5b09e602872a33

:keeler, I see you authored the patch in bug 1439383- this bug is now the #1 intermittent failure on the whole tree- can you take a look at fixing this leak?
Blocks: 1439383
Flags: needinfo?(dkeeler)
We should back out bug 1439383 before we land a workaround like that. (Thanks so much for figuring out what caused the regression!)
I tried to figure out what from bug 1439383 could have made this more frequent, but since it's just shutting down a thread and doesn't have anything to do with nsBaseAppShell, I couldn't come up with any explanations. Bug 1439383 just got backed out anyway, so this should go back to pre-March 21 levels (which, of course, doesn't solve the intermittent, but at least we have an idea of potentially associated code).
Flags: needinfo?(dkeeler)
It is super annoying that this caused bug 1439383 to be backed out.  Holding up memory wins for silly intermittent memory leaks is unfortunate.

I took mccr8's patch from comment 24 and wrote a wrapper for the appshell constructor to hand out:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d575a6b0635a6769708cf5f9040396e2a3a27944

I haven't looked closely at the refcount logs, but I believe what the leak checker is saying is that we didn't leak any wrapper objects for nsBaseAppShell being used *as an nsIAppShell*.  If we were leaking nsBaseAppShell as an nsIAppShell, then the wrapper objects would have leaked too (I think).  If that's the case, the next step is to get the wrapper's QI to hand out different objects for nsIThreadObserver and nsIObserver, the other two interfaces implemented by nsBaseAppShell.  One of those wrappers should leak, and then we should have a better idea of where to start diagnosing the leak.

It's possible that idea doesn't work, in which case I think that means that somebody is casting ns{Base,}AppShell objects directly to the interface superclasses without going through QI...which also should narrow down the potential space of places to check?
FWIW my clang-cl try pushes hit this intermittent a *lot*. I don't know if the compiler change is tickling something or if simply everyone else is hitting this a lot too. (Sad to say, I saw the disable-recommended keyword and am quietly hoping that this bug just goes away before I need to worry about landing.)
(In reply to David Major [:dmajor] from comment #69)
> FWIW my clang-cl try pushes hit this intermittent a *lot*. I don't know if
> the compiler change is tickling something or if simply everyone else is
> hitting this a lot too.

This was the top intermittent failure, so you aren't alone. Please let us know if you are still seeing it a lot after bug 1439383 got backed out.
you can see the failure rate dropped significantly 2 days ago:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-03-30&endday=2018-04-06&tree=trunk&bug=1401883

The backout (or something else around the same time) did resolve the large majority of these failures.
OK, I don't understand what's going on here.  Try push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2d35da556cb039b07e62873960aa7e8ed9a8f548

To make it clear, this push implements:

* A wrapper returned for constructing nsIAppShell instances.  The wrapper QIs itself to nsIAppShell and nsISupports.  This *should* mean that when we leak nsBaseAppShell, we're also leaking this object.  It's possible that the wrapper gets QI'd to nsI{Thread,}Observer instead, and we lose tracking of the wrapper, but this does not seem to be the case.  (nsBaseAppShell sets itself up as a thread observer and registers itself with the observer service directly, and it seems kind of unusual for something else to QI to the observer classes directly.  Previous try pushes intercepted QI to the observer classes and didn't see anything unusual.  See also below.)
* Wrappers are used for registering nsI{Thread,}Observer instances from inside nsBaseAppShell.  Thus, if we were leaking through those references, we'd leak those wrappers as well.

But the leaks (which are numerous, as you can see) never leak the wrapper objects; they continue to only leak the Mutex and the nsBaseAppShell.  Which makes no sense to me!

There are also a fair number of crashes in the nsTraceRefcnt code, which I'm puzzled about too.  Usually when that happens, it means that two different classes are using the same name for their nsTraceRefcnt key...but that doesn't appear to be happening in this case: grepping around the tree doesn't produce any hits for the newly added class names besides the newly added classes themselves.

Going to try asserting on peculiar QI requests and see if that turns up anything unusual.
OK, asserting on peculiar QI requests doesn't turn up anything unusual.  That means that all the references code are getting are coming through XPCOM construction mechanisms.  And those should all go through the wrapper handed out in the constructor...but the wrappers themselves are not leaking!

I think this means that the only reference that can leak is the one held by nsAppShellSingleton.h itself...but that reference *should* go away by the time the final bit of the leak checker runs to check for not-freed stuff.

Bug 1439383 suggests that an extra event causes this to occur more frequently, so maybe there's some runnable that's holding a reference, and the runnable is unceremoniously dropped on the floor at some point?  But then the runnable would leak too, so that doesn't seem tenable.

Oh!  Here's the problem.  Runnables get dispatched, and when they get dispatched, the thread observer (i.e. the appshell) is notified:

https://searchfox.org/mozilla-central/source/widget/nsBaseAppShell.cpp#219

which calls the widget-specific ScheduleNativeEventCallback:

https://searchfox.org/mozilla-central/source/widget/windows/nsAppShell.cpp#460

On Windows, that scheduling takes an extra reference to the appshell, and posts a message to the native Windows event loop.  The corresponding handler will release the reference:

https://searchfox.org/mozilla-central/source/widget/windows/nsAppShell.cpp#189

But what's probably happening in this bug is that ScheduleNativeEventCallback is called, and then we shut down processing of the native Windows event loop before calling EventWindowProc, thereby leaking the reference.  *Those* refcounting operations are run on the actual object, and that's why the wrappers weren't being leaked.

That is, this is just the reincarnation of bug 1220517.  More tomorrow when I can think.
Update:

There have been 42 failures in the last 7 days.
All of the failures occur on debug.
Platforms affected:

windows10-64: 28
windows10-64-ccov: 14

Here is a recent log file: 
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=174251092&lineNumber=1942 

Summary: Intermittent leakcheck | default process: 176 bytes leaked (Mutex, nsBaseAppShell)

:froydnj, did you manage to have another look at this?
Flags: needinfo?(nfroyd)
Checking the logs we found that this leak is not on a specific directory.
the ccov failures are on this directory:
toolkit/components/url-classifier/tests/mochitest

and the normal windows10 failures are on:
browser/base/content/test/chrome


we should be able to bisect this down and skip some tests- luckily the failure rate is pretty low these days.
bisection didn't work so well for me, seemed that I couldn't repro or everything reproduced it.  The faillure rate is pretty low where we could consider disabling the url-classifier tests for win10 ccov builds if needed- but I would think we will stay <20 failures/week until something triggers this again.
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Joel - did you try with the patch from bug 1439383? It seemed to make this much more common - might make it easier to narrow it down.
Flags: needinfo?(jmaher)
why did these get annotated on try :(
I tried applying the patch from bug 1439383 and did a bisection again:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=f49e9d82ae4c49327947b6f4e1ec1189e7bdab8e&filter-searchStr=c3&tochange=0cf3bc8396ce0ab411e899f3fd377052982e5969&selectedJob=179893480

I didn't see any on the ccov builds, and a few instances on the debug builds, although so few it is hard to know if there is a pattern or a fix.
Flags: needinfo?(jmaher)
When runnables are posted to the main thread's event loop, the event
loop notifies any thread observers that this has been done.  The app
shell registers itself as just such a runnable, and posts messages to
the native event loop that processing native events needs to be done.

On Windows, this posting takes an extra reference to the app shell, to
keep it alive until the message is processed by the native event loop,
since app shell code needs to be invoked during that processing.  The
processing releases this extra reference, so everything stays balanced.

Except that it's possible for messages to be posted to the native event
loop, and then browser shutdown happens.  Those messages are not
processed and the associated references taken are not released.  This
imbalance means that in debug builds, we appear to be leaking the app
shell, and that leaking results in intermittent oranges.

This intermittent orange has manifested itself in a variety of ways over
the years, depending on how big the app shell itself was (since that
changes the number of bytes leaked) and how many leak-checked things the
app shell was holding on to.  This bug is merely the latest
manifestation; the last serious work on analyzing the phenomenon and
fixing it was done in bug 1220517.

The solution proposed in that bug was that we simply stop the extra
reference counting; when the app shell is destroyed normally, we
shouldn't be processing the native event loop any more anyway.  So even
if the native event loop is holding (freed) pointers to the app shell,
we'd never execute the callback and perform a use-after-free.  Reading
through the code suggests that this *ought* to be the case, but the
potential for shooting ourselves in the foot seems awfully high.

In any event, this is not a problem unique to Windows; we have seen this
same sort of thing happen on OS X.  See nsAppShell::ProcessGeckoEvents
in widget/cocoa/nsAppShell.mm.

Here we propose a slightly different solution: we keep track of the
number of native event callbacks we have scheduled, incrementing when we
schedule, and decrementing when we actually run one.  When the app shell
is destroyed, we simply set the number of outstanding events to zero,
and we prohibit the callback from accessing the app shell if there are
no outstanding events.  This solution is analogous to dropping the extra
reference counting, but avoids potential badness if we do wind up
processing native events after the app shell is destroyed.

This patch is really Ting-Yu's patch from bug 1220517, modified by me to be
slightly safer.  You can see some of the discussion in bug 1220517 comment 31
and following.  bsmedberg and I were skeptical that the proposed fix of just
dropping the refcounting was bulletproof, but it did fix the problem.  This fix
should work just as well, and be slightly less sketchy.
Attachment #8982625 - Flags: review?(jmathies)
Flags: needinfo?(nfroyd)
Comment on attachment 8982625 [details] [diff] [review]
don't hold unnecessary references to the Windows app shell

Review of attachment 8982625 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the detail in the patch. Seems like a safe change.
Attachment #8982625 - Flags: review?(jmathies) → review+
Flags: needinfo?(jmathies)
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5fd3144d5d00
don't hold unnecessary references to the Windows app shell; r=jimm
https://hg.mozilla.org/mozilla-central/rev/5fd3144d5d00
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Duplicate of this bug: 1447440
No reports of this since this patch landed on the 7th. Is this something we could consider uplifting to Beta/ESR60?
Assignee: apavel → nfroyd
Flags: needinfo?(nfroyd)
Comment on attachment 8982625 [details] [diff] [review]
don't hold unnecessary references to the Windows app shell

(In reply to Ryan VanderMeulen [:RyanVM] from comment #99)
> No reports of this since this patch landed on the 7th. Is this something we
> could consider uplifting to Beta/ESR60?

Sure, would basically only matter for tests, though.

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: Makes sheriffs' lives easier, perhaps?
User impact if declined: None.
Fix Landed on Version: 62
Risk to taking this patch (and alternatives if risky): Very low risk.
String or UUID changes made by this patch: None

Approval Request Comment
[Feature/Bug causing the regression]: Something lost to the sands of time.
[User impact if declined]: None.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: Not really.
[String changes made/needed]: None.
Flags: needinfo?(nfroyd)
Attachment #8982625 - Flags: approval-mozilla-esr60?
Attachment #8982625 - Flags: approval-mozilla-beta?
Comment on attachment 8982625 [details] [diff] [review]
don't hold unnecessary references to the Windows app shell

Makes Windows a bit less leaky in CI. Approved for 61.0b14 and ESR 60.1.
Attachment #8982625 - Flags: approval-mozilla-esr60?
Attachment #8982625 - Flags: approval-mozilla-esr60+
Attachment #8982625 - Flags: approval-mozilla-beta?
Attachment #8982625 - Flags: approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.