Closed Bug 1368036 Opened 7 years ago Closed 5 years ago

Intermittent Windows TC TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!

Categories

(Core :: IPC, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure)

01:19:34     INFO -  Traceback (most recent call last):
01:19:34     INFO -    File "Z:\task_1496625151\build\tests\mochitest\runtests.py", line 2521, in doTests
01:19:34     INFO -      marionette_args=marionette_args,
01:19:34     INFO -    File "Z:\task_1496625151\build\tests\mochitest\runtests.py", line 2118, in runApp
01:19:34     INFO -      self.marionette.start_session(timeout=port_timeout)
01:19:34     INFO -    File "Z:\task_1496625151\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
01:19:34     INFO -      m._handle_socket_failure()
01:19:34     INFO -    File "Z:\task_1496625151\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
01:19:34     INFO -      return func(*args, **kwargs)
01:19:34     INFO -    File "Z:\task_1496625151\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1320, in start_session
01:19:34     INFO -      self.protocol, _ = self.client.connect()
01:19:34     INFO -    File "Z:\task_1496625151\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
01:19:34     INFO -      self.sock.connect((self.addr, self.port))
01:19:34     INFO -    File "c:\mozilla-build\python\Lib\socket.py", line 228, in meth
01:19:34     INFO -      return getattr(self._sock,name)(*args)
01:19:34     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
this failure started up on July 11th, it is occurring often enough, but not a high priority right now.
This is the log spew from the process that doesn't make a log:

08:31:22     INFO - GECKO(3852) | [Child 3348] WARNING: pipe error: 109: file c:/builds/moz2_slave/m-beta-w32-d-00000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
08:31:22     INFO - GECKO(3852) | ###!!! [Child][RunMessage] Error: Channel error: cannot send/recv
08:31:22     INFO - GECKO(3852) | [Child 3348] WARNING: pipe error: 109: file c:/builds/moz2_slave/m-beta-w32-d-00000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
08:31:22     INFO - GECKO(3852) | [Child 3348] WARNING: MsgDropped in ContentChild: file c:/builds/moz2_slave/m-beta-w32-d-00000000000000000/build/src/dom/ipc/ContentChild.cpp, line 2184
08:31:22     INFO - GECKO(3852) | [Child 3348] WARNING: shutting down early because of crash!: file c:/builds/moz2_slave/m-beta-w32-d-00000000000000000/build/src/dom/ipc/ContentChild.cpp, line 2145

We might be able to change shutdown in leak checking builds so it doesn't ever do that quick exit, but maybe we'd get so many leaks it wouldn't matter.

But the basic problem seems to be the pipe error. I'm not sure what that is from.
Component: General → IPC
I also see this a few times near there, which may or may not be related:
[Parent 3852] WARNING: Created child without a matching parent?: file c:/builds/moz2_slave/m-beta-w32-d-00000000000000000/build/src/gfx/layers/ipc/CrossProcessCompositorBridgeParent.cpp, line 92
If this started on the 11th, maybe this is related to bug 1373660? Comment 7 is about a beta crash, but I looked at one on trunk, and the process kind of started up then didn't do anything, which sounds kind of like the preallocated process.
Flags: needinfo?(gkrizsanits)
(In reply to Andrew McCreight [:mccr8] from comment #9)
> If this started on the 11th, maybe this is related to bug 1373660? Comment 7
> is about a beta crash, but I looked at one on trunk, and the process kind of
> started up then didn't do anything, which sounds kind of like the
> preallocated process.

Looks suspicious. That patch was backed out first, and I think that date checks out as well...

Not sure about the failure on beta though, I don't think this patch landed there, and I don't think the preallocated process manager is active on try on the beta channel either (the default process count is still 1 there and that disables the ppm).

(In reply to Andrew McCreight [:mccr8] from comment #8)
> I also see this a few times near there, which may or may not be related:
> [Parent 3852] WARNING: Created child without a matching parent?: file
> c:/builds/moz2_slave/m-beta-w32-d-00000000000000000/build/src/gfx/layers/ipc/
> CrossProcessCompositorBridgeParent.cpp, line 92

I don't know what this warning stands for, so I'm not sure. David, do you think an empty content process (one without any tab) can trigger something like this? Or could you explain to me what this warning means and when can it happen?
Flags: needinfo?(dvander)
I saw the compositor message on beta, but not inbound. Maybe there are multiple issues.
That warning can happen ifthe UI process opens a remote tab, shows it, and then immediately closes it. By the time the content process gets the "open tab" message, it tries to create a layer tree, but the compositor can't find any window to attach it to because the tab has been closed.
Flags: needinfo?(dvander)
Note that it should be a harmless message, it's only bad if we expect a tab to be visible and it's not.
So the preallocated process manager was turned off on nightly in bug 1376895 on 18th of July. So my patch was probably a no-op, which makes it less likely that this problem was caused by it... Unless if that one tests which turns it on causes the leak.
Flags: needinfo?(gkrizsanits)
I looked at a couple of logs, and the content process was just started by random tests that don't seem to do anything specifically related to the preallocated process manager.
Priority: -- → P3
See Also: → 1407803
we have 10+ failures in this bug and 30+ failures in bug 1281306, we should take a look at this.

:billm, I see you are the triage owner here- is this something you could help find the right person to start looking at these leaks?
Flags: needinfo?(wmccloskey)
See Also: → 1281306
I looked at two of these failures. They're Windows-only and they seem to be caused by a content process printing "shutting down early because of crash!" and then dying. As a result, the content process doesn't dump out leak information.

We also get a MsgDropped error in some of the failures. And there's always a slew of "pipe error: 109" messages, but that seems to happen for all the content processes.

I'm going to forward this to Jim since he's really the triage owner for IPC. Maybe someone like Bob could look at this?
Flags: needinfo?(wmccloskey) → needinfo?(jmathies)
Flags: needinfo?(jmathies)
Whiteboard: low-volume
Over the last 7 days there are 32 failures present on this bug. These happen on Windows 7, windows10-64 and windows10-64-ccov.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=168143419&lineNumber=12636

Here is a relevant part of that log: 

06:51:22     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 4784
06:51:22     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:51:22     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
06:51:22     INFO -     0 |TOTAL                                 |       77        0|   49784        0|
06:51:22     INFO -  nsTraceRefcnt::DumpStatistics: 1049 entries
06:51:22     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
06:51:22    ERROR -  776 ERROR TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
06:51:22     INFO -  TEST-INFO | leakcheck | missing output line from log file c:\users\genericworker\appdata\local\temp\tmpbqabpe.mozrunner\runtests_leaks_tab_pid5904.log
06:51:22     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 7012
Flags: needinfo?(jmathies)
I wonder if this is related to bug 1348361. The last time this spiked up it seemed like the preallocated process was involved. The timing isn't quite right, though. This started happening more on maybe the 11th or 13th, and bug 1348361 didn't land until the 16th. Bug 1445249 (one of the predecessor bugs to that) did land on the 13th. Maybe that's a red herring.
Over the last 7 days there are 41 failures on this bug. These happen on windows10-64 and Windows 7

Here is a recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=170657195&lineNumber=15114

Relevant part of the log: 

20:55:41     INFO -  nsTraceRefcnt::DumpStatistics: 926 entries
20:55:41     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
20:55:41    ERROR -  619 ERROR TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
20:55:41     INFO -  TEST-INFO | leakcheck | missing output line from log file c:\users\genericworker\appdata\local\temp\tmp6vih7r.mozrunner\runtests_leaks_tab_pid8336.log
There are 63 failures in the past 7 days, all of them on Windows 7 and Windows10-64 debug.
Recent log failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=172030105&lineNumber=24311
Relevant part of the log:
06:43:27     INFO -  Stopping web server
06:43:27     INFO -  Stopping web socket server
06:43:27     INFO -  Stopping ssltunnel
06:43:27     INFO -  TEST-INFO | leakcheck | default process: leak threshold set at 0 bytes
06:43:27     INFO -  TEST-INFO | leakcheck | plugin process: leak threshold set at 0 bytes
06:43:27     INFO -  TEST-INFO | leakcheck | tab process: leak threshold set at 1000 bytes
06:43:27     INFO -  TEST-INFO | leakcheck | geckomediaplugin process: leak threshold set at 20000 bytes
06:43:27     INFO -  TEST-INFO | leakcheck | gpu process: leak threshold set at 0 bytes
06:43:27     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 8136
06:43:27     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:43:27     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
06:43:27     INFO -     0 |TOTAL                                 |       41        0| 3384286        0|
06:43:27     INFO -  nsTraceRefcnt::DumpStatistics: 1866 entries
06:43:27     INFO -  TEST-PASS | leakcheck | default process: no leaks detected!
06:43:27     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 3320
06:43:27     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:43:27     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
06:43:27     INFO -     0 |TOTAL                                 |       77        0|   36226        0|
06:43:27     INFO -  nsTraceRefcnt::DumpStatistics: 822 entries
06:43:27     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
06:43:27    ERROR -  754 ERROR TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
06:43:27     INFO -  TEST-INFO | leakcheck | missing output line from log file c:\users\task_1522909241\appdata\local\temp\tmp8igu5i.mozrunner\runtests_leaks_tab_pid3844.log
06:43:27     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 4684
06:43:27     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:43:27     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
06:43:27     INFO -     0 |TOTAL                                 |       72        0|   30565        0|
06:43:27     INFO -  nsTraceRefcnt::DumpStatistics: 693 entries
06:43:27     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
06:43:27     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 4852
06:43:27     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:43:27     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
06:43:27     INFO -     0 |TOTAL                                 |       45        0|  166632        0|
06:43:27     INFO -  nsTraceRefcnt::DumpStatistics: 1053 entries
06:43:27     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
06:43:27     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 5292
06:43:27     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:43:27     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
06:43:27     INFO -     0 |TOTAL                                 |       76        0|   52061        0|
06:43:27     INFO -  nsTraceRefcnt::DumpStatistics: 1024 entries
06:43:27     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
06:43:27     INFO -  runtests.py | Running tests: end.
06:43:27     INFO -  Buffered messages finished
:jimm Can you please take a look?
:mccr8, could you take a look at this failure- it seems to be win7/win10 specific, predominately on browser-chrome (oddly we don't see this on devtools which is also browser chrome).  Could this be a harness issue that we need to cleanup better or shutdown the browser differently?  Maybe we need to wait longer as the majority of the failure cases the browser completes the tests and closes down in <30 seconds.
Flags: needinfo?(continuation)
I looked through three logs. The processes that are not producing leak logs get created, then create and destroy a few windows and docshells. The windows are for URLs like about:blank and data:text/html,Test.

Then there is the "*** End BrowserChrome Test Results ***" message, and a little after that there are messages like this:

12:38:21     INFO -  GECKO(2832) | [Child 3504, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
12:38:21     INFO -  GECKO(2832) | [Child 3504, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
12:38:21     INFO -  GECKO(2832) | [Child 3504, Main Thread] WARNING: shutting down early because of crash!: file z:/build/build/src/dom/ipc/ContentChild.cpp, line 2355

I'm not sure why we're getting an abnormal shutdown. Maybe the parent shut down before the child and eventually the pipe breaks and the child shuts down?

I'm afraid I don't know too much about child process creation and shutdown. You could see if it reproduces with bug 1348361 backed out. That was my only theory when this started showing up again.

Blake, do you know of any other changes to how we create and destroy child processes that has landed around the 11th or 13th of March?
Flags: needinfo?(continuation) → needinfo?(mrbkap)
baseline had 12 instances

backout had 6 instances
:jmaher, :mccr8, :mrbkap do you guys have any updates for this bug?
Flags: needinfo?(jmaher)
Flags: needinfo?(continuation)
thanks for checking Arthur- I haven't looked into this more and the rate is down <15/week so far this week and last week.
Flags: needinfo?(jmaher)
Whiteboard: low-volume[stockwell disable-recommended] → low-volume[stockwell unknown]
I don't know any more about how to fix this.
Flags: needinfo?(continuation)
Flags: needinfo?(jmathies)
Whiteboard: low-volume[stockwell unknown] → [low-volume][stockwell unknown]
My first instinct was that bug 1423261 could be related here, but the dates for checkins in that bug don't seem to line up with the dates in this bug. I don't really have any more ideas, sadly.
Flags: needinfo?(mrbkap)
Update:
In the last 7 days, there have been 31 failures.
All of them occur on the debug build type and on the following platforms:
- windows10-64 - 15
- windows7-32 - 15
- linux32 - 1

Here is a recent relevant log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?job_id=185202671&repo=mozilla-inbound&lineNumber=2467

16:50:14     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
16:50:14     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 5944
16:50:14     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
16:50:14     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
16:50:14     INFO -     0 |TOTAL                                 |       29        0|  348715        0|
16:50:14     INFO -  nsTraceRefcnt::DumpStatistics: 1036 entries
16:50:14     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
16:50:14    ERROR -  60 ERROR TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
16:50:14     INFO -  TEST-INFO | leakcheck | missing output line from log file c:\users\task_1530117000\appdata\local\temp\tmpp9mde8.mozrunner\runtests_leaks_tab_pid7148.log
16:50:14     INFO -  runtests.py | Running tests: end.
16:50:14     INFO -  Buffered messages finished
16:50:14     INFO -  Running manifest: browser\base\content\test\tabcrashed\browser.ini
16:50:15     INFO -  Z:\task_1530117000\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL

:jmaher, this started to peak again on June 24th, with 19 failures in the last 3 days. Looking back over the last 30 days this had 73 failures. Can you please have a look?
Flags: needinfo?(jmaher)
not sure if that is the root cause, but it does seem around that point.  I find that windows7 and windows10 are the same frequency- predominately in the browser-chrome suite, but a mix of devtools and mochitest-plain are there as well.  I did a bunch of retriggers to help figure this out with no luck:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=browser-chrome%20win%20debug&tochange=f0f5158db88c8a9712b7e09fccaf10d004343e85&fromchange=721354ad7b1622ad9e681d69591a991b1d45a17d

lets see if there is a pattern emerging in the forthcoming week.
Flags: needinfo?(jmaher)
I did many retriggers:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=browser-chrome-e10s-2%20win%20debug&tochange=f0f5158db88c8a9712b7e09fccaf10d004343e85&fromchange=c24b5f6944bb62ba04d67af93702aed2743cecda&selectedJob=185371805

this is deceptive as we don't have a single test, but when this became more frequent, I see that we changed many of the browser_* files (which means tests shifted between chunks).

I am done spending time on this bug for now, will revisit with more data.
Over the last 7 days there are 41 failures on this bug. These happen on windows7-32, linux64, windows10-64

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?job_id=186814999&repo=mozilla-central&lineNumber=3393

11:09:08     INFO -  nsTraceRefcnt::DumpStatistics: 817 entries
11:09:08     INFO -  TEST-PASS | leakcheck | tab process: no leaks detected!
11:09:08    ERROR -  169 ERROR TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
11:09:08     INFO -  TEST-INFO | leakcheck | missing output line from log file c:\users\task_1530874780\appdata\local\temp\tmplmf9cl.mozrunner\runtests_leaks_tab_pid5544.log
11:09:08     INFO -  == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 5948
11:09:08     INFO -       |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
11:09:08     INFO -       |                                      | Per-Inst   Leaked|   Total      Rem|
11:09:08     INFO -     0 |TOTAL                                 |       42        0|   35809        0|
Flags: needinfo?(jmathies)
Mike, do you have any idea what is causing the increase of intermittents here? The 5 logs I checked all run different test folders before the hit the issue. Thank you in advance.
Flags: needinfo?(mconley)
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #95)
> Mike, do you have any idea what is causing the increase of intermittents
> here?

No, I don't. The "shutting down early because of crash!" message is probably what needs to be investigated here. That message gets logged when a content process (or plugin process, although I think that's less relevant here) gets ContentChild::ActorDestroy called on it with an AbnormalShutdown. AbnormalShutdown, as I understand it, occurs when there's some kind of foul-up communicating across the IPC protocol.

So it might be worth understanding what's passing in the AbnormalShutdown.

I've pushed a patch to try to try to get a stack when that occurs.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b260b960f8737d579b74daabc27cad3dcc878600
I did some IPC logging on a try run where the problem reproduced.

Push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a8d17feda25a4a405177de8fba44891176db369
Full log: https://taskcluster-artifacts.net/VGltnHSaQrCV0-fojgGTag/0/public/logs/live_backing.log

Relevant snippet:

16:35:36     INFO -  GECKO(6392) | [Child 5852, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file z:/build/build/src/xpcom/base/nsTraceRefcnt.cpp, line 196
16:35:36     INFO -  GECKO(6392) | nsStringStats
16:35:36     INFO -  GECKO(6392) |  => mAllocCount:           7968
16:35:36     INFO -  GECKO(6392) |  => mReallocCount:          301
16:35:36     INFO -  GECKO(6392) |  => mFreeCount:            7968
16:35:36     INFO -  GECKO(6392) |  => mShareCount:           6221
16:35:36     INFO -  GECKO(6392) |  => mAdoptCount:            902
16:35:36     INFO -  GECKO(6392) |  => mAdoptFreeCount:        908
16:35:36     INFO -  GECKO(6392) |  => Process ID: 5852, Thread ID: 7976
16:35:36     INFO -  GECKO(6392) | [Child 4860: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 4860: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7528: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7528: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7528: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7528: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7528: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7528: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7528: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7528: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | --DOCSHELL 000001C89DF02800 == 4 [pid = 6392] [id = {bca66967-4d55-48a5-8e5e-7299d21028e7}]
16:35:36     INFO -  GECKO(6392) | --DOCSHELL 000001C89886F000 == 3 [pid = 6392] [id = {15382fcc-a583-447c-ac57-37532be31594}]
16:35:36     INFO -  GECKO(6392) | --DOCSHELL 000001C89CE5F000 == 2 [pid = 6392] [id = {e0111f91-baa5-49e1-952c-97524c7bf9d6}]
16:35:36     INFO -  GECKO(6392) | --DOCSHELL 000001C898870800 == 1 [pid = 6392] [id = {5d6bd7a8-febb-46a9-a589-6e669a6f4c11}]
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc OnChannelErrorFromLink
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Gecko_IOThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Parent 6392: Compositor]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 4860: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 4732: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | [Child 4860: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7528: Chrome_ChildThread]: D/ipc OnChannelErrorFromLink
16:35:36     INFO -  GECKO(6392) | [Child 4732: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | [Child 7164: Chrome_ChildThread]: D/ipc Receive on link thread; seqno=0, xid=-1, shouldWakeUp=0
16:35:36     INFO -  GECKO(6392) | --- Mutex : mozilla::mscom::LiveSet::mMutex (currently acquired)
16:35:36     INFO -  GECKO(6392) |  calling context
16:35:36     INFO -  GECKO(6392) |   [stack trace unavailable]
16:35:36     INFO -  GECKO(6392) | [Child 7528, Main Thread] WARNING: shutting down early because of crash!: file z:/build/build/src/dom/ipc/ContentChild.cpp, line 2365
16:35:36     INFO -  GECKO(6392) | [Child 7164: Main Thread]: D/ipc DispatchMessage: seqno=0, xid=-1
16:35:36     INFO -  GECKO(6392) | Assertion failure: false (OH SNAP - here we go.), at z:/build/build/src/dom/ipc/ContentChild.cpp:2366


I suspect we want someone who's more familiar with IPC to look over this to see if they can make heads or tails of it. I'm afraid I can't.
Flags: needinfo?(mconley)
Update: 
There have been 93 failures in the last 7 days.
Debug is the only build type.
Failures per platform and build type:
- windows10-64: 55
- windows7-32: 27
- windows10-64-msvc: 5
- windows10-64-ccov: 3
- windows7-32-msvc: 3

Recent log file:
https://treeherder.mozilla.org/logviewer.html#?job_id=193466986&repo=mozilla-inbound&lineNumber=27334
Over the last 7 days there are 63 failures on this bug. These happen on windows10-64, windows10-64-msvc, windows7-32, windows7-32-msvc

Here is the latest log example: https://treeherder.mozilla.org/logviewer.html#?job_id=194806861&repo=autoland&lineNumber=26098

08:46:19     INFO - 
08:46:19     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
08:46:19     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
08:46:19     INFO -    0 |TOTAL                                 |       42        0|   56624        0|
08:46:19     INFO - 
08:46:19     INFO - nsTraceRefcnt::DumpStatistics: 835 entries
08:46:19     INFO - 
08:46:19     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
08:46:19     INFO - 
08:46:19    ERROR - TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
08:46:19     INFO - TEST-INFO | leakcheck | missing output line from log file c:\users\task_1534753478\appdata\local\temp\tmplbm942.mozrunner\runtests_leaks_tab_pid6032.log
08:46:19     INFO - 
08:46:19     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 6084
08:46:19     INFO - 
08:46:19     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
08:46:19     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
08:46:19     INFO -    0 |TOTAL                                 |       36        0|  171873        0|
08:46:19     INFO -
Over the last 7 days there are 139 failures on this bug. These happen on windows7-32 and windows10-64.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?job_id=196142528&repo=autoland&lineNumber=32407

18:05:12     INFO - nsTraceRefcnt::DumpStatistics: 907 entries
18:05:12     INFO - 
18:05:12     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
18:05:12     INFO - 
18:05:12    ERROR - TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
18:05:12     INFO - TEST-INFO | leakcheck | missing output line from log file c:\users\task_1535391153\appdata\local\temp\tmpugasq5.mozrunner\runtests_leaks_tab_pid6660.log
18:05:12     INFO - 
18:05:12     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 6716
18:05:12     INFO - 
18:05:12     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
18:05:12     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
18:05:12     INFO -    0 |TOTAL                                 |       37        0|  324885        0|
18:05:12     INFO -
Over the last 7 days there are 157 failures on this bug. These happen on windows7-32 and windows10-64.

Log:
https://treeherder.mozilla.org/logviewer.html#?job_id=197542296&repo=autoland&lineNumber=17747

INFO - 
05:15:24     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
05:15:24     INFO - 
05:15:24    ERROR - TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
05:15:24     INFO - TEST-INFO | leakcheck | missing output line from log file c:\users\task_1536122844\appdata\local\temp\tmptqrz9k.mozrunner\runtests_leaks_tab_pid4892.log
05:15:24     INFO - 
05:15:24     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 5908
05:15:24     INFO - 
05:15:24     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
05:15:24     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
05:15:24     INFO -    0 |TOTAL                                 |       28        0|   34239        0|
05:15:24     INFO - 
05:15:24     INFO - nsTraceRefcnt::DumpStatistics: 929 entries
05:15:24     INFO - 
05:15:24     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!

jimm: Can you please take a look at this bug?
Whiteboard: [low-volume][stockwell disable-recommended] → [low-volume][stockwell disable-recommended] [stockwell needswork]
Update:
There have been 165 failures in the last week.
All the failures occur on the debug build type.
The most affected platforms are windows7-32 and windows10-64.

Summary: Intermittent Windows TC leakcheck | tab process: missing output line for total leaks!

Recent relevant log file:
https://treeherder.mozilla.org/logviewer.html#?job_id=199018973&repo=mozilla-central&lineNumber=33296
218 failures for the past week occurring on windows10-64 and windows7-32 debug.

Recent log file: https://treeherder.mozilla.org/logviewer.html#?job_id=200623450&repo=autoland&lineNumber=28214
There are 216 failures in the last 7 days. They occur on windows10-64 and windows7-32 debug.
This has a very high failure rate. As I see in IFV this started to increase from the 7th of September.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=203954806&repo=autoland&lineNumber=28219

jimm: I see that you are the triage owner, can you please take a look into this or point to someone that can? Thanks!
Looking through the history of this bug I see that there were many attempts of finding the root cause and fixing it. The last one was from mconley, 3 months ago, Comment 104. 
jmaher: Can we ni someone else on this bug more familiar with IPC? Thank you!
Flags: needinfo?(jmaher)
:jimm is the right person, I would like to leave this for him, but 3 months of a needinfo seems long
Flags: needinfo?(jmathies)
Flags: needinfo?(jmaher)
:selena, do you know if :jimm is on leave or has been on pto a lot in the last 3 months- we have some of our most frequent intermittents as tracked down to IPC related.  Maybe there is someone else besides :jimm who could help out?
Flags: needinfo?(sdeckelmann)
I've asked :jimm to have a look.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Priority: P3 → P2
Flags: needinfo?(jmathies)
From https://taskcluster-artifacts.net/eHydZWVNST6GLWnu8PNTdw/0/public/logs/live_backing.log, where pid 6032 is the one with the missing leak log:

08:44:40     INFO - GECKO(8120) | [Parent 8120, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
08:44:40     INFO - GECKO(8120) | [Child 6032, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
08:44:40     INFO - GECKO(8120) | [Child 6032, Main Thread] WARNING: shutting down early because of crash!: file z:/build/build/src/dom/ipc/ContentChild.cpp, line 2400

Normally I'd expect to see the broken pipe error from one side when the other one exits early/unexpectedly (see also comment #70) but this seems to be from both sides at once.  Slightly later in the log, around 08:44:42-43, there are similar collections of pipe error messages for other child pids, but none of the “shutting down early”; these might be toplevels other than PContent.

It's probably also significant that this happens only on Windows, where we use Windows named pipes instead of POSIX local-domain sockets.  (I see that there's been an occasional Linux or Mac failure assigned to this bug, but I suspect those are unrelated.)
Hi Matt, are there any updates here? 

Failure log:  https://treeherder.mozilla.org/logviewer.html#?job_id=204656914&repo=mozilla-inbound&lineNumber=27802

00:33:40     INFO - GECKO(4852) | MEMORY STAT | vsize 1483MB | vsizeMaxContiguous 130912257MB | residentFast 131MB | heapAllocated 26MB
00:33:40     INFO - TEST-OK | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_storage_manager_capabilities.html | took 2727ms
00:33:40     INFO - TEST-START | Shutdown
00:33:40     INFO - Passed:  15
00:33:40     INFO - Failed:  0
00:33:40     INFO - Todo:    0
00:33:40     INFO - Mode:    e10s
00:33:40     INFO - Slowest: 2727ms - /tests/toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_storage_manager_capabilities.html
00:33:40     INFO - SimpleTest FINISHED
00:33:40     INFO - TEST-INFO | Ran 1 Loops
00:33:40     INFO - SimpleTest FINISHED

0:33:41     INFO - nsTraceRefcnt::DumpStatistics: 929 entries
00:33:41     INFO - 
00:33:41     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
00:33:41     INFO - 
00:33:41    ERROR - TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
00:33:41     INFO - TEST-INFO | leakcheck | missing output line from log file c:\users\task_1539216339\appdata\local\temp\tmpuzcuui.mozrunner\runtests_leaks_tab_pid6116.log
00:33:41     INFO - 
00:33:41     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 7684
Flags: needinfo?(matt.woodrow)
Assignee: nobody → agashlin
Flags: needinfo?(jmathies)
Summary: Intermittent Windows TC leakcheck | tab process: missing output line for total leaks! → Intermittent Windows TC TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
Whiteboard: [low-volume][stockwell disable-recommended] [stockwell needswork] → [stockwell disable-recommended] [stockwell needswork]
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #170)
> Normally I'd expect to see the broken pipe error from one side when the
> other one exits early/unexpectedly (see also comment #70) but this seems to
> be from both sides at once.

It's not.

I did a Try run with extra logging, and they're different channels, and also neither one is the main (PContent) channel.  The channel where the child gets a pipe error immediately before or sometimes *after* the “shutting down early because of crash” is the process hang monitor; the one where the parent gets a pipe error right before that is the profiler.  Also, there sometimes *isn't* a pipe error message about the PContent channel (which is definitely getting some kind of error, because that's how we end up in the “because of crash” case), but in one test run there was.

I also tried changing that warning to a MOZ_CRASH, to try to reproduce this on opt builds (the early exit won't cause the leak checking to fail, because if I understand correctly we don't do that kind of leak checking on opt builds, so it makes sense that we never see it there), in the hope that I could get more repros per unit CPU time, but all that got me was a bunch of crashes with no backtrace or file/line info.
Flags: needinfo?(matt.woodrow)
There are 98 failures asociated to this bug in the last 7 days. These are occurring on Windows 7 and 10 on the debug builds.

:jimm any updates on this?
Flags: needinfo?(jmathies)
(In reply to Andrei Ciure[:andrei_ciure_] from comment #184)
> There are 98 failures asociated to this bug in the last 7 days. These are
> occurring on Windows 7 and 10 on the debug builds.
> 
> :jimm any updates on this?

Not yet, I have someone looking at it.
Flags: needinfo?(jmathies)
In the last 7 days there have been 111 occurrences Windows 7 and Windows 10 debug.
Jim  let us know when there are any update here, this continues to have a high occurrence rate. 
Thank you.
Flags: needinfo?(jmathies)
(In reply to Andreea Pavel [:apavel] from comment #194)
> Jim  let us know when there are any update here, this continues to have a
> high occurrence rate. 
> Thank you.

We haven't made any progress on this. Reproducing seems to be a problem. I currently don't have anyone working on this.
Flags: needinfo?(jmathies)
(In reply to Jim Mathies [:jimm] from comment #201)
> (In reply to Andreea Pavel [:apavel] from comment #194)
> > Jim  let us know when there are any update here, this continues to have a
> > high occurrence rate. 
> > Thank you.
> 
> We haven't made any progress on this. Reproducing seems to be a problem. I
> currently don't have anyone working on this.

Thank you Jim. I looked over some failure logs, the tests where this fails differ, the only common thing is that this fails on windows debug. (failures on other platforms are missclassifications)
This bug has failed 194 times in the last 7 days. It occurs on Windows10-64, Windows7-32 on debug build types.

Log:
https://treeherder.mozilla.org/logviewer.html#?job_id=210502906&repo=autoland&lineNumber=27629

 TEST-PASS | leakcheck | tab process: no leaks detected!
06:16:10     INFO - 
06:16:10     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 5936
06:16:10     INFO - 
06:16:10     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:16:10     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
06:16:10     INFO -    0 |TOTAL                                 |       40        0|   39217        0|
06:16:10     INFO - 
06:16:10     INFO - nsTraceRefcnt::DumpStatistics: 937 entries
06:16:10     INFO - 
06:16:10     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
06:16:10     INFO - 
06:16:10     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 6012
06:16:10     INFO - 
06:16:10     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:16:10     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
06:16:10     INFO -    0 |TOTAL                                 |       36        0|   51765        0|
06:16:10     INFO - 
06:16:10     INFO - nsTraceRefcnt::DumpStatistics: 869 entries
06:16:10     INFO - 
06:16:10     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
06:16:10     INFO - 
06:16:10    ERROR - TEST-UNEXPECTED-FAIL | leakcheck | tab process: missing output line for total leaks!
06:16:10     INFO - TEST-INFO | leakcheck | missing output line from log file c:\users\task_1541656647\appdata\local\temp\tmpwg7f7m.mozrunner\runtests_leaks_tab_pid6396.log
06:16:10     INFO - 
06:16:10     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 6408
06:16:10     INFO - 
06:16:10     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:16:10     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
06:16:10     INFO -    0 |TOTAL                                 |       41        0|   33637        0|
06:16:10     INFO - 
06:16:10     INFO - nsTraceRefcnt::DumpStatistics: 912 entries
06:16:10     INFO - 
06:16:10     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
06:16:10     INFO - 
06:16:10     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 6724
06:16:10     INFO - 
06:16:10     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
06:16:10     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
06:16:10     INFO -    0 |TOTAL                                 |       41        0|   34442        0|
06:16:10     INFO - 
06:16:10     INFO - nsTraceRefcnt::DumpStatistics: 915 entries
06:16:10     INFO - 
06:16:10     INFO - TEST-PASS | leakcheck | tab process: no leaks detected!
06:16:10     INFO - 
06:16:10     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 6772

jimm: Can you please take a look at this bug?
Flags: needinfo?(jmathies)
I did another Try push with instrumentation: https://treeherder.mozilla.org/#/jobs?repo=try&revision=833186543279cbf1a6ed1349f864ebd73abc1ed4

It doesn't tell me much, but it does show that the error is associated with reading, not writing.  And, both with these as well as the pipe errors that don't cause this bug, sometimes we get error 109 as part of the OS event that prompts us to read (or however that part works), but more often that's not returned until the actual read operation; I have no idea if that's meaningful.  


I have a vague idea: this might be a race between the parent and child closing the channel, where if the parent closes first then the child sees a broken pipe error and assumes the browser crashed, but most of the time the timing works out so that the child closes first and the parent handles the error indication more gracefully.

But that doesn't explain why we never see this on Unix, because an end-of-file on read would be reported as an error if I'm reading [1] correctly.  Maybe the timing is different, but it's surprising that we *never* see this there (and also seem to get a lot less of the spurious broken pipes).

And I don't yet understand how IPC channels are closed in a way that *isn't* considered an error, or if that's even possible.  (As always with IPC, there isn't a lot of documentation, and the people who created it are no longer here.)

[1] https://searchfox.org/mozilla-central/rev/d850d799a0009f851b5535580e0a8b4bb2c591d7/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#369
Assignee: agashlin → nobody
Flags: needinfo?(jmathies)
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #213)
> I did another Try push with instrumentation:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=833186543279cbf1a6ed1349f864ebd73abc1ed4
> 
> It doesn't tell me much, but it does show that the error is associated with
> reading, not writing.  And, both with these as well as the pipe errors that
> don't cause this bug, sometimes we get error 109 as part of the OS event
> that prompts us to read (or however that part works), but more often that's
> not returned until the actual read operation; I have no idea if that's
> meaningful.  
> 
> 
> I have a vague idea: this might be a race between the parent and child
> closing the channel, where if the parent closes first then the child sees a
> broken pipe error and assumes the browser crashed, but most of the time the
> timing works out so that the child closes first and the parent handles the
> error indication more gracefully.
> 
> But that doesn't explain why we never see this on Unix, because an
> end-of-file on read would be reported as an error if I'm reading [1]
> correctly.  Maybe the timing is different, but it's surprising that we
> *never* see this there (and also seem to get a lot less of the spurious
> broken pipes).
> 
> And I don't yet understand how IPC channels are closed in a way that *isn't*
> considered an error, or if that's even possible.  (As always with IPC, there
> isn't a lot of documentation, and the people who created it are no longer
> here.)
> 
> [1]
> https://searchfox.org/mozilla-central/rev/
> d850d799a0009f851b5535580e0a8b4bb2c591d7/ipc/chromium/src/chrome/common/
> ipc_channel_posix.cc#369

Thanks for looking Jed. This still has a high occurrence rate and since the tests that fail differ, disabling this is not the way to go.
Whiteboard: [stockwell disable-recommended] [stockwell needswork] → [stockwell needswork:owner]
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #213)
> And I don't yet understand how IPC channels are closed in a way that *isn't*
> considered an error, or if that's even possible.  (As always with IPC, there
> isn't a lot of documentation, and the people who created it are no longer
> here.)

We do: see bug 529005.  This is GoodbyeMessage and GOODBYE_MESSAGE_TYPE, which I didn't notice at first because they're defined as part of our MessageChannel overlay[1], whereas HELLO_MESSAGE_TYPE is part of the channel code from Chromium[2].

It looks like this is taken into account in the error-handling path[3][4], but there was some uncertainty on the topic of simultaneous Close().  Also, I wonder if there might be OS-specific quirks around sending a message and then immediately closing the pipe before it's read.

So, an idea for another Try run with logging: log when goodbye messages are sent and received, and when a channel error is/isn't turned into a channel close, and see if anything unusual happens here.


[1] https://searchfox.org/mozilla-central/rev/d35199ef15ffa57d190886e20d5df6471faf0870/ipc/glue/ProtocolUtils.h#61
[2] https://searchfox.org/mozilla-central/rev/d35199ef15ffa57d190886e20d5df6471faf0870/ipc/chromium/src/chrome/common/ipc_channel.h#176
[3] https://searchfox.org/mozilla-central/rev/d35199ef15ffa57d190886e20d5df6471faf0870/ipc/glue/MessageChannel.cpp#2656
[4] https://searchfox.org/mozilla-central/rev/d35199ef15ffa57d190886e20d5df6471faf0870/ipc/glue/MessageChannel.cpp#2672-2673
There are 149 failures asociated to this bug in the last 7 days. These are occurring on Windows 7 and 10 debug only.

Any updates on this?
Flags: needinfo?(jmathies)
Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][stockwell needswork]
There are 113 total failures in the last 7 days and 682 total failures in the last 30 days, all on windows debug. 

Jed are you still working on this? do you have updates?
Flags: needinfo?(jld)

I tried to reproduce this and failed, and it looks like it may have been fixed.

The original problem here was spuriously getting missing output line for total leaks without an actual crash; that error is expected if a process exits abnormally. Also, there was always the shutting down early because of crash! error (indicating a content process thinking the parent process had crashed), and it was always just one content process, not a large number of processes.

All of the reports since about December 5th (between comment #237 and comment #238), as well as all the failures when I pushed to Try yesterday and today, have either had an actual PROCESS-CRASH, or something where apparently all the child processes (including RDD and GPU) fail to write leak logs; in no case is there the shutting down early because of crash! message.

Specifically, looking at the list in orangefactor, the last real failure is from revision 55b77d1533ab6 and the first lookalike is from revision d8a3f89d4bb67f. In between those there are a few IPC changes: bug 1487249, bug 1500944, bug 1509362, and bug 1512085.

Leaving the needinfo because I'd like to look at this some more (and maybe try bisecting to find the responsible commit) before declaring victory.

Flags: needinfo?(jmathies)
Priority: P2 → P3
Whiteboard: [stockwell disable-recommended][stockwell needswork]

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #242)

Leaving the needinfo because I'd like to look at this some more (and maybe try bisecting to find the responsible commit) before declaring victory.

I'm not going to have time for this. Resolving the bug so that people will stop assigning unrelated crashes to it.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(jld)
Resolution: --- → FIXED

Ideally leakcheck or something would be smarter and not flag this failure if there was already some other crash explicitly recognized.

You need to log in before you can comment on or make changes to this bug.