Mochitest harness sometimes hangs after browser is killed

NEW
Assigned to

Status

Testing
Mozbase
2 months ago
2 months ago

People

(Reporter: gbrown, Assigned: gbrown)

Tracking

(Depends on: 1 bug, Blocks: 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

2 months ago
See https://bugzilla.mozilla.org/show_bug.cgi?id=1387222#c6. The browser has apparently been killed, but runtests.py remains hung. Is runtests.py running proc.wait(), and if so, what is it waiting for?

Let's take a closer look at that wait() and at least try to dump better diagnostics before hanging.
(Assignee)

Updated

2 months ago
Summary: Mochitest harness sometimes times out after browser is killed → Mochitest harness sometimes hangs after browser is killed
So what happens when the parent process gets killed but the child processes remain? Can this happen given the existence of bug 1404190? If yes, how does mozprocess behave here? AFAIK it waits for all output thread been done, which might hang until the global timeout/force kill by mozharness?
(Assignee)

Comment 2

2 months ago
(In reply to Henrik Skupin (:whimboo) from comment #1)

I don't have definitive answers to most of those questions, but I think you are on the right track: These hangs have something to do with waiting for output, possibly complicated by the multi-process browser.


I realized that my old friend, bug 1339568 is also related. There, mochitest-media-e10s-2 intermittently hangs on shutdown, the hang is detected and nsTerminator causes a crash, but telemetry remains alive and generates output indefinitely, defeating all of our output timeouts. When telemetry is disabled, that shutdown hang again triggers nsTerminator but now there is no output; the 370 second output timeout is triggered, but runtests.py's proc.wait() still does not return:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c20747167166421e61154e796ddaf425a13ba1f3

It turns out that mozprocess is waiting for the ProcessReader threads to end, at

https://dxr.mozilla.org/mozilla-central/rev/0bb0f14672fdda31c19aea1ed829e050d693b9af/testing/mozbase/mozprocess/mozprocess/processhandler.py#997

It appears that the ProcessReader thread is waiting at

https://dxr.mozilla.org/mozilla-central/rev/0bb0f14672fdda31c19aea1ed829e050d693b9af/testing/mozbase/mozprocess/mozprocess/processhandler.py#


I still don't feel I know definitively what has happened to the managed process. The psutil output suggests that the parent is alive after the nsTerminator crash but is successfully killed at the 370 second output timeout, and no children are found. I want to check into this more. If all processes are really dead but we are still waiting to read from a stream, perhaps we need a way to break out of that read; otherwise, perhaps more effective/aggressive killing is called for.
(Assignee)

Comment 3

2 months ago
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=f03a993a99e1fc59d4ef8055c83385bb46f4384f, I dump out a full process list as soon as the 370 second timeout is hit, and then again after killAndGetStack() and associated cleanup.

https://public-artifacts.taskcluster.net/NTsi2dJ9RDW99lwCjQZu2g/0/public/logs/live_backing.log

Initially, firefox is running as pid 3077, and there is a content proc with pid 3127, but no parent (ppid 1):

[task 2017-11-24T18:48:03.687Z] 18:48:03     INFO - {'username': 'worker', 'ppid': 986, 'pid': 3077, 'name': 'firefox'}
[task 2017-11-24T18:48:03.688Z] 18:48:03     INFO - {'username': 'worker', 'ppid': 1, 'pid': 3127, 'name': 'Web Content'}

From earlier in the log, we can see that 3127 is associated with 3077:

[task 2017-11-24T18:22:39.311Z] 18:22:39     INFO - GECKO(3077) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpECE0VA.mozrunner/runtests_leaks_tab_pid3127.log
[task 2017-11-24T18:22:39.529Z] 18:22:39     INFO - GECKO(3077) | ++DOMWINDOW == 9 (0xd8d14000) [pid = 3077] [serial = 9] [outer = 0xddf48400]
[task 2017-11-24T18:22:39.666Z] 18:22:39     INFO - GECKO(3077) | Couldn't convert chrome URL: chrome://branding/locale/brand.properties
[task 2017-11-24T18:22:40.081Z] 18:22:40     INFO - GECKO(3077) | ++DOCSHELL 0xe6a61c00 == 1 [pid = 3127] [id = {f0ba7f73-ec21-4448-89ae-3b0d4706c21c}]

Now we kill Firefox, but find no children:

[task 2017-11-24T18:48:03.690Z] 18:48:03    ERROR - Force-terminating active process(es).
[task 2017-11-24T18:48:03.691Z] 18:48:03     INFO - Determining child pids from psutil
[task 2017-11-24T18:48:03.693Z] 18:48:03     INFO - Found child pids: []
[task 2017-11-24T18:48:03.694Z] 18:48:03     INFO - Killing process: 3077
[task 2017-11-24T18:48:03.695Z] 18:48:03     INFO - TEST-INFO | started process screentopng
[task 2017-11-24T18:48:04.256Z] 18:48:04     INFO - TEST-INFO | screentopng: exit 0
[task 2017-11-24T18:48:04.258Z] 18:48:04     INFO - psutil found pid 3077 dead

Leaving the content process running:

[task 2017-11-24T18:48:04.320Z] 18:48:04     INFO - {'username': 'worker', 'ppid': 1, 'pid': 3127, 'name': 'Web Content'}
(In reply to Geoff Brown [:gbrown] from comment #3)
> Leaving the content process running:
> 
> [task 2017-11-24T18:48:04.320Z] 18:48:04     INFO - {'username': 'worker',
> 'ppid': 1, 'pid': 3127, 'name': 'Web Content'}

I wonder if we always start the content process with ppid 1, or is that maybe caused by a fork? It's weird that we actually loose the parent process of the content process.
I would have expected the browser process to terminate about 18:41:53, after the crash.
I don't know why the log at 18:48:03 still has 3077.

If the browser process has terminated as expected, then the content process is reparented to process 1.
(In reply to Karl Tomlinson (:karlt) from comment #5)
> If the browser process has terminated as expected, then the content process
> is reparented to process 1.

So what happens usually in such a situation with the content processes? Will they run forever? Or which process kills it? Isn't it when the parent process (browser) crashes, that all child processes should be killed? Aren't we doing that?
(Assignee)

Updated

2 months ago
See Also: → bug 1414495
(In reply to Henrik Skupin (:whimboo) from comment #6)
> (In reply to Karl Tomlinson (:karlt) from comment #5)
> > If the browser process has terminated as expected, then the content process
> > is reparented to process 1.
> 
> So what happens usually in such a situation with the content processes? Will
> they run forever?

I assume that content processes usually notice when the parent has crashed and
shut themselves down.  I don't know the details of the content process
shutdown sequence, nor whether this process runs as usual when the browser
process has crashed.  Content processes, however, don't use nsTerminator AFAIK
and so will run forever if they hang.

> Or which process kills it?

In builds without NS_FREE_PERMANENT_DATA (release builds), the browser process
will attempt to kill child processes on browser shutdown if the child
processes take longer than 2 seconds to shutdown.

This mechanism is not effected in other builds, nor when the browser process
crashes before this part of shutdown.

> Isn't it when the parent
> process (browser) crashes, that all child processes should be killed?

Yes, I think that would be sensible.

> Aren't we doing that?

We are not doing anything from outside the content process AFAIK.

I don't know whether or not a previous version of the test harness did this,
but I suspect the best solution would be for the content process to have
something similar to nsTerminator for bug 1404190.
(Assignee)

Updated

2 months ago
Depends on: 1421289
(Assignee)

Updated

2 months ago
Depends on: 1421295
You need to log in before you can comment on or make changes to this bug.