Closed Bug 1521447 Opened 6 years ago Closed 3 years ago

[mozprocess] kill sometimes fails on win32 to kill Firefox, and waits forever for process to shutdown

Categories

(Testing :: Mozbase, defect, P2)

x86
Windows 7
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: hang, regression)

Here an example from bug 1391545:

06:32:16 INFO - 1547620336776 Marionette DEBUG 3 <- [1,6,null,{"value":null}]
06:32:16 INFO - Exiting due to channel error.
06:48:56 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\task_1547619310\build\venv\Scripts\python', '-u', 'Z:\task_1547619310\build\tests\marionette\harness\marionette_harness\runtests.py', '--gecko-log=-', '-vv', '--log-raw=Z:\task_1547619310\build\blobber_upload_dir\marionette_raw.log', '--log-errorsummary=Z:\task_1547619310\build\blobber_upload_dir\marionette_errorsummary.log', '--log-html=Z:\task_1547619310\build\blobber_upload_dir\report.html', '--binary=Z:\task_1547619310\build\application\firefox\firefox.exe', '--address=localhost:2828', '--symbols-path=Z:\task_1547619310\build\symbols', 'Z:\task_1547619310\build\tests\marionette\tests\testing\marionette\harness\marionette_harness\tests\unit-tests.ini']

As it looks like the client might send another package but that never gets processed by Firefox. Interesting is that the client doesn't timeout after 360s.

No longer blocks: 1391545

Currently this is Windows 7 32bit only.

Blocks: 1391545
OS: Unspecified → Windows 7
Hardware: Unspecified → x86
Severity: normal → critical
Keywords: hang
Priority: -- → P2

Maybe this is related to some specific system setting (screensaver, or whatever) which we currently do not disable for Marionette jobs. Lets wait until bug 1520175 has been fixed.

Depends on: 1520175

Here some details which I posted on bug 1526880:

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #1)

Failures here are all happening due to channel errors:

22:59:37 INFO - 1549753177844 Marionette DEBUG 3 -> [0,6,"WebDriver:DeleteSession",{}]
22:59:37 INFO - 1549753177847 Marionette DEBUG 3 <- [1,6,null,{"value":null}]
22:59:37 INFO - Exiting due to channel error.

Maybe this is a hang in ProcessChild::QuickExit()?

https://searchfox.org/mozilla-central/rev/5e3bffe964110b8d1885b4236b8abd5c94d8f609/ipc/glue/MessageChannel.cpp#2527

James, could you have a look at this problem given that you added this feature? Currently this is a high frequent intermittent failure on Windows7 32bit only.

Here the current intermittent view (note that all failures have been classified for bug 1391545 so far:

https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-12-01&endday=2019-02-11&tree=trunk&bug=1391545

Blocks: 1526880
Flags: needinfo?(snorp)

Bug 1526880 covers the recent issues on Windows 7 32bit.

No longer blocks: 1391545
See Also: → 1523583

We're only seeing "Exiting due to channel error." after the timeout has been logged, so this is probably just because the parent is getting killed by the harness at this point. I don't think it's related to the timeout itself.

Flags: needinfo?(snorp)

No, the timeout of mozprocess is logged after the channel error. See:

03:28:47 INFO - 1550028527157 Marionette DEBUG 3 -> [0,101,"WebDriver:DeleteSession",{}]
03:28:47 INFO - 1550028527158 Marionette DEBUG 3 <- [1,101,null,{"value":null}]
03:28:47 INFO - Exiting due to channel error.
03:45:27 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\task_1550026868\build\venv\Scripts

That means that the harness doesn't kill the parent process at the time when the channel error occurs.

Everytime this happens is right after WebDriver:DeleteSession which code is here:
https://searchfox.org/mozilla-central/rev/cb7faaf6b4ad2528390186f1ce64618dea71031e/testing/marionette/driver.js#2851

Here Marionette removes its frame scripts from each window, and also some other observers and handlers.

The most affected test job is firefox-ui-functional-local:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-10-16&endday=2019-02-13&tree=all&bug=1516519

And actually the session restore tests where we simulate a Windows restart.

Adam, given that you worked on that implementation would you mind having a look at it? Note that this problem only happens on Windows 7 32bit, and I for myself don't have a chance to test on that platform. Thanks!

Flags: needinfo?(agashlin)

Given that we haven't gotten a reply from Adam for 19 days I had another look into this problem.

So far I noticed that all those hangs currently happen when we run the test_profile_management.py tests, which means when we mess around with the profile. So James might be right and we have a problem with mozprocess not being able to kill Firefox but waiting forever. Maybe child processes cannot be killed.

I will add a couple of debug lines and let it run on try builds. Maybe it will become a bit clearer.

Flags: needinfo?(agashlin)

A normal shutdown looks like the following:

21:37:02 INFO - ** before delete_session
21:37:02 INFO - ** after delete_session
21:37:02 INFO - ** before instance close
21:37:02 INFO - ** before runner stop
21:37:02 INFO - ** before process handler kill
21:37:02 INFO - calling TerminateJobObject
21:37:02 INFO - called TerminateJobObject
21:37:02 INFO - waiting with IO completion port
21:37:02 INFO - job object msg active processes zero
21:37:02 INFO - received 'FINISHED' from _procmgrthread
21:37:02 INFO - Exiting due to channel error.
21:37:02 INFO - Exiting due to channel error.
21:37:02 INFO - ** after process handler kill
21:37:02 INFO - ** after runner stop
21:37:02 INFO - ** after instance close

And here one which fails:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231733498&repo=try&lineNumber=3996

21:37:05 INFO - ** before delete_session
21:37:05 INFO - ** after delete_session
21:37:05 INFO - ** before instance close
21:37:05 INFO - ** before runner stop
21:37:05 INFO - ** before process handler kill
21:37:05 INFO - calling TerminateJobObject
21:37:05 INFO - called TerminateJobObject
21:37:05 INFO - waiting with IO completion port
21:37:05 INFO - job object msg active processes zero
21:37:05 INFO - received 'FINISHED' from _procmgrthread
21:53:45 INFO - Automation Error: mozprocess timed out after 1000 seconds running

So it's clearly a bug in mozprocess in not being able to kill Firefox. I assume also other test harnesses are affected by that.

Component: Marionette → Mozbase
Summary: Marionette client hangs 1000s due to channel errors → [mozprocess] Sometimes fails to kill Firefox, and waits forever for process to shutdown
Summary: [mozprocess] Sometimes fails to kill Firefox, and waits forever for process to shutdown → [mozprocess] kill sometimes fails on win32 to kill Firefox, and waits forever for process to shutdown

And here another try build with more debug logging for io port and job handling:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8f0e72f84db80ff65a94134fcb76c80b1d79e04e

By the following log output it looks like that the problem is not related to killing Firefox, but due to starting the next process by maybe re-using the io-completion port:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231898817&repo=try&lineNumber=919

Here one more try build with further logging when starting the process:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf81b045de494ee079b26d294d5dfab60d661b5d

I was wrong that this happens during startup of Firefox. Nothing in the new logs shows output for that. Actually it's really all inside the cleanup steps of mozprocess. In all the working cases the cleanup job io port line can be seen three times after the after cleanup line. That has to come from __del__. In the failing case it is getting output only twice.

To know which of those invocations is missing I pushed one more try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=837104880c51288a7e2a8e82fc0c9867c411d460

Here a working shutdown:

15:09:00 INFO - before wait _cleanup
15:09:00 INFO - before cleanup job io port
15:09:00 INFO - after cleanup job io port
15:09:00 INFO - after wait _cleanup
15:09:00 INFO - before kill _cleanup
15:09:00 INFO - before cleanup job io port
15:09:00 INFO - after cleanup job io port
15:09:00 INFO - after kill _cleanup
15:09:00 INFO - Exiting due to channel error.
15:09:00 INFO - before wait _cleanup
15:09:00 INFO - before cleanup job io port
15:09:00 INFO - after cleanup job io port
15:09:00 INFO - after wait _cleanup
15:09:00 INFO - ** after process handler kill

And here a failing one:

15:09:02 INFO - before wait _cleanup
15:09:02 INFO - before cleanup job io port
15:09:02 INFO - after cleanup job io port
15:09:02 INFO - after wait _cleanup
15:09:02 INFO - before kill _cleanup
15:09:02 INFO - before cleanup job io port
15:09:02 INFO - after cleanup job io port
15:09:02 INFO - after kill _cleanup
15:09:02 INFO - Exiting due to channel error.
15:09:02 INFO - Exiting due to channel error.
15:25:42 INFO - Automation Error: mozproc...

It means that in the failing case the final cleanup inside of wait is missing, and somewhat hangs after cleanup in Process.kill:

https://searchfox.org/mozilla-central/rev/b2d35912da5b2acecb0274eb113777d344ffb75e/testing/mozbase/mozprocess/mozprocess/processhandler.py#209

Given that kill() is called from different places, I pushed another try build with stack output for that method.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b7b9a560a7b5b7aca63619d369e4210d6f25fb55

Due to shifted priorities over the next weeks I won't have the time to continue on that bug myself. Geoff or Joel, would one of you have the time to pick it up? It's the underlying problem why we have that many hangs of 1000s in several test harnesses, which made it into the top list of OF.

Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)

I won't have time- also adding :bc in as he has an interest in cleaning up log output in general.

Flags: needinfo?(jmaher)

I'll take a look but my first impression is omg, the moving parts!

Flags: needinfo?(bob)

Thanks Bob! In case of questions I will still be around for answering them as best as I can.

++

Give me a shout if you want me to look too.

Flags: needinfo?(gbrown)

https://bugs.python.org/msg292233

Using a Windows job object should be paired with the creation flag CREATE_SUSPENDED. Callers may also need CREATE_BREAKAWAY_FROM_JOB, but setting that creation flag shouldn't be integrated into Popen.

The child has to be created suspended to ensure it doesn't spawn another process and exit before it's added to the job. Once it's in the job, call ResumeThread to start it.

On Windows Vista and 7, the child may need to break away from Python's current job, if allowed. These older versions of Windows don't implement nested jobs, so adding the child to a job will fail if it's already in one. The job used by py.exe for python.exe isn't a problem in this case since it's configured for child processes to automatically break away, but python.exe may have been started directly and added to a job that's not configured for silent breakaway.

https://searchfox.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#403

Windows 7 is 6.1 so it should return False for _can_nest_jobs()

The return code 572 appears to be ERROR_CONTROL_C_EXIT

Considering the issues are with Windows 7, I think my first guess would be require can_create_job and can_nest_jobs instead of can_create_job or can_nest_jobs which would make Windows 7 use "ProcessManager NOT managing child processes".

https://treeherder.mozilla.org/#/jobs?repo=try&revision=18f365e883faf6e0453a87de0eded66d6a893fd6

Well, that was a bust.

Added more debug prints but I'm at a loss:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0db9b1dca2cddfffd8f123045036c45eefe921fb

gbrown: Have any ideas?

Flags: needinfo?(bob) → needinfo?(gbrown)

I've tried several things attempting to get this to work but haven't gotten anything to work. I'm going to have to punt for a while.

(In reply to Bob Clary [:bc:] from comment #21)

Windows 7 is 6.1 so it should return False for _can_nest_jobs()

Agreed. I expect Windows 7 has can_create_jobs = True and can_nest_jobs = False.

The return code 572 appears to be ERROR_CONTROL_C_EXIT

I assume that ultimately derives from

https://searchfox.org/mozilla-central/rev/aae527894a97ee3bbe0c2cfce9c67c59e8b8fcb9/testing/mozbase/mozprocess/mozprocess/processhandler.py#156

Considering the issues are with Windows 7, I think my first guess would be require can_create_job and can_nest_jobs instead of can_create_job or can_nest_jobs which would make Windows 7 use "ProcessManager NOT managing child processes".

I think that was a bad idea (easy to say after test results are available!). I'm pretty vague on the exact issues here, but I think we should use jobs if we can; otherwise, processhandler is not managing child processes and I think terminating the parent with children running can hang.

I'd like to reproduce the original issue with debugging, but haven't managed to yet. Any tips?

Flags: needinfo?(gbrown)

Submit a single Mn test then retrigger several times and wait for a failure. Try 5 retriggers at a time to see if it will reproduce.

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=234053480&repo=try&lineNumber=45989-46017

04:21:20     INFO -  >> GeckoInstance.restart start
04:21:20     INFO -  >> GeckoInstance.close start, clean=False
04:21:20     INFO -  >> runner.stop start
04:21:20     INFO -  >> Mixin.kill start
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | Process.kill start
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | calling TerminateJobObject
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | Process.wait start
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _wait start
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | waiting with IO completion port
04:21:20     INFO -  DBG::MOZPROC PID:1796 (Thread-33) | job object msg active processes zero
04:21:20     INFO -  1552623680191	Marionette	DEBUG	2 <- [1,14,null,{"value":null}]
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | received 'FINISHED' from _procmgrthread
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _cleanup start
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _cleanup closed _handle
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _cleanup end
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _wait end returncode 572
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _cleanup start
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _cleanup end
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | Process.wait end, returncode 572
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _cleanup start
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | _cleanup end
04:21:20     INFO -  DBG::MOZPROC PID:1796 (MainThread) | Process.kill end, returncode=572
04:21:20     INFO -  >> Mixin.wait start
04:21:20     INFO -  >> ProcessReader.join start
04:21:20     INFO -  Exiting due to channel error.
04:21:21     INFO -  << ProcessReader.join end
04:21:21     INFO -  >> ProcessReader.join start
04:21:22     INFO -  << ProcessReader.join end
04:21:22     INFO -  >> ProcessReader.join start
...
(continues until task max-run-time exceeded, because logging is avoiding the 1000 second timeout we usually see).

So, in the marionette_driver, restart() has called close() which has called stop() on the mozrunner. stop() called ProcessHandlerMixin.kill() which delegated the killing to its Process class. afaict, Process.kill() executed normally. However, ProcessHandlerMixin.wait() hangs in this loop:

https://searchfox.org/mozilla-central/rev/f1c7ba91fad60bfea184006f3728dd6ac48c8e56/testing/mozbase/mozprocess/mozprocess/processhandler.py#881-885

        while self.reader.is_alive():
            self.reader.join(timeout=1)
            count += 1
            if timeout is not None and count > timeout:
                return None

So the reader is alive. Is that because the process we just killed is still alive? Or is a child process involved? Or is the reader just confused?

Hm, might this be related to bug 1421289? maybe we could print out the process tree for each iteration so we can see how many processes Firefox spawned, and if content processes don't exit.

I think it is closely related to 1421289. That ProcessReader is troublesome, intermittently not shutting down after a kill.

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=a295f639dbd9e1a0b591fa86506859c5628913d6

is like comment 29, but now we timeout in the wait after a kill, with a timeout of 63 seconds; when the timeout is hit, I dump process info with wmic.exe and activate additional logging for 15 seconds, then raise an exception (getting crazy, I know!).

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=234236459&repo=try&lineNumber=45849-45874

21:29:40 INFO - >> GeckoInstance.restart start
21:29:40 INFO - >> GeckoInstance.close start, clean=False
21:29:40 INFO - >> runner.stop start
21:29:40 INFO - >> Mixin.kill start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.kill start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | calling TerminateJobObject
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.wait start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _wait start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | waiting with IO completion port
21:29:40 INFO - DBG::MOZPROC PID:3180 (Thread-33) | job object msg active processes zero
21:29:40 INFO - 1552685380314 Marionette DEBUG 2 <- [1,14,null,{"value":null}]
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | received 'FINISHED' from _procmgrthread
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup closed _handle
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup end
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _wait end returncode 572
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup end
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.wait end, returncode 572
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup end
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.kill end, returncode=572
21:29:40 INFO - >> Mixin.wait start
21:29:40 INFO - Exiting due to channel error.
21:30:44 INFO - ZZZ wait taking too long -- probably in trouble!
21:30:44 INFO - HandleCount Name Priority ProcessId ThreadCount WorkingSetSize
...
21:30:44 INFO - 222 python.exe 8 6120 7 35360768
21:30:44 INFO - 174 sppsvc.exe 8 2444 4 4800512
21:30:44 INFO - 179 python.exe 8 376 2 15585280
21:30:44 INFO - 241 python.exe 8 5708 5 31203328
21:30:44 INFO - 194 python.exe 8 4604 3 18919424
21:30:44 INFO - 193 python.exe 8 5368 4 19349504
21:30:44 INFO - 77 audiodg.exe 8 4160 3 3260416
21:30:44 INFO - 5 firefox.exe 8 3792 1 77824
21:30:44 INFO - 175 WMIC.exe 8 3128 5 7671808
21:30:44 INFO - waiting another 15 seconds to see if reader thread reports anything...
21:30:44 INFO - _read got line from queue
21:30:44 INFO - _read got line from queue
21:30:44 INFO - _read got line from queue
...

There are python and firefox processes running, but there's no sign of the killed pid, 3180.

"_read got line from queue" is a typo and actually means _read did NOT get a line from the queue:

https://hg.mozilla.org/try/rev/31e3cc1d33fe690fa54a3083c2457a766f9b4db6#l2.258

so that's expected. The reader threads are running, but not reading anything.

(In reply to Geoff Brown [:gbrown] from comment #32)

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=a295f639dbd9e1a0b591fa86506859c5628913d6

is like comment 29, but now we timeout in the wait after a kill, with a timeout of 63 seconds; when the timeout is hit, I dump process info with wmic.exe and activate additional logging for 15 seconds, then raise an exception (getting crazy, I know!).

Just to note that we don't forget about it. The internal background hang monitor has a shutdown time of 65s by default for opt and debug builds, and 180s for ASAN builds. We shouldn't make the time too low.

Also I have seen this interesting fact which is Posix related and might warrant a different bug:

https://searchfox.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#86
https://searchfox.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#194-203

We only give the application 1s time to shutdown after sending SIGTERM, and before sending a SIGKILL. This never will result in a clean shutdown for Firefox, which then always cause a SIGKILL. Well, in terms of bug 336193 would be fixed.

See Also: → 1421289

(In reply to Geoff Brown [:gbrown] from comment #32)

21:30:44 INFO - 222 python.exe 8 6120 7 35360768
21:30:44 INFO - 174 sppsvc.exe 8 2444 4 4800512
21:30:44 INFO - 179 python.exe 8 376 2 15585280
21:30:44 INFO - 241 python.exe 8 5708 5 31203328
21:30:44 INFO - 194 python.exe 8 4604 3 18919424
21:30:44 INFO - 193 python.exe 8 5368 4 19349504
21:30:44 INFO - 77 audiodg.exe 8 4160 3 3260416
21:30:44 INFO - 5 firefox.exe 8 3792 1 77824
21:30:44 INFO - 175 WMIC.exe 8 3128 5 7671808

There are python and firefox processes running, but there's no sign of the killed pid, 3180.

I was distracted by bug 1421289 and the ProcessReader, but eventually came around to thinking that the real issue is this bug: firefox processes are still running after the kill. In the example above, the parent, 3180, was killed, but 3792 continued to run. Was 3792 a content process, or ...? Why wasn't it killed? Could/should mozprocess be more aggressive about tracking and explicitly killing child processes, on win32?

Unfortunately, those logs are expired now, and I am not finding time to get back to this.

See Also: → 1705385

Geoff, given that we haven't seen this issue for quite some time shall we close this bug as incomplete?

Flags: needinfo?(gbrown)

Sounds good. Thanks.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(gbrown)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.