[mozprocess] kill sometimes fails on win32 to kill Firefox, and waits forever for process to shutdown
Categories
(Testing :: Mozbase, defect, P2)
Tracking
(Not tracked)
People
(Reporter: whimboo, Unassigned)
References
Details
(Keywords: hang, regression)
Here an example from bug 1391545:
06:32:16 INFO - 1547620336776 Marionette DEBUG 3 <- [1,6,null,{"value":null}]
06:32:16 INFO - Exiting due to channel error.
06:48:56 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\task_1547619310\build\venv\Scripts\python', '-u', 'Z:\task_1547619310\build\tests\marionette\harness\marionette_harness\runtests.py', '--gecko-log=-', '-vv', '--log-raw=Z:\task_1547619310\build\blobber_upload_dir\marionette_raw.log', '--log-errorsummary=Z:\task_1547619310\build\blobber_upload_dir\marionette_errorsummary.log', '--log-html=Z:\task_1547619310\build\blobber_upload_dir\report.html', '--binary=Z:\task_1547619310\build\application\firefox\firefox.exe', '--address=localhost:2828', '--symbols-path=Z:\task_1547619310\build\symbols', 'Z:\task_1547619310\build\tests\marionette\tests\testing\marionette\harness\marionette_harness\tests\unit-tests.ini']
As it looks like the client might send another package but that never gets processed by Firefox. Interesting is that the client doesn't timeout after 360s.
Reporter | ||
Comment 1•6 years ago
|
||
Currently this is Windows 7 32bit only.
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 2•6 years ago
|
||
Maybe this is related to some specific system setting (screensaver, or whatever) which we currently do not disable for Marionette jobs. Lets wait until bug 1520175 has been fixed.
Reporter | ||
Comment 3•6 years ago
|
||
Here some details which I posted on bug 1526880:
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #1)
Failures here are all happening due to channel errors:
22:59:37 INFO - 1549753177844 Marionette DEBUG 3 -> [0,6,"WebDriver:DeleteSession",{}]
22:59:37 INFO - 1549753177847 Marionette DEBUG 3 <- [1,6,null,{"value":null}]
22:59:37 INFO - Exiting due to channel error.Maybe this is a hang in
ProcessChild::QuickExit()
?James, could you have a look at this problem given that you added this feature? Currently this is a high frequent intermittent failure on Windows7 32bit only.
Here the current intermittent view (note that all failures have been classified for bug 1391545 so far:
Reporter | ||
Comment 4•6 years ago
|
||
Bug 1526880 covers the recent issues on Windows 7 32bit.
Comment 5•6 years ago
•
|
||
We're only seeing "Exiting due to channel error." after the timeout has been logged, so this is probably just because the parent is getting killed by the harness at this point. I don't think it's related to the timeout itself.
Reporter | ||
Comment 6•6 years ago
|
||
No, the timeout of mozprocess is logged after the channel error. See:
03:28:47 INFO - 1550028527157 Marionette DEBUG 3 -> [0,101,"WebDriver:DeleteSession",{}]
03:28:47 INFO - 1550028527158 Marionette DEBUG 3 <- [1,101,null,{"value":null}]
03:28:47 INFO - Exiting due to channel error.
03:45:27 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\task_1550026868\build\venv\Scripts
That means that the harness doesn't kill the parent process at the time when the channel error
occurs.
Everytime this happens is right after WebDriver:DeleteSession
which code is here:
https://searchfox.org/mozilla-central/rev/cb7faaf6b4ad2528390186f1ce64618dea71031e/testing/marionette/driver.js#2851
Here Marionette removes its frame scripts from each window, and also some other observers and handlers.
The most affected test job is firefox-ui-functional-local
:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-10-16&endday=2019-02-13&tree=all&bug=1516519
And actually the session restore tests where we simulate a Windows restart.
Adam, given that you worked on that implementation would you mind having a look at it? Note that this problem only happens on Windows 7 32bit, and I for myself don't have a chance to test on that platform. Thanks!
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 9•6 years ago
|
||
Given that we haven't gotten a reply from Adam for 19 days I had another look into this problem.
So far I noticed that all those hangs currently happen when we run the test_profile_management.py tests, which means when we mess around with the profile. So James might be right and we have a problem with mozprocess not being able to kill Firefox but waiting forever. Maybe child processes cannot be killed.
I will add a couple of debug lines and let it run on try builds. Maybe it will become a bit clearer.
Reporter | ||
Comment 10•6 years ago
|
||
Reporter | ||
Comment 11•6 years ago
|
||
A normal shutdown looks like the following:
21:37:02 INFO - ** before delete_session
21:37:02 INFO - ** after delete_session
21:37:02 INFO - ** before instance close
21:37:02 INFO - ** before runner stop
21:37:02 INFO - ** before process handler kill
21:37:02 INFO - calling TerminateJobObject
21:37:02 INFO - called TerminateJobObject
21:37:02 INFO - waiting with IO completion port
21:37:02 INFO - job object msg active processes zero
21:37:02 INFO - received 'FINISHED' from _procmgrthread
21:37:02 INFO - Exiting due to channel error.
21:37:02 INFO - Exiting due to channel error.
21:37:02 INFO - ** after process handler kill
21:37:02 INFO - ** after runner stop
21:37:02 INFO - ** after instance close
And here one which fails:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231733498&repo=try&lineNumber=3996
21:37:05 INFO - ** before delete_session
21:37:05 INFO - ** after delete_session
21:37:05 INFO - ** before instance close
21:37:05 INFO - ** before runner stop
21:37:05 INFO - ** before process handler kill
21:37:05 INFO - calling TerminateJobObject
21:37:05 INFO - called TerminateJobObject
21:37:05 INFO - waiting with IO completion port
21:37:05 INFO - job object msg active processes zero
21:37:05 INFO - received 'FINISHED' from _procmgrthread
21:53:45 INFO - Automation Error: mozprocess timed out after 1000 seconds running
So it's clearly a bug in mozprocess in not being able to kill Firefox. I assume also other test harnesses are affected by that.
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 12•6 years ago
|
||
And here another try build with more debug logging for io port and job handling:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8f0e72f84db80ff65a94134fcb76c80b1d79e04e
Reporter | ||
Comment 13•6 years ago
|
||
By the following log output it looks like that the problem is not related to killing Firefox, but due to starting the next process by maybe re-using the io-completion port:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231898817&repo=try&lineNumber=919
Here one more try build with further logging when starting the process:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf81b045de494ee079b26d294d5dfab60d661b5d
Reporter | ||
Comment 14•6 years ago
•
|
||
I was wrong that this happens during startup of Firefox. Nothing in the new logs shows output for that. Actually it's really all inside the cleanup steps of mozprocess. In all the working cases the cleanup job io port
line can be seen three times after the after cleanup
line. That has to come from __del__
. In the failing case it is getting output only twice.
To know which of those invocations is missing I pushed one more try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=837104880c51288a7e2a8e82fc0c9867c411d460
Updated•6 years ago
|
Reporter | ||
Comment 15•6 years ago
|
||
Here a working shutdown:
15:09:00 INFO - before wait _cleanup
15:09:00 INFO - before cleanup job io port
15:09:00 INFO - after cleanup job io port
15:09:00 INFO - after wait _cleanup
15:09:00 INFO - before kill _cleanup
15:09:00 INFO - before cleanup job io port
15:09:00 INFO - after cleanup job io port
15:09:00 INFO - after kill _cleanup
15:09:00 INFO - Exiting due to channel error.
15:09:00 INFO - before wait _cleanup
15:09:00 INFO - before cleanup job io port
15:09:00 INFO - after cleanup job io port
15:09:00 INFO - after wait _cleanup
15:09:00 INFO - ** after process handler kill
And here a failing one:
15:09:02 INFO - before wait _cleanup
15:09:02 INFO - before cleanup job io port
15:09:02 INFO - after cleanup job io port
15:09:02 INFO - after wait _cleanup
15:09:02 INFO - before kill _cleanup
15:09:02 INFO - before cleanup job io port
15:09:02 INFO - after cleanup job io port
15:09:02 INFO - after kill _cleanup
15:09:02 INFO - Exiting due to channel error.
15:09:02 INFO - Exiting due to channel error.
15:25:42 INFO - Automation Error: mozproc...
It means that in the failing case the final cleanup inside of wait
is missing, and somewhat hangs after cleanup
in Process.kill
:
Given that kill()
is called from different places, I pushed another try build with stack output for that method.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b7b9a560a7b5b7aca63619d369e4210d6f25fb55
Reporter | ||
Comment 16•6 years ago
|
||
Due to shifted priorities over the next weeks I won't have the time to continue on that bug myself. Geoff or Joel, would one of you have the time to pick it up? It's the underlying problem why we have that many hangs of 1000s in several test harnesses, which made it into the top list of OF.
Comment 17•6 years ago
|
||
I won't have time- also adding :bc in as he has an interest in cleaning up log output in general.
Comment 18•6 years ago
|
||
I'll take a look but my first impression is omg, the moving parts!
Reporter | ||
Comment 19•6 years ago
|
||
Thanks Bob! In case of questions I will still be around for answering them as best as I can.
Comment 21•6 years ago
|
||
https://bugs.python.org/msg292233
Using a Windows job object should be paired with the creation flag CREATE_SUSPENDED. Callers may also need CREATE_BREAKAWAY_FROM_JOB, but setting that creation flag shouldn't be integrated into Popen.
The child has to be created suspended to ensure it doesn't spawn another process and exit before it's added to the job. Once it's in the job, call ResumeThread to start it.
On Windows Vista and 7, the child may need to break away from Python's current job, if allowed. These older versions of Windows don't implement nested jobs, so adding the child to a job will fail if it's already in one. The job used by py.exe for python.exe isn't a problem in this case since it's configured for child processes to automatically break away, but python.exe may have been started directly and added to a job that's not configured for silent breakaway.
Windows 7 is 6.1 so it should return False for _can_nest_jobs()
The return code 572 appears to be ERROR_CONTROL_C_EXIT
Considering the issues are with Windows 7, I think my first guess would be require can_create_job and can_nest_jobs instead of can_create_job or can_nest_jobs which would make Windows 7 use "ProcessManager NOT managing child processes".
https://treeherder.mozilla.org/#/jobs?repo=try&revision=18f365e883faf6e0453a87de0eded66d6a893fd6
Comment 22•6 years ago
|
||
Well, that was a bust.
Comment 23•6 years ago
|
||
Added more debug prints but I'm at a loss:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0db9b1dca2cddfffd8f123045036c45eefe921fb
gbrown: Have any ideas?
Comment 24•6 years ago
|
||
I've tried several things attempting to get this to work but haven't gotten anything to work. I'm going to have to punt for a while.
![]() |
||
Comment 25•6 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #21)
Windows 7 is 6.1 so it should return False for _can_nest_jobs()
Agreed. I expect Windows 7 has can_create_jobs = True and can_nest_jobs = False.
The return code 572 appears to be ERROR_CONTROL_C_EXIT
I assume that ultimately derives from
Considering the issues are with Windows 7, I think my first guess would be require can_create_job and can_nest_jobs instead of can_create_job or can_nest_jobs which would make Windows 7 use "ProcessManager NOT managing child processes".
I think that was a bad idea (easy to say after test results are available!). I'm pretty vague on the exact issues here, but I think we should use jobs if we can; otherwise, processhandler is not managing child processes and I think terminating the parent with children running can hang.
I'd like to reproduce the original issue with debugging, but haven't managed to yet. Any tips?
Comment 26•6 years ago
|
||
Submit a single Mn test then retrigger several times and wait for a failure. Try 5 retriggers at a time to see if it will reproduce.
![]() |
||
Comment 27•6 years ago
|
||
Thanks. I hope I am tracking down something here, slowly, but I don't have much to report yet.
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=99893b5bff484daf4be1b82f0552898f16180298
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=3a53feadc2c5abccde3c77e5d776119c2b66cce6
![]() |
||
Comment 28•6 years ago
|
||
![]() |
||
Comment 29•6 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=234053480&repo=try&lineNumber=45989-46017
04:21:20 INFO - >> GeckoInstance.restart start
04:21:20 INFO - >> GeckoInstance.close start, clean=False
04:21:20 INFO - >> runner.stop start
04:21:20 INFO - >> Mixin.kill start
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | Process.kill start
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | calling TerminateJobObject
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | Process.wait start
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _wait start
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | waiting with IO completion port
04:21:20 INFO - DBG::MOZPROC PID:1796 (Thread-33) | job object msg active processes zero
04:21:20 INFO - 1552623680191 Marionette DEBUG 2 <- [1,14,null,{"value":null}]
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | received 'FINISHED' from _procmgrthread
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _cleanup start
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _cleanup closed _handle
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _cleanup end
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _wait end returncode 572
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _cleanup start
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _cleanup end
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | Process.wait end, returncode 572
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _cleanup start
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | _cleanup end
04:21:20 INFO - DBG::MOZPROC PID:1796 (MainThread) | Process.kill end, returncode=572
04:21:20 INFO - >> Mixin.wait start
04:21:20 INFO - >> ProcessReader.join start
04:21:20 INFO - Exiting due to channel error.
04:21:21 INFO - << ProcessReader.join end
04:21:21 INFO - >> ProcessReader.join start
04:21:22 INFO - << ProcessReader.join end
04:21:22 INFO - >> ProcessReader.join start
...
(continues until task max-run-time exceeded, because logging is avoiding the 1000 second timeout we usually see).
So, in the marionette_driver, restart() has called close() which has called stop() on the mozrunner. stop() called ProcessHandlerMixin.kill() which delegated the killing to its Process class. afaict, Process.kill() executed normally. However, ProcessHandlerMixin.wait() hangs in this loop:
while self.reader.is_alive():
self.reader.join(timeout=1)
count += 1
if timeout is not None and count > timeout:
return None
So the reader is alive. Is that because the process we just killed is still alive? Or is a child process involved? Or is the reader just confused?
Reporter | ||
Comment 30•6 years ago
|
||
Hm, might this be related to bug 1421289? maybe we could print out the process tree for each iteration so we can see how many processes Firefox spawned, and if content processes don't exit.
![]() |
||
Comment 31•6 years ago
|
||
I think it is closely related to 1421289. That ProcessReader is troublesome, intermittently not shutting down after a kill.
![]() |
||
Comment 32•6 years ago
|
||
is like comment 29, but now we timeout in the wait after a kill, with a timeout of 63 seconds; when the timeout is hit, I dump process info with wmic.exe and activate additional logging for 15 seconds, then raise an exception (getting crazy, I know!).
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=234236459&repo=try&lineNumber=45849-45874
21:29:40 INFO - >> GeckoInstance.restart start
21:29:40 INFO - >> GeckoInstance.close start, clean=False
21:29:40 INFO - >> runner.stop start
21:29:40 INFO - >> Mixin.kill start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.kill start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | calling TerminateJobObject
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.wait start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _wait start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | waiting with IO completion port
21:29:40 INFO - DBG::MOZPROC PID:3180 (Thread-33) | job object msg active processes zero
21:29:40 INFO - 1552685380314 Marionette DEBUG 2 <- [1,14,null,{"value":null}]
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | received 'FINISHED' from _procmgrthread
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup closed _handle
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup end
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _wait end returncode 572
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup end
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.wait end, returncode 572
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup start
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | _cleanup end
21:29:40 INFO - DBG::MOZPROC PID:3180 (MainThread) | Process.kill end, returncode=572
21:29:40 INFO - >> Mixin.wait start
21:29:40 INFO - Exiting due to channel error.
21:30:44 INFO - ZZZ wait taking too long -- probably in trouble!
21:30:44 INFO - HandleCount Name Priority ProcessId ThreadCount WorkingSetSize
...
21:30:44 INFO - 222 python.exe 8 6120 7 35360768
21:30:44 INFO - 174 sppsvc.exe 8 2444 4 4800512
21:30:44 INFO - 179 python.exe 8 376 2 15585280
21:30:44 INFO - 241 python.exe 8 5708 5 31203328
21:30:44 INFO - 194 python.exe 8 4604 3 18919424
21:30:44 INFO - 193 python.exe 8 5368 4 19349504
21:30:44 INFO - 77 audiodg.exe 8 4160 3 3260416
21:30:44 INFO - 5 firefox.exe 8 3792 1 77824
21:30:44 INFO - 175 WMIC.exe 8 3128 5 7671808
21:30:44 INFO - waiting another 15 seconds to see if reader thread reports anything...
21:30:44 INFO - _read got line from queue
21:30:44 INFO - _read got line from queue
21:30:44 INFO - _read got line from queue
...
There are python and firefox processes running, but there's no sign of the killed pid, 3180.
"_read got line from queue" is a typo and actually means _read did NOT get a line from the queue:
https://hg.mozilla.org/try/rev/31e3cc1d33fe690fa54a3083c2457a766f9b4db6#l2.258
so that's expected. The reader threads are running, but not reading anything.
Reporter | ||
Comment 33•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #32)
is like comment 29, but now we timeout in the wait after a kill, with a timeout of 63 seconds; when the timeout is hit, I dump process info with wmic.exe and activate additional logging for 15 seconds, then raise an exception (getting crazy, I know!).
Just to note that we don't forget about it. The internal background hang monitor has a shutdown time of 65s by default for opt and debug builds, and 180s for ASAN builds. We shouldn't make the time too low.
Also I have seen this interesting fact which is Posix related and might warrant a different bug:
https://searchfox.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#86
https://searchfox.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#194-203
We only give the application 1s time to shutdown after sending SIGTERM, and before sending a SIGKILL. This never will result in a clean shutdown for Firefox, which then always cause a SIGKILL. Well, in terms of bug 336193 would be fixed.
![]() |
||
Comment 34•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #32)
21:30:44 INFO - 222 python.exe 8 6120 7 35360768
21:30:44 INFO - 174 sppsvc.exe 8 2444 4 4800512
21:30:44 INFO - 179 python.exe 8 376 2 15585280
21:30:44 INFO - 241 python.exe 8 5708 5 31203328
21:30:44 INFO - 194 python.exe 8 4604 3 18919424
21:30:44 INFO - 193 python.exe 8 5368 4 19349504
21:30:44 INFO - 77 audiodg.exe 8 4160 3 3260416
21:30:44 INFO - 5 firefox.exe 8 3792 1 77824
21:30:44 INFO - 175 WMIC.exe 8 3128 5 7671808There are python and firefox processes running, but there's no sign of the killed pid, 3180.
I was distracted by bug 1421289 and the ProcessReader, but eventually came around to thinking that the real issue is this bug: firefox processes are still running after the kill. In the example above, the parent, 3180, was killed, but 3792 continued to run. Was 3792 a content process, or ...? Why wasn't it killed? Could/should mozprocess be more aggressive about tracking and explicitly killing child processes, on win32?
Unfortunately, those logs are expired now, and I am not finding time to get back to this.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 40•3 years ago
|
||
Geoff, given that we haven't seen this issue for quite some time shall we close this bug as incomplete?
![]() |
||
Comment 41•3 years ago
|
||
Sounds good. Thanks.
Description
•