mozcrash kill_pid() implementation on Windows could be better

RESOLVED FIXED in Firefox 58

Status

enhancement
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: gbrown, Assigned: gbrown)

Tracking

Trunk
mozilla59
Points:
---

Firefox Tracking Flags

(firefox58 fixed, firefox59 fixed)

Details

Attachments

(1 attachment)

The mochitest - and probably other - harnesses rely on mozcrash to clean up firefox processes. mozcrash ultimately uses kill_pid(), which has different implementation on Windows and non-Windows.

The Windows implementation of kill_pid() at:

https://dxr.mozilla.org/mozilla-central/rev/2535bad09d720e71a982f3f70dd6925f66ab8ec7/testing/mozbase/mozcrash/mozcrash/mozcrash.py#469

doesn't check return codes and doesn't try to verify process death. I think the calls should succeed, but recent observations in bug 1397201 have me checking assumptions.
Looks like this is an identical problem like bug 1410883, or even the same and we weren't thinking about mozcrash being faulty in that handling. Great observation Geoff!
See Also: → 1410883
My primary purpose here is to improve diagnostics when killing firefox following a harness timeout, especially on Windows.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=49eabf155c6cdad9fb4d4790dcc10800317a20c7 suggests this change does no harm.


There is at least one "370 seconds without output" timeout in that run, on Windows 7, and it demonstrates the extra diagnostics:

https://public-artifacts.taskcluster.net/XE7dDTJYRZS_89kvlnCXtw/0/public/logs/live_backing.log

22:51:11  WARNING -  TEST-UNEXPECTED-TIMEOUT | security/manager/ssl/tests/mochitest/mixedcontent/test_dynDelayedUnsecureXHR.html | application timed out after 370 seconds with no output
22:51:11     INFO -  1549 ERROR Force-terminating active process(es).
22:51:11     INFO -  Determining child pids from psutil
22:51:11     INFO -  Found child pids: [1840, 5180]
22:51:11     INFO -  Killing process: 1840
22:51:11     INFO -  TEST-INFO | started process screenshot
22:51:11     INFO -  TEST-INFO | screenshot: exit 0
22:51:11     INFO -  mozcrash kill_pid(): wait failed (-1) terminating pid 1840: error 5
22:51:11     INFO -  Killing process: 5180
22:51:11     INFO -  Not taking screenshot here: see the one that was previously logged
22:51:11     INFO -  mozcrash kill_pid(): wait failed (-1) terminating pid 5180: error 5
22:51:11     INFO -  psutil found pid 5180 dead
22:51:11     INFO -  psutil found pid 1840 dead
22:51:11     INFO -  Killing process: 4448
22:51:11     INFO -  Not taking screenshot here: see the one that was previously logged
22:51:12     INFO -  mozcrash kill_pid(): wait failed (-1) terminating pid 4448: error 5
22:51:12     INFO -  psutil found pid 4448 dead
22:51:12     INFO -  TEST-INFO | Main app process: exit 1
22:51:12     INFO -  Buffered messages finished
22:51:12    ERROR -  1550 ERROR TEST-UNEXPECTED-FAIL | security/manager/ssl/tests/mochitest/mixedcontent/test_dynDelayedUnsecureXHR.html | application terminated with exit code 1
22:51:12     INFO -  runtests.py | Application ran for: 0:10:15.209000

I don't understand why kill_pid() encounters those "wait failed"s, but at least now we can see those happening. And we also see that those processes are eventually killed anyway.

I contemplated using log.error() for some of these conditions, but generally decided on log.warning(): This bug is about adding diagnostics, and I don't want to risk introducing new test failures until I have a clearer understanding of what is happening.
Attachment #8926617 - Flags: review?(jmaher)
Attachment #8926617 - Flags: review?(jmaher) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c739516873c6
Check return codes in mozcrash kill_pid; r=jmaher
Comment on attachment 8926617 [details] [diff] [review]
improve diagnostics for process management during killAndGetStack

Review of attachment 8926617 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozbase/mozcrash/mozcrash/mozcrash.py
@@ +481,5 @@
> +            if kernel32.TerminateProcess(handle, 1):
> +                # TerminateProcess is async; wait up to 30 seconds for process to
> +                # actually terminate, then give up so that clients are not kept
> +                # waiting indefinitely for hung processes.
> +                status = kernel32.WaitForSingleObject(handle, 30000)

Geoff, please note that Firefox's background monitor thread uses a timeout of 60s or 65s, before it kills Firefox due to a long shutdown. Killing the process already after 30 will cause information loss for this extra crash report. Or is terminate like `SIGKILL` and not `SIGTERM`?
https://hg.mozilla.org/mozilla-central/rev/c739516873c6
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
(In reply to Henrik Skupin (:whimboo) from comment #4)
> Comment on attachment 8926617 [details] [diff] [review]
> improve diagnostics for process management during killAndGetStack
> 
> Review of attachment 8926617 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: testing/mozbase/mozcrash/mozcrash/mozcrash.py
> @@ +481,5 @@
> > +            if kernel32.TerminateProcess(handle, 1):
> > +                # TerminateProcess is async; wait up to 30 seconds for process to
> > +                # actually terminate, then give up so that clients are not kept
> > +                # waiting indefinitely for hung processes.
> > +                status = kernel32.WaitForSingleObject(handle, 30000)
> 
> Geoff, please note that Firefox's background monitor thread uses a timeout
> of 60s or 65s, before it kills Firefox due to a long shutdown. Killing the
> process already after 30 will cause information loss for this extra crash
> report. Or is terminate like `SIGKILL` and not `SIGTERM`?

I'd say TerminateProcess is more like SIGKILL than SIGTERM. MSDN says:
----------
The TerminateProcess function is used to unconditionally cause a process to exit. The state of global data maintained by dynamic-link libraries (DLLs) may be compromised if TerminateProcess is used rather than ExitProcess.

This function stops execution of all threads within the process and requests cancellation of all pending I/O. The terminated process cannot exit until all pending I/O has been completed or canceled. When a process terminates, its kernel object is not destroyed until all processes that have open handles to the process have released those handles.

TerminateProcess is asynchronous; it initiates termination and returns immediately. If you need to be sure the process has terminated, call the WaitForSingleObject function with a handle to the process.

A process cannot prevent itself from being terminated.
----------

In my tests, the wait completes quickly (less than 1 second). 

Note that mozcrash only calls kill_pid() after generating the minidump.

My choice of 30 seconds was pretty arbitrary. I was thinking "since this seems to usually terminate immediately, it would be odd if it didn't terminate in 30 seconds, so let's report that". I don't mind seeing that time change in a follow-up if there's concern.
If it's like a SIGKILL we won't need it then. Thanks.
You need to log in before you can comment on or make changes to this bug.