Closed
Bug 1415290
Opened 6 years ago
Closed 6 years ago
mozcrash kill_pid() implementation on Windows could be better
Categories
(Testing :: Mozbase, enhancement)
Testing
Mozbase
Tracking
(firefox58 fixed, firefox59 fixed)
RESOLVED
FIXED
mozilla59
People
(Reporter: gbrown, Assigned: gbrown)
References
Details
Attachments
(1 file)
5.29 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
The mochitest - and probably other - harnesses rely on mozcrash to clean up firefox processes. mozcrash ultimately uses kill_pid(), which has different implementation on Windows and non-Windows. The Windows implementation of kill_pid() at: https://dxr.mozilla.org/mozilla-central/rev/2535bad09d720e71a982f3f70dd6925f66ab8ec7/testing/mozbase/mozcrash/mozcrash/mozcrash.py#469 doesn't check return codes and doesn't try to verify process death. I think the calls should succeed, but recent observations in bug 1397201 have me checking assumptions.
Comment 1•6 years ago
|
||
Looks like this is an identical problem like bug 1410883, or even the same and we weren't thinking about mozcrash being faulty in that handling. Great observation Geoff!
See Also: → 1410883
![]() |
Assignee | |
Comment 2•6 years ago
|
||
My primary purpose here is to improve diagnostics when killing firefox following a harness timeout, especially on Windows. https://treeherder.mozilla.org/#/jobs?repo=try&revision=49eabf155c6cdad9fb4d4790dcc10800317a20c7 suggests this change does no harm. There is at least one "370 seconds without output" timeout in that run, on Windows 7, and it demonstrates the extra diagnostics: https://public-artifacts.taskcluster.net/XE7dDTJYRZS_89kvlnCXtw/0/public/logs/live_backing.log 22:51:11 WARNING - TEST-UNEXPECTED-TIMEOUT | security/manager/ssl/tests/mochitest/mixedcontent/test_dynDelayedUnsecureXHR.html | application timed out after 370 seconds with no output 22:51:11 INFO - 1549 ERROR Force-terminating active process(es). 22:51:11 INFO - Determining child pids from psutil 22:51:11 INFO - Found child pids: [1840, 5180] 22:51:11 INFO - Killing process: 1840 22:51:11 INFO - TEST-INFO | started process screenshot 22:51:11 INFO - TEST-INFO | screenshot: exit 0 22:51:11 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 1840: error 5 22:51:11 INFO - Killing process: 5180 22:51:11 INFO - Not taking screenshot here: see the one that was previously logged 22:51:11 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 5180: error 5 22:51:11 INFO - psutil found pid 5180 dead 22:51:11 INFO - psutil found pid 1840 dead 22:51:11 INFO - Killing process: 4448 22:51:11 INFO - Not taking screenshot here: see the one that was previously logged 22:51:12 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 4448: error 5 22:51:12 INFO - psutil found pid 4448 dead 22:51:12 INFO - TEST-INFO | Main app process: exit 1 22:51:12 INFO - Buffered messages finished 22:51:12 ERROR - 1550 ERROR TEST-UNEXPECTED-FAIL | security/manager/ssl/tests/mochitest/mixedcontent/test_dynDelayedUnsecureXHR.html | application terminated with exit code 1 22:51:12 INFO - runtests.py | Application ran for: 0:10:15.209000 I don't understand why kill_pid() encounters those "wait failed"s, but at least now we can see those happening. And we also see that those processes are eventually killed anyway. I contemplated using log.error() for some of these conditions, but generally decided on log.warning(): This bug is about adding diagnostics, and I don't want to risk introducing new test failures until I have a clearer understanding of what is happening.
Attachment #8926617 -
Flags: review?(jmaher)
Attachment #8926617 -
Flags: review?(jmaher) → review+
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c739516873c6 Check return codes in mozcrash kill_pid; r=jmaher
Comment 4•6 years ago
|
||
Comment on attachment 8926617 [details] [diff] [review] improve diagnostics for process management during killAndGetStack Review of attachment 8926617 [details] [diff] [review]: ----------------------------------------------------------------- ::: testing/mozbase/mozcrash/mozcrash/mozcrash.py @@ +481,5 @@ > + if kernel32.TerminateProcess(handle, 1): > + # TerminateProcess is async; wait up to 30 seconds for process to > + # actually terminate, then give up so that clients are not kept > + # waiting indefinitely for hung processes. > + status = kernel32.WaitForSingleObject(handle, 30000) Geoff, please note that Firefox's background monitor thread uses a timeout of 60s or 65s, before it kills Firefox due to a long shutdown. Killing the process already after 30 will cause information loss for this extra crash report. Or is terminate like `SIGKILL` and not `SIGTERM`?
Comment 5•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/c739516873c6
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox59:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
![]() |
Assignee | |
Comment 6•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #4) > Comment on attachment 8926617 [details] [diff] [review] > improve diagnostics for process management during killAndGetStack > > Review of attachment 8926617 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: testing/mozbase/mozcrash/mozcrash/mozcrash.py > @@ +481,5 @@ > > + if kernel32.TerminateProcess(handle, 1): > > + # TerminateProcess is async; wait up to 30 seconds for process to > > + # actually terminate, then give up so that clients are not kept > > + # waiting indefinitely for hung processes. > > + status = kernel32.WaitForSingleObject(handle, 30000) > > Geoff, please note that Firefox's background monitor thread uses a timeout > of 60s or 65s, before it kills Firefox due to a long shutdown. Killing the > process already after 30 will cause information loss for this extra crash > report. Or is terminate like `SIGKILL` and not `SIGTERM`? I'd say TerminateProcess is more like SIGKILL than SIGTERM. MSDN says: ---------- The TerminateProcess function is used to unconditionally cause a process to exit. The state of global data maintained by dynamic-link libraries (DLLs) may be compromised if TerminateProcess is used rather than ExitProcess. This function stops execution of all threads within the process and requests cancellation of all pending I/O. The terminated process cannot exit until all pending I/O has been completed or canceled. When a process terminates, its kernel object is not destroyed until all processes that have open handles to the process have released those handles. TerminateProcess is asynchronous; it initiates termination and returns immediately. If you need to be sure the process has terminated, call the WaitForSingleObject function with a handle to the process. A process cannot prevent itself from being terminated. ---------- In my tests, the wait completes quickly (less than 1 second). Note that mozcrash only calls kill_pid() after generating the minidump. My choice of 30 seconds was pretty arbitrary. I was thinking "since this seems to usually terminate immediately, it would be odd if it didn't terminate in 30 seconds, so let's report that". I don't mind seeing that time change in a follow-up if there's concern.
Comment 7•6 years ago
|
||
If it's like a SIGKILL we won't need it then. Thanks.
Comment 8•6 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/1d4a6dc2023c
status-firefox58:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•