Open Bug 1201871 Opened 5 years ago Updated 3 years ago

[mozprocess] Add module for optional signal handling

Categories

(Testing :: Mozbase, defect)

defect
Not set
critical

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

I have seen recently that running our Firefox UI tests via mozharness that Marionette is not handling the SIGTERM signal. It means whenever the Python process gets killed Firefox will remain active and will not be closed. As result you will get a conflict with port 2828 because it is still in use.

Here a test script:
===================

import time
from marionette import MarionetteTestCase

class Test(MarionetteTestCase):

    def test_sleeps(self):
        for i in range(1, 100):
            time.sleep(1)

Run this test with Marionette and then do a "kill -TERM %pid%" via the command line. The result is that the Marionette Python process terminates, but Firefox is still open. This should not happen and Marionette has to cleanly shutdown AND close all opened subprocesses.

I had to dig into process handling and how all this works and it's pretty hard to find a page which is actually about this problem. But finally I found this pretty informative web page:

http://stefan.sofa-rockers.org/2013/08/15/handling-sub-process-hierarchies-python-linux-os-x/

So what we might need are signal handlers which do the same as what we have for Ctrl-C right now as implemented on bug 1104742.

This is a critical issue for us currently in Mozmill CI because each time Jenkins aborts a build, it will send a SIGTERM to the process. This leaves Firefox instances behind and boxes need to manually be cleaned-up.

I will try to get this fixed.
Is this a marionette runner bug or is this a mozprocess bug?
I will have to figure out if we actually can get this implemented in mozprocess directly or if each consumer has to take care of it.
I did some investigation on Friday but haven't had the time anymore to enter a comment. So here an example of how it looks like with and without those handlers. Whereby process.py uses the ProcessHandler to launch process_sub, whereby process_sub is launching Firefox the same way.

Without handlers and Ctrl+C:
----------------------------

^CTraceback (most recent call last):
  File "process.py", line 31, in <module>
    proc.wait()
  File "/home/henrik/.virtualenvs/fxtests/local/lib/python2.7/site-packages/mozprocess-0.22-py2.7.egg/mozprocess/processhandler.py", line 793, in wait
    self.reader.thread.join(timeout=1)
  File "/usr/lib/python2.7/threading.py", line 960, in join
    self.__block.wait(delay)
  File "/usr/lib/python2.7/threading.py", line 359, in wait
    _sleep(delay)
KeyboardInterrupt

As result process.py quits but leaves process_sub and Firefox running. You can then kill process_sub, but that also will leave Firefox open. If you don't kill the sub process and Firefox closes, process_sub will also quit given its child quit.

Without handlers and SIGTERM:
-----------------------------

Terminated

Same behavior as above. The sub process and Firefox are still running.

With handlers and Ctrl+C:
-------------------------

process.py: entered cleanup (got a SIGINT)
> process_sub.py: entered cleanup (got a SIGTERM)

Both files have handlers for those interrupts and safely shutdown the application via the cleanup() method.
 
With handlers and SIGTERM:
--------------------------

process.py: entered cleanup (got a SIGTERM)
> process_sub.py: entered cleanup (got a SIGTERM)

Same as above. With the handlers it works fine.


So for the world in Marionette you can assign the subprocess to the Marionette client process. If you terminate it via Ctrl+C we currently handle that, but a "kill -TERM pid" will leave Firefox open. If Mozharness is in use which will call the Marionette script via ProcessHandler you can understand it as the process.py file. 

So from my understanding Firefox should be correctly handling the SIGTERM interrupt and closes itself safely. The Marionette code and other scripts which call the Marionette script, will not be safe at the moment due to the missing interrupt handlers. 

All this has a massive affect on testing machines if those are not getting rebooted between test runs. In case of any aborts Firefox instances will still linger around and prevent us from connecting again to the wanted new Firefox process.

Now that I understand the behavior, next I will check is if we somehow can integrate this logic with mozprocess or if every application needs its own handlers. At the moment I feel that we have to do the latter.
Attached file POC.zip (obsolete) —
For easy of implementing I wrote a PoC outside of mozbase for now. It works on Linux and maybe on OS X too (to check), given that it misses the signals for Windows. I want to upload the PoC so others might want to test it.

Next I will check on OS X and then Windows. Once it is working I will first patch ProcessHandler, so that every consumer can take advantage of killing spawned processes - without any code modifications! Latter are only necessary for internal clean-up code. The PoC shows some examples.
Ok, so I had a chat with Andrew and with Ted about this. As suggested I should put some more detailed sentences up here which explains the problem better. So here it comes...

Our firefox-ui-tests are based on Marionette and subclass various Marionette client classes. Additional to its features we can also e.g. install/uninstall a Firefox build. So far we have a handler for SIGINT in Marionette which handles the KeyboardInterrupt and closes Firefox when Ctrl-C is pressed. But it does not handle SIGTERM yet. That means if our tests are getting a SIGTERM signal from the outer process (let this be mozharness) the script immediately quits. Firefox stays open, the temporary profile created by Marionette is not removed, our firefox-ui-tests python script is still running, and our installed binary will not be uninstalled. This messes up our slave machines intermittently. 

So the proposal with the signal handler for SIGTERM set in the Marionette code would only work for itself to close the subprocess and to cleanup the environment. Given that we also need a handler for the subclassed firefox-ui-tests and that only a single handler can be defined in the same process, there is no way for us to react on SIGTERM. It would only work if Marionette would let us hook on top of its own handler. Also all this custom signal handling code would need to be duplicated. It's needed by Marionette incl firefox-ui-tests, mozharness, and various other applications.

So my idea was to have a separate module in mozprocess like signalhandler.py, which will make it easy to register callbacks for specific signals. It would internally manage those callbacks to let even multiple code instances in the same process allow to have their own "handlers". In the PoC it's called InterruptHandler and can be found in processs_handler/process_handler.py. Right now it only covers OS X and Linux. For Windows it seems like changes for ProcessHandler are necessary to get it working. So it's not part of the PoC yet.

Ted, I hope that this comment explains it in detail as you requested. I would appreciate your feedback on it. Thanks!
Flags: needinfo?(ted)
Ted, it would be great if you could spend some minutes on this bug and give me your feedback. Thanks.
Whiteboard: [qa-automation-blocked]
(In reply to Henrik Skupin (:whimboo) from comment #3)
> So from my understanding Firefox should be correctly handling the SIGTERM
> interrupt and closes itself safely.

Just FYI, this is not strictly true. SIGTERM will cause Firefox to quit, but not safely (bug 336193). For automation purposes it's probably fine.

(In reply to Henrik Skupin (:whimboo) from comment #5)
> process, there is no way for us to react on SIGTERM. It would only work if
> Marionette would let us hook on top of its own handler. Also all this custom
> signal handling code would need to be duplicated. It's needed by Marionette
> incl firefox-ui-tests, mozharness, and various other applications.

You're asserting that any script that uses mozprocess wants to use this because otherwise it'll fail to kill its child processes if it gets killed via SIGTERM? I think my initial pushback on this is because this is not an unusual situation--any Python program using subprocess.Popen will have the same problem. That being said, providing a simple solution in mozprocess doesn't seem unreasonable.

I looked over your POC, it seems fairly reasonable, although I would suggest implementing a much smaller API to start unless we find need for it, maybe even just something like "mozprocess.register_signal_handlers()", that adds handlers for the various signals you need to catch, and raises an exception when it hits them, so that callers just need to do:
```
mozprocess.register_signal_handlers()
try:
  # launch a process...
except mozprocess.SignalException:
  # cleanup...

How does that sound?
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #7)
> > So from my understanding Firefox should be correctly handling the SIGTERM
> > interrupt and closes itself safely.
> 
> Just FYI, this is not strictly true. SIGTERM will cause Firefox to quit, but
> not safely (bug 336193). For automation purposes it's probably fine.

Interesting. Actually it can only become better with the changes here.

> (In reply to Henrik Skupin (:whimboo) from comment #5)
> > process, there is no way for us to react on SIGTERM. It would only work if
> > Marionette would let us hook on top of its own handler. Also all this custom
> > signal handling code would need to be duplicated. It's needed by Marionette
> > incl firefox-ui-tests, mozharness, and various other applications.
> 
> You're asserting that any script that uses mozprocess wants to use this
> because otherwise it'll fail to kill its child processes if it gets killed
> via SIGTERM? I think my initial pushback on this is because this is not an
> unusual situation--any Python program using subprocess.Popen will have the
> same problem. That being said, providing a simple solution in mozprocess
> doesn't seem unreasonable.
> 
> I looked over your POC, it seems fairly reasonable, although I would suggest
> implementing a much smaller API to start unless we find need for it, maybe
> even just something like "mozprocess.register_signal_handlers()", that adds
> handlers for the various signals you need to catch, and raises an exception
> when it hits them, so that callers just need to do:
> ```
> mozprocess.register_signal_handlers()
> try:
>   # launch a process...
> except mozprocess.SignalException:
>   # cleanup...
> 
> How does that sound?

Sounds great. I will get this implemented then and move this bug to mozprocess for now. A new Marionette bug will be filed once this feature has been added.
Component: Marionette → Mozbase
Summary: Marionette does not handle SIGTERM for proper shutdown → [mozprocess] Add module for optional signal handling
Attached patch WIP v1Splinter Review
Ted, this is a WIP of the proposal including your feedback. Can you please have a look at it? The question is if we want to introduce some signal groups like termination etc, so applications won't have to register for multiple signals (Linux vs. Windows). The WIP will only work for Linux and Mac and needs more work for Windows.
Attachment #8658156 - Attachment is obsolete: true
Attachment #8690145 - Flags: feedback?(ted)
Comment on attachment 8690145 [details] [diff] [review]
WIP v1

Trying to get additional feedback from Andrew and Chris for this addition. If that looks fine to you I can continue my work and get things in for Windows.

Consumer code which actually handles the SIGINT and SIGTERM signals would look like: 

try:
    register_signal_handler(signal.SIGTERM)

    marionette = Marionette()
    marionette.run_tests()

except (KeyboardInterrupt, TerminationInterrupt) as e:
    print 'marionette: aborting due to system interrupt: %s' % type(e)
    print 'SIGTERM -> kill process'
    marionette.proc.kill()

finally:
    print('marionette: cleanup handler')
    print('marionette: remove profile...')
Attachment #8690145 - Flags: feedback?(cmanchester)
Attachment #8690145 - Flags: feedback?(ahalberstadt)
Comment on attachment 8690145 [details] [diff] [review]
WIP v1

Review of attachment 8690145 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, please don't hate me, but I'm going to contradict ted a bit and suggest something somewhere in-between your original POC and ted's idea :p..

This looks more or less good to me, and is in line with ted's suggestion. But my problem with this approach is that if a signal handler already exists, it won't get called anymore. The API here is much simpler than your POC, but you lose the ability to 'append' handlers to existing ones rather than overwriting them. I think a middle solution that keeps the very simple API, but allows you to store a list of handlers (rather than raising an exception) is a better compromise. E.g, something like this:

_signal_handlers = defaultdict(list)

def _generic_handler(signum, frame):
    for handler in _signal_handlers[signum]:
        handler()

def register_signal_handler(signum, handler):
    orig_handler = signal.getsignal(signum)
    if orig_handler != _generic_handler:
        signal.signal(signum, _generic_handler)
        if orig_handler not in _signal_handlers[signum]:
            _signal_handlers[signum].append(orig_handlers)
    _signal_handlers[signum].insert(0, handler)

def unregister_signal_handler(...):
    ...


Thanks for looking into this.. I'm kind of surprised something like this isn't built-in to the signal module.
Attachment #8690145 - Flags: feedback?(ahalberstadt) → feedback+
Comment on attachment 8690145 [details] [diff] [review]
WIP v1

Review of attachment 8690145 [details] [diff] [review]:
-----------------------------------------------------------------

I think Andrew's proposal is generally reasonable. I don't know how often we'll end up needing to manage multiple signal handlers, but that's probably ok.

My only issue with the code here is how it relies on catching exceptions thrown by signal handlers. I know it's common practice in Python (KeyboardInterrupt, for instance), but I find it counter-intuitive because it takes an asynchronous idea and makes it seem sequential. The kind of clean up we want to do looks like it could just happen in the handler.
Attachment #8690145 - Flags: feedback?(cmanchester)
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> Ok, please don't hate me, but I'm going to contradict ted a bit and suggest
> something somewhere in-between your original POC and ted's idea :p..

Why should I? I really like to get different opinions in how to implement such a feature, so absolutely no worries! Better thanks for your input.

> But my problem with this approach is that if a signal handler already
> exists, it won't get called anymore. The API here is much simpler than your
> POC, but you lose the ability to 'append' handlers to existing ones rather
> than overwriting them. I think a middle solution that keeps the very simple
> API, but allows you to store a list of handlers (rather than raising an
> exception) is a better compromise. E.g, something like this:

So if we would take this route and would also call the original handler as set by some other code in that application, wouldn't it then make sense to have it enabled as a basic feature for ProcessHandler by default? I mean the main issue Ted mentioned was that the original handler is not called anymore and the application might malfunction. With your approach it will not happen. We would call our generic handler first, and afterward the original one. Here an example:

ProcessHandler.run():
   set handlers (eg. SIGINT and SIGTERM to safely kill the external proc)
   call proc.run()

ProcessHandler.wait():
   wait until returned
   unset handlers

The limitation is that you can only work with signals on the main thread. So if you want to call an external tool via a different thread it won't work. At least that would speak against a default signal handler usage.

(In reply to Chris Manchester [:chmanchester] from comment #12)
> I think Andrew's proposal is generally reasonable. I don't know how often
> we'll end up needing to manage multiple signal handlers, but that's probably
> ok.

Right. We would have at least two, which are the default and a custom one. I also don't see a reason why you would set multiple handlers for the same signal. Andrew, was there something specific you had in mind?

> My only issue with the code here is how it relies on catching exceptions
> thrown by signal handlers. I know it's common practice in Python
> (KeyboardInterrupt, for instance), but I find it counter-intuitive because
> it takes an asynchronous idea and makes it seem sequential. The kind of
> clean up we want to do looks like it could just happen in the handler.

We could make it working with handlers AND exceptions. In case you registered for a signal but without specifying a handler we would raise otherwise call the handler. How does this sound?
Flags: needinfo?(cmanchester)
Flags: needinfo?(ahalberstadt)
My(In reply to Henrik Skupin (:whimboo) from comment #13)
> Right. We would have at least two, which are the default and a custom one. I
> also don't see a reason why you would set multiple handlers for the same
> signal. Andrew, was there something specific you had in mind?

For example, you said firefox-ui-tests inherits from Marionette and has to set a signal handler. But what if marionette-core code is also setting a signal handler? Your signal handler would overwrite the one set by marionette-core and all the cleanup that happens there would be lost. Those are really the only two signal handlers I'm imagining.. though I guess a third application that consumes firefox-ui-tests could come along and make a third handler that overwrites the first two.
Flags: needinfo?(ahalberstadt)
Like I said raising exceptions in signal handlers is common practice in Python, if you want to have that functionality that's ok, but I think it would be simpler overall if this just offered a way to register/remove a handler.
Flags: needinfo?(cmanchester)
Comment on attachment 8690145 [details] [diff] [review]
WIP v1

I don't think I have any useful feedback to give above what ahal and chamanchester have already said.
Attachment #8690145 - Flags: feedback?(ted)
I tried to pick this up again and while testing with mozharness I noticed that the behavior as reported originally is gone. Whenever I quit the mozharness script via SIGINT or SIGTERM Firefox will close after some seconds. Not sure yet why it's working but I will investigate tomorrow.
Whiteboard: [qa-automation-blocked]
At the moment I really do not see this problem anymore on Linux and OS X. Only Windows is affected here, it continuous the tests even the Jenkins job got canceled. But that would mean a marionette process is left around which keeps Firefox open. So this bug is really not that critical anymore, and I would drop it from my list for now.
Assignee: hskupin → nobody
No longer blocks: 1283906
Status: ASSIGNED → NEW
Duplicate of this bug: 1351353
You need to log in before you can comment on or make changes to this bug.