From Lex > In the instance watcher, time_to_stop_trying is never cleaned up > when an instance is set up or given up on. I _think_ that means > that if the watcher fails to set up any instance once, it will > never exit, because the check at spot_manager.py:310 will not > succeed.
This problem has been observed in the wild, and has not been solved. Specifically, this line will sometimes never return: https://github.com/klahnakoski/SpotManager/blob/edf94957d14dace23b97decd75b0e01a9f909cd0/examples/etl.py#L83 ...as you can see, some logging has been added for next time it happens. The method is also called from a Thread so we can kill it later if it takes too long. To solve this properly, the instance manager should be called from its own thread, and report back progress periodically. When the spot manager has not heard from the instance manager for a while, then kill that instance manager and move on.