Closed Bug 1325785 Opened 3 years ago Closed 3 years ago

Autophone - devices which go offline prevent shutdown and reboot

Categories

(Testing :: Autophone, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(3 files)

Recently the pixel-0{7,8} devices have gone 'offline' and entered an error state. autophone attempts to shutdown in order to reboot however the 'offline' device remains in the error state and does not shut down which prevents the full shutdown and reboot. This currently requires an operator to issue ap autophone-stop in order to recover which if not issued quickly enough will result in the pulse queues being deleted.
I have been testing a trial patch in production but the problem still persists. pixel-09 is the latest instance. The worker's state remains at SHUTTINGDOWN and is never set to SHUTDOWN and the worker process exited which is a requirement for the main process to complete the shutdown. worker.py:main_loop should handle this after the heartbeat is updated.
Finally captured the issue with DEBUG enabled. It appears that the Pixel device disconnection initiates a shutdown as expected. All of the device workers receive the shutdown message from autophone and with the exception of pixel-10 they all shut down normally by exiting the process via the main_loop.

pixel-10 is different however. It does not shut down via main_loop but instead after 15 minutes is forcefully stopped which results in the main process not receiving the update status PhoneStatus.SHUTDOWN for pixel-10 though the worker process is terminated. It appears the autophone process initiated the stop command on pixel-10 from the autophone process which successfully called PhoneWorkerSubProcess.stop from the autophone process which forcefully terminates the worker process without removing the worker from Autophone.phone_workers.

Autophone.stop() is called in:

* Autophone.worker_msg_loop which will forcefully stop all workers in the finally clause

* the sigterm handler set in Autophone.autophone_runner

* Autophone.check_for_unrecoverable_errors where PhoneWorker.stop is called if the heartbeat hasn't been updated.

Since the heartbeat is set at 900 seconds, I think we can point to check_for_unrecoverable_errors as being responsible for issuing the stop. I'm going to try to call Autophone.purge_worker to clear the stuck worker which should allow Autophone.worker_msg_loop to exit the while loop and complete the shutdown and restart.

I still don't know why the worker's process gets hung up and ends up requiring the heartbeat to kill it.
The basic issue is the failure to update the status of the worker and its associated tests in the event that a worker process becomes unresponsive or a test fails to initialize due to device issues.

The fix for the particular hang seen by the pixel devices in this bug is to call Autophone.purge_worker after the workers are stopped when the heartbeat message isn't updated. For consistency, purge_worker is also called in other locations where the worker process is stopped.

The patch also catches exceptions raised from the PhoneTest methods which may have not been handled properly.

A tag along to this bug is fixing several cases where we call the logger with string interpolation instead of passing arguments to the logger method.

This has been running in production as a trial patch over the weekend and appears to have resolve the issues with the autophone process hanging due to disonnected Pixel devices.
Attachment #8827473 - Flags: review?(jmaher)
Comment on attachment 8827473 [details] [diff] [review]
bug-1325785-shutdown-hangs-v1.patch

Review of attachment 8827473 [details] [diff] [review]:
-----------------------------------------------------------------

so much % cleanup, glad to see this will help improve logging.
Attachment #8827473 - Flags: review?(jmaher) → review+
https://github.com/mozilla/autophone/commit/2b6760674fd092becc97edfb85eaed57b32e0049
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.