Closed Bug 1325785 Opened 8 years ago Closed 8 years ago

Autophone - devices which go offline prevent shutdown and reboot

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(3 files)

messages related to pixel-09 disconnecting 8 years ago Bob Clary [:bc] (inactive) 2.52 KB, text/plain		Details
logs showing stop of worker process instead of shutdown 8 years ago Bob Clary [:bc] (inactive) 6.95 KB, text/plain		Details
bug-1325785-shutdown-hangs-v1.patch 8 years ago Bob Clary [:bc] (inactive) 17.81 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review

Bob Clary [:bc] (inactive)

Assignee

Description

•

8 years ago

Recently the pixel-0{7,8} devices have gone 'offline' and entered an error state. autophone attempts to shutdown in order to reboot however the 'offline' device remains in the error state and does not shut down which prevents the full shutdown and reboot. This currently requires an operator to issue ap autophone-stop in order to recover which if not issued quickly enough will result in the pulse queues being deleted.

Bob Clary [:bc] (inactive)

Assignee

Comment 1

•

8 years ago

Attached file messages related to pixel-09 disconnecting — Details

I have been testing a trial patch in production but the problem still persists. pixel-09 is the latest instance. The worker's state remains at SHUTTINGDOWN and is never set to SHUTDOWN and the worker process exited which is a requirement for the main process to complete the shutdown. worker.py:main_loop should handle this after the heartbeat is updated.

Bob Clary [:bc] (inactive)

Assignee

Comment 2

•

8 years ago

Attached file logs showing stop of worker process instead of shutdown — Details

Finally captured the issue with DEBUG enabled. It appears that the Pixel device disconnection initiates a shutdown as expected. All of the device workers receive the shutdown message from autophone and with the exception of pixel-10 they all shut down normally by exiting the process via the main_loop. pixel-10 is different however. It does not shut down via main_loop but instead after 15 minutes is forcefully stopped which results in the main process not receiving the update status PhoneStatus.SHUTDOWN for pixel-10 though the worker process is terminated. It appears the autophone process initiated the stop command on pixel-10 from the autophone process which successfully called PhoneWorkerSubProcess.stop from the autophone process which forcefully terminates the worker process without removing the worker from Autophone.phone_workers. Autophone.stop() is called in: * Autophone.worker_msg_loop which will forcefully stop all workers in the finally clause * the sigterm handler set in Autophone.autophone_runner * Autophone.check_for_unrecoverable_errors where PhoneWorker.stop is called if the heartbeat hasn't been updated. Since the heartbeat is set at 900 seconds, I think we can point to check_for_unrecoverable_errors as being responsible for issuing the stop. I'm going to try to call Autophone.purge_worker to clear the stuck worker which should allow Autophone.worker_msg_loop to exit the while loop and complete the shutdown and restart. I still don't know why the worker's process gets hung up and ends up requiring the heartbeat to kill it.

Bob Clary [:bc] (inactive)

Assignee

Comment 3

•

8 years ago

Attached patch bug-1325785-shutdown-hangs-v1.patch — Details — Splinter Review

The basic issue is the failure to update the status of the worker and its associated tests in the event that a worker process becomes unresponsive or a test fails to initialize due to device issues. The fix for the particular hang seen by the pixel devices in this bug is to call Autophone.purge_worker after the workers are stopped when the heartbeat message isn't updated. For consistency, purge_worker is also called in other locations where the worker process is stopped. The patch also catches exceptions raised from the PhoneTest methods which may have not been handled properly. A tag along to this bug is fixing several cases where we call the logger with string interpolation instead of passing arguments to the logger method. This has been running in production as a trial patch over the weekend and appears to have resolve the issues with the autophone process hanging due to disonnected Pixel devices.

Attachment #8827473 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

8 years ago

Comment on attachment 8827473 [details] [diff] [review] bug-1325785-shutdown-hangs-v1.patch Review of attachment 8827473 [details] [diff] [review]: ----------------------------------------------------------------- so much % cleanup, glad to see this will help improve logging.

Attachment #8827473 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

8 years ago

https://github.com/mozilla/autophone/commit/2b6760674fd092becc97edfb85eaed57b32e0049

Blocks: autophone-deployments

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

3 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Autophone - devices which go offline prevent shutdown and reboot

Categories

(Testing Graveyard :: Autophone, defect)

Tracking

(Not tracked)

People

(Reporter: bc, Assigned: bc)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Attachment

General

Description

File Name

Content Type