Closed Bug 1891815 Opened 10 months ago Closed 10 months ago

Increase terminationGracePeriodSeconds for bitrisescript

Categories

(Release Engineering :: Release Automation, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ahal, Assigned: ahal)

References

Details

Attachments

(4 files)

I recently enabled some bitrisescript tasks in the Firefox-iOS, but noticed that many of the tasks are failing with WORKER_SHUTDOWN. Basically kubernetes was terminating the workers before the tasks could complete. Johan pointed me toward bug 1791366 which had a similar symptom.

The issue is that kubernetes sees that we don't have any work left to claim and signals to some of the replicas that they should shut down. Kubernetes has a config option called terminationGracePeriodSeconds which is the amount of time the replica has to finish whatever it is doing before it will be forcefully killed. This value was configured for 30 min, which means any tasks that took longer that were at risk of being forcefully terminated before they could finish. For more info, see:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Bitrisescript is a little different in that the length of the tasks can be arbitrary. So we should set the limit to something quite high as this will act as the ceiling for all workflows we might want to implement in Bitrise. I'm thinking we should do two hours for starters.

Looks like everything here was merged?

Flags: needinfo?(ahal)

Yep, this is confirmed fixed. Thanks!

Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Flags: needinfo?(ahal)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: