Closed Bug 1891815 Opened 10 months ago Closed 10 months ago

Increase terminationGracePeriodSeconds for bitrisescript

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: ahal, Assigned: ahal)

References

Details

Attachments

(4 files)

[mozilla-releng/scriptworker-scripts] Bug 1891815 - [bitrisescript] Adjust POLL_DURATION for pre-stop.sh (#973) 10 months ago BMO Github Automation 63 bytes, text/x-github-pull-request		Details \| Review
[mozilla-releng/scriptworker-scripts] Bug 1891815 - Increase bitrise MAX_TASK_TIMEOUT to 2 hours (#974) 10 months ago BMO Github Automation 63 bytes, text/x-github-pull-request		Details \| Review
[mozilla-releng/scriptworker-scripts] Bug 1891815 - Increase global task timeout to two hours (#975) 10 months ago BMO Github Automation 63 bytes, text/x-github-pull-request		Details \| Review
[mozilla-services/cloudops-infra] Bug 1891815 - [relengworker] Bump bitrise 'terminationGracePeriodSeco… (#5550) 10 months ago Julien Cristau [:jcristau] 60 bytes, text/x-github-pull-request		Details \| Review

Andrew Halberstadt [:ahal]

Assignee

Description

•

10 months ago

I recently enabled some bitrisescript tasks in the Firefox-iOS, but noticed that many of the tasks are failing with WORKER_SHUTDOWN. Basically kubernetes was terminating the workers before the tasks could complete. Johan pointed me toward bug 1791366 which had a similar symptom.

The issue is that kubernetes sees that we don't have any work left to claim and signals to some of the replicas that they should shut down. Kubernetes has a config option called terminationGracePeriodSeconds which is the amount of time the replica has to finish whatever it is doing before it will be forcefully killed. This value was configured for 30 min, which means any tasks that took longer that were at risk of being forcefully terminated before they could finish. For more info, see:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Bitrisescript is a little different in that the length of the tasks can be arbitrary. So we should set the limit to something quite high as this will act as the ceiling for all workflows we might want to implement in Bitrise. I'm thinking we should do two hours for starters.

BMO Github Automation

Comment 1

•

10 months ago

Attached file [mozilla-releng/scriptworker-scripts] Bug 1891815 - [bitrisescript] Adjust POLL_DURATION for pre-stop.sh (#973) — Details

BMO Github Automation

Comment 2

•

10 months ago

Attached file [mozilla-releng/scriptworker-scripts] Bug 1891815 - Increase bitrise MAX_TASK_TIMEOUT to 2 hours (#974) — Details

BMO Github Automation

Comment 3

•

10 months ago

Attached file [mozilla-releng/scriptworker-scripts] Bug 1891815 - Increase global task timeout to two hours (#975) — Details

Julien Cristau [:jcristau]

Comment 4

•

10 months ago

Attached file [mozilla-services/cloudops-infra] Bug 1891815 - [relengworker] Bump bitrise 'terminationGracePeriodSeco… (#5550) — Details

Julien Cristau [:jcristau]

Comment 5

•

10 months ago

Looks like everything here was merged?

Flags: needinfo?(ahal)

Andrew Halberstadt [:ahal]

Assignee

Comment 6

•

10 months ago

Yep, this is confirmed fixed. Thanks!

Status: ASSIGNED → RESOLVED

Closed: 10 months ago

Flags: needinfo?(ahal)

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Increase terminationGracePeriodSeconds for bitrisescript

Categories

(Release Engineering :: Release Automation, defect)

Tracking

(Not tracked)

People

(Reporter: ahal, Assigned: ahal)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Attachment

General

Description

File Name

Content Type