Open
Bug 1973063
Opened 3 days ago
Updated 2 days ago
scriptworkers should accept a maxRunTime
Categories
(Release Engineering :: Release Automation, enhancement, P3)
Release Engineering
Release Automation
Tracking
(Not tracked)
NEW
People
(Reporter: bhearsum, Unassigned)
References
Details
At the moment we have no way to have configurable-in-payload timeouts for scriptworker tasks, which means the only control over maximums exists at the scriptworker and kubernetes level. Tasks have a somewhat variable amount of expected time that they take based on their type, and we ought to be able to have finer grained control over this.
When this is addressed, it may also be a good time to look at the whole stack of timeouts and limits. As I understand it, we currently have two types of these at the moment:
- scriptworker has a
task_max_timeout
which specifies the maximum amount of time to wait for stderr/stdout from the scriptworker task subprocess. This appears to be hardcoded to 20 minutes, which makes me wonder if it works at all. - We also have
terminationGracePeriodSeconds
in kubernetes, which specifies the amount of time a container is allowed to be shutting down for before it is forcibly killed. As far as I've been able to tell, we often enter this shutdown state immediately after starting a task (because k8s-autoscale will request shutdown of existing instances as soon as no more tasks are pending - it doesn't care that there are still tasks in therunning
state).
My recommendation overall is:
- Allow
task_max_timeout
to be configurable at runtime, or replace it with something new that is. - Accept
maxRunTime
in all scriptworkers, and ensure it gets passed along totask_max_timeout
(or whatever new thing we come up with). - Ensure that
task_max_timeout
works as expected (ie: that it allows the task-specific script to run for the allotted time, and no longer). - Update
k8s-autoscale
to not shut down instances that are still running tasks- Set
terminationGracePeriodSeconds
to a much more reasonable value afterwards. (Probably a small value like 30 or 60 seconds, seeing as it would only apply after the task has been resolved?)
- Set
You need to log in
before you can comment on or make changes to this bug.
Description
•