Open Bug 1973063 Opened 3 days ago Updated 2 days ago

scriptworkers should accept a maxRunTime

Categories

(Release Engineering :: Release Automation, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: bhearsum, Unassigned)

References

Details

At the moment we have no way to have configurable-in-payload timeouts for scriptworker tasks, which means the only control over maximums exists at the scriptworker and kubernetes level. Tasks have a somewhat variable amount of expected time that they take based on their type, and we ought to be able to have finer grained control over this.

When this is addressed, it may also be a good time to look at the whole stack of timeouts and limits. As I understand it, we currently have two types of these at the moment:

  • scriptworker has a task_max_timeout which specifies the maximum amount of time to wait for stderr/stdout from the scriptworker task subprocess. This appears to be hardcoded to 20 minutes, which makes me wonder if it works at all.
  • We also have terminationGracePeriodSeconds in kubernetes, which specifies the amount of time a container is allowed to be shutting down for before it is forcibly killed. As far as I've been able to tell, we often enter this shutdown state immediately after starting a task (because k8s-autoscale will request shutdown of existing instances as soon as no more tasks are pending - it doesn't care that there are still tasks in the running state).

My recommendation overall is:

  • Allow task_max_timeout to be configurable at runtime, or replace it with something new that is.
  • Accept maxRunTime in all scriptworkers, and ensure it gets passed along to task_max_timeout (or whatever new thing we come up with).
  • Ensure that task_max_timeout works as expected (ie: that it allows the task-specific script to run for the allotted time, and no longer).
  • Update k8s-autoscale to not shut down instances that are still running tasks
    • Set terminationGracePeriodSeconds to a much more reasonable value afterwards. (Probably a small value like 30 or 60 seconds, seeing as it would only apply after the task has been resolved?)
Duplicate of this bug: 1973065
You need to log in before you can comment on or make changes to this bug.