Open Bug 1529777 Opened 7 months ago Updated 9 days ago

Hard max-run-time invites intermittent failures

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set

Tracking

(Not tracked)

People

(Reporter: gbrown, Unassigned)

References

(Blocks 2 open bugs)

Details

Most (all?) tasks have a max-run-time parameter, eg:

https://searchfox.org/mozilla-central/rev/b36e97fc776635655e84f2048ff59f38fa8a4626/taskcluster/ci/test/misc.yml#9

When a task runs for longer than max-run-time, the task is forced down, generally resulting in an error report like "[taskcluster:error] Task timeout after 3600 seconds. Force killing container."

Task timeout is an important safe-guard against hung tasks: If a task isn't proceeding, we don't want to wait indefinitely for it.

However, the resulting experience in bug 1411358 (and others -- see See Also bugs) is frustrating and wasteful: Every week, 50 to 100 tasks "randomly" time out, despite on-going efforts to update test chunks and max-run-time parameters to avoid common time outs. In my experience, most of those tasks are making progress, and would succeed if given more time; at the same time, the max-run-time is generally much longer than the average run-time of the task: A task might usually run in 40 minutes, but run in 70+ minutes for 2% of runs. I think this is usually just a result of variance in machine performance and/or variance in product (or even test) performance.

Can we be more forgiving about max-run-time? If a task exceeds max-run-time but has logged something in the last 5 minutes, can we extend max-run-time by 30 minutes (or 50%, or ...)? That sort of strategy would be less efficient in some cases: If the task still times out after the extension, we've only wasted more time. But I think the majority of extension cases would succeed, avoiding intermittent failures, unnecessary retries, etc.

I believe one of the workers has an idle timeout, too. So perhaps this could be approximated by increasing max-run-time, while reducing the idle timeout?

Component: General → Workers

There are a couple of issues here:

  • for hung tasks, extending the timeout at all just wastes more resources. Time == money for CI resources
  • if we're going to consider extending runtime based on logged output, why not go in both directions and kill tasks early if they don't log anything for 5min?
  • if we automatically bump max runtimes under certain circumstances, does that lead to a corresponding perf regression for the runtime of that task? e.g we give a task 10 extra minutes to complete and it does eventually finish, does that appear as a 10min regression in perfherder?
Duplicate of this bug: 1333833
You need to log in before you can comment on or make changes to this bug.