Open Bug 1529777 Opened 5 years ago Updated 2 years ago

Hard max-run-time invites intermittent failures [DO NOT USE FOR CLASSIFICATION]

Categories

(Firefox Build System :: Task Configuration, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: gbrown, Unassigned)

References

Details

Most (all?) tasks have a max-run-time parameter, eg:

https://searchfox.org/mozilla-central/rev/b36e97fc776635655e84f2048ff59f38fa8a4626/taskcluster/ci/test/misc.yml#9

When a task runs for longer than max-run-time, the task is forced down, generally resulting in an error report like "[taskcluster:error] Task timeout after 3600 seconds. Force killing container."

Task timeout is an important safe-guard against hung tasks: If a task isn't proceeding, we don't want to wait indefinitely for it.

However, the resulting experience in bug 1411358 (and others -- see See Also bugs) is frustrating and wasteful: Every week, 50 to 100 tasks "randomly" time out, despite on-going efforts to update test chunks and max-run-time parameters to avoid common time outs. In my experience, most of those tasks are making progress, and would succeed if given more time; at the same time, the max-run-time is generally much longer than the average run-time of the task: A task might usually run in 40 minutes, but run in 70+ minutes for 2% of runs. I think this is usually just a result of variance in machine performance and/or variance in product (or even test) performance.

Can we be more forgiving about max-run-time? If a task exceeds max-run-time but has logged something in the last 5 minutes, can we extend max-run-time by 30 minutes (or 50%, or ...)? That sort of strategy would be less efficient in some cases: If the task still times out after the extension, we've only wasted more time. But I think the majority of extension cases would succeed, avoiding intermittent failures, unnecessary retries, etc.

I believe one of the workers has an idle timeout, too. So perhaps this could be approximated by increasing max-run-time, while reducing the idle timeout?

Component: General → Workers

There are a couple of issues here:

  • for hung tasks, extending the timeout at all just wastes more resources. Time == money for CI resources
  • if we're going to consider extending runtime based on logged output, why not go in both directions and kill tasks early if they don't log anything for 5min?
  • if we automatically bump max runtimes under certain circumstances, does that lead to a corresponding perf regression for the runtime of that task? e.g we give a task 10 extra minutes to complete and it does eventually finish, does that appear as a 10min regression in perfherder?
See Also: → 1589796

Geoff, failures here are build failures and some xpchsell failure. For example:

build: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286275573&repo=autoland&lineNumber=568

Xpcshell: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286248830&repo=autoland&lineNumber=2663

Should there be separate bugs for these?

Flags: needinfo?(gbrown)

This bug was filed to investigate a strategy for dealing with intermittent-failure bugs like 1411358: Failures should be starred against bugs like 1411358 and 1589796 -- NOT this one. I do not think there is a need to open new bugs: 1411358 and 1589796 should cover almost all possibilities.

Flags: needinfo?(gbrown)

Thank you, we'll classify the failures correctly.

Component: Workers → Task Configuration
Product: Taskcluster → Firefox Build System
Summary: Hard max-run-time invites intermittent failures → Hard max-run-time invites intermittent failures [DO NOT USE FOR CLASSIFICATION]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.