1529777 - Hard max-run-time invites intermittent failures [DO NOT USE FOR CLASSIFICATION]

Reporter

Description

•

6 years ago

Most (all?) tasks have a max-run-time parameter, eg:

https://searchfox.org/mozilla-central/rev/b36e97fc776635655e84f2048ff59f38fa8a4626/taskcluster/ci/test/misc.yml#9

When a task runs for longer than max-run-time, the task is forced down, generally resulting in an error report like "[taskcluster:error] Task timeout after 3600 seconds. Force killing container."

Task timeout is an important safe-guard against hung tasks: If a task isn't proceeding, we don't want to wait indefinitely for it.

However, the resulting experience in bug 1411358 (and others -- see See Also bugs) is frustrating and wasteful: Every week, 50 to 100 tasks "randomly" time out, despite on-going efforts to update test chunks and max-run-time parameters to avoid common time outs. In my experience, most of those tasks are making progress, and would succeed if given more time; at the same time, the max-run-time is generally much longer than the average run-time of the task: A task might usually run in 40 minutes, but run in 70+ minutes for 2% of runs. I think this is usually just a result of variance in machine performance and/or variance in product (or even test) performance.

Can we be more forgiving about max-run-time? If a task exceeds max-run-time but has logged something in the last 5 minutes, can we extend max-run-time by 30 minutes (or 50%, or ...)? That sort of strategy would be less efficient in some cases: If the task still times out after the extension, we've only wasted more time. But I think the majority of extension cases would succeed, avoiding intermittent failures, unnecessary retries, etc.

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

6 years ago

I believe one of the workers has an idle timeout, too. So perhaps this could be approximated by increasing max-run-time, while reducing the idle timeout?

Component: General → Workers

Edwin Takahashi (:egao | infrequent contributor)

Updated

•

6 years ago

Blocks: task-efficiency-test-overhead

Edwin Takahashi (:egao | infrequent contributor)

Updated

•

6 years ago

No longer blocks: task-efficiency-test-overhead

Chris Cooper [:coop] (he/him)

Comment 2

•

6 years ago

There are a couple of issues here:

for hung tasks, extending the timeout at all just wastes more resources. Time == money for CI resources
if we're going to consider extending runtime based on logged output, why not go in both directions and kill tasks early if they don't log anything for 5min?
if we automatically bump max runtimes under certain circumstances, does that lead to a corresponding perf regression for the runtime of that task? e.g we give a task 10 extra minutes to complete and it does eventually finish, does that appear as a 10min regression in perfherder?

Comment hidden (Intermittent Failures Robot)

Geoff Brown [:gbrown]

Reporter

Updated

•

6 years ago

Comment 24

•

5 years ago

Geoff, failures here are build failures and some xpchsell failure. For example:

build: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286275573&repo=autoland&lineNumber=568

Xpcshell: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286248830&repo=autoland&lineNumber=2663

Should there be separate bugs for these?

Flags: needinfo?(gbrown)

Keywords: intermittent-failure

Geoff Brown [:gbrown]

Reporter

Comment 25

•

5 years ago

This bug was filed to investigate a strategy for dealing with intermittent-failure bugs like 1411358: Failures should be starred against bugs like 1411358 and 1589796 -- NOT this one. I do not think there is a need to open new bugs: 1411358 and 1589796 should cover almost all possibilities.

Flags: needinfo?(gbrown)

Keywords: intermittent-failure

Andreea Pavel [:apavel]

Comment 26

•

5 years ago

Thank you, we'll classify the failures correctly.

Comment hidden (Intermittent Failures Robot)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

4 years ago

Component: Workers → Task Configuration

Product: Taskcluster → Firefox Build System

Comment hidden (Intermittent Failures Robot)

Natalia Csoregi [:nataliaCs]

Updated

•

4 years ago

Summary: Hard max-run-time invites intermittent failures → Hard max-run-time invites intermittent failures [DO NOT USE FOR CLASSIFICATION]

Comment hidden (Intermittent Failures Robot)

BMO Automation

Updated

•

3 years ago

Severity: normal → S3