Hard max-run-time invites intermittent failures [DO NOT USE FOR CLASSIFICATION]
Categories
(Firefox Build System :: Task Configuration, enhancement)
Tracking
(Not tracked)
People
(Reporter: gbrown, Unassigned)
References
Details
Most (all?) tasks have a max-run-time parameter, eg:
When a task runs for longer than max-run-time, the task is forced down, generally resulting in an error report like "[taskcluster:error] Task timeout after 3600 seconds. Force killing container."
Task timeout is an important safe-guard against hung tasks: If a task isn't proceeding, we don't want to wait indefinitely for it.
However, the resulting experience in bug 1411358 (and others -- see See Also bugs) is frustrating and wasteful: Every week, 50 to 100 tasks "randomly" time out, despite on-going efforts to update test chunks and max-run-time parameters to avoid common time outs. In my experience, most of those tasks are making progress, and would succeed if given more time; at the same time, the max-run-time is generally much longer than the average run-time of the task: A task might usually run in 40 minutes, but run in 70+ minutes for 2% of runs. I think this is usually just a result of variance in machine performance and/or variance in product (or even test) performance.
Can we be more forgiving about max-run-time? If a task exceeds max-run-time but has logged something in the last 5 minutes, can we extend max-run-time by 30 minutes (or 50%, or ...)? That sort of strategy would be less efficient in some cases: If the task still times out after the extension, we've only wasted more time. But I think the majority of extension cases would succeed, avoiding intermittent failures, unnecessary retries, etc.
Comment 1•6 years ago
|
||
I believe one of the workers has an idle timeout, too. So perhaps this could be approximated by increasing max-run-time, while reducing the idle timeout?
Updated•5 years ago
|
Updated•5 years ago
|
Comment 2•5 years ago
|
||
There are a couple of issues here:
- for hung tasks, extending the timeout at all just wastes more resources. Time == money for CI resources
- if we're going to consider extending runtime based on logged output, why not go in both directions and kill tasks early if they don't log anything for 5min?
- if we automatically bump max runtimes under certain circumstances, does that lead to a corresponding perf regression for the runtime of that task? e.g we give a task 10 extra minutes to complete and it does eventually finish, does that appear as a 10min regression in perfherder?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 24•5 years ago
|
||
Geoff, failures here are build failures and some xpchsell failure. For example:
build: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286275573&repo=autoland&lineNumber=568
Xpcshell: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286248830&repo=autoland&lineNumber=2663
Should there be separate bugs for these?
Reporter | ||
Comment 25•5 years ago
|
||
This bug was filed to investigate a strategy for dealing with intermittent-failure bugs like 1411358: Failures should be starred against bugs like 1411358 and 1589796 -- NOT this one. I do not think there is a need to open new bugs: 1411358 and 1589796 should cover almost all possibilities.
Comment 26•5 years ago
|
||
Thank you, we'll classify the failures correctly.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Description
•