Most (all?) tasks have a max-run-time parameter, eg:
When a task runs for longer than max-run-time, the task is forced down, generally resulting in an error report like "[taskcluster:error] Task timeout after 3600 seconds. Force killing container."
Task timeout is an important safe-guard against hung tasks: If a task isn't proceeding, we don't want to wait indefinitely for it.
However, the resulting experience in bug 1411358 (and others -- see See Also bugs) is frustrating and wasteful: Every week, 50 to 100 tasks "randomly" time out, despite on-going efforts to update test chunks and max-run-time parameters to avoid common time outs. In my experience, most of those tasks are making progress, and would succeed if given more time; at the same time, the max-run-time is generally much longer than the average run-time of the task: A task might usually run in 40 minutes, but run in 70+ minutes for 2% of runs. I think this is usually just a result of variance in machine performance and/or variance in product (or even test) performance.
Can we be more forgiving about max-run-time? If a task exceeds max-run-time but has logged something in the last 5 minutes, can we extend max-run-time by 30 minutes (or 50%, or ...)? That sort of strategy would be less efficient in some cases: If the task still times out after the extension, we've only wasted more time. But I think the majority of extension cases would succeed, avoiding intermittent failures, unnecessary retries, etc.