Closed Bug 1147867 Opened 8 years ago Closed 7 years ago

Run failed with no usable log output to Treeherder


(Taskcluster :: General, defect, P1)



(Not tracked)



(Reporter: ryanvm, Assigned: garndt)



Looks like TC timed out?

[taskcluster] Task timeout after 3600 seconds. Force killing container.
[taskcluster] Unsuccessful task run with exit code: -1 completed in 3601.9

As an aside - copying/pasting output from "Inspect Task" is pretty poor UX.
Flags: needinfo?(jlal)
See Also: → 1147977
> Looks like TC timed out?
The task failed to complete before it's "task.payload.maxRunTime" was exhausted.
This is configurable in-tree. But in this case I doubt it was the limit that was the problem.

From the log:
curl -L -o /home/worker/.tc-vcs/clones/ \                                                                                                           
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                         
                                 Dload  Upload   Total   Spent    Left  Speed                                           
100    29  100    29    0     0     24      0  0:00:01  0:00:01 --:--:--    25                                          
 11 2021M   11  233M    0     0  68153      0  8:38:29  0:59:52  7:38:37     0

It timed-out while downloading a 2G artifact from S3.
I've filed bug 1147977 in order to look into this issue, presumably either S3 or
TCP congestion control is playing tricks on us. I suspect restarting slow downloads
would do the trick. And doing downloads in parallel would just be even more awesome :)

### On-Topic:
From an infrastructure perspective this is a poorly coded task.
Because the error happens inside the task. For this reason it becomes a task failure.
The fix is also in-tree, as it involves updated tc-vcs that lives in docker image that
is referenced in-tree.

That said, it's clearly a bug in taskcluster specific utilities used inside the task.
But I don't think we can do a meaningful distinction between these kinds of failures
and actual tests failures. We could possibly do an artifact, but it's gets complicated.
All of these are S3 being slow for some weird reason.

Except 5, which is 500 error while downloading docker images. This is clearly a TC problem.
I believe we're planning to fix it in Q2 using S3 to store docker images.
As an update we landed a improvement to our timeout logic here ... It's hard to say if this is fixed until we monitor for awhile.
Flags: needinfo?(jlal)
James, this lack of usable failure lines makes sheriffing TC jobs a big pain and leads to dumping ground bugs like these. What can we do to better ensure that we get usable logs in the situations raised here?
Flags: needinfo?(jlal)
Priority: -- → P1