Closed Bug 1243020 Opened 8 years ago Closed 8 years ago

Infrastructure failed tasks should still upload logs of some sort

Categories

(Taskcluster :: Workers, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: armenzg, Unassigned)

References

Details

Similar to bug 1227883

A bunch of tasks had infra issues [1], however, there are no logs or artifacts.
If I look for the raw log from TH I get an artifact not found message [2]
The live log is also not available [3]

Being able to know what happened on a retried job is very useful for starring jobs and keep track of issues which increase in frequency.

This is important to fix in order to make TC to par with Buildbot.

[1]
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f09cbcee02c7&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-searchStr=tc
[2] https://queue.taskcluster.net/v1/task/IYqffD7QSCCOLtzfdfEx6w/runs/0/artifacts/public/logs/live_backing.log
[3] https://g232jraaaaave7tvm6nan7yoiy4ktcu3fpjkpih2dgb6a3as.taskcluster-worker.net:32774/log/J3kVvqDYRQ-pSxHmuMBMvw
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #0)
> Similar to bug 1227883
> 
> A bunch of tasks had infra issues [1], however, there are no logs or
> artifacts.
> If I look for the raw log from TH I get an artifact not found message [2]
> The live log is also not available [3]
> 
> Being able to know what happened on a retried job is very useful for
> starring jobs and keep track of issues which increase in frequency.
> 
> This is important to fix in order to make TC to par with Buildbot.

I'm going to open up a discussion on this point: I'm not positive we want to emulate existing behavior, or if we should do something different than what we do today. 

The point of TaskCluster was to decouple (as much as is possible) environment configuration and infra issues from task configuration. There will never be a *perfect* decoupling.  

It's not clear to me though that we should be generically uploading logs for infra issues, or perhaps solving the underlying problems causing the infra issues to be visible to anyone interested in task execution.
How can we know what are the infra issues? (for the record, this happens even when we hit maxruntime, hence, making the job run useless)
Or how should we star the jobs if there are no logs?
Could we get some resources for this?
It makes issues like bug 1246176 impossible for developers to fix.

<jgraham> armenzg_dnd: Are there logs for the case where that wpt chunk is restarting the job?
<armenzg_dnd> jgraham, no, that's a bug of TC
<jgraham> armenzg_dnd: …
<jgraham> armenzg_dnd: So we have no way to tell what's going on at the moment?
<armenzg_dnd> jgraham, exactly
<jgraham> :'(
Flags: needinfo?(sdeckelmann)
Raising importance.
Severity: normal → major
So task are allowed to upload artifacts (and logs) after they have been resolved as "exception". Task have about 20 minutes to do so, after being resolved exception.

I suspect docker-worker isn't actually doing this. Obviously, doing so is always a best-effort service, as infra-issues are signs that the infrastructure is having issues.

Example: on spot node disappearing, we try to upload, but can't ensure that it happens.

Note: tasks resolved failed or completed should always have logs uploaded prior to resolution. Hence, ensuring that they are present.
Slightly Unrelated: hitting maxRunTime should not cause an exception. It's a controlled failure.
Component: General → Docker-Worker
Flags: needinfo?(sdeckelmann)
The tc-W(3) and tc-W-e10s(3) jobs are *not* hitting maxRuntime.
They're being retried and we have no way of knowing why.
It is likely not to be an infra issues unless the tests are causing the worker/docker to die and it is being retried as an infra issue.

In any case, the effect is that there are no logs to determine what is happening.
maxRunTime question moved to bug 1246197

The tests armen mentioned (which are covered in bug 1246176) are resolved as claim-expired, which basically means that the docker-worker did not check back in with the queue in a timely fashion.  Suggesting that it, or the EC2 instance it was running on, crashed.  This is not a type of infra issue where we are hiding potentially useful logging from you -- rather, it's things going badly wrong and is always going to require some investigation.

As such, I think this bug is ill-specified.  After diagnosing bug 1246176, we may find that there was some technical change we could make to give more feedback in this particular failure mode, and we'll spin off a new bug for that purpose.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Note: spot nodes can die without warning.
So occasional claim-expired without any logs is to be expected.

Granted we should look at them from time to time, just ensure the node was actually spot-terminated. This is not a sheriff responsibility.
Can the logs be uploaded live to another service?
If the spot is killed, the log will be hosted somewhere else.

In general, I don't assume that the main issues will be the spot being killed.

In the case I reported in comment 3 the spot was not being killed but docker/docker-worker might have been dying (I don't know if I'm right about the details).
In the case here, these instances had docker-worker crashing because the instance ran out of memory and caused one of the modules it uses for checking diskspace to fail.  The worker restarts, but loses all state of what was running which is why nothing gets uploaded and eventually the task gets resolved as claim expired.
In other words, in this case, the additional task logging was not useful.
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.