1243020 - Infrastructure failed tasks should still upload logs of some sort

Reporter

Description

•

8 years ago

Similar to bug 1227883

A bunch of tasks had infra issues [1], however, there are no logs or artifacts.
If I look for the raw log from TH I get an artifact not found message [2]
The live log is also not available [3]

Being able to know what happened on a retried job is very useful for starring jobs and keep track of issues which increase in frequency.

This is important to fix in order to make TC to par with Buildbot.

[1]
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f09cbcee02c7&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-searchStr=tc
[2] https://queue.taskcluster.net/v1/task/IYqffD7QSCCOLtzfdfEx6w/runs/0/artifacts/public/logs/live_backing.log
[3] https://g232jraaaaave7tvm6nan7yoiy4ktcu3fpjkpih2dgb6a3as.taskcluster-worker.net:32774/log/J3kVvqDYRQ-pSxHmuMBMvw

Selena Deckelmann :selenamarie :selena

Comment 1

•

8 years ago

(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #0)
> Similar to bug 1227883
> 
> A bunch of tasks had infra issues [1], however, there are no logs or
> artifacts.
> If I look for the raw log from TH I get an artifact not found message [2]
> The live log is also not available [3]
> 
> Being able to know what happened on a retried job is very useful for
> starring jobs and keep track of issues which increase in frequency.
> 
> This is important to fix in order to make TC to par with Buildbot.

I'm going to open up a discussion on this point: I'm not positive we want to emulate existing behavior, or if we should do something different than what we do today. 

The point of TaskCluster was to decouple (as much as is possible) environment configuration and infra issues from task configuration. There will never be a *perfect* decoupling.  

It's not clear to me though that we should be generically uploading logs for infra issues, or perhaps solving the underlying problems causing the infra issues to be visible to anyone interested in task execution.

Armen [:armenzg]

Reporter

Comment 2

•

8 years ago

How can we know what are the infra issues? (for the record, this happens even when we hit maxruntime, hence, making the job run useless)
Or how should we star the jobs if there are no logs?

Armen [:armenzg]

Reporter

Comment 3

•

8 years ago

Could we get some resources for this?
It makes issues like bug 1246176 impossible for developers to fix.

<jgraham> armenzg_dnd: Are there logs for the case where that wpt chunk is restarting the job?
<armenzg_dnd> jgraham, no, that's a bug of TC
<jgraham> armenzg_dnd: …
<jgraham> armenzg_dnd: So we have no way to tell what's going on at the moment?
<armenzg_dnd> jgraham, exactly
<jgraham> :'(

Flags: needinfo?(sdeckelmann)

Armen [:armenzg]

Reporter

Comment 4

•

8 years ago

Raising importance.

Severity: normal → major

Jonas Finnemann Jensen (:jonasfj)

Comment 5

•

8 years ago

So task are allowed to upload artifacts (and logs) after they have been resolved as "exception". Task have about 20 minutes to do so, after being resolved exception.

I suspect docker-worker isn't actually doing this. Obviously, doing so is always a best-effort service, as infra-issues are signs that the infrastructure is having issues.

Example: on spot node disappearing, we try to upload, but can't ensure that it happens.

Note: tasks resolved failed or completed should always have logs uploaded prior to resolution. Hence, ensuring that they are present.

Jonas Finnemann Jensen (:jonasfj)

Comment 6

•

8 years ago

Slightly Unrelated: hitting maxRunTime should not cause an exception. It's a controlled failure.

Selena Deckelmann :selenamarie :selena

Updated

•

8 years ago

Component: General → Docker-Worker

Flags: needinfo?(sdeckelmann)

Armen [:armenzg]

Reporter

Comment 7

•

8 years ago

The tc-W(3) and tc-W-e10s(3) jobs are *not* hitting maxRuntime.
They're being retried and we have no way of knowing why.
It is likely not to be an infra issues unless the tests are causing the worker/docker to die and it is being retried as an infra issue.

In any case, the effect is that there are no logs to determine what is happening.

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

8 years ago

maxRunTime question moved to bug 1246197

The tests armen mentioned (which are covered in bug 1246176) are resolved as claim-expired, which basically means that the docker-worker did not check back in with the queue in a timely fashion.  Suggesting that it, or the EC2 instance it was running on, crashed.  This is not a type of infra issue where we are hiding potentially useful logging from you -- rather, it's things going badly wrong and is always going to require some investigation.

As such, I think this bug is ill-specified.  After diagnosing bug 1246176, we may find that there was some technical change we could make to give more feedback in this particular failure mode, and we'll spin off a new bug for that purpose.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → INVALID

Jonas Finnemann Jensen (:jonasfj)

Comment 9

•

8 years ago

Note: spot nodes can die without warning.
So occasional claim-expired without any logs is to be expected.

Granted we should look at them from time to time, just ensure the node was actually spot-terminated. This is not a sheriff responsibility.

Armen [:armenzg]

Reporter

Comment 10

•

8 years ago

Can the logs be uploaded live to another service?
If the spot is killed, the log will be hosted somewhere else.

In general, I don't assume that the main issues will be the spot being killed.

In the case I reported in comment 3 the spot was not being killed but docker/docker-worker might have been dying (I don't know if I'm right about the details).

Greg Arndt [:garndt]

Comment 11

•

8 years ago

In the case here, these instances had docker-worker crashing because the instance ran out of memory and caused one of the modules it uses for checking diskspace to fail.  The worker restarts, but loses all state of what was running which is why nothing gets uploaded and eventually the task gets resolved as claim expired.

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

8 years ago

In other words, in this case, the additional task logging was not useful.

Nobody; OK to take it and work on it

Assignee

Updated

•

5 years ago

Component: Docker-Worker → Workers

Bugzilla

Quick Search

Infrastructure failed tasks should still upload logs of some sort

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: armenzg, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated