Closed Bug 1227883 Opened 7 years ago Closed 2 years ago

Cancelled tasks should upload logs/artifacts

Categories

(Taskcluster :: Workers, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: pmoore, Unassigned)

References

Details

(Whiteboard: [docker-worker])

For example:

https://tools.taskcluster.net/task-inspector/#U5pJgjUoTvmPDmyCmfsLHw/0

This task was claimed, and for two hours the claim was held (i.e. the reclaims succeeded) but for some reason there was no log attached.

There is a livelog artifact which is no longer valid (perhaps the worker has terminated), so no way to see what happened.

It might be preferable for cancelled tasks to still their logs and error/s3 artifacts uploaded. For example if a task should normally produce three file artifacts, one after the other, during task execution, but the task was cancelled during creation of second artifact, such that it was left in an incomplete state, then we'd upload:

  * an s3 artifact for the first completed file artifact
  * an s3 artifact for the incomplete second file artifact
  * an error artifact for the missing third file artifact
  * an s3 artifact for the (incomplete) log file

Obviously the docker worker cannot tell whether that the second artifact was incomplete or not, so should just upload what it finds as a regular s3 artifact.

The fact the task has an exception state is enough to know that the content of the artifacts cannot be trusted.

Publishing the artifacts despite the task cancellation allows problems to be inferred that occurred before the task was cancelled.
There was another bug for this that I just marked as a duplicate because this bug has a lot more useful information.  Thanks!

When cancellation was originally implemented it was not possible to upload an artifact past resolution (which was done by another component, not the worker).  However that has since changed to allow artifact creation 25 minutes after being resolved.
Duplicate of this bug: 1168809
I don't really care about this. But when someone calls cancelTask the implied santics is that the result of the task is irrelevant.

So why should we waste resources uploading logs and artifacts?
In fact cancellation used specifically to conserve resources.

Cancellation is a deliberate choice, not an exception that happens by accident.
s/santics/semantics/

Ie.clicking cancel task is the same as saying I'm nolonger interested in the result.

Using cancelTask to abort tasks that hang forever (beyond maxRunTime) is not intended. And we should fix the original bug that caused the hang instead.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #4)
> s/santics/semantics/
> 
> Ie.clicking cancel task is the same as saying I'm nolonger interested in the
> result.
> 
> Using cancelTask to abort tasks that hang forever (beyond maxRunTime) is not
> intended. And we should fix the original bug that caused the hang instead.

Absolutely agree with fixing the underlying bug.  In the cases I've seen so far, the issue is when docker is trying to start the live log container, which fails, and things get hung.  In the case of handling this error properly, the backing log would still be created that we can write to and when the task fails because the live log container didn't start, the bulking log would be uploaded.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #3)
> I don't really care about this. But when someone calls cancelTask the
> implied santics is that the result of the task is irrelevant.
> 
> So why should we waste resources uploading logs and artifacts?
> In fact cancellation used specifically to conserve resources.
> 
> Cancellation is a deliberate choice, not an exception that happens by
> accident.

For example, cancelling a task due to a security breach. In such a case you may be very much interested in logs.

Or a sheriff cancels your task in the night because it is either stuck, or hammering a service. You want to see the logs the next day when you wake up, to understand what led to this situation.

Often cancelling a task is about preserving resources - however, it can also be about shutting something down that is potentially causing damage or impacting other services - and in this case, having information from the run is key to resolving the problem.
This is important to bring parity with Buildbot.
Blocks: bb-to-tc
Whiteboard: [docker-worker]
Component: Docker-Worker → Worker
This is good for parity, but not required for the migration.
Blocks: 1080265
No longer blocks: bb-to-tc
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
What was the reason for closing?
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
At the bug-triage meeting multiple people said that we didn't actually want to do this afterall. I think wcosta and jonas had opinions on this.
Flags: needinfo?(wcosta)
Flags: needinfo?(jopsen)
(In reply to Brian Stack [:bstack] from comment #10)
> At the bug-triage meeting multiple people said that we didn't actually want
> to do this afterall. I think wcosta and jonas had opinions on this.

Thanks Brian.

My only reservation with closing this is stated in comment 6 - but perhaps this is an unlikely use case. The task definition would still be available, so we'd still be able to see what the task was meant to do, but if there was something compromised on the worker which would be detectable from e.g. logs, or other incriminating evidence in artifacts, we might potentially lose it. But we'd probably save a lot of overall resources by not uploading, so maybe it is right to take the hit, for the sake of saving resources.
The mindset of a canceled task is that nobody cares about it anymore. If the user changed his mind, he can always retrigger it.
Flags: needinfo?(wcosta)
What we could implement is a configuration in the task payload that says to upload the logs if the user cancels it.
(In reply to Wander Lairson Costa [:wcosta] from comment #12)
> The mindset of a canceled task is that nobody cares about it anymore. If the
> user changed his mind, he can always retrigger it.

Not always - see comment 6.
Comment 6 has valid arguments.

But if so, I think it's a very low priority.
I feel that if you cancel a task, I can't care that you are missing the log.

---
Either way, if we want it we could refile it as:
 A) queue: allow uploading artifacts after canceling of tasks
 B) tc-worker: upload log after canceling of task
 C) generic-worker: upload log after canceling of task

Note: I'm not sure abstractions in tc-worker will allow for implementation of (B).
I did abstractions to ensure that canceling aborts most connections :) Granted, I can't remember the details.
Flags: needinfo?(jopsen)
Priority: -- → P5
QA Contact: pmoore
Component: Worker → Workers
Blocks: 1585578

This is still something we could do, but not high-priority (already P5). As Jonas mentioned, it would require some changes from the queue to allow this.

Many of the comments against this are essentially: "if the user cancelled the task, nobody cares about it anymore". I think the only assumption that should be made when a task is cancelled is: "if the user cancelled the task, the user doesn't want it to run anymore".

In our case, it isn't true that cancelled task artifacts are irrelevant. Our tasks are not repeatable (randomized testing), and we usually only cancel tasks to free workers for a newer task (updating the docker image or target code). Wanting a code refresh doesn't mean I'm not interested in seeing the logs. Without artifacts, we must rely on external services for task logs.

Status: REOPENED → RESOLVED
Closed: 5 years ago2 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.