generic-worker: publishing error artifacts when artifacts cannot be uploaded and to log causes of an exception state

RESOLVED FIXED

Status

Taskcluster
Worker
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: pmoore, Assigned: pmoore)

Tracking

Details

(Whiteboard: [generic-worker])

(Assignee)

Description

3 years ago
1) If we can't publish an artifact, we should publish an error artifact.
2) When we raise a malformed-payload exception (and possibly also a worker-shutdown exception) we should publish an error artifact with a log extract, where each log line is prefixed with [taskcluster].

If we do this for worker-shutdown (not sure if we should) we should make sure it is the last thing we do, since the publish could take some not insignificant amount of time, when a shutdown might be imminent.

Jonas, what are your views on this?
(Assignee)

Updated

3 years ago
Flags: needinfo?(jopsen)
(Assignee)

Updated

3 years ago
Blocks: 1178657
A few facts:
 A) After reportException you can still upload artifacts for up to 25min or so...
 B) error artifacts should only be used to indicate missing artifacts

-------------------------------
> 1) If we can't publish an artifact, we should publish an error artifact.
Yes, if an artifact is missing upload an error artifact instead. And reportFailed after artifact
and log upload, etc...

> 2) When we raise a malformed-payload exception (and possibly also a worker-shutdown exception) we
> should publish an error artifact with a log extract, where each log line is prefixed with
> [taskcluster].
Not an error artifact. Just a normal text log artifact with "[taskcluster] <json schema errors>".
Please do feel free to a few lines explaining the error when doing this, so people understand it.
Then you reportException.

> If we do this for worker-shutdown (not sure if we should) we should make sure it is the last
> thing we do, since the publish could take some not insignificant amount of time,
> when a shutdown might be imminent.
Yes, for worker-shutdown, we reportException worker-shutdown first, and then rely on  Fact (A) to
allow us to upload logs afterwards.

Ideally, we always want logs and artifacts uploaded before we resolve a task. That way we can ensure
that any resolved tasks has artifacts and logs. This is why you can upload artifact after called
reportCompleted/reportFailed.
But for reportException we've made an exception, this is because exception implies something inherently
wrong unstable (potentially undefined behaviour) so it's reasonable to say that log/artifact upload
is best effort.

Also, if we tried to upload log before reportException worker-shutdown, we risk that we failed to
reportException and the queue will then wait until takenUntil expires before retrying the task.
It's not the end of the world, just annoying :) So we prefer to avoid that.

-------------------------------

In case for spot node shutdown:
 1) reportException immediately w. worker-shutdown (so task is retried before takenUntil)
 2) Upload log from task and append message:
  "\r\n[taskcluster] Spot node shutdown\r\n"
  (or something like that)

If you have malformed-payload, then please:
 1) upload log saying "[taskcluster] <explanation of error>", and then
 2) reportException with malformed-payload.

If you have internal worker error, then:
 1) reportException internal-worker-error (not support yet, not sure if we should)
 2) upload log and append an incidentId (we should assume internal error messages are sensitive)

If you have executed a task:
 1) Upload artifacts declared in task.payload
 2) If a file listed in task.payload is missing, upload an error artifact
    (perhaps even add "[taskcluster] Artifact <path> is missing" lines to the log)
 3) Upload log
 4) If program exited non-zero, or we created error artifact, or failed upload log, reportFailed
    otherwise we reportCompleted.

Notice, the ordering here is rather important, as it guarantees properties for tasks that are resolved.
Flags: needinfo?(jopsen)
(Assignee)

Comment 2

3 years ago
Assigning all generic worker bugs to myself for now. If anyone wants to take this bug, feel free to add a comment to request it. I can provide context.
Assignee: nobody → pmoore
(Assignee)

Updated

3 years ago
Component: TaskCluster → Generic-Worker
Product: Testing → Taskcluster
Component: Generic-Worker → Worker
Whiteboard: [generic-worker]
(Assignee)

Comment 3

2 years ago
Done!
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.