queue: docker-worker: generic-worker: reporting exceptions for intermittent errors

RESOLVED FIXED

Status

Taskcluster
Queue
RESOLVED FIXED
3 years ago
a year ago

People

(Reporter: pmoore, Unassigned)

Tracking

Details

(Reporter)

Description

3 years ago
Currently there are two allowed reasons for reporting an exception from a worker:

* malformed-payload
* worker-shutdown

In addition, these are handled differently: malformed-payloads are not retried, worker-shutdowns are (by another worker).

There are several classes of intermittent errors and permanent errors that can occur during a task run. Probably the end user is not concerned with a lot of them, but we are able to report on them. For example:

intermittent errors:
worker-shutdown
fetch-definition-failure
reclaim-failure
generate-command-failure
execute-command-failure
failed-to-report-as-failure
upload-failure
log-concatenation-failure
failed-to-report-as-successful

permanent errors:
malformed-payload <-- (btw doesn't malformed normally mean that the payload is not valid json, shouldn't this be invalid-payload instead? I guess the scheduler checks that the task graph definition is well-formed)

probably permanent, but maybe intermittent:
max-runtime-exceeded
task-crash

At the moment I am not sure how to report these in the generic worker. For example, let's say I claim a task, but then fail to reclaim it. Should I report worker-shutdown or malformed-payload? I guess for now I should use worker-shutdown in order that the task gets retried somewhere else, but it feels like an abuse of the worker-shutdown reason.

What are we currently doing in docker-worker for reporting exceptions due to intermittent errors?
(Reporter)

Updated

3 years ago
Flags: needinfo?(jopsen)
Flags: needinfo?(garndt)
(Reporter)

Comment 1

3 years ago
(nb the log-concatenation-failure is a generic-worker specific exception, since it takes a list of commands, creates separate log files for each, and then uploads individual logs plus a concatenated log - the reason for this is for visibility about which commands caused failure, being able to display the logs separately might also be useful later if we make command output collapsible like you have in travis etc).

Comment 2

3 years ago
Right now not much is done for reporting intermittent errors and most things within docker-worker default to reporting the task as failed.

I'm curious to see what opinion Jonas has on multiple exceptions, but as long as there is a finite list of exceptions and clear expectations from them (should be retried or not), I'm not sure if it's entirely horrible.  I do know the discussion of various exception types has happened in the past and really it boiled down to not wanting to pollute the exception space and we should be handling these exceptional states in better ways.
Flags: needinfo?(garndt)
First permanent errors are two:
 1) exception, reason: malformed-payload
     - JSON schema mismatch
     - Anything from task and task.payload that means you decide not to start execution)
     - Missing scopes perhaps
     - Things that can be fixed by modifying the task definition
     - Formally: report this if (and only if) the error is declarative.
 2) task-failed
     - if tasks specific code, crashed, exited non-zero
     - expected artifacts are missing
     - the worker did its job right, but task-specific code ran wild.
     - Formally: report this if (and only of) the error is caused by task-specific turing complete logic.

My opinion so far regarding intermittent errors have been that:
  The worker must retry whatever operation: access S3, docker pull, create windows user, etc.
  until the operation succeeds or the worker determines that the operation will never succeed.

  When the worker determines that the operation will never succeed, it must be due to one of these two
  cases:
   A) An unhandled exception occurred and the worker doesn't know how deal with it.
      This implies undefined behaviour, the worker node is corrupt and should shutdown.
      (Example, worker fails to create a new windows user)
   B) The operation will never succeed because the declared input is wrong (exception: malformed-payload)
      (Example, docker registry says that referenced image doesn't exist, or is private, etc)

In the case where we repeatedly can't access S3 or do "docker pull", it could be the case that S3 or
docker registry is down. But this should not be our assumption, rather we should assume the worker is
compromised/corrupted. We should not build on services that are less reliable than our workers.

----
In practice though, maybe we do have a use-case for exception: internal-worker-error.
Mainly because we might not want to shutdown the worker node just because of some unhandled error.
And we may want to report that this was caused by an unhandled exception in the worker.
I think of this as a 500 error. Perhaps we should retry it, perhaps not, that's is the only part
I'm unsure of. I guess retrying would be the simplest and safest, in that it immediately shows us
if this was specific to a weird node, or if task just triggers a corner case in the worker.
Flags: needinfo?(jopsen)
Regarding the list of intermittent errrors:
fetch-definition-failure         -> worker must be horribly broken if it can't talk to queue
reclaim-failure                  -> same again (if reclaim returns 409 it's because you don't have it)
generate-command-failure         -> If command is generated from task.payload this is a malformed-payload
execute-command-failure          -> If running task specific code fails, reportFailed.
failed-to-report-as-failure      -> So you expect to be able to report exception?
                                    (If you can't talk to the queue something is very bad)
upload-failure                   -> So S3 is down? or your worker network config is messed up?
                                    (if file is missing then task-failed)
log-concatenation-failure        -> If worker fails string concatenation, do you trust it?
failed-to-report-as-successful   -> Again, you expect to be able to report exception?
                                    (if that call is successful, you should report success)

Note, failed-to-report-as-successful, failed-to-report-as-failure, reclaim-failure can happen with
409 if you don't have the claim on the task anymore. And if you fail to contact the queue, ie.
a connection error, you can't really report exception either. So you have to drop it on the floor.
(Same argument for fetch-definition-failure, well, more or less)
Regarding:
> malformed-payload <-- (btw doesn't malformed normally mean that the payload is not valid json)

Yes, the queue ensures this is valid JSON. But I see "malformed-payload" as JSON schema error.
Or invalid contents. Yeah, invalid-payload might have been a better name :)
(Reporter)

Updated

3 years ago
Component: TaskCluster → Queue
Product: Testing → Taskcluster
We now have:
  resource-unavailable
  internal-error
Neither is retried automatically, see docs for reportException:
  https://docs.taskcluster.net/reference/platform/queue/api-docs#reportException
Summary: workers are responsible for retrying thing they can retry.
--
This bug is closed as part of the Great Bugzilla Cleaning Sprint of 2016 by bstack and jonasfj, please reopen if you disagree.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.