1) If we can't publish an artifact, we should publish an error artifact. 2) When we raise a malformed-payload exception (and possibly also a worker-shutdown exception) we should publish an error artifact with a log extract, where each log line is prefixed with [taskcluster]. If we do this for worker-shutdown (not sure if we should) we should make sure it is the last thing we do, since the publish could take some not insignificant amount of time, when a shutdown might be imminent. Jonas, what are your views on this?
A few facts: A) After reportException you can still upload artifacts for up to 25min or so... B) error artifacts should only be used to indicate missing artifacts ------------------------------- > 1) If we can't publish an artifact, we should publish an error artifact. Yes, if an artifact is missing upload an error artifact instead. And reportFailed after artifact and log upload, etc... > 2) When we raise a malformed-payload exception (and possibly also a worker-shutdown exception) we > should publish an error artifact with a log extract, where each log line is prefixed with > [taskcluster]. Not an error artifact. Just a normal text log artifact with "[taskcluster] <json schema errors>". Please do feel free to a few lines explaining the error when doing this, so people understand it. Then you reportException. > If we do this for worker-shutdown (not sure if we should) we should make sure it is the last > thing we do, since the publish could take some not insignificant amount of time, > when a shutdown might be imminent. Yes, for worker-shutdown, we reportException worker-shutdown first, and then rely on Fact (A) to allow us to upload logs afterwards. Ideally, we always want logs and artifacts uploaded before we resolve a task. That way we can ensure that any resolved tasks has artifacts and logs. This is why you can upload artifact after called reportCompleted/reportFailed. But for reportException we've made an exception, this is because exception implies something inherently wrong unstable (potentially undefined behaviour) so it's reasonable to say that log/artifact upload is best effort. Also, if we tried to upload log before reportException worker-shutdown, we risk that we failed to reportException and the queue will then wait until takenUntil expires before retrying the task. It's not the end of the world, just annoying :) So we prefer to avoid that. ------------------------------- In case for spot node shutdown: 1) reportException immediately w. worker-shutdown (so task is retried before takenUntil) 2) Upload log from task and append message: "\r\n[taskcluster] Spot node shutdown\r\n" (or something like that) If you have malformed-payload, then please: 1) upload log saying "[taskcluster] <explanation of error>", and then 2) reportException with malformed-payload. If you have internal worker error, then: 1) reportException internal-worker-error (not support yet, not sure if we should) 2) upload log and append an incidentId (we should assume internal error messages are sensitive) If you have executed a task: 1) Upload artifacts declared in task.payload 2) If a file listed in task.payload is missing, upload an error artifact (perhaps even add "[taskcluster] Artifact <path> is missing" lines to the log) 3) Upload log 4) If program exited non-zero, or we created error artifact, or failed upload log, reportFailed otherwise we reportCompleted. Notice, the ordering here is rather important, as it guarantees properties for tasks that are resolved.
Assigning all generic worker bugs to myself for now. If anyone wants to take this bug, feel free to add a comment to request it. I can provide context.
Assignee: nobody → pmoore
Component: TaskCluster → Generic-Worker
Product: Testing → Taskcluster
Component: Generic-Worker → Worker
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.