Closed
Bug 1179920
Opened 9 years ago
Closed 8 years ago
generic-worker: publishing error artifacts when artifacts cannot be uploaded and to log causes of an exception state
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: pmoore)
References
Details
(Whiteboard: [generic-worker])
1) If we can't publish an artifact, we should publish an error artifact. 2) When we raise a malformed-payload exception (and possibly also a worker-shutdown exception) we should publish an error artifact with a log extract, where each log line is prefixed with [taskcluster]. If we do this for worker-shutdown (not sure if we should) we should make sure it is the last thing we do, since the publish could take some not insignificant amount of time, when a shutdown might be imminent. Jonas, what are your views on this?
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(jopsen)
Assignee | ||
Updated•9 years ago
|
Blocks: generic-worker-fixes
Comment 1•9 years ago
|
||
A few facts: A) After reportException you can still upload artifacts for up to 25min or so... B) error artifacts should only be used to indicate missing artifacts ------------------------------- > 1) If we can't publish an artifact, we should publish an error artifact. Yes, if an artifact is missing upload an error artifact instead. And reportFailed after artifact and log upload, etc... > 2) When we raise a malformed-payload exception (and possibly also a worker-shutdown exception) we > should publish an error artifact with a log extract, where each log line is prefixed with > [taskcluster]. Not an error artifact. Just a normal text log artifact with "[taskcluster] <json schema errors>". Please do feel free to a few lines explaining the error when doing this, so people understand it. Then you reportException. > If we do this for worker-shutdown (not sure if we should) we should make sure it is the last > thing we do, since the publish could take some not insignificant amount of time, > when a shutdown might be imminent. Yes, for worker-shutdown, we reportException worker-shutdown first, and then rely on Fact (A) to allow us to upload logs afterwards. Ideally, we always want logs and artifacts uploaded before we resolve a task. That way we can ensure that any resolved tasks has artifacts and logs. This is why you can upload artifact after called reportCompleted/reportFailed. But for reportException we've made an exception, this is because exception implies something inherently wrong unstable (potentially undefined behaviour) so it's reasonable to say that log/artifact upload is best effort. Also, if we tried to upload log before reportException worker-shutdown, we risk that we failed to reportException and the queue will then wait until takenUntil expires before retrying the task. It's not the end of the world, just annoying :) So we prefer to avoid that. ------------------------------- In case for spot node shutdown: 1) reportException immediately w. worker-shutdown (so task is retried before takenUntil) 2) Upload log from task and append message: "\r\n[taskcluster] Spot node shutdown\r\n" (or something like that) If you have malformed-payload, then please: 1) upload log saying "[taskcluster] <explanation of error>", and then 2) reportException with malformed-payload. If you have internal worker error, then: 1) reportException internal-worker-error (not support yet, not sure if we should) 2) upload log and append an incidentId (we should assume internal error messages are sensitive) If you have executed a task: 1) Upload artifacts declared in task.payload 2) If a file listed in task.payload is missing, upload an error artifact (perhaps even add "[taskcluster] Artifact <path> is missing" lines to the log) 3) Upload log 4) If program exited non-zero, or we created error artifact, or failed upload log, reportFailed otherwise we reportCompleted. Notice, the ordering here is rather important, as it guarantees properties for tasks that are resolved.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 2•9 years ago
|
||
Assigning all generic worker bugs to myself for now. If anyone wants to take this bug, feel free to add a comment to request it. I can provide context.
Assignee: nobody → pmoore
Assignee | ||
Updated•9 years ago
|
Component: TaskCluster → Generic-Worker
Product: Testing → Taskcluster
Updated•8 years ago
|
Component: Generic-Worker → Worker
Whiteboard: [generic-worker]
Assignee | ||
Comment 3•8 years ago
|
||
Done!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Component: Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•