Closed Bug 1562285 Opened 5 years ago Closed 4 years ago

Add a report-error capability to taskcluster-worker-runner and generic-worker

Categories

(Taskcluster :: Workers, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: dustin, Unassigned, Mentored)

References

Details

Bug 1558346 adds a workerManager.reportWorkerError API method. Let's set up tc-worker-runner to support calling that API method in response to protocol messages from the worker and in response to errors starting the worker.

Then modify generic-worker to support sending such messages when it encounters errors.

Since we decided that it should be possible for generic-worker to run independently of tc-worker-runner (i.e. tc-worker-runner is an optional deployment component), I wanted to check that you mean generic-worker should call workerManager.reportWorkerError directly rather than reporting failures to tc-worker-runner, and tc-worker-runner then forwarding the worker error on to workerManager.

If that's what you meant, then all is ok. :-)

Rationale: I see workerManager.reportWorkerError as a straight out replacement to the current sentry reporting in generic-worker, that avoids tying us to a particular crash reporting vendor (sentry), albeit with the acknowledged and accepted caveat that we will need to duplicate/maintain some features that existed already in sentry. tc-worker-runner should also report bootstrapping failures to workerManager, but environments without tc-worker-runner should still be able to report worker errors.

g-w can still run without tc-worker-runner, but by simply displaying the error to stderr in that case (or whatever it already does). It would use the presence of a protocol feature to determine when it can use the worker-runner reporting approach, and when there's no worker-runner then that feature is implicitly not available and it just reports as usual.

I don't think this is a sentry replacement -- as you mentioned, this is about bootstrapping failures that a worker-pool admin might want to know about (e.g., config parameter missing), rather than crashes.

Depends on: 1558525
Blocks: 1558525
No longer depends on: 1558525

This isn't blocking the tc-cloudops rollout, but is still a heckin' good thing to implement.

QA Whiteboard: [lang=go]
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.