Closed Bug 1424383 Opened 6 years ago Closed 6 years ago

Workers should terminate sooner after worker definition change

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

Currently, if you update a worker definition, old workers linger around and it takes hours - possibly days - for all workers in a pool to refresh and pick up the new worker definition.

Last week, we updated AMIs for bug 1291940 and bug 1415725. The initial AMIs were buggy in multiple ways. However, it took ~24 hours for us to notice some of the failures because old workers were still working and the percentage of new AMIs in service was initially very small.

When we deploy something, it is better to have meaningful results on the success of that deployment sooner rather than later.

This bug is a request to have workers terminate after their worker definition changes. i.e. if a worker definition is modified, the worker should refuse to process any new tasks. This will ensure that any worker definition changes result in a) all new tasks running on the new worker configuration immediately b) the worker pool refresh taking no longer than the longest execution time of a running task.
I'm going to nominate this for the stability effort. Comment #0 should be self-explanatory as to why I think it important for platform stability.
Blocks: tc-stability
Note, generic-worker workers check in every 30 mins to see if there are new AMIs and self terminate if there are.

This was implemented in bug 1298010 and rolled out in generic-worker 6.1.0 (see https://bugzilla.mozilla.org/show_bug.cgi?id=1298010#c16). The code changes are here: https://github.com/taskcluster/generic-worker/pull/27/files

We might want to use the same mechanism for docker-worker.
Workers now last 15 minutes without picking a job before shutdown.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
(In reply to Wander Lairson Costa [:wcosta] from comment #3)
> Workers now last 15 minutes without picking a job before shutdown.

That is true and is a good start. However, if a worker is busy, it could accumulate tasks and stay alive for hours or days after a configuration change.

The original request/issue is still valid. I would prefer to see all workers behave like generic-worker and self-terminate after a worker configuration change so there is an upper bound on the time between a configuration changing and tasks running on that configuration. So I encourage you to reopen this issue. Or resolve as WONTFIX (since docker-worker's days are apparently numbered).
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.