Closed
Bug 1342263
Opened 8 years ago
Closed 8 years ago
generic-worker should schedule a self-reboot at least every 96 hours
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Unassigned)
References
Details
Puppet / OCC will run on reboot only, and we need hosts to keep their config up to date. Per bug 1336050, we'll also need hosts to renew their credentials.
So, generic-worker will need a feature to shutdown before 96 hours have expired. Maybe this can be a "soft stop" when idle after, say, 86 hours; or a hard stop at 96 hours.
Comment 1•8 years ago
|
||
I think we should look at this in the bigger context of all the instance-update/instance-terminate use cases we'll have, as maybe we can craft a design that meets our needs for both provisioner-managed instances, and puppet-managed hardware workers, in one shot. Until now we've tended to attack each workflow requirement one at a time, but now might be a good time for us to stop and rethink the model more generally.
In particular, to meet the requirements of:
1) how do we keep config/binaries/worker state up-to-date
2) how do we handle immediate-rollout requirements (security patches/fixes etc)
3) how can we monitor rollout / visibility of versions in our worker pools
4) how should we version rollouts
5) what safety fallbacks do we want (e.g. AWS provisioner has 96 hour hard kill on instances in case they go AWOL)
6) can we harmonize procedures across workers and infrastructure
7) what should go inside the worker, and what should be managed by bootstrapping on worker instances
It might be that we do indeed implement this 86 hour soft kill / 96 hour hard kill, but I think it would be worth having the more general discussion / design proposal first.
If you like I can schedule a meeting for this, to get the ball rolling.
Note, until now, a lot of the AWS-specific logic has been burned into the workers themselves - it might be we can pull a lot of this out and have some exterior component manage some of these concerns.
| Reporter | ||
Comment 2•8 years ago
|
||
Yes, please schedule a meeting. This is migration-critical, so we'll need to consider how we can get it accomplished quickly (1-2 weeks).
Comment 3•8 years ago
|
||
I think this isn't a blocker, since generic-worker can already be set up to run a single task and reboot (as we do for win7 aws instances at the moment). If puppet runs on reboot, we'll get updated config after every task. We can also configure it to exit after a period of inactivity (configurable idle time), so we can guarantee that every hour or two it is either rebooted due to completing a task, or being idle.
It would certainly be feasible to operate in this mode initially.
See https://github.com/taskcluster/generic-worker/blob/v8.0.1/main.go#L134 for full details.
Comment 4•8 years ago
|
||
Note, you can also currently configure it to run an arbitrary number of tasks before rebooting, so we could set it say to 10 or 12.
| Reporter | ||
Comment 5•8 years ago
|
||
Sorry, I meant it's critical for taskcluster-worker. I think any design discussions should be focused on tc-worker or cover both.
Updated•8 years ago
|
Component: Worker → Generic-Worker
Comment 6•8 years ago
|
||
This was implemented on Mac workers without adding any new features, but by:
1) setting the generic-worker configuration property idleTimeoutSecs to 345600 (== 96 hours)[1]
2) rebooting between tasks
So the longest possible time between reboots should be (96 hours + longest task time). Typically tasks take no more than an hour or two, so this should be ok.
--
[1] https://hg.mozilla.org/build/puppet/raw-file/default/modules/generic_worker/templates/generic-worker.config.erb
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•6 years ago
|
Component: Generic-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•