Closed Bug 1168798 Opened 10 years ago Closed 6 years ago

Managed/monitored roll out of task cluster components and worker types

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: pmoore, Assigned: pmoore)

Details

(Whiteboard: taskcluster-q32015-meeting)

This bug is entirely inspired by Morgan's brilliant and visionary blog post: http://linuxpoetry.com/blog/section/mozilla/19/ We currently have methods for rolling out taskcluster changes (new provisioner, new worker types with new amis, new scheduler, new auth, ...). But I believe our deployment process does not wait around to monitor the impact this has on key KPIs - e.g. after the change, are systems still performing as well? Have all new jobs suddenly stopped being processed, or are task graph extensions now consistently failing? Should we keep the change, or roll back? I think we should try to employ some systems that monitor key stats about what expected rate of job processing should be, what proportion of jobs we expect to fail etc, and monitor these as part of the deployment process for new changes. We could then automatically roll back in the case of carnage. These ideas are all taken from the blog post, but apply equally well to task cluster as they do to the RelEng systems that Morgan talks about.
Component: TaskCluster → General
Product: Testing → Taskcluster
Whiteboard: taskcluster-q32015-meeting
Component: General → Operations
Component: Operations → Redeployability
Depends on: 1427839
I think we can bake this sort of thing into the new cluster deployment stuff!
Assignee: nobody → pmoore
No longer depends on: 1427839
We talked about some related things this morning: * Kubernetes uses health monitoring during service rollout, and won't finish a rollout until it's healthy * We are considering a cluster-wide integration-testing framework (bug 1492271) * Overall task rate and other such metrics are something an operations team (like cloudops) would be concerned with, and adequate data about that could be extracted from stackdriver. So, given that this is all in progress and the original blog entry is now a 404, I'm going to close this bug.
(apologies for the midair .. having read the blog post, I still think this is an operations thing that's out of our scope)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Component: Redeployability → Services
You need to log in before you can comment on or make changes to this bug.