Closed
Bug 1168798
Opened 10 years ago
Closed 6 years ago
Managed/monitored roll out of task cluster components and worker types
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: pmoore, Assigned: pmoore)
Details
(Whiteboard: taskcluster-q32015-meeting)
This bug is entirely inspired by Morgan's brilliant and visionary blog post: http://linuxpoetry.com/blog/section/mozilla/19/
We currently have methods for rolling out taskcluster changes (new provisioner, new worker types with new amis, new scheduler, new auth, ...). But I believe our deployment process does not wait around to monitor the impact this has on key KPIs - e.g. after the change, are systems still performing as well? Have all new jobs suddenly stopped being processed, or are task graph extensions now consistently failing? Should we keep the change, or roll back?
I think we should try to employ some systems that monitor key stats about what expected rate of job processing should be, what proportion of jobs we expect to fail etc, and monitor these as part of the deployment process for new changes. We could then automatically roll back in the case of carnage. These ideas are all taken from the blog post, but apply equally well to task cluster as they do to the RelEng systems that Morgan talks about.
Assignee | ||
Updated•9 years ago
|
Component: TaskCluster → General
Product: Testing → Taskcluster
Assignee | ||
Updated•9 years ago
|
Whiteboard: taskcluster-q32015-meeting
Updated•7 years ago
|
Component: General → Operations
Comment 1•7 years ago
|
||
I think we can bake this sort of thing into the new cluster deployment stuff!
Updated•7 years ago
|
Assignee: nobody → pmoore
Comment 2•6 years ago
|
||
The brilliant and visionary blog post: https://web.archive.org/web/20150904004820/http://linuxpoetry.com/blog/section/mozilla/19/
Comment 3•6 years ago
|
||
We talked about some related things this morning:
* Kubernetes uses health monitoring during service rollout, and won't finish a rollout until it's healthy
* We are considering a cluster-wide integration-testing framework (bug 1492271)
* Overall task rate and other such metrics are something an operations team (like cloudops) would be concerned with, and adequate data about that could be extracted from stackdriver.
So, given that this is all in progress and the original blog entry is now a 404, I'm going to close this bug.
Comment 4•6 years ago
|
||
(apologies for the midair .. having read the blog post, I still think this is an operations thing that's out of our scope)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Updated•6 years ago
|
Component: Redeployability → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•