Closed Bug 1168798 Opened 10 years ago Closed 6 years ago

Managed/monitored roll out of task cluster components and worker types

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: pmoore, Assigned: pmoore)

Details

(Whiteboard: taskcluster-q32015-meeting)

Pete Moore [:pmoore][:pete]

Assignee

Description

•

10 years ago

This bug is entirely inspired by Morgan's brilliant and visionary blog post: http://linuxpoetry.com/blog/section/mozilla/19/ We currently have methods for rolling out taskcluster changes (new provisioner, new worker types with new amis, new scheduler, new auth, ...). But I believe our deployment process does not wait around to monitor the impact this has on key KPIs - e.g. after the change, are systems still performing as well? Have all new jobs suddenly stopped being processed, or are task graph extensions now consistently failing? Should we keep the change, or roll back? I think we should try to employ some systems that monitor key stats about what expected rate of job processing should be, what proportion of jobs we expect to fail etc, and monitor these as part of the deployment process for new changes. We could then automatically roll back in the case of carnage. These ideas are all taken from the blog post, but apply equally well to task cluster as they do to the RelEng systems that Morgan talks about.

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

9 years ago

Component: TaskCluster → General

Product: Testing → Taskcluster

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

9 years ago

Whiteboard: taskcluster-q32015-meeting

Greg Arndt [:garndt]

Updated

•

7 years ago

Component: General → Operations

Brian Stack [:bstack]

Updated

•

7 years ago

Component: Operations → Redeployability

Depends on: 1427839

Brian Stack [:bstack]

Comment 1

•

7 years ago

I think we can bake this sort of thing into the new cluster deployment stuff!

Brian Stack [:bstack]

Updated

•

7 years ago

Assignee: nobody → pmoore

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

6 years ago

No longer depends on: 1427839

[:owlish] 🦉 PST

Comment 2

•

6 years ago

The brilliant and visionary blog post: https://web.archive.org/web/20150904004820/http://linuxpoetry.com/blog/section/mozilla/19/

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

6 years ago

We talked about some related things this morning: * Kubernetes uses health monitoring during service rollout, and won't finish a rollout until it's healthy * We are considering a cluster-wide integration-testing framework (bug 1492271) * Overall task rate and other such metrics are something an operations team (like cloudops) would be concerned with, and adequate data about that could be extracted from stackdriver. So, given that this is all in progress and the original blog entry is now a 404, I'm going to close this bug.

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

6 years ago

(apologies for the midair .. having read the blog post, I still think this is an operations thing that's out of our scope)

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → INCOMPLETE

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Redeployability → Services

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Managed/monitored roll out of task cluster components and worker types

Categories

(Taskcluster :: Services, defect)

Tracking

(Not tracked)

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

(Whiteboard: taskcluster-q32015-meeting)

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Updated