[tracking] Implement a comprehensive slave health tool

RESOLVED FIXED

Status

Release Engineering
Tools
RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: coop, Assigned: coop)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3300] [slavehealth][dashboard][slaveapi])

(Assignee)

Description

4 years ago
Buildduty is a nightmare primarily because one needs to pull together so many disparate data sources to assess the current state of slaves.

Current data sources for slave information, some of which are redundant:
* nagios
* slavealloc
* buildbot_status db
* buildapi (e.g. recent builds, pulled from buildbot status)
* last_job_per_slave (combines slavealloc and buildbot status)
* briar-patch/kittenherder (reboot status)
* mobile_dashboard
* inventory
* devices.json (pulled from inventory)
* bugzilla

That list is ridiculous.

We have some small tools that assist with parts of process, e.g. catlee's relengbot for filing slave bugs, special scripts for AWS instances. Armen has mentioned creating a new script to deal specifically with pandas. What we really need are some comprehensive tools.

kittenherder should be fixed or re-designed to properly reboot and recover all slave types. This includes AWS slaves and pandas. Rather than blindly rebooting and assuming success, it should also verify that a reboot was actually successful. The existing kittenherder scripts have patterns that can be lifted and reorganized to accomplish most of this.

We should have a slaveinfo cgi script that, given a hostname, will return a webpage that pulls in all of the various data sources. This page should include the ability to change slavealloc information, adjust bug status (including corresponding reboots bugs), and also a button to reboot the machine. Reboots that happen this way can be flagged differently than those performed by kittenherder, but should use the same libraries, and should appear in the reboot history for that slave. This is probably something that could be setup quickly to start with a few data sources, and we could add more as we go.

Finally, we should have a dashboard where we can roll-up the status of our various pools of slaves. We should be able to drill-down to the individual slaveinfo mentioned above.

I would like to retire *all* of the other tools (data sources will remain, of course) once we have this new tool in place.
I had started this:
https://etherpad.mozilla.org/releng-manage-slaves
Removing from the buildduty query.
Whiteboard: [buildduty][slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][kittenherder]
(Assignee)

Comment 3

4 years ago
Armen, John: can we get some smaller, dependent bugs filed to cover the individual work you guys are both doing on the reboot tool, please?
Depends on: 832424, 829211
Depends on: 844195
Mozpool, or an adaptation of its model, might be a good choice here.
Depends on: 855419
(Assignee)

Updated

4 years ago
Depends on: 859403
Depends on: 862009
Depends on: 862507
Depends on: 865727
Depends on: 868414
Depends on: 869132
Depends on: 870061
Depends on: 874957
Depends on: 874666
Depends on: 875941
Depends on: 875944
Depends on: 878086
(Assignee)

Updated

4 years ago
Component: Release Engineering: Machine Management → Release Engineering: Developer Tools
QA Contact: armenzg → hwine
Product: mozilla.org → Release Engineering
Depends on: 914764
(Assignee)

Updated

4 years ago
Whiteboard: [slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][slaveapi]
(Assignee)

Comment 5

4 years ago
By the end of September (Q3), we plan to have slaveapi deployed officially as its own service (bug 913602), i.e. not running off of cruncher. Once it's deployed, we also plan to replace kittenherder with slaveapi (bug 914764).
Depends on: 913602
(Assignee)

Updated

4 years ago
Depends on: 914805
I'd love to help out with the implementation of this project.  Please let me know how I can be useful.  I see it as a sort of "mozpool for slaves", and I think some of the lessons from mozpool vsn apply here.

Updated

4 years ago
Depends on: 921067
(Assignee)

Updated

4 years ago
Assignee: nobody → coop
(Assignee)

Updated

3 years ago
Whiteboard: [slaveduty][dashboard][slaveapi] → [slavehealth][dashboard][slaveapi]
(Assignee)

Updated

3 years ago
Depends on: 911136
(Assignee)

Updated

3 years ago
Depends on: 965855
(Assignee)

Updated

3 years ago
Depends on: 965852
(Assignee)

Updated

3 years ago
Depends on: 965859
(Assignee)

Updated

3 years ago
Depends on: 965862

Updated

3 years ago
Whiteboard: [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi]

Updated

3 years ago
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi]

Updated

3 years ago
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3300] [slavehealth][dashboard][slaveapi]
(Assignee)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.