Closed Bug 821883 Opened 12 years ago Closed 9 years ago

[tracking] Implement a comprehensive slave health tool

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: coop)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3300] [slavehealth][dashboard][slaveapi])

Buildduty is a nightmare primarily because one needs to pull together so many disparate data sources to assess the current state of slaves.

Current data sources for slave information, some of which are redundant:
* nagios
* slavealloc
* buildbot_status db
* buildapi (e.g. recent builds, pulled from buildbot status)
* last_job_per_slave (combines slavealloc and buildbot status)
* briar-patch/kittenherder (reboot status)
* mobile_dashboard
* inventory
* devices.json (pulled from inventory)
* bugzilla

That list is ridiculous.

We have some small tools that assist with parts of process, e.g. catlee's relengbot for filing slave bugs, special scripts for AWS instances. Armen has mentioned creating a new script to deal specifically with pandas. What we really need are some comprehensive tools.

kittenherder should be fixed or re-designed to properly reboot and recover all slave types. This includes AWS slaves and pandas. Rather than blindly rebooting and assuming success, it should also verify that a reboot was actually successful. The existing kittenherder scripts have patterns that can be lifted and reorganized to accomplish most of this.

We should have a slaveinfo cgi script that, given a hostname, will return a webpage that pulls in all of the various data sources. This page should include the ability to change slavealloc information, adjust bug status (including corresponding reboots bugs), and also a button to reboot the machine. Reboots that happen this way can be flagged differently than those performed by kittenherder, but should use the same libraries, and should appear in the reboot history for that slave. This is probably something that could be setup quickly to start with a few data sources, and we could add more as we go.

Finally, we should have a dashboard where we can roll-up the status of our various pools of slaves. We should be able to drill-down to the individual slaveinfo mentioned above.

I would like to retire *all* of the other tools (data sources will remain, of course) once we have this new tool in place.
Removing from the buildduty query.
Whiteboard: [buildduty][slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][kittenherder]
Armen, John: can we get some smaller, dependent bugs filed to cover the individual work you guys are both doing on the reboot tool, please?
Depends on: 832424, 829211
Mozpool, or an adaptation of its model, might be a good choice here.
Depends on: 855419
Depends on: 859403
Depends on: 862009
Depends on: 862507
Depends on: 865727
Depends on: 868414
Depends on: 869132
Depends on: 870061
Depends on: 874957
Depends on: 874666
Depends on: 875941
Depends on: 875944
Depends on: 878086
Component: Release Engineering: Machine Management → Release Engineering: Developer Tools
QA Contact: armenzg → hwine
Product: mozilla.org → Release Engineering
Depends on: 914764
Whiteboard: [slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][slaveapi]
By the end of September (Q3), we plan to have slaveapi deployed officially as its own service (bug 913602), i.e. not running off of cruncher. Once it's deployed, we also plan to replace kittenherder with slaveapi (bug 914764).
Depends on: 913602
Depends on: 914805
I'd love to help out with the implementation of this project.  Please let me know how I can be useful.  I see it as a sort of "mozpool for slaves", and I think some of the lessons from mozpool vsn apply here.
Depends on: 921067
Assignee: nobody → coop
Whiteboard: [slaveduty][dashboard][slaveapi] → [slavehealth][dashboard][slaveapi]
Depends on: 911136
Depends on: 965855
Depends on: 965852
Depends on: 965859
Depends on: 965862
Whiteboard: [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3300] [slavehealth][dashboard][slaveapi]
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.