Closed Bug 821883 Opened 7 years ago Closed 5 years ago
[tracking] Implement a comprehensive slave health tool
Buildduty is a nightmare primarily because one needs to pull together so many disparate data sources to assess the current state of slaves. Current data sources for slave information, some of which are redundant: * nagios * slavealloc * buildbot_status db * buildapi (e.g. recent builds, pulled from buildbot status) * last_job_per_slave (combines slavealloc and buildbot status) * briar-patch/kittenherder (reboot status) * mobile_dashboard * inventory * devices.json (pulled from inventory) * bugzilla That list is ridiculous. We have some small tools that assist with parts of process, e.g. catlee's relengbot for filing slave bugs, special scripts for AWS instances. Armen has mentioned creating a new script to deal specifically with pandas. What we really need are some comprehensive tools. kittenherder should be fixed or re-designed to properly reboot and recover all slave types. This includes AWS slaves and pandas. Rather than blindly rebooting and assuming success, it should also verify that a reboot was actually successful. The existing kittenherder scripts have patterns that can be lifted and reorganized to accomplish most of this. We should have a slaveinfo cgi script that, given a hostname, will return a webpage that pulls in all of the various data sources. This page should include the ability to change slavealloc information, adjust bug status (including corresponding reboots bugs), and also a button to reboot the machine. Reboots that happen this way can be flagged differently than those performed by kittenherder, but should use the same libraries, and should appear in the reboot history for that slave. This is probably something that could be setup quickly to start with a few data sources, and we could add more as we go. Finally, we should have a dashboard where we can roll-up the status of our various pools of slaves. We should be able to drill-down to the individual slaveinfo mentioned above. I would like to retire *all* of the other tools (data sources will remain, of course) once we have this new tool in place.
I had started this: https://etherpad.mozilla.org/releng-manage-slaves
Removing from the buildduty query.
Whiteboard: [buildduty][slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][kittenherder]
Armen, John: can we get some smaller, dependent bugs filed to cover the individual work you guys are both doing on the reboot tool, please?
Mozpool, or an adaptation of its model, might be a good choice here.
Component: Release Engineering: Machine Management → Release Engineering: Developer Tools
QA Contact: armenzg → hwine
Product: mozilla.org → Release Engineering
Whiteboard: [slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][slaveapi]
By the end of September (Q3), we plan to have slaveapi deployed officially as its own service (bug 913602), i.e. not running off of cruncher. Once it's deployed, we also plan to replace kittenherder with slaveapi (bug 914764).
Depends on: 913602
I'd love to help out with the implementation of this project. Please let me know how I can be useful. I see it as a sort of "mozpool for slaves", and I think some of the lessons from mozpool vsn apply here.
Whiteboard: [slaveduty][dashboard][slaveapi] → [slavehealth][dashboard][slaveapi]
Whiteboard: [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3300] [slavehealth][dashboard][slaveapi]
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.