Closed
Bug 821883
Opened 12 years ago
Closed 9 years ago
[tracking] Implement a comprehensive slave health tool
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: coop)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3300] [slavehealth][dashboard][slaveapi])
Buildduty is a nightmare primarily because one needs to pull together so many disparate data sources to assess the current state of slaves. Current data sources for slave information, some of which are redundant: * nagios * slavealloc * buildbot_status db * buildapi (e.g. recent builds, pulled from buildbot status) * last_job_per_slave (combines slavealloc and buildbot status) * briar-patch/kittenherder (reboot status) * mobile_dashboard * inventory * devices.json (pulled from inventory) * bugzilla That list is ridiculous. We have some small tools that assist with parts of process, e.g. catlee's relengbot for filing slave bugs, special scripts for AWS instances. Armen has mentioned creating a new script to deal specifically with pandas. What we really need are some comprehensive tools. kittenherder should be fixed or re-designed to properly reboot and recover all slave types. This includes AWS slaves and pandas. Rather than blindly rebooting and assuming success, it should also verify that a reboot was actually successful. The existing kittenherder scripts have patterns that can be lifted and reorganized to accomplish most of this. We should have a slaveinfo cgi script that, given a hostname, will return a webpage that pulls in all of the various data sources. This page should include the ability to change slavealloc information, adjust bug status (including corresponding reboots bugs), and also a button to reboot the machine. Reboots that happen this way can be flagged differently than those performed by kittenherder, but should use the same libraries, and should appear in the reboot history for that slave. This is probably something that could be setup quickly to start with a few data sources, and we could add more as we go. Finally, we should have a dashboard where we can roll-up the status of our various pools of slaves. We should be able to drill-down to the individual slaveinfo mentioned above. I would like to retire *all* of the other tools (data sources will remain, of course) once we have this new tool in place.
Comment 1•12 years ago
|
||
I had started this: https://etherpad.mozilla.org/releng-manage-slaves
Comment 2•12 years ago
|
||
Removing from the buildduty query.
Whiteboard: [buildduty][slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][kittenherder]
Assignee | ||
Comment 3•11 years ago
|
||
Armen, John: can we get some smaller, dependent bugs filed to cover the individual work you guys are both doing on the reboot tool, please?
Updated•11 years ago
|
Comment 4•11 years ago
|
||
Mozpool, or an adaptation of its model, might be a good choice here.
Assignee | ||
Updated•11 years ago
|
Component: Release Engineering: Machine Management → Release Engineering: Developer Tools
QA Contact: armenzg → hwine
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Assignee | ||
Updated•11 years ago
|
Whiteboard: [slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][slaveapi]
Assignee | ||
Comment 5•11 years ago
|
||
By the end of September (Q3), we plan to have slaveapi deployed officially as its own service (bug 913602), i.e. not running off of cruncher. Once it's deployed, we also plan to replace kittenherder with slaveapi (bug 914764).
Depends on: 913602
Comment 6•11 years ago
|
||
I'd love to help out with the implementation of this project. Please let me know how I can be useful. I see it as a sort of "mozpool for slaves", and I think some of the lessons from mozpool vsn apply here.
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → coop
Assignee | ||
Updated•10 years ago
|
Whiteboard: [slaveduty][dashboard][slaveapi] → [slavehealth][dashboard][slaveapi]
Updated•10 years ago
|
Whiteboard: [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3291] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3296] [slavehealth][dashboard][slaveapi] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3300] [slavehealth][dashboard][slaveapi]
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•