Closed
Bug 929584
Opened 11 years ago
Closed 11 years ago
slaveapi should be able to file more types of IT bugs
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
References
Details
Attachments
(2 files, 2 obsolete files)
9.72 KB,
patch
|
jhopkins
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
746 bytes,
patch
|
bhearsum
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
Right now SlaveAPI only knows how to file bugs about reboots. Depending on the slave type and specific problem, something else may be needed. We talked about this at the buildduty meeting today and the idea of using a more generic "please poke this slave" bug was brought up. This got me thinking that maybe we should change our current reboot bugs into slave poking bugs - that way we reduce the bugspam on IT.
I'm going to chat with them before doing anything here.
Assignee | ||
Comment 1•11 years ago
|
||
One thing to take into consideration here is that rev3 machines (and maybe others) don't have any PDU or IPMI interfaces, and so their first steps should always be a reboot. We talked today about hardcoding that, but I realized just now that we can make this the case for any slaves without such interfaces. So the logic ends up being:
- Try SSH reboot
- If PDU/IPMI exists:
-- Try PDU/IPMI reboot
-- If it fails, file "poke this slave bug"
- Else:
-- File reboot bug
Comment 2•11 years ago
|
||
What is the workflow of "poke this slave"? I'm not sure what that means.
Assignee | ||
Comment 3•11 years ago
|
||
(In reply to John Hopkins (:jhopkins) from comment #2)
> What is the workflow of "poke this slave"? I'm not sure what that means.
TBD (that's why I used such a generic description). I'd like to talk to IT about what makes sense there.
Assignee | ||
Comment 4•11 years ago
|
||
Derek, since your group handles all the physical slave poking I'd like to know what should the best thing to do for machines that we can't reboot through their IPMI/PDU interfaces, what type of bug should get filed? Currently, we sometimes block against the reboot bug, sometimes file a bug for diagnostics, and sometimes file a "slave XXX can't reboot".
The background here is that we've got an automated system in place that reboots down slaves. If it can't, it needs to escalate through Bugzilla. We want to make sure that we're doing the right thing there and not making unnecessary work for anyone.
Flags: needinfo?(dmoore)
Assignee | ||
Comment 5•11 years ago
|
||
todo: re-enable notifications for production slaveapi when this is fixed: https://nagios.mozilla.org/releng-scl3/cgi-bin/extinfo.cgi?type=2&host=slaveapi1.srv.releng.scl3.mozilla.com&service=procs+-+python.slaveapi-server.py
Comment 6•11 years ago
|
||
you can file a break/fix ticket with a generic subject - "XXX host is unreachable". punt it over to the DCOPs queue and we'll make sure it's pingable and accessible via ssh before we resolve it.
let me know if I missed anything or if you'd like a different approach/process.
Flags: needinfo?(dmoore)
Assignee | ||
Comment 7•11 years ago
|
||
(In reply to Van Le [:van] from comment #6)
> you can file a break/fix ticket with a generic subject - "XXX host is
> unreachable". punt it over to the DCOPs queue and we'll make sure it's
> pingable and accessible via ssh before we resolve it.
>
> let me know if I missed anything or if you'd like a different
> approach/process.
Sounds great - thanks!
Assignee | ||
Comment 8•11 years ago
|
||
(In reply to Van Le [:van] from comment #6)
> you can file a break/fix ticket with a generic subject - "XXX host is
> unreachable". punt it over to the DCOPs queue and we'll make sure it's
> pingable and accessible via ssh before we resolve it.
>
> let me know if I missed anything or if you'd like a different
> approach/process.
Hmm, I just realized that this means we won't be using reboots bugs for most things anymore - each slave will end up with its own bug. Is that what you want? I'm agreeable either way.
Flags: needinfo?(vle)
Comment 9•11 years ago
|
||
that's fine with me, im not a real big fan of those big reboot bugs.
Flags: needinfo?(vle)
Assignee | ||
Comment 10•11 years ago
|
||
This patch implements the new per-slave reboot bugs and also quiets down SlaveAPI to only make bug comments upon initial filing of the reboot bug.
I don't think that the code is terribly interesting beyond getting rid of the RebootBug class and replacing it with helper functions. Because we no longer have an alias to look for (like we had with "reboots-xxx" before), it doesn't make sense to subclass Bug.
Once this is deployed we should be able to start using the production SlaveAPI instance \o/.
Attachment #8363658 -
Flags: review?(jhopkins)
Assignee | ||
Comment 11•11 years ago
|
||
Here's an example of the new bug filing logic: https://bugzilla-dev.allizom.org/show_bug.cgi?id=751893
It's hard to demonstrate the other cases as they don't report anything outside of the log.
Comment 12•11 years ago
|
||
Comment on attachment 8363658 [details] [diff] [review]
improve the bugs
+def get_reboot_bug(slave):
+ qs = "?product=%s&component=%s" % (reboot_product, reboot_component)
+ qs += "&blocks=%s&resolution=---" % slave.bug.id_
+ for bug in bugzilla_client.request("GET", "bug" + qs)["bugs"]:
+ if "unreachable" in bug["summary"]:
+ return Bug(bug["id"])
Is "unreachable" some kind of magic string?
Flags: needinfo?(bhearsum)
Assignee | ||
Comment 13•11 years ago
|
||
(In reply to John Hopkins (:jhopkins) from comment #12)
> Comment on attachment 8363658 [details] [diff] [review]
> improve the bugs
>
> +def get_reboot_bug(slave):
> + qs = "?product=%s&component=%s" % (reboot_product, reboot_component)
> + qs += "&blocks=%s&resolution=---" % slave.bug.id_
> + for bug in bugzilla_client.request("GET", "bug" + qs)["bugs"]:
> + if "unreachable" in bug["summary"]:
> + return Bug(bug["id"])
>
> Is "unreachable" some kind of magic string?
Yes - it's fairly unique text that's in the bug summary of the IT bug (see file_reboot_bug for the other use of it). I could probably factor that out by looking for the entire "XXXXX is unreachable" if you'd like.
Flags: needinfo?(bhearsum)
Comment 14•11 years ago
|
||
Comment on attachment 8363658 [details] [diff] [review]
improve the bugs
I'd be happy with either the full "xxx is unreachable" or a code comment, whichever you prefer.
Attachment #8363658 -
Flags: review?(jhopkins) → review+
Assignee | ||
Comment 15•11 years ago
|
||
Attachment #8363658 -
Attachment is obsolete: true
Attachment #8363818 -
Flags: review?(jhopkins)
Comment 16•11 years ago
|
||
Comment on attachment 8363818 [details] [diff] [review]
factor out summary
Perfect - thanks.
Attachment #8363818 -
Flags: review?(jhopkins) → review+
Assignee | ||
Comment 17•11 years ago
|
||
Comment on attachment 8363818 [details] [diff] [review]
factor out summary
Landed, and I'm in the process of deploying.
Attachment #8363818 -
Flags: checked-in+
Assignee | ||
Comment 18•11 years ago
|
||
Just noticed that we don't re-open the slave bug when filing the IT one - whoops.
Attachment #8363875 -
Flags: review?(jhopkins)
Comment 19•11 years ago
|
||
Comment on attachment 8363875 [details] [diff] [review]
re-open slave bug when filing IT bug
Looks like an unmatched '}'
Attachment #8363875 -
Flags: review?(jhopkins) → review-
Assignee | ||
Comment 20•11 years ago
|
||
Fixed patch, as landed.
Attachment #8363875 -
Attachment is obsolete: true
Attachment #8363895 -
Flags: review+
Assignee | ||
Updated•11 years ago
|
Attachment #8363895 -
Flags: checked-in+
Assignee | ||
Comment 21•11 years ago
|
||
Success! https://bugzilla.mozilla.org/show_bug.cgi?id=talos-r3-fed-068
Both dev & prod have been restarted for this now, and I'll be switching slaverebooter over in bug 962708.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•