Closed Bug 929584 Opened 7 years ago Closed 7 years ago

slaveapi should be able to file more types of IT bugs

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(2 files, 2 obsolete files)

Right now SlaveAPI only knows how to file bugs about reboots. Depending on the slave type and specific problem, something else may be needed. We talked about this at the buildduty meeting today and the idea of using a more generic "please poke this slave" bug was brought up. This got me thinking that maybe we should change our current reboot bugs into slave poking bugs - that way we reduce the bugspam on IT.

I'm going to chat with them before doing anything here.
One thing to take into consideration here is that rev3 machines (and maybe others) don't have any PDU or IPMI interfaces, and so their first steps should always be a reboot. We talked today about hardcoding that, but I realized just now that we can make this the case for any slaves without such interfaces. So the logic ends up being:
- Try SSH reboot
- If PDU/IPMI exists:
-- Try PDU/IPMI reboot
-- If it fails, file "poke this slave bug"
- Else:
-- File reboot bug
What is the workflow of "poke this slave"?  I'm not sure what that means.
(In reply to John Hopkins (:jhopkins) from comment #2)
> What is the workflow of "poke this slave"?  I'm not sure what that means.

TBD (that's why I used such a generic description). I'd like to talk to IT about what makes sense there.
Derek, since your group handles all the physical slave poking I'd like to know what should the best thing to do for machines that we can't reboot through their IPMI/PDU interfaces, what type of bug should get filed? Currently, we sometimes block against the reboot bug, sometimes file a bug for diagnostics, and sometimes file a "slave XXX can't reboot".

The background here is that we've got an automated system in place that reboots down slaves. If it can't, it needs to escalate through Bugzilla. We want to make sure that we're doing the right thing there and not making unnecessary work for anyone.
Flags: needinfo?(dmoore)
you can file a break/fix ticket with a generic subject - "XXX host is unreachable". punt it over to the DCOPs queue and we'll make sure it's pingable and accessible via ssh before we resolve it. 

let me know if I missed anything or if you'd like a different approach/process.
Flags: needinfo?(dmoore)
(In reply to Van Le [:van] from comment #6)
> you can file a break/fix ticket with a generic subject - "XXX host is
> unreachable". punt it over to the DCOPs queue and we'll make sure it's
> pingable and accessible via ssh before we resolve it. 
> 
> let me know if I missed anything or if you'd like a different
> approach/process.

Sounds great - thanks!
(In reply to Van Le [:van] from comment #6)
> you can file a break/fix ticket with a generic subject - "XXX host is
> unreachable". punt it over to the DCOPs queue and we'll make sure it's
> pingable and accessible via ssh before we resolve it. 
> 
> let me know if I missed anything or if you'd like a different
> approach/process.

Hmm, I just realized that this means we won't be using reboots bugs for most things anymore - each slave will end up with its own bug. Is that what you want? I'm agreeable either way.
Flags: needinfo?(vle)
that's fine with me, im not a real big fan of those big reboot bugs.
Flags: needinfo?(vle)
Attached patch improve the bugs (obsolete) — Splinter Review
This patch implements the new per-slave reboot bugs and also quiets down SlaveAPI to only make bug comments upon initial filing of the reboot bug.

I don't think that the code is terribly interesting beyond getting rid of the RebootBug class and replacing it with helper functions. Because we no longer have an alias to look for (like we had with "reboots-xxx" before), it doesn't make sense to subclass Bug.

Once this is deployed we should be able to start using the production SlaveAPI instance \o/.
Attachment #8363658 - Flags: review?(jhopkins)
Here's an example of the new bug filing logic: https://bugzilla-dev.allizom.org/show_bug.cgi?id=751893

It's hard to demonstrate the other cases as they don't report anything outside of the log.
Comment on attachment 8363658 [details] [diff] [review]
improve the bugs

+def get_reboot_bug(slave):
+    qs = "?product=%s&component=%s" % (reboot_product, reboot_component)
+    qs += "&blocks=%s&resolution=---" % slave.bug.id_
+    for bug in bugzilla_client.request("GET", "bug" + qs)["bugs"]:
+        if "unreachable" in bug["summary"]:
+            return Bug(bug["id"])

Is "unreachable" some kind of magic string?
Flags: needinfo?(bhearsum)
(In reply to John Hopkins (:jhopkins) from comment #12)
> Comment on attachment 8363658 [details] [diff] [review]
> improve the bugs
> 
> +def get_reboot_bug(slave):
> +    qs = "?product=%s&component=%s" % (reboot_product, reboot_component)
> +    qs += "&blocks=%s&resolution=---" % slave.bug.id_
> +    for bug in bugzilla_client.request("GET", "bug" + qs)["bugs"]:
> +        if "unreachable" in bug["summary"]:
> +            return Bug(bug["id"])
> 
> Is "unreachable" some kind of magic string?

Yes - it's fairly unique text that's in the bug summary of the IT bug (see file_reboot_bug for the other use of it). I could probably factor that out by looking for the entire "XXXXX is unreachable" if you'd like.
Flags: needinfo?(bhearsum)
Comment on attachment 8363658 [details] [diff] [review]
improve the bugs

I'd be happy with either the full "xxx is unreachable" or a code comment, whichever you prefer.
Attachment #8363658 - Flags: review?(jhopkins) → review+
Attachment #8363658 - Attachment is obsolete: true
Attachment #8363818 - Flags: review?(jhopkins)
Comment on attachment 8363818 [details] [diff] [review]
factor out summary

Perfect - thanks.
Attachment #8363818 - Flags: review?(jhopkins) → review+
Blocks: 962708
Comment on attachment 8363818 [details] [diff] [review]
factor out summary

Landed, and I'm in the process of deploying.
Attachment #8363818 - Flags: checked-in+
Just noticed that we don't re-open the slave bug when filing the IT one - whoops.
Attachment #8363875 - Flags: review?(jhopkins)
Comment on attachment 8363875 [details] [diff] [review]
re-open slave bug when filing IT bug

Looks like an unmatched '}'
Attachment #8363875 - Flags: review?(jhopkins) → review-
Attached patch reopen bugSplinter Review
Fixed patch, as landed.
Attachment #8363875 - Attachment is obsolete: true
Attachment #8363895 - Flags: review+
Attachment #8363895 - Flags: checked-in+
Success! https://bugzilla.mozilla.org/show_bug.cgi?id=talos-r3-fed-068

Both dev & prod have been restarted for this now, and I'll be switching slaverebooter over in bug 962708.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.