Finish implementing kittenherder's PDU reboot support

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
5 months ago

People

(Reporter: jhopkins, Assigned: jhopkins)

Tracking

Details

Attachments

(1 attachment)

(Assignee)

Description

6 years ago
Kittenherder has most of the PDU reboot code in place.  Need to finish it and test.
(Assignee)

Comment 1

6 years ago
Created attachment 747030 [details] [diff] [review]
[briar-patch] fetch PDU details from inventory, fetch lastSeen from buildapi, rewrite rebootIfNeeded logic, improve logging
Attachment #747030 - Flags: review?(coop)
(Assignee)

Comment 2

6 years ago
Patch deployment notes:
- need to "pip install pytz" into the kittenherder virtualenv
- add inventory details to secrets.cfg file
Comment on attachment 747030 [details] [diff] [review]
[briar-patch] fetch PDU details from inventory, fetch lastSeen from buildapi, rewrite rebootIfNeeded logic, improve logging

Review of attachment 747030 [details] [diff] [review]:
-----------------------------------------------------------------

A few nits, but no showstoppers. r+ with nits fixed.

We should have a conversation soon about fixing the mail output. Now that we're running more often (and you're fixing the actions we actually take), we should decide what the important issues are that we want to know about for each pass, and how we want to maintain state from one pass to the next. I'd like to get to the point where the emails are not going only to me, but are actually useful to buildduty. 

This will probably require the bugzilla integration bit first, but I don't necessarily want to gate on that.

::: releng/buildapi.py
@@ +38,5 @@
> +if __name__ == '__main__':
> +    # test
> +    rb = last_build_endtime('talos-r3-w7-029')
> +    import pprint
> +    pprint.pprint(rb)

Do we want to check rb before printing it here?

Nice to see at least basic tests being added for libraries, btw.

::: releng/remote.py
@@ +341,5 @@
> +                        (pdu, deviceID) = key_value['value'].split(':')
> +                        if not pdu.endswith('.mozilla.com'):
> +                            pdu = pdu + '.mozilla.com'
> +                        break
> +    

WS

@@ +353,2 @@
>          return False
> +        

WS

@@ +357,5 @@
> +        result = False
> +        remoteEnv = self.remoteEnv
> +
> +        if None in [self.pdu['pdu'], self.pdu['deviceID']]:
> +            return result

Is this a log-worthy error, or should we have already caught that error by this point?

@@ +849,4 @@
>          return result
>  
>      def rebootIfNeeded(self, host, lastSeen=None, indent='', dryrun=True, verbose=False):
> +        """ Reboot a host if needed. if lastSeen is None we will 

WS

@@ +869,5 @@
>                          data = host.tail_twistd_log(10)
>                          if not data or "Main loop terminated" in data or "ProcessExitedAlready" in data:
>                              break
>              else:
> +                # failed gracefull shutdown of buildbot client process

sp. graceful

@@ +929,5 @@
> +                if dryrun:
> +                    log.debug("would have hard-rebooted but dryrun is True")
> +                else:
> +                    if not (host.hasPDU or host.hasIPMI):
> +                        log.info("unreachable host does not have PDU or IPMI support")

We should exit out of the conditional at this point to avoid the subsequent checks that won't succeed anyway.

@@ +975,5 @@
> +                    tzinfo=timezone('UTC')).astimezone( \
> +                    timezone('US/Pacific')).replace(tzinfo=None)
> +                log.debug('defaulting lastseen to %s' % status['lastseen'])
> +        except requests.exceptions.HTTPError:
> +            pass

Should we still set status['lastseen'] here?
Attachment #747030 - Flags: review?(coop) → review+
(Assignee)

Comment 4

6 years ago
Comment on attachment 747030 [details] [diff] [review]
[briar-patch] fetch PDU details from inventory, fetch lastSeen from buildapi, rewrite rebootIfNeeded logic, improve logging

Addressed nits and landed in https://github.com/mozilla/briar-patch/commit/b4ad58a1496b607c8c8d64ae69adb74687cae215.  Deployed to cruncher.
Attachment #747030 - Flags: checked-in+
(Assignee)

Updated

5 years ago
Depends on: 871981
(Assignee)

Updated

5 years ago
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering

Updated

5 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.