Closed Bug 869132 Opened 11 years ago Closed 11 years ago

Finish implementing kittenherder's PDU reboot support

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhopkins, Assigned: jhopkins)

References

Details

Attachments

(1 file)

Kittenherder has most of the PDU reboot code in place.  Need to finish it and test.
Patch deployment notes:
- need to "pip install pytz" into the kittenherder virtualenv
- add inventory details to secrets.cfg file
Comment on attachment 747030 [details] [diff] [review]
[briar-patch] fetch PDU details from inventory, fetch lastSeen from buildapi, rewrite rebootIfNeeded logic, improve logging

Review of attachment 747030 [details] [diff] [review]:
-----------------------------------------------------------------

A few nits, but no showstoppers. r+ with nits fixed.

We should have a conversation soon about fixing the mail output. Now that we're running more often (and you're fixing the actions we actually take), we should decide what the important issues are that we want to know about for each pass, and how we want to maintain state from one pass to the next. I'd like to get to the point where the emails are not going only to me, but are actually useful to buildduty. 

This will probably require the bugzilla integration bit first, but I don't necessarily want to gate on that.

::: releng/buildapi.py
@@ +38,5 @@
> +if __name__ == '__main__':
> +    # test
> +    rb = last_build_endtime('talos-r3-w7-029')
> +    import pprint
> +    pprint.pprint(rb)

Do we want to check rb before printing it here?

Nice to see at least basic tests being added for libraries, btw.

::: releng/remote.py
@@ +341,5 @@
> +                        (pdu, deviceID) = key_value['value'].split(':')
> +                        if not pdu.endswith('.mozilla.com'):
> +                            pdu = pdu + '.mozilla.com'
> +                        break
> +    

WS

@@ +353,2 @@
>          return False
> +        

WS

@@ +357,5 @@
> +        result = False
> +        remoteEnv = self.remoteEnv
> +
> +        if None in [self.pdu['pdu'], self.pdu['deviceID']]:
> +            return result

Is this a log-worthy error, or should we have already caught that error by this point?

@@ +849,4 @@
>          return result
>  
>      def rebootIfNeeded(self, host, lastSeen=None, indent='', dryrun=True, verbose=False):
> +        """ Reboot a host if needed. if lastSeen is None we will 

WS

@@ +869,5 @@
>                          data = host.tail_twistd_log(10)
>                          if not data or "Main loop terminated" in data or "ProcessExitedAlready" in data:
>                              break
>              else:
> +                # failed gracefull shutdown of buildbot client process

sp. graceful

@@ +929,5 @@
> +                if dryrun:
> +                    log.debug("would have hard-rebooted but dryrun is True")
> +                else:
> +                    if not (host.hasPDU or host.hasIPMI):
> +                        log.info("unreachable host does not have PDU or IPMI support")

We should exit out of the conditional at this point to avoid the subsequent checks that won't succeed anyway.

@@ +975,5 @@
> +                    tzinfo=timezone('UTC')).astimezone( \
> +                    timezone('US/Pacific')).replace(tzinfo=None)
> +                log.debug('defaulting lastseen to %s' % status['lastseen'])
> +        except requests.exceptions.HTTPError:
> +            pass

Should we still set status['lastseen'] here?
Attachment #747030 - Flags: review?(coop) → review+
Comment on attachment 747030 [details] [diff] [review]
[briar-patch] fetch PDU details from inventory, fetch lastSeen from buildapi, rewrite rebootIfNeeded logic, improve logging

Addressed nits and landed in https://github.com/mozilla/briar-patch/commit/b4ad58a1496b607c8c8d64ae69adb74687cae215.  Deployed to cruncher.
Attachment #747030 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: