Closed Bug 921067 Opened 7 years ago Closed 6 years ago

Slaveapi should support mobile devices for reboots

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Callek)

References

Details

Attachments

(3 files, 3 obsolete files)

Slaveapi for reboots currently does escalation (ssh->ipmi/etc) reboots with verification.

While our mobile devices would use a different logic path for reboots entirely. Tegras would prefer SUTAgent->PDU, while pandas we should just go through mozpool (and/or relay board depending on timing for reference implementation of mozpool's reboot magic)

This bug is to add the reboot support for these devices.
Duplicate of this bug: 915231
From bug 915231:
(In reply to Ben Hearsum [:bhearsum] from comment #0)
> Right now, slaveapi reboots assume that SSH is always available, which isn't
> true for Pandas and Tegras. Need to fix this up and triple check that the
> PDU reboot logic works for them.
(In reply to Justin Wood (:Callek) from comment #0)
> While our mobile devices would use a different logic path for reboots
> entirely.

Actually, I don't think there's any reason to use a parallel code path for this. Other than SUTAgent there's nothing here that's mobile device specific. More on this below.

> Tegras would prefer SUTAgent->PDUwhile pandas we should just go
> through mozpool (and/or relay board depending on timing for reference
> implementation of mozpool's reboot magic)

Dustin tells me that talking to relays directly will break them. Assuming that's true, we shouldn't do that.


Here's some high level thoughts on how I think this could integrate with the existing code. Note that I'm still of the mind that we shouldn't have logic like "is a tegra/panda", but rather look for the actual things we care about.
* Create a Mozpool client (in slaveapi/clients/mozpool.py) that knows how to ask for a reboot for a slave. Note that this shouldn't be Panda-specific - we've talked a fair amount about moving other/all slaves to Mozpool down the road.
* Create a SUTAgent client that knows how to reboots through it. This means that we need the SUTAgent code + devices.json. Ideally, the SUTAgent code will be pip installable. If it's not, and it's difficult to make it installable, we'll figure something else out. As for devices.json, we can probably request that from hg as required. The link to it should go in slaveapi.ini.
* Both of the above will need load_xxx_info methods like the other clients that a Slave uses. You probably know better than me how to figure out whether or not a slave has a sutagent or mozpool interface. I'm happy to help figure out how to do this if not though.
* Need to adjust the current logic to not assume that an SSH interface is available. This is a bit tricky without special casing Tegras/Pandas, but I _think_ that changing get_console (http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/slave.py;h=bfdbf8c6f2aef4d4bfa051bc325ffd1e7945247e;hb=HEAD#l130) to catch Paramiko exceptions (probably the root SSHException) and return None (instead of raising them) would be enough. Because Tegras/Pandas don't listen on 22 that should fail quite quickly. This has the added benefit of speeding up reboots of other slaves that are down because we won't cycle through the reboot commands even though we can't connect.

Once we have the above, it should be simple to change the reboot escalation order to:
* SSH
* SUTAgent (calls slave.sutagent.reboot)
* Mozpool (calls slave.mozpool.reboot)
* IPMI
* PDU
* Bug
Attached patch WIP (obsolete) — Splinter Review
Feel free to be as bikesheddy as you think is worthy here.

This is strictly a WIP but should implement the devices.json and allow the pdu rebooting to work for tegras! As well as the ssh connections to fail gracefully
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #820586 - Flags: feedback?(bhearsum)
Comment on attachment 820586 [details] [diff] [review]
WIP

Review of attachment 820586 [details] [diff] [review]:
-----------------------------------------------------------------

Did you forgot to add files here? I don't see a slaveapi/clients/devices.py.
Attached patch WIP.2 (obsolete) — Splinter Review
This time with more code (and more files)
Attachment #820586 - Attachment is obsolete: true
Attachment #820586 - Flags: feedback?(bhearsum)
Attachment #821034 - Flags: feedback?(bhearsum)
Comment on attachment 821034 [details] [diff] [review]
WIP.2

Review of attachment 821034 [details] [diff] [review]:
-----------------------------------------------------------------

::: slaveapi/clients/devices.py
@@ +6,5 @@
> +def get_device(name, url):
> +    log.debug("%s - Requesting %s", name, url)
> +    all_devices = requests.get(url).json()
> +    if name in all_devices:
> +        return all_devices['name']

Style nit: use " not '. We've kept that consistent in this code base so far, I'm trying to keep it that way :).

::: slaveapi/slave.py
@@ +58,5 @@
> +        if oldPDU and oldPDU != self.pdu:
> +            # Debug logging if already set and not matching
> +            log.debug("Overriding old pdu ('%s')"
> +                      "with new value ('%s') from inventory." %
> +                      (oldPDU, self.pdu))

The fact that load_inventory_info() and load_devices_info() both set self.pdu is really bad. I hadn't realized this complication when we talked about this the other day. Rather than have both of these methods handle PDUs specially, how about:
- Strip load_devices_info down to just the part that loads the entire device info
- Replace load_inventory_info and the rest of the existing load_devices_info with a new load_pdu_info (load_inventory_info only deals with the pdu anyways, so we're not losing anything by doing that).
- load_pdu_info can centralize the pdu logic, and deal with overriding from devices_info if exists. I'm not too fussy about whether load_pdu_info calls load_devices_info itself or not. That could be forced onto caller if you want.

I'm open to any alternative approach that doesn't duplicate the PDU logic.

@@ +154,5 @@
> +    console = SSHConsole(slave.ip, config["ssh_credentials"])
> +    try:
> +        console.connect()  # Make sure we can connect properly
> +        return console
> +    except:

Is there a more specific exception you can catch here? Maybe the root paramiko one? As it stands now, this could eat ValueError or other things if you pass in bad data to SSHConsole.

@@ +158,5 @@
> +    except:
> +        log.exception()
> +        return None  # No valid console
> +    finally:
> +        console.disconnect()  # Don't hold a ssh connection

Does disconnect() function correctly even when connect() raises?

@@ +167,5 @@
> +        # connects on instantiation with TCP connection
> +        dm = DeviceManagerSUT(slave.ip, retryLimit=1)
> +        dm.retryLimit = 5
> +        return dm
> +    except:

Same thing here - catching a more specific exception is better, if possible.
Attachment #821034 - Flags: feedback?(bhearsum) → feedback+
Duplicate of this bug: 956898
status,

* I'm working on a refreshed patch.
** (Got slightly delayed due to a HD crash a few days into work, and TRIBE/other fires)
* Tegras are rebootable *today* but the checks and balances slaveapi does with other machines are not available for tegras [e.g. no verification that no job is running]

I hope to have a new patch up shortly after Bug 971780 is resolved (within the next week or two)
Attached patch WIP.3 (obsolete) — Splinter Review
I'm having some slight trouble getting this tested (local OS==windows issue)

but I wanted to get this up for feedback anyway.

Of note, I didn't do the devicemanager/SUTAgent stuff since its proven not-as-reliable on tegras [for recovery], and plain bad for pandas.

This takes into account what looks to be relevant comments from the prior patch
Attachment #8383450 - Flags: feedback?(bhearsum)
Comment on attachment 8383450 [details] [diff] [review]
WIP.3

Review of attachment 8383450 [details] [diff] [review]:
-----------------------------------------------------------------

I'm really happy with the progress on this, good job.

::: slaveapi/actions/shutdown_buildslave.py
@@ +48,5 @@
>      start = time.time()
> +    # Sometimes buildbot is run on different host than the slave
> +    # slave.basedir is still accurate, though.
> +    buildbotslave = slave.buildbotslave or slave
> +    console = get_console(buildbotslave)

I think this is a pretty decent way to handle this. I'd like to see the getting-the-proper-console moved to get_console though. Otherwise this will end up duplicated all over the place in the future.

::: slaveapi/clients/ssh.py
@@ +73,5 @@
> +                except socket.error, e:
> +                    from errorno import errorcode
> +                    log.debug("%s - Socket Error (%s) - %s", self.fqdn, errorcode[e[0]], e[1])
> +                    last_exc = e
> +                    break

This looks like an improvement that will make us succeed more often, can you add a comment explaining it though?

::: slaveapi/slave.py
@@ +44,5 @@
>          self.master_url = None
> +        # used for hosts that have a different machine running buildbot than themselves
> +        # Valid buildbotslave value is a dict with keys of name,domain,ip
> +        # same as the self.{key} of same names above
> +        self.buildbotslave = None

This comment appears to be outdated. A valid buildbotslave needs to be a Slave (or subclass of it). Style nit: buildbot_slave instead of buildbotslave.

@@ +85,5 @@
>  
> +    def load_devices_info(self):
> +        log.info("%s - Getting devices.json info", self.name)
> +        device_info = devices.get_device(
> +            self.name, config["devices_json_url"]

Please update the config template to reflect this new parameter.

@@ +168,5 @@
> +    # e.g. a foopy
> +    def load_all_info(self):
> +        self.load_inventory_info()
> +        self.load_ipmi_info()
> +        self.load_bug_info()

I'm iffy on whether or not this class needs to exist. You remove it and replace load_all_info with these three calls. I guess I'm fine either way.

@@ +212,5 @@
> +        log.exception()
> +        return None  # No valid console
> +    finally:
> +        console.disconnect() # Don't hold a ssh connection
> +    return None  # How did we get here?

Nice improvement.
Attachment #8383450 - Flags: feedback?(bhearsum) → feedback+
Depends on: 966371
Attached patch [slaveapi] v1Splinter Review
This has been thoroughly tested in -dev, including testing against other slave types than tegras.

I opted for doing the Buildslave/Slave() -> BaseSlave refactoring for the next patch which is adding panda support (using mozpool)
Attachment #8384860 - Flags: review?(bhearsum)
Attachment #8383450 - Attachment is obsolete: true
Attachment #821034 - Attachment is obsolete: true
Comment on attachment 8384860 [details] [diff] [review]
[slaveapi] v1

Review of attachment 8384860 [details] [diff] [review]:
-----------------------------------------------------------------

::: slaveapi/actions/shutdown_buildslave.py
@@ +51,4 @@
>          try:
>              rc, output = console.run_cmd("tail -n1 %s" % twistd_log)
>              if "Server Shut Down" in output:
> +                log.debug("%s - Shutdown succeeded on %s." % (slave.name, slave.buildbotslave.name))

actually this log message change was one of the last ones tested, and won't work on things without a seperate buildbotslave, so lets mark it instead as:

log.debug("%s - Shutdown succeeded." % slave.name)
Comment on attachment 8384860 [details] [diff] [review]
[slaveapi] v1

Review of attachment 8384860 [details] [diff] [review]:
-----------------------------------------------------------------

::: slaveapi/clients/ssh.py
@@ +88,5 @@
>              self.client.close()
>          self.connected = False
> +    
> +    def __del__(self):
> +        self.disconnect()

Is this important, or just an effort to be friendly to the remote system? If it's important, you need to do it somewhere other than __del__, because it's not guaranteed to be called.
Attachment #8384860 - Flags: review?(bhearsum) → review+
$ git push
Counting objects: 22, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (12/12), done.
Writing objects: 100% (12/12), 2.45 KiB | 0 bytes/s, done.
Total 12 (delta 10), reused 0 (delta 0)
To ssh://git.mozilla.org/build/slaveapi.git
   d43a9a5..7971eb6  master -> master

...

http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=7971eb653f94dd6476b556bff8e79b148de2b4c1

+ a followup I missed to fix in my local copy (e.g. c#13) :/ , from testing:

http://git.mozilla.org/?p=build/slaveapi.git;a=commitdiff;h=b43187b58d7f18e9945b68a110c0011eb69e6e65
Depends on: 979553
Depends on: 984721
Notes:

This hasn't yet been tested, but I presume it works fine. Mozpool Client will also need noting in puppet, its only dep is requests >= 1.0 which we have.

It also takes a param "duration" in its code for power cycle which doesn't seem to do anything in mozpool afaict by code inspection.

We're not currently planning on turning panda reboots on in slaverebooter so if this tests basic-ok I'd like to land it rather than spend too much time thinking on the following improvements:
* Improved checking (e.g. checking if a device is "out on request" in mozpool, or busy for other mozpool operations)
* Improved error-recovery (e.g. using MozpoolException and/or socket exception classes instead of an all-encompassing Except
* Improving the alive check to use mozpoolhandler rather than its current checks
** MozpoolHandler has a method "device_ping" which says: """ ping this device. Returns a JSON object with a `success` key, and value true or false. The ping happens synchronously, and takes around a half-second."""

Assuming you're ok with avoiding all that just to get the human-panda-reboots working via slave health, we're good!
Attachment #8394528 - Flags: review?(bhearsum)
Comment on attachment 8394528 [details] [diff] [review]
[slaveapi] panda v1

Review of attachment 8394528 [details] [diff] [review]:
-----------------------------------------------------------------

I'm happy to see Mozpool being used, rather than direct PDU reboots. I recall there was some disagreement on what to do for Panda reboots in the past.

Please test this before you land it.
Attachment #8394528 - Flags: review?(bhearsum) → review+
Comment on attachment 8394528 [details] [diff] [review]
[slaveapi] panda v1

$ git push
Counting objects: 42, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (29/29), done.
Writing objects: 100% (29/29), 4.22 KiB | 0 bytes/s, done.
Total 29 (delta 22), reused 0 (delta 0)
To ssh://git.mozilla.org/build/slaveapi.git
   1173310..ccce02a  master -> master

$ git log 1173310..ccce02a
commit ccce02ac8aa2bd7a000733a8d1211881d84fd927
Author: Justin Wood <Callek@gmail.com>
Date:   Tue Mar 25 22:46:10 2014 -0400

    Bump slaveapi ver for Bug 921067

commit ce5eb8a162100f825b910a92c43cc7f6cb10d8e5
Author: Justin Wood <Callek@gmail.com>
Date:   Tue Mar 25 22:44:17 2014 -0400

    Bug 921067 - Slaveapi should support panda devices for reboots. r=bhearsum

commit 8d13f0e58ce69436973a25ee2209cc09c08cf24a
Author: Justin Wood <Callek@gmail.com>
Date:   Tue Mar 25 22:41:45 2014 -0400

    Bug 984721 - Slaveapi should abstract out Machine from Slave. r=bhearsum

commit 7b3966fd3b2904a9b51edac38239bba432ac16cc
Author: Justin Wood <Callek@gmail.com>
Date:   Tue Mar 25 22:35:30 2014 -0400

    Bug 981039 - slaveapi should have a seperate knob for concurrency on buildapi calls alone. r=bhearsum
Attachment #8394528 - Flags: checked-in+
The mozpoolclient is needed for the package now, its also in the packages setup.py but this does get the venv updated.
Attachment #8396872 - Flags: review?(nthomas)
Comment on attachment 8396872 [details] [diff] [review]
[puupet] slaveapi version bump

The stamp is .... *spins wheel of colour* ... aquamarine today.
Attachment #8396872 - Flags: review?(nthomas) → review+
Comment on attachment 8396872 [details] [diff] [review]
[puupet] slaveapi version bump

Review of attachment 8396872 [details] [diff] [review]:
-----------------------------------------------------------------

bumped ver again to 1.0.18 assuming r=stamp is perfect!

(Due to https://bugzilla.mozilla.org/show_bug.cgi?id=984721#c6 )
deployed for real as part of current Slaveapi 1.0.18.1
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
doing a few panda reboots today: 

2014-04-14 14:07:40,000 - ERROR - panda-0819 - Caught exception during mozpool reboot.
Traceback (most recent call last):
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 64, in reboot
    mozpoolhandler = MozpoolHandler(self.mozpool_server)
NameError: global name 'self' is not defined
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This should be all set as of the next slaveapi deploy. Got r=jhopkins over IRC

$ git push -v
Pushing to ssh://gitolite3@git.mozilla.org/build/slaveapi.git
Counting objects: 9, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 470 bytes | 0 bytes/s, done.
Total 5 (delta 4), reused 0 (delta 0)
To ssh://gitolite3@git.mozilla.org/build/slaveapi.git
   4bb12b6..77d7470  master -> master
updating local tracking ref 'refs/remotes/origin/master'
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.