Closed Bug 1340925 Opened 7 years ago Closed 7 years ago

Adjust master restart script for auth changes

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: coop)

References

Details

Attachments

(1 file, 1 obsolete file)

Bug 1323457 removed password auth for cltbld and root on the buildbot-masters, so the master restart script will need to be adjusted. Looks like we do most things as cltbld, and only reboot as root.

We could forward SSH credentials from the user but that needs a long-lived connection back to the local ssh agent, provision some ssh keys on buildduty-tools, or <your idea here>.
Also, the failure mode was poor. The script did
* disable master in slavealloc
* send graceful stop via web ui
* try to log in to create reconfig.lock file, fail to open a connection
* repeat over all masters

End result - all masters shutting down and disabled, with no hope of ssh connection to restart buildbot. We could mitigate this with checks that all credentials work before starting on any masters (I mean LDAP for slavealloc, ssh connections, etc, everything we're going to want to use).
I don't think we need any further checking for slavealloc access. We already exit if we fail to get the list of masters from slavealloc at the start of the script:

https://hg.mozilla.org/build/tools/file/2c156751ea19/buildfarm/maintenance/restart_masters.py#l231

For the individual masters, I think a two-step approach is warranted here:

1) Check the connection to a single master (first one in the list, or whatever) at the outset, just to make sure the credentials are valid.
2) Before disabling or gracefully stopping a master, test the connection to that master first so we can be reasonably certain the follow-up actions will work.

Does this make sense?
Assignee: nobody → coop
Status: NEW → ASSIGNED
Sounds good to me.
Comment on attachment 8842945 [details] [diff] [review]
[tools] Verify we can connect to a given master before disabling buildbot

>diff --git a/buildfarm/maintenance/restart_masters.py b/buildfarm/maintenance/restart_masters.py
...
>+    # Connect to a single master, just to make sure our supplied credentials are valid.
>+    log.debug("Verifying LDAP credentials by connecting to a single master...")
>+    master = buckets.itervalues().next()[0]
>+    if master and not check_credentials(master):
>+        log.error("Couldn't connect to master. Check you LDAP credentials.")
>+        sys.exit(4)

I thought we needed key-based auth for ssh and LDAP is just for slavealloc. Am I missing something ?
(In reply to Nick Thomas [:nthomas] from comment #5)
> I thought we needed key-based auth for ssh and LDAP is just for slavealloc.
> Am I missing something ?

That error msg is redundant anyway. I'll just remove it. New patch incoming.
Same as before, but removes the extra error log message that improperly referenced LDAP credentials.
Attachment #8842945 - Attachment is obsolete: true
Attachment #8842945 - Flags: review?(nthomas)
Attachment #8844082 - Flags: review?(nthomas)
Comment on attachment 8844082 [details] [diff] [review]
[tools] Verify we can connect to a given master before disabling buildbot, v2

Review of attachment 8844082 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/maintenance/restart_masters.py
@@ +121,5 @@
> +                    log.warning("Error running remote command '%s' on master: %s" % (cmd, master['hostname']))
> +            except ssh.RemoteCommandError:
> +                log.warning("Caught exception while attempting remote command '%s' on master: %s" % (cmd, master['hostname']))
> +            else:
> +                log.error("Couldn't get console to %s" % master['hostname'])

Looks like the indent on the else should be reduced, to align with 'if console'.

@@ +509,5 @@
>  
>      put_masters_in_buckets(masters_json, master_list)
>  
> +    # Connect to a single master, just to make sure our supplied credentials are valid.
> +    log.debug("Verifying LDAP credentials by connecting to a single master...")

Please change this to refer to ssh.
Attachment #8844082 - Flags: review?(nthomas) → review+
https://hg.mozilla.org/build/tools/rev/d7c845338818d48f278478f58deeeaeaf0dc4a57
Bug 1340925 - Verify we can connect to a given master before disabling buildbot - r=nthomas
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
FTR, this fix works nicely for cltbld, provided you forward your ssh agent. The reboot functionality is broken because we can't ssh in as root; we'd need something based on sudo instead.
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: