Closed
Bug 1340925
Opened 7 years ago
Closed 7 years ago
Adjust master restart script for auth changes
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: coop)
References
Details
Attachments
(1 file, 1 obsolete file)
5.94 KB,
patch
|
nthomas
:
review+
|
Details | Diff | Splinter Review |
Bug 1323457 removed password auth for cltbld and root on the buildbot-masters, so the master restart script will need to be adjusted. Looks like we do most things as cltbld, and only reboot as root. We could forward SSH credentials from the user but that needs a long-lived connection back to the local ssh agent, provision some ssh keys on buildduty-tools, or <your idea here>.
Reporter | ||
Comment 1•7 years ago
|
||
Also, the failure mode was poor. The script did * disable master in slavealloc * send graceful stop via web ui * try to log in to create reconfig.lock file, fail to open a connection * repeat over all masters End result - all masters shutting down and disabled, with no hope of ssh connection to restart buildbot. We could mitigate this with checks that all credentials work before starting on any masters (I mean LDAP for slavealloc, ssh connections, etc, everything we're going to want to use).
Assignee | ||
Comment 2•7 years ago
|
||
I don't think we need any further checking for slavealloc access. We already exit if we fail to get the list of masters from slavealloc at the start of the script: https://hg.mozilla.org/build/tools/file/2c156751ea19/buildfarm/maintenance/restart_masters.py#l231 For the individual masters, I think a two-step approach is warranted here: 1) Check the connection to a single master (first one in the list, or whatever) at the outset, just to make sure the credentials are valid. 2) Before disabling or gracefully stopping a master, test the connection to that master first so we can be reasonably certain the follow-up actions will work. Does this make sense?
Assignee: nobody → coop
Status: NEW → ASSIGNED
Reporter | ||
Comment 3•7 years ago
|
||
Sounds good to me.
Assignee | ||
Comment 4•7 years ago
|
||
Attachment #8842945 -
Flags: review?(nthomas)
Reporter | ||
Comment 5•7 years ago
|
||
Comment on attachment 8842945 [details] [diff] [review] [tools] Verify we can connect to a given master before disabling buildbot >diff --git a/buildfarm/maintenance/restart_masters.py b/buildfarm/maintenance/restart_masters.py ... >+ # Connect to a single master, just to make sure our supplied credentials are valid. >+ log.debug("Verifying LDAP credentials by connecting to a single master...") >+ master = buckets.itervalues().next()[0] >+ if master and not check_credentials(master): >+ log.error("Couldn't connect to master. Check you LDAP credentials.") >+ sys.exit(4) I thought we needed key-based auth for ssh and LDAP is just for slavealloc. Am I missing something ?
Assignee | ||
Comment 6•7 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #5) > I thought we needed key-based auth for ssh and LDAP is just for slavealloc. > Am I missing something ? That error msg is redundant anyway. I'll just remove it. New patch incoming.
Assignee | ||
Comment 7•7 years ago
|
||
Same as before, but removes the extra error log message that improperly referenced LDAP credentials.
Attachment #8842945 -
Attachment is obsolete: true
Attachment #8842945 -
Flags: review?(nthomas)
Attachment #8844082 -
Flags: review?(nthomas)
Reporter | ||
Comment 8•7 years ago
|
||
Comment on attachment 8844082 [details] [diff] [review] [tools] Verify we can connect to a given master before disabling buildbot, v2 Review of attachment 8844082 [details] [diff] [review]: ----------------------------------------------------------------- ::: buildfarm/maintenance/restart_masters.py @@ +121,5 @@ > + log.warning("Error running remote command '%s' on master: %s" % (cmd, master['hostname'])) > + except ssh.RemoteCommandError: > + log.warning("Caught exception while attempting remote command '%s' on master: %s" % (cmd, master['hostname'])) > + else: > + log.error("Couldn't get console to %s" % master['hostname']) Looks like the indent on the else should be reduced, to align with 'if console'. @@ +509,5 @@ > > put_masters_in_buckets(masters_json, master_list) > > + # Connect to a single master, just to make sure our supplied credentials are valid. > + log.debug("Verifying LDAP credentials by connecting to a single master...") Please change this to refer to ssh.
Attachment #8844082 -
Flags: review?(nthomas) → review+
Assignee | ||
Comment 9•7 years ago
|
||
https://hg.mozilla.org/build/tools/rev/d7c845338818d48f278478f58deeeaeaf0dc4a57 Bug 1340925 - Verify we can connect to a given master before disabling buildbot - r=nthomas
Assignee | ||
Updated•7 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 10•7 years ago
|
||
FTR, this fix works nicely for cltbld, provided you forward your ssh agent. The reboot functionality is broken because we can't ssh in as root; we'd need something based on sudo instead.
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•