Closed Bug 1309197 Opened 9 years ago Closed 9 years ago

[taskcluster-worker] Create puppet config to deploy taskcluster in releng machines

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla54

People

(Reporter: wcosta, Assigned: wcosta)

References

Details

Attachments

(3 files, 1 obsolete file)

- taskcluster-worker will "own" the machine, more or less the same way docker daemon does. - because of this, it is ok the user that taskcluster-worker runs having some administrative privileges. - to avoid future mistakes/confusion/mess/misunderstandings/black holes, we will create a new user for taskcluster-worker, instead of using root or cltbld.
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Depends on: 1314977
Wander's got this mostly done in https://github.com/walac/build-puppet - I'll finish it up.
Assignee: wcosta → dustin
Here's what I have: https://github.com/mozilla/build-puppet/compare/master...djmitche:bug1309197?diff=unified&expand=1&name=bug1309197 However, it's not starting the worker, either in my version or in wander's. Which reminds me how much fun launchd is. In particular, I'm not sure what user the worker should be running as.
Callek, if you wouldn't mind running puppet on a few more t-yosemite testers in the 040-059 range, that'd be good. I'd like to see if any of them are running the worker.
Flags: needinfo?(bugspam.Callek)
(In reply to Dustin J. Mitchell [:dustin] from comment #3) > Callek, if you wouldn't mind running puppet on a few more t-yosemite testers > in the 040-059 range, that'd be good. I'd like to see if any of them are > running the worker. Running a for loop with 'puppet agent --test' on each of those nodes, 041 timed out in connecting though...
Flags: needinfo?(bugspam.Callek)
Thanks! Looking on 0042, I see the worker running, as root: root 1423 0.0 0.1 556691344 19372 ?? S 2Nov16 65:57.27 /usr/local/bin/taskcluster-worker daemon run /etc/taskcluster-worker.yml so that answers that question! I rebooted the host to see if it will start the worker, as I can't see any differences in the launchd plist.
..and it's not starting automatically, so at least this isn't something I've broken :)
`launchctl load -w /Library/LaunchAgents/net.taskcluster.worker.plist` does seem to start the service..
A bit more investigation shows that autologin is not working. I have copied wander's secrets, so I'm not sure why this would be the case. It looks like the root password is set to something insecure - I can login as root via SSH with that insecure password. The /etc/kcpasswd file corresponds to that insecure password. The defaults are set correctly to login as root. I see that 0040 is logged in (there's a Finder process running as root) but on 0042 (which I have restarted but have not run against my puppet environment) and 0045 (which I have restarted and have run against my puppet environment) root is not logged in. So I wonder if the autologin never worked, and these were all logged into manually 28 days ago? The worker is defined as a LaunchAgent, which means that it runs in the user context after user login, so it makes sense that with no user login, there is no running worker. So, I think I'm stuck here until Wander is back to provide some context. Remaining to do: * use a secure password * get root autologin working, verify that worker starts on boot * ensure mig runs in "checkin" mode -- probably best done in parallel with the puppet run * change the provisionerId
Flags: needinfo?(wcosta)
(In reply to Dustin J. Mitchell [:dustin] from comment #8) > A bit more investigation shows that autologin is not working. I have copied > wander's secrets, so I'm not sure why this would be the case. It looks like > the root password is set to something insecure - I can login as root via SSH > with that insecure password. The /etc/kcpasswd file corresponds to that > insecure password. The defaults are set correctly to login as root. I see > that 0040 is logged in (there's a Finder process running as root) but on > 0042 (which I have restarted but have not run against my puppet environment) > and 0045 (which I have restarted and have run against my puppet environment) > root is not logged in. So I wonder if the autologin never worked, and these > were all logged into manually 28 days ago? > Autologin was working last time I checked, I will look into it. > The worker is defined as a LaunchAgent, which means that it runs in the user > context after user login, so it makes sense that with no user login, there > is no running worker. > > So, I think I'm stuck here until Wander is back to provide some context. > > Remaining to do: > > * use a secure password Well, I got these password and was told to keep it while the machines are loaned. > * get root autologin working, verify that worker starts on boot The way a got it working was manually configuring autologin in system preferences, then copying kcpassword content in base64 format. > * ensure mig runs in "checkin" mode -- probably best done in parallel with > the puppet run No idea what this means. > * change the provisionerId What's the correct provisionerId?
Flags: needinfo?(wcosta)
From Jonas in email, regarding provisionerId: --- I like the idea of a provisonerId: fixed-hardware or mozilla-scl3 or data-center Ideally, we the provisionerId is specific to the group of workerTypes that will be configured by the same people. So we can grant a set of people scopes like: queue:worker-type:mozilla-scl3/*, and then that group of people can create the roles and credentials for the those workers. --- In light of the last paragraph, I think `releng-scl3` is probably a good choice.
I'd like to land that as-is (minus the environment pinning), and do the mig and provisionerId changes in subsequent patches.
Comment on attachment 8817546 [details] [review] https://github.com/mozilla/build-puppet/pull/21 I left a couple of comments in the PR, but it looks great overall!
Attachment #8817546 - Flags: review?(wcosta) → review+
So, Wander and I finally found a few minutes to sit down and work on this, and .. it just worked. We took a few otherwise-untouched machines and ran them against my (unchanged) puppet environment, then rebooted, and the autologin occurred. *BOGGLE* Once Callek double-checks this and lands it (I'm not on the whitelist anymore), I'll adjust the production secrets and reboot all of the machines into the production environment. Assuming they all successfully start taskcluster-worker, I'll try reimaging a few to make sure that still works. If so, then I'll move on to the to-do items in comment 8.
Comment on attachment 8817555 [details] Bug 1309197: Add taskcluster-worker support; p=wcosta https://reviewboard.mozilla.org/r/97804/#review98156 ::: modules/users/manifests/builder/setup.pp (Diff revision 1) > - class { > - 'disableservices::user': > - username => $username, > - group => $group, > - home => $home; > - } Ok, I don't *think* we can move this to autologin alone... https://dxr.mozilla.org/build-central/search?q=path%3Apuppet+autologin&redirect=false shows we are using autologin only from "slave" style systems. However in https://dxr.mozilla.org/build-central/source/puppet/modules/aws_manager/manifests/cron.pp#7 at the least (there are other server class systems using users::builder too) we still have that builder setup. I think we may want to disable the services some other way to support the root autologin needs.
Attachment #8817555 - Flags: review?(bugspam.Callek) → review-
I think the disableservices::user is just for disabling services relevant to a user login. I can look in more detail to verify that. Alternately, we could just allow this to not be parallel between root autologin and builder login. Thanks -- I'll redraft or reply :)
Comment on attachment 8817555 [details] Bug 1309197: Add taskcluster-worker support; p=wcosta https://reviewboard.mozilla.org/r/97804/#review98156 > Ok, I don't *think* we can move this to autologin alone... > > https://dxr.mozilla.org/build-central/search?q=path%3Apuppet+autologin&redirect=false shows we are using autologin only from "slave" style systems. > > However in https://dxr.mozilla.org/build-central/source/puppet/modules/aws_manager/manifests/cron.pp#7 at the least (there are other server class systems using users::builder too) we still have that builder setup. > > I think we may want to disable the services some other way to support the root autologin needs. ...actually looking again, I think the only server class this would affect is OSX signing. And there is no users::builder on a signing server, so its likely this is indeed fine, as long as we don't add more OSX server classes. I'll leave it to you for confirmation on what I'm seeing to 'drop' the issue. But given the confusion I think we should only land when more of us are around to watch for fallout.
Attachment #8817555 - Flags: review- → review+
The disableservices::user is all user-login-specific (screensaver, etc.) so I think if the user isn't logging in, it would have no effect. But I agree regarding the caution and I'll wait until next week to land.
Attachment #8817555 - Flags: review?(dustin)
Rail landed this about 35 minutes ago.
Wander, I ran this on 0044, after setting root_pw_kcpassword_base64!low-security to (I think) the appropriate value based on root_pw_cleartext!low-security It didn't work - no autologin. Can you try logging in via VNC and setting the autologin password via UI, then checking if the resulting /etc/kcpassword is different? If not, can you figure out why it's not doing the autologin? I don't seem to have the touch :(
Flags: needinfo?(wcosta)
(In reply to Dustin J. Mitchell [:dustin] from comment #22) > Wander, I ran this on 0044, after setting > root_pw_kcpassword_base64!low-security to (I think) the appropriate value > based on root_pw_cleartext!low-security > > It didn't work - no autologin. > > Can you try logging in via VNC and setting the autologin password via UI, > then checking if the resulting /etc/kcpassword is different? If not, can > you figure out why it's not doing the autologin? I don't seem to have the > touch :( ok, sure
Flags: needinfo?(wcosta)
Wander and Alin are working on this. It turns out I wasn't much help after all :)
Attachment #8817546 - Attachment is obsolete: true
Assignee: dustin → wcosta
I tried the bless-and-reboot trick on 00{40,42,43,44} and in all cases they are up and running tests with no further attention. There are some errors that cause the puppetize run to retry puppet a few times, but it converges eventually. So, I think this is actually done? VNC doesn't work, but from my perspective that is normal.
same on 00{46,47,48,49,50,51,52,53,54,55,56,57,58,59} not accessible via SSH: 0041 0045 Alin, can you try to resuscitate those two?
Flags: needinfo?(aselagea)
Depends on: 1325012, 1325010
Flags: needinfo?(aselagea)
(In reply to Dustin J. Mitchell [:dustin] from comment #26) > not accessible via SSH: > 0041 > 0045 > > Alin, can you try to resuscitate those two? Both are unreachable and will need intervention from DCOps to bring them back online. Added those bugs as dependencies to this one.
macosx-engine is deprecated, switch to native engine. Also add reboot plugin support.
This version switches to native engine and add the reboot plugin.
Attachment #8829877 - Flags: review?(dustin)
Attachment #8829878 - Flags: review?(dustin)
Attachment #8829877 - Flags: review?(dustin) → review+
Comment on attachment 8829878 [details] [diff] [review] upgrade taskcluster-worker version to 0.0.7 r=dustin Review of attachment 8829878 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/taskcluster_worker/templates/taskcluster-worker.yml.erb @@ +23,3 @@ > logLevel: info > plugins: > + disabled: ['interactive', 'maxruntime'] why disable maxruntime? @@ +30,5 @@ > + LANG: 'en_US.UTF-8' > + LC_ALL: 'en_US.UTF-8' > + XPC_FLAGS: '0x0' > + XPC_SERVICE_NAME: '0' > + IDLEIZER_DISABLE_SHUTDOWN: 'true' are we using idleizer??
Attachment #8829878 - Flags: review?(dustin) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #30) > Comment on attachment 8829878 [details] [diff] [review] > upgrade taskcluster-worker version to 0.0.7 r=dustin > > Review of attachment 8829878 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: modules/taskcluster_worker/templates/taskcluster-worker.yml.erb > @@ +23,3 @@ > > logLevel: info > > plugins: > > + disabled: ['interactive', 'maxruntime'] > > why disable maxruntime? > I just don't want to mess with timeouts atm. > @@ +30,5 @@ > > + LANG: 'en_US.UTF-8' > > + LC_ALL: 'en_US.UTF-8' > > + XPC_FLAGS: '0x0' > > + XPC_SERVICE_NAME: '0' > > + IDLEIZER_DISABLE_SHUTDOWN: 'true' > > are we using idleizer?? I have no idea what this is, I am just mirroring the buildbot config.
No longer depends on: 1314977
See Also: → 1314977
(In reply to Wander Lairson Costa [:wcosta] from comment #31) > > are we using idleizer?? > > I have no idea what this is, I am just mirroring the buildbot config. idleizer is some code in buildbot itself that automatically reboots slaves when they are idle or not connected to a master. It's very BB-specific, so you can drop this env var next time you're patching this file - but it doesn't hurt anything.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: