Closed Bug 1309197 Opened 9 years ago Closed 9 years ago

[taskcluster-worker] Create puppet config to deploy taskcluster in releng machines

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

mozilla54

People

(Reporter: wcosta, Assigned: wcosta)

References

Details

Attachments

(3 files, 1 obsolete file)

https://github.com/mozilla/build-puppet/pull/21 9 years ago Dustin J. Mitchell [:dustin] (he/him) 47 bytes, text/x-github-pull-request	wcosta : review+	Details \| Review
Bug 1309197: Add taskcluster-worker support; p=wcosta 9 years ago Dustin J. Mitchell [:dustin] (he/him) 58 bytes, text/x-review-board-request	Callek : review+	Details
Support taskcluster-worker native engine. r=dustin 9 years ago Wander Lairson Costa 7.85 KB, patch	dustin : review+	Details \| Diff \| Splinter Review
upgrade taskcluster-worker version to 0.0.7 r=dustin 9 years ago Wander Lairson Costa 2.88 KB, patch	dustin : review+	Details \| Diff \| Splinter Review

Wander Lairson Costa

Assignee

Description

•

9 years ago

- taskcluster-worker will "own" the machine, more or less the same way docker daemon does. - because of this, it is ok the user that taskcluster-worker runs having some administrative privileges. - to avoid future mistakes/confusion/mess/misunderstandings/black holes, we will create a new user for taskcluster-worker, instead of using root or cltbld.

Wander Lairson Costa

Assignee

Updated

•

9 years ago

Assignee: nobody → wcosta

Status: NEW → ASSIGNED

Wander Lairson Costa

Assignee

Updated

•

9 years ago

Depends on: 1314977

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

9 years ago

Wander's got this mostly done in https://github.com/walac/build-puppet - I'll finish it up.

Assignee: wcosta → dustin

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

9 years ago

Here's what I have: https://github.com/mozilla/build-puppet/compare/master...djmitche:bug1309197?diff=unified&expand=1&name=bug1309197 However, it's not starting the worker, either in my version or in wander's. Which reminds me how much fun launchd is. In particular, I'm not sure what user the worker should be running as.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

9 years ago

Callek, if you wouldn't mind running puppet on a few more t-yosemite testers in the 040-059 range, that'd be good. I'd like to see if any of them are running the worker.

Flags: needinfo?(bugspam.Callek)

Justin Wood (:Callek)

Comment 4

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #3) > Callek, if you wouldn't mind running puppet on a few more t-yosemite testers > in the 040-059 range, that'd be good. I'd like to see if any of them are > running the worker. Running a for loop with 'puppet agent --test' on each of those nodes, 041 timed out in connecting though...

Flags: needinfo?(bugspam.Callek)

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

9 years ago

Thanks! Looking on 0042, I see the worker running, as root: root 1423 0.0 0.1 556691344 19372 ?? S 2Nov16 65:57.27 /usr/local/bin/taskcluster-worker daemon run /etc/taskcluster-worker.yml so that answers that question! I rebooted the host to see if it will start the worker, as I can't see any differences in the launchd plist.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

9 years ago

..and it's not starting automatically, so at least this isn't something I've broken :)

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

9 years ago

`launchctl load -w /Library/LaunchAgents/net.taskcluster.worker.plist` does seem to start the service..

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

9 years ago

A bit more investigation shows that autologin is not working. I have copied wander's secrets, so I'm not sure why this would be the case. It looks like the root password is set to something insecure - I can login as root via SSH with that insecure password. The /etc/kcpasswd file corresponds to that insecure password. The defaults are set correctly to login as root. I see that 0040 is logged in (there's a Finder process running as root) but on 0042 (which I have restarted but have not run against my puppet environment) and 0045 (which I have restarted and have run against my puppet environment) root is not logged in. So I wonder if the autologin never worked, and these were all logged into manually 28 days ago? The worker is defined as a LaunchAgent, which means that it runs in the user context after user login, so it makes sense that with no user login, there is no running worker. So, I think I'm stuck here until Wander is back to provide some context. Remaining to do: * use a secure password * get root autologin working, verify that worker starts on boot * ensure mig runs in "checkin" mode -- probably best done in parallel with the puppet run * change the provisionerId

Flags: needinfo?(wcosta)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

9 years ago

Blocks: 1298441

Wander Lairson Costa

Assignee

Comment 9

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #8) > A bit more investigation shows that autologin is not working. I have copied > wander's secrets, so I'm not sure why this would be the case. It looks like > the root password is set to something insecure - I can login as root via SSH > with that insecure password. The /etc/kcpasswd file corresponds to that > insecure password. The defaults are set correctly to login as root. I see > that 0040 is logged in (there's a Finder process running as root) but on > 0042 (which I have restarted but have not run against my puppet environment) > and 0045 (which I have restarted and have run against my puppet environment) > root is not logged in. So I wonder if the autologin never worked, and these > were all logged into manually 28 days ago? > Autologin was working last time I checked, I will look into it. > The worker is defined as a LaunchAgent, which means that it runs in the user > context after user login, so it makes sense that with no user login, there > is no running worker. > > So, I think I'm stuck here until Wander is back to provide some context. > > Remaining to do: > > * use a secure password Well, I got these password and was told to keep it while the machines are loaned. > * get root autologin working, verify that worker starts on boot The way a got it working was manually configuring autologin in system preferences, then copying kcpassword content in base64 format. > * ensure mig runs in "checkin" mode -- probably best done in parallel with > the puppet run No idea what this means. > * change the provisionerId What's the correct provisionerId?

Flags: needinfo?(wcosta)

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

9 years ago

From Jonas in email, regarding provisionerId: --- I like the idea of a provisonerId: fixed-hardware or mozilla-scl3 or data-center Ideally, we the provisionerId is specific to the group of workerTypes that will be configured by the same people. So we can grant a set of people scopes like: queue:worker-type:mozilla-scl3/*, and then that group of people can create the roles and credentials for the those workers. --- In light of the last paragraph, I think `releng-scl3` is probably a good choice.

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

9 years ago

Attached file https://github.com/mozilla/build-puppet/pull/21 (obsolete) — Details

Attachment #8817546 - Flags: review?(wcosta)

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

9 years ago

I'd like to land that as-is (minus the environment pinning), and do the mig and provisionerId changes in subsequent patches.

Wander Lairson Costa

Assignee

Comment 13

•

9 years ago

Comment on attachment 8817546 [details] [review] https://github.com/mozilla/build-puppet/pull/21 I left a couple of comments in the PR, but it looks great overall!

Attachment #8817546 - Flags: review?(wcosta) → review+

Comment hidden (mozreview-request)

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

9 years ago

So, Wander and I finally found a few minutes to sit down and work on this, and .. it just worked. We took a few otherwise-untouched machines and ran them against my (unchanged) puppet environment, then rebooted, and the autologin occurred. *BOGGLE* Once Callek double-checks this and lands it (I'm not on the whitelist anymore), I'll adjust the production secrets and reboot all of the machines into the production environment. Assuming they all successfully start taskcluster-worker, I'll try reimaging a few to make sure that still works. If so, then I'll move on to the to-do items in comment 8.

Justin Wood (:Callek)

Comment 16

•

9 years ago

mozreview-review

Comment on attachment 8817555 [details] Bug 1309197: Add taskcluster-worker support; p=wcosta https://reviewboard.mozilla.org/r/97804/#review98156 ::: modules/users/manifests/builder/setup.pp (Diff revision 1) > - class { > - 'disableservices::user': > - username => $username, > - group => $group, > - home => $home; > - } Ok, I don't *think* we can move this to autologin alone... https://dxr.mozilla.org/build-central/search?q=path%3Apuppet+autologin&redirect=false shows we are using autologin only from "slave" style systems. However in https://dxr.mozilla.org/build-central/source/puppet/modules/aws_manager/manifests/cron.pp#7 at the least (there are other server class systems using users::builder too) we still have that builder setup. I think we may want to disable the services some other way to support the root autologin needs.

Attachment #8817555 - Flags: review?(bugspam.Callek) → review-

Dustin J. Mitchell [:dustin] (he/him)

Comment 17

•

9 years ago

I think the disableservices::user is just for disabling services relevant to a user login. I can look in more detail to verify that. Alternately, we could just allow this to not be parallel between root autologin and builder login. Thanks -- I'll redraft or reply :)

Justin Wood (:Callek)

Comment 18

•

9 years ago

mozreview-review-reply

Comment on attachment 8817555 [details] Bug 1309197: Add taskcluster-worker support; p=wcosta https://reviewboard.mozilla.org/r/97804/#review98156 > Ok, I don't *think* we can move this to autologin alone... > > https://dxr.mozilla.org/build-central/search?q=path%3Apuppet+autologin&redirect=false shows we are using autologin only from "slave" style systems. > > However in https://dxr.mozilla.org/build-central/source/puppet/modules/aws_manager/manifests/cron.pp#7 at the least (there are other server class systems using users::builder too) we still have that builder setup. > > I think we may want to disable the services some other way to support the root autologin needs. ...actually looking again, I think the only server class this would affect is OSX signing. And there is no users::builder on a signing server, so its likely this is indeed fine, as long as we don't add more OSX server classes. I'll leave it to you for confirmation on what I'm seeing to 'drop' the issue. But given the confusion I think we should only land when more of us are around to watch for fallout.

Justin Wood (:Callek)

Comment 19

•

9 years ago

mozreview-review

Comment on attachment 8817555 [details] Bug 1309197: Add taskcluster-worker support; p=wcosta https://reviewboard.mozilla.org/r/97804/#review98160

Attachment #8817555 - Flags: review- → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

9 years ago

The disableservices::user is all user-login-specific (screensaver, etc.) so I think if the user isn't logging in, it would have no effect. But I agree regarding the caution and I'll wait until next week to land.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

9 years ago

Attachment #8817555 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 21

•

9 years ago

Rail landed this about 35 minutes ago.

Dustin J. Mitchell [:dustin] (he/him)

Comment 22

•

9 years ago

Wander, I ran this on 0044, after setting root_pw_kcpassword_base64!low-security to (I think) the appropriate value based on root_pw_cleartext!low-security It didn't work - no autologin. Can you try logging in via VNC and setting the autologin password via UI, then checking if the resulting /etc/kcpassword is different? If not, can you figure out why it's not doing the autologin? I don't seem to have the touch :(

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 23

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #22) > Wander, I ran this on 0044, after setting > root_pw_kcpassword_base64!low-security to (I think) the appropriate value > based on root_pw_cleartext!low-security > > It didn't work - no autologin. > > Can you try logging in via VNC and setting the autologin password via UI, > then checking if the resulting /etc/kcpassword is different? If not, can > you figure out why it's not doing the autologin? I don't seem to have the > touch :( ok, sure

Flags: needinfo?(wcosta)

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

9 years ago

Wander and Alin are working on this. It turns out I wasn't much help after all :)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

9 years ago

Attachment #8817546 - Attachment is obsolete: true

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

9 years ago

Assignee: dustin → wcosta

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

9 years ago

I tried the bless-and-reboot trick on 00{40,42,43,44} and in all cases they are up and running tests with no further attention. There are some errors that cause the puppetize run to retry puppet a few times, but it converges eventually. So, I think this is actually done? VNC doesn't work, but from my perspective that is normal.

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

9 years ago

same on 00{46,47,48,49,50,51,52,53,54,55,56,57,58,59} not accessible via SSH: 0041 0045 Alin, can you try to resuscitate those two?

Flags: needinfo?(aselagea)

Alin Selagea [:aselagea]

Updated

•

9 years ago

Depends on: 1325012, 1325010

Flags: needinfo?(aselagea)

Alin Selagea [:aselagea]

Comment 27

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #26) > not accessible via SSH: > 0041 > 0045 > > Alin, can you try to resuscitate those two? Both are unreachable and will need intervention from DCOps to bring them back online. Added those bugs as dependencies to this one.

Wander Lairson Costa

Assignee

Comment 28

•

9 years ago

Attached patch Support taskcluster-worker native engine. r=dustin — Details — Splinter Review

macosx-engine is deprecated, switch to native engine. Also add reboot plugin support.

Wander Lairson Costa

Assignee

Comment 29

•

9 years ago

Attached patch upgrade taskcluster-worker version to 0.0.7 r=dustin — Details — Splinter Review

This version switches to native engine and add the reboot plugin.

Wander Lairson Costa

Assignee

Updated

•

9 years ago

Attachment #8829877 - Flags: review?(dustin)

Wander Lairson Costa

Assignee

Updated

•

9 years ago

Attachment #8829878 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

9 years ago

Attachment #8829877 - Flags: review?(dustin) → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 30

•

9 years ago

Comment on attachment 8829878 [details] [diff] [review] upgrade taskcluster-worker version to 0.0.7 r=dustin Review of attachment 8829878 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/taskcluster_worker/templates/taskcluster-worker.yml.erb @@ +23,3 @@ > logLevel: info > plugins: > + disabled: ['interactive', 'maxruntime'] why disable maxruntime? @@ +30,5 @@ > + LANG: 'en_US.UTF-8' > + LC_ALL: 'en_US.UTF-8' > + XPC_FLAGS: '0x0' > + XPC_SERVICE_NAME: '0' > + IDLEIZER_DISABLE_SHUTDOWN: 'true' are we using idleizer??

Attachment #8829878 - Flags: review?(dustin) → review+

Wander Lairson Costa

Assignee

Comment 31

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #30) > Comment on attachment 8829878 [details] [diff] [review] > upgrade taskcluster-worker version to 0.0.7 r=dustin > > Review of attachment 8829878 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: modules/taskcluster_worker/templates/taskcluster-worker.yml.erb > @@ +23,3 @@ > > logLevel: info > > plugins: > > + disabled: ['interactive', 'maxruntime'] > > why disable maxruntime? > I just don't want to mess with timeouts atm. > @@ +30,5 @@ > > + LANG: 'en_US.UTF-8' > > + LC_ALL: 'en_US.UTF-8' > > + XPC_FLAGS: '0x0' > > + XPC_SERVICE_NAME: '0' > > + IDLEIZER_DISABLE_SHUTDOWN: 'true' > > are we using idleizer?? I have no idea what this is, I am just mirroring the buildbot config.

Wander Lairson Costa

Assignee

Comment 32

•

9 years ago

https://hg.mozilla.org/build/puppet/rev/4860fb9968afd3d0a25dcfd5e6dd2aa26b195d2a

Wander Lairson Costa

Assignee

Updated

•

9 years ago

No longer depends on: 1314977

Comment 33

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/4a3864389fe6d93c57e3d43cdcaef9ac6d115cb8

Dustin J. Mitchell [:dustin] (he/him)

Comment 34

•

9 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #31) > > are we using idleizer?? > > I have no idea what this is, I am just mirroring the buildbot config. idleizer is some code in buildbot itself that automatically reboots slaves when they are idle or not connected to a master. It's very BB-specific, so you can drop this env var next time you're patching this file - but it doesn't hurt anything.

Ryan VanderMeulen [:RyanVM]

Comment 35

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4a3864389fe6

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Worker → Workers

You need to log in before you can comment on or make changes to this bug.