Closed
Bug 1309197
Opened 9 years ago
Closed 9 years ago
[taskcluster-worker] Create puppet config to deploy taskcluster in releng machines
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla54
People
(Reporter: wcosta, Assigned: wcosta)
References
Details
Attachments
(3 files, 1 obsolete file)
|
58 bytes,
text/x-review-board-request
|
Callek
:
review+
|
Details |
|
7.85 KB,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
|
2.88 KB,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
- taskcluster-worker will "own" the machine, more or less the same way
docker daemon does.
- because of this, it is ok the user that taskcluster-worker runs
having some administrative privileges.
- to avoid future mistakes/confusion/mess/misunderstandings/black
holes, we will create a new user for taskcluster-worker, instead of
using root or cltbld.
| Assignee | ||
Updated•9 years ago
|
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Comment 1•9 years ago
|
||
Wander's got this mostly done in https://github.com/walac/build-puppet - I'll finish it up.
Assignee: wcosta → dustin
Comment 2•9 years ago
|
||
Here's what I have:
https://github.com/mozilla/build-puppet/compare/master...djmitche:bug1309197?diff=unified&expand=1&name=bug1309197
However, it's not starting the worker, either in my version or in wander's. Which reminds me how much fun launchd is.
In particular, I'm not sure what user the worker should be running as.
Comment 3•9 years ago
|
||
Callek, if you wouldn't mind running puppet on a few more t-yosemite testers in the 040-059 range, that'd be good. I'd like to see if any of them are running the worker.
Flags: needinfo?(bugspam.Callek)
Comment 4•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> Callek, if you wouldn't mind running puppet on a few more t-yosemite testers
> in the 040-059 range, that'd be good. I'd like to see if any of them are
> running the worker.
Running a for loop with 'puppet agent --test' on each of those nodes, 041 timed out in connecting though...
Flags: needinfo?(bugspam.Callek)
Comment 5•9 years ago
|
||
Thanks!
Looking on 0042, I see the worker running, as root:
root 1423 0.0 0.1 556691344 19372 ?? S 2Nov16 65:57.27 /usr/local/bin/taskcluster-worker daemon run /etc/taskcluster-worker.yml
so that answers that question! I rebooted the host to see if it will start the worker, as I can't see any differences in the launchd plist.
Comment 6•9 years ago
|
||
..and it's not starting automatically, so at least this isn't something I've broken :)
Comment 7•9 years ago
|
||
`launchctl load -w /Library/LaunchAgents/net.taskcluster.worker.plist` does seem to start the service..
Comment 8•9 years ago
|
||
A bit more investigation shows that autologin is not working. I have copied wander's secrets, so I'm not sure why this would be the case. It looks like the root password is set to something insecure - I can login as root via SSH with that insecure password. The /etc/kcpasswd file corresponds to that insecure password. The defaults are set correctly to login as root. I see that 0040 is logged in (there's a Finder process running as root) but on 0042 (which I have restarted but have not run against my puppet environment) and 0045 (which I have restarted and have run against my puppet environment) root is not logged in. So I wonder if the autologin never worked, and these were all logged into manually 28 days ago?
The worker is defined as a LaunchAgent, which means that it runs in the user context after user login, so it makes sense that with no user login, there is no running worker.
So, I think I'm stuck here until Wander is back to provide some context.
Remaining to do:
* use a secure password
* get root autologin working, verify that worker starts on boot
* ensure mig runs in "checkin" mode -- probably best done in parallel with the puppet run
* change the provisionerId
Flags: needinfo?(wcosta)
| Assignee | ||
Comment 9•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #8)
> A bit more investigation shows that autologin is not working. I have copied
> wander's secrets, so I'm not sure why this would be the case. It looks like
> the root password is set to something insecure - I can login as root via SSH
> with that insecure password. The /etc/kcpasswd file corresponds to that
> insecure password. The defaults are set correctly to login as root. I see
> that 0040 is logged in (there's a Finder process running as root) but on
> 0042 (which I have restarted but have not run against my puppet environment)
> and 0045 (which I have restarted and have run against my puppet environment)
> root is not logged in. So I wonder if the autologin never worked, and these
> were all logged into manually 28 days ago?
>
Autologin was working last time I checked, I will look into it.
> The worker is defined as a LaunchAgent, which means that it runs in the user
> context after user login, so it makes sense that with no user login, there
> is no running worker.
>
> So, I think I'm stuck here until Wander is back to provide some context.
>
> Remaining to do:
>
> * use a secure password
Well, I got these password and was told to keep it while the machines are loaned.
> * get root autologin working, verify that worker starts on boot
The way a got it working was manually configuring autologin in system preferences, then copying kcpassword content in base64 format.
> * ensure mig runs in "checkin" mode -- probably best done in parallel with
> the puppet run
No idea what this means.
> * change the provisionerId
What's the correct provisionerId?
Flags: needinfo?(wcosta)
Comment 10•9 years ago
|
||
From Jonas in email, regarding provisionerId:
---
I like the idea of a provisonerId:
fixed-hardware
or
mozilla-scl3
or
data-center
Ideally, we the provisionerId is specific to the group of workerTypes that will be configured by the same people.
So we can grant a set of people scopes like: queue:worker-type:mozilla-scl3/*, and then that group of people
can create the roles and credentials for the those workers.
---
In light of the last paragraph, I think `releng-scl3` is probably a good choice.
Comment 11•9 years ago
|
||
Attachment #8817546 -
Flags: review?(wcosta)
Comment 12•9 years ago
|
||
I'd like to land that as-is (minus the environment pinning), and do the mig and provisionerId changes in subsequent patches.
| Assignee | ||
Comment 13•9 years ago
|
||
Comment on attachment 8817546 [details] [review]
https://github.com/mozilla/build-puppet/pull/21
I left a couple of comments in the PR, but it looks great overall!
Attachment #8817546 -
Flags: review?(wcosta) → review+
| Comment hidden (mozreview-request) |
Comment 15•9 years ago
|
||
So, Wander and I finally found a few minutes to sit down and work on this, and .. it just worked. We took a few otherwise-untouched machines and ran them against my (unchanged) puppet environment, then rebooted, and the autologin occurred. *BOGGLE*
Once Callek double-checks this and lands it (I'm not on the whitelist anymore), I'll adjust the production secrets and reboot all of the machines into the production environment. Assuming they all successfully start taskcluster-worker, I'll try reimaging a few to make sure that still works. If so, then I'll move on to the to-do items in comment 8.
Comment 16•9 years ago
|
||
| mozreview-review | ||
Comment on attachment 8817555 [details]
Bug 1309197: Add taskcluster-worker support; p=wcosta
https://reviewboard.mozilla.org/r/97804/#review98156
::: modules/users/manifests/builder/setup.pp
(Diff revision 1)
> - class {
> - 'disableservices::user':
> - username => $username,
> - group => $group,
> - home => $home;
> - }
Ok, I don't *think* we can move this to autologin alone...
https://dxr.mozilla.org/build-central/search?q=path%3Apuppet+autologin&redirect=false shows we are using autologin only from "slave" style systems.
However in https://dxr.mozilla.org/build-central/source/puppet/modules/aws_manager/manifests/cron.pp#7 at the least (there are other server class systems using users::builder too) we still have that builder setup.
I think we may want to disable the services some other way to support the root autologin needs.
Attachment #8817555 -
Flags: review?(bugspam.Callek) → review-
Comment 17•9 years ago
|
||
I think the disableservices::user is just for disabling services relevant to a user login. I can look in more detail to verify that. Alternately, we could just allow this to not be parallel between root autologin and builder login. Thanks -- I'll redraft or reply :)
Comment 18•9 years ago
|
||
| mozreview-review-reply | ||
Comment on attachment 8817555 [details]
Bug 1309197: Add taskcluster-worker support; p=wcosta
https://reviewboard.mozilla.org/r/97804/#review98156
> Ok, I don't *think* we can move this to autologin alone...
>
> https://dxr.mozilla.org/build-central/search?q=path%3Apuppet+autologin&redirect=false shows we are using autologin only from "slave" style systems.
>
> However in https://dxr.mozilla.org/build-central/source/puppet/modules/aws_manager/manifests/cron.pp#7 at the least (there are other server class systems using users::builder too) we still have that builder setup.
>
> I think we may want to disable the services some other way to support the root autologin needs.
...actually looking again, I think the only server class this would affect is OSX signing. And there is no users::builder on a signing server, so its likely this is indeed fine, as long as we don't add more OSX server classes.
I'll leave it to you for confirmation on what I'm seeing to 'drop' the issue. But given the confusion I think we should only land when more of us are around to watch for fallout.
Comment 19•9 years ago
|
||
| mozreview-review | ||
Comment on attachment 8817555 [details]
Bug 1309197: Add taskcluster-worker support; p=wcosta
https://reviewboard.mozilla.org/r/97804/#review98160
Attachment #8817555 -
Flags: review- → review+
Comment 20•9 years ago
|
||
The disableservices::user is all user-login-specific (screensaver, etc.) so I think if the user isn't logging in, it would have no effect. But I agree regarding the caution and I'll wait until next week to land.
Updated•9 years ago
|
Attachment #8817555 -
Flags: review?(dustin)
Comment 21•9 years ago
|
||
Rail landed this about 35 minutes ago.
Comment 22•9 years ago
|
||
Wander, I ran this on 0044, after setting root_pw_kcpassword_base64!low-security to (I think) the appropriate value based on root_pw_cleartext!low-security
It didn't work - no autologin.
Can you try logging in via VNC and setting the autologin password via UI, then checking if the resulting /etc/kcpassword is different? If not, can you figure out why it's not doing the autologin? I don't seem to have the touch :(
Flags: needinfo?(wcosta)
| Assignee | ||
Comment 23•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #22)
> Wander, I ran this on 0044, after setting
> root_pw_kcpassword_base64!low-security to (I think) the appropriate value
> based on root_pw_cleartext!low-security
>
> It didn't work - no autologin.
>
> Can you try logging in via VNC and setting the autologin password via UI,
> then checking if the resulting /etc/kcpassword is different? If not, can
> you figure out why it's not doing the autologin? I don't seem to have the
> touch :(
ok, sure
Flags: needinfo?(wcosta)
Comment 24•9 years ago
|
||
Wander and Alin are working on this. It turns out I wasn't much help after all :)
Updated•9 years ago
|
Attachment #8817546 -
Attachment is obsolete: true
Updated•9 years ago
|
Assignee: dustin → wcosta
Comment 25•9 years ago
|
||
I tried the bless-and-reboot trick on 00{40,42,43,44} and in all cases they are up and running tests with no further attention. There are some errors that cause the puppetize run to retry puppet a few times, but it converges eventually. So, I think this is actually done? VNC doesn't work, but from my perspective that is normal.
Comment 26•9 years ago
|
||
same on 00{46,47,48,49,50,51,52,53,54,55,56,57,58,59}
not accessible via SSH:
0041
0045
Alin, can you try to resuscitate those two?
Flags: needinfo?(aselagea)
Updated•9 years ago
|
Comment 27•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #26)
> not accessible via SSH:
> 0041
> 0045
>
> Alin, can you try to resuscitate those two?
Both are unreachable and will need intervention from DCOps to bring them back online. Added those bugs as dependencies to this one.
| Assignee | ||
Comment 28•9 years ago
|
||
macosx-engine is deprecated, switch to native engine.
Also add reboot plugin support.
| Assignee | ||
Comment 29•9 years ago
|
||
This version switches to native engine and add the reboot plugin.
| Assignee | ||
Updated•9 years ago
|
Attachment #8829877 -
Flags: review?(dustin)
| Assignee | ||
Updated•9 years ago
|
Attachment #8829878 -
Flags: review?(dustin)
Updated•9 years ago
|
Attachment #8829877 -
Flags: review?(dustin) → review+
Comment 30•9 years ago
|
||
Comment on attachment 8829878 [details] [diff] [review]
upgrade taskcluster-worker version to 0.0.7 r=dustin
Review of attachment 8829878 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/taskcluster_worker/templates/taskcluster-worker.yml.erb
@@ +23,3 @@
> logLevel: info
> plugins:
> + disabled: ['interactive', 'maxruntime']
why disable maxruntime?
@@ +30,5 @@
> + LANG: 'en_US.UTF-8'
> + LC_ALL: 'en_US.UTF-8'
> + XPC_FLAGS: '0x0'
> + XPC_SERVICE_NAME: '0'
> + IDLEIZER_DISABLE_SHUTDOWN: 'true'
are we using idleizer??
Attachment #8829878 -
Flags: review?(dustin) → review+
| Assignee | ||
Comment 31•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #30)
> Comment on attachment 8829878 [details] [diff] [review]
> upgrade taskcluster-worker version to 0.0.7 r=dustin
>
> Review of attachment 8829878 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: modules/taskcluster_worker/templates/taskcluster-worker.yml.erb
> @@ +23,3 @@
> > logLevel: info
> > plugins:
> > + disabled: ['interactive', 'maxruntime']
>
> why disable maxruntime?
>
I just don't want to mess with timeouts atm.
> @@ +30,5 @@
> > + LANG: 'en_US.UTF-8'
> > + LC_ALL: 'en_US.UTF-8'
> > + XPC_FLAGS: '0x0'
> > + XPC_SERVICE_NAME: '0'
> > + IDLEIZER_DISABLE_SHUTDOWN: 'true'
>
> are we using idleizer??
I have no idea what this is, I am just mirroring the buildbot config.
| Assignee | ||
Comment 32•9 years ago
|
||
| Assignee | ||
Updated•9 years ago
|
| Assignee | ||
Comment 33•9 years ago
|
||
Comment 34•9 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #31)
> > are we using idleizer??
>
> I have no idea what this is, I am just mirroring the buildbot config.
idleizer is some code in buildbot itself that automatically reboots slaves when they are idle or not connected to a master. It's very BB-specific, so you can drop this env var next time you're patching this file - but it doesn't hurt anything.
Comment 35•9 years ago
|
||
| bugherder | ||
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Updated•7 years ago
|
Component: Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•