Closed Bug 891881 Opened 9 years ago Closed 9 years ago

Support 10.6 Talos with PuppetAgain

Categories

(Infrastructure & Operations :: RelOps: Puppet, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: coop)

References

Details

Attachments

(4 files)

No description provided.
Depends on: 882869
Component: Server Operations: RelEng → RelOps: Puppet
Product: mozilla.org → Infrastructure & Operations
QA Contact: arich → dustin
This is currently blocked on getting a clean 10.6 install, which is blocked on some particularly typical Apple insanity -- in particular, there are no retail versions of OS X that will install on this hardware, so we need to find restore DVDs.
Depends on: 894988
Duplicate of this bug: 873186
Depends on: 898007
OK, I think this is about ready, except that we don't have a means in place to image 10.6 systems with DeployStudio - bug 894988.

The system log is full of

Jul 29 08:48:17 r4-mini-001 edu.mit.Kerberos.krb5kdc[30221]: Can't get profile to fetch realms
Jul 29 08:48:17 r4-mini-001 com.apple.launchd[1] (edu.mit.Kerberos.krb5kdc[30221]): Exited with exit code: 1
Jul 29 08:48:17 r4-mini-001 com.apple.launchd[1] (edu.mit.Kerberos.krb5kdc): Throttling respawn: Will start in 10 seconds

but I'm hoping, based on some Googling, that it's an artifact of the imaging process (Casper, in this case) and not a problem with Puppet.
Attached patch bug891881.patchSplinter Review
Attachment #782608 - Flags: review?(coop)
Comment on attachment 782608 [details] [diff] [review]
bug891881.patch

Review of attachment 782608 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/users/manifests/signer/account.pp
@@ +35,5 @@
>              # relevant fixes.
>              # NOTE: this user is *not* an Administrator.  All admin-level access is granted via sudoers.
>              case $::macosx_productversion_major {
> +                '10.6': {
> +                    if (secret("signer_pw_paddedsha1") == '') {

Are we signing on 10.6 now?
Attachment #782608 - Flags: review?(coop) → review+
Yeah, we've been signing on 10.6 since we moved onto the r4 minis about a year and a half ago (bug 729077).
Even if we weren't, it's the right thing to do to update that manifest to support 10.6 -- puppetagain should not artificially limit itself to only the (OS, purpose) tuples that we use.
Comment on attachment 782608 [details] [diff] [review]
bug891881.patch

https://hg.mozilla.org/build/puppet/rev/c93c03a7c848

(minus the manifests change, and the unused $service in vnc::init
Attachment #782608 - Flags: checked-in+
OK, just waiting on the capacity to image puppetagain hosts with 10.6.
Depends on: 901534
Minor patch to avoid running screenresolution on every puppet run
Attachment #785968 - Flags: review?(coop)
Attachment #785968 - Flags: review?(coop) → review+
I think we're ready to deploy this - 10.6 support in DeployStudio is set up IIRC.

Coop, do you want to do the deployment, or is it Armen's turn to have all the fun?
Flags: needinfo?(coop)
I'm happy to continue working through this.

Dustin: just to confirm the ask here, you're ready for me to try netboot-ing a 10.6 slave and then run it through staging?
Flags: needinfo?(coop) → needinfo?(dustin)
I'll do the netbooting initially, but yes.  Which should I reimage?
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> I'll do the netbooting initially, but yes.  Which should I reimage?

I've set aside talos-r4-snow-001. 

It's enabled in slavealloc and pointed at my test master, so it *should* end up in the correct state at the end of a netboot.
This is reimaged, but I forgot to add the node definitions until just now, so it's looping in puppetize.sh waiting.  It should go forward shortly.  I'll keep an eye on it.
OK, it's up, but it looks like its basedir is wrong:

cltbld     748   0.0  0.1  2460596   6652   ??  S     7:33AM   0:00.75 /tools/buildbot-0.8.4-pre-moz2/bin/python2.7 /tools/buildbot/bin/twistd --no_save --logfile /Users/cltbld/talos-slave/twistd.log --python /Users/cltbld/talos-slave/buildbot.tac

Can you fix that and reboot and see how it goes?

I'm leaving this afternoon for the rest of the week, but Amy or other relops folks can certainly reimage more hosts for you.  I still have my name on talos-r4-snow-079, so we can reimage that one too if you'd like more parallelism.
There seems to be some problem with sudo as well:

[cltbld@talos-r4-snow-001.build.scl1.mozilla.com talos-slave]$ sudo reboot
sudo: unknown defaults entry `umask_override'

...and then it prompts me for a password.
(In reply to Chris Cooper [:coop] from comment #17)
> ...and then it prompts me for a password.

...and then tells me:

cltbld is not in the sudoers file.  This incident will be reported.
Looks like snow leopard's sudoers doesn't support the #include syntax :(
[cltbld@talos-r4-snow-001.build.scl1.mozilla.com ~]$ sudo reboot
sudo: can't stat /etc/sudoers.d/*: No such file or directory
Segmentation fault

sudo is not an app I like to see segfaulting!
Attached patch bug891881.patchSplinter Review
Attachment #793600 - Flags: review?(coop)
Attachment #793600 - Flags: review?(coop) → review+
tested OK on:
 lion
 mtnlion
 ubuntu
 centos

So I'll land this on Monday unless someone's willing to land it earlier.
We still get the "sudo: unknown defaults entry `umask_override'" message, but I can 'sudo reboot' now as cltbld. Tested on talos-r4-snow-001.
Oh, I completely forgot about that once I saw things working.  I'll get a patch together.
talos-r4-snow-001 is connected to my dev-master and is running some tests now.
This converts the sudoers base files to a template, stripping the comments, and omits the umask_override on OS X 10.6.

tested OK on
  mtnlion
  centos
  ubuntu
  snow
Attachment #795539 - Flags: review?(coop)
Comment on attachment 795539 [details] [diff] [review]
bug891881-sudoers-2.patch

Review of attachment 795539 [details] [diff] [review]:
-----------------------------------------------------------------

Yay for removing code.
Attachment #795539 - Flags: review?(coop) → review+
(In reply to Chris Cooper [:coop] from comment #26)
> talos-r4-snow-001 is connected to my dev-master and is running some tests
> now.

All tests passed in staging.

Dustin: should I start setting up batches of 10.6 machines for netbooting?
Sure, that'd be great.  I'll get them in the right deploystudio category, if they're not already.
Attachment #795539 - Flags: checked-in+
Attachment #785968 - Flags: checked-in+
All talos-r4-snow-* are now in the deploystudio group that will install them with puppetagain.  They only need to be blessed and reboot.
OK, I will get the first batch setup for netbooting shortly.
Batches:

1. 002-027
2. 028-056 (41 doesn't exist)
3. 057-083 (81 doesn't exist)

Setting up batch 1 to netboot now.
Batch 1 has had their basedirs updated in slavealloc and have been marked for netboot. The only idle slave was 026, so I rebooted that one manually.
Batch 2 has had basedirs updated in slavealloc and has been marked for netboot.
Assignee: dustin → coop
Batch 3 has had basedirs updated in slavealloc and has been marked for netboot.
Status: NEW → ASSIGNED
Priority: -- → P2
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Duplicate of this bug: 907794
You need to log in before you can comment on or make changes to this bug.