Closed Bug 1134223 Opened 10 years ago Closed 10 years ago

reimage 23 10.8 machines as 10.10

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: dividehex)

References

Details

Attachments

(3 files)

I have patches in Bug 1126493 to enable 10.10 opt tests on trunk and at the same time disable 10.8 opt tests on the same branches. To enable this, I'd like to reimage some 10.8 machines as 10.10 and change the hostnames etc. Let's start with 23 machines so we have a total of 30 10.10 machines (currently have seven up). Let me know when you're ready to do this and I'll disable the 10.8 machines in slavealloc etc.
Blocks: 1126493
how about talos-mtnlion-r5-077 -> talos-mtnlion-r5-100
I've updated https://docs.google.com/a/mozilla.com/spreadsheets/d/1o4C9aUDmyIwn7VgAxur2_GIlOwx7f92hii_oL8lnjMU/edit#gid=66084899 Jake, can you modify nagios (rename, comment things out so nagios doesn't break when DNS changes), inventory (don't forget the CNAMEs), and get them rebuilt and added back to nagios?
Assignee: relops → jwatkins
jake: please ping me before you reimage them so I can disable them in slavealloc and we don't burn any jobs
Should we move to 10.10.2 base image before mass reimaging? The current base image is 10.10.0 https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=915041082&mt=12
If we thing that some of the issues are due to issues they've released patches for, this is probably a good idea. Any .0 release is probably a bad idea, anyway.
We still run the original 10.8 release on the talos-mtnlion-r5-* slaves. But I think upgrading to the latest image shouldn't be a problem. Once the image is ready, I could image the existing 10.10 slaves and run the tests again on try to ensure there aren't any unexpected failures.
Ok. Sounds like a good plan. I'll build the new base image for 10.10.2 and ping :kmoir when ready to reimage some of the current yosemite systems
Kim, I have a new image captured. Do you have a host ready to be reimaged?
Flags: needinfo?(kmoir)
You can use t-yosemite-r5-0001
Flags: needinfo?(kmoir)
Depends on: 1134898
(In reply to Kim Moir [:kmoir] from comment #9) > You can use t-yosemite-r5-0001 I reimaged t-yosemite-r5-0001 and it looks ok from an initial glance. Let me know if/when I can reimage the rest of the *current* yosemite machines.
I'll reenable 0001, disable the other 6, and push to try so it can have a shot at ALL THE TESTS once the current tree-closing-infra goes away.
Which would have been a fine plan, once I realized it would be faster just to retrigger everything on https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd8f5061d87f instead of pushing again, except that 0001 isn't actually taking any jobs.
It looks like the puppet run on t-yosemite-r5-0001 isn't finishing and thus buildbot isn't getting started. I'm investigating.
So this is what is happening if I try to run puppet manually on t-yosemite-r5-0001 [root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com var]# puppet agent --test --server=releng-puppet1.srv.releng.scl3.mozilla.com 2015-02-20 07:17:44.717 system_profiler[557:4261] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:17:44.719 system_profiler[557:4261] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:17:45.065 system_profiler[564:4293] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:17:45.066 system_profiler[564:4293] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:17:45.445 system_profiler[573:4332] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:17:45.446 system_profiler[573:4332] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 Error: Could not request certificate: Error 400 on SERVER: this master is not a CA Exiting; failed to retrieve certificate and waitforcert is disabled [root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com var]# puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com 2015-02-20 07:18:01.099 system_profiler[588:4404] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:18:01.100 system_profiler[588:4404] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:18:01.463 system_profiler[594:4434] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:18:01.464 system_profiler[594:4434] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:18:01.844 system_profiler[603:4476] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-20 07:18:01.845 system_profiler[603:4476] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 Error: Could not request certificate: Error 400 on SERVER: this master is not a CA Exiting; failed to retrieve certificate and waitforcert is disabled If I run this command on the (10.10) yosemite hosts, it works fine. Is the new image missing the deploy password or something similar? The machine is missing the certs in /var/lib/puppet/ssl/certs
Flags: needinfo?(jwatkins)
Depends on: 1134790
Looks like it is getting hung up on installing this java developer package because of a untrusted cert. The package is pretty old and more than likely contains an expired cert. Our options are to upgrade the package to a current version or strip the cert from package. Is the package even still needed? If not, we could just remove it. Fri Feb 20 17:12:30 -0800 2015 Puppet (err): Execution of '/usr/sbin/installer -pkg /private/tmp/dmg.F9uOTD/JavaDeveloper.pkg -target /' returned 1: installer: Package name is Java Developer Package installer: Certificate used to sign package is not trusted. Use -allowUntrusted to override. Fri Feb 20 17:12:30 -0800 2015 /Stage[main]/Packages::Javadeveloper_for_os_x/Packages::Pkgdmg[javadeveloper_for_os_x_2012003__11m3646]/Package[javadeveloper_for_os_x_2012003__11m3646.dmg]/ensure (err): change from absent to present failed: Execution of '/usr/sbin/installer -pkg /private/tmp/dmg.F9uOTD/JavaDeveloper.pkg -target /' returned 1: installer: Package name is Java Developer Package installer: Certificate used to sign package is not trusted. Use -allowUntrusted to override.
Flags: needinfo?(jwatkins)
The trail of tears for installing Java is bug 790206 and bug 705570 (and a random assortment of weird dependencies). The short version is that there are two answers, "well, try not installing it and see whether we leak in crashtests and a couple of mochitest chunks, and if we don't good enough" and "yes, even if we don't still leak without it installed we still want to have it installed because we do have some Java-specific tests, which we do want to run to ensure we don't break those users who are stuck still having to install Java on their Macs." The latter is the much more responsible of the two answers.
Regarding comment #2. Yes, I saw the java error message before. This isn't is the root cause however, because in my staging master I disabled the java package in the manifest and ran puppet manually on t-yosemite-r5-0001 and the same issue occurred with the certificate. Ran it in debug mode and nothing about the package - just the same certificate error as above. If I look at the server logs (releng-puppet2.srv.releng.scl3.mozilla.com) there isn't anything about connecting to t-yosemite-r5-0001 today.
The cert errors in comment 14 are because the puppet agent can't find its certificates. On OS X, the default location for puppet's SSL stuff is wrong, so until we install puppet.conf (via a successful puppet run), we have to specify --ssldir=/var/lib/puppet/ssl. You'll see this in puppetize.sh, for example. So I'm guessing that puppet run would have succeeded -- or at least given a more helpful error message -- with that additional option.
No, I did try this and the same error message resulted puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=kmoir --pluginsync --ssldir=/var/lib/puppet/ssl No ca.pem root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com ~]# puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=kmoir --pluginsync --ssldir=/var/lib/puppet/ssl 2015-02-23 09:14:37.387 system_profiler[616:4804] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-23 09:14:37.389 system_profiler[616:4804] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-23 09:14:37.753 system_profiler[622:4835] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-23 09:14:37.755 system_profiler[622:4835] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-23 09:14:38.145 system_profiler[631:4876] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 2015-02-23 09:14:38.146 system_profiler[631:4876] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 Error: Could not request certificate: Error 400 on SERVER: this master is not a CA Exiting; failed to retrieve certificate and waitforcert is disabled [root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com ~]# ls -lR /var/lib/puppet/ssl total 0 drwxr-xr-x 2 puppet puppet 68 20 Feb 17:10 certificate_requests drwxr-xr-x 2 puppet puppet 68 20 Feb 20:10 certs drwxr-x--- 2 puppet puppet 68 20 Feb 17:10 private drwxr-x--- 3 puppet puppet 102 23 Feb 08:52 private_keys drwxr-xr-x 3 puppet puppet 102 20 Feb 20:11 public_keys /var/lib/puppet/ssl/certificate_requests: /var/lib/puppet/ssl/certs: /var/lib/puppet/ssl/private: /var/lib/puppet/ssl/private_keys: total 8 -rw-r----- 1 puppet puppet 3243 23 Feb 08:52 t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem /var/lib/puppet/ssl/public_keys: total 8 -rw-r--r-- 1 puppet puppet 800 23 Feb 08:52 t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem
:kmoir, did you happen to reboot the machine and then try puppet manually? If the system gets rebooted before the puppet finishes it's first run, the launchd entry would still be in place to run puppetize.sh again on start. The problem is this time it doesn't have the deploypass on disk since it wiped it on the first successful attempt getting a cert installed. On the second attempt, puppetize.sh will wipe the certs and try again but without the deploypass, it hangs.
I'm downloading the latest package from apple but it is concerning that the latest version is from Oct 15, 2013
Odd, there should be a 'ca.pem' in certs. Odd in particular that one of the three files is missing! I re-ran puppetize.sh (after applying the fix in bug 1132198), then ran puppet again. It gave me lots of errors about files being forbidden, which suggests incorrect filesystem permissions on your environment on the puppetmaster. Error: /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin/concatfragments.rb]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/concat/concatfragments.rb: Error 403 on SERVER: Forbidden request: t-yosemite-r5-0001.test.releng.scl3.mozilla.com(10.26.56.16) access to /file_metadata/modules/concat/concatfragments.rb [find] at :119 I'm not sure what would be causing that -- the filesystem permissions actually look correct. And somehow the SSL files disappeared again. I think that might have been kim and i stepping on feet..
Blocks: 1135751
I don't understand why the ssl certs are disappearing. If I run puppetize.sh on them machine it runs and then seems to delete the certs if it is pinned to my user env. I've never seen this before. In any case, if I change the machine back so it synchronizes with a production master, puppet seems to continue to work but complains about that java package not satisfying dependencies. Puppet continues to cycle and does not start buildbot. I'll attach a patch so that we don't install the java package on 10.10 (temporarily) and see if this addresses the issue with the puppetizing process not completing in production. If this fixes the issue I'll look at installing the new java package.
Attached patch bug1134223.patchSplinter Review
Attachment #8568109 - Flags: review?(jwatkins)
Attachment #8568109 - Flags: review?(jwatkins) → review+
if excluding java dev breaks things here is a patch to install the new package on 10.10 only and still maintain the old package for the other osx versions
Attachment #8568112 - Flags: review?(kmoir)
The most likely theory should be "not having Java won't break things, it will just cause us to silently not run the tests that depend on Java on 10.10 and thus allow us to break it without knowing."
Attachment #8568112 - Flags: review?(kmoir) → review+
So I tried the patch to remove java and see if puppet eventually stopped running. It seemed that it lost the certificates again and I had to run puppetize.sh. I reverted the patch and am interested to see the result when we try the new package in dividehex's patch.
So the new java package is installed correctly. However, I had to run puppetize.sh manually because the ca.pem was missing again. Could we try to reimage this machine with 10.10.2 and rerun to see if this issue can be replicated? i.e. reimage it and let it puppetize again. I'm concerned that the machine isn't in a good state now, because we have been fiddling with it so much. (Or if you have the command to reimage it via netboot, I could do it :-), I don't know the address
(In reply to Kim Moir [:kmoir] from comment #27) The command to reimage is the same across all the test machines: /usr/sbin/bless --netboot --server bsdp://10.26.56.110; reboot I've done that on t-yosemite-r5-0001, so we'll see how it fares.
I reinstalled it and it seems to have come up fine. I re-ran puppet a few times to make sure it wasn't deleting the pem files. I do note that it's spitting out (what appear to be inconsequential) errors from system_profiler whenever we run puppet, though, so that should be fixed before we put this in production: system_profiler[2642:12110] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0 I then pinned it to your env, kim: puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=kmoir -v --no-splay --no-daemonize Notice: /File[/var/lib/puppet/lib/puppet]/ensure: removed Notice: /File[/var/lib/puppet/lib/puppet_x]/ensure: removed Notice: /File[/var/lib/puppet/lib/facter]/ensure: removed But all of the cert stuff is still there on disk. /var/lib/puppet/ssl//certs/ca.pem /var/lib/puppet/ssl//certs/t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem /var/lib/puppet/ssl//private_keys/t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem And subsequent runs went fine.
https://tickets.puppetlabs.com/browse/FACT-724 for the complaining about system_profiler (dustin has been commenting there, I see)
Thanks Jake! Running tests on it now on try. I've disabled the other yosemite slaves temporarily in slavealloc so all the tests will run on the 10.10.2 slave.
Unfortunately your trychooser syntax was missing a "-u" so you just got 10.6 tests. 10.10 running in https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3b34031784f
thanks philor
I remimaged the other yosemite slaves as 10.10.2
talos-mtnlion-r5-087 was already migrated to t-yosemite-r5-0003. I've removed it from the list and shifted the remaining up in line. This will give us exactly 23 new OSX 10.10 slaves for a total of 30.
Just to make sure we really do have bustage from the newly reimaged slaves like it appears we do, and not just newly introduced bustage that happened at a very inconvenient time, I retriggered the debug mochitest-2 and browser-chrome-3 in https://treeherder.mozilla.org/#/jobs?repo=try&revision=83c654620c54, my last clear and clean and certain parent revision.
Sigh. Not 10.10.2 bustage, just someone sneaking in code bustage while we didn't have any slaves enabled and I wasn't pushing every m-c merge to try. Does the fun ever start?
talos-mtnlion-r5-077-100 (excluding 87) have been disabled. Once the tests are finished on these machines, Jake will update inventory and we can reimage them.
Noticed amy's spreadsheet renamed them as yosemite 29-52 instead of the earlier range I had entered for the machine we didn't get
Attachment #8569996 - Flags: review?(jwatkins)
Attachment #8569996 - Flags: review?(jwatkins) → review+
Nagios and inventory have been updated. I've also deleted the previous host entries in deploystudio and the yosemite group as default. We can start reimaging.
Attachment #8569996 - Flags: checked-in+
First batch has completed reimaging and puppetizing (t-yosemite-r5-{0029..0035}) Here are the batches that still need to be reimaged 0036..0040 0041..0045 0046..0051
0036..0040 is done
0041..0045 is done except for 0042. It seems to be stuck on the first task of the workflow.
Depends on: 1137440
0046..0051 is done
:kmoir, looks like t-yosemite-r5-0042 got some attention and took the reimage ok. I'm going to go ahead and r/f this.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: