Closed
Bug 1134223
Opened 10 years ago
Closed 10 years ago
reimage 23 10.8 machines as 10.10
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kmoir, Assigned: dividehex)
References
Details
Attachments
(3 files)
1.02 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
2.26 KB,
patch
|
kmoir
:
review+
dividehex
:
checked-in+
|
Details | Diff | Splinter Review |
716 bytes,
patch
|
dividehex
:
review+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
I have patches in Bug 1126493 to enable 10.10 opt tests on trunk and at the same time disable 10.8 opt tests on the same branches. To enable this, I'd like to reimage some 10.8 machines as 10.10 and change the hostnames etc. Let's start with 23 machines so we have a total of 30 10.10 machines (currently have seven up). Let me know when you're ready to do this and I'll disable the 10.8 machines in slavealloc etc.
Reporter | ||
Comment 1•10 years ago
|
||
how about talos-mtnlion-r5-077 -> talos-mtnlion-r5-100
Comment 2•10 years ago
|
||
I've updated https://docs.google.com/a/mozilla.com/spreadsheets/d/1o4C9aUDmyIwn7VgAxur2_GIlOwx7f92hii_oL8lnjMU/edit#gid=66084899
Jake, can you modify nagios (rename, comment things out so nagios doesn't break when DNS changes), inventory (don't forget the CNAMEs), and get them rebuilt and added back to nagios?
Assignee: relops → jwatkins
Reporter | ||
Comment 3•10 years ago
|
||
jake: please ping me before you reimage them so I can disable them in slavealloc and we don't burn any jobs
Assignee | ||
Comment 4•10 years ago
|
||
Should we move to 10.10.2 base image before mass reimaging? The current base image is 10.10.0
https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=915041082&mt=12
Comment 5•10 years ago
|
||
If we thing that some of the issues are due to issues they've released patches for, this is probably a good idea. Any .0 release is probably a bad idea, anyway.
Reporter | ||
Comment 6•10 years ago
|
||
We still run the original 10.8 release on the talos-mtnlion-r5-* slaves. But I think upgrading to the latest image shouldn't be a problem. Once the image is ready, I could image the existing 10.10 slaves and run the tests again on try to ensure there aren't any unexpected failures.
Assignee | ||
Comment 7•10 years ago
|
||
Ok. Sounds like a good plan. I'll build the new base image for 10.10.2 and ping :kmoir when ready to reimage some of the current yosemite systems
Assignee | ||
Comment 8•10 years ago
|
||
Kim, I have a new image captured. Do you have a host ready to be reimaged?
Flags: needinfo?(kmoir)
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Kim Moir [:kmoir] from comment #9)
> You can use t-yosemite-r5-0001
I reimaged t-yosemite-r5-0001 and it looks ok from an initial glance. Let me know if/when I can reimage the rest of the *current* yosemite machines.
Comment 11•10 years ago
|
||
I'll reenable 0001, disable the other 6, and push to try so it can have a shot at ALL THE TESTS once the current tree-closing-infra goes away.
Comment 12•10 years ago
|
||
Which would have been a fine plan, once I realized it would be faster just to retrigger everything on https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd8f5061d87f instead of pushing again, except that 0001 isn't actually taking any jobs.
Reporter | ||
Comment 13•10 years ago
|
||
It looks like the puppet run on t-yosemite-r5-0001 isn't finishing and thus buildbot isn't getting started. I'm investigating.
Reporter | ||
Comment 14•10 years ago
|
||
So this is what is happening if I try to run puppet manually on t-yosemite-r5-0001
[root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com var]# puppet agent --test --server=releng-puppet1.srv.releng.scl3.mozilla.com
2015-02-20 07:17:44.717 system_profiler[557:4261] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:17:44.719 system_profiler[557:4261] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:17:45.065 system_profiler[564:4293] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:17:45.066 system_profiler[564:4293] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:17:45.445 system_profiler[573:4332] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:17:45.446 system_profiler[573:4332] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
Exiting; failed to retrieve certificate and waitforcert is disabled
[root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com var]# puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com
2015-02-20 07:18:01.099 system_profiler[588:4404] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:18:01.100 system_profiler[588:4404] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:18:01.463 system_profiler[594:4434] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:18:01.464 system_profiler[594:4434] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:18:01.844 system_profiler[603:4476] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-20 07:18:01.845 system_profiler[603:4476] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
Exiting; failed to retrieve certificate and waitforcert is disabled
If I run this command on the (10.10) yosemite hosts, it works fine. Is the new image missing the deploy password or something similar? The machine is missing the certs in /var/lib/puppet/ssl/certs
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 15•10 years ago
|
||
Looks like it is getting hung up on installing this java developer package because of a untrusted cert. The package is pretty old and more than likely contains an expired cert. Our options are to upgrade the package to a current version or strip the cert from package.
Is the package even still needed? If not, we could just remove it.
Fri Feb 20 17:12:30 -0800 2015 Puppet (err): Execution of '/usr/sbin/installer -pkg /private/tmp/dmg.F9uOTD/JavaDeveloper.pkg -target /' returned 1: installer: Package name is Java Developer Package
installer: Certificate used to sign package is not trusted. Use -allowUntrusted to override.
Fri Feb 20 17:12:30 -0800 2015 /Stage[main]/Packages::Javadeveloper_for_os_x/Packages::Pkgdmg[javadeveloper_for_os_x_2012003__11m3646]/Package[javadeveloper_for_os_x_2012003__11m3646.dmg]/ensure (err): change from absent to present failed: Execution of '/usr/sbin/installer -pkg /private/tmp/dmg.F9uOTD/JavaDeveloper.pkg -target /' returned 1: installer: Package name is Java Developer Package
installer: Certificate used to sign package is not trusted. Use -allowUntrusted to override.
Flags: needinfo?(jwatkins)
Comment 16•10 years ago
|
||
The trail of tears for installing Java is bug 790206 and bug 705570 (and a random assortment of weird dependencies). The short version is that there are two answers, "well, try not installing it and see whether we leak in crashtests and a couple of mochitest chunks, and if we don't good enough" and "yes, even if we don't still leak without it installed we still want to have it installed because we do have some Java-specific tests, which we do want to run to ensure we don't break those users who are stuck still having to install Java on their Macs." The latter is the much more responsible of the two answers.
Updated•10 years ago
|
Blocks: t-yosemite-r5-0001
Reporter | ||
Comment 17•10 years ago
|
||
Regarding comment #2. Yes, I saw the java error message before. This isn't is the root cause however, because in my staging master I disabled the java package in the manifest and ran puppet manually on t-yosemite-r5-0001 and the same issue occurred with the certificate. Ran it in debug mode and nothing about the package - just the same certificate error as above. If I look at the server logs (releng-puppet2.srv.releng.scl3.mozilla.com) there isn't anything about connecting to t-yosemite-r5-0001 today.
Comment 18•10 years ago
|
||
The cert errors in comment 14 are because the puppet agent can't find its certificates. On OS X, the default location for puppet's SSL stuff is wrong, so until we install puppet.conf (via a successful puppet run), we have to specify --ssldir=/var/lib/puppet/ssl. You'll see this in puppetize.sh, for example. So I'm guessing that puppet run would have succeeded -- or at least given a more helpful error message -- with that additional option.
Reporter | ||
Comment 19•10 years ago
|
||
No, I did try this and the same error message resulted
puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=kmoir --pluginsync --ssldir=/var/lib/puppet/ssl
No ca.pem
root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com ~]# puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=kmoir --pluginsync --ssldir=/var/lib/puppet/ssl
2015-02-23 09:14:37.387 system_profiler[616:4804] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-23 09:14:37.389 system_profiler[616:4804] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-23 09:14:37.753 system_profiler[622:4835] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-23 09:14:37.755 system_profiler[622:4835] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-23 09:14:38.145 system_profiler[631:4876] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
2015-02-23 09:14:38.146 system_profiler[631:4876] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
Exiting; failed to retrieve certificate and waitforcert is disabled
[root@t-yosemite-r5-0001.test.releng.scl3.mozilla.com ~]# ls -lR /var/lib/puppet/ssl
total 0
drwxr-xr-x 2 puppet puppet 68 20 Feb 17:10 certificate_requests
drwxr-xr-x 2 puppet puppet 68 20 Feb 20:10 certs
drwxr-x--- 2 puppet puppet 68 20 Feb 17:10 private
drwxr-x--- 3 puppet puppet 102 23 Feb 08:52 private_keys
drwxr-xr-x 3 puppet puppet 102 20 Feb 20:11 public_keys
/var/lib/puppet/ssl/certificate_requests:
/var/lib/puppet/ssl/certs:
/var/lib/puppet/ssl/private:
/var/lib/puppet/ssl/private_keys:
total 8
-rw-r----- 1 puppet puppet 3243 23 Feb 08:52 t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem
/var/lib/puppet/ssl/public_keys:
total 8
-rw-r--r-- 1 puppet puppet 800 23 Feb 08:52 t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem
Assignee | ||
Comment 20•10 years ago
|
||
:kmoir, did you happen to reboot the machine and then try puppet manually? If the system gets rebooted before the puppet finishes it's first run, the launchd entry would still be in place to run puppetize.sh again on start. The problem is this time it doesn't have the deploypass on disk since it wiped it on the first successful attempt getting a cert installed. On the second attempt, puppetize.sh will wipe the certs and try again but without the deploypass, it hangs.
Assignee | ||
Comment 21•10 years ago
|
||
I'm downloading the latest package from apple but it is concerning that the latest version is from Oct 15, 2013
Comment 22•10 years ago
|
||
Odd, there should be a 'ca.pem' in certs. Odd in particular that one of the three files is missing!
I re-ran puppetize.sh (after applying the fix in bug 1132198), then ran puppet again. It gave me lots of errors about files being forbidden, which suggests incorrect filesystem permissions on your environment on the puppetmaster.
Error: /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin/concatfragments.rb]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/concat/concatfragments.rb: Error 403 on SERVER: Forbidden request: t-yosemite-r5-0001.test.releng.scl3.mozilla.com(10.26.56.16) access to /file_metadata/modules/concat/concatfragments.rb [find] at :119
I'm not sure what would be causing that -- the filesystem permissions actually look correct.
And somehow the SSL files disappeared again. I think that might have been kim and i stepping on feet..
Reporter | ||
Comment 23•10 years ago
|
||
I don't understand why the ssl certs are disappearing. If I run puppetize.sh on them machine it runs and then seems to delete the certs if it is pinned to my user env. I've never seen this before.
In any case, if I change the machine back so it synchronizes with a production master, puppet seems to continue to work but complains about that java package not satisfying dependencies. Puppet continues to cycle and does not start buildbot.
I'll attach a patch so that we don't install the java package on 10.10 (temporarily) and see if this addresses the issue with the puppetizing process not completing in production. If this fixes the issue I'll look at installing the new java package.
Reporter | ||
Comment 24•10 years ago
|
||
Attachment #8568109 -
Flags: review?(jwatkins)
Assignee | ||
Updated•10 years ago
|
Attachment #8568109 -
Flags: review?(jwatkins) → review+
Assignee | ||
Comment 25•10 years ago
|
||
if excluding java dev breaks things here is a patch to install the new package on 10.10 only and still maintain the old package for the other osx versions
Attachment #8568112 -
Flags: review?(kmoir)
Comment 26•10 years ago
|
||
The most likely theory should be "not having Java won't break things, it will just cause us to silently not run the tests that depend on Java on 10.10 and thus allow us to break it without knowing."
Reporter | ||
Updated•10 years ago
|
Attachment #8568112 -
Flags: review?(kmoir) → review+
Reporter | ||
Comment 27•10 years ago
|
||
So I tried the patch to remove java and see if puppet eventually stopped running. It seemed that it lost the certificates again and I had to run puppetize.sh. I reverted the patch and am interested to see the result when we try the new package in dividehex's patch.
Assignee | ||
Comment 28•10 years ago
|
||
Comment on attachment 8568112 [details] [diff] [review]
bug1134223-1.patch
remote: https://hg.mozilla.org/build/puppet/rev/7b9f7c4be906
remote: https://hg.mozilla.org/build/puppet/rev/a63981adb9fa
Attachment #8568112 -
Flags: checked-in+
Reporter | ||
Comment 29•10 years ago
|
||
So the new java package is installed correctly. However, I had to run puppetize.sh manually because the ca.pem was missing again. Could we try to reimage this machine with 10.10.2 and rerun to see if this issue can be replicated? i.e. reimage it and let it puppetize again. I'm concerned that the machine isn't in a good state now, because we have been fiddling with it so much. (Or if you have the command to reimage it via netboot, I could do it :-), I don't know the address
Comment 30•10 years ago
|
||
(In reply to Kim Moir [:kmoir] from comment #27)
The command to reimage is the same across all the test machines:
/usr/sbin/bless --netboot --server bsdp://10.26.56.110; reboot
I've done that on t-yosemite-r5-0001, so we'll see how it fares.
Comment 31•10 years ago
|
||
I reinstalled it and it seems to have come up fine. I re-ran puppet a few times to make sure it wasn't deleting the pem files.
I do note that it's spitting out (what appear to be inconsequential) errors from system_profiler whenever we run puppet, though, so that should be fixed before we put this in production:
system_profiler[2642:12110] platformPluginDictionary: Can't get X86PlatformPlugin, return value 0
I then pinned it to your env, kim: puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=kmoir -v --no-splay --no-daemonize
Notice: /File[/var/lib/puppet/lib/puppet]/ensure: removed
Notice: /File[/var/lib/puppet/lib/puppet_x]/ensure: removed
Notice: /File[/var/lib/puppet/lib/facter]/ensure: removed
But all of the cert stuff is still there on disk.
/var/lib/puppet/ssl//certs/ca.pem
/var/lib/puppet/ssl//certs/t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem
/var/lib/puppet/ssl//private_keys/t-yosemite-r5-0001.test.releng.scl3.mozilla.com.pem
And subsequent runs went fine.
Comment 32•10 years ago
|
||
https://tickets.puppetlabs.com/browse/FACT-724 for the complaining about system_profiler (dustin has been commenting there, I see)
Reporter | ||
Comment 33•10 years ago
|
||
Thanks Jake! Running tests on it now on try. I've disabled the other yosemite slaves temporarily in slavealloc so all the tests will run on the 10.10.2 slave.
Comment 34•10 years ago
|
||
Unfortunately your trychooser syntax was missing a "-u" so you just got 10.6 tests. 10.10 running in https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3b34031784f
Reporter | ||
Comment 35•10 years ago
|
||
thanks philor
Reporter | ||
Comment 36•10 years ago
|
||
I remimaged the other yosemite slaves as 10.10.2
Assignee | ||
Comment 37•10 years ago
|
||
talos-mtnlion-r5-087 was already migrated to t-yosemite-r5-0003. I've removed it from the list and shifted the remaining up in line. This will give us exactly 23 new OSX 10.10 slaves for a total of 30.
Comment 38•10 years ago
|
||
Just to make sure we really do have bustage from the newly reimaged slaves like it appears we do, and not just newly introduced bustage that happened at a very inconvenient time, I retriggered the debug mochitest-2 and browser-chrome-3 in https://treeherder.mozilla.org/#/jobs?repo=try&revision=83c654620c54, my last clear and clean and certain parent revision.
Comment 39•10 years ago
|
||
Sigh. Not 10.10.2 bustage, just someone sneaking in code bustage while we didn't have any slaves enabled and I wasn't pushing every m-c merge to try. Does the fun ever start?
Reporter | ||
Comment 40•10 years ago
|
||
talos-mtnlion-r5-077-100 (excluding 87) have been disabled. Once the tests are finished on these machines, Jake will update inventory and we can reimage them.
Reporter | ||
Comment 41•10 years ago
|
||
Noticed amy's spreadsheet renamed them as yosemite 29-52 instead of the earlier range I had entered for the machine we didn't get
Attachment #8569996 -
Flags: review?(jwatkins)
Assignee | ||
Updated•10 years ago
|
Attachment #8569996 -
Flags: review?(jwatkins) → review+
Assignee | ||
Comment 42•10 years ago
|
||
Nagios and inventory have been updated. I've also deleted the previous host entries in deploystudio and the yosemite group as default. We can start reimaging.
Reporter | ||
Updated•10 years ago
|
Attachment #8569996 -
Flags: checked-in+
Assignee | ||
Comment 43•10 years ago
|
||
First batch has completed reimaging and puppetizing (t-yosemite-r5-{0029..0035})
Here are the batches that still need to be reimaged
0036..0040
0041..0045
0046..0051
Assignee | ||
Comment 44•10 years ago
|
||
0036..0040 is done
Assignee | ||
Comment 45•10 years ago
|
||
0041..0045 is done except for 0042. It seems to be stuck on the first task of the workflow.
Assignee | ||
Comment 46•10 years ago
|
||
0046..0051 is done
Assignee | ||
Comment 47•10 years ago
|
||
:kmoir, looks like t-yosemite-r5-0042 got some attention and took the reimage ok. I'm going to go ahead and r/f this.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•