Closed Bug 1141628 Opened 9 years ago Closed 9 years ago

Create an OS X 10.7 build host for DR

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: dividehex)

References

Details

Attachments

(1 file)

We need to create an OS X 10.7 build client host for DR.
Blocks: 1141631
coop/jordan, can you specify an r4 machine to take out of service and use for this?
Flags: needinfo?(jlund)
Flags: needinfo?(coop)
Depends on: t-snow-r4-0165
Flags: needinfo?(coop)
t-snow-r4-0165
Flags: needinfo?(jlund)
nagios checks removed

svn commit -m 'Bug 1141626 & Bug 1141628: remove nagios checks for DR minis'
Sending        releng/scl3.pp
Transmitting file data .
Committed revision 102585.
Depends on: 1148255
This host was renamed to bld-lion-r4-001.test.releng.scl3.mozilla.com for reimaging purposes.
I copied over the lion image from install.build and setup a new workflow (Restore lion-r4) but it failed to come back from a reimage.  I suspect it didn't like the apple core storage stuff so I've removed it and filed a bug to have dcops netboot it.  Hopefully it will recover after that.
DCOPs got this to netboot and it took the image correctly.  It has been puppetized to include toplevel::slave::releng::build::standard

I going to have this moved to the build vlan so it can be tested against a dev master
Depends on: 1150132
Attached patch bug1141628.patchSplinter Review
I had to change the host name because it still reflected the test network

scutil --set HostName bld-lion-r4-001.build.releng.scl3.mozilla.com

The build failed because the host is missing some ssh keys, I think because it couldn't puppetize properly because the incorrect host name is in the configs.
Attachment #8588579 - Flags: review?(jwatkins)
Comment on attachment 8588579 [details] [diff] [review]
bug1141628.patch

hmmm...  It was imaged and puppetized before it was moved to the build vlan.  In theory, it should have gotten all the same keys and configuration regardless of the fqdn
Attachment #8588579 - Flags: review?(jwatkins) → review+
Comment on attachment 8588579 [details] [diff] [review]
bug1141628.patch

and merged to production
Attachment #8588579 - Flags: checked-in+
I ran some more tests and I think it is okay.  The warnings are from missing taskcluster credentials. A dmg was created and I was able to run the nightly build How will it be continue to be updated in TOR - I assume it will continue to be updated via puppet?
It won't get any updates. When we test the DR, it will be updated by hand (possibly by attaching it to a puppet master in AWS, but not necessarily during a test). The machine will be powered off in a closet until a regular test (see the resiliency/DR docs I added to).
Okay I think the missing credentials were because the machine was not in slavealloc.  I've fixed this and am running another test.
Still failing with this error message, even though this file exists on the filesystem

http://dev-master2.bb.releng.use1.mozilla.com:8039/builders/OS%20X%2010.7%20mozilla-central%20build/builds/28/steps/run_script/logs/stdio

10:03:39     INFO -  Error: /builds/slave/m-cen-m64-00000000000000000000/build/src/obj-firefox/i386/browser/installer/package-manifest:462: File missing in ../../dist: Nightly.app/Contents/Resources/modules/services/datareporting/sessions.jsm
10:03:55     INFO -  Traceback (most recent call last):
10:03:55     INFO -    File "/builds/slave/m-cen-m64-00000000000000000000/build/src/toolkit/mozapps/installer/packager.py", line 404, in <module>
10:03:55     INFO -      main()
10:03:55     INFO -    File "/builds/slave/m-cen-m64-00000000000000000000/build/src/toolkit/mozapps/installer/packager.py", line 353, in main
10:03:55     INFO -      copier.add(mozpath.join(respath, 'removed-files'), removals)
10:03:55     INFO -    File "/tools/python/lib/python2.7/contextlib.py", line 24, in __exit__
10:03:55     INFO -      self.gen.next()
10:03:55     INFO -    File "/builds/slave/m-cen-m64-00000000000000000000/build/src/python/mozbuild/mozpack/errors.py", line 129, in accumulate
10:03:55     INFO -      raise AccumulatedErrors()
10:03:55     INFO -  mozpack.errors.AccumulatedErrors
10:03:55     INFO -  make[3]: *** [stage-package] Error 1
10:03:55     INFO -  make[2]: *** [postflight_all] Error 2
10:03:55     INFO -  make[1]: *** [realbuild] Error 2
10:03:55     INFO -  make: *** [build] Error 2
10:03:55     INFO -  153 compiler warnings present.

:mshal do you have any ideas? I know you have more experience with the mac build side of things than I
Flags: needinfo?(mshal)
We chatted on IRC - this particular problem was gone on the most recent build, so I didn't investigate further. I think the only remaining issue is the credentials used for 'make upload'.
Flags: needinfo?(mshal)
(In reply to Michael Shal [:mshal] from comment #14)
> We chatted on IRC - this particular problem was gone on the most recent
> build, so I didn't investigate further. I think the only remaining issue is
> the credentials used for 'make upload'.

What is the status on this host?  Is there a reason the credentials are missing? Could these be manually added for DR purposes?
The current status is
* the correct keys are on the machine
* I can run the ssh command to connect to dev-stage01.srv.releng.scl3.mozilla.com that fail during the build from the command line and it worked
* I  adjusted the staging server name and clobbered the builder dir on the machine and am rerunning the build

I think the machine is fine I'm just being careful to ensure it's in a valid state.
So the subsequent build failed.  I don't understand why. If I run the Command '['ssh', '-o', 'IdentityFile=/Users/cltbld/.ssh/ffxbld_rsa', 'ffxbld@dev-stage01.srv.releng.scl3.mozilla.com', 'mktemp -d'] command from the command line of the slave it works fine.  If anyone has suggestions on what to do they would be appreciated.

2:09:17     INFO -  2015-04-15 12:09:17,378 - Copying ../../dist//firefox-40.0a1.en-US.mac.checksums.asc to cache /builds/slave/m-cen-m64-00000000000000000000/build/signing_cache/gpg/f4de22cdae03cd6036d153e4c82571d98b17c546
12:09:17     INFO -  /builds/slave/m-cen-m64-00000000000000000000/build/src/obj-firefox/i386/_virtualenv/bin/python /builds/slave/m-cen-m64-00000000000000000000/build/src/build/upload.py --base-path ../../dist \
12:09:17     INFO -  		'../../dist/firefox-40.0a1.en-US.mac.dmg'   '../../dist/mac/xpi/firefox-40.0a1.en-US.langpack.xpi'  '../../dist/firefox-40.0a1.en-US.mac.tests.zip' '../../dist/firefox-40.0a1.en-US.mac.crashreporter-symbols.zip'  '../../dist//firefox-40.0a1.en-US.mac.txt' '../../dist//firefox-40.0a1.en-US.mac.json' '../../dist//firefox-40.0a1.en-US.mac.mozinfo.json' '../../dist/jsshell-mac.zip'     ../../dist/host/bin/mar  ../../dist/host/bin/mbsdiff     \
12:09:17     INFO -  		'../../dist//firefox-40.0a1.en-US.mac.checksums' '../../dist//firefox-40.0a1.en-US.mac.checksums'.asc
12:09:17     INFO -  Permission denied (publickey,gssapi-with-mic,password).
12:09:28     INFO -  Permission denied (publickey,gssapi-with-mic,password).
12:09:44     INFO -  Permission denied (publickey,gssapi-with-mic,password).
12:10:08     INFO -  Permission denied (publickey,gssapi-with-mic,password).
12:10:45     INFO -  Permission denied (publickey,gssapi-with-mic,password).
12:10:45     INFO -  Command '['ssh', '-o', 'IdentityFile=/Users/cltbld/.ssh/ffxbld_rsa', 'ffxbld@dev-stage01.srv.releng.scl3.mozilla.com', 'mktemp -d']' returned non-zero exit status 255
-oIdentityFile won't disable use of an SSH agent.  If ssh-add -l shows your key on the command line, that might be why.  Try running it with 'env -i ssh -oIdentityFile...'
Dustin: Yes this explains it. Thanks for the pointer.

So the problem was that I had the prod env specified in slavealloc and thus it had the prod keys, not the staging keys.  In any case, when I fixed this I was able to run a m-c nightly build that ran green.
The machine ran several m-c builds last night, all were green so I think the machine is good.
(In reply to Kim Moir [:kmoir] from comment #20)
> The machine ran several m-c builds last night, all were green so I think the
> machine is good.

Thanks Kim!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: