Closed Bug 1116210 Opened 9 years ago Closed 9 years ago

[mig agent] MacOS DMG packaging

Categories

(Enterprise Information Security Graveyard :: MIG, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jvehent, Assigned: dustin)

References

Details

Attachments

(3 files, 2 obsolete files)

MIG Agent needs proper packaging for MacOS.
After some testing, it seems that the best approach is to build the agent on MacOS directly. It would be nice to cross-compile from Linux, but FPM PKG generation and hdiutil for DMG creation are only available on darwin. (in the future, we should investigate alternatives, but that's not a priority).

The following method works:

1. Compile the agent with OS=darwin
2. Using FPM, create a PKG file
3. Using hdiutil, create a DMG file that contains only the PKG file
4. To install, mount the DMG file, execute the installer command on the PKG file, and unmount the DMG file
Any install script contained in the PKG file will be executed by the installer.

Sample run:

$ hdiutil mount /tmp/mig-agent-20141229+da443e6.dev-x86_64.dmg
/dev/disk1          	Apple_partition_scheme         	
/dev/disk1s1        	Apple_partition_map            	
/dev/disk1s2        	Apple_HFS                      	/Volumes/Mozilla InvestiGator Agent

$ ls /Volumes/Mozilla\ InvestiGator\ Agent/
mig-agent-20141229+da443e6.dev-x86_64.pkg

$ sudo installer -package /Volumes/Mozilla\ InvestiGator\ Agent/mig-agent-20141229+da443e6.dev-x86_64.pkg -target /
Password:
installer: Package name is mig-agent-20141229+da443e6.dev-x86_64
installer: Upgrading at base path /
installer: The upgrade was successful.

$ hdiutil unmount /Volumes/Mozilla\ InvestiGator\ Agent/
"/Volumes/Mozilla InvestiGator Agent/" unmounted successfully.

$ sudo /sbin/mig-agent -q=pid
6161

Makefile entry is in attachment for feedback.
Attachment #8542217 - Flags: feedback?(dustin)
Attachment #8542217 - Flags: feedback?(dustin) → feedback+
Component: Operations Security (OpSec): General → Operations Security (OpSec): MIG
Blocks: 1116685
No longer blocks: 896480
Retested the procedure today with latest mig-agent, manual install works fine.

    $ hdiutil mount ~/Code/build_mig/packages/opsec/mig-agent-20150402+b05e40b.prod-x86_64.dmg 
    /dev/disk1          	Apple_partition_scheme         	
    /dev/disk1s1        	Apple_partition_map            	
    /dev/disk1s2        	Apple_HFS                      	/Volumes/Mozilla InvestiGator Agent
    $ sudo installer -package /Volumes/Mozilla\ InvestiGator\ Agent/mig-agent-20150402+b05e40b.prod-x86_64.pkg -target /
    installer: Package name is mig-agent-20150402+b05e40b.prod-x86_64
    installer: Upgrading at base path /
    installer: The upgrade was successful.
    $ hdiutil unmount /Volumes/Mozilla\ InvestiGator\ Agent/
    "/Volumes/Mozilla InvestiGator Agent/" unmounted successfully.
    $ sudo /sbin/mig-agent
    [info] using builtin conf
    $ sudo /sbin/mig-agent -q=pid
    8759
    $ sudo /sbin/mig-agent -V
    20150402+b05e40b.prod

The configuration will be identical to the one already puppeted for linux, so the only change needed is to deploy & install the DMG. How should this be done? I see some dmg install scripts in puppet that curl packages from various places. Should I hosts the DMG on a S3 bucket operated by opsec or is there a more standard way to deploy it?
Sweet!!

The DMGs end up in /data/repos/DMGs, and are installed with the packages::pkgdmg define, e.g.,

class packages::mozilla::mig_agent {
    case $::operatingsystem {
        ...
        Darwin: {
            packages::pkgdmg {
                mig-agent:  # must match the DMG base name
                    version => "20150402+b05e40b.prod-x86_64",
                    os_version_specific => false;  # same binary works on multiple versions
            }
        }
        ...
    }
}
Ok, I scp-ed the package into /data/repos/DMGs/mig-agent-20150402+b05e40b.prod-x86_64.dmg
Do I need to run a command to update the repository?

I'll send a patch, but it will need the config file change from bug 1149639 to be merged first.
Nope, OS X doesn't have "repositories" :(

So you should be good to go.
The attached patch provides the base support for darwin packages in the mig modules. It requires merging https://bug1149639.bugzilla.mozilla.org/attachment.cgi?id=8587388 first because of the dependency on the `api` config parameter.

I have not assigned the class mig::agent::daemon to any hosts yet. I'm assuming we want to test it on a single host first and see if it works as expected.

Dustin: can you perform the test deploy, and I'll help verify it?
Attachment #8587412 - Flags: review?(dustin)
Attachment #8587412 - Attachment is obsolete: true
Attachment #8587412 - Flags: review?(dustin)
Attachment #8587424 - Flags: review+
This actually deploys the patch.  I tried this on a buildslave and a server with no errors, although the buildslave hasn't rebooted yet (bld-lion-r5-007.try.releng.scl3.mozilla.com).

I had to add anchors here so that the restart could depend on pkgdmg on OS X.

Arr, given the risk of outage, is it OK to deploy this?
Attachment #8587470 - Flags: review?(jvehent)
Attachment #8587470 - Flags: feedback?(arich)
Comment on attachment 8587470 [details] [diff] [review]
bug1116210-deploy.patch

I'd install this on one or two clients of each type (10.6, 10.7, and 10.10) and make sure they go through a reboot cycle to make sure we aren't going to run into issues where machines get hung before rolling it out site-wide.
Attachment #8587470 - Flags: feedback?(arich) → feedback+
Assignee: jvehent → dustin
bld-lion-r5-007.try.releng.scl3.mozilla.com
mac-signing2.srv.releng.scl3.mozilla.com
t-yosemite-r5-0009.test.releng.scl3.mozilla.com
I can't see bld-lion-r5 or t-yosemite-r5. The other two are showing, plus my own:

mig=> select name from agents where status='online' and environment->>'os'='darwin';
                   name                   
------------------------------------------
 install.build.releng.scl3.mozilla.com
 mac-signing2.srv.releng.scl3.mozilla.com
 Juliens-Mac-mini.local
(3 rows)
Morgan, are there dashboards and things I can look at to see what's going on with the mig runner task on those platforms?

Also, how cool is it that this (almost) just runs on OS X without any additional futzing with startup scripts?
Flags: needinfo?(winter2718)
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> Morgan, are there dashboards and things I can look at to see what's going on
> with the mig runner task on those platforms?
> 
> Also, how cool is it that this (almost) just runs on OS X without any
> additional futzing with startup scripts?

Quite cool :) You can use the runner dashboards. If you need another retry dashboard that's separated by platform I can make one in a jiffy. https://stats.taskcluster.net/grafana/#/dashboard/db/runner
Flags: needinfo?(winter2718)
So, running by hand:

[root@bld-lion-r5-007.try.releng.scl3.mozilla.com ~]# cat /opt/runner/tasks.d/1-mig_agent 
#!/bin/bash
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.

# run mig-agent in checkin mode
/sbin/mig-agent -m agent-checkin || true
[root@bld-lion-r5-007.try.releng.scl3.mozilla.com ~]# /sbin/mig-agent -m agent-checkin
[info] Using external conf from /etc/mig/mig-agent.cfg
[root@bld-lion-r5-007.try.releng.scl3.mozilla.com ~]# echo $?
0

So it looks like it's running.  Julien, do you see this host now?
Yep. bld-lion-r5-007.try.releng.scl3.mozilla.com just showed up.
Apr 06 06:52:22 bld-lion-r5-007 1-mig_agent: starting (max time 600s)
Apr 06 06:52:23 bld-lion-r5-007 1-mig_agent: OK 

Can you see if it re-checked-in at that time?  If so, what's the latest status of t-yosemite?
Yosemite appears not to be logging to papertrail.  In its system.log I only see this about runner:

[root@t-yosemite-r5-0009.test.releng.scl3.mozilla.com ~]# grep runner /var/log/system.log
Apr  6 06:44:15 t-yosemite-r5-0009 com.apple.xpc.launchd[1] (com.mozilla.runner): This key does not do anything: OnDemand

yet, Buildbot is running, so I'm guessing runner ran..
Negative. The only hit I have is at 13:33 UTC, so 06:33 PST.
Can you get syslog from that box? The agent sends INFO logs into the DAEMON facility.
I think there's something wrong with logging on OS X.

There's nothing for runner or mig in /var/log/system.log.  On papertrail, there's nothing whatsoever for t-yosemite-*, and for bld-linux-r5-007, the lines matching 'mig' are all from runner:

 Apr 06 06:52:15 bld-lion-r5-007 tasks: ['0-darwin_clean_buildbot', '1-cleanslate', '1-mig_agent', '4-buildbot.py', '99-post_flight']
Apr 06 06:52:21 bld-lion-r5-007 running: pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Apr 06 06:52:22 bld-lion-r5-007 1-mig_agent: starting (max time 600s)
Apr 06 06:52:23 bld-lion-r5-007 1-mig_agent: OK
Apr 06 06:52:23 bld-lion-r5-007 running: post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"}
OK, I was wrong about yosemite, but all I see is logs of runner running the task.
I'd suggest changing the config in /etc/mig/agent.conf to log to a file instead of syslog:

[logging]
    mode    = "file"
    level   = "debug"
    file    = "/tmp/mig_agent.log"
Morgan reminded me that on OS X, runner runs as cltbld, not root, and mig needs to be run as root.

Julien suggested something like

  if [ "$UID" != 0 ]; then PREFIX="sudo"; fi; $PREFIX /sbin/mig-agent -m agent-checkin

with /sbin/mig-agent added to sudoers.  That's probably the easiest course to making this work.
OK, this worked in our testing.  r? for this patch and attachment 8587470 [details] [diff] [review]?
Attachment #8588759 - Flags: review?(jvehent)
Comment on attachment 8587470 [details] [diff] [review]
bug1116210-deploy.patch

Review of attachment 8587470 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.
Attachment #8587470 - Flags: review?(jvehent) → review+
Comment on attachment 8588759 [details] [diff] [review]
bug1116210-sudo.patch

I think the r+ on the previous was meant for this patch.
Attachment #8588759 - Flags: review?(jvehent) → review+
Attachment #8587424 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
While I now have ~250 MacOS hosts checking in, it seems that bld-lion-r5-007 didn't rejoin the pool. Reopening to investigate.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
There are only two bld-lion-r5 hosts that have checked in, in fact.
 Apr 07 10:26:40 bld-lion-r5-008 running: pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Apr 07 10:26:41 bld-lion-r5-008 1-mig_agent: starting (max time 600s)
Apr 07 10:26:42 bld-lion-r5-008 sudo:   cltbld : TTY=unknown ; PWD=/Users ; USER=root ; COMMAND=/sbin/mig-agent -m agent-checkin
Apr 07 10:26:42 bld-lion-r5-008 kernel: nstat_lookup_entry failed: 2
Apr 07 10:26:47 bld-lion-r5-008 kernel: nstat_lookup_entry failed: 2
Apr 07 10:26:48 bld-lion-r5-008 kernel: nstat_lookup_entry failed: 2
Apr 07 10:26:49 bld-lion-r5-008 1-mig_agent: OK
Apr 07 10:26:49 bld-lion-r5-008 running: post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"} 

So, it's running mig.  I also ran mig on this host by hand a few minutes ago, which I bet checked it in.  I'm assuming something is failing in mig when run from runner on this platform.  Do the 'nstat' messages mean anything to you, Julien?
dustin@euclid ~/tmp $  ./mig-agent-search "environment->>'os'='darwin'" | grep bld-lion-r5 | cut -d\; -f 1 | sort
"bld-lion-r5-007.try.releng.scl3.mozilla.com"; "2015-04-08T17:28:45Z"
"bld-lion-r5-008.try.releng.scl3.mozilla.com"; "2015-04-08T16:47:31Z"
"bld-lion-r5-015.try.releng.scl3.mozilla.com"; "2015-04-08T16:44:17Z"
"bld-lion-r5-050.build.releng.scl3.mozilla.com"; "2015-04-08T16:09:25Z"
"bld-lion-r5-051.build.releng.scl3.mozilla.com"; "2015-04-08T16:57:42Z"
"bld-lion-r5-053.build.releng.scl3.mozilla.com"; "2015-04-08T16:42:28Z"
"bld-lion-r5-055.build.releng.scl3.mozilla.com"; "2015-04-08T16:21:06Z"
"bld-lion-r5-057.build.releng.scl3.mozilla.com"; "2015-04-08T16:29:00Z"
"bld-lion-r5-061.build.releng.scl3.mozilla.com"; "2015-04-08T16:46:05Z"
"bld-lion-r5-065.build.releng.scl3.mozilla.com"; "2015-04-08T17:09:50Z"
"bld-lion-r5-068.build.releng.scl3.mozilla.com"; "2015-04-08T17:20:42Z"
"bld-lion-r5-070.build.releng.scl3.mozilla.com"; "2015-04-08T17:31:15Z"
"bld-lion-r5-071.build.releng.scl3.mozilla.com"; "2015-04-08T17:42:34Z"
"bld-lion-r5-072.build.releng.scl3.mozilla.com"; "2015-04-08T17:14:24Z"
"bld-lion-r5-076.build.releng.scl3.mozilla.com"; "2015-04-08T17:14:28Z"
"bld-lion-r5-080.build.releng.scl3.mozilla.com"; "2015-04-08T17:08:28Z"
"bld-lion-r5-082.build.releng.scl3.mozilla.com"; "2015-04-08T16:58:50Z"
"bld-lion-r5-083.build.releng.scl3.mozilla.com"; "2015-04-08T17:29:21Z"
"bld-lion-r5-085.build.releng.scl3.mozilla.com"; "2015-04-08T17:31:22Z"
"bld-lion-r5-092.build.releng.scl3.mozilla.com"; "2015-04-08T17:03:59Z"

So for about an hour yesterday, lion hosts could talk to mig.  ???!?
Sorry, today, about an hour ago.  I didn't change anything related during that time.
Just to pick one that's not on the list above:

[root@bld-lion-r5-095.try.releng.scl3.mozilla.com ~]# uptime
11:18  up 40 mins, 2 users, load averages: 7.81 6.13 3.29

 Apr 08 10:41:17 bld-lion-r5-095 running: pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Apr 08 10:41:18 bld-lion-r5-095 1-mig_agent: starting (max time 600s)
Apr 08 10:41:18 bld-lion-r5-095 sudo:   cltbld : TTY=unknown ; PWD=/Users ; USER=root ; COMMAND=/sbin/mig-agent -m agent-checkin
Apr 08 10:41:23 bld-lion-r5-095 kernel: nstat_lookup_entry failed: 2
Apr 08 10:41:24 bld-lion-r5-095 kernel: nstat_lookup_entry failed: 2
Apr 08 10:41:25 bld-lion-r5-095 1-mig_agent: OK
Apr 08 10:41:25 bld-lion-r5-095 running: post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"} 

So, somehow that mig agent didn't check in -- but others have??
I just approved 35 bld-lion hosts that had checked into the scheduler. Now that they are approved, they will show up in the mig-agent-search next time they check in.
MIG currently sees 75 bld-lion hosts. Most of them seem to check in every hour or so, and then there's this:

                     name                      |         checkin time          
-----------------------------------------------+-------------------------------
 bld-lion-r5-088.build.releng.scl3.mozilla.com | 2015-04-09 03:51:17.226116+00
 bld-lion-r5-088.build.releng.scl3.mozilla.com | 2015-04-09 10:56:11.958589+00

That host only checked in twice today at 7 hours intervals. I don't know if we run build job that last for 7 hours, or if this is a potential issue with MIG/Runner missing checkins. Morgan, any thought?
Flags: needinfo?(winter2718)
In the logs on the machine I'm seeing that mig only ran twice. Instead of running on a set schedule, mig only runs before a buildbot job starts, so if it takes a long time to pick up a job this sort of thing will happen. from /var/tmp/runner.err:

2015-04-09 03:56:03,358 - INFO - iteration 1
2015-04-09 03:56:03,377 - DEBUG - tasks: ['0-darwin_clean_buildbot', '1-cleanslate', '1-mig_agent', '4-buildbot.py', '99-post_flight']
2015-04-09 03:56:03,378 - DEBUG - Updating env with {'HG_SHARE_BASE_DIR': '/builds/hg-shared', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11', 'RUNNER_CONFIG_CMD': '/opt/runner/bin/python2.7 /opt/runner/bin/runner -c /opt/runner/runner.cfg', 'TWISTD_LOG_PATH': '/builds/slave/twistd.log', 'GIT_SHARE_BASE_DIR': '/builds/git-shared'}
2015-04-09 03:56:03,378 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-darwin_clean_buildbot", "result": "RUNNING"}
2015-04-09 03:56:04,381 - DEBUG - 0-darwin_clean_buildbot: starting (max time 600s)
2015-04-09 03:56:05,385 - DEBUG - 0-darwin_clean_buildbot: OK
2015-04-09 03:56:05,386 - DEBUG - running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-darwin_clean_buildbot", "result": "OK"}
2015-04-09 03:56:06,389 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "RUNNING"}
2015-04-09 03:56:07,393 - DEBUG - 1-cleanslate: starting (max time 600s)
2015-04-09 03:56:07,441 - DEBUG - No saved process list found, creating one at /var/tmp/cleanslate
2015-04-09 03:56:08,396 - DEBUG - 1-cleanslate: OK
2015-04-09 03:56:08,397 - DEBUG - running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "OK"}
2015-04-09 03:56:09,400 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
2015-04-09 03:56:10,403 - DEBUG - 1-mig_agent: starting (max time 600s)
[info] Using external conf from /etc/mig/mig-agent.cfg
2015-04-09 03:56:17,413 - DEBUG - 1-mig_agent: OK
2015-04-09 03:56:17,414 - DEBUG - running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"}
2015-04-09 03:56:19,418 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "4-buildbot.py", "result": "RUNNING"}
2015-04-09 03:56:20,422 - DEBUG - 4-buildbot.py: starting (max time 600s)
Error sending notice to nagios (ignored)
Flags: needinfo?(winter2718)
Ha, that's interesting. I somehow assumed that all buildbots are busy 100% of the time, and missed the case where an unused buildbot would just no be running mig.

MacOS support is working as expected, so I'm going to resolve this bug. Thanks for the help.

As a somewhat unrelated note, could we run MIG in daemon mode on those hosts, and use a pre-job hook to shut down then agent, then restart it with a post-job hook?
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Component: Operations Security (OpSec): MIG → MIG
Product: mozilla.org → Enterprise Information Security
Product: Enterprise Information Security → Enterprise Information Security Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: