[mig agent] MacOS DMG packaging

RESOLVED FIXED

Status

Enterprise Information Security
MIG
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: ulfr, Assigned: dustin)

Tracking

Details

Attachments

(3 attachments, 2 obsolete attachments)

(Reporter)

Description

3 years ago
MIG Agent needs proper packaging for MacOS.
After some testing, it seems that the best approach is to build the agent on MacOS directly. It would be nice to cross-compile from Linux, but FPM PKG generation and hdiutil for DMG creation are only available on darwin. (in the future, we should investigate alternatives, but that's not a priority).

The following method works:

1. Compile the agent with OS=darwin
2. Using FPM, create a PKG file
3. Using hdiutil, create a DMG file that contains only the PKG file
4. To install, mount the DMG file, execute the installer command on the PKG file, and unmount the DMG file
Any install script contained in the PKG file will be executed by the installer.

Sample run:

$ hdiutil mount /tmp/mig-agent-20141229+da443e6.dev-x86_64.dmg
/dev/disk1          	Apple_partition_scheme         	
/dev/disk1s1        	Apple_partition_map            	
/dev/disk1s2        	Apple_HFS                      	/Volumes/Mozilla InvestiGator Agent

$ ls /Volumes/Mozilla\ InvestiGator\ Agent/
mig-agent-20141229+da443e6.dev-x86_64.pkg

$ sudo installer -package /Volumes/Mozilla\ InvestiGator\ Agent/mig-agent-20141229+da443e6.dev-x86_64.pkg -target /
Password:
installer: Package name is mig-agent-20141229+da443e6.dev-x86_64
installer: Upgrading at base path /
installer: The upgrade was successful.

$ hdiutil unmount /Volumes/Mozilla\ InvestiGator\ Agent/
"/Volumes/Mozilla InvestiGator Agent/" unmounted successfully.

$ sudo /sbin/mig-agent -q=pid
6161

Makefile entry is in attachment for feedback.
(Reporter)

Comment 1

3 years ago
Created attachment 8542217 [details]
Makefile entry to build macos DMG of mig-agent
Attachment #8542217 - Flags: feedback?(dustin)
(Assignee)

Updated

3 years ago
Attachment #8542217 - Flags: feedback?(dustin) → feedback+
(Reporter)

Updated

3 years ago
Component: Operations Security (OpSec): General → Operations Security (OpSec): MIG
(Reporter)

Updated

3 years ago
Blocks: 1116685
No longer blocks: 896480
(Reporter)

Updated

3 years ago
Blocks: 1143250
(Reporter)

Updated

3 years ago
Blocks: 1149503
(Reporter)

Comment 2

3 years ago
Retested the procedure today with latest mig-agent, manual install works fine.

    $ hdiutil mount ~/Code/build_mig/packages/opsec/mig-agent-20150402+b05e40b.prod-x86_64.dmg 
    /dev/disk1          	Apple_partition_scheme         	
    /dev/disk1s1        	Apple_partition_map            	
    /dev/disk1s2        	Apple_HFS                      	/Volumes/Mozilla InvestiGator Agent
    $ sudo installer -package /Volumes/Mozilla\ InvestiGator\ Agent/mig-agent-20150402+b05e40b.prod-x86_64.pkg -target /
    installer: Package name is mig-agent-20150402+b05e40b.prod-x86_64
    installer: Upgrading at base path /
    installer: The upgrade was successful.
    $ hdiutil unmount /Volumes/Mozilla\ InvestiGator\ Agent/
    "/Volumes/Mozilla InvestiGator Agent/" unmounted successfully.
    $ sudo /sbin/mig-agent
    [info] using builtin conf
    $ sudo /sbin/mig-agent -q=pid
    8759
    $ sudo /sbin/mig-agent -V
    20150402+b05e40b.prod

The configuration will be identical to the one already puppeted for linux, so the only change needed is to deploy & install the DMG. How should this be done? I see some dmg install scripts in puppet that curl packages from various places. Should I hosts the DMG on a S3 bucket operated by opsec or is there a more standard way to deploy it?
(Assignee)

Comment 3

3 years ago
Sweet!!

The DMGs end up in /data/repos/DMGs, and are installed with the packages::pkgdmg define, e.g.,

class packages::mozilla::mig_agent {
    case $::operatingsystem {
        ...
        Darwin: {
            packages::pkgdmg {
                mig-agent:  # must match the DMG base name
                    version => "20150402+b05e40b.prod-x86_64",
                    os_version_specific => false;  # same binary works on multiple versions
            }
        }
        ...
    }
}
(Reporter)

Comment 4

3 years ago
Ok, I scp-ed the package into /data/repos/DMGs/mig-agent-20150402+b05e40b.prod-x86_64.dmg
Do I need to run a command to update the repository?

I'll send a patch, but it will need the config file change from bug 1149639 to be merged first.
(Assignee)

Comment 5

3 years ago
Nope, OS X doesn't have "repositories" :(

So you should be good to go.
(Reporter)

Comment 6

3 years ago
Created attachment 8587412 [details] [diff] [review]
mig darwin deployment for build-puppet

The attached patch provides the base support for darwin packages in the mig modules. It requires merging https://bug1149639.bugzilla.mozilla.org/attachment.cgi?id=8587388 first because of the dependency on the `api` config parameter.

I have not assigned the class mig::agent::daemon to any hosts yet. I'm assuming we want to test it on a single host first and see if it works as expected.

Dustin: can you perform the test deploy, and I'll help verify it?
Attachment #8587412 - Flags: review?(dustin)
(Reporter)

Comment 7

3 years ago
Created attachment 8587424 [details] [diff] [review]
deploy macos agent 20150402+1c880e7.prod-x86_64
Attachment #8587412 - Attachment is obsolete: true
Attachment #8587412 - Flags: review?(dustin)
(Assignee)

Updated

3 years ago
Attachment #8587424 - Flags: review+
(Assignee)

Comment 9

3 years ago
Created attachment 8587470 [details] [diff] [review]
bug1116210-deploy.patch

This actually deploys the patch.  I tried this on a buildslave and a server with no errors, although the buildslave hasn't rebooted yet (bld-lion-r5-007.try.releng.scl3.mozilla.com).

I had to add anchors here so that the restart could depend on pkgdmg on OS X.

Arr, given the risk of outage, is it OK to deploy this?
Attachment #8587470 - Flags: review?(jvehent)
Attachment #8587470 - Flags: feedback?(arich)
Comment on attachment 8587470 [details] [diff] [review]
bug1116210-deploy.patch

I'd install this on one or two clients of each type (10.6, 10.7, and 10.10) and make sure they go through a reboot cycle to make sure we aren't going to run into issues where machines get hung before rolling it out site-wide.
Attachment #8587470 - Flags: feedback?(arich) → feedback+
(Assignee)

Updated

3 years ago
Assignee: jvehent → dustin
(Assignee)

Comment 11

3 years ago
bld-lion-r5-007.try.releng.scl3.mozilla.com
mac-signing2.srv.releng.scl3.mozilla.com
t-yosemite-r5-0009.test.releng.scl3.mozilla.com
(Reporter)

Comment 12

3 years ago
I can't see bld-lion-r5 or t-yosemite-r5. The other two are showing, plus my own:

mig=> select name from agents where status='online' and environment->>'os'='darwin';
                   name                   
------------------------------------------
 install.build.releng.scl3.mozilla.com
 mac-signing2.srv.releng.scl3.mozilla.com
 Juliens-Mac-mini.local
(3 rows)
(Assignee)

Comment 13

3 years ago
Morgan, are there dashboards and things I can look at to see what's going on with the mig runner task on those platforms?

Also, how cool is it that this (almost) just runs on OS X without any additional futzing with startup scripts?
Flags: needinfo?(winter2718)
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> Morgan, are there dashboards and things I can look at to see what's going on
> with the mig runner task on those platforms?
> 
> Also, how cool is it that this (almost) just runs on OS X without any
> additional futzing with startup scripts?

Quite cool :) You can use the runner dashboards. If you need another retry dashboard that's separated by platform I can make one in a jiffy. https://stats.taskcluster.net/grafana/#/dashboard/db/runner
Flags: needinfo?(winter2718)
(Assignee)

Comment 15

3 years ago
So, running by hand:

[root@bld-lion-r5-007.try.releng.scl3.mozilla.com ~]# cat /opt/runner/tasks.d/1-mig_agent 
#!/bin/bash
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.

# run mig-agent in checkin mode
/sbin/mig-agent -m agent-checkin || true
[root@bld-lion-r5-007.try.releng.scl3.mozilla.com ~]# /sbin/mig-agent -m agent-checkin
[info] Using external conf from /etc/mig/mig-agent.cfg
[root@bld-lion-r5-007.try.releng.scl3.mozilla.com ~]# echo $?
0

So it looks like it's running.  Julien, do you see this host now?
(Reporter)

Comment 16

3 years ago
Yep. bld-lion-r5-007.try.releng.scl3.mozilla.com just showed up.
(Assignee)

Comment 17

3 years ago
Apr 06 06:52:22 bld-lion-r5-007 1-mig_agent: starting (max time 600s)
Apr 06 06:52:23 bld-lion-r5-007 1-mig_agent: OK 

Can you see if it re-checked-in at that time?  If so, what's the latest status of t-yosemite?
(Assignee)

Comment 18

3 years ago
Yosemite appears not to be logging to papertrail.  In its system.log I only see this about runner:

[root@t-yosemite-r5-0009.test.releng.scl3.mozilla.com ~]# grep runner /var/log/system.log
Apr  6 06:44:15 t-yosemite-r5-0009 com.apple.xpc.launchd[1] (com.mozilla.runner): This key does not do anything: OnDemand

yet, Buildbot is running, so I'm guessing runner ran..
(Reporter)

Comment 19

3 years ago
Negative. The only hit I have is at 13:33 UTC, so 06:33 PST.
Can you get syslog from that box? The agent sends INFO logs into the DAEMON facility.
(Assignee)

Comment 20

3 years ago
I think there's something wrong with logging on OS X.

There's nothing for runner or mig in /var/log/system.log.  On papertrail, there's nothing whatsoever for t-yosemite-*, and for bld-linux-r5-007, the lines matching 'mig' are all from runner:

 Apr 06 06:52:15 bld-lion-r5-007 tasks: ['0-darwin_clean_buildbot', '1-cleanslate', '1-mig_agent', '4-buildbot.py', '99-post_flight']
Apr 06 06:52:21 bld-lion-r5-007 running: pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Apr 06 06:52:22 bld-lion-r5-007 1-mig_agent: starting (max time 600s)
Apr 06 06:52:23 bld-lion-r5-007 1-mig_agent: OK
Apr 06 06:52:23 bld-lion-r5-007 running: post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"}
(Assignee)

Comment 21

3 years ago
OK, I was wrong about yosemite, but all I see is logs of runner running the task.
(Reporter)

Comment 22

3 years ago
I'd suggest changing the config in /etc/mig/agent.conf to log to a file instead of syslog:

[logging]
    mode    = "file"
    level   = "debug"
    file    = "/tmp/mig_agent.log"
(Assignee)

Comment 23

3 years ago
Morgan reminded me that on OS X, runner runs as cltbld, not root, and mig needs to be run as root.

Julien suggested something like

  if [ "$UID" != 0 ]; then PREFIX="sudo"; fi; $PREFIX /sbin/mig-agent -m agent-checkin

with /sbin/mig-agent added to sudoers.  That's probably the easiest course to making this work.
(Assignee)

Comment 24

3 years ago
Created attachment 8588759 [details] [diff] [review]
bug1116210-sudo.patch

OK, this worked in our testing.  r? for this patch and attachment 8587470 [details] [diff] [review]?
Attachment #8588759 - Flags: review?(jvehent)
(Reporter)

Comment 25

3 years ago
Comment on attachment 8587470 [details] [diff] [review]
bug1116210-deploy.patch

Review of attachment 8587470 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.
Attachment #8587470 - Flags: review?(jvehent) → review+
(Assignee)

Comment 26

3 years ago
Comment on attachment 8588759 [details] [diff] [review]
bug1116210-sudo.patch

I think the r+ on the previous was meant for this patch.
Attachment #8588759 - Flags: review?(jvehent) → review+
(Assignee)

Updated

3 years ago
Attachment #8587424 - Attachment is obsolete: true
(Assignee)

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Reporter)

Comment 29

3 years ago
While I now have ~250 MacOS hosts checking in, it seems that bld-lion-r5-007 didn't rejoin the pool. Reopening to investigate.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 30

3 years ago
There are only two bld-lion-r5 hosts that have checked in, in fact.
(Assignee)

Comment 31

3 years ago
 Apr 07 10:26:40 bld-lion-r5-008 running: pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Apr 07 10:26:41 bld-lion-r5-008 1-mig_agent: starting (max time 600s)
Apr 07 10:26:42 bld-lion-r5-008 sudo:   cltbld : TTY=unknown ; PWD=/Users ; USER=root ; COMMAND=/sbin/mig-agent -m agent-checkin
Apr 07 10:26:42 bld-lion-r5-008 kernel: nstat_lookup_entry failed: 2
Apr 07 10:26:47 bld-lion-r5-008 kernel: nstat_lookup_entry failed: 2
Apr 07 10:26:48 bld-lion-r5-008 kernel: nstat_lookup_entry failed: 2
Apr 07 10:26:49 bld-lion-r5-008 1-mig_agent: OK
Apr 07 10:26:49 bld-lion-r5-008 running: post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"} 

So, it's running mig.  I also ran mig on this host by hand a few minutes ago, which I bet checked it in.  I'm assuming something is failing in mig when run from runner on this platform.  Do the 'nstat' messages mean anything to you, Julien?
(Assignee)

Comment 32

3 years ago
dustin@euclid ~/tmp $  ./mig-agent-search "environment->>'os'='darwin'" | grep bld-lion-r5 | cut -d\; -f 1 | sort
"bld-lion-r5-007.try.releng.scl3.mozilla.com"; "2015-04-08T17:28:45Z"
"bld-lion-r5-008.try.releng.scl3.mozilla.com"; "2015-04-08T16:47:31Z"
"bld-lion-r5-015.try.releng.scl3.mozilla.com"; "2015-04-08T16:44:17Z"
"bld-lion-r5-050.build.releng.scl3.mozilla.com"; "2015-04-08T16:09:25Z"
"bld-lion-r5-051.build.releng.scl3.mozilla.com"; "2015-04-08T16:57:42Z"
"bld-lion-r5-053.build.releng.scl3.mozilla.com"; "2015-04-08T16:42:28Z"
"bld-lion-r5-055.build.releng.scl3.mozilla.com"; "2015-04-08T16:21:06Z"
"bld-lion-r5-057.build.releng.scl3.mozilla.com"; "2015-04-08T16:29:00Z"
"bld-lion-r5-061.build.releng.scl3.mozilla.com"; "2015-04-08T16:46:05Z"
"bld-lion-r5-065.build.releng.scl3.mozilla.com"; "2015-04-08T17:09:50Z"
"bld-lion-r5-068.build.releng.scl3.mozilla.com"; "2015-04-08T17:20:42Z"
"bld-lion-r5-070.build.releng.scl3.mozilla.com"; "2015-04-08T17:31:15Z"
"bld-lion-r5-071.build.releng.scl3.mozilla.com"; "2015-04-08T17:42:34Z"
"bld-lion-r5-072.build.releng.scl3.mozilla.com"; "2015-04-08T17:14:24Z"
"bld-lion-r5-076.build.releng.scl3.mozilla.com"; "2015-04-08T17:14:28Z"
"bld-lion-r5-080.build.releng.scl3.mozilla.com"; "2015-04-08T17:08:28Z"
"bld-lion-r5-082.build.releng.scl3.mozilla.com"; "2015-04-08T16:58:50Z"
"bld-lion-r5-083.build.releng.scl3.mozilla.com"; "2015-04-08T17:29:21Z"
"bld-lion-r5-085.build.releng.scl3.mozilla.com"; "2015-04-08T17:31:22Z"
"bld-lion-r5-092.build.releng.scl3.mozilla.com"; "2015-04-08T17:03:59Z"

So for about an hour yesterday, lion hosts could talk to mig.  ???!?
(Assignee)

Comment 33

3 years ago
Sorry, today, about an hour ago.  I didn't change anything related during that time.
(Assignee)

Comment 34

3 years ago
Just to pick one that's not on the list above:

[root@bld-lion-r5-095.try.releng.scl3.mozilla.com ~]# uptime
11:18  up 40 mins, 2 users, load averages: 7.81 6.13 3.29

 Apr 08 10:41:17 bld-lion-r5-095 running: pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Apr 08 10:41:18 bld-lion-r5-095 1-mig_agent: starting (max time 600s)
Apr 08 10:41:18 bld-lion-r5-095 sudo:   cltbld : TTY=unknown ; PWD=/Users ; USER=root ; COMMAND=/sbin/mig-agent -m agent-checkin
Apr 08 10:41:23 bld-lion-r5-095 kernel: nstat_lookup_entry failed: 2
Apr 08 10:41:24 bld-lion-r5-095 kernel: nstat_lookup_entry failed: 2
Apr 08 10:41:25 bld-lion-r5-095 1-mig_agent: OK
Apr 08 10:41:25 bld-lion-r5-095 running: post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"} 

So, somehow that mig agent didn't check in -- but others have??
(Reporter)

Comment 35

3 years ago
I just approved 35 bld-lion hosts that had checked into the scheduler. Now that they are approved, they will show up in the mig-agent-search next time they check in.
(Reporter)

Comment 36

3 years ago
MIG currently sees 75 bld-lion hosts. Most of them seem to check in every hour or so, and then there's this:

                     name                      |         checkin time          
-----------------------------------------------+-------------------------------
 bld-lion-r5-088.build.releng.scl3.mozilla.com | 2015-04-09 03:51:17.226116+00
 bld-lion-r5-088.build.releng.scl3.mozilla.com | 2015-04-09 10:56:11.958589+00

That host only checked in twice today at 7 hours intervals. I don't know if we run build job that last for 7 hours, or if this is a potential issue with MIG/Runner missing checkins. Morgan, any thought?
Flags: needinfo?(winter2718)
In the logs on the machine I'm seeing that mig only ran twice. Instead of running on a set schedule, mig only runs before a buildbot job starts, so if it takes a long time to pick up a job this sort of thing will happen. from /var/tmp/runner.err:

2015-04-09 03:56:03,358 - INFO - iteration 1
2015-04-09 03:56:03,377 - DEBUG - tasks: ['0-darwin_clean_buildbot', '1-cleanslate', '1-mig_agent', '4-buildbot.py', '99-post_flight']
2015-04-09 03:56:03,378 - DEBUG - Updating env with {'HG_SHARE_BASE_DIR': '/builds/hg-shared', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11', 'RUNNER_CONFIG_CMD': '/opt/runner/bin/python2.7 /opt/runner/bin/runner -c /opt/runner/runner.cfg', 'TWISTD_LOG_PATH': '/builds/slave/twistd.log', 'GIT_SHARE_BASE_DIR': '/builds/git-shared'}
2015-04-09 03:56:03,378 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-darwin_clean_buildbot", "result": "RUNNING"}
2015-04-09 03:56:04,381 - DEBUG - 0-darwin_clean_buildbot: starting (max time 600s)
2015-04-09 03:56:05,385 - DEBUG - 0-darwin_clean_buildbot: OK
2015-04-09 03:56:05,386 - DEBUG - running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-darwin_clean_buildbot", "result": "OK"}
2015-04-09 03:56:06,389 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "RUNNING"}
2015-04-09 03:56:07,393 - DEBUG - 1-cleanslate: starting (max time 600s)
2015-04-09 03:56:07,441 - DEBUG - No saved process list found, creating one at /var/tmp/cleanslate
2015-04-09 03:56:08,396 - DEBUG - 1-cleanslate: OK
2015-04-09 03:56:08,397 - DEBUG - running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "OK"}
2015-04-09 03:56:09,400 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
2015-04-09 03:56:10,403 - DEBUG - 1-mig_agent: starting (max time 600s)
[info] Using external conf from /etc/mig/mig-agent.cfg
2015-04-09 03:56:17,413 - DEBUG - 1-mig_agent: OK
2015-04-09 03:56:17,414 - DEBUG - running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "OK"}
2015-04-09 03:56:19,418 - DEBUG - running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "4-buildbot.py", "result": "RUNNING"}
2015-04-09 03:56:20,422 - DEBUG - 4-buildbot.py: starting (max time 600s)
Error sending notice to nagios (ignored)
Flags: needinfo?(winter2718)
(Reporter)

Comment 38

3 years ago
Ha, that's interesting. I somehow assumed that all buildbots are busy 100% of the time, and missed the case where an unused buildbot would just no be running mig.

MacOS support is working as expected, so I'm going to resolve this bug. Thanks for the help.

As a somewhat unrelated note, could we run MIG in daemon mode on those hosts, and use a pre-job hook to shut down then agent, then restart it with a post-job hook?
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Component: Operations Security (OpSec): MIG → MIG
Product: mozilla.org → Enterprise Information Security
You need to log in before you can comment on or make changes to this bug.