Closed Bug 891880 Opened 11 years ago Closed 11 years ago

Support 10.7 Talos with PuppetAgain

Categories

(Infrastructure & Operations :: RelOps: Puppet, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: coop)

References

Details

Attachments

(2 files)

      No description provided.
Depends on: 882869
Attached patch bug891880.patch — — Splinter Review
This runs puppet just fine on (temporarily confusingly named!) bld-lion-r5-003.  runslave.py starts, although the slave is disabled so that's the end of that.  Apache runs and points to the right directory.

I'd like to land this and re-image one or two existing talos systems to test them out.  There are a few minor worries:

 * old puppet installs pyyaml, but I don't see it used
 * old puppet claims to install an xcode named, oddly, "python-sdk".  But gcc isn't installed on the machines, and Python-2.7 is.
 * The version of Java being installed here is the same as on mountain lion, and seems to install just fine

I don't remember discussing who would do the review/testing of these silos, so feel free to punt the r? to someone else, coop.
Attachment #774677 - Flags: review?(coop)
Component: Server Operations: RelEng → RelOps: Puppet
Product: mozilla.org → Infrastructure & Operations
QA Contact: arich → dustin
Coop?
Comment on attachment 774677 [details] [diff] [review]
bug891880.patch

Review of attachment 774677 [details] [diff] [review]:
-----------------------------------------------------------------

Patch looks good. I probably won't have time to test the re-imaged nodes myself. I may punt that to Callek, but I'll find someone.
Attachment #774677 - Flags: review?(coop) → review+
I'll put this on
  talos-r4-lion-006
  talos-r4-lion-035
  talos-r4-lion-036
which slave-health currently indicates are waiting for a sane bringup procedure.
Comment on attachment 774677 [details] [diff] [review]
bug891880.patch

https://hg.mozilla.org/build/puppet/rev/11ada8651600

landed with a typo fixed, after verifying no changes on mntlion talos.
Attachment #774677 - Flags: checked-in+
I reimaged a machine with a new DS workflow to bring it up with this config, and it *mostly* worked.
 - cltbld did not autologin after the puppetize restart
 - lots of
Jul 16 12:48:51 talos-r4-lion-006 ubd[1238]: _openLogFile: open("/Users/cltbld/Library/Logs/Ubiquity/cltbld/ubiquity.log") errno = 13 (Permission denied)
Jul 16 12:48:51: --- last message repeated 3 times ---
Jul 16 12:48:51 talos-r4-lion-006 com.apple.ubd[1238]: [ERROR]       abf29f1a27 [13/07/16 12:48:51.121]  1238.main ubd_main:1916 failed to mkdir "/Users/cltbld/Library/Application Support/Ubiquity" (Permission denied)
Jul 16 12:48:51 talos-r4-lion-006 com.apple.ubd[1238]: [ERROR]       abf2a37810 [13/07/16 12:48:51.121]  1238.main copy_mme_bag:156 copyPreferredMobileMeName failed
Jul 16 12:48:51 talos-r4-lion-006 com.apple.ubd[1238]: [ERROR]       abf2a44c9d [13/07/16 12:48:51.122]  1238.main ubd_main:2030 null personid
Jul 16 12:48:51 talos-r4-lion-006 com.apple.ubd[1238]: [ERROR]       abf2a5d3d0 [13/07/16 12:48:51.122]  1238.main get_uuid_and_data_dir:567 failed to mkdir "/Users/cltbld/Library/Application Support/Ubiquity" (Permission denied)
Jul 16 12:48:51 talos-r4-lion-006 com.apple.launchd.peruser.28[197] (com.apple.ubd[1238]): Exited with code: 254
Jul 16 12:48:51 talos-r4-lion-006 com.apple.launchd.peruser.28[197] (com.apple.ubd): Throttling respawn: Will start in 10 seconds

(Ubiquity is better known as iCloud)

 - perms seem wrong:

[root@talos-r4-lion-006.build.scl1.mozilla.com ~]# ls -al /Users/cltbld/
total 64
drwxr-xr-x+ 22 cltbld  staff   748 Jul 16 11:27 .
drwxr-xr-x   5 root    admin   170 Feb  2  2012 ..
-rw-r--r--   1 501     staff     3 Feb  2  2012 .CFUserTextEncoding
drwx------   2 501     staff    68 Jul 16 11:24 .Trash
-rw-------   1 501     staff   281 Apr 27  2012 .bash_history
-rw-r--r--   1 cltbld  staff   256 Jul 16 11:27 .bashrc
-rw-r--r--   1 cltbld  staff   272 Jul 16 11:27 .gitconfig
-rw-r--r--   1 cltbld  staff   385 Jul 16 11:27 .hgrc
drwxr-xr-x   3 cltbld  staff   102 Jul 16 12:38 .pip
-rw-r--r--   1 cltbld  staff  2643 Jul 16 11:27 .screenrc
drwx------   5 cltbld  staff   170 Jul 16 11:27 .ssh
-rw-------   1 root    staff  1468 Apr 26  2012 .viminfo
-rw-r--r--   1 cltbld  staff  1105 Jul 16 11:27 .vimrc
drwx------   4 501     staff   136 Feb  2  2012 Desktop
drwx------   4 501     staff   136 Feb  2  2012 Documents
drwx------   4 501     staff   136 Feb  2  2012 Downloads
drwx------@ 32 cltbld  staff  1088 Apr 27  2012 Library
drwx------   3 501     staff   102 Feb  2  2012 Movies
drwx------   3 501     staff   102 Feb  2  2012 Music
drwx------   4 501     staff   136 Feb  2  2012 Pictures
drwxr-xr-x   4 501     staff   136 Feb  2  2012 Public
drwxr-xr-x   3 501     staff   102 Feb  2  2012 Sites

I suspect that this is because cltbld already exists in the base image.  Will 'vestigate.
I was using the wrong base image.  Fixed now, and I'm reimaging again.
OK, -006 is reimaged and working, and 035 and 036 are in progress.  Coop, over to you to test them out and/or delegate.
Assignee: dustin → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Had a mix of red and green yesterday, with the failures stemming from timeouts AFAICT. Digging in to the results a bit more now to find out why.
I think I've found the source of the problem:

In /private/etc/apache2/other/talos.conf, Apache is looking for the talos-data at /builds/slave/talos-slave/talos-data/talos, but on Mac this should be /Users/cltbld/talos-slave/talos-data/talos.

Dustin: is this an easy fix?
Flags: needinfo?(dustin)
On PuppetAgain, everything's under /builds/slave, with talos (unfortunately) at /builds/slave/talos-slave.  So the basedir in slavealloc should get changed to point there.
Flags: needinfo?(dustin)
..you should see that the mtnlion hosts are configured this way.  I should have mentioned that when I handed them over -- sorry!
(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> On PuppetAgain, everything's under /builds/slave, with talos (unfortunately)
> at /builds/slave/talos-slave.  So the basedir in slavealloc should get
> changed to point there.

Still had some problems with Apache after making that change, but I got it to run by making the following change in httpd.conf:

--- httpd.conf.orig	2013-07-19 09:19:16.000000000 -0700
+++ httpd.conf	2013-07-19 09:19:04.000000000 -0700
@@ -661,17 +661,17 @@ Include /private/etc/apache2/extra/httpd
 		DirectoryIndex index.html index.php
 	</IfModule>
 </IfModule>
 <IfDefine !MACOSXSERVER>
     Include /etc/apache2/other/*.conf
 </IfDefine>
 <IfDefine MACOSXSERVER>
     <IfDefine WEBSERVICE_ON>
-        Include /etc/apache2/sites/*.conf
+        Include /etc/apache2/other/*.conf
     </IfDefine>
     <IfDefine !WEBSERVICE_ON>
         Include /etc/apache2/sites/virtual_host_global.conf
         Include /etc/apache2/sites/*_.conf
         Include /etc/apache2/sites/*__shadow.conf
     </IfDefine>
 </IfDefine>

DocumentRoot is being set to /var/empty by configs under sites/

Dustin: are we managing Apache with puppet? 

Is there a better way to achieve the same thing here? Perhaps do away with the whole MACOSXSERVER check?
Flags: needinfo?(dustin)
Attached patch bug891880-p2.patch — — Splinter Review
We're not managing the core Apache configs with puppet on any platform, so that the basic OS configs remain in place.

This patch removes -D MACOSXSERVER from the plist, which /etc/apache2/ReadMe.txt suggests is a good solution.  I've run this on talos-r4-lion-006 already.  Did it help?
Attachment #778574 - Flags: review?(coop)
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> Created attachment 778574 [details] [diff] [review]
> bug891880-p2.patch
> 
> We're not managing the core Apache configs with puppet on any platform, so
> that the basic OS configs remain in place.
> 
> This patch removes -D MACOSXSERVER from the plist, which
> /etc/apache2/ReadMe.txt suggests is a good solution.  I've run this on
> talos-r4-lion-006 already.  Did it help?

Yep, that seems to be working. Thanks.
Attachment #778574 - Flags: review?(coop) → review+
Attachment #778574 - Flags: checked-in+
All green since the Apache plist fix on Friday. I think we're good here.
Great!  So, as far as deployment, each host needs the following prep:
 * move to the correct group in DeployStudio (requires a mac)
 * change basedir in slavealloc
 * bless to reimage

Followed by:
 * reboot

We don't need to disable at any point.  So, one option may be to move all hosts to the correct group as soon as possible (which means finding someone from relops with a mac - I don't have one today). Then, in whatever batch size you prefer, you can change basedirs and bless, then let each host reboot itself after a job, or reboot it manually.

Given our experience with lion builders, I think it makes sense to do at most 1/3 of the hosts, then wait a bit to detect any remaining issues.
(In reply to Dustin J. Mitchell [:dustin] from comment #18) 
> Given our experience with lion builders, I think it makes sense to do at
> most 1/3 of the hosts, then wait a bit to detect any remaining issues.

Good call.

There are 87 r4-lion testers, so the batches are pretty simple:

1) 01-30 (07 is on loan)
2) 31-60 (58 is decomm)
3) 61-90 (63 is decomm, 83 doesn't exist -> became signing server)

Dustin: can you let me know when you (or whoever) has moved all the lion testers to the correct DS group? I'll change the basedir for the first batch in slavealloc and then we can bless them and reboot.
Sounds good.  Before we do that, I'd like to take care of bug 895995.
Depends on: 895995
All the r4-lion testers (87 total) in the DeployStudio group talos-r4-lion has been moved to the correct group in DeployStudio (talos-r4-lion-puppetagain).
No longer depends on: 895995
Depends on: 895995
Let me double-check timing with Hal. Lots of r5 minis are moving over the next few days, so re-imaging the r4-lion minis at the same time may be no big deal or the worst idea ever.
Flags: needinfo?(hwine)
r5 mini move is complete now. Doing it in same window would likely have been okay, as sheriffs knew there would be delays on OS-X at that time.
Flags: needinfo?(hwine)
Dustin: could we get started on batch 1 today? (see comment #20 for batches)
Sure - let's change the basedir and then I (or you if you prefer) can bless them.
(In reply to Dustin J. Mitchell [:dustin] from comment #26)
> Sure - let's change the basedir and then I (or you if you prefer) can bless
> them.

basedir has been changed for batch #1. How do I bless the slaves?
Buildbot is not starting after the puppetize.sh-induced reboot.  It starts fine on the next reboot, though.

First boot:

> Aug  6 13:02:21 talos-r4-lion-011 puppet-agent[234]: Unable to fetch my node definition, but the agent run will continue:
> Aug  6 13:02:21 talos-r4-lion-011 puppet-agent[234]: Error 400 on SERVER: this master is not a CA
> Aug  6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet]/ensure) created
> Aug  6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet/provider]/ensure) created
> Aug  6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet/provider/file_line]/ensure) created
> ..
> Aug  6 13:04:39 talos-r4-lion-011 puppet-agent[234]: (/Stage[main]/Buildslave::Startup::Launchd/File[/Library/LaunchAgents/com.mozilla.buildslave.plist]/ensure) defined content as '{md5}bcf1fcd8bd1b38f2f839b146945f7301'
> ..
> Aug  6 13:05:25 talos-r4-lion-011 puppet-agent[234]: Finished catalog run in 168.92 seconds

Second boot:
> Aug  6 13:05:38 talos-r4-lion-011 reboot[2533]: rebooted by _locationd
> Aug  6 13:05:38 talos-r4-lion-011 reboot[2533]: BOOT_TIME: 1375819538 424994
> rc.server[ 8 ]: Tuning server for 8 GB (rounded down).
> Aug  6 13:05:48 localhost bootlog[0]: BOOT_TIME 1375819548 0
> Aug  6 13:06:01 localhost UserEventAgent[30]: starting CaptiveNetworkSupport as SystemEventAgent built May 25 2011 12:27:35
> Aug  6 13:05:49 localhost com.apple.launchd[1]: *** launchd[1] has started up. ***
> Aug  6 13:05:49 localhost com.apple.launchd[1]: *** Verbose boot, will log to /dev/console. ***
> Aug  6 13:06:00 localhost com.apple.launchd[1] (com.apple.xgridd.pcastserver): Bug: launchd_core_logic.c:5174 (25247):2
> Aug  6 13:06:00 localhost com.apple.launchd[1] (com.apple.xgridd.pcastserver): Path monitoring failed on "/var/pcast/server/xgridd/keepalive": No such file or directory
> Aug  6 13:06:01 localhost UserEventAgent[30]: CertsKeychainMonitor: configuring
> Aug  6 13:06:01 localhost UserEventAgent[30]: WirelessAirPortDeviceNameCopy(): no BSD interface name found for object 12551
> Aug  6 13:06:01 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCCopyWiFiDevices:388 WiFi Device Name == NULL
> Aug  6 13:06:03 localhost mDNSResponder[31]: mDNSResponder mDNSResponder-320.10 (Aug  2 2011 19:56:51) starting OSXVers 11
> Aug  6 13:06:03 localhost configd[34]: BLUETOOTH: Failed to start job com.apple.blued
> Aug  6 13:06:06 localhost airportd[51]: _processDLILEvent: en1 attached (down)
> Aug  6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CreateInterfaceWatchList:2788 WiFi Devices Found. :)
> Aug  6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe
> Aug  6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCRebuildCache:81 Failed to get service order
> Aug  6 13:06:06: --- last message repeated 1 time ---
> Aug  6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe
> Aug  6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCRebuildCache:81 Failed to get service order
> Aug  6 13:06:06: --- last message repeated 1 time ---
> Aug  6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe
> Aug  6 13:06:06 localhost systemkeychain[46]: done file: /var/run/systemkeychaincheck.done
> Aug  6 13:06:06 localhost mDNSResponder[31]: D2D_IPC: Loaded
> Aug  6 13:06:06 localhost mDNSResponder[31]: D2DInitialize succeeded
> Aug  6 13:06:06 localhost com.apple.ucupdate.plist[72]: ucupdate: Checked 1 update, no match found.
> Aug  6 13:06:06 localhost com.apple.launchd[1] (com.mozilla.puppet[102]): open("/var/log/puppet/puppet.out", ...): No such file or directory
> Aug  6 13:06:06 localhost com.apple.pfctl[81]: No ALTQ support in kernel
> Aug  6 13:06:06 localhost com.apple.pfctl[81]: ALTQ related functions disabled
> Aug  6 13:06:06 localhost com.apple.launchd[1] (com.mozilla.puppet[102]): open("/var/log/puppet/puppet.err", ...): No such file or directory
> Aug  6 13:06:06 localhost emond[94]: SetUpLogs: uid = 0 gid = 0
> Aug  6 13:06:06 localhost emond[94]: SetUpLogs: opening /Library/Logs/EventMonitor/EventMonitor.error.log
> Aug  6 13:06:07 localhost com.mozilla.puppet[102]: Starting run-puppet.sh at Tue Aug 6 13:06:07 PDT 2013
> Aug  6 13:06:07 localhost com.mozilla.puppet[102]: No DNS configuration available
> Aug  6 13:06:07 localhost com.mozilla.puppet[102]: ..waiting for DNS
> HeadlessStartup: Already setup or this is an upgrade so we will not set the password.
> Aug  6 13:06:07 localhost com.apple.usbmuxd[71]: usbmuxd-211 built on May 16 2011 at 00:14:56 on May 16 2011 at 00:14:55, running 64 bit
> Aug  6 13:06:07 localhost loginwindow[85]: Login Window Application Started
> Aug  6 13:06:07 localhost freshclam[67]: Can't query current.cvd.clamav.net
> Aug  6 13:06:07 localhost freshclam[67]: Invalid DNS reply. Falling back to HTTP mode.
> Aug  6 13:06:08 localhost configd[34]: bootp_session_transmit: bpf_write(en1) failed: Network is down (50)
> Aug  6 13:06:08 localhost configd[34]: DHCP en1: INIT transmit failed
> Aug  6 13:06:08 localhost com.mozilla.puppet[102]: No DNS configuration available
> Aug  6 13:06:08 localhost com.mozilla.puppet[102]: ..waiting for DNS
> Aug  6 13:06:08 localhost UserEventAgent[30]: get_backup_share_points no AFP
> Aug  6 13:06:08 localhost UserEventAgent[30]: WebUserEventAgent: installed
> Aug  6 13:06:08 talos-r4-lion-011 configd[34]: setting hostname to "talos-r4-lion-011.build.scl1.mozilla.com"
> Aug  6 13:06:08 talos-r4-lion-011 configd[34]: network configuration changed.
> Aug  6 13:06:08 talos-r4-lion-011 freshclam[67]: Can't get information about database.clamav.net: nodename nor servname provided, or not known
> Aug  6 13:06:08 talos-r4-lion-011 freshclam[67]: Can't read main.cvd header from database.clamav.net (IP: )
> Aug  6 13:06:09 talos-r4-lion-011 mds[83]: (Normal) FMW: FMW 0 0
> Aug  6 13:06:10 talos-r4-lion-011 loginwindow[85]: **DMPROXY** Found `/System/Library/CoreServices/DMProxy'.
> Aug  6 13:06:11 talos-r4-lion-011 netbiosd[137]: Unable to start NetBIOS name service:
> Aug  6 13:06:12 talos-r4-lion-011 loginwindow[85]: Login Window Started Security Agent
> Aug  6 13:06:12 talos-r4-lion-011 com.apple.launchctl.LoginWindow[141]: com.apple.findmymacmessenger: Already loaded
> Aug  6 13:06:12 talos-r4-lion-011 rpcsvchost[139]: sandbox_init: com.apple.msrpc.netlogon.sb succeeded
> Aug  6 13:06:12 talos-r4-lion-011 SecurityAgent[149]: Echo enabled
> Aug  6 13:06:13 talos-r4-lion-011 SecurityAgent[149]: User info context values set for cltbld
> Aug  6 13:06:13 talos-r4-lion-011 freshclam[67]: Can't query current.cvd.clamav.net
> Aug  6 13:06:13 talos-r4-lion-011 freshclam[67]: Invalid DNS reply. Falling back to HTTP mode.
> Aug  6 13:06:14 talos-r4-lion-011 freshclam[67]: Can't get information about database.clamav.net: nodename nor servname provided, or not known
> Aug  6 13:06:14 talos-r4-lion-011 freshclam[67]: Can't read main.cvd header from database.clamav.net (IP: )
> Aug  6 13:06:14 talos-r4-lion-011 SecurityAgent[149]: Login Window login proceeding
> Aug  6 13:06:15 talos-r4-lion-011 configd[34]: network configuration changed.
> Aug  6 13:06:16 talos-r4-lion-011 ntpd[66]: proto: precision = 1.000 usec
> Aug  6 13:06:16 talos-r4-lion-011 XProtectUpdater[69]: Ignoring new signature plist: Not an increase in version
> Aug  6 13:06:16 talos-r4-lion-011 com.apple.launchd[1] (com.apple.xprotectupdater[69]): Exited with code: 252
> Aug  6 13:06:19 talos-r4-lion-011 loginwindow[85]: Login Window - Returned from Security Agent
> Aug  6 13:06:19 talos-r4-lion-011 loginwindow[85]: USER_PROCESS: 85 console
> Aug  6 13:06:19 talos-r4-lion-011 freshclam[67]: Your ClamAV installation is OUTDATED!
> Aug  6 13:06:19 talos-r4-lion-011 freshclam[67]: Local version: 0.97.1 Recommended version: 0.97.8
> Aug  6 13:06:19 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.ReportCrash): Falling back to default Mach exception handler. Could not find: com.apple.ReportCrash.Self
> Aug  6 13:06:19 talos-r4-lion-011 com.apple.launchctl.Aqua[205]: load: option requires an argument -- D
> Aug  6 13:06:19 talos-r4-lion-011 com.apple.launchctl.Aqua[205]: usage: launchctl load [-wF] [-D <user|local|network|system|all>] paths...
> Aug  6 13:06:20 talos-r4-lion-011 UserEventAgent[30]: CaptiveNetworkSupport:CNSServerRegisterUserAgent:187 new user agent port: 29983
> Aug  6 13:06:20 talos-r4-lion-011 mds[83]: (Warning) Server: No stores registered for metascope "kMDQueryScopeComputerIndexed"
> Aug  6 13:06:20 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.launchctl.Aqua[205]): Exited with code: 1
> Aug  6 13:06:21 talos-r4-lion-011 sshd[197]: USER_PROCESS: 228 ttys000
> Aug  6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: ;; connection timed out; no servers could be reached
> Aug  6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: releng-puppet2.srv.releng.scl3.mozilla.com has address 10.26.48.50
> Aug  6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: Running puppet agent against server 'releng-puppet2.build.scl1.mozilla.com'
> Aug  6 13:06:23 talos-r4-lion-011 MiniLauncher[232]: Launching iCloud prefs for user 28
> Aug  6 13:06:27 talos-r4-lion-011 iCal Helper[247]: Could not find Meta Data for persistent Store
> Aug  6 13:06:28 talos-r4-lion-011 System Preferences[241]: contentsOfDirectoryAtPath returned:Error Domain=NSCocoaErrorDomain Code=260 "The folder <E2><80><9C>PreferencePanes<E2><80><9D> doesn<E2><80><99>t exist." UserInfo=0x4004849e0 {NSFilePath=/Users/cltbld/Library/PreferencePanes, NSUserStringVariant=(
>             Folder
>         ), NSUnderlyingError=0x400484a20 "The operation couldn<E2><80><99>t be completed. (OSStatus error -43.)"} for path:/Users/cltbld/Library/PreferencePanes
> Aug  6 13:06:32 talos-r4-lion-011 WindowServer[119]: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged.
> Aug  6 13:06:36 talos-r4-lion-011 Dock[223]: Bookmark failed to issue extension for item /Developer/Applications (depth=1): No such file or directory
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.749 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.750 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.750 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.751 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:06:41 talos-r4-lion-011 mds[83]: (/)(Warning) IndexQuery in bool preIterate_FSI(SISearchCtx_FSI*):Throttling inefficient file system query
> Aug  6 13:06:46 talos-r4-lion-011 Finder[229]: Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/com.apple.finder.savedState/window_1.data: No such file or directory (2)
> Aug  6 13:06:48 talos-r4-lion-011 UserEventAgent[30]: CertsKeychainMonitor: ready to process keychain & timer events
> Aug  6 13:06:49 talos-r4-lion-011 System Preferences[241]: Database file is missing: /Users/cltbld/Library/Application Support/AddressBook/AddressBook-v22.abcddb
> Aug  6 13:06:49 talos-r4-lion-011 System Preferences[241]: unable to dlopen /usr/lib/sasl2/pwauxprop.so: dlopen(/usr/lib/sasl2/pwauxprop.so, 2): no suitable image found.  Did find:
>                 /usr/lib/sasl2/pwauxprop.so: GC capability mismatch
> Aug  6 13:06:55 talos-r4-lion-011 System Preferences[241]: The IAAccountDiscovery completed notification
> Aug  6 13:06:56 talos-r4-lion-011 System Preferences[241]: Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/com.apple.systempreferences.savedState/window_3.data: No such file or directory (2)
> Aug  6 13:07:00 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.AddressBook.abd): Throttling respawn: Will start in 8 seconds
> Aug  6 13:07:01 talos-r4-lion-011 AddressBookSync[306]: Could not replace account with identifier: _local
> Aug  6 13:07:01 talos-r4-lion-011 [0x0-0xc00c].com.apple.systempreferences[241]: 2013-08-06 13:07:01.398 AddressBookSync[306:707] Could not replace account with identifier: _local
> Aug  6 13:07:01 talos-r4-lion-011 genatsdb[252]: *GENATSDB* FontObjects generated = 324
> Aug  6 13:07:03 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 69.163.100.14)
> Aug  6 13:07:03 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug  6 13:07:04 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 150.214.142.197)
> Aug  6 13:07:04 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug  6 13:07:04 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 207.57.106.31)
> Aug  6 13:07:04 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug  6 13:07:04 talos-r4-lion-011 org.clamav.freshclam-init[67]: ERROR: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug  6 13:07:04 talos-r4-lion-011 freshclam[67]: Incremental update failed, trying to download daily.cvd
> Aug  6 13:07:10 talos-r4-lion-011 servermgrd[77]: servermgr_ipfilter:ipfw config:Notice:Flushed IPv4 rules
> Aug  6 13:07:10 talos-r4-lion-011 servermgrd[77]: servermgr_ipfilter:ipfw config:Notice:Flushed IPv6 rules
> Aug  6 13:07:25 talos-r4-lion-011 freshclam[67]: Your ClamAV installation is OUTDATED!
> Aug  6 13:07:25 talos-r4-lion-011 freshclam[67]: Current functionality level = 61, recommended = 63
> Aug  6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[users]/Users::Builder::Account/Darwinuser[cltbld]/gid) gid changed '0' to 'staff'
> Aug  6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[users]/Users::Builder::Account/Exec[kill-builder-keychain]) Triggered 'refresh' from 1 events
> Aug  6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Tweaks::Cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns) executed successfully
> Aug  6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/File[/usr/local/bin/changebackground.sh]/ensure) defined content as '{md5}32b1681c3cf1e2d83684d00fe233c430'
> Aug  6 13:08:09 talos-r4-lion-011 defaults[1085]:
>         The domain/default pair of (com.apple.desktop, Background) does not exist
> Aug  6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]/returns) executed successfully
> Aug  6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]) Triggered 'refresh' from 1 events
> Aug  6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]) Triggered 'refresh' from 2 events
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[223]): Exited with code: 1
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[1094]): The GID of the account (20) changed out from under us (0)!
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[1094]): In a future build of the OS, this error will be fatal.
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.826 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.827 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.828 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.829 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug  6 13:08:11 talos-r4-lion-011 screenresolution[1097]: starting screenresolution argv=/usr/local/bin/screenresolution get
> Aug  6 13:08:11 talos-r4-lion-011 screenresolution[1097]: Display 0: 1600x1200x32
> Aug  6 13:08:13 talos-r4-lion-011 defaults[1145]:
>         The domain/default pair of (/Users/cltbld/Library/Preferences/.GlobalPreferences.plist, NSQuitAlwaysKeepsWindows) does not exist
> Aug  6 13:08:13 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Clean::Appstate/Osxutils::Defaults[builder-NSQuitAlwaysKeframework, Our status with coreservicesd is unusual, 2, and things eruser.28[203] (com.apple.cookied[1203]): The GID of the account (20) changed out from under us (0)!
> Aug  6 13:09:32 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.cookied[1203]): In a future build of the 

I suspect that the changing gid is the cause.  One fix, then, might be to switch to the newer version of the user provider that's now packaged with Puppet (bug 894900).
(In reply to Dustin J. Mitchell [:dustin] from comment #28) 
> I suspect that the changing gid is the cause.  One fix, then, might be to
> switch to the newer version of the user provider that's now packaged with
> Puppet (bug 894900).

I've set aside talos-r4-lion-005 to try this out on tomorrow.

Dustin: how involved is the patch? Will something similar need to happen for 10.6?
Fixing that will require a *lot* of testing.  I don't think we should delay this deployment on that account.

One possible short-term fix, other than just rebooting after imaging, is to run puppet twice from puppetize.sh.
I moved that issue, including the proposed short-term fix, to bug 902903.  This is a "soft" dependency, in the sense that we can continue re-imaging hosts even if the short-term fix doesn't work.
Depends on: 902903
I've blessed 034-060 and rebooted the slaves that are marked as disabled. I'll check in on them at the end of the day to see which ones need an extra reboot to return to production.
Batch #3 have all been netbooted.

Fingers crossed for the solution in bug 902903 to close this out for good.
I suspect I didn't land that solution in time to make that batch work without a manual reboot, but hopefully it will prevent that need in subsequent re-images.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: