Closed Bug 902903 Opened 11 years ago Closed 11 years ago

Buildbot doesn't start after post-install reboot on Lion

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(1 file)

(moved from bug 891880) Buildbot is not starting after the puppetize.sh-induced reboot. It starts fine on the next reboot, though. First boot: > Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: Unable to fetch my node definition, but the agent run will continue: > Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: Error 400 on SERVER: this master is not a CA > Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet]/ensure) created > Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet/provider]/ensure) created > Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet/provider/file_line]/ensure) created > .. > Aug 6 13:04:39 talos-r4-lion-011 puppet-agent[234]: (/Stage[main]/Buildslave::Startup::Launchd/File[/Library/LaunchAgents/com.mozilla.buildslave.plist]/ensure) defined content as '{md5}bcf1fcd8bd1b38f2f839b146945f7301' > .. > Aug 6 13:05:25 talos-r4-lion-011 puppet-agent[234]: Finished catalog run in 168.92 seconds Second boot: > Aug 6 13:05:38 talos-r4-lion-011 reboot[2533]: rebooted by _locationd > Aug 6 13:05:38 talos-r4-lion-011 reboot[2533]: BOOT_TIME: 1375819538 424994 > rc.server[ 8 ]: Tuning server for 8 GB (rounded down). > Aug 6 13:05:48 localhost bootlog[0]: BOOT_TIME 1375819548 0 > Aug 6 13:06:01 localhost UserEventAgent[30]: starting CaptiveNetworkSupport as SystemEventAgent built May 25 2011 12:27:35 > Aug 6 13:05:49 localhost com.apple.launchd[1]: *** launchd[1] has started up. *** > Aug 6 13:05:49 localhost com.apple.launchd[1]: *** Verbose boot, will log to /dev/console. *** > Aug 6 13:06:00 localhost com.apple.launchd[1] (com.apple.xgridd.pcastserver): Bug: launchd_core_logic.c:5174 (25247):2 > Aug 6 13:06:00 localhost com.apple.launchd[1] (com.apple.xgridd.pcastserver): Path monitoring failed on "/var/pcast/server/xgridd/keepalive": No such file or directory > Aug 6 13:06:01 localhost UserEventAgent[30]: CertsKeychainMonitor: configuring > Aug 6 13:06:01 localhost UserEventAgent[30]: WirelessAirPortDeviceNameCopy(): no BSD interface name found for object 12551 > Aug 6 13:06:01 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCCopyWiFiDevices:388 WiFi Device Name == NULL > Aug 6 13:06:03 localhost mDNSResponder[31]: mDNSResponder mDNSResponder-320.10 (Aug 2 2011 19:56:51) starting OSXVers 11 > Aug 6 13:06:03 localhost configd[34]: BLUETOOTH: Failed to start job com.apple.blued > Aug 6 13:06:06 localhost airportd[51]: _processDLILEvent: en1 attached (down) > Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CreateInterfaceWatchList:2788 WiFi Devices Found. :) > Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe > Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCRebuildCache:81 Failed to get service order > Aug 6 13:06:06: --- last message repeated 1 time --- > Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe > Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCRebuildCache:81 Failed to get service order > Aug 6 13:06:06: --- last message repeated 1 time --- > Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe > Aug 6 13:06:06 localhost systemkeychain[46]: done file: /var/run/systemkeychaincheck.done > Aug 6 13:06:06 localhost mDNSResponder[31]: D2D_IPC: Loaded > Aug 6 13:06:06 localhost mDNSResponder[31]: D2DInitialize succeeded > Aug 6 13:06:06 localhost com.apple.ucupdate.plist[72]: ucupdate: Checked 1 update, no match found. > Aug 6 13:06:06 localhost com.apple.launchd[1] (com.mozilla.puppet[102]): open("/var/log/puppet/puppet.out", ...): No such file or directory > Aug 6 13:06:06 localhost com.apple.pfctl[81]: No ALTQ support in kernel > Aug 6 13:06:06 localhost com.apple.pfctl[81]: ALTQ related functions disabled > Aug 6 13:06:06 localhost com.apple.launchd[1] (com.mozilla.puppet[102]): open("/var/log/puppet/puppet.err", ...): No such file or directory > Aug 6 13:06:06 localhost emond[94]: SetUpLogs: uid = 0 gid = 0 > Aug 6 13:06:06 localhost emond[94]: SetUpLogs: opening /Library/Logs/EventMonitor/EventMonitor.error.log > Aug 6 13:06:07 localhost com.mozilla.puppet[102]: Starting run-puppet.sh at Tue Aug 6 13:06:07 PDT 2013 > Aug 6 13:06:07 localhost com.mozilla.puppet[102]: No DNS configuration available > Aug 6 13:06:07 localhost com.mozilla.puppet[102]: ..waiting for DNS > HeadlessStartup: Already setup or this is an upgrade so we will not set the password. > Aug 6 13:06:07 localhost com.apple.usbmuxd[71]: usbmuxd-211 built on May 16 2011 at 00:14:56 on May 16 2011 at 00:14:55, running 64 bit > Aug 6 13:06:07 localhost loginwindow[85]: Login Window Application Started > Aug 6 13:06:07 localhost freshclam[67]: Can't query current.cvd.clamav.net > Aug 6 13:06:07 localhost freshclam[67]: Invalid DNS reply. Falling back to HTTP mode. > Aug 6 13:06:08 localhost configd[34]: bootp_session_transmit: bpf_write(en1) failed: Network is down (50) > Aug 6 13:06:08 localhost configd[34]: DHCP en1: INIT transmit failed > Aug 6 13:06:08 localhost com.mozilla.puppet[102]: No DNS configuration available > Aug 6 13:06:08 localhost com.mozilla.puppet[102]: ..waiting for DNS > Aug 6 13:06:08 localhost UserEventAgent[30]: get_backup_share_points no AFP > Aug 6 13:06:08 localhost UserEventAgent[30]: WebUserEventAgent: installed > Aug 6 13:06:08 talos-r4-lion-011 configd[34]: setting hostname to "talos-r4-lion-011.build.scl1.mozilla.com" > Aug 6 13:06:08 talos-r4-lion-011 configd[34]: network configuration changed. > Aug 6 13:06:08 talos-r4-lion-011 freshclam[67]: Can't get information about database.clamav.net: nodename nor servname provided, or not known > Aug 6 13:06:08 talos-r4-lion-011 freshclam[67]: Can't read main.cvd header from database.clamav.net (IP: ) > Aug 6 13:06:09 talos-r4-lion-011 mds[83]: (Normal) FMW: FMW 0 0 > Aug 6 13:06:10 talos-r4-lion-011 loginwindow[85]: **DMPROXY** Found `/System/Library/CoreServices/DMProxy'. > Aug 6 13:06:11 talos-r4-lion-011 netbiosd[137]: Unable to start NetBIOS name service: > Aug 6 13:06:12 talos-r4-lion-011 loginwindow[85]: Login Window Started Security Agent > Aug 6 13:06:12 talos-r4-lion-011 com.apple.launchctl.LoginWindow[141]: com.apple.findmymacmessenger: Already loaded > Aug 6 13:06:12 talos-r4-lion-011 rpcsvchost[139]: sandbox_init: com.apple.msrpc.netlogon.sb succeeded > Aug 6 13:06:12 talos-r4-lion-011 SecurityAgent[149]: Echo enabled > Aug 6 13:06:13 talos-r4-lion-011 SecurityAgent[149]: User info context values set for cltbld > Aug 6 13:06:13 talos-r4-lion-011 freshclam[67]: Can't query current.cvd.clamav.net > Aug 6 13:06:13 talos-r4-lion-011 freshclam[67]: Invalid DNS reply. Falling back to HTTP mode. > Aug 6 13:06:14 talos-r4-lion-011 freshclam[67]: Can't get information about database.clamav.net: nodename nor servname provided, or not known > Aug 6 13:06:14 talos-r4-lion-011 freshclam[67]: Can't read main.cvd header from database.clamav.net (IP: ) > Aug 6 13:06:14 talos-r4-lion-011 SecurityAgent[149]: Login Window login proceeding > Aug 6 13:06:15 talos-r4-lion-011 configd[34]: network configuration changed. > Aug 6 13:06:16 talos-r4-lion-011 ntpd[66]: proto: precision = 1.000 usec > Aug 6 13:06:16 talos-r4-lion-011 XProtectUpdater[69]: Ignoring new signature plist: Not an increase in version > Aug 6 13:06:16 talos-r4-lion-011 com.apple.launchd[1] (com.apple.xprotectupdater[69]): Exited with code: 252 > Aug 6 13:06:19 talos-r4-lion-011 loginwindow[85]: Login Window - Returned from Security Agent > Aug 6 13:06:19 talos-r4-lion-011 loginwindow[85]: USER_PROCESS: 85 console > Aug 6 13:06:19 talos-r4-lion-011 freshclam[67]: Your ClamAV installation is OUTDATED! > Aug 6 13:06:19 talos-r4-lion-011 freshclam[67]: Local version: 0.97.1 Recommended version: 0.97.8 > Aug 6 13:06:19 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.ReportCrash): Falling back to default Mach exception handler. Could not find: com.apple.ReportCrash.Self > Aug 6 13:06:19 talos-r4-lion-011 com.apple.launchctl.Aqua[205]: load: option requires an argument -- D > Aug 6 13:06:19 talos-r4-lion-011 com.apple.launchctl.Aqua[205]: usage: launchctl load [-wF] [-D <user|local|network|system|all>] paths... > Aug 6 13:06:20 talos-r4-lion-011 UserEventAgent[30]: CaptiveNetworkSupport:CNSServerRegisterUserAgent:187 new user agent port: 29983 > Aug 6 13:06:20 talos-r4-lion-011 mds[83]: (Warning) Server: No stores registered for metascope "kMDQueryScopeComputerIndexed" > Aug 6 13:06:20 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.launchctl.Aqua[205]): Exited with code: 1 > Aug 6 13:06:21 talos-r4-lion-011 sshd[197]: USER_PROCESS: 228 ttys000 > Aug 6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: ;; connection timed out; no servers could be reached > Aug 6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: releng-puppet2.srv.releng.scl3.mozilla.com has address 10.26.48.50 > Aug 6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: Running puppet agent against server 'releng-puppet2.build.scl1.mozilla.com' > Aug 6 13:06:23 talos-r4-lion-011 MiniLauncher[232]: Launching iCloud prefs for user 28 > Aug 6 13:06:27 talos-r4-lion-011 iCal Helper[247]: Could not find Meta Data for persistent Store > Aug 6 13:06:28 talos-r4-lion-011 System Preferences[241]: contentsOfDirectoryAtPath returned:Error Domain=NSCocoaErrorDomain Code=260 "The folder <E2><80><9C>PreferencePanes<E2><80><9D> doesn<E2><80><99>t exist." UserInfo=0x4004849e0 {NSFilePath=/Users/cltbld/Library/PreferencePanes, NSUserStringVariant=( > Folder > ), NSUnderlyingError=0x400484a20 "The operation couldn<E2><80><99>t be completed. (OSStatus error -43.)"} for path:/Users/cltbld/Library/PreferencePanes > Aug 6 13:06:32 talos-r4-lion-011 WindowServer[119]: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged. > Aug 6 13:06:36 talos-r4-lion-011 Dock[223]: Bookmark failed to issue extension for item /Developer/Applications (depth=1): No such file or directory > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.749 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.750 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.750 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.751 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:06:41 talos-r4-lion-011 mds[83]: (/)(Warning) IndexQuery in bool preIterate_FSI(SISearchCtx_FSI*):Throttling inefficient file system query > Aug 6 13:06:46 talos-r4-lion-011 Finder[229]: Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/com.apple.finder.savedState/window_1.data: No such file or directory (2) > Aug 6 13:06:48 talos-r4-lion-011 UserEventAgent[30]: CertsKeychainMonitor: ready to process keychain & timer events > Aug 6 13:06:49 talos-r4-lion-011 System Preferences[241]: Database file is missing: /Users/cltbld/Library/Application Support/AddressBook/AddressBook-v22.abcddb > Aug 6 13:06:49 talos-r4-lion-011 System Preferences[241]: unable to dlopen /usr/lib/sasl2/pwauxprop.so: dlopen(/usr/lib/sasl2/pwauxprop.so, 2): no suitable image found. Did find: > /usr/lib/sasl2/pwauxprop.so: GC capability mismatch > Aug 6 13:06:55 talos-r4-lion-011 System Preferences[241]: The IAAccountDiscovery completed notification > Aug 6 13:06:56 talos-r4-lion-011 System Preferences[241]: Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/com.apple.systempreferences.savedState/window_3.data: No such file or directory (2) > Aug 6 13:07:00 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.AddressBook.abd): Throttling respawn: Will start in 8 seconds > Aug 6 13:07:01 talos-r4-lion-011 AddressBookSync[306]: Could not replace account with identifier: _local > Aug 6 13:07:01 talos-r4-lion-011 [0x0-0xc00c].com.apple.systempreferences[241]: 2013-08-06 13:07:01.398 AddressBookSync[306:707] Could not replace account with identifier: _local > Aug 6 13:07:01 talos-r4-lion-011 genatsdb[252]: *GENATSDB* FontObjects generated = 324 > Aug 6 13:07:03 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 69.163.100.14) > Aug 6 13:07:03 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net > Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 150.214.142.197) > Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net > Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 207.57.106.31) > Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net > Aug 6 13:07:04 talos-r4-lion-011 org.clamav.freshclam-init[67]: ERROR: getpatch: Can't download daily-17290.cdiff from database.clamav.net > Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: Incremental update failed, trying to download daily.cvd > Aug 6 13:07:10 talos-r4-lion-011 servermgrd[77]: servermgr_ipfilter:ipfw config:Notice:Flushed IPv4 rules > Aug 6 13:07:10 talos-r4-lion-011 servermgrd[77]: servermgr_ipfilter:ipfw config:Notice:Flushed IPv6 rules > Aug 6 13:07:25 talos-r4-lion-011 freshclam[67]: Your ClamAV installation is OUTDATED! > Aug 6 13:07:25 talos-r4-lion-011 freshclam[67]: Current functionality level = 61, recommended = 63 > Aug 6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[users]/Users::Builder::Account/Darwinuser[cltbld]/gid) gid changed '0' to 'staff' > Aug 6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[users]/Users::Builder::Account/Exec[kill-builder-keychain]) Triggered 'refresh' from 1 events > Aug 6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Tweaks::Cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns) executed successfully > Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/File[/usr/local/bin/changebackground.sh]/ensure) defined content as '{md5}32b1681c3cf1e2d83684d00fe233c430' > Aug 6 13:08:09 talos-r4-lion-011 defaults[1085]: > The domain/default pair of (com.apple.desktop, Background) does not exist > Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]/returns) executed successfully > Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]) Triggered 'refresh' from 1 events > Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]) Triggered 'refresh' from 2 events > Aug 6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[223]): Exited with code: 1 > Aug 6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[1094]): The GID of the account (20) changed out from under us (0)! > Aug 6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[1094]): In a future build of the OS, this error will be fatal. > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.826 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.827 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.828 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.829 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication > Aug 6 13:08:11 talos-r4-lion-011 screenresolution[1097]: starting screenresolution argv=/usr/local/bin/screenresolution get > Aug 6 13:08:11 talos-r4-lion-011 screenresolution[1097]: Display 0: 1600x1200x32 > Aug 6 13:08:13 talos-r4-lion-011 defaults[1145]: > The domain/default pair of (/Users/cltbld/Library/Preferences/.GlobalPreferences.plist, NSQuitAlwaysKeepsWindows) does not exist > Aug 6 13:08:13 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Clean::Appstate/Osxutils::Defaults[builder-NSQuitAlwaysKeframework, Our status with coreservicesd is unusual, 2, and things eruser.28[203] (com.apple.cookied[1203]): The GID of the account (20) changed out from under us (0)! > Aug 6 13:09:32 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.cookied[1203]): In a future build of the I suspect that the changing gid is the cause. One fix, then, might be to switch to the newer version of the user provider that's now packaged with Puppet (bug 894900). A short-term fix, if that's the case, may be to run puppet twice on Darwin before the initial reboot.
I added the following to puppetize.sh on pxe1.build.scl1: # on Darwin, run puppet again, as the user provider doesn't quite get things (particularly gid) right on the first try - bug 902903 if [ "$OS" = "Darwin" ]; then while ! /usr/bin/puppet agent --no-daemonize --onetime --server=${PUPPET_SERVER:-puppet} --pluginsync --ssldir=/var/lib/puppet/ssl; do echo "Second puppet run failed; re-trying after 10m" sleep 600 done fi Coop will try later today and let me know if it helped.
Blocks: 891880
(In reply to Dustin J. Mitchell [:dustin] from comment #1) > Coop will try later today and let me know if it helped. I've blessed talos-r4-lion-031 and rebooted it.
(In reply to Chris Cooper [:coop] from comment #2) > I've blessed talos-r4-lion-031 and rebooted it. talos-r4-lion-031 came back from netboot properly, but buildbot didn't start up. It's currently sitting in that state. Did I miss a step here?
Nope, no steps missed. I'll have a look now.
Finished the first catalog run: > Aug 9 09:07:43 talos-r4-lion-031 puppet-agent[242]: Finished catalog run in 168.88 seconds Started the second, including the suspect gid change: > Aug 9 09:08:09 talos-r4-lion-031 puppet-agent[2532]: (/Stage[users]/Users::Builder::Account/Darwinuser[cltbld]/gid) gid changed '0' to 'staff' > Aug 9 09:08:09 talos-r4-lion-031 puppet-agent[2532]: (/Stage[users]/Users::Builder::Account/Exec[kill-builder-keychain]) Triggered 'refresh' from 1 events > Aug 9 09:08:10 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Tweaks::Cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns) executed successfully > Aug 9 09:08:44 talos-r4-lion-031 com.apple.launchd[1] (com.deploystudio.finalizeCleanup[3437]): posix_spawn("/etc/deploystudio/sbin/ds_finalize_cleanup.sh", ...): No such file or directory > Aug 9 09:08:44 talos-r4-lion-031 com.apple.launchd[1] (com.deploystudio.finalizeCleanup[3437]): Exited with code: 1 > Aug 9 09:08:44 talos-r4-lion-031 com.apple.launchd[1] (com.deploystudio.finalizeCleanup): Throttling respawn: Will start in 10 seconds > Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/File[/usr/local/bin/changebackground.sh]/ensure) defined content as '{md5}32b1681c3cf1e2d83684d00fe233c430' > Aug 9 09:08:45 talos-r4-lion-031 defaults[3456]: > The domain/default pair of (com.apple.desktop, Background) does not exist > Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]/returns) executed successfully > Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]) Triggered 'refresh' from 1 events > Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]/returns) No matching processes were found > Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]) Failed to call refresh: /usr/bin/killall Dock returned 1 instead of one of [0] > Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]) /usr/bin/killall Dock returned 1 instead of one of [0] it tries to set the screenresolution when it shouldn't because FACTER_PUPPETIZING is not set; this is harmless > Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: starting screenresolution argv=/usr/local/bin/screenresolution get > Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: 3891612: (connectAndCheck) Untrusted apps are not allowed to connect to or launch Window Server before login. > Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged. > Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: Error: failed to get list of active displays > Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3467]: starting screenresolution argv=/usr/local/bin/screenresolution set 1600x1200x32 > Aug 9 09:08:46 talos-r4-lion-031 screenresolution[3467]: 3891612: (connectAndCheck) Untrusted apps are not allowed to connect to or launch Window Server before login. > Aug 9 09:08:46 talos-r4-lion-031 screenresolution[3467]: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged. > Aug 9 09:08:46 talos-r4-lion-031 screenresolution[3467]: Error: failed to get list of active displays > Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Screenresolution/Exec[set-resolution]/returns) 2013-08-09 09:08:45.996 screenresolution[3467:707] starting screenresolution argv=/usr/local/bin/screenresolution set 1600x1200x32 > Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Screenresolution/Exec[set-resolution]/returns) 2013-08-09 09:08:46.000 screenresolution[3467:707] Error: failed to get list of active displays > Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of one of [0] > Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Screenresolution/Exec[set-resolution]/returns) change from notrun to 0 failed: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of one of [0] > Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Puppet::Atboot/File[/etc/puppet/puppetmasters.txt]/content) content changed '{md5}22db71f3f0bca0d11d7bd94fc873341b' to '{md5}fdf71264736f9a8b0fcf316f384bc280' > Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Users::Builder::Setup/Python::User_pip_conf[cltbld]/File[/Users/cltbld/.pip/pip.conf]/content) content changed '{md5}a408bfe315e1832a457fc0e1193bc244' to '{md5}056e5ae36816c7b96ebd1e2fcc6adb95' > Aug 9 09:08:48 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Users::Root::Setup/Python::User_pip_conf[root]/File[/var/root/.pip/pip.conf]/content) content changed '{md5}a408bfe315e1832a457fc0e1193bc244' to '{md5}056e5ae36816c7b96ebd1e2fcc6adb95' > Aug 9 09:08:49 talos-r4-lion-031 puppet-agent[2532]: Finished catalog run in 52.19 seconds It reboots.. > Aug 9 09:09:03 talos-r4-lion-031 reboot[3555]: rebooted by _locationd > Aug 9 09:09:03 talos-r4-lion-031 reboot[3555]: BOOT_TIME: 1376064543 474837 And runs puppet.. > Aug 9 09:10:50 talos-r4-lion-031 puppet-agent[235]: (/Stage[main]/Tweaks::Cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns) executed successfully > Aug 9 09:11:22 talos-r4-lion-031 screenresolution[1058]: starting screenresolution argv=/usr/local/bin/screenresolution get > Aug 9 09:11:22 talos-r4-lion-031 screenresolution[1058]: Display 0: 1600x1200x32 > Aug 9 09:11:24 talos-r4-lion-031 defaults[1106]: > The domain/default pair of (/Users/cltbld/Library/Preferences/.GlobalPreferences.plist, NSQuitAlwaysKeepsWindows) does not exist > Aug 9 09:11:24 talos-r4-lion-031 puppet-agent[235]: (/Stage[main]/Clean::Appstate/Osxutils::Defaults[builder-NSQuitAlwaysKeepsWindows]/Exec[osx_defaults write /Users/cltbld/Library/Preferences/.GlobalPreferences.plist NSQuitAlwaysKeepsWindows=>0]/returns) executed successfully > Aug 9 09:11:27 talos-r4-lion-031 puppet-agent[235]: Finished catalog run in 53.52 seconds but doesn't run Buildbot. cltbld is logged in and its launchd is running and watching /var/tmp/puppet.finished And I just noticed the slave is disabled! So, that would explain it for this host at least.
We re-imaged 031, and it mysteriously started Buildbot after I ran dtruss on launchd and touched puppet.finished. I'll watch more closely next time (-033).
I mentioned in bug 902970, but it might be relevant here, too -- the image we're building these r4 minis out with was originally developed for r5 minis. That *might* be relevant. There may also be a timing issue, where puppet touches the semaphore file before cltbld's launchd is watching it.
The system logs for lion startup contain lots of errors, but those errors are the same on the old build machines and the new talos machines, so that's probably not a useful lead. In the system logs, I see lots of logs from com.apple.launchd.peruser.28 before I see the puppet run, so I think the race condition in comment 7 is also not present. I also don't see why a reboot would fix that issue. So, still stumped.
Aug 12 15:45:36 talos-r4-lion-005 puppet-agent[228]: Finished catalog run in 50.26 seconds Aug 12 15:45:39 talos-r4-lion-005 XProtectUpdater[69]: Ignoring new signature plist: Not an increase in version Aug 12 15:45:39 talos-r4-lion-005 com.apple.launchd[1] (com.apple.xprotectupdater[69]): Exited with code: 252 Aug 12 15:46:59 talos-r4-lion-005 helpd[243]: CarbonCore.framework, Our status with coreservicesd is unusual, 2, and things likely will stop working. Aug 12 15:46:59 talos-r4-lion-005 helpd[243]: CarbonCore.framework: Unable to connect to coreservicesd. Things likely won't work. Sorry. Aug 12 15:48:41 talos-r4-lion-005 servermgrd[77]: No requests in 300 seconds, shutting down Aug 12 15:48:50 talos-r4-lion-005 talagent[207]: PersistentUI: CSBackupSetItemExcluded() failed with OSStatus -61 [root@talos-r4-lion-005.build.scl1.mozilla.com ~]# ls -al /builds/slave/{,talos-slave} /var/tmp/puppet.finished -rw-r--r-- 1 root wheel 0 Aug 12 15:45 /var/tmp/puppet.finished /builds/slave/: total 0 drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 . drwxr-xr-x 3 root wheel 102 Aug 12 15:39 .. drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 talos-slave /builds/slave/talos-slave: total 0 drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 . drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 .. drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 talos-data It must be something about missing the "touch" in that file - some sort of launchd bug. If I touch puppet.finished again, it invokes runslave a second or so later: [root@talos-r4-lion-005.build.scl1.mozilla.com ~]# touch /var/tmp/puppet.finished [root@talos-r4-lion-005.build.scl1.mozilla.com ~]# ls -al /builds/slave/{,talos-slave} /var/tmp/puppet.finished -rw-r--r-- 1 root wheel 0 Aug 12 15:56 /var/tmp/puppet.finished /builds/slave/: total 0 drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 . drwxr-xr-x 3 root wheel 102 Aug 12 15:39 .. drwxr-xr-x 4 cltbld staff 136 Aug 12 15:57 talos-slave /builds/slave/talos-slave: total 8 drwxr-xr-x 4 cltbld staff 136 Aug 12 15:57 . drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 .. -rw-r--r-- 1 cltbld staff 1397 Aug 12 15:57 buildbot.tac drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 talos-data I suspect we need to find a more robust way to start buildbot after puppet.
09:55 <@gneagle> djmitche: This fails because once a WatchPath doesn't exist (for example, if it's removed), launchd stops paying attention to it. 09:55 <@gneagle> djmitche: Instead, use KeepAlive:PathState 09:55 < djmitche> oh! 09:55 < djmitche> that explains why this only fails once - user's launchd starts *before* the path exists 09:55 < djmitche> thanks! 09:56 <@gneagle> djmitche: Example here: http://code.google.com/p/munki/source/browse/launchd/LaunchDaemons/com.googlecode.munki.managedsoftwareupdate-manualcheck.plist
09:57 <@gneagle> Also: http://code.google.com/p/munki/source/browse/launchd/LaunchAgents/com.googlecode.munki.ManagedSoftwareUpdate.plist 09:58 <@gneagle> I use this technique for user-level processes to trigger root-level processes and the other way around, which sounds like what you are doing as well.
So the process using PathState would be this: * puppet startup deletes the semaphore file (to avoid early startup of buildbot if a machine crashes while starting buildbot) * on success, puppet creates the semaphore file * buildbot script immediately deletes the semaphore file when it starts My worry is that launchd will kill the buildbot script when the semaphore file is removed, but we'll see.
Attached patch bug902903.patchSplinter Review
Launchd doesn't see buildbot - it just sees run-buildslave.sh/runslave.py exit, so it doesn't kill anything. I've only tested this on snow leopard so far, but let's get the review going while I finish testing.
Attachment #789884 - Flags: review?(coop)
Works fine on mountain lion, too: Aug 14 06:10:21 talos-mtnlion-r5-001.test.releng.scl3.mozilla.com puppet-agent[140]: Finished catalog run in 41.53 seconds Aug 14 06:10:22 talos-mtnlion-r5-001.test.releng.scl3.mozilla.com run-buildslave[974]: starting runslave.py Can I borrow a lion host briefly?
Comment on attachment 789884 [details] [diff] [review] bug902903.patch Review of attachment 789884 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/buildslave/manifests/startup/launchd.pp @@ +23,3 @@ > > $buildslave_path = "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin" > WS cleanup on landing, please? (I know it's not yours.)
Attachment #789884 - Flags: review?(coop) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #14) > Can I borrow a lion host briefly? talos-r4-lion-001 is ready for you. I disabled it in slavealloc previously, but it is now re-enabled and should connect to buildbot on the next reboot/netboot.
Looks good there, too: Aug 14 09:09:46 talos-r4-lion-001 puppet-agent[137]: Finished catalog run in 55.02 seconds Aug 14 09:09:48 talos-r4-lion-001 run-buildslave[1088]: starting runslave.py
Landed, and I see no problems cropping up. We'll see for sure if this fixed the problem with imaging next time we re-image a host.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: