Closed Bug 537748 Opened 15 years ago Closed 14 years ago

update leopard talos ref image

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: anodelman)

References

Details

Attachments

(5 files, 2 obsolete files)

With the new hardware we should take the opportunity to:

1) update the latest dot release for leopard/tiger
2) change user from mozqa to cltbld
Blocks: 512533
Leopard upgraded to 10.5.8 also:

- changed to cltbld user
- clean up twistd.logs on reboot
- clean up error_log apache log on reboot

Looks like tiger ref image won't need to be updated immediately as tiger will not run on the new hardware.
Summary: update leopard/tiger talos ref images → update leopard talos ref image
Depends on: 537974
Depends on: 538403
Depends on: 538587
* Installed Puppet
* Modified buildbot launcher NOT to run on boot
* Installed Puppet launcher that runs once, waits for exit, then runs Buildbot
* Installed buildbot-tac generator, with init script patch from attachment
* Modified buildbot plist file for new username

After spending 3 hours finding and fixing little bugs I got all of the above done. This machine is running in staging over the weekend now.
(In reply to comment #3)
> After spending 3 hours finding and fixing little bugs I got all of the above
> done. This machine is running in staging over the weekend now.

This comment was posted too soon. I thought I fixed the last issues, but I still haven't been able to get Buildbot launching after Puppet runs. I'm continuing to look for a solution to this this morning.
Finally got it working, after learning about WatchPaths for launchd plist files. This machine is running in staging now.
Attached patch updated buildbot plist file (obsolete) — Splinter Review
This is the updated version of the buildbot plist file that will prevent buildbot from launching until Puppet has finished. WatchPaths will attempt to launch buildbot whenever that file is modified. Note that we don't have a StartInterval anymore - I doubt launching this way will be flakey like launching at startup was. Incoming is the Puppet launcher that touches this file after Puppet runs (regardless of success or failure).
Attached patch Puppet launcherSplinter Review
Here's the script that launches Puppet. When Puppet runs for the first time on a new slave it will return 1 because the certificate will not be signed yet. Therefore, we run in a loop until the certificate gets signed and Puppet runs. By doing this we ensure that the slave will be up to date before it accepts any jobs.
Attached file Puppet plist launcher
This is pretty similar to the one we use for builds. We use the StartInterval/LaunchOnlyOnce combo to make sure it gets run if the first attempt doesn't get through. Note that we'll need a second copy of this when we have some staging slaves.
Password stuff has been moved up, because we need the same logic to handle the tac file.
Might need to update the IGNORE_HOSTS - I don't know yet what we're doing for a Leopard v3 image, we may just continue using qm-leopard-ref.
Attachment #420732 - Attachment is obsolete: true
Attachment #421054 - Flags: review?(catlee)
Comment on attachment 421042 [details] [diff] [review]
updated buildbot plist file

Details in comment #6
Attachment #421042 - Flags: review?(catlee)
Comment on attachment 421043 [details] [diff] [review]
Puppet launcher

Details in comment #7
Attachment #421043 - Flags: review?(catlee)
Comment on attachment 421044 [details]
Puppet plist launcher

Details in comment #8
Attachment #421044 - Flags: review?(catlee)
Attachment #421042 - Flags: review?(catlee) → review+
Attachment #421043 - Flags: review?(catlee) → review+
Comment on attachment 421044 [details]
Puppet plist launcher

tiny nit: I wouldn't call the script 'run-puppet-and-buildbot.sh', since it's not actually running buildbot.
Attachment #421044 - Flags: review?(catlee) → review+
Attachment #421054 - Flags: review?(catlee) → review+
Just hit a problem with FileDownload on the dirty profiles on this machine. I don't quite understand why though - we're running the same Buildbot as the other slaves. Looking into it.
Assignee: nobody → anodelman
Got the issue sorted out. Turns out that OS X 10.5.8 comes with Twisted 2.5.0 (10.5.2 comes with Twisted 2.4.0). After installing Twisted 2.4.0 and adjusting PYTHONPATH I managed to get the dirty profile tests working again. It's now running in staging. Initial numbers are looking good. There's a bit more deviation between runs, but nothing horrible, AFAICT:
leopard mozilla-1.9.1	dhtml	856.1, 852.6
svg_opacity	12174.5, 12190.5



More in the spreadsheet:
http://spreadsheets.google.com/ccc?key=0AhwotqoBm6lEdHJLN2w0U2dTa1o2bVpOLXNWZF9ZM0E&hl=en
Just had a look at more numbers for talos-r3-leopard01 and the deviation shown is within the normal range (a couple ms for most tests, 100s of ms for nochrome gfx and svg_opacity).

However, there's a period of time overnight where the slave went through a ton of rapid disconnects/reconnects (from roughly 11:40pm PST yesterday to 5am PST today). I'm hoping this happening because of some machine futzing last night, but we'll see...
Looks like somebody rebooted it:
Jan 12 23:48:19 talos-r3-leopard01 sudo[6666]:   cltbld : TTY=ttys001 ; PWD=/Users/cltbld/talos-slave/talos-data ; USER=root ; COMMAND=/sbin/reboot
Jan 12 23:48:20 talos-r3-leopard01 reboot[6666]: rebooted by cltbld
Jan 12 23:48:20 talos-r3-leopard01 reboot[6666]: SHUTDOWN_TIME: 1263368900 9105

Dunno why this would cause failures for 6 hours, though.
Attachment #421042 - Attachment is obsolete: true
Attachment #421446 - Flags: review?(catlee)
Comment on attachment 421043 [details] [diff] [review]
Puppet launcher

Landed with a better name:

Checking in run-puppet.sh;
/mofo/puppet-files/talos/mac/run-puppet.sh,v  <--  run-puppet.sh
initial revision: 1.1
done
Attachment #421043 - Flags: checked-in+
Comment on attachment 421044 [details]
Puppet plist launcher

Landed, with the name adjusted:
Checking in com.reductivelabs.puppet.plist;
/mofo/puppet-files/talos/mac/com.reductivelabs.puppet.plist,v  <--  com.reductivelabs.puppet.plist
initial revision: 1.1
done
Attachment #421044 - Flags: checked-in+
Comment on attachment 421054 [details] [diff] [review]
buildbot-tac updates for talos

buildbot-tac updates for talos
Attachment #421054 - Flags: checked-in+
(In reply to comment #16)
> Just had a look at more numbers for talos-r3-leopard01 and the deviation shown
> is within the normal range (a couple ms for most tests, 100s of ms for nochrome
> gfx and svg_opacity).
> 
> However, there's a period of time overnight where the slave went through a ton
> of rapid disconnects/reconnects (from roughly 11:40pm PST yesterday to 5am PST
> today). I'm hoping this happening because of some machine futzing last night,
> but we'll see...

Turns out that Phong had taken the machine down and cloned it to another one due to a miscommunication - so we had two machines trying to connect with the same name. So - this is a non-issue.
Nothing big here - just need to make Puppet aware of the new and incoming machines.
Attachment #421503 - Flags: review?
Attachment #421503 - Flags: review? → review?(anodelman)
Attachment #421503 - Flags: review?(anodelman) → review+
Comment on attachment 421503 [details] [diff] [review]
update puppet configs for new leopard slaves

changeset:   86:ba716aca3f32
Attachment #421503 - Flags: checked-in+
Phong, talos-r3-leopard-ref is ready to have its image taken. Feel free to shut it down whenever you're ready to take it.

A couple notes about imaging:
Please use 3 digit numbering - talos-r3-leopardNNN
Please be sure to set the full hostname with:
scutil --set HostNAme talos-r3-leopardNNN.mozilla.org

Our Puppet configs are dependent on that getting set properly.
Attachment #421446 - Flags: review?(catlee) → review+
Comment on attachment 421446 [details]
buildbot plist - set PYTHONPATH

Checking in buildbot.start.talos.slave.plist;
/mofo/puppet-files/talos/mac/buildbot.start.talos.slave.plist,v  <--  buildbot.start.talos.slave.plist
initial revision: 1.1
done
Attachment #421446 - Flags: checked-in+
I'm pretty sure we're all done here.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
No longer blocks: 522528
Per irc with catlee, not sure if this is caused by refimage or something else. Getting a lot of errors like this, for various different talos suites:
=====
-------- Original Message --------
Subject: buildbot failure in Talos on MacOSX Leopard 10.5.8 mozilla-1.9.0 talos
Date: Tue, 26 Jan 2010 19:13:04 -0800


The Buildbot has detected a failed build of MacOSX Leopard 10.5.8 mozilla-1.9.0 talos on Talos.
Full details are available at:
 http://talos-master.mozilla.org:8012builders/MacOSX%20Leopard%2010.5.8%20mozilla-1.9.0%20talos/builds/268

Buildbot URL: http://talos-master.mozilla.org:8012

Buildslave for this Build: talos-r3-leopard-011

Build Reason: 
Build Source Stamp: unavailable
Blamelist: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/bm-xserve08-mozilla1.9.0/1264560300/

BUILD FAILED: failed cleanup slave lost

sincerely,
 -The Buildbot
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: