Closed Bug 683734 Opened 13 years ago Closed 12 years ago

Repurpose Rev3 10.6 machines

Categories

(Release Engineering :: General, defect, P3)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: armenzg)

References

Details

Attachments

(11 files)

Once we have replaced our 10.6 test machines with Rev4 hardware, we should use the freed rev3 hardware for other platforms.
Depends on: 690236
Blocks: 681748
Summary: Repurpose Rev3 machines when Rev4 machines are in production → Repurpose Rev3 machines when Rev4 machines are declared authoritative
Depends on: 693918
Depends on: 694251
Depends on: 695976
Depends on: 695979
Depends on: 696417
Depends on: 696453
This patch directs leopard slaves to the new talos_osx class implemented for the 10.6 rev4 and 10.7 rev4 machines.  This shouldn't be landed until the rev3 10.6 machines are turned off.

I have tested this on talos-r3-leopard-001 and the only change on boot was the com.apple.dock.plist going from 600 to 644 and the file being replaced.  This exact same change happens on every single reboot, as evidenced by:

talos-r3-leopard-001:~ cltbld$ grep e6bbe59dfd61a20cd007c0608729fac5 /var/puppet/log/puppet.out  | wc -l
    7119.

PUPPET OUTPUT FROM BOOT

notice: Starting catalog run
notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/buildslave::cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns: executed successfully
notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/Exec[remove-index]/returns: executed successfully
notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/Exec[disable-indexing]/returns: executed successfully
notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/File[/Users/cltbld/Library/Preferences/com.apple.dock.plist]/checksum: checksum changed '{md5}e6bbe59dfd61a20cd007c0608729fac5' to '{md5}8c117cfb1046e4fd8b2cb872cd8a84da'
notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/File[/Users/cltbld/Library/Preferences/com.apple.dock.plist]/source: replacing from source puppet://staging-puppet.build.mozilla.org/staging/darwin9-i386/test/Users/cltbld/Library/Preferences/com.apple.dock.plist with contents {md5}e6bbe59dfd61a20cd007c0608729fac5
notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/File[/Users/cltbld/Library/Preferences/com.apple.dock.plist]/mode: mode changed '600' to '644'
notice: Finished catalog run in 6.85 seconds
Attachment #569211 - Flags: review?(coop)
I generated the patch by doing:

cd puppet-manifests/os
cp talos_osx_rev4.pp talos_osx.pp
hg rm talos_osx_rev4.pp

then using sed to replace 'talos_osx_rev4' with 'talosslave' for the rev4 machines.
I killed the buildslave on this machine and ran puppet with --test and --debug.  This is the output.  most of the things it is doing are just checking that what is installed is correct.
Depends on: 696959
Comment on attachment 569211 [details] [diff] [review]
puppet-manifests v1

Review of attachment 569211 [details] [diff] [review]:
-----------------------------------------------------------------

r+, assuming class name is correct (or gets corrected).

::: os/talos_osx.pp
@@ +2,2 @@
>  
> +class talos_osx_rev4 {

You're not changing talosslave.pp AFAICT, so doesn't this need to stay as talos_osx (same for header above)?
Attachment #569211 - Flags: review?(coop) → review+
(In reply to Chris Cooper [:coop] from comment #4)
> Comment on attachment 569211 [details] [diff] [review] [diff] [details] [review]
> puppet-manifests v1
> 
> Review of attachment 569211 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> r+, assuming class name is correct (or gets corrected).
> 
> ::: os/talos_osx.pp
> @@ +2,2 @@
> >  
> > +class talos_osx_rev4 {
> 
> You're not changing talosslave.pp AFAICT, so doesn't this need to stay as
> talos_osx (same for header above)?

I'll correct the class name, it should remain 'talos_osx'
This turns off the rev3 10.6 test machines.  It also removes snowleopard-r4 and moves the r4 snowleopard platforms to the 'snowleopard' slave platform.

Not sure if this is reconfig safe on the test-masters.
Attachment #571205 - Flags: review?(coop)
Comment on attachment 571205 [details] [diff] [review]
buildbot-configs v1

Review of attachment 571205 [details] [diff] [review]:
-----------------------------------------------------------------

Can I also ask that we clean out the unused rev4 10.6 slaves (81-160) from slavealloc once this lands?
Attachment #571205 - Flags: review?(coop) → review+
(In reply to Chris Cooper [:coop] from comment #7)
> Can I also ask that we clean out the unused rev4 10.6 slaves (81-160) from
> slavealloc once this lands?

Yep, i cleaned them out a couple weeks ago
when I am done turning rev3-10.6 off, I'll kick over to Armen
Assignee: nobody → jhford
Depends on: 700503
No longer depends on: 700503
Here is some data to add to the discussion!

0.04% of tests are failing because of unresolved intermittent errors related to rev4 machines.  Given that, I think the trade off of a slight amount of random-failure is worth accepting for significant improvements to wait times on other platforms.  We aren't giving up on the random failure, either.  Bug 700672 is tracking some possible fixes to the resolution issues like a new dongle design and boxes that simulate DVI monitors.

I am going to change the dependencies for these intermittent failure issues to 700672 so work in this bug can proceed.
No longer depends on: 693918
No longer depends on: 696417
No longer depends on: 696453
This is an interim patch so that we stop scheduling 10.6 rev3 jobs.  My understanding is that because I am removing them from the scheduler master, running builds won't be interrupted but new jobs won't queue up and make self serve useless.

This patch won't be around for long.  Part of landing the patch to rename snowleopard-r4 to snowleopard will be to revert this patch.
Attachment #573947 - Flags: review?(catlee)
This landed in this morning's reconfig.
Comment on attachment 573947 [details] [diff] [review]
stop scheduling 10.6 rev3 jobs

Review of attachment 573947 [details] [diff] [review]:
-----------------------------------------------------------------

please look into doing this in config.py or related files on the test scheduler master
Attachment #573947 - Flags: review?(catlee)
Shifting bug to Armen as we discussed.  I am going to hold off on the puppet portion of this bug until we actuall turn off the rev3 10.6 machines.

I think we should let these machines exist in their current state (but disabled in slavealloc) until Thursday Nov 24.
Assignee: jhford → armenzg
Summary: Repurpose Rev3 machines when Rev4 machines are declared authoritative → Repurpose Rev3 10.6 machines
Rev3 MacOS Snow are not running anymore as per:
https://tbpl.mozilla.org/?jobname=Rev3%20MacOSX%20Snow&rev=e7d5dd9efeca

The change landed with:
http://hg.mozilla.org/build/buildbot-configs/rev/1420fec41822

I have to removed these 3 pending jobs:
try	39857d1faeb7 	Rev3 MacOSX Snow Leopard 10.6.2 try debug test xpcshell
try	39857d1faeb7 	Rev3 MacOSX Snow Leopard 10.6.2 try debug test mochitest-other
try	39857d1faeb7 	Rev3 MacOSX Snow Leopard 10.6.2 try debug test reftest

As jhford mentions we'll hold off until the 24th "just in case" and move from there.

TODO:
* file bugs for IT to re-purpose machines (not to be done before the 24th)
** remove from DNS/nagios
* determine how to distribute these slaves for the other OSes
* disable from slave-alloc
* remove from puppet
* remove from buildapi (util.py)
* anything else?
Priority: P4 → P3
There are 59 r3-snow machines without including the ref machine.
There are 5 OSes to distribute these slaves to.

I grabbed data from mozilla-inbound which includes the 4 sets of PGO builds that get triggered for mozilla-central and mozilla-inbound.

I've got some data but I'm still not sure this is the right way of breaking it down.

	# of suites  Total SUM (secs)	Percentage	59 * percentage
Fedora	  44		41817		20.36%		12.01
Fedora64  44		40178		19.57%		11.54
Leopard	  29		22955		11.18%		6.60
Win7	  44		53350		25.98%		15.33
Xp	  44		47051		22.91%		13.52
			205351				59 (SUM)

I will think it this through a little more and have an answer tomorrow.

https://docs.google.com/spreadsheet/ccc?key=0ApOCAHvaMQSFdGlkYmpfOUpHcHBxTFNUbi1ILTY1bVE#gid=10
Priority: P3 → P2
I have filed a bug to have a tool to help us do distributions like this at a later time.

I have enough data to do a somehow informed decision but not using the best data as I could have gathered.

OS\Data sources	No PGO	Both	Waitimes  Armen
Fedora		11.27	11.91	13.12	   13
Fedora64	10.66	11.95	11.88	   12
Leopard		9.55	6.54	10.87	    7
Win7		14.51	15.20	11.93	   14
Xp		13.00	13.40	11.19	   13

PGO only happens in mozilla-central and mozilla-integration 4 times a day.
Using the source from the waitimes report is a little better since on the try server we don't always trigger all builds and all tests. The try server represents 52% of the load.

Adding the number I saw for each silo we would have these many production slaves:
fed     - 76-3=74 -> 21.08%
fed64   - 71-3=68 -> 19.37%
leopard - 66-3=63 -> 17.95%
w7      - 79-4=75 -> 21.37%
xp      - 75-4=71 -> 20.23%
TOTAL   -     351 rev3 prod slaves

This distribution is very similar to the distribution from the wait times report of a week worth's of data.
Fedora	 8946	22.24%
Fedora64 8101	20.14%
Leopard	 7409	18.42%
Win7	 8134	20.23%
Xp	 7626	18.96%

Said all that here is what I think is the list of slaves we want:
01- talos-r3-fed-064
02- talos-r3-fed-065
03- talos-r3-fed-066
04- talos-r3-fed-067
05- talos-r3-fed-068
06- talos-r3-fed-069
07- talos-r3-fed-070
08- talos-r3-fed-071
09- talos-r3-fed-072
10- talos-r3-fed-073
11- talos-r3-fed-074
12- talos-r3-fed-075
13- talos-r3-fed-076
01- talos-r3-fed64-060
02- talos-r3-fed64-061
03- talos-r3-fed64-062
04- talos-r3-fed64-063
05- talos-r3-fed64-064
06- talos-r3-fed64-065
07- talos-r3-fed64-066
08- talos-r3-fed64-067
09- talos-r3-fed64-068
10- talos-r3-fed64-069
11- talos-r3-fed64-070
12- talos-r3-fed64-071
01- talos-r3-leopard-060
02- talos-r3-leopard-061
03- talos-r3-leopard-062
04- talos-r3-leopard-063
05- talos-r3-leopard-064
06- talos-r3-leopard-065
07- talos-r3-leopard-066
01- talos-r3-w7-066
02- talos-r3-w7-067
03- talos-r3-w7-068
04- talos-r3-w7-069
05- talos-r3-w7-070
06- talos-r3-w7-071
07- talos-r3-w7-072
08- talos-r3-w7-073
09- talos-r3-w7-074
10- talos-r3-w7-075
11- talos-r3-w7-076
12- talos-r3-w7-077
13- talos-r3-w7-078
14- talos-r3-w7-079
01- talos-r3-xp-063
02- talos-r3-xp-064
03- talos-r3-xp-065
04- talos-r3-xp-066
05- talos-r3-xp-067
06- talos-r3-xp-068
07- talos-r3-xp-069
08- talos-r3-xp-070
09- talos-r3-xp-071
10- talos-r3-xp-072
11- talos-r3-xp-073
12- talos-r3-xp-074
13- talos-r3-xp-075
Depends on: 705352
Work left in here:
- buildbot-configs
- slavealloc
- puppet work
- OPSI work
Status: NEW → ASSIGNED
Priority: P2 → P4
Whiteboard: waiting on IT's re-imaging work on bug 705352
Also, all of the Windows slaves need additional work done to them that the ref machine got only after these machines were imaged. This comment details exactly what needs to be done: https://bugzilla.mozilla.org/show_bug.cgi?id=704578#c17

Please let me know if there's any confusion.
per RelEng/IT mtg, bug#705352 now fixed.
Whiteboard: waiting on IT's re-imaging work on bug 705352
In case anyone ping me again about it I already know I can get started.
I spoke about it with arr on IRC already.
Attached patch config changesSplinter Review
Attachment #580215 - Flags: review?(coop)
Attachment #580217 - Flags: review?(coop)
I added this to the production DB and locked the slaves to my staging master.
Attachment #580215 - Flags: review?(coop) → review+
Attachment #580216 - Flags: review?(coop) → review+
Attachment #580217 - Flags: review?(coop) → review+
Attachment #580215 - Flags: checked-in+
Attachment #580216 - Flags: checked-in+
Attachment #580217 - Flags: checked-in+
Armen --

When you update puppet manifests, you *must*
 (a) hg pull -u on all masters (including master-puppet1)
 (b) watch /var/log/messages for a while to make sure the changes work
Bug 709591 is from a simple typo in 415eae655ad4.

While I was landing it, I also unintentionally added a number of other changesets (8 on mpt-production-puppet, for example) that had not been updated on the other masters -- any of which could have unexpected effects at an unexpected time.
(In reply to Dustin J. Mitchell [:dustin] from comment #27)
> Armen --
> 
> When you update puppet manifests, you *must*
>  (a) hg pull -u on all masters (including master-puppet1)
>  (b) watch /var/log/messages for a while to make sure the changes work
> Bug 709591 is from a simple typo in 415eae655ad4.
> 
> While I was landing it, I also unintentionally added a number of other
> changesets (8 on mpt-production-puppet, for example) that had not been
> updated on the other masters -- any of which could have unexpected effects
> at an unexpected time.

I'm very sorry about that. I created this section for future reference:
https://wiki.mozilla.org/ReleaseEngineering/Puppet/Usage#Deploy_changes

What is master-puppet1?
Do you have an idea on how to prevent pushing things like this live?
I reconfig-ed with the landed changes from this bug today.
Followed instructions in:
https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave

For leopard slaves I had to:
* run the following: scutil --set HostName XXX 
* switch to root and run twice:
>  puppetd --test --server scl-production-puppet.build.scl1.mozilla.com

For XP slaves I had to:
NOTE: VNC in fullscreen on Lion does not allow you to type
* change the computer name
* add the DNS suffix and reboot
* I had to do the extra steps below because these machines got re-imaged before they got deployed to the ref image

Bug found: autoit is not installed on 64, 70, 71, 72, 73, 74 & 75
http://grab.by/bpXX

For Fedora and Fedora64 slaves:
* Some slaves were not reachable by ssh and I made mention on bug 705352.
* I fixed the hostname (/etc/sysconfig/network) and rebooted
* run, wait & run until got signed certificate:
>  puppetd --test --server scl-production-puppet.build.scl1.mozilla.com

The Windows 7 slaves need to be activated in bug 705352.

== EXTRA STEPS ==
cd c:\
wget -O installservice.bat --no-check-certificate http://people.mozilla.com/~bhearsum/installservice.bat
runas /user:administrator "schtasks /create /tn mozillamaintenance /tr \"c:\\windows\\system32\\cmd.exe /c \\\"c:\\installservice.bat\\\"\" /sc ONSTART /ru SYSTEM"
wget -O keys.reg --no-check-certificate https://bugzilla.mozilla.org/attachment.cgi?id=577617
regedit /s keys.reg
wget -O MozRoot.crt --no-check-certificate https://bugzilla.mozilla.org/attachment.cgi?id=577619

* Browse to download location
* Right click cert, choose "Install certificate
* Choose "Trusted Root Certificate Authorities" as the install location
Depends on: 712004
slave w7-065 is added on bug 676155.
Attachment #582885 - Flags: review?(coop)
Attached file sql statement for IT
rhelmer told me not to worry to write a DELETE statement for the talos-r3-snow slaves. I can remove it from the previous patch as well if you want to.
Attachment #582890 - Flags: review?(coop)
Attachment #582885 - Flags: review?(coop) → review+
Depends on: 712131
Attachment #582890 - Flags: review?(coop) → review+
Priority: P4 → P2
I have put the Win7 slaves on staging.
I will be putting the slaves from bug 712004 into staging as well.

Meanwhile I have moved these slaves to production and announced on dev.tree-management:
talos-r3-fed-064    
talos-r3-fed-065    
talos-r3-fed-066    
talos-r3-fed-067    
talos-r3-fed-068    
talos-r3-fed-069    
talos-r3-fed-071    
talos-r3-fed-075    
talos-r3-fed-076    
talos-r3-fed64-060  
talos-r3-fed64-061  
talos-r3-fed64-062  
talos-r3-fed64-063  
talos-r3-fed64-065  
talos-r3-fed64-066  
talos-r3-fed64-067  
talos-r3-fed64-068  
talos-r3-fed64-069  
talos-r3-fed64-070  
talos-r3-leopard-060
talos-r3-leopard-061
talos-r3-leopard-062
talos-r3-leopard-063
talos-r3-leopard-065
talos-r3-leopard-066
talos-r3-leopard-074
talos-r3-xp-063 
talos-r3-xp-064 
talos-r3-xp-065 
talos-r3-xp-066 
talos-r3-xp-067 
talos-r3-xp-068 
talos-r3-xp-069 
talos-r3-xp-070 
talos-r3-xp-071 
talos-r3-xp-072 
talos-r3-xp-073 
talos-r3-xp-074
bhearsum asked me to double check the XP machines as he wasn't sure I picked up the latest changes.

I had to run these missing steps:
wget -O installservice.bat --no-check-certificate https://bug704578.bugzilla.mozilla.org/attachment.cgi?id=579099
wget -O add_cert.msc --no-check-certificate https://bugzilla.mozilla.org/attachment.cgi?id=579191
start add_cert.msc
* From menu: Action -> All Tasks -> Import... launches Certificate Import Wizard
* Click Next
* Browse and use C:\MozRoot.crt
* Next, Next, Finish
* Close the MMC window
Status update:
* all slaves are in production except:
 talos-r3-fed-072 - bug 712004
 talos-r3-w7-066
 talos-r3-w7-067
 talos-r3-w7-068
 talos-r3-w7-069
 talos-r3-w7-070
 talos-r3-w7-071
 talos-r3-w7-072
 talos-r3-w7-073
 talos-r3-w7-074
 talos-r3-w7-075
 talos-r3-w7-076
 talos-r3-w7-077
We had a couple of slaves that had trouble:
bug 713326 - Please get talos-r3-xp-067 and talos-r3-xp-066 out of production 

I have to have a look at the win7 slaves's runs look like.
Depends on: 713326
More than a couple, because there was also bug 714392 and bug 714561, but they got the reimage treatment rather than just awaiting your return.
bhearsum I would like to add the Windows 7 slaves into the pool but I would like to know if the steps I have to run are these:
https://wiki.mozilla.org/ReferencePlatforms/Test/Win7#Mozilla_maintenance_service.2C_associated_registry_keys.2C_Mozilla_test_CA_root
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #38)
> bhearsum I would like to add the Windows 7 slaves into the pool but I would
> like to know if the steps I have to run are these:
> https://wiki.mozilla.org/ReferencePlatforms/Test/
> Win7#Mozilla_maintenance_service.2C_associated_registry_keys.
> 2C_Mozilla_test_CA_root

Depends on which image they were cloned from. If they were cloned after the latest images in https://bugzilla.mozilla.org/show_bug.cgi?id=706344, no. You can find this out by for "Mozilla Maintenance Service" in services.msc.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #30)
> Bug found: autoit is not installed on 64, 70, 71, 72, 73, 74 & 75
> http://grab.by/bpXX

Filed as bug 717955.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #17)
> 01- talos-r3-w7-066
> 02- talos-r3-w7-067
> 03- talos-r3-w7-068
> 04- talos-r3-w7-069
> 05- talos-r3-w7-070
> 06- talos-r3-w7-071
> 07- talos-r3-w7-072
> 08- talos-r3-w7-073
> 09- talos-r3-w7-074
> 10- talos-r3-w7-075
> 11- talos-r3-w7-076
> 12- talos-r3-w7-077
> 13- talos-r3-w7-078
> 14- talos-r3-w7-079

All of these slaves got the "Mozilla Maintenance Service" installed.
They're now taking jobs on my development master to verify one last time.
I put the following slaves into the production pool:
* talos-r3-xp-035
* talos-r3-xp-066
* talos-r3-xp-067
* talos-r3-xp-070
* talos-r3-xp-075

I'm waiting on bug 705352 for talos-r3-w7-072 to be activated.
Priority: P2 → P4
talos-r3-fed64-065 was waiting for a reboot from IT but no one had enabled it in slave alloc (bug 715786).

talos-r3-xp-066 and talos-r3-xp-067 were synced with OPSI but I completely missed adding the maintenance service manually. They're now on staging again.

I'm waiting on bug 705352 for talos-r3-w7-072 to be activated.
Priority: P4 → P2
Priority: P2 → P3
Depends on: 718922
talos-r3-xp-070 - back to the pool after re-installing the certificate
talos-r3-xp-075 - back to the pool after re-installing the certificate
talos-r3-xp-066 - IT debugging file permission issues - bug 719892
talos-r3-xp-067 - IT debugging file permission issues - bug 719892
talos-r3-xp-068 - on preproduction; I will put it back on Monday
talos-r3-w7-072 - back to the pool
Depends on: 719892
No longer depends on: 718922
No longer depends on: 719892
IT will fix talos-r3-xp-066 and talos-r3-xp-067 in bug 719892.

Nothing left to be done.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.