Closed Bug 1229790 Opened 4 years ago Closed 4 years ago

Create AWS test buildbot masters for t-w732 and t-w10 slaves

Categories

(Infrastructure & Operations :: RelOps: General, task)

Unspecified
Windows
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: grenade)

References

Details

(Whiteboard: [windows][aws])

Attachments

(7 files, 8 obsolete files)

2.96 KB, patch
Callek
: review+
Details | Diff | Splinter Review
1.07 KB, patch
Callek
: review+
Details | Diff | Splinter Review
69.61 KB, text/plain
Details
10.27 KB, patch
kmoir
: review+
Details | Diff | Splinter Review
83.79 KB, patch
Details | Diff | Splinter Review
1.12 KB, patch
grenade
: review+
Callek
: checked-in+
Details | Diff | Splinter Review
751 bytes, patch
Callek
: review+
Details | Diff | Splinter Review
No description provided.
Attachment #8694753 - Flags: review?(catlee) → review?(bugspam.Callek)
Attachment #8694760 - Flags: review?(catlee) → review?(bugspam.Callek)
Attachment #8694753 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8694760 [details] [diff] [review]
puppet again nodes for new bb masters: bm128 (use1), bm129 (usw2) for t-w732 and tw10

Review of attachment 8694760 [details] [diff] [review]:
-----------------------------------------------------------------

the tools change needs to land before these instances are started up, so ideally before this puppet patch lands.
Attachment #8694760 - Flags: review?(bugspam.Callek) → review+
Depends on: 1229907
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Attached file t-win-spot.csv
created spot instance names in slavealloc:
- use1: t-{w732,w10}-spot-{000-099,200-299}
- usw2: t-{w732,w10}-spot-{100-199,300-399}
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
this patch assumes that we want to replicate all 'win7-ix' config as 'win7-spot' in order to be able to configure these separately (to prevent ec2 from running perf tests).
win7 config is also updated to include new t-w732-spot-* slaves
win10 config is only updated to include new t-w10-spot-* slaves
Attachment #8695357 - Flags: review?(catlee)
Attachment #8695357 - Flags: review?(bugspam.Callek)
Attachment #8695357 - Flags: feedback?(q)
modified win7-spot platform config to include 'VM' in the name (for watch-pending)
Attachment #8695357 - Attachment is obsolete: true
Attachment #8695357 - Flags: review?(catlee)
Attachment #8695357 - Flags: review?(bugspam.Callek)
Attachment #8695357 - Flags: feedback?(q)
Attachment #8695375 - Flags: review?(catlee)
Attachment #8695375 - Flags: review?(bugspam.Callek)
Attached patch buildermap for t-w732 (obsolete) — Splinter Review
Attachment #8695392 - Flags: review?(catlee)
Attachment #8695375 - Flags: review?(bugspam.Callek)
Comment on attachment 8695375 [details] [diff] [review]
buildbot-configs support for aws t-w732-spot and t-w10-spot

Review of attachment 8695375 [details] [diff] [review]:
-----------------------------------------------------------------

I feel like I'm forgetting some tidbit with regard to how slave_platforms and such work (as in, are we duplicating the tests to run on both in-house and aws with this patch? I can't recall)...

But my comments should help give you a leaping point for your next steps at the least.

::: mozilla-tests/BuildSlaves.py.template
@@ +28,4 @@
>      "xp-ix": "pass",
>      "win7": "pass",
>      "win7-ix": "pass",
> +    "win7-spot": "pass",

there is a version of this file in puppet, and it will need to be modified (and deployed) prior to this patch landing, otherwise we'll fail a reconfig

::: mozilla-tests/config.py
@@ +219,4 @@
>  
>  ALL_TALOS_PLATFORMS = get_talos_slave_platforms(PLATFORMS, platforms=('linux64', 'win32', 'macosx64', 'win64', ))
>  LINUX_ONLY = get_talos_slave_platforms(PLATFORMS, platforms=('linux64', ))
> +WIN7_ONLY = ['win7-ix', 'win7-spot']

noteworthy, doing this will mark our aws windows instances as "ok for performance tests" (aiui) is that what we want?

::: mozilla-tests/production_config.py
@@ +30,5 @@
>  
>  for i in range(1, 11):
>      SLAVES['win10']['t-w1064-ix-%04i' % i] = {}
> +for i in range(0, 400):
> +    SLAVES['win10']['t-w10-spot-%03i' % i] = {}

we're intending aws w10 to not be splittable from in-house win10? (fwiw if we need in house for talos, we'll need to be able to split)  This comment is not blocking, since w10 is still on try only.

::: mozilla-tests/thunderbird_config.py
@@ +59,4 @@
>  PLATFORMS['win32']['env_name'] = 'win32-perf'
>  PLATFORMS['win32']['xp-ix'] = {'name': builder_prefix + "Windows XP 32-bit"}
>  PLATFORMS['win32']['win7-ix'] = {'name': builder_prefix + "Windows 7 32-bit"}
> +PLATFORMS['win32']['win7-spot'] = {'name': builder_prefix + "Windows 7 32-bit"}

"Windows 7 VM 32-bit" to match above

@@ +240,5 @@
> +                    'config_files': ["unittests/win_unittest.py"],
> +                },
> +                'mozmill': {
> +                    'config_files': ["unittests/win_unittest.py"],
> +                },

fwiw TB's Mozmill is the only TB specific suite here, (it doesn't run for any firefox job) so it *could* be a unique form of failure for these that don't exist on Firefox jobs.
Attachment #8695375 - Flags: feedback+
Attached patch puppet BuildSlaves-tests.py (obsolete) — Splinter Review
Attachment #8695524 - Flags: review?(bugspam.Callek)
Backed out Win 10 slaves so that I can do a more thorough job of separating Win10 and Win10-VM, removed w732-spot from WIN_ONLY as we don't want ec2 doing perf tests.
Attachment #8695375 - Attachment is obsolete: true
Attachment #8695375 - Flags: review?(catlee)
Attachment #8695530 - Flags: review?(bugspam.Callek)
Attachment #8695392 - Flags: review?(catlee) → review?(bugspam.Callek)
Attachment #8695392 - Attachment description: buildermap for t-732 → buildermap for t-w732
renamed var WIN7_ONLY > WIN7_HW_ONLY
Attachment #8695530 - Attachment is obsolete: true
Attachment #8695530 - Flags: review?(bugspam.Callek)
Attachment #8695549 - Flags: review?(bugspam.Callek)
Attachment #8695524 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8695549 [details] [diff] [review]
buildbot-configs support for aws t-w732-spot

Review of attachment 8695549 [details] [diff] [review]:
-----------------------------------------------------------------

ok, this is an r+ with condition, the only thing needing changing is the 'talos_slave_platforms' line...

::: mozilla-tests/config.py
@@ +95,5 @@
>  }
>  PLATFORMS['macosx64']['talos_slave_platforms'] = ['yosemite', 'yosemite_r7']
>  
> +PLATFORMS['win32']['slave_platforms'] = ['xp-ix', 'win7-ix', 'win7-spot']
> +PLATFORMS['win32']['talos_slave_platforms'] = ['xp-ix', 'win7-ix', 'win7-spot']

Like thunderbird's comment, we are explicitly running w7 tests in parallel. which is ok.

HOWEVER we shouldn't also add win7-spot to the 'talos_slave_platforms' line. Since that will run talos/performance tests on win7-spot (as per this patch, all talos suites that are not already windows only)

::: mozilla-tests/thunderbird_config.py
@@ +55,4 @@
>      'reboot_command': ['/tools/buildbot/bin/python'] + MOZHARNESS_REBOOT_CMD,
>      'system_bits': '64',
>  }
> +PLATFORMS['win32']['slave_platforms'] = ['xp-ix', 'win7-ix', 'win7-spot']

Just to comment explicitly, this line's change will make us run two sets of the same tests, one set in AWS and one set on hardware.

This is ok per:
<arr> for the time being, since we're still testing things out, etsts[sic] will continue to run in the dc, too

@@ +105,4 @@
>      PLATFORMS['win32']['slave_platforms'] + \
>      PLATFORMS['macosx64']['slave_platforms']
>  
> +WIN7_HW_ONLY = ['win7']

not actually an issue for you, but if you feel like it, this line can be removed (its not used in Thunderbirds config at all)
Attachment #8695549 - Flags: review?(bugspam.Callek) → review+
pushed:
- https://hg.mozilla.org/build/puppet/rev/30045e4ae70b
- https://hg.mozilla.org/build/buildbot-configs/rev/63fc4e290988
- https://github.com/mozilla/build-cloud-tools/commit/7cf248c28ad465a48acf4ef1969143747d1f73c9

I think this takes effect after the next reconfig
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Reopened as the config can't be enabled until the slaves are ready and running.
https://hg.mozilla.org/build/buildbot-configs/rev/00fc6d65fb2b
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I'm hoping that this patch achieves the following:
- slave set "win7-vm" consisting of t-w732-spot-001 - t-w732-spot-400 instances
- new slaves restricted to try builds only
- new slaves just pick up existing test jobs intended for win7 32 (currently assigned to win7-ix only) without creating duplicate jobs.
Attachment #8695549 - Attachment is obsolete: true
Attachment #8698882 - Flags: review?(bugspam.Callek)
Attached patch rename test slave group (obsolete) — Splinter Review
Attachment #8695524 - Attachment is obsolete: true
Attachment #8698883 - Flags: review?(bugspam.Callek)
Attachment #8695392 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8698882 [details] [diff] [review]
buildbot-configs support for aws t-w732-spot

Review of attachment 8698882 [details] [diff] [review]:
-----------------------------------------------------------------

This patch is very close, but as we discussed we wanted hardware and vm's to be used at once, as the same thing. not as two distinct things.

This patch as is, makes the VM's used on try (except for talos which will keep using the ix ones). Rather than making VM's added to the ix pool for try.

The way to fix that would be to basically do what I'll try to describe on line-level comments.

::: mozilla-tests/production_config.py
@@ +23,2 @@
>  for i in range(1, 244):   # Move some XP test machines to the Windows 7 pool from 234 to 243 // Bug 1226729 
>      SLAVES['win7-ix']['t-w732-ix-%03i' % i] = {}

e.g. with current naming you'd want:

    SLAVES['win7-vm']['t-w732-ix-%03i' % i] = {}

here as well, to make sure the ix's are included in the same set of machines as we use the VMs for.

That also does beg the question on if we want to name it 'win7-vm' as far as the configs go though. If yes, thats fine, if not we'll need to adjust the other places along the way too.

The only downside for this "don't run in parallel" plan is the pretty name for the slave platform can't be named differently afaict.
Attachment #8698882 - Flags: review?(bugspam.Callek) → review-
Comment on attachment 8698883 [details] [diff] [review]
rename test slave group

Review of attachment 8698883 [details] [diff] [review]:
-----------------------------------------------------------------

r+ (and stamp for any similar change pending the outcome of the previous patch I r-'d)
Attachment #8698883 - Flags: review?(bugspam.Callek) → review+
(In reply to Justin Wood (:Callek) from comment #19)
> Comment on attachment 8698882 [details] [diff] [review]
> buildbot-configs support for aws t-w732-spot
> 
> Review of attachment 8698882 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This patch is very close, but as we discussed we wanted hardware and vm's to
> be used at once, as the same thing. not as two distinct things.

Amy, I hear theres some time off going on until end of year, how important is this patch to flesh out before then? if it is important I can flesh it out and get someone else on the team to review and land stuff. Otherwise I'll just defer until theres an update to it, so I can concentrate on release-promo.
Flags: needinfo?(arich)
Switching the needinfo to Q, since it's his work that depends on this.
Flags: needinfo?(arich) → needinfo?(q)
Callek,

 Having this done by the end of year would unblock other work I need to do. On an urgency  scale of 1 - 10 I would put it at a 6/7. Any help would be greatly appreciated if you have the time.

Q 
(In reply to Justin Wood (:Callek) from comment #21)
> (In reply to Justin Wood (:Callek) from comment #19)
> > Comment on attachment 8698882 [details] [diff] [review]
> > buildbot-configs support for aws t-w732-spot
> > 
> > Review of attachment 8698882 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > This patch is very close, but as we discussed we wanted hardware and vm's to
> > be used at once, as the same thing. not as two distinct things.
> 
> Amy, I hear theres some time off going on until end of year, how important
> is this patch to flesh out before then? if it is important I can flesh it
> out and get someone else on the team to review and land stuff. Otherwise
> I'll just defer until theres an update to it, so I can concentrate on
> release-promo.
Flags: needinfo?(q)
This patch updates the prior one that Rob posted, but follows the discussed expectations:

* Talos jobs stay with ix only, and will not be run on ec2 machines
* We are not running these "in parallel" and if we disabled all ec2 machines the world will not burn, it will chug along.
* We want these ec2 machines to be "additions" to the existing win7 pool, and take jobs based on how many are active/online/enabled at a given time. Expecting that we'll do a few and watch for issues, disable the lot and repeat until satisfied.
* Developers should not have to care that these are run in EC2 or in-house at all, nor should Treeherder make that a bold statement to them. Those who need to (sheriffs and releng) can easily identify based on the slave name, rather than the buildername.

The sanity checking I've done is:
* Checked for `no` difference in builder_list output before and after patch
* Checked the dump_master output for any anomalies (we see an extra line for this prettyname in the output on dump_masters, but I `think` its fine, I'll paste the diff in a bit)

This obsoletes both the prior puppet patch (due to new slave name) and the prior cloud-tools patch (due to the builder name non-uniqueness) But I'm holding off on writing newer patches for those both until after this gets an r+
Attachment #8695392 - Attachment is obsolete: true
Attachment #8698882 - Attachment is obsolete: true
Attachment #8698883 - Attachment is obsolete: true
Attachment #8702770 - Flags: review?(kmoir)
Attachment #8702774 - Attachment is patch: true
Comment on attachment 8702770 [details] [diff] [review]
[configs] add win7 ec2 instances

lgtm
Attachment #8702770 - Flags: review?(kmoir) → review+
I need to land this before attachment 8702770 [details] [diff] [review]
Attachment #8704639 - Flags: review?(rthijssen)
Comment on attachment 8704639 [details] [diff] [review]
[puppet] rename the slave group

Review of attachment 8704639 [details] [diff] [review]:
-----------------------------------------------------------------

ship it
Attachment #8704639 - Flags: review?(rthijssen) → review+
Attachment #8704639 - Flags: checked-in+
Comment on attachment 8702770 [details] [diff] [review]
[configs] add win7 ec2 instances

Review of attachment 8702770 [details] [diff] [review]:
-----------------------------------------------------------------

On default: https://hg.mozilla.org/build/buildbot-configs/rev/4a61c10820eb

Will be live with next reconfig
Comment on attachment 8707464 [details] [diff] [review]
update buildermap for corrected win 7 bb config

Review of attachment 8707464 [details] [diff] [review]:
-----------------------------------------------------------------

fyi, you can also attach the url to the github PR for a bmo attachment, and it bmo will link straight to github ;-)
Attachment #8707464 - Flags: review?(bugspam.Callek) → review+
Getting an error connecting to 128:

2016-01-27 10:21:56-0800 [-] Log opened.
2016-01-27 10:21:56-0800 [-] twistd 10.2.0 (C:\mozilla-build\buildbotve\scripts\python.exe 2.7.3) starting up.
2016-01-27 10:21:56-0800 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2016-01-27 10:21:56-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02237760>
2016-01-27 10:21:56-0800 [-] Connecting to buildbot-master128.bb.releng.use1.mozilla.com:9201
2016-01-27 10:21:56-0800 [-] Watching c:\slave\shutdown.stamp's mtime to initiate shutdown
2016-01-27 10:21:56-0800 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2016-01-27 10:21:56-0800 [Broker,client] While trying to connect:
	Traceback (most recent call last):
	  File "C:\mozilla-build\buildbotve\lib\site-packages\twisted\spread\pb.py", line 514, in expressionReceived
	    method(*sexp[1:])
	  File "C:\mozilla-build\buildbotve\lib\site-packages\twisted\spread\pb.py", line 884, in proto_answer
	    d.callback(self.unserialize(netResult))
	  File "C:\mozilla-build\buildbotve\lib\site-packages\twisted\internet\defer.py", line 361, in callback
	    self._startRunCallbacks(result)
	  File "C:\mozilla-build\buildbotve\lib\site-packages\twisted\internet\defer.py", line 455, in _startRunCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "C:\mozilla-build\buildbotve\lib\site-packages\twisted\internet\defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "C:\mozilla-build\buildbotve\lib\site-packages\twisted\spread\pb.py", line 1112, in _cbResponse
	    return challenger.callRemote("respond", respond(challenge, password), client)
	  File "C:\mozilla-build\buildbotve\lib\site-packages\twisted\spread\pb.py", line 1001, in respond
	    m.update(password)
	exceptions.TypeError: must be string or buffer, not None
	
2016-01-27 10:21:56-0800 [Broker,client] Lost connection to buildbot-master128.bb.releng.use1.mozilla.com:9201
2016-01-27 10:21:56-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02237760>
2016-01-27 10:21:56-0800 [-] Main loop terminated.
2016-01-27 10:21:56-0800 [-] Server Shut Down.
2016-01-27 10:21:56-0800 [-] Server Shut Down.
This tuned out to be a slavallocdb issue. The password for connection is set per pool. Callek found this via the query "select * from slave_passwords where poolid IN (63, 92, 94);". The password was updated and the slaves are getting the correct password in the tac file now.
(In reply to Q from comment #35)
> This tuned out to be a slavallocdb issue. The password for connection is set
> per pool. Callek found this via the query "select * from slave_passwords
> where poolid IN (63, 92, 94);". The password was updated and the slaves are
> getting the correct password in the tac file now.

Note that those magic numbers were from `select * from pools;` where 63 was the existing inhouse pool and 92 and 94 were the aws pools.
these are up and running happily now
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Blocks: 1254580
You need to log in before you can comment on or make changes to this bug.