Closed
Bug 1304065
Opened 9 years ago
Closed 9 years ago
Schedule win10vm test jobs on try
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: rwood, Assigned: aobreja)
References
Details
Attachments
(9 files, 8 obsolete files)
|
69.61 KB,
text/csv
|
jmaher
:
review+
|
Details |
|
3.19 KB,
patch
|
kmoir
:
feedback+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
|
25.31 KB,
patch
|
coop
:
feedback+
|
Details | Diff | Splinter Review |
|
937 bytes,
patch
|
coop
:
review+
|
Details | Diff | Splinter Review |
|
24.81 KB,
patch
|
kmoir
:
review+
kmoir
:
checked-in-
|
Details | Diff | Splinter Review |
|
12.88 KB,
text/plain
|
Details | |
|
3.67 KB,
patch
|
kmoir
:
review+
|
Details | Diff | Splinter Review |
|
3.52 KB,
patch
|
Details | Diff | Splinter Review | |
|
60 bytes,
text/x-github-pull-request
|
Details | Review |
I've done some preliminary testing on an AWS Win10VM from Q, and it looks good.
The eventual goal is to migrate all the Win 8 jobs to Win10VM. For starters, can we please schedule the test jobs to run on try on Win10VM, using some special try syntax?
From an email from :jmaher:
I suspect we will need 2 pools:
20 machines as c3.xlarge
5 machines as g2 class machines (for g1/g2/g3, gpu, and reftest jobs)
Updated•9 years ago
|
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
Comment 1•9 years ago
|
||
we will need some entries in slavealloc for the new machines (possibly we need a bug to create the machines?)
will we need new test masters? currently we have 1+ hardware machines as win10, what master do they use?
the ideal scenario here is to copy the win7 data about which jobs run on hw, vm, vm-gfx.
Comment 2•9 years ago
|
||
Q, are there any changes required in cloud-tools before we start on slavealloc/buildbot-configs?
Flags: needinfo?(q)
Comment 3•9 years ago
|
||
Looks like we have 400 entries already in slavealloc for t-w10-spot-XXX. There are not yet any entries in buildbot-configs.
We should be good to go as userdata is shared across types. I can run another test run to be sure.
Flags: needinfo?(q)
| Assignee | ||
Comment 5•9 years ago
|
||
Add new 400 win10 machines in buildbot-config.
Attachment #8793772 -
Flags: review?(kmoir)
Updated•9 years ago
|
Attachment #8793772 -
Flags: review?(kmoir) → review+
Comment 6•9 years ago
|
||
All 400 t-w10-spot instances added in slavealloc are g2.2xlarge. I think we'll only enable a bunch of them as a first step, but the question here would be: do we want to run the tests on this VM type *only*? Or should we follow the example of Windows 7 VMs and use c3.2xlarge?
At the moment we have 600 t-w732-spot VMs and another 200 g-w732-spot VMS, which makes it 800 in total. Taking a look at the buildbot masters they connect to, I noticed 6 of them, each serving 133-134 such VMs:
+----------------------+----------+
| nickname | count(*) |
+----------------------+----------+
| bm128-tests1-windows | 134 |
| bm129-tests1-windows | 134 |
| bm137-tests1-windows | 133 |
| bm138-tests1-windows | 133 |
| bm139-tests1-windows | 133 |
| bm140-tests1-windows | 133 |
+----------------------+----------+
6 rows in set (0.01 sec)
That means we'll likely need new buildbot masters too.
Flags: needinfo?(jmaher)
| Assignee | ||
Updated•9 years ago
|
Assignee: nobody → aobreja
Comment 7•9 years ago
|
||
we want to do what we are doing for win7. The initial request is a small pool of c3.2xlarge and g2 machines so we can get things green on try.
It is possible to just do 20 c3 and 5 g2 machines? Or do we need the full scale?
Flags: needinfo?(jmaher)
Comment 8•9 years ago
|
||
We've been frequently exhausting the pool of g2.2xlarge machines, so I think we need to be very careful about enabling a lot of them for w10. We should probably also rename these in the config to be g-w10 to indicate that they're g2 instances, not the c3/4 instances we're using for non-graphics work.
Comment 9•9 years ago
|
||
lets only do a small pool of g2. And Amy reminded me, we have been having trouble getting enough g2 instances, possibly we need to run more on hardware in the future- for the purposes of this bug, lets keep it small until we are ready to roll out to integration branches- then we can make decisions :)
| Assignee | ||
Comment 10•9 years ago
|
||
Should we delete the current w-10 machines and create something similar as on windows 7:
600 t-w10-spot machines(c3.2xlarge) and 200 g-w10-spot (g2.2xlarge)
Or to keep a small pool and in this case how many machines t-w10-spot and how many g-w10-spot should we have?
Flags: needinfo?(q)
Flags: needinfo?(jmaher)
Comment 11•9 years ago
|
||
i am not sure what the current win-10 machines are. I am looking to use the new AMI which Q has created to have at least 20-c2.2xlarge and 5-g2.2xlarge machines to test so we can access them with a try push.
600 and 200 machines seem like way too much, we are still making sure the test jobs run successfully and these are not scheduled by default on try.
Apologies if I don't know the terminology or details of what currently exists.
Flags: needinfo?(jmaher)
| Assignee | ||
Comment 12•9 years ago
|
||
Adding in slavealloc 800 machines:
600 t-w10-spot (c3.2xlarge) ( 300 us-east-1 and 300 us-west-2)
200 g-w10-spot (g2.2xlarge) ( 100 us-east-1 and 100 us-west-2)
Attachment #8794183 -
Flags: review?(jmaher)
Updated•9 years ago
|
Attachment #8794183 -
Flags: review?(jmaher) → review+
| Assignee | ||
Comment 13•9 years ago
|
||
The changes for buildbot-config.
Attachment #8794196 -
Flags: review?(jmaher)
Comment 14•9 years ago
|
||
Comment on attachment 8794196 [details] [diff] [review]
bug1304065_bb.patch
Review of attachment 8794196 [details] [diff] [review]:
-----------------------------------------------------------------
::: mozilla-tests/production_config.py
@@ +41,5 @@
> for i in range(102, 103): # Use win8's 102 for win10 // Bug 1191481
> SLAVES['win10']['t-w864-ix-%03i' % i] = {}
>
> +for i in range(1, 601) # added in Bug 1304065
> + SLAVES['win10']['t-w10-spot-%03i' % i] = {}
don't we want range(1, 20)
@@ +43,5 @@
>
> +for i in range(1, 601) # added in Bug 1304065
> + SLAVES['win10']['t-w10-spot-%03i' % i] = {}
> +
> +for i in range(1, 201) # added in Bug 1304065
don't we want range(1, 5)
Attachment #8794196 -
Flags: review?(jmaher) → review-
| Assignee | ||
Comment 15•9 years ago
|
||
We can add all these machines because they will be disabled.
And I will enable only 20 t-w10-spot and 5 g-w10-spot.As long as these machines are not enable everything should be fine.
Flags: needinfo?(jmaher)
Comment 16•9 years ago
|
||
Comment on attachment 8794196 [details] [diff] [review]
bug1304065_bb.patch
Review of attachment 8794196 [details] [diff] [review]:
-----------------------------------------------------------------
as per comment this adds the machines, but doesn't enable them.
Attachment #8794196 -
Flags: review- → review+
Updated•9 years ago
|
Flags: needinfo?(jmaher)
| Assignee | ||
Updated•9 years ago
|
Attachment #8794196 -
Flags: checked-in+
| Assignee | ||
Comment 17•9 years ago
|
||
400 w-10-spot machines were deleted from slavealloc:
mysql> delete * from slaves where name like 't-w10-spot%';
Query OK, 400 rows affected (0.05 sec)
mysql> select count(*) from slaves where name like 't-w10-spot%';
800 machines were added:
mysql> select count(*) from slaves where name like 'g-w10-spot%';
+----------+
| count(*) |
+----------+
| 200 |
+----------+
mysql> select count(*) from slaves where name like 't-w10-spot%';
+----------+
| count(*) |
+----------+
| 600 |
+----------+
600 t-w10-spot (c3.2xlarge) ( 300 us-east-1 and 300 us-west-2)
200 g-w10-spot (g2.2xlarge) ( 100 us-east-1 and 100 us-west-2)
All these 800 new machines are disabled except 25 of them: t-w10-spot-[001-020] and
g-w10t-spot-[001-005].
Comment 18•9 years ago
|
||
very cool!
:catlee, is the next step hooking these up to jobs in buildbot-configs?
Flags: needinfo?(catlee)
Comment 19•9 years ago
|
||
Yes, and we also need patches to https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg that will spin up these w10 instances when there are jobs to run.
Flags: needinfo?(catlee)
| Assignee | ||
Comment 20•9 years ago
|
||
Patch on https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg to spin up the new w10 instances when there are jobs to run.
Attachment #8794824 -
Flags: feedback?(kmoir)
| Reporter | ||
Comment 21•9 years ago
|
||
Am I on the right track? I just added win10_vm and win10_vm_gfx, same setup as win7_vm and win7_vm_gfx.
Attachment #8794849 -
Flags: feedback?(catlee)
Updated•9 years ago
|
Attachment #8794824 -
Flags: feedback?(kmoir) → feedback+
| Assignee | ||
Comment 22•9 years ago
|
||
Robert can you pass the review to Coop or to Joel Maher to solve this bug more quickly?
I don't know if Chris will have time to give a review for this patch very soon.
Flags: needinfo?(rwood)
| Reporter | ||
Comment 23•9 years ago
|
||
Flags: needinfo?(rwood)
Attachment #8794849 -
Flags: feedback?(catlee) → feedback?(coop)
Comment 24•9 years ago
|
||
Comment on attachment 8794849 [details] [diff] [review]
bug1304065.patch
Review of attachment 8794849 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good. You'll need a corresponding patch to production_config.py to add the slave class to the SLAVES dict before it will work though.
Attachment #8794849 -
Flags: feedback?(coop) → feedback+
| Assignee | ||
Comment 25•9 years ago
|
||
The patch for the production_config.py.
Attachment #8797424 -
Flags: review?(coop)
| Assignee | ||
Comment 26•9 years ago
|
||
The patch with all modifications in buildbot-config.
Attachment #8797442 -
Flags: review?(coop)
Updated•9 years ago
|
Attachment #8794824 -
Flags: checked-in+
Comment 27•9 years ago
|
||
Comment on attachment 8794849 [details] [diff] [review]
bug1304065.patch
https://travis-ci.org/mozilla-releng/build-cloud-tools/builds/165258156
The patch actually had an extra , in it which I removed and the tests passed
| Assignee | ||
Comment 28•9 years ago
|
||
Comment on attachment 8797442 [details] [diff] [review]
bug1304065_bb_1.patch
Review of attachment 8797442 [details] [diff] [review]:
-----------------------------------------------------------------
Kim don't we need also to modify production_config.py to add the slave class?, those modifications are made in this patch
Attachment #8797442 -
Flags: review?(coop) → review?(kmoir)
Comment 29•9 years ago
|
||
aobreja: I'm not sure what you're asking, there are two buildbot patches, both are needed
Updated•9 years ago
|
Attachment #8797424 -
Flags: review?(coop) → review+
Comment 30•9 years ago
|
||
Comment on attachment 8797442 [details] [diff] [review]
bug1304065_bb_1.patch
Can you provide I builder diff so I can compare the new/old builders?
Flags: needinfo?(aobreja)
| Assignee | ||
Comment 31•9 years ago
|
||
Attachment #8797442 -
Attachment is obsolete: true
Attachment #8797442 -
Flags: review?(kmoir)
| Assignee | ||
Comment 32•9 years ago
|
||
Sorry for the delay Kim,here I attach the difference.
Flags: needinfo?(aobreja)
Comment 33•9 years ago
|
||
Comment on attachment 8799767 [details] [diff] [review]
bug1304065_bb.patch
I looked at the builder diff. I don't understand why we are making changes to other branches , I thought this patch was just about enabling new tests on try
Flags: needinfo?(aobreja)
| Assignee | ||
Comment 34•9 years ago
|
||
The new patch which will enabling new tests for Windows 10 VM only on try.
Attachment #8793772 -
Attachment is obsolete: true
Attachment #8794196 -
Attachment is obsolete: true
Attachment #8799767 -
Attachment is obsolete: true
Flags: needinfo?(aobreja)
Attachment #8801694 -
Flags: review?(kmoir)
| Assignee | ||
Comment 35•9 years ago
|
||
The difference of the tests that are run after the patch is added.
Attachment #8799774 -
Attachment is obsolete: true
Comment 36•9 years ago
|
||
Comment on attachment 8801694 [details] [diff] [review]
bug1304065_bb_try.patch
Nit: there is an extra blank line at line 3885
Attachment #8801694 -
Flags: review?(kmoir) → review+
| Assignee | ||
Updated•9 years ago
|
Attachment #8801694 -
Flags: checked-in+
Comment 37•9 years ago
|
||
a lot of traffic in this bug, does this mean on the next reconfig we can run tests on win10 on a try push?
| Assignee | ||
Comment 38•9 years ago
|
||
The patch is in production:
https://hg.mozilla.org/build/buildbot-configs/rev/1e9f1a185891ecc98080a07bc5124eb06a6abdfe
Kim can tests be run for win10 or do we need anything else? I think we also need to update these changes in slave_health repository,I will create the patch and ask for a review from coop.
Flags: needinfo?(kmoir)
| Assignee | ||
Comment 39•9 years ago
|
||
Patch for slave_health repository.
Attachment #8802048 -
Flags: review?(coop)
Comment 40•9 years ago
|
||
Comment on attachment 8802048 [details] [diff] [review]
bug1304065_slave_health.patch
Review of attachment 8802048 [details] [diff] [review]:
-----------------------------------------------------------------
r+ with small ordering nits fixed.
::: js/trends.js
@@ +21,5 @@
> 'g-w732-spot',
> 't-w732-ix',
> 't-w732-spot',
> + 't-w10-spot',
> + 'g-w10-spot',
Can we keep the g- and t- slavetypes contiguous here, please?
::: test_trends.html
@@ +48,1 @@
> <td><div id="t-w864-ixTrend"></div></td>
Can you split this row into two separate rows, each with a max of three slavetypes, please? That will keep it consistent with the rest.
Attachment #8802048 -
Flags: review?(coop) → review+
Comment 41•9 years ago
|
||
I think that is it, you could try a try run and see if it works.
Flags: needinfo?(kmoir)
Comment 42•9 years ago
|
||
Andrei:
This is not spinning up new instances. (Joel tried a try run) I think the problem is that the AMI generation is not configured in puppet and thus not generating the amis
See /modules/aws_manager/manifests/cron.pp
needs an entry for t-w10 and g-w10
also configs need to be there in the cloud tools for g-w10, I just see t-w10 here
Kims-MacBook-Pro:configs kmoir$ git remote -v
origin git@github.com:mozilla/build-cloud-tools.git (fetch)
origin git@github.com:mozilla/build-cloud-tools.git (push)
Kims-MacBook-Pro:configs kmoir$ pwd
/Users/kmoir/git/build-cloud-tools/configs
Kims-MacBook-Pro:configs kmoir$ ls *w10*
t-w10 t-w10.user-data
Kims-MacBook-Pro:configs kmoir$
Once we get the amis generated you should start a try run
Flags: needinfo?(aobreja)
Updated•9 years ago
|
Attachment #8801694 -
Flags: checked-in+ → checked-in-
Comment 43•9 years ago
|
||
Also, it might be worth updating the doc for the next person
I wrote this a long time ago
https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_New_AWS_Worker_Class
It's probably really out of date or there are more recent documents to update regarding adding a new class of machines of AWS
Comment 44•9 years ago
|
||
I deleted all the pending 'Windows 10' on try rev a127e128388b, since we were getting this error from watch_pending.py:
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: Traceback (most recent call last):
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 593, in <module>
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: main()
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 569, in main
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: latest_ami_percentage=args.latest_ami_percentage,
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 483, in aws_watch_pending
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: latest_ami_percentage=latest_ami_percentage)
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 152, in request_spot_instances
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: instance_config = load_instance_config(moz_instance_type)
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "/builds/aws_manager/lib/python2.7/site-packages/repoze/lru/__init__.py", line 287, in lru_cached
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: val = f(*arg)
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "/builds/aws_manager/cloud-tools/cloudtools/aws/__init__.py", line 239, in load_instance_config
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: moz_instance_type)))
Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: IOError: [Errno 2] No such file or directory: '/builds/aws_manager/cloud-tools/cloudtools/aws/../../configs/g-w10'
Landed https://github.com/mozilla-releng/build-cloud-tools/commit/36c8415ffdb716939841db5f96b359028f856d33 as a guard to prevent it recurring until the rest of the config lands.
Went unnoticed for about 7 hours, until we started to get a backlog alert from nagios and were seeing pending across several AWS-based pools.
| Assignee | ||
Comment 45•9 years ago
|
||
The patch for build_cloud repository (adding g-w10 files in configs , and revert the change made by Nick ("g-w10": 1000)).
Flags: needinfo?(aobreja)
Attachment #8802526 -
Flags: review?(kmoir)
| Assignee | ||
Comment 46•9 years ago
|
||
The patch for puppet repository.
Attachment #8802530 -
Flags: review?(kmoir)
| Assignee | ||
Comment 47•9 years ago
|
||
By checking /modules/aws_manager/manifests/cron.pp it seems that AMI was generated for t-w732 by the script but we also have an AMI generated in 2016-10-17-03-56 which I suspect that was manually generated.
Amy do you know how was generated the AMI for g-w732 or could you pass the need info to someone who may know?Thanks
Flags: needinfo?(arich)
Updated•9 years ago
|
Flags: needinfo?(arich) → needinfo?(q)
Comment 48•9 years ago
|
||
Comment on attachment 8802530 [details] [diff] [review]
bug1304065_puppet.patch
asking in #releng
kmoir> so I note that there are cron jobs in puppet to create w-732 amis, but none to create g-732 amis
11:31 AM we are trying to add t-w10 and g-w10 amis etc
11:31 AM to run tests on try
11:32 AM but it is a mystery to me how g-w732 amis are generated given there isn't any cron jobs for them
11:38 AM
<kmoir> Q or markco ^^
<Q> They should have them
11:41 AM → brson and •bhearsum (promoted to owner, opped) joined
11:41 AM
<Q> However they can bus the t-ami
<kmoir> Q what does bus the t-ami mean?
11:46 AM
<grenade> kmoir: t-w732 == normal tester, g-w732 == tester with gpu
11:47 AM
<kmoir> right so is the ami the same config but on a different instance type?
11:47 AM my understanding was that the amis were generated by puppet crons daily
11:47 AM
<grenade> i'm not sure if it's configured to take advantage but i know that there's no difference in whats on the ami
11:48 AM t and g could easily share an ami. i don't know if they do
11:48 AM
<kmoir> okay, I don't understand how the g* instances know to use the t* ami
11:48 AM → gerard-majax joined (Alexandre@moz-r23ptt.lmuk.1rfi.0450.2001.IP)
11:48 AM
<kmoir> is that something in cloud tools that I am missing?
11:49 AM
<grenade> yeah, sorry i don't know either
11:51 AM
<arr> Q is the one who can definitively answer that, and he's in line for the TSA at the moment
11:51 AM I switched the NI on the bug to him
11:51 AM oh, osrry, that was a different bug that I switched the NI on (1304065)
Updated•9 years ago
|
Attachment #8802526 -
Flags: review?(kmoir) → review+
Comment 49•9 years ago
|
||
The way this is supposed to work for t-w732 and g-w732:
They both use the same base AMI (specified in build-cloud-tools/configs/t-w732 and build-cloud-tools/configs/g-w732). There's a cron job (defined by puppet in modules/aws_manager/manifests/cron.pp) that generates the golden AMIs for use1 (and copies it to usw2) using that base AMI on aws-manager2. We were missing that last piece until today, so the g-w732 golden AMI was not being generated (see bug 1311430).
Flags: needinfo?(q)
Comment 50•9 years ago
|
||
okay thanks Amy. Alin, your patch for puppet needs to be updated to include g-w10 ami generation
Comment 51•9 years ago
|
||
oops, sorry I meant to say Andrei in previous comment :-)
| Assignee | ||
Comment 52•9 years ago
|
||
The new patch for puppet.
Attachment #8802530 -
Attachment is obsolete: true
Attachment #8802530 -
Flags: review?(kmoir)
Attachment #8802835 -
Flags: review?(kmoir)
| Assignee | ||
Comment 53•9 years ago
|
||
Attachment #8802048 -
Attachment is obsolete: true
Updated•9 years ago
|
Attachment #8802835 -
Flags: review?(kmoir) → review+
| Assignee | ||
Comment 54•9 years ago
|
||
Recreated patch to not include the modification of g-w10 in watch_pendings.cfg
Attachment #8802526 -
Attachment is obsolete: true
Comment 55•9 years ago
|
||
can we get an update here?
| Assignee | ||
Comment 56•9 years ago
|
||
This also is going to wontfix since we aren't going to support w10 buildbot (bug 1330999).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Updated•8 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•