Closed Bug 1304065 Opened 9 years ago Closed 9 years ago

Schedule win10vm test jobs on try

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: rwood, Assigned: aobreja)

References

Details

Attachments

(9 files, 8 obsolete files)

69.61 KB, text/csv
jmaher
: review+
Details
3.19 KB, patch
kmoir
: feedback+
kmoir
: checked-in+
Details | Diff | Splinter Review
25.31 KB, patch
coop
: feedback+
Details | Diff | Splinter Review
937 bytes, patch
coop
: review+
Details | Diff | Splinter Review
24.81 KB, patch
kmoir
: review+
kmoir
: checked-in-
Details | Diff | Splinter Review
12.88 KB, text/plain
Details
3.67 KB, patch
kmoir
: review+
Details | Diff | Splinter Review
3.52 KB, patch
Details | Diff | Splinter Review
60 bytes, text/x-github-pull-request
Details | Review
I've done some preliminary testing on an AWS Win10VM from Q, and it looks good. The eventual goal is to migrate all the Win 8 jobs to Win10VM. For starters, can we please schedule the test jobs to run on try on Win10VM, using some special try syntax? From an email from :jmaher: I suspect we will need 2 pools: 20 machines as c3.xlarge 5 machines as g2 class machines (for g1/g2/g3, gpu, and reftest jobs)
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
we will need some entries in slavealloc for the new machines (possibly we need a bug to create the machines?) will we need new test masters? currently we have 1+ hardware machines as win10, what master do they use? the ideal scenario here is to copy the win7 data about which jobs run on hw, vm, vm-gfx.
Q, are there any changes required in cloud-tools before we start on slavealloc/buildbot-configs?
Flags: needinfo?(q)
Looks like we have 400 entries already in slavealloc for t-w10-spot-XXX. There are not yet any entries in buildbot-configs.
We should be good to go as userdata is shared across types. I can run another test run to be sure.
Flags: needinfo?(q)
Attached patch bug1305065_bb.patch (obsolete) — Splinter Review
Add new 400 win10 machines in buildbot-config.
Attachment #8793772 - Flags: review?(kmoir)
Attachment #8793772 - Flags: review?(kmoir) → review+
All 400 t-w10-spot instances added in slavealloc are g2.2xlarge. I think we'll only enable a bunch of them as a first step, but the question here would be: do we want to run the tests on this VM type *only*? Or should we follow the example of Windows 7 VMs and use c3.2xlarge? At the moment we have 600 t-w732-spot VMs and another 200 g-w732-spot VMS, which makes it 800 in total. Taking a look at the buildbot masters they connect to, I noticed 6 of them, each serving 133-134 such VMs: +----------------------+----------+ | nickname | count(*) | +----------------------+----------+ | bm128-tests1-windows | 134 | | bm129-tests1-windows | 134 | | bm137-tests1-windows | 133 | | bm138-tests1-windows | 133 | | bm139-tests1-windows | 133 | | bm140-tests1-windows | 133 | +----------------------+----------+ 6 rows in set (0.01 sec) That means we'll likely need new buildbot masters too.
Flags: needinfo?(jmaher)
Assignee: nobody → aobreja
we want to do what we are doing for win7. The initial request is a small pool of c3.2xlarge and g2 machines so we can get things green on try. It is possible to just do 20 c3 and 5 g2 machines? Or do we need the full scale?
Flags: needinfo?(jmaher)
We've been frequently exhausting the pool of g2.2xlarge machines, so I think we need to be very careful about enabling a lot of them for w10. We should probably also rename these in the config to be g-w10 to indicate that they're g2 instances, not the c3/4 instances we're using for non-graphics work.
lets only do a small pool of g2. And Amy reminded me, we have been having trouble getting enough g2 instances, possibly we need to run more on hardware in the future- for the purposes of this bug, lets keep it small until we are ready to roll out to integration branches- then we can make decisions :)
Should we delete the current w-10 machines and create something similar as on windows 7: 600 t-w10-spot machines(c3.2xlarge) and 200 g-w10-spot (g2.2xlarge) Or to keep a small pool and in this case how many machines t-w10-spot and how many g-w10-spot should we have?
Flags: needinfo?(q)
Flags: needinfo?(jmaher)
i am not sure what the current win-10 machines are. I am looking to use the new AMI which Q has created to have at least 20-c2.2xlarge and 5-g2.2xlarge machines to test so we can access them with a try push. 600 and 200 machines seem like way too much, we are still making sure the test jobs run successfully and these are not scheduled by default on try. Apologies if I don't know the terminology or details of what currently exists.
Flags: needinfo?(jmaher)
Adding in slavealloc 800 machines: 600 t-w10-spot (c3.2xlarge) ( 300 us-east-1 and 300 us-west-2) 200 g-w10-spot (g2.2xlarge) ( 100 us-east-1 and 100 us-west-2)
Attachment #8794183 - Flags: review?(jmaher)
Attachment #8794183 - Flags: review?(jmaher) → review+
Attached patch bug1304065_bb.patch (obsolete) — Splinter Review
The changes for buildbot-config.
Attachment #8794196 - Flags: review?(jmaher)
Comment on attachment 8794196 [details] [diff] [review] bug1304065_bb.patch Review of attachment 8794196 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla-tests/production_config.py @@ +41,5 @@ > for i in range(102, 103): # Use win8's 102 for win10 // Bug 1191481 > SLAVES['win10']['t-w864-ix-%03i' % i] = {} > > +for i in range(1, 601) # added in Bug 1304065 > + SLAVES['win10']['t-w10-spot-%03i' % i] = {} don't we want range(1, 20) @@ +43,5 @@ > > +for i in range(1, 601) # added in Bug 1304065 > + SLAVES['win10']['t-w10-spot-%03i' % i] = {} > + > +for i in range(1, 201) # added in Bug 1304065 don't we want range(1, 5)
Attachment #8794196 - Flags: review?(jmaher) → review-
We can add all these machines because they will be disabled. And I will enable only 20 t-w10-spot and 5 g-w10-spot.As long as these machines are not enable everything should be fine.
Flags: needinfo?(jmaher)
Comment on attachment 8794196 [details] [diff] [review] bug1304065_bb.patch Review of attachment 8794196 [details] [diff] [review]: ----------------------------------------------------------------- as per comment this adds the machines, but doesn't enable them.
Attachment #8794196 - Flags: review- → review+
Flags: needinfo?(jmaher)
400 w-10-spot machines were deleted from slavealloc: mysql> delete * from slaves where name like 't-w10-spot%'; Query OK, 400 rows affected (0.05 sec) mysql> select count(*) from slaves where name like 't-w10-spot%'; 800 machines were added: mysql> select count(*) from slaves where name like 'g-w10-spot%'; +----------+ | count(*) | +----------+ | 200 | +----------+ mysql> select count(*) from slaves where name like 't-w10-spot%'; +----------+ | count(*) | +----------+ | 600 | +----------+ 600 t-w10-spot (c3.2xlarge) ( 300 us-east-1 and 300 us-west-2) 200 g-w10-spot (g2.2xlarge) ( 100 us-east-1 and 100 us-west-2) All these 800 new machines are disabled except 25 of them: t-w10-spot-[001-020] and g-w10t-spot-[001-005].
very cool! :catlee, is the next step hooking these up to jobs in buildbot-configs?
Flags: needinfo?(catlee)
Yes, and we also need patches to https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg that will spin up these w10 instances when there are jobs to run.
Flags: needinfo?(catlee)
Patch on https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg to spin up the new w10 instances when there are jobs to run.
Attachment #8794824 - Flags: feedback?(kmoir)
Attached patch bug1304065.patchSplinter Review
Am I on the right track? I just added win10_vm and win10_vm_gfx, same setup as win7_vm and win7_vm_gfx.
Attachment #8794849 - Flags: feedback?(catlee)
Attachment #8794824 - Flags: feedback?(kmoir) → feedback+
Robert can you pass the review to Coop or to Joel Maher to solve this bug more quickly? I don't know if Chris will have time to give a review for this patch very soon.
Flags: needinfo?(rwood)
Flags: needinfo?(rwood)
Attachment #8794849 - Flags: feedback?(catlee) → feedback?(coop)
Comment on attachment 8794849 [details] [diff] [review] bug1304065.patch Review of attachment 8794849 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. You'll need a corresponding patch to production_config.py to add the slave class to the SLAVES dict before it will work though.
Attachment #8794849 - Flags: feedback?(coop) → feedback+
The patch for the production_config.py.
Attachment #8797424 - Flags: review?(coop)
Attached patch bug1304065_bb_1.patch (obsolete) — Splinter Review
The patch with all modifications in buildbot-config.
Attachment #8797442 - Flags: review?(coop)
Attachment #8794824 - Flags: checked-in+
Comment on attachment 8794849 [details] [diff] [review] bug1304065.patch https://travis-ci.org/mozilla-releng/build-cloud-tools/builds/165258156 The patch actually had an extra , in it which I removed and the tests passed
Comment on attachment 8797442 [details] [diff] [review] bug1304065_bb_1.patch Review of attachment 8797442 [details] [diff] [review]: ----------------------------------------------------------------- Kim don't we need also to modify production_config.py to add the slave class?, those modifications are made in this patch
Attachment #8797442 - Flags: review?(coop) → review?(kmoir)
aobreja: I'm not sure what you're asking, there are two buildbot patches, both are needed
Attachment #8797424 - Flags: review?(coop) → review+
Comment on attachment 8797442 [details] [diff] [review] bug1304065_bb_1.patch Can you provide I builder diff so I can compare the new/old builders?
Flags: needinfo?(aobreja)
Attached patch bug1304065_bb.patch (obsolete) — Splinter Review
Attachment #8797442 - Attachment is obsolete: true
Attachment #8797442 - Flags: review?(kmoir)
Attached file diferenta (obsolete) —
Sorry for the delay Kim,here I attach the difference.
Flags: needinfo?(aobreja)
Comment on attachment 8799767 [details] [diff] [review] bug1304065_bb.patch I looked at the builder diff. I don't understand why we are making changes to other branches , I thought this patch was just about enabling new tests on try
Flags: needinfo?(aobreja)
Flags: needinfo?(q)
The new patch which will enabling new tests for Windows 10 VM only on try.
Attachment #8793772 - Attachment is obsolete: true
Attachment #8794196 - Attachment is obsolete: true
Attachment #8799767 - Attachment is obsolete: true
Flags: needinfo?(aobreja)
Attachment #8801694 - Flags: review?(kmoir)
Attached file difference
The difference of the tests that are run after the patch is added.
Attachment #8799774 - Attachment is obsolete: true
Comment on attachment 8801694 [details] [diff] [review] bug1304065_bb_try.patch Nit: there is an extra blank line at line 3885
Attachment #8801694 - Flags: review?(kmoir) → review+
a lot of traffic in this bug, does this mean on the next reconfig we can run tests on win10 on a try push?
The patch is in production: https://hg.mozilla.org/build/buildbot-configs/rev/1e9f1a185891ecc98080a07bc5124eb06a6abdfe Kim can tests be run for win10 or do we need anything else? I think we also need to update these changes in slave_health repository,I will create the patch and ask for a review from coop.
Flags: needinfo?(kmoir)
Attached patch bug1304065_slave_health.patch (obsolete) — Splinter Review
Patch for slave_health repository.
Attachment #8802048 - Flags: review?(coop)
Comment on attachment 8802048 [details] [diff] [review] bug1304065_slave_health.patch Review of attachment 8802048 [details] [diff] [review]: ----------------------------------------------------------------- r+ with small ordering nits fixed. ::: js/trends.js @@ +21,5 @@ > 'g-w732-spot', > 't-w732-ix', > 't-w732-spot', > + 't-w10-spot', > + 'g-w10-spot', Can we keep the g- and t- slavetypes contiguous here, please? ::: test_trends.html @@ +48,1 @@ > <td><div id="t-w864-ixTrend"></div></td> Can you split this row into two separate rows, each with a max of three slavetypes, please? That will keep it consistent with the rest.
Attachment #8802048 - Flags: review?(coop) → review+
I think that is it, you could try a try run and see if it works.
Flags: needinfo?(kmoir)
Andrei: This is not spinning up new instances. (Joel tried a try run) I think the problem is that the AMI generation is not configured in puppet and thus not generating the amis See /modules/aws_manager/manifests/cron.pp needs an entry for t-w10 and g-w10 also configs need to be there in the cloud tools for g-w10, I just see t-w10 here Kims-MacBook-Pro:configs kmoir$ git remote -v origin git@github.com:mozilla/build-cloud-tools.git (fetch) origin git@github.com:mozilla/build-cloud-tools.git (push) Kims-MacBook-Pro:configs kmoir$ pwd /Users/kmoir/git/build-cloud-tools/configs Kims-MacBook-Pro:configs kmoir$ ls *w10* t-w10 t-w10.user-data Kims-MacBook-Pro:configs kmoir$ Once we get the amis generated you should start a try run
Flags: needinfo?(aobreja)
Attachment #8801694 - Flags: checked-in+ → checked-in-
Also, it might be worth updating the doc for the next person I wrote this a long time ago https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_New_AWS_Worker_Class It's probably really out of date or there are more recent documents to update regarding adding a new class of machines of AWS
I deleted all the pending 'Windows 10' on try rev a127e128388b, since we were getting this error from watch_pending.py: Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: Traceback (most recent call last): Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 593, in <module> Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: main() Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 569, in main Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: latest_ami_percentage=args.latest_ami_percentage, Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 483, in aws_watch_pending Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: latest_ami_percentage=latest_ami_percentage) Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 152, in request_spot_instances Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: instance_config = load_instance_config(moz_instance_type) Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "/builds/aws_manager/lib/python2.7/site-packages/repoze/lru/__init__.py", line 287, in lru_cached Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: val = f(*arg) Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "/builds/aws_manager/cloud-tools/cloudtools/aws/__init__.py", line 239, in load_instance_config Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: moz_instance_type))) Oct 18 17:31:06 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: IOError: [Errno 2] No such file or directory: '/builds/aws_manager/cloud-tools/cloudtools/aws/../../configs/g-w10' Landed https://github.com/mozilla-releng/build-cloud-tools/commit/36c8415ffdb716939841db5f96b359028f856d33 as a guard to prevent it recurring until the rest of the config lands. Went unnoticed for about 7 hours, until we started to get a backlog alert from nagios and were seeing pending across several AWS-based pools.
The patch for build_cloud repository (adding g-w10 files in configs , and revert the change made by Nick ("g-w10": 1000)).
Flags: needinfo?(aobreja)
Attachment #8802526 - Flags: review?(kmoir)
Attached patch bug1304065_puppet.patch (obsolete) — Splinter Review
The patch for puppet repository.
Attachment #8802530 - Flags: review?(kmoir)
By checking /modules/aws_manager/manifests/cron.pp it seems that AMI was generated for t-w732 by the script but we also have an AMI generated in 2016-10-17-03-56 which I suspect that was manually generated. Amy do you know how was generated the AMI for g-w732 or could you pass the need info to someone who may know?Thanks
Flags: needinfo?(arich)
Flags: needinfo?(arich) → needinfo?(q)
Comment on attachment 8802530 [details] [diff] [review] bug1304065_puppet.patch asking in #releng kmoir> so I note that there are cron jobs in puppet to create w-732 amis, but none to create g-732 amis 11:31 AM we are trying to add t-w10 and g-w10 amis etc 11:31 AM to run tests on try 11:32 AM but it is a mystery to me how g-w732 amis are generated given there isn't any cron jobs for them 11:38 AM <kmoir> Q or markco ^^ <Q> They should have them 11:41 AM → brson and •bhearsum (promoted to owner, opped) joined 11:41 AM <Q> However they can bus the t-ami <kmoir> Q what does bus the t-ami mean? 11:46 AM <grenade> kmoir: t-w732 == normal tester, g-w732 == tester with gpu 11:47 AM <kmoir> right so is the ami the same config but on a different instance type? 11:47 AM my understanding was that the amis were generated by puppet crons daily 11:47 AM <grenade> i'm not sure if it's configured to take advantage but i know that there's no difference in whats on the ami 11:48 AM t and g could easily share an ami. i don't know if they do 11:48 AM <kmoir> okay, I don't understand how the g* instances know to use the t* ami 11:48 AM → gerard-majax joined (Alexandre@moz-r23ptt.lmuk.1rfi.0450.2001.IP) 11:48 AM <kmoir> is that something in cloud tools that I am missing? 11:49 AM <grenade> yeah, sorry i don't know either 11:51 AM <arr> Q is the one who can definitively answer that, and he's in line for the TSA at the moment 11:51 AM I switched the NI on the bug to him 11:51 AM oh, osrry, that was a different bug that I switched the NI on (1304065)
Attachment #8802526 - Flags: review?(kmoir) → review+
The way this is supposed to work for t-w732 and g-w732: They both use the same base AMI (specified in build-cloud-tools/configs/t-w732 and build-cloud-tools/configs/g-w732). There's a cron job (defined by puppet in modules/aws_manager/manifests/cron.pp) that generates the golden AMIs for use1 (and copies it to usw2) using that base AMI on aws-manager2. We were missing that last piece until today, so the g-w732 golden AMI was not being generated (see bug 1311430).
Flags: needinfo?(q)
okay thanks Amy. Alin, your patch for puppet needs to be updated to include g-w10 ami generation
oops, sorry I meant to say Andrei in previous comment :-)
The new patch for puppet.
Attachment #8802530 - Attachment is obsolete: true
Attachment #8802530 - Flags: review?(kmoir)
Attachment #8802835 - Flags: review?(kmoir)
Attachment #8802835 - Flags: review?(kmoir) → review+
Recreated patch to not include the modification of g-w10 in watch_pendings.cfg
Attachment #8802526 - Attachment is obsolete: true
Depends on: 1312044
can we get an update here?
This also is going to wontfix since we aren't going to support w10 buildbot (bug 1330999).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: