Last Comment Bug 1034055 - implement c3.xlarge slave class for Linux64 test spot instances
: implement c3.xlarge slave class for Linux64 test spot instances
Status: RESOLVED FIXED
:
Product: Release Engineering
Classification: Other
Component: Platform Support (show other bugs)
: unspecified
: x86 Mac OS X
-- normal (vote)
: ---
Assigned To: Kim Moir [:kmoir]
: Chris Cooper [:coop]
:
Mentors:
Depends on: 1035270 1035863
Blocks: 994920 1031083 1034034 1036609 1038320 1038941 1039227
  Show dependency treegraph
 
Reported: 2014-07-03 06:40 PDT by Kim Moir [:kmoir]
Modified: 2014-07-17 11:13 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
bug1034055.patch (8.57 KB, patch)
2014-07-03 09:02 PDT, Kim Moir [:kmoir]
rail: feedback+
Details | Diff | Splinter Review
bug1034055-2.patch (9.28 KB, patch)
2014-07-03 10:36 PDT, Kim Moir [:kmoir]
rail: feedback+
Details | Diff | Splinter Review
bug1034055puppet.patch (6.69 KB, patch)
2014-07-03 10:50 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055-3.patch (9.28 KB, patch)
2014-07-03 12:39 PDT, Kim Moir [:kmoir]
rail: review-
Details | Diff | Splinter Review
bug1034055-4.patch (9.44 KB, patch)
2014-07-03 13:45 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055bb.patch (5.31 KB, patch)
2014-07-03 14:17 PDT, Kim Moir [:kmoir]
no flags Details | Diff | Splinter Review
bug1034055wp.patch (1.01 KB, patch)
2014-07-03 14:26 PDT, Kim Moir [:kmoir]
rail: review+
Details | Diff | Splinter Review
bug1034055bb.patch wip (6.77 KB, patch)
2014-07-07 19:34 PDT, Kim Moir [:kmoir]
no flags Details | Diff | Splinter Review
bug1034055bb-2.patch (7.03 KB, patch)
2014-07-08 08:28 PDT, Kim Moir [:kmoir]
no flags Details | Diff | Splinter Review
bug1034055puppet-2.patch (1.71 KB, patch)
2014-07-08 08:32 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055bb-3.patch (7.03 KB, patch)
2014-07-09 13:13 PDT, Kim Moir [:kmoir]
no flags Details | Diff | Splinter Review
emulatorslave (20.19 KB, text/plain)
2014-07-09 13:15 PDT, Kim Moir [:kmoir]
rail: review+
Details
bug10340550-4.patch (7.06 KB, patch)
2014-07-09 13:54 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055wp-2.patch (1.02 KB, patch)
2014-07-11 09:25 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055allbranches.patch (10.76 KB, patch)
2014-07-14 09:26 PDT, Kim Moir [:kmoir]
no flags Details | Diff | Splinter Review
bug1034055wp-3.patch (1.11 KB, patch)
2014-07-14 09:30 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055-slavehealth.patch (1.15 KB, patch)
2014-07-14 10:06 PDT, Kim Moir [:kmoir]
rail: review-
Details | Diff | Splinter Review
bug1034055allbranches-2.patch (12.18 KB, patch)
2014-07-14 11:23 PDT, Kim Moir [:kmoir]
rail: review+
Details | Diff | Splinter Review
bug1034055puppetremove.patch (2.03 KB, patch)
2014-07-14 11:55 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055allbranches-3.patch (12.92 KB, patch)
2014-07-14 12:08 PDT, Kim Moir [:kmoir]
kmoir: checked‑in+
Details | Diff | Splinter Review
bug1034055moarinstances.patch (800 bytes, patch)
2014-07-15 11:55 PDT, Kim Moir [:kmoir]
rail: review+
kmoir: checked‑in+
Details | Diff | Splinter Review
emulatorlist2.txt (19.81 KB, text/plain)
2014-07-15 11:56 PDT, Kim Moir [:kmoir]
rail: review+
Details
emulatorlist2.txt (20.39 KB, text/plain)
2014-07-15 13:06 PDT, Kim Moir [:kmoir]
no flags Details

Description User image Kim Moir [:kmoir] 2014-07-03 06:40:02 PDT
We are having capacity issues on ix because Android 2.3 Armv6 jobs are now enabled now on trunk.  gbrown tested a c3.xlarge instance and the reftests et al now run to completion on this instance type. They didn't before.  Thus we should enable a slave class in AWS for Linux64 test spot instances on this instance type so that we can migrate the 2.3 tests that are currently running on ix to AWS.
Comment 1 User image Kim Moir [:kmoir] 2014-07-03 09:02:41 PDT
Created attachment 8450274 [details] [diff] [review]
bug1034055.patch

Initial patch for cloud tools to enable c3.xlarge as a new slave class

I'll add the patches for the other repos later.  As a fyi I plan to deploy these changes on ash first to ensure it doesn't break things.

I'm not sure how to match this platform in slavealloc.py here so it gets the correct slaves
http://hg.mozilla.org/build/cloud-tools/file/bfd92f8744a5/cloudtools/slavealloc.py#l70
Same thing applies to the regular expression for the builder types in water_pending.cfg since the tests will still be split across the existing instance types and the new c3.xlarge slaves

I'm not sure what the ami id should be in the configs for tst-emulator64.  Does this get generated automatically when puppet creates it?
Comment 2 User image Rail Aliiev [:rail] ⌚️ET 2014-07-03 09:16:53 PDT
Comment on attachment 8450274 [details] [diff] [review]
bug1034055.patch

Review of attachment 8450274 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm!

::: configs/instance2ami.json
@@ +21,5 @@
> +     "regions": ["us-east-1", "us-west-2"]},
> +    {"ami-config": "ubuntu-12.04-x86_64-desktop",
> +     "instance-config": "tst-emulator64",
> +     "ssh-key": "aws-releng",
> +     "ssh-user": "ubuntu",

This code is not used anymore, but won't hurt.

::: configs/tst-emulator64
@@ +5,5 @@
> +        "domain": "test.releng.use1.mozilla.com",
> +        "ami": "ami-e48e1e8d",
> +        "subnet_ids": ["subnet-ae35ccc4", "subnet-8f32cbe5", "subnet-ff3542d7",
> +                       "subnet-b8643190", "subnet-fb97bc8f", "subnet-844b7ec2",
> +                       "subnet-ed35cc87", "subnet-5cd0d828", "subnet-7ca5f03a"],

I hope we have enough IP addresses! :)

@@ +7,5 @@
> +        "subnet_ids": ["subnet-ae35ccc4", "subnet-8f32cbe5", "subnet-ff3542d7",
> +                       "subnet-b8643190", "subnet-fb97bc8f", "subnet-844b7ec2",
> +                       "subnet-ed35cc87", "subnet-5cd0d828", "subnet-7ca5f03a"],
> +        "security_group_ids": ["sg-f0f1239f"],
> +        "instance_type": "c3.xlarge",

FTR. instance_type from this file is used only for on-demand instances.

@@ +11,5 @@
> +        "instance_type": "c3.xlarge",
> +        "distro": "ubuntu",
> +        "ssh_key": "aws-releng",
> +        "use_public_ip": true,
> +        "instance_profile_name": "tst-emulator64",

TODO: This will require adding a new profile in Amazon IAM. Copy paste of tst-linux64 IAM role will work.

@@ +17,5 @@
> +            "/dev/sda1": {
> +                "size": 20,
> +                "volume_type": "gp2",
> +                "instance_dev": "/dev/xvda1"
> +            }

Since this instance type comes with ephemeral storage, it's worth to copy /dev/sdb and /dev/sdc entries from http://hg.mozilla.org/build/cloud-tools/file/bfd92f8744a5/configs/bld-linux64#l21 (they are harmless).

@@ +23,5 @@
> +        "tags": {
> +            "moz-type": "tst-emulator64"
> +        }
> +    },
> +    "us-west-2": {

the same comments as above for usw2

::: configs/tst-emulator64.cloud-init
@@ +1,1 @@
> +#cloud-config

You can just symlink this file to tst-linux64 since they are identical

::: configs/watch_pending.cfg
@@ +23,5 @@
>          "^Ubuntu Mulet VM 12.04 x64.*": "tst-linux64",
>          "^Ubuntu ASAN VM 12.04 x64.*": "tst-linux64",
>          "^b2g_(emulator|ubuntu64)_vm": "tst-linux64",
>          "^Android 2.3( Armv6)? Emulator(?:(?!plain-reftest|crashtest|jsreftest).)*$": "tst-linux64",
> +        "^Android 2.3( Armv6)? Emulator(?:(?!plain-reftest|crashtest|jsreftest).)*$": "tst-emulator64",

This implies that no new tests are going to be run on this instance type. Only moving existing ones from m1.medium to c3.xlarge. Correct?

Everything except this line can (actually should) be landed in advance, so we generate AMIs and probably test them manually.
Comment 3 User image Rail Aliiev [:rail] ⌚️ET 2014-07-03 09:20:48 PDT
(In reply to Rail Aliiev [:rail] from comment #2)
> > +        "instance_profile_name": "tst-emulator64",
> 
> TODO: This will require adding a new profile in Amazon IAM. Copy paste of
> tst-linux64 IAM role will work.

Done.
Comment 4 User image Rail Aliiev [:rail] ⌚️ET 2014-07-03 09:31:38 PDT
Added the "golden" DNS entries (they will be used to puppetize the AMIs):

invtool A create --ip 10.134.50.244 --fqdn tst-emulator64-ec2-golden.test.releng.use1.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.134.50.244 --target tst-emulator64-ec2-golden.test.releng.use1.mozilla.com  --private --description "Golden AMI"

invtool A create --ip 10.132.49.72 --fqdn tst-emulator64-ec2-golden.test.releng.usw2.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.132.49.72 --target tst-emulator64-ec2-golden.test.releng.usw2.mozilla.com  --private --description "Golden AMI"
Comment 5 User image Rail Aliiev [:rail] ⌚️ET 2014-07-03 09:53:16 PDT
(In reply to Kim Moir [:kmoir] from comment #1)
> I'll add the patches for the other repos later.  As a fyi I plan to deploy
> these changes on ash first to ensure it doesn't break things.
> 
> I'm not sure how to match this platform in slavealloc.py here so it gets the
> correct slaves
> http://hg.mozilla.org/build/cloud-tools/file/bfd92f8744a5/cloudtools/
> slavealloc.py#l70

We may need to use something else to distinguish these... Maybe speed... The best approach would be adding a new column in the slavealloc DB table, I think.

> Same thing applies to the regular expression for the builder types in
> water_pending.cfg since the tests will still be split across the existing
> instance types and the new c3.xlarge slaves

Are we going to leave existing ones on m1.medium and use c3.xlarge for plain-reftest|crashtest|jsreftest?

> 
> I'm not sure what the ami id should be in the configs for tst-emulator64. 

The current value is OK. This is a base AMI, no puppet applied. This AMI is used to bootstrap and puppetize instances.

> Does this get generated automatically when puppet creates it?

Once we puppetize the "golden" AMIs (daily) we tag and publish them. Then aws_watch_pending.py uses http://hg.mozilla.org/build/cloud-tools/file/bfd92f8744a5/cloudtools/aws/ami.py#l127 to figure out the latest usable AMI.
Comment 6 User image Kim Moir [:kmoir] 2014-07-03 10:05:44 PDT
In regard to comment #5, yes I plan to leave the existing tests than can run on m1.medium because it isn't worth using a more expensive instance type that we need.  Thanks for all the feedback, I'm working on new patches.
Comment 7 User image Kim Moir [:kmoir] 2014-07-03 10:36:26 PDT
Created attachment 8450372 [details] [diff] [review]
bug1034055-2.patch
Comment 8 User image Rail Aliiev [:rail] ⌚️ET 2014-07-03 10:44:19 PDT
Comment on attachment 8450372 [details] [diff] [review]
bug1034055-2.patch

Review of attachment 8450372 [details] [diff] [review]:
-----------------------------------------------------------------

::: configs/tst-emulator64
@@ +17,5 @@
> +            "/dev/sda1": {
> +                "size": 20,
> +                "volume_type": "gp2",
> +                "instance_dev": "/dev/xvda1"
> +            }

^ JSON syntax error :) missing coma

@@ +53,5 @@
> +            "/dev/sda1": {
> +                "size": 20,
> +                "volume_type": "gp2",
> +                "instance_dev": "/dev/xvda1"
> +            }

the same here

::: configs/watch_pending.cfg
@@ +23,5 @@
>          "^Ubuntu Mulet VM 12.04 x64.*": "tst-linux64",
>          "^Ubuntu ASAN VM 12.04 x64.*": "tst-linux64",
>          "^b2g_(emulator|ubuntu64)_vm": "tst-linux64",
>          "^Android 2.3( Armv6)? Emulator(?:(?!plain-reftest|crashtest|jsreftest).)*$": "tst-linux64",
> +        "^Android 2.3( Armv6)? Emulator(?:(?!plain-reftest|crashtest|jsreftest).)*$": "tst-emulator64",

I think you need:

"^Android 2.3( Armv6)? Emulator.* (plain-reftest|crashtest|jsreftest).*": "tst-emulator64",

(?!blah) means "not blah".

I'd test the regexp against the actual builder names... I know, this is bad :)

Once we swith to allthethings.json, this horrible block will be gone!
Comment 9 User image Kim Moir [:kmoir] 2014-07-03 10:50:00 PDT
Created attachment 8450384 [details] [diff] [review]
bug1034055puppet.patch
Comment 10 User image Kim Moir [:kmoir] 2014-07-03 12:39:23 PDT
Created attachment 8450470 [details] [diff] [review]
bug1034055-3.patch

I tested the regexp against builder names and it worked.  Like you said, I wouldn't land buildermap change in watch_pending.cfg until we tested the AMIs
Comment 11 User image Rail Aliiev [:rail] ⌚️ET 2014-07-03 13:11:01 PDT
Comment on attachment 8450470 [details] [diff] [review]
bug1034055-3.patch

Review of attachment 8450470 [details] [diff] [review]:
-----------------------------------------------------------------

It looks fine to me as a not final patch for 2 reasons:

* we still need to distinguish tst-linux64 and tst-emulator64 in http://hg.mozilla.org/build/cloud-tools/file/bfd92f8744a5/cloudtools/slavealloc.py#l45

::: configs/watch_pending.cfg
@@ +23,5 @@
>          "^Ubuntu Mulet VM 12.04 x64.*": "tst-linux64",
>          "^Ubuntu ASAN VM 12.04 x64.*": "tst-linux64",
>          "^b2g_(emulator|ubuntu64)_vm": "tst-linux64",
>          "^Android 2.3( Armv6)? Emulator(?:(?!plain-reftest|crashtest|jsreftest).)*$": "tst-linux64",
> +        "^Android 2.3( Armv6)? Emulator.* (plain-reftest|crashtest|jsreftest).*": "tst-emulator64",

This would immediately start trying to start new instance types before they are ready to go (AMIs).

Can you put this hunk in a separate patch?
Comment 12 User image Kim Moir [:kmoir] 2014-07-03 13:45:09 PDT
Created attachment 8450507 [details] [diff] [review]
bug1034055-4.patch
Comment 13 User image Kim Moir [:kmoir] 2014-07-03 14:17:21 PDT
Created attachment 8450530 [details] [diff] [review]
bug1034055bb.patch

buildbot patch, still need to test
Comment 14 User image Kim Moir [:kmoir] 2014-07-03 14:26:04 PDT
Created attachment 8450535 [details] [diff] [review]
bug1034055wp.patch

patch for watch pending after we test AMIs
Comment 15 User image Rail Aliiev [:rail] ⌚️ET 2014-07-03 18:55:46 PDT
Comment on attachment 8450507 [details] [diff] [review]
bug1034055-4.patch

Review of attachment 8450507 [details] [diff] [review]:
-----------------------------------------------------------------

::: cloudtools/slavealloc.py
@@ +76,3 @@
>         slave.get("trustlevel") == "try":
>          return "tst-linux64"
> +    

Please kill the trailing space when you land it.

Kim, can you also file a bug to make this method more straight forward. Using speed here can be used as a temporary work around, but it may hit us in the future at some point. Please assign it to me.
Comment 16 User image Pete Moore [:pmoore][:pete] 2014-07-04 00:29:11 PDT
Kim and Rail, thanks for working so quickly through all this yesterday - fantastic job! I can't quite believe how much you got done yesterday - really great work. =)
Comment 17 User image Kim Moir [:kmoir] 2014-07-04 11:26:45 PDT
regarding comment 15, I opened bug 1034674
Comment 18 User image Kim Moir [:kmoir] 2014-07-04 13:53:18 PDT
So I couldn't get an ami to generate and I'm stuck so I think I'll wait until more people are back on Monday to help me debug.  

I tried to invoke the puppet script by hand but it failed because it couldn't create an c3.xlarge image in us-east-1b.   

As an aside, I thought adding -r us-west-2 to the script parameters in cron.pp for this instance type would help but invoking I see no error messages or log output. And no AMI created.  So not sure what's going on.
Comment 19 User image Kim Moir [:kmoir] 2014-07-07 07:50:34 PDT
I talked to rail this morning and the problem was that the address assigned to tst-emulator64-ec2-golden.test.releng.use1.mozilla.com.  He said to modify the config to use these subnets

  "subnet_ids": ["subnet-35a9835e", "subnet-8f21eeee", "subnet-8c20efed",
                        "subnet-173ff076", "subnet-0aa98361", "subnet-33a98358"],
        "security_group_ids": ["sg-f0f1239f"],

with python free_ips.py -c /tmp/tst-emulator64 -n 1 and then update the A and PTR records for ec2-golden.test.releng.use1.mozilla.com to reflect the new address. Once this is propagated, I should be able to rerun the ami generation script successfully.
Comment 20 User image Pete Moore [:pmoore][:pete] 2014-07-07 11:03:10 PDT
See https://bugzilla.mozilla.org/show_bug.cgi?id=1034034#c11

Maybe we can still run these tests on IX after all?
Comment 21 User image Chris Cooper [:coop] 2014-07-07 11:35:41 PDT
(In reply to Pete Moore [:pete][:pmoore] from comment #20)
> See https://bugzilla.mozilla.org/show_bug.cgi?id=1034034#c11
> 
> Maybe we can still run these tests on IX after all?

The builders are configured differently from the testers, so we can't just bulk out the existing pools by re-imaging the builders.

However, since we're effectively creating a new pool of AWS machines for mobile testing, there's no reason why we couldn't use a new hardware pool instead, provided the build machines meet the other prerequisites for use as test machines. I'm specifically worried about the minimum graphics requirements here because I don't think the builders have graphics cards and, because of their design/configuration, may not be able to accept new cards if required.

I'm still a bigger fan of pushing work into AWS if possible, but I'll let Kim decide the best course of action here. If we want to try an experiment, we can pull a windows ix builder to test.
Comment 22 User image Kim Moir [:kmoir] 2014-07-07 11:46:29 PDT
I'm going to proceed with testing on AWS right now and get my patches tested on my dev-master and see if we can that working on a branch since we seem to be pretty close there and this would resolve some immediate capacity issues.

If we did decide to go with the builder ix machines we would have to get ateam to test them since they are a different hardware ref than our existing ix pool.
Comment 23 User image Kim Moir [:kmoir] 2014-07-07 19:34:43 PDT
Created attachment 8452081 [details] [diff] [review]
bug1034055bb.patch wip

patches to enable on ash. The builder diff looks good.  I just can't get my AWS loaner to talk to my dev-master due to networking issues so I can invoke sendchanges.  Will have to talk to people tomorrow morning and see how I can get them to connect.
Comment 24 User image Kim Moir [:kmoir] 2014-07-07 20:13:55 PDT
Talked to nthomas and got my dev-master ports fixed. Invoked sendchanges and running some tests overnight with this slave.
Comment 25 User image Kim Moir [:kmoir] 2014-07-08 07:08:00 PDT
The loaner script added the wrong instance type so the tests timed out last night.  I changed my loaner instance to c3.xlarge and now the tests are running green.  I'll attach patches to enable this configuration on ash.
Comment 26 User image Kim Moir [:kmoir] 2014-07-08 08:28:36 PDT
Created attachment 8452349 [details] [diff] [review]
bug1034055bb-2.patch

patches to enable on ash
Comment 27 User image Kim Moir [:kmoir] 2014-07-08 08:32:27 PDT
Created attachment 8452352 [details] [diff] [review]
bug1034055puppet-2.patch

we need two names for the slave classes just like last time one for armv6, one for 2.3  I can delete the ix slave classes that I created earlier once this is in production
Comment 28 User image Kim Moir [:kmoir] 2014-07-08 13:58:07 PDT
It turns out the slave I used last night didn't have the correct AMI or instance tyoe.  I had to recreate the loaner twice today because there were some problems in our configs.  I'm now working through issues where the tests don't complete. I don't think it's a problem with the instance, just the test setup, still investigating.

http://dev-master1.srv.releng.scl3.mozilla.com:8036/builders/Android%202.3%20Armv6%20Emulator%20ash%20opt%20test%20plain-reftest-7/builds/3/steps/run_script/logs/stdio
Comment 29 User image Jonathan Griffin (:jgriffin) 2014-07-08 15:01:40 PDT
*** Bug 1031083 has been marked as a duplicate of this bug. ***
Comment 30 User image Kim Moir [:kmoir] 2014-07-09 08:47:12 PDT
So it seems ash is in a bad state and this was the source of all the errors I saw.  When I changed my dev-master to run these tests on m-c and invoked the associated sendchanges, the testing is proceeding.
Comment 31 User image Kim Moir [:kmoir] 2014-07-09 12:59:35 PDT
Comment on attachment 8452349 [details] [diff] [review]
bug1034055bb-2.patch

I won't land this until the slavealloc entries are in place that I will attach shortly but asking for review since it's ready
Comment 32 User image Kim Moir [:kmoir] 2014-07-09 13:00:39 PDT
Comment on attachment 8452352 [details] [diff] [review]
bug1034055puppet-2.patch

ubuntu64_vm_large will still be used for Android 2.3 tests but I need a separate name for armv6 tests to avoid builder name contention
Comment 33 User image Kim Moir [:kmoir] 2014-07-09 13:13:29 PDT
Created attachment 8453274 [details] [diff] [review]
bug1034055bb-3.patch
Comment 34 User image Kim Moir [:kmoir] 2014-07-09 13:15:07 PDT
Created attachment 8453275 [details]
emulatorslave

List of slaves to add to slavealloc
Comment 35 User image Kim Moir [:kmoir] 2014-07-09 13:54:56 PDT
Created attachment 8453319 [details] [diff] [review]
bug10340550-4.patch
Comment 36 User image Rail Aliiev [:rail] ⌚️ET 2014-07-09 13:59:13 PDT
Comment on attachment 8453319 [details] [diff] [review]
bug10340550-4.patch

Review of attachment 8453319 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla-tests/production_config.py
@@ +69,5 @@
>  for i in range(1,100) + range(301,400):
>      SLAVES['ubuntu64_vm']['tst-linux64-ec2-%03i' % i] = {}
>  
> +for i in range(1,100) + range(301,400):
> +    SLAVES['ubuntu64_vm_large']['tst-emulator64-spot-%03i' % i] = {}  

Can you kill the trailing spaces above.
Comment 37 User image Kim Moir [:kmoir] 2014-07-10 07:17:55 PDT
Comment on attachment 8453275 [details]
emulatorslave

imported to slavealloc
Comment 38 User image Kim Moir [:kmoir] 2014-07-10 07:21:05 PDT
Tests all ran green on m-c on my dev-master on this image. So I'll get ready to deploy on ash, then to other branches.
Comment 39 User image Kim Moir [:kmoir] 2014-07-10 07:46:19 PDT
First I have to finish setting up two new masters in AWS to handle the additional slaves in bug 1035863
Comment 40 User image Kim Moir [:kmoir] 2014-07-11 09:25:21 PDT
Created attachment 8454517 [details] [diff] [review]
bug1034055wp-2.patch
Comment 41 User image Kim Moir [:kmoir] 2014-07-14 07:56:22 PDT
In production and merged m-c to ash to see if the emulator images are spun up correctly for Android 2.3 Armv6 tests on this branch.
Comment 42 User image Rail Aliiev [:rail] ⌚️ET 2014-07-14 09:01:06 PDT
Once you deploy the changes to all branches, we should also adjust the slave_health regexes like in http://hg.mozilla.org/build/slave_health/rev/5a892e2e304f
Comment 43 User image Pete Moore [:pmoore][:pete] 2014-07-14 09:21:33 PDT
Hi Rail, Kim,

I noticed we are getting emails like these, but maybe this is solved by comment 42?


On 14 Jul 2014, at 17:41, Cron Daemon <root@cruncher.srv.releng.scl3.mozilla.com> wrote:

Unknown slave_type for test: tst-emulator64-spot-059
Unknown slave_type for test: tst-emulator64-spot-058
Unknown slave_type for test: tst-emulator64-spot-325
Unknown slave_type for test: tst-emulator64-spot-324
Unknown slave_type for test: tst-emulator64-spot-323
Unknown slave_type for test: tst-emulator64-spot-322
Unknown slave_type for test: tst-emulator64-spot-321
Unknown slave_type for test: tst-emulator64-spot-320
Unknown slave_type for test: tst-emulator64-spot-051
Unknown slave_type for test: tst-emulator64-spot-050
Unknown slave_type for test: tst-emulator64-spot-053
Unknown slave_type for test: tst-emulator64-spot-052
.....
.....
Comment 44 User image Kim Moir [:kmoir] 2014-07-14 09:26:06 PDT
Created attachment 8455396 [details] [diff] [review]
bug1034055allbranches.patch

Patch to enable on relevant branches.  The builder diff just shows an ordering difference.

I'll attach a patch for the cloud-tools too to remove Ash from the regexp.  Also, I'll remove the slave classes from puppet after we roll this out because there will be jobs in progress when we reconfig.
Comment 45 User image Kim Moir [:kmoir] 2014-07-14 09:30:46 PDT
Created attachment 8455406 [details] [diff] [review]
bug1034055wp-3.patch

watch all branches not just ash
Comment 46 User image Rail Aliiev [:rail] ⌚️ET 2014-07-14 09:45:36 PDT
Comment on attachment 8455406 [details] [diff] [review]
bug1034055wp-3.patch

I think "^Android 2.3( Armv6)? Emulator.*" is what you want, the rest is redundant.
Comment 47 User image Kim Moir [:kmoir] 2014-07-14 10:06:44 PDT
Created attachment 8455434 [details] [diff] [review]
bug1034055-slavehealth.patch

adjust slave health
Comment 48 User image Rail Aliiev [:rail] ⌚️ET 2014-07-14 10:21:44 PDT
Comment on attachment 8455434 [details] [diff] [review]
bug1034055-slavehealth.patch

Review of attachment 8455434 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/slave_health.js
@@ +91,5 @@
>  	} else if (pending.match(/(Ubuntu HW 12.04 x64|b2g_ics_armv7a_gecko_emulator_hw|b2g_emulator_hw)/)) {
>  	    slavetype = "talos-linux64-ix";
>  	} else if (pending.match(/Android (?:4.2 )?x86/)) {
>  	    slavetype = "talos-linux64-ix";
>  	} else if (pending.match(/Android 2.3( Armv6)? Emulator(?:(?!plain-reftest|crashtest|jsreftest).)*$/)) {

you also need to change the regexp to match the same rexep in watch_pending.cfg

@@ +92,5 @@
>  	    slavetype = "talos-linux64-ix";
>  	} else if (pending.match(/Android (?:4.2 )?x86/)) {
>  	    slavetype = "talos-linux64-ix";
>  	} else if (pending.match(/Android 2.3( Armv6)? Emulator(?:(?!plain-reftest|crashtest|jsreftest).)*$/)) {
> +	    slavetype = "tst-emulator64-spot";

Even though the line above looks ok, it's not enough. grepping slave_health gives me a lot of entries for tst-linux64, probably you need more changes to define this type. Coop may know more.
Comment 49 User image Rail Aliiev [:rail] ⌚️ET 2014-07-14 10:47:01 PDT
(In reply to Pete Moore [:pete][:pmoore] from comment #43)
> Hi Rail, Kim,
> 
> I noticed we are getting emails like these, but maybe this is solved by
> comment 42?
> 
> 
> On 14 Jul 2014, at 17:41, Cron Daemon
> <root@cruncher.srv.releng.scl3.mozilla.com> wrote:
> 
> Unknown slave_type for test: tst-emulator64-spot-059

root@cruncher?!! /me goes to get rid of this.
Comment 50 User image Rail Aliiev [:rail] ⌚️ET 2014-07-14 10:49:43 PDT
(In reply to Rail Aliiev [:rail] from comment #49)
> (In reply to Pete Moore [:pete][:pmoore] from comment #43)
> > Hi Rail, Kim,
> > 
> > I noticed we are getting emails like these, but maybe this is solved by
> > comment 42?
> > 
> > 
> > On 14 Jul 2014, at 17:41, Cron Daemon
> > <root@cruncher.srv.releng.scl3.mozilla.com> wrote:
> > 
> > Unknown slave_type for test: tst-emulator64-spot-059
> 
> root@cruncher?!! /me goes to get rid of this.


I think this is slave_health. phew...
Comment 51 User image Kim Moir [:kmoir] 2014-07-14 11:23:24 PDT
Created attachment 8455481 [details] [diff] [review]
bug1034055allbranches-2.patch
Comment 52 User image Rail Aliiev [:rail] ⌚️ET 2014-07-14 11:41:21 PDT
Comment on attachment 8455481 [details] [diff] [review]
bug1034055allbranches-2.patch

Review of attachment 8455481 [details] [diff] [review]:
-----------------------------------------------------------------

Can you also remove ubuntu64_hw_mobile from BuildSlaves.py.template when you land?

A separate patch to remove ubuntu64_hw_mobile from puppet is also appreciated.

::: mozilla-tests/mobile_config.py
@@ +1426,2 @@
>              'debug_unittest_suites': [],
>          },       

Can you also kill the trailing space above when you land this.

@@ +1599,5 @@
>          ANDROID_2_3_AWS_DICT['opt_unittest_suites'].append(suite)
>  
>  # enable android 2.3 tests to ride the trains bug 1004791
>  for name, branch in items_at_least(BRANCHES, 'gecko_version', 32):
>      # Loop removes it from any branch that gets beyond here

as a follow up, can you fie a bug to fix this loop using items_before(). It's much more convenient for merge duty patches.
Comment 53 User image Kim Moir [:kmoir] 2014-07-14 11:55:32 PDT
Created attachment 8455508 [details] [diff] [review]
bug1034055puppetremove.patch

remove old slave classes from puppet once buildbot config patches are landed and reconfiged
Comment 54 User image Kim Moir [:kmoir] 2014-07-14 12:08:14 PDT
Created attachment 8455516 [details] [diff] [review]
bug1034055allbranches-3.patch

better patch than comment 52 + bug filed for loop in bug 1038320
Comment 55 User image Kim Moir [:kmoir] 2014-07-14 12:14:32 PDT
Comment on attachment 8455434 [details] [diff] [review]
bug1034055-slavehealth.patch

coop is going to add the stuff to slave health so I won't worry about writing a patch for it.  Thanks coop!
Comment 56 User image Kim Moir [:kmoir] 2014-07-14 12:50:42 PDT
Tests are green on ash so I'll land my patches to enable on all branches and reconfig again first thing tomorrow morning.  Reconfigs are fast when people in California are still sleeping and the load is light on the masters :-)
Comment 57 User image Kim Moir [:kmoir] 2014-07-15 07:35:00 PDT
Comment on attachment 8455406 [details] [diff] [review]
bug1034055wp-3.patch

like this as rail suggested
Android 2.3( Armv6)? Emulator.*
Comment 58 User image Kim Moir [:kmoir] 2014-07-15 07:35:19 PDT
In production
Comment 59 User image Kim Moir [:kmoir] 2014-07-15 09:20:34 PDT
We are seeing a few problems cloning hg on some spot instances, but this seems to be bug 1036176.

I can see in slave health that the entire pool of spot instances is up and running jobs.  However, there are still ~800 pending jobs so we'll have to wait and see if we need to expand the pool, especially given that there are b2g tests that want to run on this same instance type in bug 1031083.
Comment 60 User image Kim Moir [:kmoir] 2014-07-15 11:55:57 PDT
Created attachment 8456351 [details] [diff] [review]
bug1034055moarinstances.patch

more instances to reduce pending
Comment 61 User image Kim Moir [:kmoir] 2014-07-15 11:56:33 PDT
Created attachment 8456352 [details]
emulatorlist2.txt

add new instances to slavealloc
Comment 62 User image Rail Aliiev [:rail] ⌚️ET 2014-07-15 12:30:30 PDT
Comment on attachment 8456352 [details]
emulatorlist2.txt

conditional r+:
s/tst-linux64-spot/tst-emulator64-spot/

(it would fail inserting into the db)
Comment 63 User image Kim Moir [:kmoir] 2014-07-15 12:38:32 PDT
Comment on attachment 8455508 [details] [diff] [review]
bug1034055puppetremove.patch

and merged
Comment 64 User image Kim Moir [:kmoir] 2014-07-15 13:06:41 PDT
Created attachment 8456406 [details]
emulatorlist2.txt

actually the hostnames were wrong, this is what I added to the db
Comment 65 User image Kim Moir [:kmoir] 2014-07-16 06:26:57 PDT
Comment on attachment 8456406 [details]
emulatorlist2.txt

added to and enabled in slavealloc
Comment 66 User image Kim Moir [:kmoir] 2014-07-16 07:37:06 PDT
New slave pool in production
Comment 67 User image Kim Moir [:kmoir] 2014-07-16 08:20:56 PDT
New instances are up but the pending count is still high (~1200).  Will watch it over the next 24 hours and see how it keeps up with load.
Comment 68 User image Kim Moir [:kmoir] 2014-07-17 11:13:34 PDT
Pending count looks good.  Closing.

Note You need to log in before you can comment on or make changes to this bug.