1230330 - Switch to capacity 1 worker type for Firefox desktop tests

Assignee

Description

•

9 years ago

I would like to try beefier instances since I'm needing to chunk 3-5 times for certain suites.

Armen [:armenzg]

Assignee

Comment 1

•

9 years ago

What do I need to do such?

Armen [:armenzg]

Assignee

Updated

•

9 years ago

Assignee: nobody → armenzg

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

9 years ago

We'll need to set up a new workerType or two for you.  What workerType are you using now?

Armen [:armenzg]

Assignee

Comment 3

•

9 years ago

b2gtest.

armenzg@armenzg-thinkpad:~/repos/mozilla-central/testing/taskcluster/tasks/tests$ grep -r "from:" fx_test_base.yml 
  from: 'tasks/test.yml'
armenzg@armenzg-thinkpad:~/repos/mozilla-central/testing/taskcluster/tasks/tests$ grep -r "workerType" ../test.yml
  workerType: b2gtest

fx_test_base.yml only shows up in my push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc0b3cc247cf

It is the renamed version of:
https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/tasks/tests/fx_unittest_base.yml

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

9 years ago

OK -- that's running 4 jobs per host, and it's *3.xlarge:

    {
      "instanceType": "r3.xlarge",
      "capacity": 4,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true
      },
      "launchSpec": {}
    },
    {
      "instanceType": "m3.xlarge",
      "capacity": 4,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true
      },
      "launchSpec": {}
    },
    {
      "instanceType": "c3.xlarge",
      "capacity": 4,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true
      },
      "launchSpec": {}
    }

I don't know how we typically go about tweaking these things -- perhaps that instance type is correct, but the capacity needs to be dialed back?  Or maybe we just need bigger instances?  Or both?

Armen [:armenzg]

Assignee

Comment 5

•

9 years ago

Is r3 the beefiest?
Which one are we running on?

I can't list workers.

garndt: do you know?

Flags: needinfo?(garndt)

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

9 years ago

Comment 4 us the information from the worker list.  We're running on r3.xlarge, m3.xlarge, and c3.xlarge.  The specs are available on the Amazon's pricing page.  But we don't know if the issue is IO bandwidth, CPU contention between containers, or some other issue.  One thing we could try is to use the instance type releng uses for EC2 testers, and only use capacity: 1

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

9 years ago

using the identical instance to the buildbot ec2 seems the correct route to go.  This would give us a benchmark of the overhead of taskcluster/docker and we can do better math around that.

I agree we should understand if we are CPU or I/O bound.

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

9 years ago

OK -- those are m1.medium.  I created a new 'desktop-test' workerType which uses m1.medium with a capacity of 1 (meaning only one test job at a time on the host).  I adjusted the permissions for level-1 repos to allow use of that workerType.  So you should be able to re-push with "b2gtest" replaced with "desktop-test" (which is a good change anyway) and use the m1.medium size.

I set the max capacity to 30 so that it won't spin up hundreds of instances for each try push.  We can increase that later once use of the workerType is in-tree (please remind me to do so!)

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

9 years ago

(you should be able to set that workertype in fx_test_base.yml)

Armen [:armenzg]

Assignee

Comment 10

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=221286127276

Armen [:armenzg]

Assignee

Comment 11

•

9 years ago

Switching to desktop-test has all jobs pending.
Anyway to be able to open the tap?

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

9 years ago

So, m1.medium isn't compatible with the AMI type.

m3.medium only has 3.9G instance storage, which is a little small!

I tried setting that to be OK:

    {
      "instanceType": "m3.medium",
      "capacity": 1,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true,
        "capacityManagement": {
          "diskspaceThreshold": 2000000000
        }
      },
      "launchSpec": {}
    }

and that seems to be able to actually run tasks:

https://tools.taskcluster.net/task-graph-inspector/#SxzC0KO5TOWyGR6ueZSo6Q/LNn2pK-eRJK1FM65wKnpBA/

I'm not sure what disk space we really need for these test jobs -- 2GB sounds a little tight!  Maybe we could set that to 3.5G?  In any case, I think we'll need to stick to capacity=1 for these instance types!  Note that they also only have 3.75G of RAM.

Flags: needinfo?(garndt)

Armen [:armenzg]

Assignee

Comment 13

•

9 years ago

I think I'm starting to see a bit more green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ddb03c51c09f,221286127276&group_state=expanded&filter-resultStatus=success&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable

Armen [:armenzg]

Assignee

Comment 14

•

9 years ago

However, the jobs run slower. Maybe 20-30% slower? It's hard to tell.

I think only four more jobs are green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ddb03c51c09f,221286127276&group_state=expanded&filter-resultStatus=success&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

9 years ago

It makes sense that they run slower -- it's a slower CPU.  How do they compare to running on Buildbot EC2 instances?

Armen [:armenzg]

Assignee

Comment 16

•

9 years ago

Pretty bad:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=52bcfd8df0a1,b06014e0ba00&group_state=expanded

We should probably conclude that there is something else making the tests run slow.

Could docker be getting on the way?

Dustin J. Mitchell [:dustin] (he/him)

Comment 17

•

9 years ago

Docker doesn't do anything active to interfere with a running container -- just sets up the container parameters.

What part of the task is taking longer?  Is it the actual test run, or downloads and things like that?

Armen [:armenzg]

Assignee

Comment 18

•

9 years ago

Side note, I've seen a couple of tasks with no logs: bug 1230942

Armen [:armenzg]

Assignee

Comment 19

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> What part of the task is taking longer?  Is it the actual test run, or
> downloads and things like that?

The test run.

Joel Maher ( :jmaher ) (UTC -8)

Comment 20

•

9 years ago

isn't docker a container inside an OS where the buildbot ec2 jobs are just running on the OS?

Armen [:armenzg]

Assignee

Comment 21

•

9 years ago

Could we have a better worker type with a capacity of 1?

Armen [:armenzg]

Assignee

Comment 22

•

9 years ago

For the record, another push using 'desktop-test' has most instances being killed:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d7a9c7f3b475

Greg Arndt [:garndt]

Comment 23

•

9 years ago

Looks like the tasks are getting terminated because they're exceeding the max run time set in the task definition (3600 seconds).

Armen [:armenzg]

Assignee

Comment 24

•

9 years ago

They weren't timing out in previous pushes. I didn't change anything that could produce this.

Anyways, I will go back to 'b2gdesktop' until I have another worker to use.

Greg Arndt [:garndt]

Comment 25

•

9 years ago

I am not exactly sure what changed, but it looks like they're timing out waiting for gnome to start:

+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ gnome-session
/home/worker/workspace/test-linux.sh: line 99: /home/worker/artifacts/public/gnome-session.log: No such file or directory
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!'

The above repeats until the task finally times out.

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

9 years ago

In practice there's not a lot of difference between a process running "in a container" or "on the OS" -- they're both processes in the same kernel, just linked to different namespaces.

I think the medium instances are just too small to run -- probably gnome-desktop is getting OOM'd in those failing runs.  I just upgraded the desktop-test workerType to m3.large and re-pushed your latest try.

Dustin J. Mitchell [:dustin] (he/him)

Comment 27

•

9 years ago

   https://treeherder.mozilla.org/#/jobs?repo=try&revision=461bd2e184a9

Dustin J. Mitchell [:dustin] (he/him)

Comment 28

•

9 years ago

The gnome-session thing is a bug in bug 1228416

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

9 years ago

lets verify what we have here- we should be able to analyze some logs and determine errors/failures/deltas.

Dustin J. Mitchell [:dustin] (he/him)

Comment 30

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fd657428a97f

Dustin J. Mitchell [:dustin] (he/him)

Comment 31

•

9 years ago

^^ including the patch on bug 1228416, so hopefully some better results!

Joel Maher ( :jmaher ) (UTC -8)

Comment 32

•

9 years ago

the try push looks good.  We have a lot of chunks there- but they don't seem to be long.

Dustin J. Mitchell [:dustin] (he/him)

Comment 33

•

9 years ago

OK, so let's stick to m3.large for the desktop-test workerType.  The workerType is set up -- do we need to do anything further on this bug?

Armen [:armenzg]

Assignee

Comment 34

•

9 years ago

I think we're fine.

I need to spend some time running the reftests locally to see what's going on.
I think getting to the point of trying bug 1221661 will be where we'll know how close (or not) we are.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Assignee

Comment 35

•

9 years ago

I will land the change to make this permanent.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Armen [:armenzg]

Assignee

Comment 36

•

9 years ago

Attached file MozReview Request: Bug 1230330 - Switch to desktop-test workerType. r=dustin — Details

Bug 1230330 - Switch to desktop-test workerType. r=dustin

Attachment #8698134 - Flags: review?(dustin)

Armen [:armenzg]

Assignee

Updated

•

9 years ago

Summary: Investigate using beefier instances for some Firefox desktop tests → Switch to capacity 1 worker type for Firefox desktop tests

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

9 years ago

Attachment #8698134 - Flags: review?(dustin) → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 37

•

9 years ago

Comment on attachment 8698134 [details]
MozReview Request: Bug 1230330 - Switch to desktop-test workerType. r=dustin

https://reviewboard.mozilla.org/r/27821/#review24985

Armen [:armenzg]

Assignee

Comment 38

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/748264bd31c02275c602fa5f2b4f2aefd4119dcc
Bug 1230330 - Switch to desktop-test workerType. DONTBUILD. r=dustin

Carsten Book [:Tomcat]

Comment 39

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/748264bd31c0

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED