Closed Bug 1230330 Opened 7 years ago Closed 7 years ago

Switch to capacity 1 worker type for Firefox desktop tests

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(1 file)

I would like to try beefier instances since I'm needing to chunk 3-5 times for certain suites.
What do I need to do such?
Assignee: nobody → armenzg
We'll need to set up a new workerType or two for you.  What workerType are you using now?
b2gtest.

armenzg@armenzg-thinkpad:~/repos/mozilla-central/testing/taskcluster/tasks/tests$ grep -r "from:" fx_test_base.yml 
  from: 'tasks/test.yml'
armenzg@armenzg-thinkpad:~/repos/mozilla-central/testing/taskcluster/tasks/tests$ grep -r "workerType" ../test.yml
  workerType: b2gtest

fx_test_base.yml only shows up in my push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc0b3cc247cf

It is the renamed version of:
https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/tasks/tests/fx_unittest_base.yml
OK -- that's running 4 jobs per host, and it's *3.xlarge:

    {
      "instanceType": "r3.xlarge",
      "capacity": 4,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true
      },
      "launchSpec": {}
    },
    {
      "instanceType": "m3.xlarge",
      "capacity": 4,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true
      },
      "launchSpec": {}
    },
    {
      "instanceType": "c3.xlarge",
      "capacity": 4,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true
      },
      "launchSpec": {}
    }

I don't know how we typically go about tweaking these things -- perhaps that instance type is correct, but the capacity needs to be dialed back?  Or maybe we just need bigger instances?  Or both?
Is r3 the beefiest?
Which one are we running on?

I can't list workers.

garndt: do you know?
Flags: needinfo?(garndt)
Comment 4 us the information from the worker list.  We're running on r3.xlarge, m3.xlarge, and c3.xlarge.  The specs are available on the Amazon's pricing page.  But we don't know if the issue is IO bandwidth, CPU contention between containers, or some other issue.  One thing we could try is to use the instance type releng uses for EC2 testers, and only use capacity: 1
using the identical instance to the buildbot ec2 seems the correct route to go.  This would give us a benchmark of the overhead of taskcluster/docker and we can do better math around that.

I agree we should understand if we are CPU or I/O bound.
OK -- those are m1.medium.  I created a new 'desktop-test' workerType which uses m1.medium with a capacity of 1 (meaning only one test job at a time on the host).  I adjusted the permissions for level-1 repos to allow use of that workerType.  So you should be able to re-push with "b2gtest" replaced with "desktop-test" (which is a good change anyway) and use the m1.medium size.

I set the max capacity to 30 so that it won't spin up hundreds of instances for each try push.  We can increase that later once use of the workerType is in-tree (please remind me to do so!)
(you should be able to set that workertype in fx_test_base.yml)
Switching to desktop-test has all jobs pending.
Anyway to be able to open the tap?
So, m1.medium isn't compatible with the AMI type.

m3.medium only has 3.9G instance storage, which is a little small!

I tried setting that to be OK:

    {
      "instanceType": "m3.medium",
      "capacity": 1,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {
        "isolatedContainers": true,
        "capacityManagement": {
          "diskspaceThreshold": 2000000000
        }
      },
      "launchSpec": {}
    }

and that seems to be able to actually run tasks:

https://tools.taskcluster.net/task-graph-inspector/#SxzC0KO5TOWyGR6ueZSo6Q/LNn2pK-eRJK1FM65wKnpBA/

I'm not sure what disk space we really need for these test jobs -- 2GB sounds a little tight!  Maybe we could set that to 3.5G?  In any case, I think we'll need to stick to capacity=1 for these instance types!  Note that they also only have 3.75G of RAM.
Flags: needinfo?(garndt)
It makes sense that they run slower -- it's a slower CPU.  How do they compare to running on Buildbot EC2 instances?
Pretty bad:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=52bcfd8df0a1,b06014e0ba00&group_state=expanded

We should probably conclude that there is something else making the tests run slow.

Could docker be getting on the way?
Docker doesn't do anything active to interfere with a running container -- just sets up the container parameters.

What part of the task is taking longer?  Is it the actual test run, or downloads and things like that?
Side note, I've seen a couple of tasks with no logs: bug 1230942
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> What part of the task is taking longer?  Is it the actual test run, or
> downloads and things like that?

The test run.
isn't docker a container inside an OS where the buildbot ec2 jobs are just running on the OS?
Could we have a better worker type with a capacity of 1?
For the record, another push using 'desktop-test' has most instances being killed:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d7a9c7f3b475
Looks like the tasks are getting terminated because they're exceeding the max run time set in the task definition (3600 seconds).
They weren't timing out in previous pushes. I didn't change anything that could produce this.

Anyways, I will go back to 'b2gdesktop' until I have another worker to use.
I am not exactly sure what changed, but it looks like they're timing out waiting for gnome to start:

+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ gnome-session
/home/worker/workspace/test-linux.sh: line 99: /home/worker/artifacts/public/gnome-session.log: No such file or directory
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!'

The above repeats until the task finally times out.
In practice there's not a lot of difference between a process running "in a container" or "on the OS" -- they're both processes in the same kernel, just linked to different namespaces.

I think the medium instances are just too small to run -- probably gnome-desktop is getting OOM'd in those failing runs.  I just upgraded the desktop-test workerType to m3.large and re-pushed your latest try.
The gnome-session thing is a bug in bug 1228416
lets verify what we have here- we should be able to analyze some logs and determine errors/failures/deltas.
^^ including the patch on bug 1228416, so hopefully some better results!
the try push looks good.  We have a lot of chunks there- but they don't seem to be long.
OK, so let's stick to m3.large for the desktop-test workerType.  The workerType is set up -- do we need to do anything further on this bug?
I think we're fine.

I need to spend some time running the reftests locally to see what's going on.
I think getting to the point of trying bug 1221661 will be where we'll know how close (or not) we are.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I will land the change to make this permanent.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bug 1230330 - Switch to desktop-test workerType. r=dustin
Attachment #8698134 - Flags: review?(dustin)
Summary: Investigate using beefier instances for some Firefox desktop tests → Switch to capacity 1 worker type for Firefox desktop tests
Attachment #8698134 - Flags: review?(dustin) → review+
Comment on attachment 8698134 [details]
MozReview Request: Bug 1230330 - Switch to desktop-test workerType. r=dustin

https://reviewboard.mozilla.org/r/27821/#review24985
https://hg.mozilla.org/mozilla-central/rev/748264bd31c0
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.