Closed
Bug 1230330
Opened 9 years ago
Closed 9 years ago
Switch to capacity 1 worker type for Firefox desktop tests
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: armenzg)
References
Details
Attachments
(1 file)
I would like to try beefier instances since I'm needing to chunk 3-5 times for certain suites.
Assignee | ||
Comment 1•9 years ago
|
||
What do I need to do such?
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → armenzg
Comment 2•9 years ago
|
||
We'll need to set up a new workerType or two for you. What workerType are you using now?
Assignee | ||
Comment 3•9 years ago
|
||
b2gtest.
armenzg@armenzg-thinkpad:~/repos/mozilla-central/testing/taskcluster/tasks/tests$ grep -r "from:" fx_test_base.yml
from: 'tasks/test.yml'
armenzg@armenzg-thinkpad:~/repos/mozilla-central/testing/taskcluster/tasks/tests$ grep -r "workerType" ../test.yml
workerType: b2gtest
fx_test_base.yml only shows up in my push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc0b3cc247cf
It is the renamed version of:
https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/tasks/tests/fx_unittest_base.yml
Comment 4•9 years ago
|
||
OK -- that's running 4 jobs per host, and it's *3.xlarge:
{
"instanceType": "r3.xlarge",
"capacity": 4,
"utility": 1,
"secrets": {},
"scopes": [],
"userData": {
"isolatedContainers": true
},
"launchSpec": {}
},
{
"instanceType": "m3.xlarge",
"capacity": 4,
"utility": 1,
"secrets": {},
"scopes": [],
"userData": {
"isolatedContainers": true
},
"launchSpec": {}
},
{
"instanceType": "c3.xlarge",
"capacity": 4,
"utility": 1,
"secrets": {},
"scopes": [],
"userData": {
"isolatedContainers": true
},
"launchSpec": {}
}
I don't know how we typically go about tweaking these things -- perhaps that instance type is correct, but the capacity needs to be dialed back? Or maybe we just need bigger instances? Or both?
Assignee | ||
Comment 5•9 years ago
|
||
Is r3 the beefiest?
Which one are we running on?
I can't list workers.
garndt: do you know?
Flags: needinfo?(garndt)
Comment 6•9 years ago
|
||
Comment 4 us the information from the worker list. We're running on r3.xlarge, m3.xlarge, and c3.xlarge. The specs are available on the Amazon's pricing page. But we don't know if the issue is IO bandwidth, CPU contention between containers, or some other issue. One thing we could try is to use the instance type releng uses for EC2 testers, and only use capacity: 1
Comment 7•9 years ago
|
||
using the identical instance to the buildbot ec2 seems the correct route to go. This would give us a benchmark of the overhead of taskcluster/docker and we can do better math around that.
I agree we should understand if we are CPU or I/O bound.
Comment 8•9 years ago
|
||
OK -- those are m1.medium. I created a new 'desktop-test' workerType which uses m1.medium with a capacity of 1 (meaning only one test job at a time on the host). I adjusted the permissions for level-1 repos to allow use of that workerType. So you should be able to re-push with "b2gtest" replaced with "desktop-test" (which is a good change anyway) and use the m1.medium size.
I set the max capacity to 30 so that it won't spin up hundreds of instances for each try push. We can increase that later once use of the workerType is in-tree (please remind me to do so!)
Comment 9•9 years ago
|
||
(you should be able to set that workertype in fx_test_base.yml)
Assignee | ||
Comment 10•9 years ago
|
||
Assignee | ||
Comment 11•9 years ago
|
||
Switching to desktop-test has all jobs pending.
Anyway to be able to open the tap?
Comment 12•9 years ago
|
||
So, m1.medium isn't compatible with the AMI type.
m3.medium only has 3.9G instance storage, which is a little small!
I tried setting that to be OK:
{
"instanceType": "m3.medium",
"capacity": 1,
"utility": 1,
"secrets": {},
"scopes": [],
"userData": {
"isolatedContainers": true,
"capacityManagement": {
"diskspaceThreshold": 2000000000
}
},
"launchSpec": {}
}
and that seems to be able to actually run tasks:
https://tools.taskcluster.net/task-graph-inspector/#SxzC0KO5TOWyGR6ueZSo6Q/LNn2pK-eRJK1FM65wKnpBA/
I'm not sure what disk space we really need for these test jobs -- 2GB sounds a little tight! Maybe we could set that to 3.5G? In any case, I think we'll need to stick to capacity=1 for these instance types! Note that they also only have 3.75G of RAM.
Flags: needinfo?(garndt)
Assignee | ||
Comment 13•9 years ago
|
||
I think I'm starting to see a bit more green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ddb03c51c09f,221286127276&group_state=expanded&filter-resultStatus=success&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
Assignee | ||
Comment 14•9 years ago
|
||
However, the jobs run slower. Maybe 20-30% slower? It's hard to tell.
I think only four more jobs are green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ddb03c51c09f,221286127276&group_state=expanded&filter-resultStatus=success&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
Comment 15•9 years ago
|
||
It makes sense that they run slower -- it's a slower CPU. How do they compare to running on Buildbot EC2 instances?
Assignee | ||
Comment 16•9 years ago
|
||
Pretty bad:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=52bcfd8df0a1,b06014e0ba00&group_state=expanded
We should probably conclude that there is something else making the tests run slow.
Could docker be getting on the way?
Comment 17•9 years ago
|
||
Docker doesn't do anything active to interfere with a running container -- just sets up the container parameters.
What part of the task is taking longer? Is it the actual test run, or downloads and things like that?
Assignee | ||
Comment 18•9 years ago
|
||
Side note, I've seen a couple of tasks with no logs: bug 1230942
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> What part of the task is taking longer? Is it the actual test run, or
> downloads and things like that?
The test run.
Comment 20•9 years ago
|
||
isn't docker a container inside an OS where the buildbot ec2 jobs are just running on the OS?
Assignee | ||
Comment 21•9 years ago
|
||
Could we have a better worker type with a capacity of 1?
Assignee | ||
Comment 22•9 years ago
|
||
For the record, another push using 'desktop-test' has most instances being killed:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d7a9c7f3b475
Comment 23•9 years ago
|
||
Looks like the tasks are getting terminated because they're exceeding the max run time set in the task definition (3600 seconds).
Assignee | ||
Comment 24•9 years ago
|
||
They weren't timing out in previous pushes. I didn't change anything that could produce this.
Anyways, I will go back to 'b2gdesktop' until I have another worker to use.
Comment 25•9 years ago
|
||
I am not exactly sure what changed, but it looks like they're timing out waiting for gnome to start:
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ gnome-session
/home/worker/workspace/test-linux.sh: line 99: /home/worker/artifacts/public/gnome-session.log: No such file or directory
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!' -f /tmp/gnome-session-started ']'
+ sleep 1
+ '[' '!'
The above repeats until the task finally times out.
Comment 26•9 years ago
|
||
In practice there's not a lot of difference between a process running "in a container" or "on the OS" -- they're both processes in the same kernel, just linked to different namespaces.
I think the medium instances are just too small to run -- probably gnome-desktop is getting OOM'd in those failing runs. I just upgraded the desktop-test workerType to m3.large and re-pushed your latest try.
Comment 27•9 years ago
|
||
Comment 28•9 years ago
|
||
The gnome-session thing is a bug in bug 1228416
Comment 29•9 years ago
|
||
lets verify what we have here- we should be able to analyze some logs and determine errors/failures/deltas.
Comment 30•9 years ago
|
||
Comment 31•9 years ago
|
||
^^ including the patch on bug 1228416, so hopefully some better results!
Comment 32•9 years ago
|
||
the try push looks good. We have a lot of chunks there- but they don't seem to be long.
Comment 33•9 years ago
|
||
OK, so let's stick to m3.large for the desktop-test workerType. The workerType is set up -- do we need to do anything further on this bug?
Assignee | ||
Comment 34•9 years ago
|
||
I think we're fine.
I need to spend some time running the reftests locally to see what's going on.
I think getting to the point of trying bug 1221661 will be where we'll know how close (or not) we are.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 35•9 years ago
|
||
I will land the change to make this permanent.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 36•9 years ago
|
||
Bug 1230330 - Switch to desktop-test workerType. r=dustin
Attachment #8698134 -
Flags: review?(dustin)
Assignee | ||
Updated•9 years ago
|
Summary: Investigate using beefier instances for some Firefox desktop tests → Switch to capacity 1 worker type for Firefox desktop tests
Updated•9 years ago
|
Attachment #8698134 -
Flags: review?(dustin) → review+
Comment 37•9 years ago
|
||
Comment on attachment 8698134 [details]
MozReview Request: Bug 1230330 - Switch to desktop-test workerType. r=dustin
https://reviewboard.mozilla.org/r/27821/#review24985
Assignee | ||
Comment 38•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/748264bd31c02275c602fa5f2b4f2aefd4119dcc
Bug 1230330 - Switch to desktop-test workerType. DONTBUILD. r=dustin
Comment 39•9 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•