Closed Bug 1235889 Opened 8 years ago Closed 7 years ago

migrate linux desktop-test jobs from m1.medium to m3.large

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

a push to try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d8d18bab88b&group_state=expanded&selectedJob=14958916

looking at the runtimes, many jobs take >3600s.  For example Linux 64 Debug M(1) takes 74 minutes on the taskcluster job and ~54 on the buildbot jobs (look on inbound for reference data).

Looking over the runtimes of individual tests, we see a lot of differences:
https://docs.google.com/spreadsheets/d/1I9v3Y9StLVauj1LvOlIQ5YCJdww5MdPlxaCO1dpaUOI/edit#gid=0

but reproducing them locally comparing bare metal to docker, we don't see a lot of differences- but some.  

dustin did an experiment where he gave me a ec2 loaner (m3.large?) and I ran the docker commands as defined in the task inspector:
https://tools.taskcluster.net/task-inspector/#V-hdEQZPS_KDDmne1GspHg/

the only modifications I had to do were to add v4l2loopback on the system (including modprobe) and the --device=/dev/video0 on the cli for docker run.

I had 2 runs of M(1) and I had:
1954 seconds
1970 seconds

whereas the default run on taskcluster yields:
3976 seconds

that is half the runtime!
It's less a loaner than an instance I launched by hand and installed docker on (Ubuntu Trusty Server)

ubuntu@ip-10-36-24-170:~$ curl 169.254.169.254/2014-11-05/meta-data/instance-type; echo
m3.large

that's the same instance type configured for use with the desktop-test workerType

The docker image in use here is, I think, one using gnome-shell instead of gnome-session.

Note that the run on taskcluster is basically the 3600s maxRunTime plus the time to download the image.

So we've held the container contents fixed, and the instance type fixed.  The host image differs, but not appreciably - docker-worker is based on http://thecloudmarket.com/image/ami-5189a661--ubuntu-images-hvm-ssd-ubuntu-trusty-14-04-amd64-server-20150325, so also Trusty server.

What could we be missing here?  Alternately, what can we do to narrow down the problem?
both my comparison points should be gnome-shell.  could the taskcluster worker or other software provide overhead?  maybe chewing up ram or something?  I am open to more experiments- the ec2 instance that we spun up runs the tests as expected, similar to what I see on my local desktop with the same image.
It's a possibility, but this instance type has 7.5G of RAM and the tests run in Buildbot with just 3.75G, so the RAM contention seems unlikely.  The docker-worker does do a fair bit of housekeeping type stuff, as well as log handling, but no more than Buildbot.
there was a fix to the worker we were using, garndt had realized we were limiting our vm to a single core, despite being the only process on the box.  Removing this restriction, we now have similar runtimes as buildbot.  I know we have different ec2 instance types- there could be future work to consider there.

I will let :dustin close this out as he sees fit.  I am not sure where we have code to track the flag to limit cpus to a single core- it would be nice if we could link to some vcs repo with the change made :)
This change was done in the worker type definition in the provisioner UI.  Unfortunately there is no version control for those changes and no code to point at :(

The option that was removed was the "isolatedcontainers" option, that when set to true, causes the container to be restricted to 1 cpu by using the docker 'cpuset-cpus' flag.
to test my theory on intermittents, :garndt created a unique worker type that has 1cpu, and I have a try push comparing desktop-test worker vs desktop-test-1cpu:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=dc7a5af7556e&tochange=52c996474c82

ideally I can do more retriggers or secondary pushes to determine how stable these things are.  I expect different sets of failures as we will get some timeouts in the 1cpu case.
The `m1.medium` that releng is using for testers is 1 vCPU / 3.75G RAM.  I suspect that whatever LXC feature Docker is using to constrain to one CPU is actually *worse* than just having one Amazon vCPU?  `m3.large` is 2 vCPU and 7.5G RAM.

Incidentally, the reason docker-worker didn't run on m1.medium (or m3.medium) is that there's only 4G of instance storage on those types, and that's not enough for the desktop-test image plus the necessary artifacts, test.zip, etc.

I don't think there's much to do here besides remembering that we need to not use isolatedcontainers.  I'm not sure a warning would do any good, if it's even possible to detect cpuset-cpus from inside the docker container.
should we use m1.medium for the taskcluster desktop-test images?  It would be good to see if we can get ~runtimes with that.
> Incidentally, the reason docker-worker didn't run on m1.medium (or m3.medium) is that there's only 4G of instance storage on
> those types, and that's not enough for the desktop-test image plus the necessary artifacts, test.zip, etc.

so, no :)
Summary: time to run taskcluster jobs take 20% longer than buildbot peers → time to run taskcluster jobs take 20% longer than buildbot peers - fix: remove "isolatedcontainers" from worker type definition
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> The `m1.medium` that releng is using for testers is 1 vCPU / 3.75G RAM.  I
> suspect that whatever LXC feature Docker is using to constrain to one CPU is
> actually *worse* than just having one Amazon vCPU?  `m3.large` is 2 vCPU and
> 7.5G RAM.

This is especially weird because m3.large has 6.5 compute units to m1.medium's 2, so even a single vCPU on m3.large should be faster.
I think that bug 1237663 has solved this.  I hope!

However, I want to leave this open so that we can loop back *after* all of the tests are migrated to TC and do some controlled experiments running suites side-by-side on m1's and m3's and comparing the results.  Given enough focus, hopefully we can figure out and address whatever is making them run slowly / timeout on the m3's.  We have to do this eventually, since m1's are already legacy and are likely to be removed from EC2 at some point.
This is not needed to make Linux64 debug test jobs tier 2 (bug 1171033).

Instead, when we're running all tier2 jobs we can work on this.

Reversing deps.
No longer blocks: tc-linux64-debug
Depends on: tc-linux64-debug
Summary: time to run taskcluster jobs take 20% longer than buildbot peers - fix: remove "isolatedcontainers" from worker type definition → migrate desktop-test jobs from m1.medium to m3.large
Component: General → General Automation
Product: Taskcluster → Release Engineering
QA Contact: catlee
Summary: migrate desktop-test jobs from m1.medium to m3.large → migrate linux desktop-test jobs from m1.medium to m3.large
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.