1235889 - migrate linux desktop-test jobs from m1.medium to m3.large

Reporter

Description

•

8 years ago

a push to try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d8d18bab88b&group_state=expanded&selectedJob=14958916

looking at the runtimes, many jobs take >3600s.  For example Linux 64 Debug M(1) takes 74 minutes on the taskcluster job and ~54 on the buildbot jobs (look on inbound for reference data).

Looking over the runtimes of individual tests, we see a lot of differences:
https://docs.google.com/spreadsheets/d/1I9v3Y9StLVauj1LvOlIQ5YCJdww5MdPlxaCO1dpaUOI/edit#gid=0

but reproducing them locally comparing bare metal to docker, we don't see a lot of differences- but some.  

dustin did an experiment where he gave me a ec2 loaner (m3.large?) and I ran the docker commands as defined in the task inspector:
https://tools.taskcluster.net/task-inspector/#V-hdEQZPS_KDDmne1GspHg/

the only modifications I had to do were to add v4l2loopback on the system (including modprobe) and the --device=/dev/video0 on the cli for docker run.

I had 2 runs of M(1) and I had:
1954 seconds
1970 seconds

whereas the default run on taskcluster yields:
3976 seconds

that is half the runtime!

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

8 years ago

It's less a loaner than an instance I launched by hand and installed docker on (Ubuntu Trusty Server)

ubuntu@ip-10-36-24-170:~$ curl 169.254.169.254/2014-11-05/meta-data/instance-type; echo
m3.large

that's the same instance type configured for use with the desktop-test workerType

The docker image in use here is, I think, one using gnome-shell instead of gnome-session.

Note that the run on taskcluster is basically the 3600s maxRunTime plus the time to download the image.

So we've held the container contents fixed, and the instance type fixed.  The host image differs, but not appreciably - docker-worker is based on http://thecloudmarket.com/image/ami-5189a661--ubuntu-images-hvm-ssd-ubuntu-trusty-14-04-amd64-server-20150325, so also Trusty server.

What could we be missing here?  Alternately, what can we do to narrow down the problem?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

8 years ago

both my comparison points should be gnome-shell.  could the taskcluster worker or other software provide overhead?  maybe chewing up ram or something?  I am open to more experiments- the ec2 instance that we spun up runs the tests as expected, similar to what I see on my local desktop with the same image.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

8 years ago

It's a possibility, but this instance type has 7.5G of RAM and the tests run in Buildbot with just 3.75G, so the RAM contention seems unlikely.  The docker-worker does do a fair bit of housekeeping type stuff, as well as log handling, but no more than Buildbot.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

8 years ago

there was a fix to the worker we were using, garndt had realized we were limiting our vm to a single core, despite being the only process on the box.  Removing this restriction, we now have similar runtimes as buildbot.  I know we have different ec2 instance types- there could be future work to consider there.

I will let :dustin close this out as he sees fit.  I am not sure where we have code to track the flag to limit cpus to a single core- it would be nice if we could link to some vcs repo with the change made :)

Greg Arndt [:garndt]

Comment 5

•

8 years ago

This change was done in the worker type definition in the provisioner UI.  Unfortunately there is no version control for those changes and no code to point at :(

The option that was removed was the "isolatedcontainers" option, that when set to true, causes the container to be restricted to 1 cpu by using the docker 'cpuset-cpus' flag.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

8 years ago

to test my theory on intermittents, :garndt created a unique worker type that has 1cpu, and I have a try push comparing desktop-test worker vs desktop-test-1cpu:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=dc7a5af7556e&tochange=52c996474c82

ideally I can do more retriggers or secondary pushes to determine how stable these things are.  I expect different sets of failures as we will get some timeouts in the 1cpu case.

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

8 years ago

The `m1.medium` that releng is using for testers is 1 vCPU / 3.75G RAM.  I suspect that whatever LXC feature Docker is using to constrain to one CPU is actually *worse* than just having one Amazon vCPU?  `m3.large` is 2 vCPU and 7.5G RAM.

Incidentally, the reason docker-worker didn't run on m1.medium (or m3.medium) is that there's only 4G of instance storage on those types, and that's not enough for the desktop-test image plus the necessary artifacts, test.zip, etc.

I don't think there's much to do here besides remembering that we need to not use isolatedcontainers.  I'm not sure a warning would do any good, if it's even possible to detect cpuset-cpus from inside the docker container.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

8 years ago

should we use m1.medium for the taskcluster desktop-test images?  It would be good to see if we can get ~runtimes with that.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

8 years ago

> Incidentally, the reason docker-worker didn't run on m1.medium (or m3.medium) is that there's only 4G of instance storage on
> those types, and that's not enough for the desktop-test image plus the necessary artifacts, test.zip, etc.

so, no :)

Selena Deckelmann :selenamarie :selena

Updated

•

8 years ago

Summary: time to run taskcluster jobs take 20% longer than buildbot peers → time to run taskcluster jobs take 20% longer than buildbot peers - fix: remove "isolatedcontainers" from worker type definition

(not currently active) Ted Mielczarek

Comment 10

•

8 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> The `m1.medium` that releng is using for testers is 1 vCPU / 3.75G RAM.  I
> suspect that whatever LXC feature Docker is using to constrain to one CPU is
> actually *worse* than just having one Amazon vCPU?  `m3.large` is 2 vCPU and
> 7.5G RAM.

This is especially weird because m3.large has 6.5 compute units to m1.medium's 2, so even a single vCPU on m3.large should be faster.

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

8 years ago

I think that bug 1237663 has solved this.  I hope!

However, I want to leave this open so that we can loop back *after* all of the tests are migrated to TC and do some controlled experiments running suites side-by-side on m1's and m3's and comparing the results.  Given enough focus, hopefully we can figure out and address whatever is making them run slowly / timeout on the m3's.  We have to do this eventually, since m1's are already legacy and are likely to be removed from EC2 at some point.

Armen [:armenzg]

Comment 12

•

8 years ago

This is not needed to make Linux64 debug test jobs tier 2 (bug 1171033).

Instead, when we're running all tier2 jobs we can work on this.

Reversing deps.

No longer blocks: tc-linux64-debug

Depends on: tc-linux64-debug

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

8 years ago

Summary: time to run taskcluster jobs take 20% longer than buildbot peers - fix: remove "isolatedcontainers" from worker type definition → migrate desktop-test jobs from m1.medium to m3.large

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

8 years ago

Component: General → General Automation

Product: Taskcluster → Release Engineering

QA Contact: catlee

Summary: migrate desktop-test jobs from m1.medium to m3.large → migrate linux desktop-test jobs from m1.medium to m3.large

Chris AtLee [:catlee]

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: General Automation → General

Bugzilla

Quick Search

migrate linux desktop-test jobs from m1.medium to m3.large

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Updated

Updated

Updated

Updated