Open
Bug 1237024
Opened 10 years ago
Updated 3 years ago
tracking bug for tests which act differently on single core vs multi core in a docker environment
Categories
(Testing :: General, defect)
Testing
General
Tracking
(Not tracked)
NEW
People
(Reporter: jmaher, Unassigned)
References
(Depends on 4 open bugs)
Details
in our efforts to get tests running in taskcluster we are focusing on getting them running in a docker container. Last week while investigating runtime differences we saw that we had a 20% runtime difference (bug 1235889) and it turned out that our ec2 vm was artificially limited to a single core. Removing this restriction put us roughly at parity with our existing buildbot runtimes.
Now we see much fewer timeouts, but instead we see more intermittents. In fact before we had pretty consistent test results, and afterwards it seemed a bit more random.
One piece of evidence here is that xpcshell tests were running side by side on buildbot and taskcluster for a couple of weeks. When we removed the single core restriction they became unstable and we had to turn them off (see bug 1236909).
In addition, I did a couple of try pushes:
multi core: https://treeherder.mozilla.org/#/jobs?repo=try&revision=52c996474c82
single core: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc7a5af7556e
these are identical code/patches except for using a different ec2 vm type. I took some time to look over the failures and annotate them:
https://docs.google.com/spreadsheets/d/1aH7OL1HC6UvdkkC6oo0XTs34ABXlOs87YGCh2b-bpGE/edit#gid=144705895
we have 10% more failures on multi core than single core.
So what is the reason for this? Is this a root cause of many of our intermittents on the tree? Can we use this testing to our advantage to find bad tests or features of the browser?
I will file bugs for a few of the specific test cases which showed up frequently as failing and acted difference in single vs multi core
Comment 1•10 years ago
|
||
So for xpcshell tests there's a fairly straightforward explanation here: the test harness will run 4 * cpu_count xpcshell processes in parallel, so with more cores we run more tests in parallel:
https://dxr.mozilla.org/mozilla-central/rev/0771c5eab32f0cee4f7d12bc382298a81e0eabb2/testing/xpcshell/runxpcshelltests.py#1317
Presumably there are just bad test interactions that show up when you run more tests in parallel there.
For other test suites yeah, I'd say non-determinism from multicore scheduling is a *huge* source of intermittents, whether due to bugs in Gecko or bugs in tests.
No longer depends on: 1237034
Comment 2•10 years ago
|
||
Oh, I linked to the wrong place in comment 1:
https://dxr.mozilla.org/mozilla-central/rev/0771c5eab32f0cee4f7d12bc382298a81e0eabb2/testing/xpcshell/runxpcshelltests.py#50
Comment 3•10 years ago
|
||
Do we have a worker type we can use to push to try an only use 1 core?
On another note, why is it more intermittent on TC multi core while not on Buildbot EC2 instances?
Can we please have the exact specs and settings for both images? Thanks.
Reporter | ||
Comment 4•10 years ago
|
||
if you are pushing to taskcluster, I have a cset from try which shows how to get a single core:
https://hg.mozilla.org/try/rev/dc7a5af7556e
maybe :garndt could weigh on on the specifics.
Flags: needinfo?(garndt)
Comment 5•10 years ago
|
||
I also suggested that (for non-xpcshell tests) we could try launching Firefox under taskset to peg it to a single core:
http://manpages.ubuntu.com/manpages/wily/man1/taskset.1.html
Comment 6•10 years ago
|
||
I believe I saw it stated (I might be mistaken) that the buildbot instances are a single vcpu so when comparing taskcluster and buildbot, tests running in taskcluster were not more intermittent than those running in buildbot when a task is restricted to 1 vcpu. However, the major problem that was trying to be solved was an issue with the time it took to execute those tests within taskcluster when restricted to 1vcpu.
When comparing 1 vs >1 cpu used for a task, the instance type did not change, nor the AMI. The thing that change was the option to use the docker flag "cpuset-cpus". This option is not used unless the terribly named docker-worker option "isolatedContainers" was specified for that worker, which it was originally for 'desktop-test'. Now 'desktop-test' does not specify that option, and 'desktop-test-1cpu' has it enabled for comparison.
The instance we're running is based on an Ubuntu 14.04.2 AMI, ami-5189a661 (found here [1]). Both worker types are using a m3.large instance (details here [2]). The scripts we use to modify and provision the AMIs are in the docker-worker repo [3].
Here is a brief summary of changes that occur as far as installing packages and enabling things to run:
1. syslog logs are forwards to papertrail
2. ubuntu packages installed: lxc-docker-1.6.1 btrfs-tools lvm2 curl build-essential linux-image-extra git-core pbuilder python-mock python-configobj python-support cdbs python-pip jq rsyslog-gnutls openvpn v4l2loopback-utils lxc python-statsd influxdb
3. node package installed: babel@4.7.16
4. kernel support enabled for video and audio loopback
5. docker-worker source extracted and run on boot (after docker is ready)
[1] http://cloud-images.ubuntu.com/releases/14.04/release-20150325/
[2] https://aws.amazon.com/ec2/instance-types/
[3] https://github.com/taskcluster/docker-worker/tree/master/deploy/packer
Flags: needinfo?(garndt)
Reporter | ||
Comment 7•10 years ago
|
||
we should run some buildbot tests on m3.large and compare the runtime to m3.medium, it would give us the data to see what the overhead of docker is. Since we are doing a single core on buildbot m3.medium machines it begs the question of why a single core on a m3.large was timing out. adding in >1 core for testing is scope creep IMO- might be worth taking, but it introduces more intermittents.
Comment 8•10 years ago
|
||
I plan on trying to profile some things on the servers itself to see if I can determine what might be going crazy on them to cause this variance. On buildbot, with m3.medium, these jobs would complete in under 1 hour, correct?
Reporter | ||
Comment 9•10 years ago
|
||
all jobs but M2 complete in <1 hour- it is easy to compare if something is timing out. I think we should stick with a 1 hour limit and see what fails. We could do 5400 seconds and everything should pass even on the old single core issues.
![]() |
||
Comment 10•10 years ago
|
||
More random Linux x64 debug Cpp unit test oranges on task cluster:
TestCSPParser: https://treeherder.mozilla.org/logviewer.html#?job_id=6469124&repo=fx-team
TestCertDB: https://treeherder.mozilla.org/logviewer.html#?job_id=19383472&repo=mozilla-inbound
test_AsXXX_helpers: https://treeherder.mozilla.org/logviewer.html#?job_id=19385666&repo=mozilla-inbound
Comment 11•10 years ago
|
||
Maybe other related multi-core issues? We're soon going to be moving back to single-core and roll the change through commit in the future.
Reporter | ||
Comment 12•10 years ago
|
||
I believe there are more, it will take time to fully investigate. Trying to work on a few of these now might make our tests more stable in general. I do wonder if windows/osx is running successfully in >1core; all linux to date is single core.
Comment 13•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #12)
> I believe there are more, it will take time to fully investigate. Trying to
> work on a few of these now might make our tests more stable in general. I
> do wonder if windows/osx is running successfully in >1core; all linux to
> date is single core.
The Windows/Mac testers are all hardware machines, so they are almost certainly >1 core. (What's our intermittent rate look like for Windows/Mac vs. Linux?)
Reporter | ||
Comment 14•10 years ago
|
||
keep in mind we have different tests enabled/disabled on each platform. According to orange factor:
osx: 1.15
windows: 3.69
linux: 5.39
looking at the bugs specific bugs for linux, we have ASAN, a bunch of leaks, and a heavy weight towards infrastructure issues.
So I think this is something maybe specific to docker or specific instance types on ec2.
Comment 15•10 years ago
|
||
Bug 945981 strongly suggests that this is specific to instance types. We seem to agree that it's due to core count, but it could also be related to CPU type or other more subtle differences between those instance types.
Reporter | ||
Comment 16•10 years ago
|
||
the comparison here is the same worker type- but why we see different results than other automation could easily be the instance/worker type!
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•