Open Bug 1237024 Opened 10 years ago Updated 3 years ago

tracking bug for tests which act differently on single core vs multi core in a docker environment

Categories

(Testing :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

(Depends on 4 open bugs)

Details

in our efforts to get tests running in taskcluster we are focusing on getting them running in a docker container. Last week while investigating runtime differences we saw that we had a 20% runtime difference (bug 1235889) and it turned out that our ec2 vm was artificially limited to a single core. Removing this restriction put us roughly at parity with our existing buildbot runtimes. Now we see much fewer timeouts, but instead we see more intermittents. In fact before we had pretty consistent test results, and afterwards it seemed a bit more random. One piece of evidence here is that xpcshell tests were running side by side on buildbot and taskcluster for a couple of weeks. When we removed the single core restriction they became unstable and we had to turn them off (see bug 1236909). In addition, I did a couple of try pushes: multi core: https://treeherder.mozilla.org/#/jobs?repo=try&revision=52c996474c82 single core: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc7a5af7556e these are identical code/patches except for using a different ec2 vm type. I took some time to look over the failures and annotate them: https://docs.google.com/spreadsheets/d/1aH7OL1HC6UvdkkC6oo0XTs34ABXlOs87YGCh2b-bpGE/edit#gid=144705895 we have 10% more failures on multi core than single core. So what is the reason for this? Is this a root cause of many of our intermittents on the tree? Can we use this testing to our advantage to find bad tests or features of the browser? I will file bugs for a few of the specific test cases which showed up frequently as failing and acted difference in single vs multi core
Depends on: 1237028
Depends on: 1237030
Depends on: 1237034
So for xpcshell tests there's a fairly straightforward explanation here: the test harness will run 4 * cpu_count xpcshell processes in parallel, so with more cores we run more tests in parallel: https://dxr.mozilla.org/mozilla-central/rev/0771c5eab32f0cee4f7d12bc382298a81e0eabb2/testing/xpcshell/runxpcshelltests.py#1317 Presumably there are just bad test interactions that show up when you run more tests in parallel there. For other test suites yeah, I'd say non-determinism from multicore scheduling is a *huge* source of intermittents, whether due to bugs in Gecko or bugs in tests.
No longer depends on: 1237034
Depends on: 1237039
Depends on: 1237046
Do we have a worker type we can use to push to try an only use 1 core? On another note, why is it more intermittent on TC multi core while not on Buildbot EC2 instances? Can we please have the exact specs and settings for both images? Thanks.
if you are pushing to taskcluster, I have a cset from try which shows how to get a single core: https://hg.mozilla.org/try/rev/dc7a5af7556e maybe :garndt could weigh on on the specifics.
Flags: needinfo?(garndt)
I also suggested that (for non-xpcshell tests) we could try launching Firefox under taskset to peg it to a single core: http://manpages.ubuntu.com/manpages/wily/man1/taskset.1.html
I believe I saw it stated (I might be mistaken) that the buildbot instances are a single vcpu so when comparing taskcluster and buildbot, tests running in taskcluster were not more intermittent than those running in buildbot when a task is restricted to 1 vcpu. However, the major problem that was trying to be solved was an issue with the time it took to execute those tests within taskcluster when restricted to 1vcpu. When comparing 1 vs >1 cpu used for a task, the instance type did not change, nor the AMI. The thing that change was the option to use the docker flag "cpuset-cpus". This option is not used unless the terribly named docker-worker option "isolatedContainers" was specified for that worker, which it was originally for 'desktop-test'. Now 'desktop-test' does not specify that option, and 'desktop-test-1cpu' has it enabled for comparison. The instance we're running is based on an Ubuntu 14.04.2 AMI, ami-5189a661 (found here [1]). Both worker types are using a m3.large instance (details here [2]). The scripts we use to modify and provision the AMIs are in the docker-worker repo [3]. Here is a brief summary of changes that occur as far as installing packages and enabling things to run: 1. syslog logs are forwards to papertrail 2. ubuntu packages installed: lxc-docker-1.6.1 btrfs-tools lvm2 curl build-essential linux-image-extra git-core pbuilder python-mock python-configobj python-support cdbs python-pip jq rsyslog-gnutls openvpn v4l2loopback-utils lxc python-statsd influxdb 3. node package installed: babel@4.7.16 4. kernel support enabled for video and audio loopback 5. docker-worker source extracted and run on boot (after docker is ready) [1] http://cloud-images.ubuntu.com/releases/14.04/release-20150325/ [2] https://aws.amazon.com/ec2/instance-types/ [3] https://github.com/taskcluster/docker-worker/tree/master/deploy/packer
Flags: needinfo?(garndt)
we should run some buildbot tests on m3.large and compare the runtime to m3.medium, it would give us the data to see what the overhead of docker is. Since we are doing a single core on buildbot m3.medium machines it begs the question of why a single core on a m3.large was timing out. adding in >1 core for testing is scope creep IMO- might be worth taking, but it introduces more intermittents.
I plan on trying to profile some things on the servers itself to see if I can determine what might be going crazy on them to cause this variance. On buildbot, with m3.medium, these jobs would complete in under 1 hour, correct?
all jobs but M2 complete in <1 hour- it is easy to compare if something is timing out. I think we should stick with a 1 hour limit and see what fails. We could do 5400 seconds and everything should pass even on the old single core issues.
Depends on: 1235944
Depends on: 1237112
Depends on: 1237316
Maybe other related multi-core issues? We're soon going to be moving back to single-core and roll the change through commit in the future.
I believe there are more, it will take time to fully investigate. Trying to work on a few of these now might make our tests more stable in general. I do wonder if windows/osx is running successfully in >1core; all linux to date is single core.
(In reply to Joel Maher (:jmaher) from comment #12) > I believe there are more, it will take time to fully investigate. Trying to > work on a few of these now might make our tests more stable in general. I > do wonder if windows/osx is running successfully in >1core; all linux to > date is single core. The Windows/Mac testers are all hardware machines, so they are almost certainly >1 core. (What's our intermittent rate look like for Windows/Mac vs. Linux?)
keep in mind we have different tests enabled/disabled on each platform. According to orange factor: osx: 1.15 windows: 3.69 linux: 5.39 looking at the bugs specific bugs for linux, we have ASAN, a bunch of leaks, and a heavy weight towards infrastructure issues. So I think this is something maybe specific to docker or specific instance types on ec2.
Bug 945981 strongly suggests that this is specific to instance types. We seem to agree that it's due to core count, but it could also be related to CPU type or other more subtle differences between those instance types.
the comparison here is the same worker type- but why we see different results than other automation could easily be the instance/worker type!
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.