1281241 - run some if not all linux unittests tests on m3.medium instances, consider upgrading llvmpipe/mesa as well

Reporter

Description

•

8 years ago

let the good times roll! We have a problem where many tests are experiencing failures- it looks like this is an issue with the hardware or the software in the container running the tests. To solve this we will try: * running unittests on try with m3.medium instances (instead of m1.medium) so we have more than 1 core available to the docker container and theoretically the tests * upgrading mesa/llvmpipe in the docker image

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

8 years ago

first try push is up here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b8b07888730 ideally before we go live, I would like to see us run 50 instances of any job and see no more than 3 jobs reported as orange, ideally less than 3. lets see if this stuff runs as expected, we can retrigger jobs to see trends.

Armen [:armenzg]

Comment 2

•

8 years ago

I'm trying to understand the context. Did we disable a bunch of tests to make things work on m1.medium? Where are you experiencing failures?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 3

•

8 years ago

m3.medium is >1 core and it had a larger percentage of test failures (iirc 5% more failures) than m1.medium. I cannot find the spreadsheet right now. Jeff has a use case, possibly he can outline some more specific failures he is seeing.

Armen [:armenzg]

Comment 4

•

8 years ago

Would we want to narrow this down to a number of suites? (the ones Jeff cares about)

Jeff Muizelaar [:jrmuizel]

Comment 5

•

8 years ago

(In reply to Joel Maher (:jmaher) from comment #3) > m3.medium is >1 core and it had a larger percentage of test failures (iirc > 5% more failures) than m1.medium. I cannot find the spreadsheet right now. > Jeff has a use case, possibly he can outline some more specific failures he > is seeing. m3.medium is still 1 core. However, m3.medium is faster and cheaper than m1.medium. We wanted the extra performance and newer cpu instructions to reduce the likelihood of some timeouts during test runs when using llvmpipe.

Jeff Muizelaar [:jrmuizel]

Comment 6

•

8 years ago

These jobs all failed with: [taskcluster:error] Pulling docker image {"path":"public/image.tar","type":"task-image","taskId":"Yw8NRCthSbK5HC_6tJTIMA"} has failed. This may indicate an error with the registry, image name, or an authentication error. Try pulling the image locally to ensure image exists. ENOSPC, write

Flags: needinfo?(jmaher)

Greg Arndt [:garndt]

Comment 7

•

8 years ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #6) > These jobs all failed with: > [taskcluster:error] Pulling docker image > {"path":"public/image.tar","type":"task-image","taskId": > "Yw8NRCthSbK5HC_6tJTIMA"} has failed. This may indicate an error with the > registry, image name, or an authentication error. Try pulling the image > locally to ensure image exists. ENOSPC, write Unfortunately sometimes this happens where docker dies when importing an image. Usually retriggering the task solves the issue.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

8 years ago

odd, I am not able to figure out how to get this going. Using the taskcluster/desktop-test image (the latest two tags from hub.docker.com), I see failures where we use fail to download test-linux.sh as it is in a hardcoded path and that has since moved: curl --fail -o ./test-linux.sh --retry 10 https://hg.mozilla.org/try//raw-file/a1c987825a8ffd17026792296f58583cc95011cc/testing/taskcluster/scripts/tester/test-linux.sh ^ testing/taskcluster -> taskcluster/ci/legacy So it appears we need the magic generated image.tar instead. We have dozens of failed jobs with the same error message, I don't think this is an intermittent failure. :garndt, I could use your help to solve this. you can see the changes I made to run this on try; https://hg.mozilla.org/try/rev/8b8b0788873027aa5f1116fbd4e8403d7678445a possibly there is something else on the host os or tc-worker that we need to setup properly before running jobs on there?

Flags: needinfo?(jmaher) → needinfo?(garndt)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 9

•

8 years ago

Joel, the location of the script is coded here: https://dxr.mozilla.org/mozilla-central/source/testing/docker/desktop-test/bin/test.sh#32 What we would need is a new bug with a fix for the new location of the script and get this pushed to inbound. The desktop-test task will automatically be triggered. Also do you run your tests from in-tree or external? If in-tree you shouldn't have to create the docker image but could reference its last task id.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 10

•

8 years ago

I had tried to reference a static image on hub.docker.com/taskcluster/desktop-test, but those images are out of date by months and do not include recent changes. I don't know the process for updating those, and it is a ~6 hour cycle for me to update docker images if I were to test on my own. the dynamically generated docker images (image.tar) uses the correct location for test-linux.sh- possibly I just need some education on how to use the image.tar from in-tree auto generated images.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 11

•

8 years ago

I will walk through with Joel on IRC. Lets see if we can get this managed to work before garndt comes online.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 12

•

8 years ago

It's mystic. Something is clearly not working and when I look at the live log I would say it's a malformed download of the docker image. To test this I was using the same task id and triggered a test for our firefox-ui-functional tests via my external script in mozmill-ci. Interestingly this task doesn't seem to fail in extracting the docker archive: https://tools.taskcluster.net/task-inspector/#NjzHovwMSwSnw66fzxClKA/0 Also I wonder why it takes that long until tasks for medium worker are getting started. Even if no other test is running and the queue is empty you have to wait up to 18min before the task gets started. Maybe its the special hardware specs which are different to desktop test workers? The workers for our firefox-ui-tests take about 4-5 min to spin up. https://tools.taskcluster.net/task-inspector/#G1VdyT1cSwqIzIH4ZBZNUA/0

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 13

•

8 years ago

ok, doing a fresh push, seemed to resolve this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c42f80fbc92b38ea45f8d601494375b4943d460f now to do another push with all the mochitests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3e72eee9c87

Flags: needinfo?(garndt)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 14

•

8 years ago

ok, we hit a roadblock here, the failure to extract the image is because we run out of disk space, we have 4GB total on m3.medium- and get this, to get more disk we need to go with 2 vCPU. Jeff, do you have a preference on m3.large vs c3.large? c3.large has half the RAM, but a slightly faster processor. Possibly we could restrict docker to use only 1 cpu, although that would not be ideal. Keep in mind that some of our tests (like web-platform-tests) already run on larger instances, I think c3.xlarge, so we could try that out without getting any custom AMI images setup.

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 15

•

8 years ago

2 vCPU is even better for us. My preference would be c4.large, c3.large, m3.large.

Flags: needinfo?(jmuizelaar)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

8 years ago

running on m3.large: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8b1440a46186b5e5cf6e0ce87247bc5cbe5ca57&filter-searchStr=tc Jeff, If there are certain jobs/tests that you want to look at, please retrigger away. This weekend if all is well so far, I will ensure I have enough clients and do 50 retriggers on each job, that will give us enough data to ensure we are not introducing noisier jobs/tests.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 17

•

8 years ago

it appears that retriggers are not working on taskcluster jobs right now, so I cannot determine if these failed tests are perma fail or not. I assume this will get fixed today. Jeff, please give me some direction here to know what issues to look for.

Flags: needinfo?(jmuizelaar)

Greg Arndt [:garndt]

Comment 18

•

8 years ago

Joel, I left a comment on the other bug about retriggers, but it seems that you can retrigger them from the staging site so that you're unblocked.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 19

•

8 years ago

as a note, that is: https://treeherder.allizom.org/#/jobs?repo=try&revision=b8b1440a46186b5e5cf6e0ce87247bc5cbe5ca57&filter-searchStr=tc

Jeff Muizelaar [:jrmuizel]

Comment 20

•

8 years ago

The jobs results don't look too bad. I'm going to have Andrew investigate to see if we can't make things more green (especially bc2)

Flags: needinfo?(jmuizelaar)

Andrew Comminos [:acomminos]

Assignee

Comment 21

•

8 years ago

Most of these seem not too terrible. bc3 on e10s appears to be permafailing due to perf-related changes. For the intermittents, many of the crashes appear to be happening within native cairo; https://public-artifacts.taskcluster.net/emzV5D06SIe6iXYZpp4Ukw/0/public/logs/live_backing.log https://public-artifacts.taskcluster.net/YmFWEXy_SL6fomYoTJzRnA/0/public/logs/live_backing.log https://public-artifacts.taskcluster.net/Igf_QOQVTza4aNuKg92akg/0/public/logs/live_backing.log Unfortunately, we don't have frame pointers, but the calls are almost certainly being made from within GTK. Going to see if I can reproduce on an old build of gtk+cairo.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 22

•

8 years ago

I would say we have a list of tests to fix, here they are by job type: linux64 opt: * mochitest-3: dom/events/test/test_bug659071.html, dom/filesystem/tests/test_basic.html * a11y: all kinds of badness * bc4: toolkit/modules/tests/browser/browser_FinderHighlighter.js * bc7: oddness in browser/base/content/test/general/* * c3: layout/xul/test/test_windowminmaxsize.xul * mochitest-e10s-3: dom/html/test/test_fullscreen-api.html * mochitest-e10s-5: dom/push/test/test_serviceworker_lifetime.html * mochitest-e10s-8: gfx/layers/apz/test/mochitest/test_group_touchevents.html * mochitest-e10s-10: toolkit/components/extensions/test/mochitest/test_ext_notifications.html * bc-e10s-1: toolkit/modules/tests/browser/browser_FinderHighlighter.js * bc-e10s-3: toolkit/components/perfmonitoring/tests/browser/browser_compartments.js * bc-e10s-5: browser/components/privatebrowsing/test/browser/* * bc-e10s-6: browser/components/sessionstore/test/* * bc-e10s-7: oddness in browser/base/content/test/general/* linux64 debug: * mochitest-10: toolkit/components/prompts/test/test_subresources_prompts.html * c3: toolkit/content/tests/chrome/test_popup_anchoratrect.xul (this fails frequently on m1.medium) * mochitest-e10s-3: dom/html/test/test_fullscreen-api.html * mochitest-e10s-7: leak in /tests/dom/workers/test/serviceworkers keep in mind that this could be partially related to the base revision I pushed with, the tree and tests change all the time- overall I think most of these are valid and unique. While they might exist as *known* failures, there is a good chance they are happening much more frequently. luckily for these mochitests there are <15 issues to sort out, a few look hard though.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 23

•

8 years ago

just checking in here, are there any updates?

(not currently active) Ted Mielczarek

Comment 24

•

8 years ago

(In reply to Andrew Comminos [:acomminos] from comment #21) > Most of these seem not too terrible. bc3 on e10s appears to be permafailing > due to perf-related changes. > > For the intermittents, many of the crashes appear to be happening within > native cairo; We should be able to get debug symbols for these libraries. Someone just needs to grab a loaner test instance and follow these steps: https://bugzilla.mozilla.org/show_bug.cgi?id=528231#c30 (If someone else does that work, you can give me the symbols.zip and I'll upload it.)

Bug 1281241 - Use large desktop-test instances by default on TaskCluster. 8 years ago Andrew Comminos [:acomminos] 58 bytes, text/x-review-board-request	jmaher : review+	Details
move some tests from m1.medium to m3.large (default instance type) 7 years ago Joel Maher ( :jmaher ) (UTC -8) 2.16 KB, patch	gbrown : review+	Details \| Diff \| Splinter Review