Closed
Bug 1281241
Opened 8 years ago
Closed 7 years ago
run some if not all linux unittests tests on m3.medium instances, consider upgrading llvmpipe/mesa as well
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: acomminos)
References
(Depends on 1 open bug)
Details
Attachments
(2 files)
let the good times roll!
We have a problem where many tests are experiencing failures- it looks like this is an issue with the hardware or the software in the container running the tests.
To solve this we will try:
* running unittests on try with m3.medium instances (instead of m1.medium) so we have more than 1 core available to the docker container and theoretically the tests
* upgrading mesa/llvmpipe in the docker image
Reporter | ||
Comment 1•8 years ago
|
||
first try push is up here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b8b07888730
ideally before we go live, I would like to see us run 50 instances of any job and see no more than 3 jobs reported as orange, ideally less than 3. lets see if this stuff runs as expected, we can retrigger jobs to see trends.
Comment 2•8 years ago
|
||
I'm trying to understand the context.
Did we disable a bunch of tests to make things work on m1.medium?
Where are you experiencing failures?
Reporter | ||
Comment 3•8 years ago
|
||
m3.medium is >1 core and it had a larger percentage of test failures (iirc 5% more failures) than m1.medium. I cannot find the spreadsheet right now. Jeff has a use case, possibly he can outline some more specific failures he is seeing.
Comment 4•8 years ago
|
||
Would we want to narrow this down to a number of suites? (the ones Jeff cares about)
Comment 5•8 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #3)
> m3.medium is >1 core and it had a larger percentage of test failures (iirc
> 5% more failures) than m1.medium. I cannot find the spreadsheet right now.
> Jeff has a use case, possibly he can outline some more specific failures he
> is seeing.
m3.medium is still 1 core. However, m3.medium is faster and cheaper than m1.medium. We wanted the extra performance and newer cpu instructions to reduce the likelihood of some timeouts during test runs when using llvmpipe.
Comment 6•8 years ago
|
||
These jobs all failed with:
[taskcluster:error] Pulling docker image {"path":"public/image.tar","type":"task-image","taskId":"Yw8NRCthSbK5HC_6tJTIMA"} has failed. This may indicate an error with the registry, image name, or an authentication error. Try pulling the image locally to ensure image exists. ENOSPC, write
Flags: needinfo?(jmaher)
Comment 7•8 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)
> These jobs all failed with:
> [taskcluster:error] Pulling docker image
> {"path":"public/image.tar","type":"task-image","taskId":
> "Yw8NRCthSbK5HC_6tJTIMA"} has failed. This may indicate an error with the
> registry, image name, or an authentication error. Try pulling the image
> locally to ensure image exists. ENOSPC, write
Unfortunately sometimes this happens where docker dies when importing an image. Usually retriggering the task solves the issue.
Reporter | ||
Comment 8•8 years ago
|
||
odd, I am not able to figure out how to get this going. Using the taskcluster/desktop-test image (the latest two tags from hub.docker.com), I see failures where we use fail to download test-linux.sh as it is in a hardcoded path and that has since moved:
curl --fail -o ./test-linux.sh --retry 10 https://hg.mozilla.org/try//raw-file/a1c987825a8ffd17026792296f58583cc95011cc/testing/taskcluster/scripts/tester/test-linux.sh
^ testing/taskcluster -> taskcluster/ci/legacy
So it appears we need the magic generated image.tar instead. We have dozens of failed jobs with the same error message, I don't think this is an intermittent failure.
:garndt, I could use your help to solve this. you can see the changes I made to run this on try;
https://hg.mozilla.org/try/rev/8b8b0788873027aa5f1116fbd4e8403d7678445a
possibly there is something else on the host os or tc-worker that we need to setup properly before running jobs on there?
Flags: needinfo?(jmaher) → needinfo?(garndt)
Comment 9•8 years ago
|
||
Joel, the location of the script is coded here:
https://dxr.mozilla.org/mozilla-central/source/testing/docker/desktop-test/bin/test.sh#32
What we would need is a new bug with a fix for the new location of the script and get this pushed to inbound. The desktop-test task will automatically be triggered.
Also do you run your tests from in-tree or external? If in-tree you shouldn't have to create the docker image but could reference its last task id.
Reporter | ||
Comment 10•8 years ago
|
||
I had tried to reference a static image on hub.docker.com/taskcluster/desktop-test, but those images are out of date by months and do not include recent changes. I don't know the process for updating those, and it is a ~6 hour cycle for me to update docker images if I were to test on my own.
the dynamically generated docker images (image.tar) uses the correct location for test-linux.sh- possibly I just need some education on how to use the image.tar from in-tree auto generated images.
Comment 11•8 years ago
|
||
I will walk through with Joel on IRC. Lets see if we can get this managed to work before garndt comes online.
Comment 12•8 years ago
|
||
It's mystic. Something is clearly not working and when I look at the live log I would say it's a malformed download of the docker image. To test this I was using the same task id and triggered a test for our firefox-ui-functional tests via my external script in mozmill-ci. Interestingly this task doesn't seem to fail in extracting the docker archive:
https://tools.taskcluster.net/task-inspector/#NjzHovwMSwSnw66fzxClKA/0
Also I wonder why it takes that long until tasks for medium worker are getting started. Even if no other test is running and the queue is empty you have to wait up to 18min before the task gets started. Maybe its the special hardware specs which are different to desktop test workers? The workers for our firefox-ui-tests take about 4-5 min to spin up.
https://tools.taskcluster.net/task-inspector/#G1VdyT1cSwqIzIH4ZBZNUA/0
Reporter | ||
Comment 13•8 years ago
|
||
ok, doing a fresh push, seemed to resolve this:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c42f80fbc92b38ea45f8d601494375b4943d460f
now to do another push with all the mochitests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3e72eee9c87
Flags: needinfo?(garndt)
Reporter | ||
Comment 14•8 years ago
|
||
ok, we hit a roadblock here, the failure to extract the image is because we run out of disk space, we have 4GB total on m3.medium- and get this, to get more disk we need to go with 2 vCPU.
Jeff, do you have a preference on m3.large vs c3.large? c3.large has half the RAM, but a slightly faster processor. Possibly we could restrict docker to use only 1 cpu, although that would not be ideal. Keep in mind that some of our tests (like web-platform-tests) already run on larger instances, I think c3.xlarge, so we could try that out without getting any custom AMI images setup.
Flags: needinfo?(jmuizelaar)
Comment 15•8 years ago
|
||
2 vCPU is even better for us. My preference would be c4.large, c3.large, m3.large.
Flags: needinfo?(jmuizelaar)
Reporter | ||
Comment 16•8 years ago
|
||
running on m3.large:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8b1440a46186b5e5cf6e0ce87247bc5cbe5ca57&filter-searchStr=tc
Jeff, If there are certain jobs/tests that you want to look at, please retrigger away.
This weekend if all is well so far, I will ensure I have enough clients and do 50 retriggers on each job, that will give us enough data to ensure we are not introducing noisier jobs/tests.
Reporter | ||
Comment 17•8 years ago
|
||
it appears that retriggers are not working on taskcluster jobs right now, so I cannot determine if these failed tests are perma fail or not. I assume this will get fixed today.
Jeff, please give me some direction here to know what issues to look for.
Flags: needinfo?(jmuizelaar)
Comment 18•8 years ago
|
||
Joel, I left a comment on the other bug about retriggers, but it seems that you can retrigger them from the staging site so that you're unblocked.
Reporter | ||
Comment 19•8 years ago
|
||
Comment 20•8 years ago
|
||
The jobs results don't look too bad. I'm going to have Andrew investigate to see if we can't make things more green (especially bc2)
Flags: needinfo?(jmuizelaar)
Assignee | ||
Comment 21•8 years ago
|
||
Most of these seem not too terrible. bc3 on e10s appears to be permafailing due to perf-related changes.
For the intermittents, many of the crashes appear to be happening within native cairo;
https://public-artifacts.taskcluster.net/emzV5D06SIe6iXYZpp4Ukw/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/YmFWEXy_SL6fomYoTJzRnA/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/Igf_QOQVTza4aNuKg92akg/0/public/logs/live_backing.log
Unfortunately, we don't have frame pointers, but the calls are almost certainly being made from within GTK. Going to see if I can reproduce on an old build of gtk+cairo.
Reporter | ||
Comment 22•8 years ago
|
||
I would say we have a list of tests to fix, here they are by job type:
linux64 opt:
* mochitest-3: dom/events/test/test_bug659071.html, dom/filesystem/tests/test_basic.html
* a11y: all kinds of badness
* bc4: toolkit/modules/tests/browser/browser_FinderHighlighter.js
* bc7: oddness in browser/base/content/test/general/*
* c3: layout/xul/test/test_windowminmaxsize.xul
* mochitest-e10s-3: dom/html/test/test_fullscreen-api.html
* mochitest-e10s-5: dom/push/test/test_serviceworker_lifetime.html
* mochitest-e10s-8: gfx/layers/apz/test/mochitest/test_group_touchevents.html
* mochitest-e10s-10: toolkit/components/extensions/test/mochitest/test_ext_notifications.html
* bc-e10s-1: toolkit/modules/tests/browser/browser_FinderHighlighter.js
* bc-e10s-3: toolkit/components/perfmonitoring/tests/browser/browser_compartments.js
* bc-e10s-5: browser/components/privatebrowsing/test/browser/*
* bc-e10s-6: browser/components/sessionstore/test/*
* bc-e10s-7: oddness in browser/base/content/test/general/*
linux64 debug:
* mochitest-10: toolkit/components/prompts/test/test_subresources_prompts.html
* c3: toolkit/content/tests/chrome/test_popup_anchoratrect.xul (this fails frequently on m1.medium)
* mochitest-e10s-3: dom/html/test/test_fullscreen-api.html
* mochitest-e10s-7: leak in /tests/dom/workers/test/serviceworkers
keep in mind that this could be partially related to the base revision I pushed with, the tree and tests change all the time- overall I think most of these are valid and unique. While they might exist as *known* failures, there is a good chance they are happening much more frequently.
luckily for these mochitests there are <15 issues to sort out, a few look hard though.
Reporter | ||
Comment 23•8 years ago
|
||
just checking in here, are there any updates?
Comment 24•8 years ago
|
||
(In reply to Andrew Comminos [:acomminos] from comment #21)
> Most of these seem not too terrible. bc3 on e10s appears to be permafailing
> due to perf-related changes.
>
> For the intermittents, many of the crashes appear to be happening within
> native cairo;
We should be able to get debug symbols for these libraries. Someone just needs to grab a loaner test instance and follow these steps:
https://bugzilla.mozilla.org/show_bug.cgi?id=528231#c30
(If someone else does that work, you can give me the symbols.zip and I'll upload it.)
Assignee | ||
Comment 25•8 years ago
|
||
(In reply to Joel Maher (:jmaher: pto- back july 7th) from comment #23)
> just checking in here, are there any updates?
I believe we're going to go ahead with this- I'll be starting on fixing the failures in the near future, once my backlog is clear.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #24)
> (In reply to Andrew Comminos [:acomminos] from comment #21)
> > Most of these seem not too terrible. bc3 on e10s appears to be permafailing
> > due to perf-related changes.
> >
> > For the intermittents, many of the crashes appear to be happening within
> > native cairo;
>
> We should be able to get debug symbols for these libraries. Someone just
> needs to grab a loaner test instance and follow these steps:
> https://bugzilla.mozilla.org/show_bug.cgi?id=528231#c30
>
> (If someone else does that work, you can give me the symbols.zip and I'll
> upload it.)
I've asked for a loaner to symbolicate, thanks!
Assignee | ||
Comment 26•8 years ago
|
||
Bug 1285561 appears to have fixed up the a11y crashes; not sure if other tests were affected.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1777b364e978
Reporter | ||
Comment 27•8 years ago
|
||
there was a nice list of tests in comment 22, do you think most of those are fixed? should we be pushing and collecting a large volume of data to see where we stand?
Assignee | ||
Comment 28•8 years ago
|
||
Not quite yet I don't think- my recent push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=1777b364e978) suggests that many of the intermittents documented in comment 22 are still valid, save for a11y (and some other nonspecific oranges). I'm going to continue investigating these, particularly high-volume failures such as dom/html/test/test_fullscreen-api.html.
Reporter | ||
Comment 29•8 years ago
|
||
oh great, maybe in a short while this will be ready to go live :)
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 30•8 years ago
|
||
Most of the high volume intermittents (above 30%) should be fixed now;
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8d2a3e1e038&filter-searchStr=tc-m
There are still some more frequently occurring failures- these seem to be quite hard to track down.
Since our primary purpose of using dual-core instances is to run llvmpipe faster for GL composition, jrmuizel and I were discussing potentially running a tier-2 "tc-gl-M-(e10s)" set of tests. An additional benefit of this would be ensuring that we still test the basic composition path on Linux. These tests would run on the m3.large instances with layers.acceleration.force-enabled set to true.
Since failure rate is saner now, what do you think Joel?
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(jmaher)
Reporter | ||
Comment 31•8 years ago
|
||
we could switch just the 'gl' jobs to m3.medium for now, and eventually move other jobs. The e10s tests look more problematic, and add debug and it gets even messier.
in fact, everything but browser-chrome (bc*), mochitest-media (mda), and mochitest-plain could be ported over given the results of the try push from comment 30. Maybe doing that, seeing how it sorts out and reassessing a week later to get a list of tests to clean up for the remaining jobs would be a good route to go. Ideally we can get to 100% on the m3.mediums.
to fix this we would need to modify:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml
and for each test we care about add:
instance-size: large
for example, webgl:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml#291
Flags: needinfo?(jmaher)
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → andrew
Status: NEW → ASSIGNED
Reporter | ||
Comment 33•8 years ago
|
||
mozreview-review |
Comment on attachment 8782149 [details]
Bug 1281241 - Use large desktop-test instances by default on TaskCluster.
https://reviewboard.mozilla.org/r/72376/#review69984
<p>to confirm, this will run all tests on the new instance type except for mochitest-plain and mochitest-browser-chrome? This means, marionette, cppunit, jittests, mochitest-other, mochitest-devtools, reftest, crashtest, web-platform-tests, etc. will all run on the new instance type. I do like the usage of 'legacy'.</p>
<p>Lastly, do we have a current list of bugs or tests which are holding us back from making mochitest-plain and browser-chrome from running on desktop-test-large ?</p>
Attachment #8782149 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 34•8 years ago
|
||
Yup, everything except for plain and browser-chrome mochitest suites.
I'm currently working on making this bug depend on the remaining intermittents.
Keywords: leave-open
Comment 35•8 years ago
|
||
Pushed by acomminos@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dd82326944a4
Use large desktop-test instances by default on TaskCluster. r=jmaher
Comment 36•8 years ago
|
||
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f5ed7f38160e
followup - Use legacy taskcluster instances for XPCShell and ASAN devtools. r=philor
Comment 37•8 years ago
|
||
bugherder |
Comment 38•8 years ago
|
||
I assume this change also affected our firefox-ui-tests which use desktop-test with the Ubuntu 16.04 docker images? I ask because since this patch landed ALL of our intermittent test failures are gone! No single one re-appeared! This is freaking cool!
Would we have to make a further change to allow our qa-3-linux-fx-tests workers to use the same?
Flags: needinfo?(jmaher)
Reporter | ||
Comment 39•8 years ago
|
||
I am not sure what the qa-3-linux-fx-tests workers are, but this change should be easy to test out on try- look at the changesets and try it out on the try server.
that is exciting that changing the worker type to a multi core machine solved a lot of the intermittents!
Flags: needinfo?(jmaher)
Comment 40•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #39)
> I am not sure what the qa-3-linux-fx-tests workers are, but this change
> should be easy to test out on try- look at the changesets and try it out on
> the try server.
Those are workers we use for the firefox-ui-tests as triggered by mozmill-ci. Basically they use the desktop-test docker image too. From recent fxfn jobs I can see worker types of desktop-test-large:
https://queue.taskcluster.net/v1/task/Bvdxx6KZQ625cBEdIfDmXA
So it means that some TC admins would have to update our workers? I don't see a possibility to do so via a task definition.
> that is exciting that changing the worker type to a multi core machine
> solved a lot of the intermittents!
Not only a lot, but all for us! Since Friday we haven't had a single intermittent failure for fx-ui-tests anymore!
Updated•8 years ago
|
Flags: needinfo?(jmaher)
Reporter | ||
Comment 41•8 years ago
|
||
not clear what information is needed from me
Flags: needinfo?(jmaher)
Comment 42•8 years ago
|
||
Sorry, I actually wanted to ni? dustin. Dustin, can you please have a look at comment 40? If it warrants a new bug I can file one. Thanks.
Comment 43•8 years ago
|
||
There are two options, really -- we can update the `qa-3-linux-fx-tests` workerType, or just use the desktop-test-large workerType. The latter makes more sense, as there's not much reason to segregate these tasks into their own workerType.
In fact, that appears to be the case aleady -- I don't see any indication of a different workerType here:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml#59
I also don't see "linux-fx-tests" anywhere in dxr. I think we set up that workerType earlier, when you were scheduling this out-of-tree, and now that it's in-tree you're already completely migrated. So as far as the firefox-ui-tests, you've already won :)
Comment 44•8 years ago
|
||
It's hard to keep this conversation about fx ui tests in two different bugs (see also bug 1296547). In short we setup a different worker type initially by your request, so we also can get our own queue. If we start running tests for Nightly builds for all locales, I don't want to blow up the desktop-test-large queue that heavily. Shall we continue here or on bug 1296547?
Comment 45•8 years ago
|
||
Here is good.
If we're not talking about 10,000 tasks, then there's no real worry about blowing up that queue -- it regularly has over 2000 instances running. I think it's better to use one workerType than to run the same tasks on different workerTypes between nightly and on-commit.
Comment 46•8 years ago
|
||
Dustin, I created the following Github issue to start this conversion:
https://github.com/mozilla/mozmill-ci/issues/812. Please have a look. Thanks.
Comment 47•8 years ago
|
||
Just to note here, the desktop-test-large worker type has drastically improved the download of the docker image compared to the old desktop-test worker. Maybe its not for all machines but in the following case we had only about 3min compared to ~20min before!
https://tools.taskcluster.net/task-inspector/#WW_mn3EqTZKyRGfZHdRV1g/0
Updated•8 years ago
|
Blocks: skia-linux
Reporter | ||
Comment 48•7 years ago
|
||
this will leave two remaining tests (out of 5 originally):
* browser-chrome
* asan devtools
here is a try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=79642bc50a9a8628564638c853c71147c39d53df
browser-chrome could move if we address bug 1395539 and bug 1384879. Possibly there are other hurdles for browser-chrome tests with more retriggers, etc.
I haven't tested devtools on asan yet.
Attachment #8918208 -
Flags: review?(gbrown)
Updated•7 years ago
|
Attachment #8918208 -
Flags: review?(gbrown) → review+
Comment 49•7 years ago
|
||
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1df54aa8b21
move mochitest-plain, screenshots, and xpcshell off m1.medium. r=gbrown
Comment 50•7 years ago
|
||
bugherder |
Reporter | ||
Updated•7 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 51•7 years ago
|
||
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•