Closed Bug 1281241 Opened 4 years ago Closed 2 years ago

run some if not all linux unittests tests on m3.medium instances, consider upgrading llvmpipe/mesa as well

Categories

(Testing :: General, defect)

49 Branch
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: acomminos)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

let the good times roll!

We have a problem where many tests are experiencing failures- it looks like this is an issue with the hardware or the software in the container running the tests.

To solve this we will try:
* running unittests on try with m3.medium instances (instead of m1.medium) so we have more than 1 core available to the docker container and theoretically the tests
* upgrading mesa/llvmpipe in the docker image
first try push is up here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b8b07888730

ideally before we go live, I would like to see us run 50 instances of any job and see no more than 3 jobs reported as orange, ideally less than 3.  lets see if this stuff runs as expected, we can retrigger jobs to see trends.
I'm trying to understand the context.

Did we disable a bunch of tests to make things work on m1.medium?
Where are you experiencing failures?
m3.medium is >1 core and it had a larger percentage of test failures (iirc 5% more failures) than m1.medium.  I cannot find the spreadsheet right now.  Jeff has a use case, possibly he can outline some more specific failures he is seeing.
Would we want to narrow this down to a number of suites? (the ones Jeff cares about)
(In reply to Joel Maher (:jmaher) from comment #3)
> m3.medium is >1 core and it had a larger percentage of test failures (iirc
> 5% more failures) than m1.medium.  I cannot find the spreadsheet right now. 
> Jeff has a use case, possibly he can outline some more specific failures he
> is seeing.

m3.medium is still 1 core. However, m3.medium is faster and cheaper than m1.medium. We wanted the extra performance and newer cpu instructions to reduce the likelihood of some timeouts during test runs when using llvmpipe.
These jobs all failed with:
 [taskcluster:error] Pulling docker image {"path":"public/image.tar","type":"task-image","taskId":"Yw8NRCthSbK5HC_6tJTIMA"} has failed. This may indicate an error with the registry, image name, or an authentication error. Try pulling the image locally to ensure image exists. ENOSPC, write
Flags: needinfo?(jmaher)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)
> These jobs all failed with:
>  [taskcluster:error] Pulling docker image
> {"path":"public/image.tar","type":"task-image","taskId":
> "Yw8NRCthSbK5HC_6tJTIMA"} has failed. This may indicate an error with the
> registry, image name, or an authentication error. Try pulling the image
> locally to ensure image exists. ENOSPC, write

Unfortunately sometimes this happens where docker dies when importing an image.  Usually retriggering the task solves the issue.
odd, I am not able to figure out how to get this going.  Using the taskcluster/desktop-test image (the latest two tags from hub.docker.com), I see failures where we use fail to download test-linux.sh as it is in a hardcoded path and that has since moved:
curl --fail -o ./test-linux.sh --retry 10 https://hg.mozilla.org/try//raw-file/a1c987825a8ffd17026792296f58583cc95011cc/testing/taskcluster/scripts/tester/test-linux.sh

^ testing/taskcluster -> taskcluster/ci/legacy

So it appears we need the magic generated image.tar instead.  We have dozens of failed jobs with the same error message, I don't think this is an intermittent failure.

:garndt, I could use your help to solve this.  you can see the changes I made to run this on try;
https://hg.mozilla.org/try/rev/8b8b0788873027aa5f1116fbd4e8403d7678445a

possibly there is something else on the host os or tc-worker that we need to setup properly before running jobs on there?
Flags: needinfo?(jmaher) → needinfo?(garndt)
Joel, the location of the script is coded here:
https://dxr.mozilla.org/mozilla-central/source/testing/docker/desktop-test/bin/test.sh#32

What we would need is a new bug with a fix for the new location of the script and get this pushed to inbound. The desktop-test task will automatically be triggered.

Also do you run your tests from in-tree or external? If in-tree you shouldn't have to create the docker image but could reference its last task id.
I had tried to reference a static image on hub.docker.com/taskcluster/desktop-test, but those images are out of date by months and do not include recent changes.  I don't know the process for updating those, and it is a ~6 hour cycle for me to update docker images if I were to test on my own.

the dynamically generated docker images (image.tar) uses the correct location for test-linux.sh- possibly I just need some education on how to use the image.tar from in-tree auto generated images.
I will walk through with Joel on IRC. Lets see if we can get this managed to work before garndt comes online.
It's mystic. Something is clearly not working and when I look at the live log I would say it's a malformed download of the docker image. To test this I was using the same task id and triggered a test for our firefox-ui-functional tests via my external script in mozmill-ci. Interestingly this task doesn't seem to fail in extracting the docker archive:

https://tools.taskcluster.net/task-inspector/#NjzHovwMSwSnw66fzxClKA/0

Also I wonder why it takes that long until tasks for medium worker are getting started. Even if no other test is running and the queue is empty you have to wait up to 18min before the task gets started. Maybe its the special hardware specs which are different to desktop test workers? The workers for our firefox-ui-tests take about 4-5 min to spin up.

https://tools.taskcluster.net/task-inspector/#G1VdyT1cSwqIzIH4ZBZNUA/0
ok, doing a fresh push, seemed to resolve this:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c42f80fbc92b38ea45f8d601494375b4943d460f

now to do another push with all the mochitests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3e72eee9c87
Flags: needinfo?(garndt)
ok, we hit a roadblock here, the failure to extract the image is because we run out of disk space, we have 4GB total on m3.medium- and get this, to get more disk we need to go with 2 vCPU.

Jeff, do you have a preference on m3.large vs c3.large?  c3.large has half the RAM, but a slightly faster processor.  Possibly we could restrict docker to use only 1 cpu, although that would not be ideal.  Keep in mind that some of our tests (like web-platform-tests) already run on larger instances, I think c3.xlarge, so we could try that out without getting any custom AMI images setup.
Flags: needinfo?(jmuizelaar)
2 vCPU is even better for us. My preference would be c4.large, c3.large, m3.large.
Flags: needinfo?(jmuizelaar)
running on m3.large:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8b1440a46186b5e5cf6e0ce87247bc5cbe5ca57&filter-searchStr=tc

Jeff, If there are certain jobs/tests that you want to look at, please retrigger away.

This weekend if all is well so far, I will ensure I have enough clients and do 50 retriggers on each job, that will give us enough data to ensure we are not introducing noisier jobs/tests.
it appears that retriggers are not working on taskcluster jobs right now, so I cannot determine if these failed tests are perma fail or not.  I assume this will get fixed today.

Jeff, please give me some direction here to know what issues to look for.
Flags: needinfo?(jmuizelaar)
Joel, I left a comment on the other bug about retriggers, but it seems that you can retrigger them from the staging site so that you're unblocked.
The jobs results don't look too bad. I'm going to have Andrew investigate to see if we can't make things more green (especially bc2)
Flags: needinfo?(jmuizelaar)
Most of these seem not too terrible. bc3 on e10s appears to be permafailing due to perf-related changes.

For the intermittents, many of the crashes appear to be happening within native cairo;

https://public-artifacts.taskcluster.net/emzV5D06SIe6iXYZpp4Ukw/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/YmFWEXy_SL6fomYoTJzRnA/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/Igf_QOQVTza4aNuKg92akg/0/public/logs/live_backing.log

Unfortunately, we don't have frame pointers, but the calls are almost certainly being made from within GTK. Going to see if I can reproduce on an old build of gtk+cairo.
I would say we have a list of tests to fix, here they are by job type:
linux64 opt:
* mochitest-3: dom/events/test/test_bug659071.html, dom/filesystem/tests/test_basic.html
* a11y: all kinds of badness
* bc4: toolkit/modules/tests/browser/browser_FinderHighlighter.js
* bc7: oddness in browser/base/content/test/general/*
* c3: layout/xul/test/test_windowminmaxsize.xul 
* mochitest-e10s-3: dom/html/test/test_fullscreen-api.html 
* mochitest-e10s-5: dom/push/test/test_serviceworker_lifetime.html
* mochitest-e10s-8: gfx/layers/apz/test/mochitest/test_group_touchevents.html
* mochitest-e10s-10: toolkit/components/extensions/test/mochitest/test_ext_notifications.html
* bc-e10s-1: toolkit/modules/tests/browser/browser_FinderHighlighter.js
* bc-e10s-3: toolkit/components/perfmonitoring/tests/browser/browser_compartments.js
* bc-e10s-5: browser/components/privatebrowsing/test/browser/*
* bc-e10s-6: browser/components/sessionstore/test/*
* bc-e10s-7: oddness in browser/base/content/test/general/*

linux64 debug:
* mochitest-10: toolkit/components/prompts/test/test_subresources_prompts.html
* c3: toolkit/content/tests/chrome/test_popup_anchoratrect.xul (this fails frequently on m1.medium)
* mochitest-e10s-3: dom/html/test/test_fullscreen-api.html
* mochitest-e10s-7: leak in /tests/dom/workers/test/serviceworkers

keep in mind that this could be partially related to the base revision I pushed with, the tree and tests change all the time- overall I think most of these are valid and unique.  While they might exist as *known* failures, there is a good chance they are happening much more frequently.

luckily for these mochitests there are <15 issues to sort out, a few look hard though.
just checking in here, are there any updates?
(In reply to Andrew Comminos [:acomminos] from comment #21)
> Most of these seem not too terrible. bc3 on e10s appears to be permafailing
> due to perf-related changes.
> 
> For the intermittents, many of the crashes appear to be happening within
> native cairo;

We should be able to get debug symbols for these libraries. Someone just needs to grab a loaner test instance and follow these steps:
https://bugzilla.mozilla.org/show_bug.cgi?id=528231#c30

(If someone else does that work, you can give me the symbols.zip and I'll upload it.)
(In reply to Joel Maher (:jmaher: pto- back july 7th) from comment #23)
> just checking in here, are there any updates?

I believe we're going to go ahead with this- I'll be starting on fixing the failures in the near future, once my backlog is clear.

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #24)
> (In reply to Andrew Comminos [:acomminos] from comment #21)
> > Most of these seem not too terrible. bc3 on e10s appears to be permafailing
> > due to perf-related changes.
> > 
> > For the intermittents, many of the crashes appear to be happening within
> > native cairo;
> 
> We should be able to get debug symbols for these libraries. Someone just
> needs to grab a loaner test instance and follow these steps:
> https://bugzilla.mozilla.org/show_bug.cgi?id=528231#c30
> 
> (If someone else does that work, you can give me the symbols.zip and I'll
> upload it.)

I've asked for a loaner to symbolicate, thanks!
Depends on: 1285561
Bug 1285561 appears to have fixed up the a11y crashes; not sure if other tests were affected.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1777b364e978
there was a nice list of tests in comment 22, do you think most of those are fixed?  should we be pushing and collecting a large volume of data to see where we stand?
Not quite yet I don't think- my recent push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=1777b364e978) suggests that many of the intermittents documented in comment 22 are still valid, save for a11y (and some other nonspecific oranges). I'm going to continue investigating these, particularly high-volume failures such as dom/html/test/test_fullscreen-api.html.
oh great, maybe in a short while this will be ready to go live :)
Depends on: 1240643, 1131576
Depends on: 1284742
Depends on: 1284038
Most of the high volume intermittents (above 30%) should be fixed now;

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8d2a3e1e038&filter-searchStr=tc-m

There are still some more frequently occurring failures- these seem to be quite hard to track down.

Since our primary purpose of using dual-core instances is to run llvmpipe faster for GL composition, jrmuizel and I were discussing potentially running a tier-2 "tc-gl-M-(e10s)" set of tests. An additional benefit of this would be ensuring that we still test the basic composition path on Linux. These tests would run on the m3.large instances with layers.acceleration.force-enabled set to true.

Since failure rate is saner now, what do you think Joel?
Flags: needinfo?(jmaher)
we could switch just the 'gl' jobs to m3.medium for now, and eventually move other jobs.  The e10s tests look more problematic, and add debug and it gets even messier.

in fact, everything but browser-chrome (bc*), mochitest-media (mda), and mochitest-plain could be ported over given the results of the try push from comment 30.  Maybe doing that, seeing how it sorts out and reassessing a week later to get a list of tests to clean up for the remaining jobs would be a good route to go.  Ideally we can get to 100% on the m3.mediums.

to fix this we would need to modify:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml

and for each test we care about add:
instance-size: large

for example, webgl:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml#291
Flags: needinfo?(jmaher)
Assignee: nobody → andrew
Status: NEW → ASSIGNED
Depends on: 1262702
Comment on attachment 8782149 [details]
Bug 1281241 - Use large desktop-test instances by default on TaskCluster.

https://reviewboard.mozilla.org/r/72376/#review69984

<p>to confirm, this will run all tests on the new instance type except for mochitest-plain and mochitest-browser-chrome?  This means, marionette, cppunit, jittests, mochitest-other, mochitest-devtools, reftest, crashtest, web-platform-tests, etc. will all run on the new instance type.  I do like the usage of 'legacy'.</p>
<p>Lastly, do we have a current list of bugs or tests which are holding us back from making mochitest-plain and browser-chrome from running on desktop-test-large ?</p>
Attachment #8782149 - Flags: review?(jmaher) → review+
Yup, everything except for plain and browser-chrome mochitest suites.

I'm currently working on making this bug depend on the remaining intermittents.
Keywords: leave-open
Depends on: 1202200
Depends on: 1280290
Pushed by acomminos@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dd82326944a4
Use large desktop-test instances by default on TaskCluster. r=jmaher
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f5ed7f38160e
followup - Use legacy taskcluster instances for XPCShell and ASAN devtools. r=philor
I assume this change also affected our firefox-ui-tests which use desktop-test with the Ubuntu 16.04 docker images? I ask because since this patch landed ALL of our intermittent test failures are gone! No single one re-appeared! This is freaking cool!

Would we have to make a further change to allow our qa-3-linux-fx-tests workers to use the same?
Flags: needinfo?(jmaher)
I am not sure what the qa-3-linux-fx-tests workers are, but this change should be easy to test out on try- look at the changesets and try it out on the try server.

that is exciting that changing the worker type to a multi core machine solved a lot of the intermittents!
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #39)
> I am not sure what the qa-3-linux-fx-tests workers are, but this change
> should be easy to test out on try- look at the changesets and try it out on
> the try server.

Those are workers we use for the firefox-ui-tests as triggered by mozmill-ci. Basically they use the desktop-test docker image too. From recent fxfn jobs I can see worker types of desktop-test-large:

https://queue.taskcluster.net/v1/task/Bvdxx6KZQ625cBEdIfDmXA

So it means that some TC admins would have to update our workers? I don't see a possibility to do so via a task definition.

> that is exciting that changing the worker type to a multi core machine
> solved a lot of the intermittents!

Not only a lot, but all for us! Since Friday we haven't had a single intermittent failure for fx-ui-tests anymore!
Flags: needinfo?(jmaher)
not clear what information is needed from me
Flags: needinfo?(jmaher)
Sorry, I actually wanted to ni? dustin. Dustin, can you please have a look at comment 40? If it warrants a new bug I can file one. Thanks.
There are two options, really -- we can update the `qa-3-linux-fx-tests` workerType, or just use the desktop-test-large workerType.  The latter makes more sense, as there's not much reason to segregate these tasks into their own workerType.

In fact, that appears to be the case aleady -- I don't see any indication of a different workerType here:

https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml#59

I also don't see "linux-fx-tests" anywhere in dxr.  I think we set up that workerType earlier, when you were scheduling this out-of-tree, and now that it's in-tree you're already completely migrated.  So as far as the firefox-ui-tests, you've already won :)
It's hard to keep this conversation about fx ui tests in two different bugs (see also bug 1296547). In short we setup a different worker type initially by your request, so we also can get our own queue. If we start running tests for Nightly builds for all locales, I don't want to blow up the desktop-test-large queue that heavily. Shall we continue here or on bug 1296547?
Here is good.

If we're not talking about 10,000 tasks, then there's no real worry about blowing up that queue -- it regularly has over 2000 instances running.  I think it's better to use one workerType than to run the same tasks on different workerTypes between nightly and on-commit.
Dustin, I created the following Github issue to start this conversion:
https://github.com/mozilla/mozmill-ci/issues/812. Please have a look. Thanks.
Just to note here, the desktop-test-large worker type has drastically improved the download of the docker image compared to the old desktop-test worker. Maybe its not for all machines but in the following case we had only about 3min compared to ~20min before!

https://tools.taskcluster.net/task-inspector/#WW_mn3EqTZKyRGfZHdRV1g/0
this will leave two remaining tests (out of 5 originally):
* browser-chrome
* asan devtools

here is a try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=79642bc50a9a8628564638c853c71147c39d53df

browser-chrome could move if we address bug 1395539 and bug 1384879.  Possibly there are other hurdles for browser-chrome tests with more retriggers, etc.

I haven't tested devtools on asan yet.
Attachment #8918208 - Flags: review?(gbrown)
Depends on: 1408384
Depends on: 1408387
Depends on: 1408389
Attachment #8918208 - Flags: review?(gbrown) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1df54aa8b21
move mochitest-plain, screenshots, and xpcshell off m1.medium. r=gbrown
Depends on: 1408506
Depends on: 1411334
Blocks: 1411344
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Blocks: 1429595
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.