Closed Bug 1171033 (tc-linux64-debug) Opened 9 years ago Closed 8 years ago

[tracking] Schedule linux64 desktop tests via TaskCluster on try and enable with tier2 status

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ahal, Assigned: armenzg)

References

(Blocks 1 open bug)

Details

(Whiteboard: [bb2tc] [milestone1][leave-open])

Attachments

(1 file, 1 obsolete file)

Linux64 opt builds are currently scheduled on try (but hidden). It's a good time to look into scheduling its tests as well.

This will double the load until we can decommission the buildbot scheduled tests, so I'll try and get them scheduled not by default. Apparently this might be hard in TC.
Perharps we can add a special flag to the TC try parser to activate non-default jobs.
I don't know.
Blocks: bb-to-tc
Depends on: 1171140
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #1)
> Perharps we can add a special flag to the TC try parser to activate
> non-default jobs.
> I don't know.

I realized for the time being, I can just keep pushing the change that schedules the tests to try. If it turns out they need a lot of greening up and multiple people start working on it, then we might need to land it permanently, but until then, another problem for another time.
Depends on: 1171390
I got mochitests scheduled but they're failing because they can't find the tests.zip. I haven't looked into it yet.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d83136a649e9&exclusion_profile=false
Morgan, do you think this is caused by that artifact problem you mentioned in the meeting? If so, is there a bug I can follow along?
Flags: needinfo?(winter2718)
Talked to Morgan on irc, it's likely the same problem. Tests.zip is an artifact in taskcluster, so it isn't getting uploaded by the builds yet due to bug 1172107.
Depends on: 1172107
Flags: needinfo?(winter2718)
Bug 1171033 - Schedule linux64 mochitest with taskcluster
Quick update: 
Tests.zip is now found fine and harness runs, but firefox can't seem to start due to the following error:
firefox: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.17' not found (required by /home/worker/build/application/firefox/firefox)

I added "sudo apt-get update && sudo apt-get install -y libc6" to the test image's Dockerfile, but that didn't seem to work, or else I'm not testing it properly. Here's the latest try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1fa1f76ee1bd&exclusion_profile=false
Depends on: 1175938
Depends on: 1176031
Ok I have a better understanding of what needs to happen regarding the glibc issue now. There are two ways to solve the problem:

1. Link to the proper glibc in the tester or builder. The quick hack workaround is to download/unzip 2.17 in the tester and add it at the beginning of LD_LIBRARY_PATH. The proper solution is to fix bug 1179818. Adding that as a blocker because even if I do hack around the issue, it's probably a pre-requisite to scheduling live and un-hidden.

2. Upgrade the tester image to 14.04. This might be tricky however, as I suspect it will cause a lot of test failures. I'd like to not worry about this at the same time as migrating to taskcluster. That being said, it's probably worth at least pushing to try and seeing how things go.
Depends on: 1179818
Depends on: 1182142
Component: TaskCluster → General
Product: Testing → Taskcluster
Depends on: 1184084
Depends on: 1187047
Depends on: 1189892
Andrew -- I think this is ready jump back in.  We have green builds!  They produce artifacts with simple names, and there's already code in the `mach taskcluster-graph` command for turning those simple names into artifact URLs for test tasks.  This is how B2G links its builds and tests.

I think the tricky bit will be getting an operating system image that we're happy with, but IIRC you'd already made some progress on that front?
Flags: needinfo?(ahalberstadt)
Woohoo, thanks!

I doubt the progress I made on the image will be very relevant anymore since the os has changed, but it wasn't very much anyway. From here, just scheduling and pushing to try to see what happens is the best way to go. I'm trying to finish something else up this quarter, but I'm sure I'll get started on them again in Q4.
Flags: needinfo?(ahalberstadt)
The *build* image changed.  The tests can run in whatever OS you prefer.
Right, for some reason I thought I'd need to change the tester image to match the toolchain of the build.. but that's what you just fixed by making the build image match the test.

In that case yes, I have a slightly modified image I can work off.
Depends on: 1209064
On Friday I got some tests running:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=844213339596

The tests all passed, but the job is still orange because of some 'pactl' (pulseaudio related tool) failure. I don't understand why it's happening. Hopefully just a matter of updating some config in the image.

The good news is that it's mochitest specific, so shouldn't block other tests suites. The code path that causes the failure can also be turned off if we wanted by not passing in --use-test-media-devices.
ahal: I'm trying to land this piece of code which allows test jobs to work even if --read-buildbot-config is used (since Buildbot jobs can only be run with that action).

For TC tasks, I assume that the builds specify ['extra']['locations']['build'].

I assume you will be defining the call to Mozharness with --installer-url, nevertheless, I wanted to let you know what I'm doing.

[1] https://reviewboard.mozilla.org/r/21137
            if parent_task['extra'].get('locations'):
                # Build tasks generated under TC specify where they upload their builds
                installer_path = parent_task['extra']['locations']['build']

                self.set_artifacts(
                    self.url_to_artifact(parent_id, installer_path),
                    self.url_to_artifact(parent_id, 'public/build/test_packages.json'),
                    self.url_to_artifact(parent_id, 'public/build/target.crashreporter-symbols.zip')
                )
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Schedule linux64 mochitest with taskcluster
Whiteboard: [bb2tc] [milestone1]
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Schedule linux64 mochitest with taskcluster
Depends on: 1213314
Depends on: 1214194
Here's a try run with the latest patch that also schedules mochitest bc, dt and gl:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6210c226e2e0

They all fail with the pactl issue (bug 1214194), but tests still seem to be run and passing.
Also note the need for a bit of a creative try syntax. Fixing that is bug 1213314.
No longer depends on: 1213325
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Schedule linux64 taskcluster tests on try
Attachment #8621613 - Attachment description: MozReview Request: Bug 1171033 - Schedule linux64 mochitest with taskcluster → MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try
Here's a new try run with reftest and xpcshell:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6529e38db5d2

Some notes:
1. A green reftest job! The other jobs all have test failures, but seemingly no harness-wide issues.
2. These are chunked based on debug builds in buildbot.. still have to figure how to distinguish opt vs debug chunking in the configs (or ignore opt for now like originally planned).
3. This patch will likely need to be refactored once :dustin's image refactor lands. Results may also vary on the new image.
4. There's some info missing in the "Job details" pane that shows up for buildbot jobs (i.e passed/failed/skipped, artifacts uploaded, etc..). Though I think this is a wider taskcluster issue.
Depends on: 1213325
Sweet!

For #4, is there a bug filed that you're aware? Last quarter we managed to get some decent traction on bringing the sheriffability level and I think this is one of them.
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Schedule linux64 taskcluster tests on try
FYI, at least one of the xpcshell errors you're seeing looks like a misconfigured system encoding:
 TEST-UNEXPECTED-FAIL | dom/plugins/test/unit/test_bug455213.js | run_test - [run_test : 75] "Plug-in for testing purposes.â„¢ (हिनà¥\x8Dदी 中文 العربية)" == "Plug-in for testing purposes.™ (हिन्दी 中文 العربية)" 

You might just need LANG=en_US.UTF-8 in the environment.
I want to get the task configs landed minus the branch configs to actually schedule them, so that:

a) they don't bitrot
b) it's easier for other people to push them to try
c) when we're ready to enable them, it's just a small and easy to understand patch that needs to land

Here's an actually pretty green looking try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b1d20fc2248

Aside from the orange jobs, we'll need to fix:
1. Add --use-test-media-devices back (dustin has the fix, we just need to test it out)
2. There are failures in the debug bc5 job, investigate why those aren't turning the job orange.
3. xpcshell tasks seem to just abruptly end with no output indicating why.

I'll file new bugs to tackle these problems in time.
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Add taskcluster linux64 test configs, r=dustin

This adds test configs for desktop linux64 unittests, including: mochitest-plain,
mochitest-browser-chrome, mochitest-devtools-chrome, reftest and xpcshell. It
also does a minor refactor of the b2g configs to remove some b2g-specific logic
from the base 'test.yml' config.

This does *not* schedule these tests anywhere just yet.
Attachment #8621613 - Attachment description: MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try → MozReview Request: Bug 1171033 - Add taskcluster linux64 test configs, r=dustin
Attachment #8621613 - Flags: review?(dustin)
Schedule taskcluster linux64 tests on try
Also to note, that patch cargo cults a lot of how the b2g configs were set up. E.g, we may want to start organizing the configs into subdirectories, and/or try to use less inheritance.

It also currently doesn't handle different build types. E.g, if you want opt to have different chunking from debug, you currently have to override that in the branch configs. There's no way to set it directly in the test configs.
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

https://reviewboard.mozilla.org/r/11013/#review20493

This looks good!  A few comments, but I'd be happy to land this as-is, perhaps farming these out to low-priority bugs?

::: testing/taskcluster/tasks/test.yml:27
(Diff revision 6)
>          loopbackAudio: true

It'd be nice to not have these defined for every job, if they're not required..

::: testing/taskcluster/tasks/test.yml:33
(Diff revision 6)
>        tc-vcs: '/home/worker/.tc-vcs'

I'm confident that we don't need the tc-vcs cache for firefox tests.  I don't know about B2G, but maybe this can move to the b2g base file?

As for linux-cache -- I have no idea what that's for.  Is it necessary/useful in this case?  Caching brings risks of cache poisoning and the potential need to clobber, so unless this is a known win I think we should leave it out.
Attachment #8621613 - Flags: review?(dustin) → review+
To the note -- yeah, it's kind of a mess.  I've been avoiding modifying a lot of the B2G stuff, because I don't know how it works or want to get involved in maintaining it; and I think that once we've loaded our task definitions in using this system, we will have a better idea of the requirements for a system to generate them in a more maintainable fashion.
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Add taskcluster linux64 test configs, r=dustin

This adds test configs for desktop linux64 unittests, including: mochitest-plain,
mochitest-browser-chrome, mochitest-devtools-chrome, reftest and xpcshell. It
also does a minor refactor of the b2g configs to remove some b2g-specific logic
from the base 'test.yml' config.

This does *not* schedule these tests anywhere just yet.
Comment on attachment 8677585 [details]
MozReview Request: Schedule taskcluster linux64 tests on try

Schedule taskcluster linux64 tests on try
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Add taskcluster linux64 test configs, r=dustin

This adds test configs for desktop linux64 unittests, including: mochitest-plain,
mochitest-browser-chrome, mochitest-devtools-chrome, reftest and xpcshell. It
also does a minor refactor of the b2g configs to remove some b2g-specific logic
from the base 'test.yml' config.

This does *not* schedule these tests anywhere just yet.
Comment on attachment 8677585 [details]
MozReview Request: Schedule taskcluster linux64 tests on try

Schedule taskcluster linux64 tests on try
Fixed review comments. Here's a try run proving that b2g emulator and mulet didn't get broken, which is all I care about for now:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4517d152da44

The tc mochitest failures are because I didn't remove --use-test-media-devices this time, and the image hasn't been updated with the fix yet.
Whiteboard: [bb2tc] [milestone1] → [bb2tc] [milestone1][leave-open]
Had to back this out for breaking the decision task in:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d351ee79b4e4

Still investigating why.
Attachment #8621613 - Attachment description: MozReview Request: Bug 1171033 - Add taskcluster linux64 test configs, r=dustin → MozReview Request: Bug 1171033 - Add taskcluster linux64 test configs (but not scheduled anywhere yet), r=dustin
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Add taskcluster linux64 test configs (but not scheduled anywhere yet), r=dustin

This adds test configs for desktop linux64 unittests, including: mochitest-plain,
mochitest-browser-chrome, mochitest-devtools-chrome, reftest and xpcshell. It
also does a minor refactor of the b2g configs to remove some b2g-specific logic
from the base 'test.yml' config.

This does *not* schedule these tests anywhere just yet.
Comment on attachment 8677585 [details]
MozReview Request: Schedule taskcluster linux64 tests on try

Schedule taskcluster linux64 tests on try
Pretty sure I found and fixed the issue, but I was never able to reproduce by running |mach taskcluster-graph| locally :/. Here's a try run that includes gaia_build_tests this time:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3c25591967ee
Here's latest try run with --use-test-media-devices added back in and dustin's fix for the pactl issue. It seems to work now, though it looks considerably less green than it did before:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b84d34d7166a

Also, the jobs got scheduled twice for some reason?

But either way, we're at a point where it's worth talking about how we want to get these things turned on. E.g, it looks like we could move mochitest-gl over immediately if we want. Do we run it side-by-side with buildbot for awhile? Do we just disable the buildbot job right away? Is there anything else (i.e sheriff wise) blocking us from turning it on?
Depends on: 1218537
Depends on: 1218542
I think that's a question for Selena..
Flags: needinfo?(sdeckelmann)
(In reply to Andrew Halberstadt [:ahal] from comment #43)

> But either way, we're at a point where it's worth talking about how we want
> to get these things turned on. E.g, it looks like we could move mochitest-gl
> over immediately if we want. Do we run it side-by-side with buildbot for
> awhile? Do we just disable the buildbot job right away? Is there anything
> else (i.e sheriff wise) blocking us from turning it on?

\o/  I am overjoyed by this question!  There are no sheriff blockers.

I suggest we make these Tier2 initially -- for our evaluation period.

The idea was to run jobs side-by-side for a while. I don't think we had a fixed period identified, so looping :jgriffin in.
Flags: needinfo?(sdeckelmann) → needinfo?(jgriffin)
I agree about running them side-by-side as Tier 2 for a while. I think two weeks should be enough time to compare failure rates to give us confidence that we can turn the buildbot jobs off. We should let the sheriffs know what we're doing, so they can look for problems during that window as well.
Flags: needinfo?(jgriffin)
Alias: tc-linux64
Summary: Schedule linux64 desktop tests via TaskCluster on try → [tracking] Schedule linux64 desktop tests via TaskCluster on try
Depends on: 1218791
Depends on: 1218841
Comment on attachment 8621613 [details]
MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try

Bug 1171033 - Schedule linux64 taskcluster tests on try
Attachment #8621613 - Attachment description: MozReview Request: Bug 1171033 - Add taskcluster linux64 test configs (but not scheduled anywhere yet), r=dustin → MozReview Request: Bug 1171033 - Schedule linux64 taskcluster tests on try
Attachment #8677585 - Attachment is obsolete: true
(crap, latest mozreview patch overwrote the one that already landed)

For anyone wanting to help green up the tests, here's the latest state:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=603b29218a37

Just push the attached patch to try along with test fix ups or disablings.
Depends on: 1221553
Depends on: 1221661
How relevant are these?
[dix] Could not init font path element /usr/share/fonts/X11/100dpi/:unscaled, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/75dpi/:unscaled, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/Type1, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/100dpi, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/75dpi, removing from list!
[dix] Could not init font path element /var/lib/defoma/x-ttcidfont-conf.d/dirs/TrueType, removing from list!
I will be grabbing this.
Assignee: ahalberstadt → armenzg
Depends on: 1223123
It seems that my push is using my images.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=30eb0ab890a2&filter-searchStr=TC <- armenzg
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f28fce077ec4&filter-searchStr=TC <- jmaher

The results are looking similar (good!).

A weird thing which I have noticed is that we have to filter jobs with "TC".
The reason is that the Linux jobs we cancelled through TH's UI are not actually being cancelled (bug 1213520).
FTR, ignore my previous push as it was missing my docker changes. I'm currently working on bug 1223123 and more try pushes will happen there.

Current status summary
######################
Tier-2 blockers are bugs that if fixed will make jobs green.
Tier-1 blockers are bugs which will block switching equivalent Buildbot jobs off.

Tier-2 blockers:
* Bug 1222162 - browser/components/search/test has a few tests which are failing when run on task cluster
** might be a dupe of bug 1223123
* Bug 1223123 - We need a window manager of some type while running test jobs on linux

Tier-1 blockers:
* Bug 1218537 - Taskcluster jobs don't print information into the treeherder "Job details" pane
* Bug 1221553 - TaskCluster test jobs are skipping blobber uploads
* Bug 1221661 - task cluster test results seem to fail on crash reporter tests- probably a common cause
We're probably going to need Mesa as well (bug 1220658).
Depends on: 1224641
Depends on: 1224724
Depends on: 1226282
Depends on: 1227637
Depends on: 1227652
Depends on: 1227657
Depends on: 1228289
Depends on: 1228416
Depends on: 1225484
Depends on: 1228632
Feedback wanted in this comment. Thanks :)

This is the current set of results:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=52bcfd8df0a1,96923331f0ef,042f329cbbb1&group_state=expanded

The first one shows Buildbot jobs, the second one has *most* Buildbot jobs and most harnesses work.
The last push addresses any remaining suites which the second push failed to have addressed correctly.

NOTE:
* I don't know of a *clean* way where we could run by default a chunk suite without running all other ones
* We will probably need to separate yml files into opt/debug to control chunking separately (since debug runs so slow)
* I have needed to chunk a lot more than on Buildbot

I think I will prep patches to enable jobs which are green by default and run it with --times 10 to see how stable they're are.
After that I think I will focus on fixing the upload of artifacts.

My current worry is bug 1221661 with the crashreporter not working.

I'm also considering creating a tier1 tracker bug to help separate issues which are not blockers to make this a tier2.

Suites that came green:
* cpp
* Jit1&2
* mochitest-push
* crashtest opt
* marionette opt
* jsreftest
* wr

Unknown (waiting on last push):
* luciddream
* mochitest-other
Should we aim to run the "opt" test jobs which come back green? (Since all my pushes have already been doing that)

Even if on Q1 we will only be swapping the debug builds between Buildbot and TC.

The counter argument to this would be that we would be running jobs side-by-side without clear intent of replacing the opt builds.

I need to know this so I can adjust my patches if we won't.
Depends on: 1229893
Depends on: 1226751
Depends on: 1230330
Depends on: 1232070
Depends on: 1232316
Depends on: 1232407
Depends on: 1233044
Depends on: 1233054
Depends on: 1233554
Depends on: 1233716
Depends on: 1233725
Depends on: 1231618
Since I started looking this in November, here's where we stand.

Anyone reading, please let me know if you have any questions of where we currently stand.

NOTE: Not all test issues have been filed as some of them will have shared issues. In January we should press the pedal down to file everything and get lots of developer involment. We will have then all *known* remaining docker image issues ironed out.

Latest clean push (still running atm):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fbe188bab4b0

* We're running 8 jobs side by side on the integration trees: http://mzl.la/1NH8qCA
** Cpp, Jit, mochistest gl/push, JsReftest, web platform reftest, xpcshell (4 chunks) and crashtest
** Crashtests will show up in the next merge from inbound to central
* bug 1221553 (dustin) - fix upload of artifacts
** It shows under the task inspector; not TH (bug 1218537 & bug 1218537)
* bug 1221661 (dustin) - allow enabling ptrace for worker in order for crash reporter to work
** dumping of crashes still needs work (bug 1233716)
* bug 1223123 (armenzg) - we added a window manager
* jmaher helped fixing various test failures and filed bugs for developers to help investigate
** FIXED: bug 1224641, bug 1232316 and bug 1232979
*  bug 1227637 (dustin) - install latest patched mesa
*  bug 1227657 (armenzg) - Removed Ubuntu's update prompt as it would still focus
*  bug 1228289 (glandium) - Avoid l10n-check overwriting final package when MOZ_SIMPLE_PACKAGE_NAME is set
*  bug 1228416 (dustin) - Redirect gnome-session's output into its own artifact to reduce intertwined noise with Mozharness' exectution
*  bug 1230330 (armenzg) - Switch from b2gtest worker to deskt-test worker type
** b2gtest workers were running with capacity of 4 which seemed to be affecting tests

Current issues:
* Bug 1231618 - tc-vcs should change paths.default in repository's hgrc
* Bug 1232407 (armenzg) - Allow starting desktop-test images with VNC if requested
* Bug 1233716 (armenzg) - Fix dumping of crashes in docker containers

Test failures (these will have to be reviewed to determine if they're still valid):
* Bug 1222162 - bc1 issue
* Bug 1224724 - reftest issue
* Bug 1226282 - probably dupe of 1221661
* Bug 1226751 - intermittent issue (hopefully to be backed out)
* Bug 1232981 - bidi/83958-1*.html reftests fail on new linux64 docker container
* Bug 1232983 - border-radius/clipping-6.html is failing on linux64 docker container
* Bug 1232985 - /bugs/321402-4|5|6.xul fail on linux64 when running from a docker container
* Bug 1233054 - Luciddream jobs failing in desktop-test container to load libfreetype.so.6
* Bug 1232980 - many bidi/with-first-letter-*.html reftests fail on new linux64 docker container
* Bug 1233554 - Linux x64 debug crashtest e10s crash in docker image

Innocous:
* bug 1227652 - pygtk import issue (probably from another utility using python's logging)
Depends on: 1234352
Analysis per suite [1]:
* Web platform tests
** focus issues and time outs
* Luciddream (bug 1233054) - issues with packages
* Mochitests (plain, browser-chrome, devtools, other)
** focus issues and time outs
* Reftests
** lots of issues with pixels differing

There are also crashes happening in some of those jobs.
Similar situations are found for e10s equivalent jobs.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=fbe188bab4b0&group_state=expanded
No longer depends on: 1226751
Depends on: 1237068
Current plan (timeline 1 week)
* (dustin) We're going to try to run docker-worker on m1.medium
** Bug to be filed
** We want to compare results to match Buildbot's current set up is
** m1.medium is single core. xpcshell tests will start working again
** In the future, we will be able to run some jobs on multi-core versus single-core if we would like to
** NOTE: We don't what we will get when running on m1.medium
* (armenzg) Switch from gnome-session to xsession to match Buildbot set up
** Also reported better VNC results
** Not too much change wrt to test results [1]
* (armenzg) fix crash that mochitests are hitting (file bug)
* (armenzg) dowgrade mesa
* (armenzg) disable screen saver and locking
* (armenzg) Fix dumping of crashes in docker containers


[1] https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=050d6b6dd77d,c63b24f60dd
Depends on: 1237663
Depends on: 1238948
Depends on: 1239301
Depends on: 1239327
Depends on: 1238739
No longer depends on: 1197642
No longer depends on: 1011171
For the curious, this is where we currently stand:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=3c9a54d68c95

Last week's plan (7 days ago)
* [DONE] (dustin) Run jobs on m1.medium (par with releng)
* [DONE] (armenzg) Switch from gnome-session to xsession to match Buildbot set up
* [DONE] (armenzg) Disable apport crash reporter (stealing mochitest focus)
* [DONE] (armenzg) dowgrade mesa
* [DONE] (armenzg) disable screen saver and locking
* [DONE] (armenzg) Fix dumping of crashes in docker containers

Fixed in the last week:
* [DONE] (armenzg) Split mochitests into 8 chunks
* reviewed current dependencies

Current plan (1-2 weeks timeline)
* Find root issue for reftests failing (we assume some more docker image work)
* Investigate and file bugs for current test issues
* Aim to get green all the way with m1.medium

Ongoing:
* Split mochitest-other into mochitest-a11y and mochitest-chrome
* Add gtest
* Filed a11y issues - bug 1239301
* m-8 test-alerts.html - bug 1236036
* e10s crash in crashtest - bug 1233554
* Luciddream (this is an opt *only* job - bug 1233554
* R1 - bug 1232985
* Build tier1 issue - bug 1231618
* Intermittent m-3 (I don't see it happening anymore; removing dep) - bug 1197642
* Intermittent bc7 (I don't see it happening anymore; removing dep) - bug 1011171
No longer depends on: 1227652
Depends on: 1239766
Depends on: 1240056
Blocks: 1240062
I just did a scan of the tester-specific stuff that we do in PuppetAgain:

 * need to start the window manager session (duh, but it did take us a long time to figure this out)
 * tweaks::fonts
   ---> bug 1240056
 * clean::appstate
   ---> only on OS X
 * EDID data
   ---> only for GPUs
 * gnome-settings-daemon upgrade to 3.4.2-0ubuntu0.6.2 (bug 846348)
   ---> we are already running 3.4.2-0ubuntu0.6.6
 * disable jockey-gtk and deja-dup-monitor (bug 9849444)
   ---> bug 1240084
Depends on: 1240171
Depends on: 1241277
Depends on: 1241280
Depends on: 1241297
Depends on: 1241506
Here's where we are with greenness:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5e96bb453f67

We've enabled a lot of jobs on inbound.

I need to spend time filing bugs for R3, R-e10s2 and R-e10s3.
C-e10s is already filed (bug 1233554).
Depends on: 1241942
selena: we need to determine the scope of this bug.

A - make them run on try
B - run side by side *green* as tier-2 on all trunk trees
C - run side by side *green* as tier-1 on all trunk trees
D - replace the Buildbot jobs on trunk trees

At any point, we have to determine at which point this project is no more an engineering productivy effort but a releng/taskcluster effort.

I'm going to file a bug for D and assume that we're doing B in here [as per dustin [1]].

[1] <dustin> armenzg: for your part, I think it'd be OK if you left things with the tests enabled in TC at tier 2

Current known blockers for debug builds being tier1:
* bug 1174263
* bug 1231320
* bug 1234929
* bug 1231618
Flags: needinfo?(sdeckelmann)
Depends on: 1241979
Depends on: 1242023
Depends on: 1242033
Depends on: 1242502
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #62)
> selena: we need to determine the scope of this bug.
> 
> A - make them run on try
> B - run side by side *green* as tier-2 on all trunk trees
> C - run side by side *green* as tier-1 on all trunk trees
> D - replace the Buildbot jobs on trunk trees
> 
> At any point, we have to determine at which point this project is no more an
> engineering productivy effort but a releng/taskcluster effort.
> 
> I'm going to file a bug for D and assume that we're doing B in here [as per
> dustin [1]].

Sounds good!  Thank you for going through the details there. 

I'll file a bug for C and make the bugs you mentioned blockers.
Flags: needinfo?(sdeckelmann)
Summary: [tracking] Schedule linux64 desktop tests via TaskCluster on try → [tracking] Schedule linux64 desktop tests via TaskCluster on try and enable with tier2 status
Depends on: 1242682
This is the current breakdown:

To be discussed:
* Pending jobs reports for sherrifs [1]
* Proper integration under Treeherder -> Infra menu [2]
* SETA support

(armenzg) Make TC Linux64 *debug* test jobs tier2
* bug 1232985: /bugs/321402-4|5|6.xul fail on linux64 when running from a docker container
* bug 1233554: Linux x64 debug crashtest e10s crash in docker image (tests/reftest/tests/dom/canvas/crashtests/780392-1.html)
* bug 1241297: wpt-3 e10s always takes as long as the max runtime allows it to
   * bug 1238435: Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html, /html/dom/reflection-grouping.html followed by busting the whole run
* bug 1242033: Linux x64 debug e10s reftest 3 - element-paint-native-widget.html element-paint-native-widget-ref.html
* bug 1242682: Separate dom/media into its own subsuite

Make TC Linux64 *debug* test jobs tier1 (dropping associated Buildbot builders)
* bug 1218537: It's not possible to submit multiple "Job Info" artifacts
* bug 1241280: Fix web platform tests grouping
* bug 1242023: Cannot schedule buildbot bridge builds
   * This would allow us to schedule the jobs on the Buildbot generated build without having to wait for L64 debug builds to replace the Buildbot one

Make TC Linux64 *OPT* test jobs tier1
* bug 1233054: [opt] Luciddream jobs failing in desktop-test container to load libfreetype.so.6

Make TC Linux64 *debug* _BUILD_ jobs tier1
* bug 1231618: (tier1 issue) tc-vcs should change paths.default in repository's hgrc
   * bug 1241111: Allow overriding SOURCE_REV_URL, SOURCE_REPO, SOURCE_CHANGESET

Optimizations:
* Evaluate switching to m3 once we are completely green on m1
   * bug 1235889: time to run taskcluster jobs take 20% longer than buildbot peers


[1]
http://builddata.pub.build.mozilla.org/reports/pending/pending.html
[2] http://people.mozilla.org/~armenzg/sattap/99d8ad71.png
Depends on: 1243005
No longer depends on: 1231618
No longer depends on: 1218537
No longer depends on: 1242023
No longer depends on: 1233054
Blocks: 1235889
No longer depends on: 1235889
The list of dependencies is now up to date.
we need to verify runtimes and intermittent rates as compared to buildbot.
Depends on: 1243039
Depends on: 1243080
Alias: tc-linux64 → tc-linux64-debug
Depends on: 1244233
Depends on: 1244720
After the experiment from this weekend, jmaher discovered few discrepancies to what we should actually be running.
For instance, a lot of e10s jobs were *not* running as e10s due to a bug in the task definitions (the payload was defined twice).
Due to this, our greeness has gone rather up [1]

In the same fashion as crashtest e10s, we're seeing crashes for:
* m-4
* d-t{1,2,9}

New issues:
* marionette - lots of test issues
* dt8 - it used to take 30mins. It now times out
* Wr - various test failures

e10s issues:
* m10 - test_alerts.html is back
* bc1 - test/alerts/browser_notification_close.js
* bc6 - (intermittent) sessionstore/test/browser_crashedTabs.js and sessionstore/test/browser_579879.js
* dt6 - inspector/test/browser_inspector_initialization.js

Hopefully to be fixed in my next push:
* JP - wrong values in the test definitions
* m-other - I increase the max run time

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=c9397ac87d91&filter-searchStr=e10s&group_state=expanded
Depends on: 1244936
Depends on: 1245243
Bug 1245243 might fix most of the e10s issues we're seeing.
catlee asked me to see if I can help with the issues you guys are running into.  Can someone please tell me where to look for the latest state of things?  I have been looking at this bug and bug 1237024 but getting details straight in my mind is pretty difficult...  Thanks!
thanks Ehsan!  Right now we are waiting on bug 1245243, we suspect this will fix some of the e10s failures we are seeing (at the very least crashtest in bug 1233554).  On top of that there are a list of failures to figure out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=aba896a9c4e2

We were waiting for the shared mem mount increase and then planned on retesting.  Keep in mind the m-e10s(dt*) are scheduled on tc, but they are not greened up on buildbot- so ignore those.

Outside of greening up jobs, we need to resolve how to get stats for the sheriffs on wait times and other machine related info.  Maybe you could allocate a day next week to help look at any remaining mochitest/browser-chrome tests that are still failing after we fix bug 1245243.
Removing bug 1242682 and bug 1243080 as they're no real blockers but improvements.
No longer depends on: 1242682, 1243080
Depends on: 1246152
Depends on: 1246019
Sadly I only have tomorrow and then will go on vacation.  Can you please ping me when I get back?
(In reply to :Ehsan Akhgari (Away 2/10-2/19) from comment #72)
> Sadly I only have tomorrow and then will go on vacation.  Can you please
> ping me when I get back?

We will, however, we hope to be done by then. Enjoy your break!
Latest push [1]

Status summary:
* Since last week we properly pass --e10s flag to e10s jobs
** This is why we had a sudden increase of oranges
* We are now using a newer version of docker which allow us to modify /dev/shm
** This takes away a couple of crashes on e10s
* Wr is now passing (it is rather intermittent)
* Wr-e10s is now beeing scheduled and running green

Remaining issues
* Marionette still has a bunch of test failures - bug 1246283
* s/bc1/bc4/ - test/alerts/browser_notification_close.js (bug 1244936)
* Bump mochitest's timeout from 45 seconds to 90 seconds fixes some tests - bug 1246152 (still to land)
** This is due that running mochitests inside of docker is slower
* We will be increasing the global timeout for devtools (fixes dt8 time out - bug 1246279)

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=4d8d791a8e5d&exclusion_profile=false&group_state=expanded
* m10 e10s issue - bug 1246019
Depends on: 1247033
Depends on: 1227730
Depends on: 1246947
Blocks: 1243024
Depends on: 1247382
Remaining test issues:
* m10 e10s - test_alerts.html - bug 1246019 - bug 1227730
* bc4 e10s - test/alerts/browser_notification_close.js (bug 1244936)
* wpt reftests - bdi-paragraph-level-container.html - bug 1247033

Fixed:
* Marionette - bug 1246283
* Some higher mochitest intermitency - mount workspace under hosts' SSD - bug 1246947
Depends on: 1248028
Status summary:
* bug 1227730 fixes all remaining perma failures; we're waiting on dev to get the patch reviewed and landed
* bug 1227637 releng hosts have a new mesa and we have to match it

This weekend we would like to run another experiment to compare Buildbot to TaskCluster/docker since we've mounted ~/workspace to the hosts' SSD disk.
It seems that mochitest-push has become a mess both for Buildbot and TC
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=mochitest-push&fromchange=ee60dc3d0655&exclusion_profile=false&tochange=c5d6c3e00c91

For TC, it was green until d320678c4fab  (4 push on that view).

They're currently all hidden.
I think we should start looking into enabling mochitest-plain and mochitest-browser-chrome until kits gets bug 1227730 fixed.
We can hide m10 and bc4 until then.

This is our current latest *greenest* run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=55098f406c33&group_state=expanded
Everything is running side by side and visibly on Treeherder.

The only work left in here is finishing up the documentation and improve how to run this locally.
Depends on: 1245254
Depends on: 1251734
Depends on: 1251693
9 months later and all dependencies have been solved.
We're now running L64 debug test jobs as tier-2 on most integration/trunk repos (except few - bug 1252471).

We can now close this and deal with other platforms.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Congrats! And you managed to get it done before you turned into a pumpkin. :)
Awesome, that is an impressive dependency tree!
Depends on: 1270885
You need to log in before you can comment on or make changes to this bug.