Open Bug 1565332 (debian-10) Opened 4 months ago Updated 4 days ago

Update desktop1604-test image to use Debian 10 (buster)

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

People

(Reporter: egao, Assigned: egao)

References

(Depends on 6 open bugs, Blocks 2 open bugs)

Details

(Keywords: leave-open)

Attachments

(5 files, 1 obsolete file)

Update the docker image currently used for Ubuntu 16.04 tests to use Ubuntu 18.04.

Assignee: nobody → egao

I know you spent time on this, but I think we shouldn't do this. We should instead switch the test image to Debian. Why? Because we can actually reproducibly build Debian images, but we can't do that for Ubuntu. And not being able to do that means things can break whenever some unrelated change triggers a new docker image build, which has happened a lot until we finally bailed and used a hack (bug 1503756)

:glandium - thanks for your input. I did spend some time on this, but not a significant amount (it was on the side while I did other things).

To confirm, you are proposing that we migrate our linux32/linux64 testing to run on Debian platform correct? That doesn't sound like too much work, though I might be underestimating the difficulty here.

:jmaher - what do you think? Migrating from one distro of linux to another is not a small change. Who do we have signing off on these sorts of decisions?

Flags: needinfo?(jmaher)

I agree getting a reliable Ubuntu image via docker is more difficult than it should be. the question is what do we need for test coverage. It is my understanding that Firefox is the default browser for Ubuntu distributions, and it is hard to tell for Fedora. I wasn't able to get numbers related to breakdown of linux distribution easily, that could help drive the decision.

A few things to consider:

  1. is it easy to get x86 libraries installed in Fedora as we currently do in Ubuntu?
  2. are there issues getting DRM installed in order to test media playback?
  3. what version of Fedora would we use?
  4. how often do we need to update it? - Fedora only supports software for ~13 months, whereas Ubuntu LTS supports for 5 years
  5. we should have parity with our hardware installs (driver support, install scripts) - need relops
  6. possibly bitbar android docker container could be the same? maybe packet.net host os?

Currently we upgrade every few years, which isn't ideal but on a more regular schedule we would need a team to officially support this, both as a docker image and as hardware installs.

:dhouse, in terms of installation on physical hardware in the datacenter (currently the moonshots) do you have preferences or concerns with either Ubuntu or Fedora?

Flags: needinfo?(jmaher) → needinfo?(dhouse)

Why are you talking of Fedora when I was talking about Debian?

Flags: needinfo?(jmaher)

sorry, Debian has 2 year windows for the LTS support which is better than Fedora but not as ideal as Ubuntu. The rest of my questions still stand.

Flags: needinfo?(jmaher)

Debian actually has some longer LTS support, and it's always possible to stay on older versions. Like we're currently using Debian 7 for build tasks.

Also note that LTS support in Ubuntu is what has broken us in many occasions, because they like to upgrade stuff. And that because of that, in practice, we haven't upgraded the still supported Ubuntu 16.04 for > 6 months.

BTW, extended LTS support for Debian 7 finished this year, after 6 years. The just released Debian 10 is set to be supported for 5 years.

One of the issues I've identified with :jmaher in our discussion is that, as far as I am aware there isn't a point of single authority that makes a call on what distribution and version of the said distribution to use.

I see two factors that will have an impact on decision to use a certain distribution/version of an operating system:

  1. ease of automation - deterministic builds, package availability, driver stability, etc.
  2. representative - based on usage numbers for the certain distribution/version

Personally, I am inclined to value the latter, which would mean sacrificing some of the advantages of using Debian (noted by :glandium). Several factors play a role in my reasoning.

Popular usage

Let me preface that Linux distribution statistics are difficult if not impossible to find, so much of the market share of various distributions are a combination of anecdotal evidence, sample polls and general interest.

With that said, it's generally agreed that Ubuntu represents the dominant Linux distribution. This is backed up by various data sources:

It isn't known if 18.04 has a majority within Ubuntu, but it's likely a fair assumption since it's the latest LTS release. I was not able to retrieve data on this, despite an hour or so playing with telemetry data.

Familiarity

Current CI seems to have been designed and written assuming Ubuntu, so there's a lot of familiarity with using Ubuntu as the base image. Debian is similar, but different enough that it may cause issues.

Driver support

Admittedly I am not 100% certain about this point, but if tests are ever run on hardware machines or gpu-accelerated machines the difference in driver availability might be problematic.

Package support

There are Debian equivalents to some of the packages that are required, but Ubuntu packages it nicely in a metapackage. For example, see multiverse and ubuntu-restricted-extras.

I'll be posting a discussion at mozilla.dev.platform in the coming days to gather feedback and comments regarding this proposal.

I think, from the lack of response at https://groups.google.com/forum/#!topic/mozilla.dev.platform/HCYoPiBUi8M, I take it that there isn't really a strong case to be made for switching the distribution from Ubuntu despite its pitfalls.

I will wait another week prior to making a call, given that there seems to be no one that is responsible for making a final decision.

(In reply to Edwin Gao (:egao) from comment #10)

I think, from the lack of response at https://groups.google.com/forum/#!topic/mozilla.dev.platform/HCYoPiBUi8M, I take it that there isn't really a strong case to be made for switching the distribution from Ubuntu despite its pitfalls.

I'd take the opposite conclusion from that, i.e. no one has strong opinions, and glandium already expressed his here (which I agree with).

I'm ni?ing :RyanVM for input from release management.

Flags: needinfo?(ryanvm)

I don't have a strong opinion on this either assuming Debian is able to run all the test suites we need it to run.

In general, I think our standards for Linux testing have been lower under the assumption that most users are getting their builds from distros anyway and the number of combinations of components in the wild is mind-bogglingly huge. Also, I don't recall the decision for changing the base OS for Linux tests in the past being one that went outside the various teams responsible for maintaining our automated test infrastructure.

So I guess my tl;dr is to say that going with whatever makes the most sense and is easiest to maintain going forward sounds like the reasonable option here and I don't see any reason to avoid the change based on what's been said here and (not) said on dev.platform.

Flags: needinfo?(ryanvm)

Intent
My intent is to spend two weeks (maximum) to bring the Debian tests to a similar state as Ubuntu 18.04.

Current state of Debian push: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=cd2e7656a270507ca2beb6a7373c0ec6b334c3eb
Current state of Ubuntu 18.04: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=35e5ff46e184fff0305a891a216c31e41aac6895&selectedJob=257973741

Reason
After similar amount of effort spent on bootstrapping Ubuntu 18.04 and Debian 10 images, we have a discrepancy in terms of how suitable the resulting images are for running tests.

Some of the challenges faced are:

  • debian image can be built, but lots of dependencies are missing (eg. alsa-base, ubuntu-restricted-extras)
  • tweaks are needed in the test harnesses and scripts (eg. test-linux.sh)
  • lack of window manager

Once the underlying issues (resulting from switching to Debian) are resolved, then the image can be considered on equal footing as Ubuntu 18.04.

What may happen
If I am not able to have a working image that is ready to run a test in two weeks of full-time work, then all work will revert back to using Ubuntu 18.04 which provides a nearly ready-to-use image with a couple of hours of work.

FWIW, ubuntu-restricted-extras doesn't provide much that is useful to Firefox. Only libavcodec-extra, that it depends on, is.

Firefox doesn't support alsa anymore, so alsa-base shouldn't be necessary.

So far, mixed outcomes.

Initial focus has been on getting mochitest suites to run tests. In the initial baseline push https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=257490869&revision=cd2e7656a270507ca2beb6a7373c0ec6b334c3eb it is possible to see that nearly all of mochitest suites fail due to either

  • pactl list short modules subprocess call returning an error
  • pactl load-module module-null-sink call in test-linux.sh returning an error

The former scenario means the mochitest test harness has initialized, parsed the manifest and performed TEST-SKIP on annotated tests. We're getting further with this scenario.

The latter scenario is something I cannot seem to resolve. I've ensured that pulseaudio is installed but it seems to have difficulty initializing.

Example of the push can be seen here: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=10fa9e6091f5592dd290312dfea25e9b1a0494af

pulseaudio apparently starts, and the first pactl load-module module-null-sink works. So something might be killing pulseaudio later.

It's also worth noting that a few things are missing: dbus-launch (package dbus-x11), gnome-keyring-daemon (package gnome-keyring), and compiz (but I would advise to use something else than compiz, because it's not in prominent use anymore)
The script is also setting DESKTOP_SESSION=ubuntu, not sure what side effect that might have...

Also, you'll probably want to do something about bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8). The locales package is installed, but nothing is set in /etc/locale.gen. So you need something like echo en_US.UTF-8 UTF-8 > /etc/locale.gen ; dpkg-reconfigure --frontend=noninteractive locales.

pulseaudio is exiting on its own:

I: [pulseaudio] module-suspend-on-idle.c: Sink null idle for too long, suspending ...
D: [pulseaudio] sink.c: null: suspend_cause: (none) -> IDLE
D: [pulseaudio] sink.c: null: state: IDLE -> SUSPENDED
D: [pulseaudio] source.c: null.monitor: suspend_cause: (none) -> IDLE
D: [pulseaudio] source.c: null.monitor: state: IDLE -> SUSPENDED
D: [pulseaudio] core.c: Hmm, no streams around, trying to vacuum.
I: [pulseaudio] module-device-restore.c: Synced.
I: [pulseaudio] core.c: We are idle, quitting...
I: [pulseaudio] main.c: Daemon shutdown initiated.

There's a --exit-idle-time option that could be passed to avoid this.

I found that the Debian image was having issues setting the LC_ALL=en_US.UTF-8 locale which caused other cascading failures.

After installing the required dependencies like gnome-keyring, dbus-x11 and making the docker container generate then set the locales, at least a couple of the mochitest subsuites now run to completion:

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=98be45094ed0267dae5b32175620c09acb3282e4

I will attempt moving back the pulseaudio related initialization in test-linux.sh back to a function, and add the extra argument --exit-idle-time with value of 600.

That seems to help, though bunch of media tests fail still due to pactl not being initialized when the tests are run.
However, just prior to the failure in the mochitest harness with pactl, there is this peculiar error that is not observed on Ubuntu 16.04:

(gst-launch-1.0:866): GStreamer-CRITICAL **: 05:09:26.190: gst_object_unref: assertion '((GObject *) object)->ref_count > 0' failed

I have reasons to doubt that perhaps gstreamer1.0, which is the current version, is not compatible or has significantly changed behavior.
Currently I am attempting to restore some of the debian jessie repositories so that I may install gstreamer0.1 to see how the tests behave.

You should probably use -1 as a value for exit-idle-time.

Since the last comment, the following were done:

  • enable jessie repositories in apt
  • install 0.10 version of gstreamer and relevant libraries

Initially, the mochitest harness was not able to locate the gst-launcher-0.10 library, because at a couple of places like this line was looking for gst-launcher-0.1 (note the missing 0), leading to:

[task 2019-08-01T18:07:13.280Z] 18:07:13     INFO -  usage: runtests.py [options] [test paths]
[task 2019-08-01T18:07:13.280Z] 18:07:13     INFO -  runtests.py: error: Missing gst-launch-{0.1,1.0}, required for --use-test-media-devices
[task 2019-08-01T18:07:13.336Z] 18:07:13    ERROR - Return code: 2
[task 2019-08-01T18:07:13.338Z] 18:07:13    ERROR - No checks run.
[task 2019-08-01T18:07:13.339Z] 18:07:13    ERROR - No suite end message was emitted by this harness.
[task 2019-08-01T18:07:13.339Z] 18:07:13     INFO - TinderboxPrint: mochitest-mochitest-plain-chunked<br/><em class="testfail">T-FAIL</em>
[task 2019-08-01T18:07:13.340Z] 18:07:13    ERROR - # TBPL FAILURE #
[task 2019-08-01T18:07:13.341Z] 18:07:13  WARNING - setting return code to 2
[task 2019-08-01T18:07:13.341Z] 18:07:13    ERROR - The mochitest suite: mochitest-plain-chunked ran with return status: FAILURE

Once the string is corrected the issues with gstreamer not being found is resolved.

That brings the status back to mochitest suites failing with pactl not found errors.

I will take another look at the dependencies and how pulseaudio is being initialized.

So far, I haven't had luck in having pulseaudio/pactl remain initialized when the test harness is run.

When lucky, pactl initialization passes and the harness begins running tests:

[task 2019-08-06T20:04:38.368Z] 20:04:38     INFO - Running manifest: browser/components/extensions/test/mochitest/mochitest.ini
[task 2019-08-06T20:04:38.765Z] 20:04:38     INFO -  Setting pipeline to PAUSED ...
[task 2019-08-06T20:04:38.765Z] 20:04:38     INFO -  Pipeline is PREROLLING ...
[task 2019-08-06T20:04:38.766Z] 20:04:38     INFO -  Pipeline is PREROLLED ...
[task 2019-08-06T20:04:38.766Z] 20:04:38     INFO -  Setting pipeline to PLAYING ...
[task 2019-08-06T20:04:38.766Z] 20:04:38     INFO -  New clock: GstSystemClock
[task 2019-08-06T20:04:38.802Z] 20:04:38     INFO -  Got EOS from element "pipeline0".
[task 2019-08-06T20:04:38.802Z] 20:04:38     INFO -  Execution ended after 33416930 ns.
[task 2019-08-06T20:04:38.802Z] 20:04:38     INFO -  Setting pipeline to PAUSED ...
[task 2019-08-06T20:04:38.802Z] 20:04:38     INFO -  Setting pipeline to READY ...
[task 2019-08-06T20:04:38.802Z] 20:04:38     INFO -  Setting pipeline to NULL ...
[task 2019-08-06T20:04:38.802Z] 20:04:38     INFO -  Freeing pipeline ...
[task 2019-08-06T20:04:38.802Z] 20:04:38     INFO -  /usr/bin/pactl
[task 2019-08-06T20:04:38.809Z] 20:04:38     INFO -  0	module-device-restore
[task 2019-08-06T20:04:38.810Z] 20:04:38     INFO -  1	module-stream-restore
[task 2019-08-06T20:04:38.810Z] 20:04:38     INFO -  2	module-card-restore
[task 2019-08-06T20:04:38.811Z] 20:04:38     INFO -  3	module-augment-properties
[task 2019-08-06T20:04:38.811Z] 20:04:38     INFO -  4	module-udev-detect
[task 2019-08-06T20:04:38.812Z] 20:04:38     INFO -  6	module-native-protocol-unix
[task 2019-08-06T20:04:38.812Z] 20:04:38     INFO -  7	module-default-device-restore
[task 2019-08-06T20:04:38.813Z] 20:04:38     INFO -  8	module-rescue-streams
[task 2019-08-06T20:04:38.813Z] 20:04:38     INFO -  9	module-always-sink
[task 2019-08-06T20:04:38.814Z] 20:04:38     INFO -  11	module-intended-roles
[task 2019-08-06T20:04:38.814Z] 20:04:38     INFO -  12	module-suspend-on-idle
[task 2019-08-06T20:04:38.815Z] 20:04:38     INFO -  13	module-position-event-sounds
[task 2019-08-06T20:04:38.815Z] 20:04:38     INFO -  14	module-filter-heuristics
[task 2019-08-06T20:04:38.816Z] 20:04:38     INFO -  15	module-filter-apply
[task 2019-08-06T20:04:38.816Z] 20:04:38     INFO -  16	module-switch-on-port-available
[task 2019-08-06T20:04:38.816Z] 20:04:38     INFO -  17	module-null-sink
[task 2019-08-06T20:04:38.822Z] 20:04:38     INFO -  0	module-device-restore
[task 2019-08-06T20:04:38.823Z] 20:04:38     INFO -  1	module-stream-restore
[task 2019-08-06T20:04:38.823Z] 20:04:38     INFO -  2	module-card-restore
[task 2019-08-06T20:04:38.823Z] 20:04:38     INFO -  3	module-augment-properties
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  4	module-udev-detect
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  6	module-native-protocol-unix
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  7	module-default-device-restore
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  8	module-rescue-streams
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  9	module-always-sink
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  11	module-intended-roles
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  12	module-suspend-on-idle
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  13	module-position-event-sounds
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  14	module-filter-heuristics
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  15	module-filter-apply
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  16	module-switch-on-port-available
[task 2019-08-06T20:04:38.826Z] 20:04:38     INFO -  17	module-null-sink
[task 2019-08-06T20:04:39.161Z] 20:04:39     INFO -  pk12util: PKCS12 IMPORT SUCCESSFUL

Normally however, the following occurs:

[task 2019-08-06T21:23:01.106Z] 21:23:01     INFO - Running manifest: browser/components/extensions/test/mochitest/mochitest.ini
[task 2019-08-06T21:23:01.489Z] 21:23:01     INFO -  Setting pipeline to PAUSED ...
[task 2019-08-06T21:23:01.489Z] 21:23:01     INFO -  Pipeline is PREROLLING ...
[task 2019-08-06T21:23:01.490Z] 21:23:01     INFO -  Pipeline is PREROLLED ...
[task 2019-08-06T21:23:01.490Z] 21:23:01     INFO -  Setting pipeline to PLAYING ...
[task 2019-08-06T21:23:01.491Z] 21:23:01     INFO -  New clock: GstSystemClock
[task 2019-08-06T21:23:01.527Z] 21:23:01     INFO -  Got EOS from element "pipeline0".
[task 2019-08-06T21:23:01.527Z] 21:23:01     INFO -  Execution ended after 33426459 ns.
[task 2019-08-06T21:23:01.527Z] 21:23:01     INFO -  Setting pipeline to PAUSED ...
[task 2019-08-06T21:23:01.528Z] 21:23:01     INFO -  Setting pipeline to READY ...
[task 2019-08-06T21:23:01.528Z] 21:23:01     INFO -  Setting pipeline to NULL ...
[task 2019-08-06T21:23:01.528Z] 21:23:01     INFO -  Freeing pipeline ...
[task 2019-08-06T21:23:01.528Z] 21:23:01     INFO -  /usr/bin/pactl
[task 2019-08-06T21:23:01.535Z] 21:23:01     INFO -  Connection failure: Connection refused
[task 2019-08-06T21:23:01.536Z] 21:23:01     INFO -  pa_context_connect() failed: Connection refused
[task 2019-08-06T21:23:01.536Z] 21:23:01     INFO -  Traceback (most recent call last):
[task 2019-08-06T21:23:01.537Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3176, in <module>
[task 2019-08-06T21:23:01.537Z] 21:23:01     INFO -      sys.exit(cli())
[task 2019-08-06T21:23:01.537Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3172, in cli
[task 2019-08-06T21:23:01.538Z] 21:23:01     INFO -      return run_test_harness(parser, options)
[task 2019-08-06T21:23:01.538Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3157, in run_test_harness
[task 2019-08-06T21:23:01.538Z] 21:23:01     INFO -      result = runner.runTests(options)
[task 2019-08-06T21:23:01.539Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2660, in runTests
[task 2019-08-06T21:23:01.539Z] 21:23:01     INFO -      res = self.runMochitests(options, tests_in_manifest)
[task 2019-08-06T21:23:01.540Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2440, in runMochitests
[task 2019-08-06T21:23:01.540Z] 21:23:01     INFO -      result = self.doTests(options, testsToRun)
[task 2019-08-06T21:23:01.540Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2721, in doTests
[task 2019-08-06T21:23:01.540Z] 21:23:01     INFO -      devices = findTestMediaDevices(self.log)
[task 2019-08-06T21:23:01.541Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 814, in findTestMediaDevices
[task 2019-08-06T21:23:01.541Z] 21:23:01     INFO -      if not null_sink_loaded():
[task 2019-08-06T21:23:01.541Z] 21:23:01     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 810, in null_sink_loaded
[task 2019-08-06T21:23:01.542Z] 21:23:01     INFO -      [pactl, 'list', 'short', 'modules'])
[task 2019-08-06T21:23:01.542Z] 21:23:01     INFO -    File "/usr/lib/python2.7/subprocess.py", line 223, in check_output
[task 2019-08-06T21:23:01.543Z] 21:23:01     INFO -      raise CalledProcessError(retcode, cmd, output=output)
[task 2019-08-06T21:23:01.543Z] 21:23:01     INFO -  subprocess.CalledProcessError: Command '['/usr/bin/pactl', 'list', 'short', 'modules']' returned non-zero exit status 1
[task 2019-08-06T21:23:01.571Z] 21:23:01    ERROR - Return code: 1

The condition that differentiates the lucky instance with the normal instance is unknown. Even in the same revision on try, one instance of the test may be able to successfully initialize pactl while another instance may fail and throw the subprocess error.

Your attempts that I found on try don't set the exit timeout for pulseaudio.

(In reply to Mike Hommey [:glandium] from comment #25)

Your attempts that I found on try don't set the exit timeout for pulseaudio.

I've had many pushes; the recent ones (today) I removed the exit timeout to go back to a known state where it somewhat worked.

The default idle exit timeout is too short. See comment 22.

The problem does not appear to be that exit timeout is too short, since I did make the change to try -1 as a value last week, but it failed to keep pactl and pulseaudio initialized. For an example from last week, see https://hg.mozilla.org/try/rev/2e099b2a9f8d71e8c7e36b5abd778f4402b116f8.

What worked was :tomprince's suggestion to use 0 as the timeout at test-linux possibly combined with modification to mochitest/runtests.py.

Changes involved in the file:

  • wrap the subprocess.check_call call with try/except block to return a boolean
  • additional call to start, daemonize and set exit timer prior to pactl load-module module-null-sink

Ideally, instead of initializing again in runtests.py the better practice would be to check the status using pulseaudio --check, and execute required actions based on the outcome of that call.

Regardless with this change I've gotten some mochitest suites to green status: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=a55741c38bce60fe642d78b9213703296869bfc5

Using mozilla-central 2f9fcfd57416a8424ff12a11c9734ee9a2fb6ed0 as baseline, roughly half the tests are green:

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=119fc9a78c7c62c7bace963a4bb88673ee9cab51

This will provide a good starting point to begin filing bugs for developers to address.

Summary: Update desktop1604-test image to use Ubuntu 18.04 → Update desktop1604-test image to use Debian 10 (buster)

Cancel :dhouse's needinfo.

Flags: needinfo?(dhouse)

Note bug 1562627 is a similar-ish pulseaudio issue that's already happening on Ubuntu 16.04.

(In reply to Ryan VanderMeulen [:RyanVM][PTO Aug 5-9] from comment #12)

I don't have a strong opinion on this either assuming Debian is able to run all the test suites we need it to run.

In general, I think our standards for Linux testing have been lower under the assumption that most users are getting their builds from distros anyway and the number of combinations of components in the wild is mind-bogglingly huge. Also, I don't recall the decision for changing the base OS for Linux tests in the past being one that went outside the various teams responsible for maintaining our automated test infrastructure.

So I guess my tl;dr is to say that going with whatever makes the most sense and is easiest to maintain going forward sounds like the reasonable option here and I don't see any reason to avoid the change based on what's been said here and (not) said on dev.platform.

+1 Sorry I missed the NI about Ubuntu vs {insert base distro here} on hardware.

I think we had picked Ubuntu for the large user-base and the simplicity in getting the drivers and packages. But I don't think we are using anything Ubuntu-specific and could move upstream for the hardware to use Debian also (I'd like to be using the same distro for docker worker and hardware so that we might share some knowledge and config).

jmaher, what do you think? Would it benefit us to switch for perf tests to match the docker worker?

Flags: needinfo?(jmaher)

I would prefer all our linux test machines use the same image. This means:
32 bit linux aws
64 bit linux aws
linux hardware in datacenter
(possibly bitbar)
(possibly packet.net base image for emulators)

Flags: needinfo?(jmaher)
Blocks: 1572739
Keywords: leave-open

I have a prototype patch attached to this bug that would add the necessary pipings for Debian 10 to be built and used on CI as a drop-in replacement of Ubuntu 16.04.

Please note that since this is still a prototype, I'm sure there are inefficiencies and unnecessary steps taken in the scripts, but this patch will have our CI infra in a state where Debian 10 can be built, used as test images and have some test suites pass.

My goal is to have this patch in mozilla-central so that I can focus on greening test suites, filing bugs for developers and directing them to make a few changes to enable them to test on Debian 10. Once a stable, green baseline is achieved the Dockerfile and test system setup script can be optimized so that unnecessary packages, files and such are not included.

Attachment #9087809 - Attachment is obsolete: true
Depends on: 1503785

:kinetik - I'm not sure who else I could reach out to regarding pulseaudio issues that I'm still running into; I cannot seem to have it reliably start and remain started in my test image, even after the fixes in bug 1572311 (setting pulseaudio --exit-idle-time=-1). I've kept that bug closed since this is not a GTest specific question.

GTest - https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=fb9298e8a6a5811b82cdb60aa8448ec6088d597a
Other suites (mochitest in particular): https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=263525493&revision=3618c4ddabcba50c66b937a2be770354455a941b

When test-linux.sh initializes pulseaudio, the following is the log:

[task 2019-08-26T19:34:43.889Z] + pulseaudio --fail --daemonize --start -vvvv
[task 2019-08-26T19:34:43.912Z] D: [pulseaudio] conf-parser.c: Parsing configuration file '/etc/pulse/client.conf'
[task 2019-08-26T19:34:43.912Z] D: [pulseaudio] conf-parser.c: Parsing configuration file '/etc/pulse/client.conf.d/00-disable-autospawn.conf'
[task 2019-08-26T19:34:43.952Z] I: [pulseaudio] main.c: Daemon startup successful.
[task 2019-08-26T19:34:43.953Z] + pulseaudio --check
[task 2019-08-26T19:34:43.960Z] + '[' 0 -eq 0 ']'
[task 2019-08-26T19:34:43.960Z] + echo 'Pulseaudio successfully initialized'
[task 2019-08-26T19:34:43.960Z] Pulseaudio successfully initialized
[task 2019-08-26T19:34:43.960Z] + pactl load-module module-null-sink
[task 2019-08-26T19:34:43.969Z] 16

When the test is then run in the harness, despite my added code to check for pulseaudio and call the same initialization as in test-linux.sh if it is not running, the following is the output:

[task 2019-08-26T19:35:58.029Z] 19:35:58     INFO - Running manifest: accessible/tests/browser/browser.ini
[task 2019-08-26T19:35:58.306Z] 19:35:58     INFO -  Setting pipeline to PAUSED ...
[task 2019-08-26T19:35:58.306Z] 19:35:58     INFO -  libv4l2: error getting pixformat: Invalid argument
[task 2019-08-26T19:35:58.307Z] 19:35:58     INFO -  Pipeline is PREROLLING ...
[task 2019-08-26T19:35:58.307Z] 19:35:58     INFO -  Pipeline is PREROLLED ...
[task 2019-08-26T19:35:58.307Z] 19:35:58     INFO -  Setting pipeline to PLAYING ...
[task 2019-08-26T19:35:58.307Z] 19:35:58     INFO -  New clock: GstSystemClock
[task 2019-08-26T19:35:58.335Z] 19:35:58     INFO -  Got EOS from element "pipeline0".
[task 2019-08-26T19:35:58.335Z] 19:35:58     INFO -  Execution ended after 33401858 ns.
[task 2019-08-26T19:35:58.335Z] 19:35:58     INFO -  Setting pipeline to PAUSED ...
[task 2019-08-26T19:35:58.335Z] 19:35:58     INFO -  Setting pipeline to READY ...
[task 2019-08-26T19:35:58.335Z] 19:35:58     INFO -  Setting pipeline to NULL ...
[task 2019-08-26T19:35:58.336Z] 19:35:58     INFO -  Freeing pipeline ...
[task 2019-08-26T19:35:58.344Z] 19:35:58     INFO -  Connection failure: Connection refused
[task 2019-08-26T19:35:58.344Z] 19:35:58     INFO -  pa_context_connect() failed: Connection refused
[task 2019-08-26T19:35:58.351Z] 19:35:58     INFO -  D: [pulseaudio] conf-parser.c: Parsing configuration file '/etc/pulse/client.conf'
[task 2019-08-26T19:35:58.352Z] 19:35:58     INFO -  D: [pulseaudio] conf-parser.c: Parsing configuration file '/etc/pulse/client.conf.d/00-disable-autospawn.conf'
[task 2019-08-26T19:35:58.353Z] 19:35:58     INFO -  N: [pulseaudio] main.c: User-configured server at {689cfabc30776e6bfe2e7477e81eaa6d}unix:/tmp/pulse-qrpFpnpvYwVl/native, which appears to be local. Probing deeper.
[task 2019-08-26T19:35:58.354Z] 19:35:58     INFO -  I: [pulseaudio] main.c: Daemon startup successful.
[task 2019-08-26T19:35:58.360Z] 19:35:58     INFO -  Connection failure: Connection refused
[task 2019-08-26T19:35:58.360Z] 19:35:58     INFO -  pa_context_connect() failed: Connection refused
[task 2019-08-26T19:35:58.360Z] 19:35:58     INFO -  Traceback (most recent call last):
[task 2019-08-26T19:35:58.360Z] 19:35:58     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3191, in <module>
[task 2019-08-26T19:35:58.361Z] 19:35:58     INFO -      sys.exit(cli())
[task 2019-08-26T19:35:58.361Z] 19:35:58     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3187, in cli
[task 2019-08-26T19:35:58.361Z] 19:35:58     INFO -      return run_test_harness(parser, options)
[task 2019-08-26T19:35:58.361Z] 19:35:58     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3172, in run_test_harness
[task 2019-08-26T19:35:58.361Z] 19:35:58     INFO -      result = runner.runTests(options)
[task 2019-08-26T19:35:58.361Z] 19:35:58     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2675, in runTests
[task 2019-08-26T19:35:58.361Z] 19:35:58     INFO -      res = self.runMochitests(options, tests_in_manifest)
[task 2019-08-26T19:35:58.362Z] 19:35:58     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2454, in runMochitests
[task 2019-08-26T19:35:58.362Z] 19:35:58     INFO -      result = self.doTests(options, testsToRun)
[task 2019-08-26T19:35:58.364Z] 19:35:58     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2736, in doTests
[task 2019-08-26T19:35:58.364Z] 19:35:58     INFO -      devices = findTestMediaDevices(self.log)
[task 2019-08-26T19:35:58.365Z] 19:35:58     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 830, in findTestMediaDevices
[task 2019-08-26T19:35:58.366Z] 19:35:58     INFO -      'module-null-sink'
[task 2019-08-26T19:35:58.367Z] 19:35:58     INFO -    File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
[task 2019-08-26T19:35:58.368Z] 19:35:58     INFO -      raise CalledProcessError(retcode, cmd)
[task 2019-08-26T19:35:58.368Z] 19:35:58     INFO -  subprocess.CalledProcessError: Command '['/usr/bin/pactl', 'load-module', 'module-null-sink']' returned non-zero exit status 1
[task 2019-08-26T19:35:58.389Z] 19:35:58    ERROR - Return code: 1

Would you have any ideas or hints for me to try out? As noted in bug 1572311, changing the timer to -1 seems to fix GTest but only about 50% of the time. If you know someone that might be better suited to help, let me know - this pulseaudio issue has been plaguing my efforts since the beginning as this bug thread shows.

Flags: needinfo?(kinetik)
See Also: → 1531916

I don't have any specific advice... Starting PA from multiple places seems like the wrong approach. If test-linux.sh is responsible for setting the environment up, PA should be started there, once, and nowhere else. If PA is exiting after that, the reason should be present in the logs, so configure PA to run with verbose logging to somewhere, then examine the logs to find out what caused the exit and address that. We should be able to use the same code path on Debian and Ubuntu - any differences in PA behaviour should then be normalized by using the same command line and config everywhere.

Flags: needinfo?(kinetik)
Pushed by egao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4d4271a0ccad
add new dockerfile for debian 10 (buster) test image and add necessary piping without switching the main CI pipeline from ubuntu 16.04 r=jmaher

Backed out changeset 4d4271a0ccad (bug 1565332) for Android Mochitest failures on a CLOSED TREE.

Backout link: https://hg.mozilla.org/integration/autoland/rev/f6714b862df3110ec4435d152ddbdbe6f3555527

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=4d4271a0ccad1e5a561ca93bd4144767cf0503a4&selectedJob=264342207

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=264342207&repo=autoland&lineNumber=525

Log snippet:
[task 2019-08-30T22:33:31.146Z] script.py exitcode 127
[taskcluster 2019-08-30T22:33:31.160Z] Exit Code: 1
[taskcluster 2019-08-30T22:33:31.160Z] User Time: 307.239ms
[taskcluster 2019-08-30T22:33:31.160Z] Kernel Time: 113.971ms
[taskcluster 2019-08-30T22:33:31.160Z] Wall Time: 11.883975031s
[taskcluster 2019-08-30T22:33:31.160Z] Result: FAILED
[taskcluster 2019-08-30T22:33:31.160Z] === Task Finished ===
[taskcluster 2019-08-30T22:33:31.161Z] Task Duration: 11.885089203s
[taskcluster 2019-08-30T22:33:31.682Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/Gr-uUzz9TTidG7B3E_go7Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-08-29T22:28:11.106Z
[taskcluster:error] exit status 1

Flags: needinfo?(egao)
Pushed by egao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a425cc70e4de
add new dockerfile for debian 10 (buster) test image and add necessary piping without switching the main CI pipeline from ubuntu 16.04 r=jmaher
Attachment #9089490 - Attachment description: Bug 1565332: Add option to run linux desktop tests on debian; → Bug 1565332 - add option to toggle linux desktop tests to run on debian 10

The base debian10 patch is now merged into mozilla-central.

There still exists two odd issues;

  • failed suites reporting success (non-zero exit codes are overridden to 0) this is addressed with Attachment 9090622 [details].
  • GTest still suffers from intermittent error initializing cubeb library errors despite pactl, pulseaudio and pacmd functioning, and this is an intermittent issue that occurs in roughly ~30% of the pushes

Other concerns:

  • desktop environment differences may be responsible for multiple failures
Flags: needinfo?(egao)
Attachment #9090622 - Attachment description: Bug 1565332 - restore set -e after the debian-specific block in test-linux.sh → Bug 1565332 - restore set -e in the debian-specific block in test-linux.sh
Pushed by egao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7c9167e4f8fb
add option to toggle linux desktop tests to run on debian 10 r=ahal
Pushed by egao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ccf8a603df02
restore set -e in the debian-specific block in test-linux.sh r=gbrown

According to my work in bug 1572311 comment 24 and a few other try pushes, it seems that we are safely able to conclude the following:

So the next course of action(s) are:

  1. investigate if pulseaudio should be stripped from test-linux.sh;
    • if above is a yes, then where it should be placed in:
      • run-task
      • desktop_unittest.py
      • elsewhere
  2. investigate the list of tests that actually require pulseaudio, and restrict the initialization of pulseaudio to just those tests
    • this is a task under 1518930
Alias: debian-10
Attachment #9092527 - Attachment description: Bug 1565332 - change how pulseaudio is initialied for Debian 10 test image → Bug 1565332 - change how pulseaudio is initialized
Attachment #9092527 - Attachment description: Bug 1565332 - change how pulseaudio is initialized → Bug 1565332 - change how pulseaudio is initialied for Debian 10 test image
Attachment #9092527 - Attachment description: Bug 1565332 - change how pulseaudio is initialied for Debian 10 test image → Bug 1565332 - change how pulseaudio is initialized for Debian 10 test image
Attachment #9092527 - Attachment description: Bug 1565332 - change how pulseaudio is initialized for Debian 10 test image → Bug 1565332 - change how pulseaudio is initialized for Debian 10 test image without affecting existing Ubuntu 16.04 process
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/8b5c572d7695
Pin pip to 19.2.3 to avoid breaking docker image. a=bustage-fix CLOSED TREE
Pushed by egao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fd0d2f340380
change how pulseaudio is initialized for Debian 10 test image without affecting existing Ubuntu 16.04 process r=jlund,dustin

I tried a try push with the command xrandr added during the mochitest runtests.py setup phase to check the screen resolution to ensure the screen resolution is as expected, and this is what I get:

[task 2019-10-21T23:38:16.417Z] xrandr
[task 2019-10-21T23:38:16.417Z] + xrandr
[task 2019-10-21T23:38:16.442Z] xrandr: Failed to get size of gamma for output screen
[task 2019-10-21T23:38:16.442Z] Screen 0: minimum 1 x 1, current 1600 x 1200, maximum 1600 x 1200
[task 2019-10-21T23:38:16.443Z] screen connected 1600x1200+0+0 0mm x 0mm
[task 2019-10-21T23:38:16.444Z]    1600x1200      0.00* 

This value is in line with what I expect, but we see some screen resolution related failures scattered throughout various suites. Maybe there is something else to it.

Depends on: 1543337
Depends on: 1593059
Depends on: 1596526
Depends on: 1596586
You need to log in before you can comment on or make changes to this bug.