Open Bug 1693151 Opened 4 years ago Updated 4 years ago

Intermittent failure : WARNING: 7fbe3f20e4a0 OpenCubeb() failed to init cubeb: file /builds/worker/checkouts/gecko/dom/media/AudioStream.cpp:323

Categories

(Firefox Build System :: Task Configuration, defect)

defect

Tracking

(Not tracked)

People

(Reporter: alwu, Unassigned)

References

(Blocks 1 open bug)

Details

Recently we're getting this failure more on the automation test on the try server, the error looks like following

WARNING: 7fbe3f20e4a0 OpenCubeb() failed to init cubeb: file /builds/worker/checkouts/gecko/dom/media/AudioStream.cpp:323

There are some logs for this issue, now this issue can be seen in bug1683016, bug1662614 and bug1661187.

[1]
https://bugzilla.mozilla.org/show_bug.cgi?id=1683016#c17
https://bugzilla.mozilla.org/show_bug.cgi?id=1683016#c20

(In reply to Alastor Wu [:alwu] from comment #0)

https://bugzilla.mozilla.org/show_bug.cgi?id=1683016#c20

The last successful stream is at 2021-01-19T23:22:33.703Z, the first failing stream is at 2021-01-19T23:22:43.182Z. The first failing stream happens in a new content process (note the cubeb prefs logged immediately before the stream init happens).

AudioIPC's stream init was changed in late October in bug 1661398, but I don't see an obvious link to this issue.

It could be a PulseAudio server issue - there might be an error from cubeb_core::context::ContextRef::stream_init in the AudioIPC server, which would need additional logs added to the stream init path in cubeb-pulse-rs or possibly logs from the test VM's PulseAudio server (if available).

(In reply to Matthew Gregan [:kinetik] from comment #1)

AudioIPC's stream init was changed in late October in bug 1661398, but I don't see an obvious link to this issue.

I don't think it's related to the root cause of this bug, but I did notice that since bug 1661398 AudioIPC is leaking a small ServerStream state object when stream_init fails, I'll fix that in audioipc-2 PR 114.

Severity: -- → S3
Priority: -- → P3

(In reply to Matthew Gregan [:kinetik] from comment #2)

I don't think it's related to the root cause of this bug, but I did notice that since bug 1661398 AudioIPC is leaking a small ServerStream state object when stream_init fails, I'll fix that in audioipc-2 PR 114.

That'll be fixed via bug 1694777.

Depends on: 1694777

It seems bug1694777 not fixing this issue, I still see this failure on the autoland on March 1st.

Blocks: 1673527

I commented on one of the blocked bugs, but think it's worth highlighting here -- I think it's possible as we disable tests that we shift the signature we're reporting failures against to the next mediacontrol test in the list. If this is the case, it's worth keeping in mind that increases we see in the different failures could be driven by that rather than any changes to media code.

Per the discussion in bug1673527 comment27, we might need to disable all those media control related browser tests on Linux if we keep encountering this error. But before doing that, I would like to see if kinetik has any idea about the possible solution for this bug.

Thank you.

Flags: needinfo?(kinetik)

Per bryce's analysis, it seems that the error indeed extends to other failure signature, which directly failed on some PulseAudio related check in bug 1695748. Could we raise this error to a higher priority or we have to disable all those tests on Linux, and I guess that error would still be possible to happen for other browser tests, which plays audio on Linux.

It's unclear if this is a CI or cubeb issue, but I'm leaning towards CI.

From cubeb's perspective, the issue appears to be that the PulseAudio server is not running when we expect it to be and fail to connect. Since we're seeing the connection failure from Gecko (OpenCubeb() failed to init) and PulseAudio's pactl utility (pa_context_connect() failed: Connection refused, Could not list currently loaded modules), the connection issue is not limited internally to Gecko/cubeb.

It's possible our code triggers a fatal exit or crashes in the PA server, but I haven't seen that happen before and there's no evidence in the logs.

PA is explicitly launched in test-linux.sh. Note that verbose logging is enabled, but I can't see any (stderr) output from the PA server after the --dump-conf (stdout) output. Redirecting PA's log output to a new artifact results in a single empty pulseaudio.log being produced, so it seems like PA is silent despite a verbose logging config.

This issue is reported to start around 2020-12-16. Is it possible it was showing up elsewhere, earlier?

I can't see an obvious link to any recent changes in cubeb-related code. In terms of test infra changes: gst-launch was tweaked in bug 1679491 to turn a hang into a crash and the ubuntu1804 VM configuration was changed in bug 1678451, but those may not be related.

Bug 1641395 made changes to the way PulseAudio is started (possibly affecting PA's autospawn behaviour, since client.conf is no longer used), but these changes landed back in June 2020. One comment in that bug is concerning: noting PulseAudio was (unexpectedly) not running due to unrelated changes in the config.

Hi, Joel,
I wonder if you have any thought about comment8, is that possible a CI issue? Did you aware any possible change that could affect PA on the try server? (This happened since 2020/12/16)
Thank you so much.

Flags: needinfo?(jmaher)

I've been trying to get reliable logging out of the PulseAudio server on Try. Confusingly, the PA server is launched in different locations for Ubuntu (test-linux.sh) and Debian (desktop_unittest.py) test images. Note that the Debian code explicitly terminates any existing server but the Ubuntu code will prefer any existing server over starting a new one.

It turns out the PA launch in test-linux.sh has been broken since bug 1641395, since --dump-conf and --cleanup-shm are both single-shot arguments that terminate the server. That means the Ubuntu tests have been relying on a different PA server since 2020-06-02 - are we picking up a PA launched from elsewhere or left running after a prior test run? Perhaps this is a bad interaction with PA's auto-spawn functionality - it looks like that's explicitly disabled on Debian (and was on Ubuntu, but this was removed in bug 1641395 for some reason).

Undoing the change from bug 1641395 reveals that the PA server launched from test-linux.sh fails to start sometimes due to an existing PA server being present. In other runs, the PA server launched from test-linux.sh is the one used by the tests - I suspect this reduces the incidence of this test failure.

Aside from the mess of which PA server is being used by the tests, here is an example of a bc7 task that used the test-linux.sh PA server and the mediacontrols tests still failed, so I'm able to look at the PA logs.

In this case, we see the PA server start at 2021-03-08T23:28:33.756Z. pulse-a.log shows the PA server running for at least 1067.402 seconds, at the end of which we see some GNOME stuff disconnecting. This puts the PA server's last log time around 2021-03-08T23:46:21Z, which the main logs reveal is when the tests start failing:

[task 2021-03-08T23:46:19.240Z] 23:46:19     INFO - GECKO(9701) | MEMORY STAT | vsize 3100MB | residentFast 312MB | heapAllocated 104MB
[task 2021-03-08T23:46:19.241Z] 23:46:19     INFO - TEST-OK | dom/media/mediacontrol/tests/browser/browser_media_control_supported_keys.js | took 8356ms
[task 2021-03-08T23:46:19.262Z] 23:46:19     INFO - checking window state
[task 2021-03-08T23:46:19.278Z] 23:46:19     INFO - TEST-START | dom/media/mediacontrol/tests/browser/browser_nosrc_and_error_media.js
[task 2021-03-08T23:46:19.902Z] 23:46:19     INFO - GECKO(9701) | [Child 9934, MediaDecoderStateMachine #1] WARNING: Decoder=7fbf309c5400 state=DECODING_METADATA Decode metadata failed, shutting down decoder: file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachine.cpp:366
[task 2021-03-08T23:46:19.903Z] 23:46:19     INFO - GECKO(9701) | [Child 9934, MediaDecoderStateMachine #1] WARNING: Decoder=7fbf309c5400 Decode error: NS_ERROR_DOM_MEDIA_METADATA_ERR (0x806e0006): file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachine.cpp:3476
[task 2021-03-08T23:46:21.549Z] 23:46:21     INFO - GECKO(9701) | JavaScript error: resource:///actors/DOMFullscreenParent.jsm, line 130: Error: TelemetryStopwatch: key "FULLSCREEN_CHANGE_MS" was already initialized
[task 2021-03-08T23:46:21.659Z] 23:46:21     INFO - GECKO(9701) | [Child 9934, MediaDecoderStateMachine #1] WARNING: Decoder=7fbf340ad000 state=DECODING_METADATA Decode metadata failed, shutting down decoder: file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachine.cpp:366
[task 2021-03-08T23:46:21.659Z] 23:46:21     INFO - GECKO(9701) | [Child 9934, MediaDecoderStateMachine #1] WARNING: Decoder=7fbf340ad000 Decode error: NS_ERROR_DOM_MEDIA_METADATA_ERR (0x806e0006): file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachine.cpp:3476
[task 2021-03-08T23:46:21.921Z] 23:46:21     INFO - GECKO(9701) | JavaScript error: resource:///actors/DOMFullscreenParent.jsm, line 130: Error: TelemetryStopwatch: key "FULLSCREEN_CHANGE_MS" was already initialized
[task 2021-03-08T23:46:21.986Z] 23:46:21     INFO - GECKO(9701) | MEMORY STAT | vsize 2899MB | residentFast 308MB | heapAllocated 104MB
[task 2021-03-08T23:46:21.988Z] 23:46:21     INFO - TEST-OK | dom/media/mediacontrol/tests/browser/browser_nosrc_and_error_media.js | took 2712ms
[task 2021-03-08T23:46:22.004Z] 23:46:22     INFO - checking window state
[task 2021-03-08T23:46:22.020Z] 23:46:22     INFO - TEST-START | dom/media/mediacontrol/tests/browser/browser_remove_controllable_media_for_active_controller.js
[task 2021-03-08T23:46:22.337Z] 23:46:22     INFO - GECKO(9701) | [Child 9762, MediaDecoderStateMachine #1] WARNING: 7f27d1e59120 OpenCubeb() failed to init cubeb: file /builds/worker/checkouts/gecko/dom/media/AudioStream.cpp:323

This error in xsession-errors.log happened between the last PA log message and the beginning of the test failures:

(gnome-shell:10256): Gdk-ERROR **: 23:46:21.183: The program 'gnome-shell' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadAlloc (insufficient resources for operation)'.
  (Details: serial 818 error_code 11 request_code 12 (core protocol) minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the GDK_SYNCHRONIZE environment
   variable to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)
== Stack trace for context 0x55b853f2c4d0 ==
gnome-session-binary[59]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 5
gnome-session-binary[59]: WARNING: App 'org.gnome.Shell.desktop' respawning too quickly
gnome-session-binary[59]: CRITICAL: We failed, but the fail whale is dead. Sorry....

Disabling the PulseAudio server startup in test-linux.sh shows the tests still pass (and fail): https://treeherder.mozilla.org/jobs?repo=try&revision=e87e8e540e2a07c27ce4ca377dce5dd93d04520c

Disabling autospawn in addition to the above shows the tests fail due to no running PA server: https://treeherder.mozilla.org/jobs?repo=try&revision=dc5834ef931e04ecd507fd94986e34dc5050bfdd (expected result)

The autospawn change mentioned in comment 10 changed the behaviour here, but there may have been a further change required before the test failures became common enough to be noticed.

If autospawned PA server lifetime is tied to the desktop session lifetime and this gnome-shell crash is recently introduced, that may explain the other component of this test failure appearing when it did.

With both of above changes plus launching PA from desktop_unittest.py (as Debian test images do), no mediacontrols test failures: https://treeherder.mozilla.org/jobs?repo=try&revision=0cdc4a4e0404fa1e3e4a0c3873da675002d6277c (need to fix logging, stderr doesn't work for daemonized servers)

(In reply to Matthew Gregan [:kinetik] from comment #13)

With both of above changes plus launching PA from desktop_unittest.py (as Debian test images do), no mediacontrols test failures: https://treeherder.mozilla.org/jobs?repo=try&revision=0cdc4a4e0404fa1e3e4a0c3873da675002d6277c (need to fix logging, stderr doesn't work for daemonized servers)

The failing tests on this run are mostly in webauthn and have the gnome-shell crash in common. The mediacontrols tests immediately precede the webauthn tests, and the gnome-shell crash timing lines up with

[task 2021-03-09T03:55:18.680Z] 03:55:18     INFO - TEST-START | dom/media/mediacontrol/tests/browser/browser_nosrc_and_error_media.js
[task 2021-03-09T03:55:19.309Z] 03:55:19     INFO - GECKO(9609) | [Child 9873, MediaDecoderStateMachine #1] WARNING: Decoder=7f46f37d2000 state=DECODING_METADATA Decode metadata failed, shutting down decoder: file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachine.cpp:366

...in the log every time.

(In reply to Matthew Gregan [:kinetik] from comment #14)

[task 2021-03-09T03:55:18.680Z] 03:55:18     INFO - TEST-START | dom/media/mediacontrol/tests/browser/browser_nosrc_and_error_media.js
[task 2021-03-09T03:55:19.309Z] 03:55:19     INFO - GECKO(9609) | [Child 9873, MediaDecoderStateMachine #1] WARNING: Decoder=7f46f37d2000 state=DECODING_METADATA Decode metadata failed, shutting down decoder: file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachine.cpp:366

...in the log every time.

This is expected, because in that test we're testing that the media control won't be triggered on a decoded error file.

(In reply to Alastor Wu [:alwu] from comment #15)

This is expected, because in that test we're testing that the media control won't be triggered on a decoded error file.

From 25 test runs: 9 failed. Of those, 6 show the same pattern of webauthn failures immediately after the mediacontrols tests passed - 6/6 of those have gnome-shell crashes present in the xsession-errors.log, all with a crash timestamp that lines up with dom/media/mediacontrol/tests/browser/browser_nosrc_and_error_media.js. 0/3 of the other (non-webauthn) failing tests have gnome-shell crashes in xsession-errors.log.

Sorry, I just meant the error shown in the MediaDecoderStateMachine was expected.

See Also: → 1641395

I don't know if there is a linux expert here at Mozilla (1.5 years ago I wouldn't have said that), but this sounds like we need to get it fixed on our core docker image. I am cc'ing :glandium and :aerickson who might know more about linux configuration than me.

Flags: needinfo?(jmaher)

Gut reaction: maybe it's about time we just stopped trying to hack around the fact that we don't run a full desktop and instead run a full desktop, including all the system services that entails.

To summarize so far:
mediacontrols tests in test-linux1804-64/opt-mochitest-browser-chrome-e10s-7 (and other ubuntu1804-test images) fail midway through due to PulseAudio suddenly becoming unavailable. This is not a Gecko/cubeb or PulseAudio issue, but appears to be the result of gnome-shell crashing while mediacontrols tests are running (seems to correlate with the start of browser_nosrc_and_error_media.js), causing dbus and PA to exit as part of the desktop session torn down by the gnome-shell crash.

It's not clear if browser_nosrc_and_error_media.js is directly related or just happens to be running at the moment of the gnome-shell crash.

If I modify PA to run from desktop_unittest.py instead of test-linux.sh, PA does not seem to exit during the gnome-shell crash, so the mediacontrols tests pass but the following webauthn tests always fail (presumably due to relying on gnome-keyring-daemon, which is likely also killed by the gnome-shell crash). It seems like PA needs to be launched from test-linux.sh, since the wpt tests fail due to a missing PA when run from desktop_unittest.py so I assume the wpt tests aren't run in that environment.

I think it's possible to run PA independent of the desktop session from test-linux.sh by explicitly configuring a Unix named pipe in the config, but I haven't tested this and it wouldn't solve the main issue of the gnome-shell crash. And :glandium's likely right that the best solution is to run these images like a real user desktop session anyway.

The changes to how PA was launched in bug 1641395 are ultimately unrelated to all of this, but made enabling logging for PA difficult due to the unclear/multiple locations PA could run from, including autospawning which we don't want enabled for a predictable test environment.

Flags: needinfo?(kinetik)

Huge thank for the help and analysis from :kinetik, based on the comment 19 and 20, we should fix this issue by running a full desktop, instead of doing some hacks. So this bug should be moved to another component and wait for the next action.

Component: Audio/Video: cubeb → Task Configuration
Product: Core → Firefox Build System

The product::component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --
You need to log in before you can comment on or make changes to this bug.