Open Bug 1441563 (gUM_timeout) Opened 6 years ago Updated 2 months ago

getUserMedia timeouts causing lots of intermittent failures (mostly on linux, but sometimes on windows)

Tracking

()

Status:

NEW

People

(Reporter: bwc, Unassigned)

References

(Blocks 2 open bugs)

Details

Attachments

(9 files)

PulseAudio verbose logs 6 years ago Alex Chronopoulos [:achronop] 493.29 KB, text/x-log		Details
Logs from a TSAN build 6 years ago Alex Chronopoulos [:achronop] 2.40 MB, application/zip		Details
A deadlock warning in MediaManager 6 years ago Alex Chronopoulos [:achronop] 26.99 KB, text/plain		Details
Firefox parent process bt all when deadlock was happening 6 years ago Alex Chronopoulos [:achronop] 194.31 KB, text/plain		Details
Firefox child process bt all when deadlock was happening 6 years ago Alex Chronopoulos [:achronop] 144.60 KB, text/plain		Details
PA server bt all when deadlock was happening 6 years ago Alex Chronopoulos [:achronop] 8.00 KB, text/plain		Details
Backtrace in audio process when a deadlock seems to be happening 6 years ago Arun Raghavan 128.33 KB, text/plain		Details
Backtrace in audio process a little after the previous backtrace 6 years ago Arun Raghavan 35.95 KB, text/plain		Details
bt-all-4-times-audioipc-off.txt 6 years ago Alex Chronopoulos [:achronop] 593.41 KB, text/plain		Details

Byron Campen [:bwc]

Reporter

Description

•

6 years ago

Many of our intermittent failures are caused by getUserMedia never resolving, but because the failures are spread out it is hard to see how big this problem is.

Byron Campen [:bwc]

Reporter

Comment 1

•

6 years ago

This is causing (at least some of) the failures in bug 1059240, 1440169, 1432025, 1210206, 1292976, 1300686, 1293910, 1424141, 1296075, 1297611, 1300644, 1309419, 1440669, 1441179, 1227386, 1127828, and 1436171,

Might be related to bug 1432786, 1439030, 1439491, and 1441512 (NotFoundError in getUserMedia).

Alias: gUM_timeout

Blocks: 1059240, 1440169, 1432025, 1210206, 1292976, 1300686, 1293910, 1424141, 1296075, 1297611, 1300644, 1309419, 1440669, 1441179, 1227386, 1127828, 1436171

See Also: → 1432786, 1439030, 1439491, 1441512

Summary: getUserMedia timeouts causing lots of intermittent failures → getUserMedia timeouts causing lots of intermittent failures (mostly on linux, but sometimes on windows)

Byron Campen [:bwc]

Reporter

Updated

•

6 years ago

Flags: needinfo?(jib)

Noemi Erli[:noemi_erli]

Comment 2

•

6 years ago

This bug blocks 1059240 which has reached 40 failures in the past week.
:jib any updates here?

Michael Froman [:mjf]

Updated

•

6 years ago

Rank: 19

Priority: -- → P2

Nils Ohlmeier [:drno]

Updated

•

6 years ago

Blocks: 1416432

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 3

•

6 years ago

A lot of the dependent bug reports are years old, but I agree there appears to be a recent pattern. I see two kinds:

In bug 1432025 comment 6: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=163285091&lineNumber=79792

[task 2018-02-20T20:58:48.224Z] 20:58:48     INFO - Call getUserMedia for {"audio":{"autoGainControl":false,"echoCancellation":false,"noiseSuppression":false}}
[task 2018-02-20T20:58:48.225Z] 20:58:48     INFO - Buffered messages finished
[task 2018-02-20T20:58:48.226Z] 20:58:48    ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_transceivers.html | application timed out after 370 seconds with no output
[task 2018-02-20T20:58:48.227Z] 20:58:48    ERROR - Force-terminating active process(es).

and bug 1432786 comment 5: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=163090755&lineNumber=47758

[task 2018-02-19T21:34:25.579Z] 21:34:25     INFO - Call getUserMedia for {"audio":{"autoGainControl":false,"echoCancellation":false,"noiseSuppression":false}}
[task 2018-02-19T21:34:25.582Z] 21:34:25     INFO - Buffered messages finished
[task 2018-02-19T21:34:25.590Z] 21:34:25     INFO - TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_peerConnection_transceivers.html | Error executing test: NotFoundError: The object can not be found here.  
[task 2018-02-19T21:34:25.594Z] 21:34:25     INFO -     runTestWhenReady/<@dom/media/tests/mochitest/head.js:454:7
[task 2018-02-19T21:34:25.595Z] 21:34:25     INFO -     Async*runTestWhenReady@dom/media/tests/mochitest/head.js:452:10
[task 2018-02-19T21:34:25.596Z] 21:34:25     INFO -     async*@dom/media/tests/mochitest/test_peerConnection_transceivers.html:2197:3

Alex, can you take a look? cc Andreas and Bryce as well for help.

Assignee: nobody → achronop

Flags: needinfo?(jib)

Alex Chronopoulos [:achronop]

Comment 4

•

6 years ago

This is the big issue we introduced with loopback devices. Unfortunately, there is no easy explanation. I have to create some time to engage PulseAudio guys on this. I will do it right after the push I am currently working on.

There are 2 possible workarounds. The first and most reasonable is not to use Loopback device if you do not need it. For example if the test does not make use of the AudioAnalyzer to verify the receiving tone you can disable it (by setting on the top of `runTest()` method `DISABLE_LOOPBACK_TONE = true). The second and less preferable workaround is to make use of virtual devices.

Paul Adenot (:padenot)

Comment 5

•

6 years ago

Please try to mitigate this as soon as possible, this is getting really annoying and burns quite a lot of time of a lot of people.

Alex Chronopoulos [:achronop]

Comment 6

•

6 years ago

Attached file PulseAudio verbose logs — Details

Following command repro the error reliably:

taskset -c 0 ./mach mochitest --use-test-media-devices dom/media/tests/mochitest/test_peerConnection_replaceTrack.html

Andreea Pavel [:apavel]

Updated

•

6 years ago

Blocks: 1443774

Alex Chronopoulos [:achronop]

Comment 7

•

6 years ago

Attached file Logs from a TSAN build — Details

Command to repro with a TSAN build:

./mach mochitest --use-test-media-devices dom/media/tests/mochitest/test_peerConnection_replaceTrack.html |& tee ../tsan_log.txt

Alex Chronopoulos [:achronop]

Comment 8

•

6 years ago

Attached file A deadlock warning in MediaManager — Details

This is part of the TSAN log uploaded in previous post

Alex Chronopoulos [:achronop]

Comment 9

•

6 years ago

When I replace the input device test is becoming green. Loopback device setup makes use of "Monitor of Null Output" device in order to capture whatever is going to the output. If change the input device back to the "Sine source at 440 Hz" the test runs successfully. I have already provided logs to the PulseAudio people and I am waiting for their response. I will discuss this extra finding with them.

Alex Chronopoulos [:achronop]

Comment 10

•

6 years ago

Profiling PulseAudio server on failed case:

Samples: 11K of event 'cycles:uppp', Event count (approx.): 9974038159
Overhead  Command          Shared Object             Symbol                                                                                                                                  
  31.21%  alsa-sink-CS420  [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
  19.25%  pulseaudio       [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
   4.54%  null-sink        [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
   1.44%  pulseaudio       libpulse.so.0.20.2        [.] pa_mainloop_dispatch                                                                                                                
   1.23%  alsa-source-CS4  [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
   1.01%  alsa-sink-CS420  libpulsecore-11.1.so      [.] _init                                                                                                                               
   0.91%  alsa-sink-CS420  libpulsecommon-11.1.so    [.] _init                                                                                                                               
   0.81%  alsa-sink-CS420  libalsa-util.so           [.] _init                                                                                                                               
   0.71%  alsa-sink-CS420  libpulsecore-11.1.so      [.] pa_rtpoll_run                                                                                                                       
   0.71%  alsa-sink-CS420  libalsa-util.so           [.] thread_func                                                                                                                         
   0.55%  alsa-sink-CS420  libpulse.so.0.20.2        [.] _init                                                                                                                               
   0.42%  pulseaudio       libpulse.so.0.20.2        [.] _init                                                                                                                               
   0.41%  pulseaudio       libpulsecommon-11.1.so    [.] _init

Andreas Pehrson [:pehrsons]

Comment 11

•

6 years ago

If this is an issue on our side it could be triggered by different behavior for the different input devices. I recall when testing the loopback device patches early on that the null monitor's callbacks are much more frequent than those of the sine source.

That could explain a higher orange factor if we're looking at some sort of race involving the audio callback.

Arun Raghavan

Comment 12

•

6 years ago

I can't seem to reproduce the problem locally. Would it be possible to get a backtrace from when the deadlock happens? And is the deadlock in PA or the test (does pactl stat work when the deadlock happens)?

Alex Chronopoulos [:achronop]

Comment 13

•

6 years ago

Thanks for looking at it. I think you do not repro because you have an optimized build. I can repro on a debug build only. I tried this morning an opt build and the test passes. I can explain you in IRC how to build with debug enabled. It's very easy.

I tried the `pactl stat` while the deadlock was happening, that's the results:
$ taskset -c 0 pactl stat
Currently in use: 2041 blocks containing 4.1 MiB bytes total.
Allocated during whole lifetime: 1026601 blocks containing 638.6 MiB bytes total.
Sample cache size: 0 B

I also attached the debugger on PA process: 
0x00007f1e4669cb36 in __GI_ppoll (fds=0x558ca613eef0, nfds=36, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
39	  return SYSCALL_CANCEL (ppoll, fds, nfds, timeout, sigmask, _NSIG / 8);
warning: File "/home/achronop/myrepos/mozilla/firefox/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /home/achronop/myrepos/mozilla/firefox/.gdbinit
line to your configuration file "/home/achronop/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/achronop/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
Missing separate debuginfos, use: dnf debuginfo-install alsa-lib-1.1.4.1-1.fc26.x86_64 libgcrypt-1.8.2-1.fc26.x86_64 nss-softokn-freebl-3.35.0-1.0.fc26.x86_64 pcre-8.41-5.fc26.x86_64 speexdsp-1.2-0.10.rc3.fc26.x86_64
(gdb) bt
#0  0x00007f1e4669cb36 in __GI_ppoll (fds=0x558ca613eef0, nfds=36, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x00007f1e498d2641 in ppoll (__ss=0x0, __timeout=<optimized out>, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  pa_mainloop_poll (m=m@entry=0x558ca602eb30) at pulse/mainloop.c:852
#3  0x00007f1e498d2c50 in pa_mainloop_iterate (m=0x558ca602eb30, block=<optimized out>, retval=0x7ffffa011dd8) at pulse/mainloop.c:926
#4  0x00007f1e498d2ce0 in pa_mainloop_run (m=0x558ca602eb30, retval=0x7ffffa011dd8) at pulse/mainloop.c:944
#5  0x0000558ca496133e in main (argc=<optimized out>, argv=<optimized out>) at daemon/main.c:1142

Alex Chronopoulos [:achronop]

Comment 14

•

6 years ago

Attached file Firefox parent process bt all when deadlock was happening — Details

Alex Chronopoulos [:achronop]

Comment 15

•

6 years ago

Attached file Firefox child process bt all when deadlock was happening — Details

Alex Chronopoulos [:achronop]

Comment 16

•

6 years ago

Attached file PA server bt all when deadlock was happening — Details

Alex Chronopoulos [:achronop]

Comment 17

•

6 years ago

Arun, following your question about which threads are involved in the scenario I uploaded the bt off all threads for the Firefox and PulseAudio when the deadlock was happening. For PA I continued and re-paused once to check if it was "moving". Feel free to ping me in IRC for more questions.

Arun Raghavan

Comment 18

•

6 years ago

Attached file Backtrace in audio process when a deadlock seems to be happening — Details

Arun Raghavan

Comment 19

•

6 years ago

Attached file Backtrace in audio process a little after the previous backtrace — Details

I attached to the audio process (sorry, not sure what the correct terminology is) and dumped the backtrace a couple of times (once when it looked deadlocked, then again close to timeout). I see two audio threads.

One of the threads has a different trace in the second backtrace (more or less waiting in the mainloop), suggesting that it is not where the deadlock is.

Will try to unravel where the common backtrace is in the code (it's in a pa_threaded_mainloop_wait()).

Arun Raghavan

Comment 20

•

6 years ago

(In reply to Arun Raghavan from comment #19)
[...]
> Will try to unravel where the common backtrace is in the code (it's in a
> pa_threaded_mainloop_wait()).

I meant pa_threaded_mainloop_lock().

Arun Raghavan

Comment 21

•

6 years ago

I'm quite puzzled by what's happening. Thread 13 in the backtraces I pasted seems stuck in a wait that should not happen. My machine seems to be quite slow at doing backtraces, so _maybe_ there's something wrong there. I've asked achronop to get me some backtraces from the deadlock.

From what I can tell, the problem is occurring on the client side. Next steps after I have looked at the backtraces (if nothing is different), is to see if the problem occurs with cubeb remote disabled, and finally with rust disabled if it does.

Since this is a bit of a mystery, my current paranoid suspicions include single-threaded weirdness in the Rust async code (since it only happens with taskset -c 0) or a problem in null-sink (maybe if all else fails, we can try to just write a simple loopback and try to make it do the same thing).

I did quickly look at cubeb-pulse-rs context code and all the locking code looked kosher, as expected.

Arun Raghavan

Comment 22

•

6 years ago

The other thing I was thinking of was instrumenting all the lock/unlock wait/signal things in cubeb-pulse-rs (ideally with some way to identify the caller), but my worry is that it might be a _lot_ of output to parse (and might affect behaviour of the test).

Alex Chronopoulos [:achronop]

Comment 23

•

6 years ago

NI myself to provide the info requested by Arun.

Flags: needinfo?(achronop)

Alex Chronopoulos [:achronop]

Comment 24

•

6 years ago

Attached file bt-all-4-times-audioipc-off.txt — Details

This file contains 4 backtracks from all threads of content process. Remote cubeb is disabled.

Flags: needinfo?(achronop)

Alex Chronopoulos [:achronop]

Comment 25

•

6 years ago

I executed the test using two different pairs of Sink Devices and Monitor Sources. Sink devices was real devices, the built in sink and an HDMI sink. In all cases the error is reproducible. I wonder if the the way PA implements the Monitor Sources uncovers that issue.

Christian Holler (:decoder)

Comment 26

•

6 years ago

This could part of the symptoms we see in fuzzing (quite a few cubeb processes and a lot of processes locked in I/O).

Arun Raghavan

Comment 27

•

6 years ago

Are these backtraces from when the process is hung?

Alex Chronopoulos [:achronop]

Comment 28

•

6 years ago

Yes, I paused the debugger 4 times, every 15 secs, started at the point that test was not progressing further.

Byron Campen [:bwc]

Reporter

Updated

•

5 years ago

Blocks: 1480942

Byron Campen [:bwc]

Reporter

Updated

•

5 years ago

Blocks: 1538566

Byron Campen [:bwc]

Reporter

Updated

•

3 years ago

Blocks: 1645930

BugBot [:suhaib / :marco/ :calixte]

Comment 29

•

2 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:jib, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: achronop → nobody

Flags: needinfo?(jib)

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Byron Campen [:bwc]

Reporter

Updated

•

5 months ago

Blocks: 1781624

Byron Campen [:bwc]

Reporter

Updated

•

3 months ago

Depends on: 1876526

Jim Mathies [:jimm]

Updated

•

2 months ago

Flags: needinfo?(jib)

You need to log in before you can comment on or make changes to this bug.