Open Bug 1441563 (gUM_timeout) Opened 6 years ago Updated 2 months ago

getUserMedia timeouts causing lots of intermittent failures (mostly on linux, but sometimes on windows)

Categories

(Core :: WebRTC: Audio/Video, defect, P2)

59 Branch
defect

Tracking

()

People

(Reporter: bwc, Unassigned)

References

(Blocks 2 open bugs)

Details

Attachments

(9 files)

Many of our intermittent failures are caused by getUserMedia never resolving, but because the failures are spread out it is hard to see how big this problem is.
This is causing (at least some of) the failures in bug 1059240, 1440169, 1432025, 1210206, 1292976, 1300686, 1293910, 1424141, 1296075, 1297611, 1300644, 1309419, 1440669, 1441179, 1227386, 1127828, and 1436171,

Might be related to bug 1432786, 1439030, 1439491, and 1441512 (NotFoundError in getUserMedia).
Alias: gUM_timeout
See Also: → 1432786, 1439030, 1439491, 1441512
Summary: getUserMedia timeouts causing lots of intermittent failures → getUserMedia timeouts causing lots of intermittent failures (mostly on linux, but sometimes on windows)
Flags: needinfo?(jib)
This bug blocks 1059240 which has reached 40 failures in the past week.
:jib any updates here?
Rank: 19
Priority: -- → P2
Blocks: 1416432
A lot of the dependent bug reports are years old, but I agree there appears to be a recent pattern. I see two kinds:

In bug 1432025 comment 6: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=163285091&lineNumber=79792

[task 2018-02-20T20:58:48.224Z] 20:58:48     INFO - Call getUserMedia for {"audio":{"autoGainControl":false,"echoCancellation":false,"noiseSuppression":false}}
[task 2018-02-20T20:58:48.225Z] 20:58:48     INFO - Buffered messages finished
[task 2018-02-20T20:58:48.226Z] 20:58:48    ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_transceivers.html | application timed out after 370 seconds with no output
[task 2018-02-20T20:58:48.227Z] 20:58:48    ERROR - Force-terminating active process(es).

and bug 1432786 comment 5: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=163090755&lineNumber=47758

[task 2018-02-19T21:34:25.579Z] 21:34:25     INFO - Call getUserMedia for {"audio":{"autoGainControl":false,"echoCancellation":false,"noiseSuppression":false}}
[task 2018-02-19T21:34:25.582Z] 21:34:25     INFO - Buffered messages finished
[task 2018-02-19T21:34:25.590Z] 21:34:25     INFO - TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_peerConnection_transceivers.html | Error executing test: NotFoundError: The object can not be found here.  
[task 2018-02-19T21:34:25.594Z] 21:34:25     INFO -     runTestWhenReady/<@dom/media/tests/mochitest/head.js:454:7
[task 2018-02-19T21:34:25.595Z] 21:34:25     INFO -     Async*runTestWhenReady@dom/media/tests/mochitest/head.js:452:10
[task 2018-02-19T21:34:25.596Z] 21:34:25     INFO -     async*@dom/media/tests/mochitest/test_peerConnection_transceivers.html:2197:3

Alex, can you take a look? cc Andreas and Bryce as well for help.
Assignee: nobody → achronop
Flags: needinfo?(jib)
This is the big issue we introduced with loopback devices. Unfortunately, there is no easy explanation. I have to create some time to engage PulseAudio guys on this. I will do it right after the push I am currently working on.

There are 2 possible workarounds. The first and most reasonable is not to use Loopback device if you do not need it. For example if the test does not make use of the AudioAnalyzer to verify the receiving tone you can disable it (by setting on the top of `runTest()` method `DISABLE_LOOPBACK_TONE = true). The second and less preferable workaround is to make use of virtual devices.
Please try to mitigate this as soon as possible, this is getting really annoying and burns quite a lot of time of a lot of people.
Following command repro the error reliably:

taskset -c 0 ./mach mochitest --use-test-media-devices dom/media/tests/mochitest/test_peerConnection_replaceTrack.html
Blocks: 1443774
Attached file Logs from a TSAN build
Command to repro with a TSAN build:

./mach mochitest --use-test-media-devices dom/media/tests/mochitest/test_peerConnection_replaceTrack.html |& tee ../tsan_log.txt
This is part of the TSAN log uploaded in previous post
When I replace the input device test is becoming green. Loopback device setup makes use of "Monitor of Null Output" device in order to capture whatever is going to the output. If change the input device back to the "Sine source at 440 Hz" the test runs successfully. I have already provided logs to the PulseAudio people and I am waiting for their response. I will discuss this extra finding with them.
Profiling PulseAudio server on failed case:

Samples: 11K of event 'cycles:uppp', Event count (approx.): 9974038159
Overhead  Command          Shared Object             Symbol                                                                                                                                  
  31.21%  alsa-sink-CS420  [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
  19.25%  pulseaudio       [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
   4.54%  null-sink        [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
   1.44%  pulseaudio       libpulse.so.0.20.2        [.] pa_mainloop_dispatch                                                                                                                
   1.23%  alsa-source-CS4  [unknown]                 [.] 0xffffffff9ba0013c                                                                                                                  
   1.01%  alsa-sink-CS420  libpulsecore-11.1.so      [.] _init                                                                                                                               
   0.91%  alsa-sink-CS420  libpulsecommon-11.1.so    [.] _init                                                                                                                               
   0.81%  alsa-sink-CS420  libalsa-util.so           [.] _init                                                                                                                               
   0.71%  alsa-sink-CS420  libpulsecore-11.1.so      [.] pa_rtpoll_run                                                                                                                       
   0.71%  alsa-sink-CS420  libalsa-util.so           [.] thread_func                                                                                                                         
   0.55%  alsa-sink-CS420  libpulse.so.0.20.2        [.] _init                                                                                                                               
   0.42%  pulseaudio       libpulse.so.0.20.2        [.] _init                                                                                                                               
   0.41%  pulseaudio       libpulsecommon-11.1.so    [.] _init
If this is an issue on our side it could be triggered by different behavior for the different input devices. I recall when testing the loopback device patches early on that the null monitor's callbacks are much more frequent than those of the sine source.

That could explain a higher orange factor if we're looking at some sort of race involving the audio callback.
I can't seem to reproduce the problem locally. Would it be possible to get a backtrace from when the deadlock happens? And is the deadlock in PA or the test (does pactl stat work when the deadlock happens)?
Thanks for looking at it. I think you do not repro because you have an optimized build. I can repro on a debug build only. I tried this morning an opt build and the test passes. I can explain you in IRC how to build with debug enabled. It's very easy.

I tried the `pactl stat` while the deadlock was happening, that's the results:
$ taskset -c 0 pactl stat
Currently in use: 2041 blocks containing 4.1 MiB bytes total.
Allocated during whole lifetime: 1026601 blocks containing 638.6 MiB bytes total.
Sample cache size: 0 B

I also attached the debugger on PA process: 
0x00007f1e4669cb36 in __GI_ppoll (fds=0x558ca613eef0, nfds=36, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
39	  return SYSCALL_CANCEL (ppoll, fds, nfds, timeout, sigmask, _NSIG / 8);
warning: File "/home/achronop/myrepos/mozilla/firefox/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /home/achronop/myrepos/mozilla/firefox/.gdbinit
line to your configuration file "/home/achronop/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/achronop/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
Missing separate debuginfos, use: dnf debuginfo-install alsa-lib-1.1.4.1-1.fc26.x86_64 libgcrypt-1.8.2-1.fc26.x86_64 nss-softokn-freebl-3.35.0-1.0.fc26.x86_64 pcre-8.41-5.fc26.x86_64 speexdsp-1.2-0.10.rc3.fc26.x86_64
(gdb) bt
#0  0x00007f1e4669cb36 in __GI_ppoll (fds=0x558ca613eef0, nfds=36, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x00007f1e498d2641 in ppoll (__ss=0x0, __timeout=<optimized out>, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  pa_mainloop_poll (m=m@entry=0x558ca602eb30) at pulse/mainloop.c:852
#3  0x00007f1e498d2c50 in pa_mainloop_iterate (m=0x558ca602eb30, block=<optimized out>, retval=0x7ffffa011dd8) at pulse/mainloop.c:926
#4  0x00007f1e498d2ce0 in pa_mainloop_run (m=0x558ca602eb30, retval=0x7ffffa011dd8) at pulse/mainloop.c:944
#5  0x0000558ca496133e in main (argc=<optimized out>, argv=<optimized out>) at daemon/main.c:1142
Arun, following your question about which threads are involved in the scenario I uploaded the bt off all threads for the Firefox and PulseAudio when the deadlock was happening. For PA I continued and re-paused once to check if it was "moving". Feel free to ping me in IRC for more questions.
I attached to the audio process (sorry, not sure what the correct terminology is) and dumped the backtrace a couple of times (once when it looked deadlocked, then again close to timeout). I see two audio threads.

One of the threads has a different trace in the second backtrace (more or less waiting in the mainloop), suggesting that it is not where the deadlock is.

Will try to unravel where the common backtrace is in the code (it's in a pa_threaded_mainloop_wait()).
(In reply to Arun Raghavan from comment #19)
[...]
> Will try to unravel where the common backtrace is in the code (it's in a
> pa_threaded_mainloop_wait()).

I meant pa_threaded_mainloop_lock().
I'm quite puzzled by what's happening. Thread 13 in the backtraces I pasted seems stuck in a wait that should not happen. My machine seems to be quite slow at doing backtraces, so _maybe_ there's something wrong there. I've asked achronop to get me some backtraces from the deadlock.

From what I can tell, the problem is occurring on the client side. Next steps after I have looked at the backtraces (if nothing is different), is to see if the problem occurs with cubeb remote disabled, and finally with rust disabled if it does.

Since this is a bit of a mystery, my current paranoid suspicions include single-threaded weirdness in the Rust async code (since it only happens with taskset -c 0) or a problem in null-sink (maybe if all else fails, we can try to just write a simple loopback and try to make it do the same thing).

I did quickly look at cubeb-pulse-rs context code and all the locking code looked kosher, as expected.
The other thing I was thinking of was instrumenting all the lock/unlock wait/signal things in cubeb-pulse-rs (ideally with some way to identify the caller), but my worry is that it might be a _lot_ of output to parse (and might affect behaviour of the test).
NI myself to provide the info requested by Arun.
Flags: needinfo?(achronop)
This file contains 4 backtracks from all threads of content process. Remote cubeb is disabled.
Flags: needinfo?(achronop)
I executed the test using two different pairs of Sink Devices and Monitor Sources. Sink devices was real devices, the built in sink and an HDMI sink. In all cases the error is reproducible. I wonder if the the way PA implements the Monitor Sources uncovers that issue.
This could part of the symptoms we see in fuzzing (quite a few cubeb processes and a lot of processes locked in I/O).
Are these backtraces from when the process is hung?
Yes, I paused the debugger 4 times, every 15 secs, started at the point that test was not progressing further.
Blocks: 1480942
Blocks: 1538566
Blocks: 1645930

The bug assignee didn't login in Bugzilla in the last 7 months.
:jib, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: achronop → nobody
Flags: needinfo?(jib)
Severity: normal → S3
Blocks: 1781624
Depends on: 1876526
Flags: needinfo?(jib)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: