Closed Bug 1814692 Opened 2 years ago Closed 2 years ago

Video crash firefox, no crashlog

Categories

(Core :: Audio/Video: cubeb, defect)

Firefox 111
x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
111 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox109 --- disabled
firefox110 --- disabled
firefox111 --- fixed

People

(Reporter: grin, Assigned: pehrsons)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

Steps to reproduce:

Firefox crashes on various video chat applications. My test was done with
https://melonapp.com/ because it was crashing pretty fast on setup screen (usually within 1 mnute).

Actual results:

Firefox crashes. No crash log, only this:

Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
CPU time limit exceeded

Bisect told me that:

23:47.89 INFO: Last good revision: 23705e7275490be1e8422adcf07ab7df296858ce
23:47.89 INFO: First bad revision: 1fde17a7764ba61cf1f6449c46627a28c54d97bf

I really hope that it was really good, not just "not yet crashed" (waited for 3-4 minutes and did resolution changes in hope to crash faster).

Expected results:

A crashlog would be nice. Or something suggesting what's happening. Best would be no crash.

The latest Nightly 111.0a1 doesn't crash on my side on https://melonapp.com/, tested on macOS 11.7 and on Ubuntu 20.04.

Setting this to Core: WebRTC so that our developers could take a look at this issue. If this is not the right component, please assign a more appropriate one. Thanks!

Component: Untriaged → WebRTC
OS: Unspecified → Linux
Product: Firefox → Core
Hardware: Unspecified → x86_64

I have been talking about this with the Internet, and possible wild guesses:

  • FF may have been killed due to prlimit, aka. RTTIME timeout for real-time tasks
  • based on bisect it may or may not be related to audio (AND possibly video)
  • melonapp was just an easy example, it crashes on microsoft© teams™ or possibly many of the video chat services. It definitely doesn't crash on bbb, but I cannot really offer experience with others.

Hi, thanks for filing!

I guess this is linux? Which distro and version? Also is this a distro-packaged Firefox or did you get it off Mozilla's server?

"CPU time limit exceeded" means we blew the budget for a realtime thread and got killed by SIGXCPU. We do use realtime threads for audio (input/output for "realtime audio" aka MediaStream/WebAudio based audio) . Your wild guesses are not so wild! We do however get a heads up in content processes that we are taking too long in the callback and thereby should disable the realtime mode. Probably this crash happens in the parent process then.

Thanks for bisecting this, however I have not been able to make sense of the revision hashes. They don't seem valid for nightly, beta or release. Could you help me pinpoint them?

Also if this is fairly easy to reproduce, could you try to reproduce it under gdb to get a stack? I reckon the parent process is where it is crashing, so it should be enough to start firefox through gdb and you're already hooked to the right process.

Note, here is an explanation of how we use realtime threads for audio. I also see bug 1795719 which is for a SIGXCPU crash for screen capture. This is surprising because that thread should not be a realtime thread.

Flags: needinfo?(grin)
See Also: → 1795719

(In reply to Andreas Pehrson [:pehrsons] from comment #3)

I guess this is linux? Which distro and version? Also is this a distro-packaged Firefox or did you get it off Mozilla's server?

Oh, sorry, I fogot that there is no detailed crashlog. :-)

Debian sid/amd64, and it includes various bits and pieces from Debian-multimedia.
I do not think v111 is packaged anywhere, so no, this is Nightly autoupdating itself, but if you look at the bisect you'll see that this possibly originates way back in the past.

Thanks for bisecting this, however I have not been able to make sense of the revision hashes. They don't seem valid for nightly, beta or release. Could you help me pinpoint them?

This was offered by the arch's automagic bisect tool and it had to get deep into the revs, so I guess these are not release tags but changesets: this one is the bad and I guess these are the changes before it, two of them seem to be inbetween. I cannot readily tell you how to get unifiied difference of it.

Also if this is fairly easy to reproduce, could you try to reproduce it under gdb to get a stack? I reckon the parent process is where it is crashing, so it should be enough to start firefox through gdb and you're already hooked to the right process.

I hope this helps (I worry that there are multiple subsystems, namely audio and video is interacting):

Thread 210 "CaptureThread" received signal SIGXCPU, CPU time limit exceeded.
[Switching to Thread 0x7fffc85ba6c0 (LWP 30807)]
0x00007fffee8dd690 in CopyRow_ERMS () from /mnt/storage/firefox/libxul.so
(gdb) bt
#0  0x00007fffee8dd690 in CopyRow_ERMS () at /mnt/storage/firefox/libxul.so
#1  0x00007fffee8d11d1 in CopyPlane () at /mnt/storage/firefox/libxul.so
#2  0x00007fffee8ce601 in ScalePlane () at /mnt/storage/firefox/libxul.so
#3  0x00007fffee8c9793 in libyuv::I4xxToI420(unsigned char const*, int, unsigned char const*, int, unsigned char const*, int, unsigned char*, int, unsigned char*, int, unsigned char*, int, int, int, int, int) () at /mnt/storage/firefox/libxul.so
#4  0x00007fffee8c96ff in I422ToI420 () at /mnt/storage/firefox/libxul.so
#5  0x00007fffee8d22f7 in libyuv::JpegI422ToI420(void*, unsigned char const* const*, int const*, int) () at /mnt/storage/firefox/libxul.so
#6  0x00007fffee8cc3f7 in libyuv::MJpegDecoder::DecodeToCallback(void (*)(void*, unsigned char const* const*, int const*, int), void*, int, int) () at /mnt/storage/firefox/libxul.so
#7  0x00007fffee8d21a6 in MJPGToI420 () at /mnt/storage/firefox/libxul.so
#8  0x00007fffee8d2dce in ConvertToI420 () at /mnt/storage/firefox/libxul.so
#9  0x00007fffee4b33cc in webrtc::videocapturemodule::VideoCaptureImpl::IncomingFrame(unsigned char*, unsigned long, webrtc::VideoCaptureCapability const&, long) () at /mnt/storage/firefox/libxul.so
#10 0x00007fffee4b259c in webrtc::videocapturemodule::VideoCaptureModuleV4L2::CaptureProcess() () at /mnt/storage/firefox/libxul.so
#11 0x00007fffee4b2784 in std::_Function_handler<void (), webrtc::videocapturemodule::VideoCaptureModuleV4L2::StartCapture(webrtc::VideoCaptureCapability const&)::$_2>::_M_invoke(std::_Any_data const&) () at /mnt/storage/firefox/libxul.so
#12 0x00007fffee505b37 in std::_Function_handler<void (), rtc::PlatformThread::SpawnThread(std::function<void ()>, absl::string_view, rtc::ThreadAttributes, bool)::$_1>::_M_invoke(std::_Any_data const&) () at /mnt/storage/firefox/libxul.so
#13 0x00007fffee505a46 in rtc::(anonymous namespace)::RunPlatformThread(void*) () at /mnt/storage/firefox/libxul.so
#14 0x000055555560182e in set_alt_signal_stack_and_start(PthreadCreateParams*) ()
#15 0x00007ffff7aa7fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007ffff7b2866c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Flags: needinfo?(grin)

Thanks! This is an rtc::PlatformThread SIGXCPU like in bug 1795719. I'm not sure why this thread would have a cpu time limit, since it should not be a realtime thread. But audioipc2 as a regressor makes some sense since it sets realtime priority for its audio threads. Could you run through the steps in bug 1795719 comment 9 also (though in your case use camera instead of screen capture, and "Capture" instead of "Screen" in step 5 when finding the TID)?

Component: WebRTC → Audio/Video: cubeb
Flags: needinfo?(grin)
Regressed by: 1792785

For the record, this link shows the bisection range properly.

~$ chrt -m
SCHED_OTHER min/max priority    : 0/0
SCHED_FIFO min/max priority     : 1/99
SCHED_RR min/max priority       : 1/99
SCHED_BATCH min/max priority    : 0/0
SCHED_IDLE min/max priority     : 0/0
SCHED_DEADLINE min/max priority : 0/0
$ ulimit -Sa
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 35
file size                   (blocks, -f) unlimited
pending signals                     (-i) 127554
max locked memory           (kbytes, -l) unlimited
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 99
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 127554
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
$ ps -Lp 18719  | grep Cap
18719 28994 pts/77   00:00:00 VideoCapture
18719 29071 pts/77   00:00:00 CaptureThread

$ chrt -ap 28994 | grep 28994
pid 28994's current scheduling policy: SCHED_OTHER
pid 28994's current scheduling priority: 0

$ chrt -ap 28994 | grep 29071
pid 29071's current scheduling policy: SCHED_FIFO
pid 29071's current scheduling priority: 96

Thanks! This makes some sense because "CaptureThread" is a libwebrtc thread with priority kHigh which ends up having a sched_param.sched_priority of in this case max(99 - 1 - 2, 2)=96. Then it ends up in pthread_setschedparam(pthread_self(), policy, &param), and AH policy is SCHED_FIFO.

kinetik, paul, while I can work around this by patching libwebrtc, do you have any clue why audioipc2 would regress this? As in, would it affect whether other threads than those it manages itself may set an rt scheduler? Does it do something differently from audioipc1?

Flags: needinfo?(padenot)
Flags: needinfo?(kinetik)
Flags: needinfo?(grin)

That SCHED_FIFO has been around since the dawn of time, fwiw. We had a local fix back in the FxOS days but it soon got lost. Fixing this upstream was also dismissed.

I'm guessing we were not normally allowed to set SCHED_FIFO on desktop linux systems, so this never had an effect. It doesn't on my local Ubuntu 22.04 either.

What's happening here is that audioipc2 sets an rlimit(RLIMIT_RTTIME, ...)` equivalent to roughly 5ms, while the previous code used the default value of about 50ms. 5ms is too short for copying large buffers around, especially desktop capture in YUV422 as it seems to be the case, because that's going to be multiple megabytes.

This was never intended to be used on anything else but the real-time audio threads, but looks like it works well enough, we should move it back to 50ms. We're doing cleanups in the audio remoting later this year (after the removal of the old branch), and we'll be able to make this nicer, probably.

My ulimit -Sa reports real-time priority (-r) 0, so setting this to 99 would be a prerequisite to reproducing.

It might still make sense to remove these libwebrtc threads' ability to set SCHED_FIFO.

A fix mitigation / fix for this is up at https://github.com/mozilla/audioipc/pull/165.

Flags: needinfo?(padenot)
Assignee: nobody → apehrson
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Flags: needinfo?(kinetik)
Pushed by pehrsons@gmail.com: https://hg.mozilla.org/integration/autoland/rev/9e64a965e26c Don't attempt realtime scheduling rtc::PlatformThreads. r=webrtc-reviewers,bwc
Duplicate of this bug: 1795719
See Also: 1795719
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 111 Branch
QA Whiteboard: [qa-111b-p2]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: