Open Bug 1401389 Opened 4 years ago Updated 3 years ago
RTC Externals 1 .0 .0 can cause parent process to hang
403.78 KB, text/plain
161.89 KB, text/plain
315.23 KB, text/plain
I have Firefox Nightly 57.0a1 (2017-09-19) (64-bit) on OSX the WebRTC Externals web extension installed https://github.com/fippo/webrtc-externals With the extension installed and the following STR I get in about 75% a lockup of the parent process and can only recover by force quitting Firefox. - open a room on https://appear.in - join the same room with Firefox with the extension installed - wait for the call to connect - open a new tab via command+t - close the new tab via command-w - open a new tab via command+t - close the new tab via command-w - click on the enlarging arrow in the upper right corner of the remote video - open a new tab via command+t - close the new tab via command+w - leave the call bt clicking on the red X at the bottom of the video rendering area At one of these steps the browsers stops to respond. After some time the cursor turns into the beach ball of death. Call continues for some time. After some time the call disconnects. Only way to recover from this is to force-quit Firefox. Note: it might need two or three attempts with the above steps to repro.
In here it look to me like the content process is waiting on a blocking IPC request answer for ever.
Can't reproduce on Windows after a few tries.
Can reproduce on OS X, it gets pretty bad.
can reproduce on linux too in a somewhat simpler case. Just make a call in two tabs, then switch between the tabs a couple of times. The extension calls getStats() every second and then does a window.postMessage + channel.postMessage. Possibly that goes wrong if the tab is in the background?
Andy, should this bug be under WebExtensions, or do you think there's something that can be done to the add-on to fix this?
Priority: -- → P3
Sure, we don't really know where the problem is though at this point.
Component: Extension Compatibility → WebExtensions: Untriaged
Product: Firefox → Toolkit
I can repro on Ubuntu 17.04 with recent Nightly by just entering an appear.in room with one remote participant while having the extension enabled. A couple seconds into the call, all UI interaction freezes but audio/video runs fine. Attaching gdb shows the parent process main thread in weird places: > (gdb) bt > #0 0x000029d6648ef5b0 in () > #1 0xfff8800000000000 in () > #2 0x7ff0000000000000 in () > #3 0x4018000000000000 in () > #4 0xfff8800000000018 in () > #5 0x7ff0000000000000 in () > #6 0x0000000000000000 in () Resuming the thread and breaking again appears to just change the last few digits of frame #0 a bit. One core is pegged at 100%, presumable running the main thread. The activity trace from drno looks more sane though. It includes this bit which looks bad (sync dispatch on main thread): > 2224 nsFrameMessageManager::SendMessage(nsTSubstring<char16_t> const&, JS::Handle<JS::Value>, JS::Handle<JS::Value>, nsIPrincipal*, JSContext*, unsigned char, JS::MutableHandle<JS::Value>, bool) (in XUL) + 684 [0x1157dc99c] > 2224 mozilla::dom::TabChild::DoSendBlockingMessage(JSContext*, nsTSubstring<char16_t> const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*, bool) (in XUL) + 323 [0x1167cfcf3] > 2224 mozilla::dom::PBrowserChild::SendSyncMessage(nsTString<char16_t> const&, mozilla::dom::ClonedMessageData const&, nsTArray<mozilla::jsipc::CpowEntry> const&, IPC::Principal const&, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) (in XUL) + 595 [0x115144503] > 2224 mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) (in XUL) + 2131 [0x114f25343] > 2224 mozilla::ipc::MessageChannel::WaitForSyncNotify(bool) (in XUL) + 154 [0x114f25a8a] > 2224 mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) (in libmozglue.dylib) + 28 [0x10e3834dc] Any chance we can bump priority on this? Or could you aid me in what to look for in order to debug it further?
Sorry, at this point we do not have the people to dig more into this one.
Slightly above the fragment pasted in comment 9 is a stack frame for: ``` nsGlobalWindow::DispatchDOMWindowCreated() (in XUL) + 101 [0x1157ed005] ``` The WebExtensions framework does handle that event but I believe only in extension content processes (and it sound like this is a web content process that is pegged). In any case, it doesn't look to me like the handler for that event does anything that can trigger synchronous IPC. Its possible that grabbing a profile during one of these events (see https://perf-html.io/) would give us more information, but I think the best bet is trying to get this in front of somebody more skilled at analyzing and triaging these sorts of problems, perhaps by changing the component to something like Firefox:Untriaged
since I just had to debug a coworkers machine: should we (me) pull the extension from the store until this is fixed?
(In reply to Philipp Hancke [:fippo] from comment #12) > since I just had to debug a coworkers machine: should we (me) pull the > extension from the store until this is fixed? That is probably a good idea, as it currently lets Firefox appear unstable with little chance users of the extension will figure out that it's caused by the extension.
deactivated. Source is available from https://github.com/fippo/webrtc-externals if anyone wants to give it a try.
i've spent some more time debugging this. It doesn't happen when I deactivate the stats graphs. Those cause issues in Edge as well so i suspect the issue is *somewhere* in the graph library (which, to be fair, was written as part of chrome's internals). Will remove the graphs in Firefox and republish. Shall we resolve as "works for me"?
Personally I'd still like to understand the underlying problem here. Whether the graph lib is doing something crazy or we are reacting to it in crazy ways. Perhaps this knowledge can let us shrink the testcase and STR a bit though.
Bulk move of bugs per https://bugzilla.mozilla.org/show_bug.cgi?id=1483958
You need to log in before you can comment on or make changes to this bug.