Deadlock in SharedMemoryBasic_mach → AV1 playback doesn't work in Firefox Nightly on macOS 10.14.4
Categories
(Core :: IPC, defect, P1)
Tracking
()
People
(Reporter: beingalink, Assigned: haik)
References
(Regression)
Details
(Keywords: regression)
Crash Data
Attachments
(1 file)
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0
Steps to reproduce:
Open https://www.singhkays.com/blog/its-time-replace-gifs-with-av1-video/
Actual results:
All videos play except for the av1 ones (only fixed picture is shown). All context menu items on the av1 video are greyed out except for "Loop".
Expected results:
The av1 video should play.
Updated•6 years ago
|
Comment 1•6 years ago
|
||
Single videos appear to play back fine by themselves:
https://www.singhkays.com/blog/its-time-replace-gifs-with-av1-video/media/scene2/av1_200k.mp4
So presumably it's an issue with multiple concurrent decodes?
Updated•6 years ago
|
Updated•6 years ago
|
Comment 2•6 years ago
|
||
This looks like an RDD issue. I don't get the error when I disable RDD. I get the following warnings when I visit that link (I am building with MacOSX SDK 10.11):
[Child 75554, RemVidChild] WARNING: short timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 597
[RDD 75557, RemVidParent] WARNING: short timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 597
[Child 75546, Main Thread] WARNING: No active window: file /Users/achronopoulos/repos/mozilla/firefox/js/xpconnect/src/XPCJSContext.cpp, line 662
[Child 75554, RemVidChild] WARNING: long timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 601
[RDD 75557, RemVidParent] WARNING: long timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 601
[RDD 75557, Chrome_ChildThread] WARNING: pipe error: Broken pipe: file /Users/achronopoulos/repos/mozilla/firefox/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 727
[Child 75554, RemVidChild] WARNING: Serials do not match up: 89 vs 93: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 613
[Child 75554, MediaPlayback #1] WARNING: NS_ERROR_DOM_MEDIA_DECODE_ERR (0x806e0004): file /Users/achronopoulos/repos/mozilla/firefox/dom/media/MediaFormatReader.cpp, line 1657
[Child 75554, MediaPlayback #1] WARNING: NS_ERROR_DOM_MEDIA_DECODE_ERR (0x806e0004): file /Users/achronopoulos/repos/mozilla/firefox/dom/media/MediaFormatReader.cpp, line 1657
[Child 75554, MediaPlayback #4] WARNING: Decoder=1666e7800 Decode error: NS_ERROR_DOM_MEDIA_DECODE_ERR (0x806e0004): file /Users/achronopoulos/repos/mozilla/firefox/dom/media/MediaDecoderStateMachine.cpp, line 3307
Updated•6 years ago
|
Comment 3•6 years ago
|
||
I tried this under Linux, and it doesn't happen. It does happen for me under macOS 10.14.4. I have some logging in place locally, and I can see RDD process decoding av1 on that page before the warnings. Those are all Shmem related warnings, which is a bit interesting. I'll add some more logging to see if we're failing to create the shmem to pass back the decoded frames.
Comment 4•6 years ago
|
||
I see similar things. It only happens on OSX for me too. Also I see that we create three av1 decoders. Some decoding is taking place for the first and the second decoder when the third one is created and everything stops. By everything stops I mean RemoteDecoderParent::RecvInput
stop getting new input.
Comment 5•6 years ago
|
||
I believe everything stops because RDD is stopped waiting for AllocShmem to return during the long timeout in file /Users/mfroman/mozilla/moz-central/ipc/glue/SharedMemoryBasic_mach.mm, line 601
Comment 6•6 years ago
|
||
We are definitely hitting the error case in RemoteVideoDecoder.cpp here[1].
[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RemoteVideoDecoder.cpp#191
Comment 7•6 years ago
|
||
CreateSharedMemory is failing here[1], but I don't know why yet.
[1] https://searchfox.org/mozilla-central/source/ipc/glue/ProtocolUtils.cpp#748
Comment 8•6 years ago
|
||
Jed, do you have any thoughts as to why we'd start failing to alloc Shmem midway through a working av1 decode, failing on the long timeout here[1]?
[1] https://searchfox.org/mozilla-central/source/ipc/glue/SharedMemoryBasic_mach.mm#601
Comment 9•6 years ago
|
||
So the way this works is, if process A is sending a shared memory handle to process B, then B should have a thread dedicated to handling these messages from A, waiting in PortServerThread
. Communication between child processes is set up through the parent process; see ForwardGetPortsMessage
. What I'd do if I were trying to debug this is find the thread that's supposed to be answering and see if it's blocked somewhere.
That Serials do not match up: 89 vs 93
error is also a little odd — it looks like the message thread eventually woke up and answered one of the requests that had timed out? It's possible that this protocol will permanently lose sync in that case, if it's always n>0
messages behind the reply it's expecting.
I'm also a little suspicious of this non-atomic static; if the compiler rematerializes my_serial
as serial - 1
after the serial++
, which I believe it's allowed to do, there's a race condition which could cause spurious serial mismatches, but I don't think that would explain the timeout. That code was added in bug 1205164.
Comment 10•6 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #9)
I'm also a little suspicious of this non-atomic static; if the compiler rematerializes
my_serial
asserial - 1
after theserial++
, which I believe it's allowed to do, there's a race condition which could cause spurious serial mismatches, but I don't think that would explain the timeout. That code was added in bug 1205164.
I tried making the non-atomic static into an atomic, like:
static mozilla::Atomic<uint64_t> serial(0);
That modification did not change the behavior. We still see the timeouts.
Comment 11•6 years ago
|
||
I can reproduce this, and with some help from LLDB I think I know what's going on here:
- The content and RDD processes try to
ShareToProcess
to each other at the same time - Both processes take their
gMutex
. - Both processes send to each other.
- Both processes'
PortServerThread
s for each other wake up, receive the message, and block on theirgMutex
es. - Deadlock.
- Later, the
ShareToProcess
es time out and release the lock, allowing the messages to finally be processed. - Replies are sent, and then received by the next (wrong)
ShareToProcess
. Everything is now broken.
Comment 12•6 years ago
|
||
(Obviously AV1 support wasn't regressed 4 years ago, because it didn't exist then, but I believe the underlying bug is that old.)
Comment 13•6 years ago
|
||
\o/ Thank you for tracking that down!
I'd been adding logging to see what else I could find. I'd just gotten to the point of knowing that all three decoders were decoding at least some av1 content, but I had not figured out why we stalled.
Is this something I can change from our usage or is this something that must be changed in the shared memory implementation?
![]() |
||
Updated•6 years ago
|
![]() |
||
Updated•6 years ago
|
Comment 14•6 years ago
|
||
Looks like a similar issue may be happening on OSX CCov tests:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=246683182&repo=try&lineNumber=9154-9162
Comment 15•6 years ago
|
||
This signature first started showing up in 68, in the 20190417214729 build.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 16•6 years ago
|
||
(In reply to Michael Froman [:mjf] from comment #13)
Is this something I can change from our usage or is this something that must be changed in the shared memory implementation?
Theoretically, it might be possible to work around this by only ever sending shared memory in one direction. See also GMP, which arranges for all shared memory to be allocated in the parent process by using intr
messages (the only use of that feature outside of NPAPI, I think) to request creation, combined with its own caching layer. So that would probably be more work than just fixing this properly.
On that topic, there are two ways to deal with this:
-
Go ahead with bug 1465669 and use POSIX shm instead. That's a major change and I wouldn't want to try uplifting it, but I have a patch that used to work, and I'm in the process of un-bit-rotting it so that we at least have it as an option.
-
Use finer-grained locking in
SharedMemoryBasic_mach
so that sending share requests doesn't block handling incoming ones. I think that should be doable (the two directions already have separate sets of Mach ports), but there's a bunch of process-global state and I haven't gone through all the code in detail. Haik is going to take a look at that.
Comment 17•6 years ago
|
||
(In reply to Liz Henry (:lizzard) (use needinfo) from comment #15)
This signature first started showing up in 68, in the 20190417214729 build.
I'm not so sure about that: I don't see any PortServerThread
s blocked on a mutex. I think we can also end up blocked in ShareToProcess
like that if the other process already exited, but there are a lot of things that happen during shutdown and I don't know if they can happen out of order enough to cause that.
Assignee | ||
Comment 18•6 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #16)
- Use finer-grained locking in
SharedMemoryBasic_mach
so that sending share requests doesn't block handling incoming ones. I think that should be doable (the two directions already have separate sets of Mach ports), but there's a bunch of process-global state and I haven't gone through all the code in detail. Haik is going to take a look at that.
Thanks for the writeup. The analysis was really helpful. I've been looking at the locking today and have some ideas. I'm still familiarizing myself with the code, but I noticed that in SharedMemoryBasic::ShareToProcess() we hold gMutex while we call SendMessage() and then WaitForMessage(), but even with the locking, there doesn't seem to be a guarantee that the message received with WaitForMessage() is the reply to the message that was sent because (for example) there could be an existing message ready to be read before SendMessage() is called. And PortServerThread isn't holding gMutex when it calls WaitForMessage() so we have two threads racing calling WaitForMessage() in parallel. If PortServerThread() is given the incoming message, and there are no subsequent messages, ShareToProcess() would block indefinitely while holding gMutex.
Ignoring that problem, I think we could replace gMutex with a RW lock. The ShareToProcess and PortServerThread code paths could use the reader lock. More analysis needed to determine if the additional locks (one each for sending and receiving) would be needed here.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 19•6 years ago
|
||
(In reply to Haik Aftandilian [:haik] from comment #18)
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #16)
Thanks for the writeup. The analysis was really helpful. I've been looking at the locking today and have some ideas. I'm still familiarizing myself with the code, but I noticed that in SharedMemoryBasic::ShareToProcess() we hold gMutex while we call SendMessage() and then WaitForMessage(), but even with the locking, there doesn't seem to be a guarantee that the message received with WaitForMessage() is the reply to the message that was sent because (for example) there could be an existing message ready to be read before SendMessage() is called. And PortServerThread isn't holding gMutex when it calls WaitForMessage() so we have two threads racing calling WaitForMessage() in parallel. If PortServerThread() is given the incoming message, and there are no subsequent messages, ShareToProcess() would block indefinitely while holding gMutex.
Jed explained how we use a different set of ports in PortServerThread (setup in GetMemoryPortsForPid() and SetupMachMemory()) so the ordering issue I described above is not a problem.
Assignee | ||
Comment 20•6 years ago
|
||
(In reply to Haik Aftandilian [:haik] from comment #18)
Ignoring that problem, I think we could replace gMutex with a RW lock. The ShareToProcess and PortServerThread code paths could use the reader lock. More analysis needed to determine if the additional locks (one each for sending and receiving) would be needed here.
In PortServerThread(), we acquire gMutex after returning from WaitForMessage() and before calling HandleSharePortsMessage() to handle the kSharePortsMsg message which causes the deadlock. For HandleSharePortsMessage(), it appears we don't need to hold gMutex. If that's the case, not holding gMutex during HandleSharePortsMessage() would be a simple fix. I'm looking into this before trying a more complex locking scheme.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b73fb1440e55546ea8a06ba2b9b42b832381c431
Assignee | ||
Comment 21•6 years ago
|
||
Don't hold gMutex when calling HandleSharePortsMessage() from PortServerThread to avoid deadlock.
Comment 22•6 years ago
|
||
Comment 23•6 years ago
|
||
bugherder |
Updated•6 years ago
|
Comment 24•6 years ago
|
||
Please request uplift to beta if and when you're comfortable doing so. Thanks!
Assignee | ||
Comment 25•6 years ago
|
||
Comment on attachment 9065827 [details]
Bug 1550771 - Deadlock in SharedMemoryBasic_mach triggered by AV1 playback r?jld
Beta/Release Uplift Approval Request
- User impact if declined: On Mac systems, users might experience browser hangs, tab spinners, and tab crashes on sites that have more than one AV1 video decoding streams.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: See description. The page listed in the description reliably causes a hang on macOS. Sometimes the page appears to work as expected, but switching to a new tab and then returning to the original page triggers the problem.
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): The change is in a risky area, but the change is small in scope.
- String changes made/needed: None
Assignee | ||
Updated•6 years ago
|
Comment 26•6 years ago
|
||
For some additional info on Comment 25:
- I have verified this fix in Nightly (2019-05-23) using the reporter's STR.
- This can also affect macOS in non-media situations with browser hangs and spinners, although much less often.
Reporter | ||
Comment 27•6 years ago
|
||
Just to confirm as I reported the original bug, the site with the av1 videos works fine now in nightly.
Updated•6 years ago
|
Comment 28•6 years ago
|
||
Verified as fixed on Firefox Nightly 69.0a1 (2019-05-23) on Mac OS X 10.14.
I observed that non of the videos were played on Ubuntu 18.04 x64. Is this intended?
Comment 29•6 years ago
|
||
Comment on attachment 9065827 [details]
Bug 1550771 - Deadlock in SharedMemoryBasic_mach triggered by AV1 playback r?jld
macos deadlock fix, approved for 68.0b5
Assignee | ||
Comment 30•6 years ago
|
||
(In reply to Hani Yacoub from comment #28)
Verified as fixed on Firefox Nightly 69.0a1 (2019-05-23) on Mac OS X 10.14.
I observed that non of the videos were played on Ubuntu 18.04 x64. Is this intended?
This bug fix only changed code used on macOS so it should not have affected Linux playback.
The videos should play on Ubuntu 18.04, but for some you may need to install extra video codec packages first. On an 18.04 system I tested, all the video types appear to play and that system has libavcodec-extra and chromium-codecs-ffmpeg-extra installed. You can test AV1 alone on https://demo.bitmovin.com/public/firefox/av1/
@mjf, Michael, do you know which codec packages are recommended to be installed on Ubuntu 18.04?
Comment 31•6 years ago
|
||
bugherder uplift |
Comment 32•6 years ago
|
||
(In reply to Haik Aftandilian [:haik] from comment #30)
(In reply to Hani Yacoub from comment #28)
Verified as fixed on Firefox Nightly 69.0a1 (2019-05-23) on Mac OS X 10.14.
I observed that non of the videos were played on Ubuntu 18.04 x64. Is this intended?@mjf, Michael, do you know which codec packages are recommended to be installed on Ubuntu 18.04?
Poking around on google a bit, I found this link:
https://askubuntu.com/questions/171026/how-to-install-all-existing-media-codecs
tl;dr sudo apt-get install ubuntu-restricted-extras
For more info on that package:
https://packages.ubuntu.com/bionic/ubuntu-restricted-extras
Anecdotally, I don't remember doing anything special at all for my desktop Ubuntu 18.04 install, and those videos play for me. However, on a recent Ubuntu 18.04 VM I built (using VMWare's easy install feature), I can see they don't play until installing libavcodec-extra (which is included in ubuntu-restricted-extras). I suppose it is possible that something else I installed on my desktop had libavcodec-extra as a dependency.
Comment 33•6 years ago
|
||
The issue is not fixed on MacBook pro using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.
The issue is fixed on iMac using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.
Shouldn't this be fixed on both MacBook pro and on iMac?
Assignee | ||
Comment 34•6 years ago
|
||
(In reply to Hani Yacoub from comment #33)
The issue is not fixed on MacBook pro using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.
The issue is fixed on iMac using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.Shouldn't this be fixed on both MacBook pro and on iMac?
Yes, it should be fixed on all Macs. Please double check you have 68.05b and then there are some debugging steps you could take to confirm what's happening.
Comment 35•6 years ago
|
||
My initial Beta download was 68.04b and it did not work. I did the update to 68.05b, and it works correctly for me.
So to confirm, MacBook Pro w/ 68.05b works for me.
Comment 36•6 years ago
|
||
After looking more carefully into this bug I confirm that the bug is fixed on all Macs using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.
Updated•6 years ago
|
Updated•6 years ago
|
Updated•3 years ago
|
Description
•