Closed Bug 1550771 Opened 7 months ago Closed 7 months ago

Deadlock in SharedMemoryBasic_mach → AV1 playback doesn't work in Firefox Nightly on macOS 10.14.4

Categories

(Core :: IPC, defect, P1)

68 Branch
Unspecified
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox66 --- wontfix
firefox67 --- wontfix
firefox67.0.1 --- wontfix
firefox68 --- verified
firefox69 --- verified

People

(Reporter: beingalink, Assigned: haik)

References

(Regression)

Details

(Keywords: regression)

Crash Data

Attachments

(1 file)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

Open https://www.singhkays.com/blog/its-time-replace-gifs-with-av1-video/

Actual results:

All videos play except for the av1 ones (only fixed picture is shown). All context menu items on the av1 video are greyed out except for "Loop".

Expected results:

The av1 video should play.

Component: Untriaged → Audio/Video: Playback
Product: Firefox → Core

Single videos appear to play back fine by themselves:
https://www.singhkays.com/blog/its-time-replace-gifs-with-av1-video/media/scene2/av1_200k.mp4

So presumably it's an issue with multiple concurrent decodes?

Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(achronop)
Priority: -- → P2

This looks like an RDD issue. I don't get the error when I disable RDD. I get the following warnings when I visit that link (I am building with MacOSX SDK 10.11):

[Child 75554, RemVidChild] WARNING: short timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 597
[RDD 75557, RemVidParent] WARNING: short timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 597
[Child 75546, Main Thread] WARNING: No active window: file /Users/achronopoulos/repos/mozilla/firefox/js/xpconnect/src/XPCJSContext.cpp, line 662
[Child 75554, RemVidChild] WARNING: long timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 601
[RDD 75557, RemVidParent] WARNING: long timeout didn't get an id (ipc/rcv) timed out 10004003
: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 601
[RDD 75557, Chrome_ChildThread] WARNING: pipe error: Broken pipe: file /Users/achronopoulos/repos/mozilla/firefox/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 727
[Child 75554, RemVidChild] WARNING: Serials do not match up: 89 vs 93: file /Users/achronopoulos/repos/mozilla/firefox/ipc/glue/SharedMemoryBasic_mach.mm, line 613
[Child 75554, MediaPlayback #1] WARNING: NS_ERROR_DOM_MEDIA_DECODE_ERR (0x806e0004): file /Users/achronopoulos/repos/mozilla/firefox/dom/media/MediaFormatReader.cpp, line 1657
[Child 75554, MediaPlayback #1] WARNING: NS_ERROR_DOM_MEDIA_DECODE_ERR (0x806e0004): file /Users/achronopoulos/repos/mozilla/firefox/dom/media/MediaFormatReader.cpp, line 1657
[Child 75554, MediaPlayback #4] WARNING: Decoder=1666e7800 Decode error: NS_ERROR_DOM_MEDIA_DECODE_ERR (0x806e0004): file /Users/achronopoulos/repos/mozilla/firefox/dom/media/MediaDecoderStateMachine.cpp, line 3307
Flags: needinfo?(achronop)
Flags: needinfo?(mfroman)

I tried this under Linux, and it doesn't happen. It does happen for me under macOS 10.14.4. I have some logging in place locally, and I can see RDD process decoding av1 on that page before the warnings. Those are all Shmem related warnings, which is a bit interesting. I'll add some more logging to see if we're failing to create the shmem to pass back the decoded frames.

Flags: needinfo?(mfroman)

I see similar things. It only happens on OSX for me too. Also I see that we create three av1 decoders. Some decoding is taking place for the first and the second decoder when the third one is created and everything stops. By everything stops I mean RemoteDecoderParent::RecvInput stop getting new input.

I believe everything stops because RDD is stopped waiting for AllocShmem to return during the long timeout in file /Users/mfroman/mozilla/moz-central/ipc/glue/SharedMemoryBasic_mach.mm, line 601

We are definitely hitting the error case in RemoteVideoDecoder.cpp here[1].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RemoteVideoDecoder.cpp#191

CreateSharedMemory is failing here[1], but I don't know why yet.

[1] https://searchfox.org/mozilla-central/source/ipc/glue/ProtocolUtils.cpp#748

Jed, do you have any thoughts as to why we'd start failing to alloc Shmem midway through a working av1 decode, failing on the long timeout here[1]?

[1] https://searchfox.org/mozilla-central/source/ipc/glue/SharedMemoryBasic_mach.mm#601

Flags: needinfo?(jld)

So the way this works is, if process A is sending a shared memory handle to process B, then B should have a thread dedicated to handling these messages from A, waiting in PortServerThread. Communication between child processes is set up through the parent process; see ForwardGetPortsMessage. What I'd do if I were trying to debug this is find the thread that's supposed to be answering and see if it's blocked somewhere.

That Serials do not match up: 89 vs 93 error is also a little odd — it looks like the message thread eventually woke up and answered one of the requests that had timed out? It's possible that this protocol will permanently lose sync in that case, if it's always n>0 messages behind the reply it's expecting.

I'm also a little suspicious of this non-atomic static; if the compiler rematerializes my_serial as serial - 1 after the serial++, which I believe it's allowed to do, there's a race condition which could cause spurious serial mismatches, but I don't think that would explain the timeout. That code was added in bug 1205164.

Flags: needinfo?(jld)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #9)

I'm also a little suspicious of this non-atomic static; if the compiler rematerializes my_serial as serial - 1 after the serial++, which I believe it's allowed to do, there's a race condition which could cause spurious serial mismatches, but I don't think that would explain the timeout. That code was added in bug 1205164.

I tried making the non-atomic static into an atomic, like:
static mozilla::Atomic<uint64_t> serial(0);

That modification did not change the behavior. We still see the timeouts.

I can reproduce this, and with some help from LLDB I think I know what's going on here:

  1. The content and RDD processes try to ShareToProcess to each other at the same time
  2. Both processes take their gMutex.
  3. Both processes send to each other.
  4. Both processes' PortServerThreads for each other wake up, receive the message, and block on their gMutexes.
  5. Deadlock.
  6. Later, the ShareToProcesses time out and release the lock, allowing the messages to finally be processed.
  7. Replies are sent, and then received by the next (wrong) ShareToProcess. Everything is now broken.
Component: Audio/Video: Playback → IPC
Summary: av1 playback doesn't work in firefox nightly on macOS 10.14.4 → Deadlock in SharedMemoryBasic_mach → AV1 playback doesn't work in Firefox Nightly on macOS 10.14.4

(Obviously AV1 support wasn't regressed 4 years ago, because it didn't exist then, but I believe the underlying bug is that old.)

\o/ Thank you for tracking that down!

I'd been adding logging to see what else I could find. I'd just gotten to the point of knowing that all three decoders were decoding at least some av1 content, but I had not figured out why we stalled.

Is this something I can change from our usage or is this something that must be changed in the shared memory implementation?

Flags: needinfo?(jld)
Flags: needinfo?(haftandilian)
Crash Signature: [@ shutdownhang | mozilla::ipc::SharedMemoryBasic::ShareToProcess ]

(In reply to Michael Froman [:mjf] from comment #13)

Is this something I can change from our usage or is this something that must be changed in the shared memory implementation?

Theoretically, it might be possible to work around this by only ever sending shared memory in one direction. See also GMP, which arranges for all shared memory to be allocated in the parent process by using intr messages (the only use of that feature outside of NPAPI, I think) to request creation, combined with its own caching layer. So that would probably be more work than just fixing this properly.

On that topic, there are two ways to deal with this:

  1. Go ahead with bug 1465669 and use POSIX shm instead. That's a major change and I wouldn't want to try uplifting it, but I have a patch that used to work, and I'm in the process of un-bit-rotting it so that we at least have it as an option.

  2. Use finer-grained locking in SharedMemoryBasic_mach so that sending share requests doesn't block handling incoming ones. I think that should be doable (the two directions already have separate sets of Mach ports), but there's a bunch of process-global state and I haven't gone through all the code in detail. Haik is going to take a look at that.

Flags: needinfo?(jld)

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #15)

Maybe this is related: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang%20%7C%20mozilla%3A%3Aipc%3A%3ASharedMemoryBasic%3A%3AShareToProcess&date=%3E%3D2019-05-10T00%3A22%3A00.000Z&date=%3C2019-05-17T00%3A22%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_columns=url&_columns=email&_sort=-date&page=1#summary

This signature first started showing up in 68, in the 20190417214729 build.

I'm not so sure about that: I don't see any PortServerThreads blocked on a mutex. I think we can also end up blocked in ShareToProcess like that if the other process already exited, but there are a lot of things that happen during shutdown and I don't know if they can happen out of order enough to cause that.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #16)

  1. Use finer-grained locking in SharedMemoryBasic_mach so that sending share requests doesn't block handling incoming ones. I think that should be doable (the two directions already have separate sets of Mach ports), but there's a bunch of process-global state and I haven't gone through all the code in detail. Haik is going to take a look at that.

Thanks for the writeup. The analysis was really helpful. I've been looking at the locking today and have some ideas. I'm still familiarizing myself with the code, but I noticed that in SharedMemoryBasic::ShareToProcess() we hold gMutex while we call SendMessage() and then WaitForMessage(), but even with the locking, there doesn't seem to be a guarantee that the message received with WaitForMessage() is the reply to the message that was sent because (for example) there could be an existing message ready to be read before SendMessage() is called. And PortServerThread isn't holding gMutex when it calls WaitForMessage() so we have two threads racing calling WaitForMessage() in parallel. If PortServerThread() is given the incoming message, and there are no subsequent messages, ShareToProcess() would block indefinitely while holding gMutex.

Ignoring that problem, I think we could replace gMutex with a RW lock. The ShareToProcess and PortServerThread code paths could use the reader lock. More analysis needed to determine if the additional locks (one each for sending and receiving) would be needed here.

Assignee: nobody → haftandilian
Flags: needinfo?(haftandilian)
Priority: P2 → P1

(In reply to Haik Aftandilian [:haik] from comment #18)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #16)
Thanks for the writeup. The analysis was really helpful. I've been looking at the locking today and have some ideas. I'm still familiarizing myself with the code, but I noticed that in SharedMemoryBasic::ShareToProcess() we hold gMutex while we call SendMessage() and then WaitForMessage(), but even with the locking, there doesn't seem to be a guarantee that the message received with WaitForMessage() is the reply to the message that was sent because (for example) there could be an existing message ready to be read before SendMessage() is called. And PortServerThread isn't holding gMutex when it calls WaitForMessage() so we have two threads racing calling WaitForMessage() in parallel. If PortServerThread() is given the incoming message, and there are no subsequent messages, ShareToProcess() would block indefinitely while holding gMutex.

Jed explained how we use a different set of ports in PortServerThread (setup in GetMemoryPortsForPid() and SetupMachMemory()) so the ordering issue I described above is not a problem.

(In reply to Haik Aftandilian [:haik] from comment #18)

Ignoring that problem, I think we could replace gMutex with a RW lock. The ShareToProcess and PortServerThread code paths could use the reader lock. More analysis needed to determine if the additional locks (one each for sending and receiving) would be needed here.

In PortServerThread(), we acquire gMutex after returning from WaitForMessage() and before calling HandleSharePortsMessage() to handle the kSharePortsMsg message which causes the deadlock. For HandleSharePortsMessage(), it appears we don't need to hold gMutex. If that's the case, not holding gMutex during HandleSharePortsMessage() would be a simple fix. I'm looking into this before trying a more complex locking scheme.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b73fb1440e55546ea8a06ba2b9b42b832381c431

Don't hold gMutex when calling HandleSharePortsMessage() from PortServerThread to avoid deadlock.

Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aba5bfdd772b
Deadlock in SharedMemoryBasic_mach triggered by AV1 playback r=jld
Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

Please request uplift to beta if and when you're comfortable doing so. Thanks!

Flags: qe-verify+
Flags: needinfo?(haftandilian)

Comment on attachment 9065827 [details]
Bug 1550771 - Deadlock in SharedMemoryBasic_mach triggered by AV1 playback r?jld

Beta/Release Uplift Approval Request

  • User impact if declined: On Mac systems, users might experience browser hangs, tab spinners, and tab crashes on sites that have more than one AV1 video decoding streams.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: See description. The page listed in the description reliably causes a hang on macOS. Sometimes the page appears to work as expected, but switching to a new tab and then returning to the original page triggers the problem.
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): The change is in a risky area, but the change is small in scope.
  • String changes made/needed: None
Flags: needinfo?(haftandilian)
Attachment #9065827 - Flags: approval-mozilla-beta?

For some additional info on Comment 25:

  1. I have verified this fix in Nightly (2019-05-23) using the reporter's STR.
  2. This can also affect macOS in non-media situations with browser hangs and spinners, although much less often.

Just to confirm as I reported the original bug, the site with the av1 videos works fine now in nightly.

QA Whiteboard: [qa-triaged]

Verified as fixed on Firefox Nightly 69.0a1 (2019-05-23) on Mac OS X 10.14.
I observed that non of the videos were played on Ubuntu 18.04 x64. Is this intended?

Flags: needinfo?(haftandilian)

Comment on attachment 9065827 [details]
Bug 1550771 - Deadlock in SharedMemoryBasic_mach triggered by AV1 playback r?jld

macos deadlock fix, approved for 68.0b5

Attachment #9065827 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Hani Yacoub from comment #28)

Verified as fixed on Firefox Nightly 69.0a1 (2019-05-23) on Mac OS X 10.14.
I observed that non of the videos were played on Ubuntu 18.04 x64. Is this intended?

This bug fix only changed code used on macOS so it should not have affected Linux playback.

The videos should play on Ubuntu 18.04, but for some you may need to install extra video codec packages first. On an 18.04 system I tested, all the video types appear to play and that system has libavcodec-extra and chromium-codecs-ffmpeg-extra installed. You can test AV1 alone on https://demo.bitmovin.com/public/firefox/av1/

@mjf, Michael, do you know which codec packages are recommended to be installed on Ubuntu 18.04?

Flags: needinfo?(haftandilian) → needinfo?(mfroman)

(In reply to Haik Aftandilian [:haik] from comment #30)

(In reply to Hani Yacoub from comment #28)

Verified as fixed on Firefox Nightly 69.0a1 (2019-05-23) on Mac OS X 10.14.
I observed that non of the videos were played on Ubuntu 18.04 x64. Is this intended?

@mjf, Michael, do you know which codec packages are recommended to be installed on Ubuntu 18.04?

Poking around on google a bit, I found this link:
https://askubuntu.com/questions/171026/how-to-install-all-existing-media-codecs

tl;dr sudo apt-get install ubuntu-restricted-extras

For more info on that package:
https://packages.ubuntu.com/bionic/ubuntu-restricted-extras

Anecdotally, I don't remember doing anything special at all for my desktop Ubuntu 18.04 install, and those videos play for me. However, on a recent Ubuntu 18.04 VM I built (using VMWare's easy install feature), I can see they don't play until installing libavcodec-extra (which is included in ubuntu-restricted-extras). I suppose it is possible that something else I installed on my desktop had libavcodec-extra as a dependency.

Flags: needinfo?(mfroman)

The issue is not fixed on MacBook pro using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.
The issue is fixed on iMac using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.

Shouldn't this be fixed on both MacBook pro and on iMac?

Flags: needinfo?(haftandilian)

(In reply to Hani Yacoub from comment #33)

The issue is not fixed on MacBook pro using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.
The issue is fixed on iMac using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.

Shouldn't this be fixed on both MacBook pro and on iMac?

Yes, it should be fixed on all Macs. Please double check you have 68.05b and then there are some debugging steps you could take to confirm what's happening.

Flags: needinfo?(haftandilian)

My initial Beta download was 68.04b and it did not work. I did the update to 68.05b, and it works correctly for me.

So to confirm, MacBook Pro w/ 68.05b works for me.

After looking more carefully into this bug I confirm that the bug is fixed on all Macs using Firefox 68.0b5 on Mac OS X 10.14 and on Mac OS X 10.13.

Status: RESOLVED → VERIFIED
OS: Unspecified → macOS
You need to log in before you can comment on or make changes to this bug.