Crash in [@ mozilla::MediaDecoderStateMachine::OnAudioPopped]

NEW
Assigned to

Status

()

defect
P2
critical
3 months ago
2 months ago

People

(Reporter: marcia, Assigned: bryce)

Tracking

({crash, leave-open, regression})

Trunk
Unspecified
macOS
Points:
---

Firefox Tracking Flags

(firefox66 wontfix, firefox67 unaffected, firefox68 fix-optional)

Details

(crash signature)

Attachments

(1 attachment)

This bug is for crash report bp-f10c2589-c955-4101-b874-738a70190320.

MacOS crash visible in nightly, but may go as far back as FF 65: https://bit.ly/2UYGwR5

The few URLs are https://hangouts.google.com/ and https://www.youtube.com/watch?v=m3sV7daQRls

Top 10 frames of crashing thread:

0 XUL mozilla::MediaDecoderStateMachine::OnAudioPopped clang/include/c++/v1/algorithm:721
1 XUL mozilla::detail::RunnableMethodImpl<mozilla::detail::Listener<RefPtr<mozilla::AudioData> >*, void  xpcom/threads/nsThreadUtils.h:1122
2 XUL mozilla::AutoTaskDispatcher::TaskGroupRunnable::Run xpcom/threads/TaskDispatcher.h:197
3 XUL mozilla::TaskQueue::Runner::Run xpcom/threads/TaskQueue.cpp:199
4 XUL nsThreadPool::Run xpcom/threads/nsThreadPool.cpp:241
5 XUL non-virtual thunk to nsThreadPool::Run xpcom/threads/nsThreadPool.cpp
6 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1179
7 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:482
8 XUL mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:333
9 XUL nsThread::ThreadFunc ipc/chromium/src/base/message_loop.cc:315

Looks like we could be dereferencing a nullptr here. We shouldn't end up with null data in the audio queue, but perhaps there's an edge case we're missing.

Alastor, any idea on what could be going on here?

Flags: needinfo?(alwu)
Priority: -- → P2

Keep NI, will check it later.

As we've ensured that the samples which are pushed into media queue would always be not a nullptr [1], so I'm looking for if there is any chance the data was freed before popped out.

And I found that the ref-counting for the MediaData is kind of tricky. The AudioQueue holds a raw point to data [2], not a ref counting. So after pushing data into queue, it seems that the only place inside the media queue keeping data alive is here [3].

So after MediaEventListener notifies the event, which would hold the reference to data, there seems no any other places to keep this data alive? (if the data was not popped out during the push event)

In AudioSink case, it would pop out the data when it received the push event [5], which has kept the reference to data. So the strong reference is always kept between pushing and poping data.

However, in DecodedStream case, it won't pop data directly [6], it would pop the data until the DecodedStream::NotifyOutput is called [7]. So, who is keeping the reference between pushing and poping data in this case? I couldn't find it, but I'm sure there is one place to keep the reference, otherwise, DecodedStream could not be able to work (I just couldn't find it)

[1] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/MediaQueue.h#45-47
[2] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/MediaDecoderStateMachine.h#559
[3] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/MediaQueue.h#49
[4] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/mediasink/AudioSink.cpp#333
[5] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/mediasink/AudioSink.cpp#319
[6] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/mediasink/DecodedStream.cpp#539-583
[7] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/mediasink/DecodedStream.cpp#809

Assignee: nobody → alwu
Flags: needinfo?(alwu)

Here are some questions,
(1) Is it possible to release data before poping it out from media queue?

  • If so, it might be a possible root cause of this crash.

(2) Do you guys know who is keeping the data alive for the DecodedStream case?

(3) Why the AudioQueue and VideoQueue in MDSM only keep the raw pointer, not a strong reference for media data?


Hi, Bryce, Jya,
Do you guys have any idea about these questions?
Thank you!

Flags: needinfo?(jyavenard)
Flags: needinfo?(bvandyk)

Discussed a bit on IRC: I think this is handled by the queue manually adding refs here. This also leads to the patterns used when popping from the queue where a RefPtr is returned, but the reference count isn't changed as the queue already added a ref and is now passing that ownership via the returned ptr.

Flags: needinfo?(jyavenard)
Flags: needinfo?(bvandyk)

(In reply to Alastor Wu [:alwu] from comment #3)

As we've ensured that the samples which are pushed into media queue would always be not a nullptr [1], so I'm looking for if there is any chance the data was freed before popped out.

And I found that the ref-counting for the MediaData is kind of tricky. The AudioQueue holds a raw point to data [2], not a ref counting. So after pushing data into queue, it seems that the only place inside the media queue keeping data alive is here [3].

https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/dom/media/MediaQueue.h#46

When the data is held in the queue, AddRef is called on it. It may not be held in a RefPtr, but the refcount is incremented as it should.

MediaQueue override nsDequeue and provides a method of what occurred when the element is removed from the queue:
https://searchfox.org/mozilla-central/source/dom/media/MediaQueue.h#22

Which will call Release on the element, as such, MediaQueue doesn't need to use RefPtr, the handling is the same.

After some investigation, it seems we have almost no possible to release the sample before the MediaEventListener gets the notification.

In [1], we keep the ref count of the sample is 1 still. And then send it via runnable in order to notify all listeners, we would added another reference in the runnable by creating a new RefPtr [2]. That will be done after doing a type convertion, because the type we sent to MediaEventProducer::Notify() is RefPtr&, but it would be converted to RefPtr later by using Delay() in [4].

That means, when we pop out the sample from queue, we have 2 ref count on the sample. One is for the caller of PopFront(), who owns the orignal ref count, which was tranfered by calling forget() [5]. Another one is for the notify runnable which would ensure the sample alive until all listeners have received the notification.

[1] https://searchfox.org/mozilla-central/rev/d143f8ce30d1bcfee7a1227c27bf876a85f8cede/dom/media/MediaQueue.h#59
[2] https://searchfox.org/mozilla-central/rev/d143f8ce30d1bcfee7a1227c27bf876a85f8cede/ipc/chromium/src/base/task.h#293
[3] https://searchfox.org/mozilla-central/rev/d143f8ce30d1bcfee7a1227c27bf876a85f8cede/dom/media/MediaEventSource.h#438
[4] https://searchfox.org/mozilla-central/rev/ d143f8ce30d1bcfee7a1227c27bf876a85f8cede/dom/media/MediaEventSource.h#139
[5] https://searchfox.org/mozilla-central/rev/d143f8ce30d1bcfee7a1227c27bf876a85f8cede/dom/media/MediaQueue.h#63

We have MOZ_ASSERTs to guard against nulls at various points. If this case is rare enough, could we be missing it in assert enabled builds, and occasionally placing null data into our queue in release builds? Then we only see it when we do the bad deref?

We could use a diagnostic assert here to see if this is happening and to get a better idea what is giving us null data.

Adding leave-open so we can land such a patch and keep an eye on this.

Keywords: leave-open

We're rarely getting nullptrs out of MediaQueues. It's not clear where these are
coming from, as we have many guards against them. Upgrade this assert to a
diagnostic to help us track the source and determine if the value is null before
entering the queue.

Assignee: alwu → bvandyk
Pushed by bvandyk@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b9bd90dd38bf
Use a diagnostic assert to check for nullptr when pushing into MediaQueue. r=alwu

Hi Bryce, is the diagnostic assert showing anything helpful here? (tracking for 68). Thanks!

Flags: needinfo?(bvandyk)

(In reply to Patricia Lawless from comment #12)

Hi Bryce, is the diagnostic assert showing anything helpful here? (tracking for 68). Thanks!

Kind of. I've been keeping an eye on crashes, and it looks like we're not hitting that assert. It's possible that we'd hit it with such a low rate that we need to keep watching it for the case where it happens. On the other hand, us not hitting the assert could suggest the data is being nulled after we push into the queue and that we've overlooked something above.

We've had two crashes with this sig since the assert was added, but they're both in old builds, and don't look like the null deref I suspect (I'm not sure what to make of these new ones yet). My inclination is to keep watching this and see if we get more crashes.

Flags: needinfo?(bvandyk)

Most recent crash against this signature is interesting. It is happening in a build that should have the diagnostic assert added above. This suggests that this is not a null being pushed into the MediaQueue. If this were the case the diagnostic assert should have fired before we could reach this later failure.

So we may need to take another look at if our sample could somehow be nulled while in the queue.

You need to log in before you can comment on or make changes to this bug.