Closed Bug 1273769 Opened 8 years ago Closed 7 years ago

Crash in mozilla::MP4Demuxer::~MP4Demuxer


(Core :: Audio/Video: Playback, defect, P1)




Tracking Status
firefox48 --- affected
firefox49 --- affected
firefox-esr45 --- affected
firefox50 --- affected
firefox51 --- affected


(Reporter: n.nethercote, Assigned: jya)



(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-09b9bf6c-82f3-455b-8af8-424a32160516.

Crashes with this signature have occurred 102 times in the past week, all on Mac OS X, in versions ranging from FF 42--49.

Here's a search giving all crashes since May 10th:

Looking at the more recent ones, some of them crash at address 0x0, and some crash at addresses like 0x101 or 0x10135.

Looking at the stack traces, the top frame sometimes points to the first line of the MP4Demuxer class, and sometimes to the  |NS_INLINE_DECL_THREADSAFE_REFCOUNTING(Stream);| line inside Stream. My guess is that something is going wrong with destruction and refcounting of this class.

jya, any ideas?
Flags: needinfo?(jyavenard)
I've seen a few reports where whenever the crash occurred, the backtrace show that we have two threads accessing the mp4 object, (the rust demuxers always appear in the stack trace).

Now of course, it could be a coincidence and that there are effectively multiple mp4/MSE being used at once. (this one it's mp4parse_get_track_info)

it's just surprising that all the crashes in 49 I looked at, shows concurrent access to mozilla::MP4Demuxer::Init() on one and mozilla::MP4Demuxer::~MP4Demuxer() in the other.
Flags: needinfo?(jyavenard)
JW, you're very good at finding weird stuff there... If you could have a look at this...
Flags: needinfo?(jwwang)
Nathan, There's a long list of very weird crashes that has plagued media for a while, ones that never made any sense and only ever affect Mac.

I've never seen those type of crashes on Linux or Windows, despite that last one being by far where the majority of our users are.

The only logical explanation when I look at those crashes is that a data race occurred between two threads. But by design, this should never happen because everything run via a series of runnable launched sequentially on a single common TaskQueue

I've looked at the code over and over, and not ever is the above paradigm broken: we can't have by design data races.
And yet, those crashes are almost always two threads playing with the same objects, or ran out of order, always on Mac...

It should be impossible for a TaskQueue/ThreadPool to run more than one task at a time.

I noticed this comment in TaskQueue.cpp

That we rely on unlocking/locking to apply memory barriers. 
Could there be that in Mac there's something slightly different with how monitor/memory barriers work?

It's a shot in the dark, but I can't think of anything else at this stage...
Flags: needinfo?(nfroyd)
(In reply to Jean-Yves Avenard [:jya] from comment #3)
> That we rely on unlocking/locking to apply memory barriers. 
> Could there be that in Mac there's something slightly different with how
> monitor/memory barriers work?

No.  TaskQueue relies on the locking primitives providing the guarantees that POSIX provides:

while it's possible that OS X doesn't provide those guarantees, it's extremely unlikely and it'd be hard to write correct parallel programs without those guarantees.  (Note that those guarantees don't apply to the Windows locking primitives, but again, it's hard to imagine how general locking primitives can do their job without providing POSIX-like guarantees.)

One unlikely possibility is that locking the mutex is actually returning an error for some reason, which means that you could easily have multiple threads inside a critical section, since NSPR's locking primitives don't really do anything for errors.  But then again, if you look at the errors that mutex locking and unlocking can return:

virtually none of them can happen since we're not using recursive locks and I really hope we're not attempting to lock the mutex on the same thread multiple times...
Flags: needinfo?(nfroyd)
I strongly suspect bug 1281295 (TSan: data race involving ~TrackBuffersManager() -> ShutdownDemuxers() on main thread) here too!

E.g., in, the main thread is in ~SourceBufferResource().
One of the members of TrackBuffersManager is 'RefPtr<SourceBufferResource> mCurrentInputBuffer'.
So I'm guessing TrackBuffersManager is being destroyed, including its mInputDemuxer (which should be destroyed right before mCurrentInputBuffer), which would explain the crash in ~MP4Demuxer in thread 39.

So I think we should get bug 1281295 fixed, and see how many other signatures benefit from it.
Depends on: 1281295
Actually, looking at again, the runnable (that calls TrackBuffersManager::SegmentParserLoop() in the crashing thread) keeps an owning reference to the TrackBuffersManager, so I don't see how the last reference could have been released on the main thread.
Will need to investigate further...
Sorry: Bug 1281295 was a red herring, nothing to do with TrackBuffersManager.
Assignee: nobody → jyavenard
Crash volume for signature 'mozilla::MP4Demuxer::~MP4Demuxer':
 - nightly (version 51): 1 crash from 2016-08-01.
 - aurora  (version 50): 1 crash from 2016-08-01.
 - beta    (version 49): 8 crashes from 2016-08-02.
 - release (version 48): 53 crashes from 2016-07-25.
 - esr     (version 45): 21 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       1
 - aurora        0       0       0
 - beta          3       1       2
 - release      14       9       6
 - esr           1       1       1

Affected platform: Mac OS X

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly
 - aurora            #905
 - beta    #4117
 - release #798
 - esr     #1758
No crashes since bug 1315631 went in... continue to monitor for one week, after that can close as duplicate (hopefully)
See Also: → 1315631
Per comment 9, keep monitoring.
Flags: needinfo?(jwwang)
You need to log in before you can comment on or make changes to this bug.