shutdownhang in mozilla::MediaShutdownManager::Shutdown() with a thread in WMFMediaDataDecoder::Flush()

RESOLVED FIXED

Status

()

Core
Audio/Video: Playback
P1
critical
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: Robert Kaiser, Assigned: karlt)

Tracking

(Depends on: 1 bug, {crash})

Trunk
x86
Windows NT
crash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox37+ wontfix, firefox38+ wontfix, firefox39+ fixed, firefox40+ wontfix, firefox41 affected)

Details

(crash signature)

(Reporter)

Description

3 years ago
This bug was filed from the Socorro interface and is 
report bp-80eb894c-0602-4e2e-be37-13c412150128.
=============================================================

It seems that some of our shutdownhangs are in media shutdown code, we should see if we can get those fixed. Unsurprisingly, many different YouTube URLs are on those crash reports.

For stats and reports, please see https://crash-stats.mozilla.com/report/list?signature=shutdownhang%20%7C%20WaitForSingleObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_Wait%20%7C%20nsThread%3A%3AProcessNextEvent%28bool%2C%20bool%2A%29%20%7C%20NS_ProcessNextEvent%28nsIThread%2A%2C%20bool%29%20%7C%20mozilla%3A%3AMediaShutdownManager%3A%3AShutdown%28%29
(Reporter)

Updated

3 years ago
Summary: crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::MediaShutdownManager::Shutdown() → shutdownhang in mozilla::MediaShutdownManager::Shutdown()
Blocks: 778617

Updated

3 years ago
Priority: -- → P1
I still see some crashes in crash-stats on Aurora - https://crash-stats.mozilla.com/report/index/aa6779b5-e05d-40fe-8fde-a75122150205 is one example. The most recent crashes in this stack are buildid=2015020700.
(Reporter)

Updated

3 years ago
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::MediaShutdownManager::Shutdown()] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::MediaShutdownManager::Shutdown()] [@ shutdownhang | ntdll.dll@0x3c6bc]
Tracking all MSE P1 bugs for Firefox 37.
status-firefox37: --- → affected
status-firefox38: --- → affected
tracking-firefox37: --- → +
Anthony - Can you please help find an owner for this bug?
Flags: needinfo?(ajones)
Assignee: nobody → karlt
Flags: needinfo?(ajones)
I looked at these crashes - most of the recent ones seem audio-related, and almost certainly fallout from padenot's stuff.

Comment 5

3 years ago
If that's the case, it's my understanding that all the stuff is now backed out so this should be gone. Kairo, Marcia can you check if we are still seeing any of these? I searched and couldn't find anything with a build id later than 2/16 but I might have missed one of the signatures.
(Assignee)

Updated

3 years ago
See Also: → bug 1135562
(Assignee)

Comment 7

3 years ago
Of the hangs in comment 6 not covered by bug 1135562, the majority seem to have threads in MediaTaskQueue::FlushAndDispatch().

e.g. https://crash-stats.mozilla.com/report/index/5836e867-b5e4-42a1-9e6c-f5dce2150221

That suggests that perhaps bug 1125970 may have an effect here.

There are also threads in WMFMediaDataDecoder::Flush(), and typically MFTDecoder::Output() and BaseOutputPin::GetDeliveryBuffer().

Those stacks are from non-MSE code paths.  The MSE code path in comment 0 also involved WMFMediaDataDecoder::Flush(), but FlushAndDispatch() was not on any stack.
Depends on: 1125970
(Assignee)

Updated

3 years ago
Summary: shutdownhang in mozilla::MediaShutdownManager::Shutdown() → shutdownhang in mozilla::MediaShutdownManager::Shutdown() with a thread in WMFMediaDataDecoder::Flush()
(In reply to Karl Tomlinson (:karlt) from comment #7)
> Those stacks are from non-MSE code paths.  The MSE code path in comment 0
> also involved WMFMediaDataDecoder::Flush(), but FlushAndDispatch() was not
> on any stack.

The first thing that WMFMediaDataDecoder::Flush does is invoke mTaskQueue->Flush(), so there's a good chance it's related. Note that one odd side-effect of flushing is that the runnable gets destroyed on a different thread than it would otherwise, so if the runnable is holding onto things that do interesting things in their destructor, there's a potential for trouble. Additionally, the decoder might be expecting its tasks to run, and get confused in ::Flush if those tasks disappeared.
(Assignee)

Comment 9

3 years ago
(In reply to Bobby Holley (:bholley) from comment #8)
> The first thing that WMFMediaDataDecoder::Flush does is invoke
> mTaskQueue->Flush(), so there's a good chance it's related.

Yes, thanks for your comments.

Related, but this task queue won't be directly affected by changes bug 1125970.

> Note that one
> odd side-effect of flushing is that the runnable gets destroyed on a
> different thread than it would otherwise, so if the runnable is holding onto
> things that do interesting things in their destructor, there's a potential
> for trouble.

Runnables on this task queue hold references to WMFMediaDataDecoder and
sometimes auto pointers to mp4_demuxer::MP4Sample.  Since the WMFMediaDataDecoder owns the task queue, and MediaDataDecoder has thread-safe ref counting, I think that should be fine.  I assume it is expected that MP4Sample can deleted on different threads.

> Additionally, the decoder might be expecting its tasks to run,
> and get confused in ::Flush if those tasks disappeared.

There are only a few different runnables queued on this task queue.  I'm looking at those and when they are called to see if I can identify a sequence that would be problematic.  A try run says we could run those tasks instead of emptying them, but I don't have evidence yet to indicate that would help.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cc72e2711a5b
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4da9a9fb8648
(In reply to Karl Tomlinson (:karlt) from comment #9)
> There are only a few different runnables queued on this task queue.  I'm
> looking at those and when they are called to see if I can identify a
> sequence that would be problematic.  A try run says we could run those tasks
> instead of emptying them, but I don't have evidence yet to indicate that
> would help.

In general, cpearce wants to remove flushing from all the PDMs. It's possible that this would adversely affect latency in the WMF case, in which case we might need a different mechanism.
(Assignee)

Updated

3 years ago
Depends on: 1138229
During triage today, I was asked to report back on crash volume - this data is for the last 7 days:

The latest beta in FF 37 has 151 crashes. There were 231 crashes in B2 and 150 in B1.
Sheila - This is still marked as an MSE P1. It's not a high volume crash but is a shutdown crash. Is this still on the list of MSE bugs to fix in 37? If so, are we going to have a fix by Thursday for Beta 7?
Flags: needinfo?(smooney)
(Assignee)

Comment 13

3 years ago
I don't know the cause of these hangs, and so we have only speculative fixes and no real expectation of a fix by Thursday.

The patch uplifted from bug 1139380 may help, but that is speculation.

The number of similar but different sets of thread stacks here, suggests that either there are a number of different issues or the threads are busy processing media frames to drain their queues and taking a very long time to do so.

Something like the ResetDecode changes in https://hg.mozilla.org/mozilla-central/rev/5f2bef7e0b6c from bug 1135170 may help by flushing / dropping samples instead of processing the remaining queue.  However, that patch, as is, is too involved to be a candidate for uplift, and any reduced version would be speculative and not well tested.  Emptying queues should not take as long as our 60s shutdown timeout, so I suspect there is another undiagnosed confounding issue.
Depends on: 1139380
I reviewed this bug with the media team today. While this is a priority bug, it won't block the 37 release. Given that we're building the final 37 Beta on Thursday, I'm marking this as wontfix in 37.
status-firefox37: affected → wontfix
status-firefox39: --- → affected
tracking-firefox38: --- → +
tracking-firefox39: --- → +
Marking this for followup later in aurora and after Karl gets back.
Whiteboard: [followup-2015-04-20]
Karl, we are now past the middle of the beta cycle and  shutdownhang crashes are an important part of our crashes. Could you have a look asap? Thanks
Flags: needinfo?(karlt)
Whiteboard: [followup-2015-04-20]
(Assignee)

Updated

3 years ago
Depends on: 1155432
We need to check when bug 1155432 makes it to beta to see if it has moved the needle.
(Assignee)

Comment 18

3 years ago
This bug still exists on 38.0b7/20150424064750 which includes the changes from
bug 1155432.

These hang reports have a threads in different stages of
WMFVideoMFTManager::Output(), while WMFMediaDataDecoder::Flush() is Wait()ing
on the monitor condition variable.

https://crash-stats.mozilla.com/report/index/b901bff8-3d02-4fda-bcb8-dec762150427#allthreads
https://crash-stats.mozilla.com/report/index/09ccddd4-36fd-4633-b19c-e722e2150426#allthreads

Searching the threads for media/ identifies the threads involved.

A similar report for 39.0a2/20150417004243 indicates that changes from bug 1135170 have not fixed this. e.g.
https://crash-stats.mozilla.com/report/index/ae2376d6-48a5-4b76-9652-4da972150417#allthreads
No longer depends on: 1125970, 1138229, 1139380, 1155432
Flags: needinfo?(karlt)
Like 37, wontfix for 38.

Karl, any news on this for 39? Thanks
status-firefox38: affected → wontfix
Flags: needinfo?(karlt)
(Assignee)

Comment 20

3 years ago
I filed bug 1161402 in case that gives us some clues as to what is happening.
Bug 1159456 may help with some of these, but not most.

I'm going to have a look into what info might be worth recording on the stacks, to extract for clues.
Depends on: 1161402
Flags: needinfo?(karlt)
(Assignee)

Updated

3 years ago
Depends on: 1162364
(Assignee)

Updated

3 years ago
Depends on: 1161892
(Assignee)

Updated

3 years ago
See Also: → bug 1166107
Flags: needinfo?(smooney)
Karl, I know it may be hard to tell where this is at right now because we don't have a lot of crash data for 39 (yet), but do you think that bug 1159456 fixed this issue?
Flags: needinfo?(karlt)
(Assignee)

Comment 22

3 years ago
I don't know whether or not fixing bug 1159456 had much impact.  It was only going to fix some of the hangs here anyway.

I think fixing bug 1161892 made some difference, because that eliminated one class of hangs, but there were still plenty of hangs matching the description here.

I'm looking out for changes from fixing bug 1162364, but need more crash data from aurora builds newer than 20150529004007 to make any analysis yet.
Flags: needinfo?(karlt)
From discussion above and with k17e it sounds like this may become more clear as some of the other shutdown hangs get resolved. So we are waiting-and-seeing for a bit longer.
There are crashes from 39.0b2, but none from 39.0b3, 4, or 5. Seems like a good sign! 

I'm going to mark this as fixed; Karl, if you think too hasty, please comment and change the flag back to "affected" for 39.
status-firefox39: affected → fixed
status-firefox40: --- → affected
status-firefox41: --- → ?
tracking-firefox40: --- → +
Flags: needinfo?(karlt)
status-firefox40: affected → unaffected
status-firefox41: ? → unaffected
tracking-firefox40: + → ---
(Assignee)

Updated

3 years ago
See Also: → bug 1174592
(Assignee)

Comment 25

3 years ago
I see some hangs in 39.0b3 which may be addressed by bug 1162364, which is fixed in beta 5.
https://crash-stats.mozilla.com/report/index/593db919-cb94-4d84-9f6d-617732150608#allthreads
https://crash-stats.mozilla.com/report/index/2ecd41d1-d552-4db5-b5eb-a19442150608#allthreads

There are also a number of MediaShutdownManager::Shutdown() hangs where there is no media code in the stack traces of the other threads, but that would not be what is tracked in this bug.

e.g. https://crash-stats.mozilla.com/report/index/f4e1f4a5-bf56-4dcb-b427-967192150608#allthreads

I can't explain the lack of MediaShutdownManager::Shutdown() reports on b4 and b5, because the complete lack of these signatures would imply that several different bugs have been fixed concurrently.  I suspect that is unlikely.

However, we do seem to have a much lower relative incidence of this bug, and it may now be less important than other MediaShutdownManager::Shutdown() hangs.

I suspect there may still be something to fix in this bug, but I'm short on ideas and I don't think we have time to address what remains of this bug before 39 release anyway.
(Assignee)

Updated

3 years ago
Flags: needinfo?(karlt)
Tracking this for 40 just in case and to follow up once 40 moves to beta.
status-firefox40: unaffected → ?
tracking-firefox40: --- → +
Anthony, as I mentioned in my email, I do see this shutdown hang on FF40.0b4 on crash-stats page. Could you please help find an owner so we can address this in time for FF40 release?
Flags: needinfo?(ajones)
(Assignee)

Comment 28

3 years ago
Looking at reports from 40.0b4, the variation in stack traces seems to have
reduced considerably.

The common situation is now that one thread is in WaitForSingleObject from
CCoreH264Decoder::GetDisplayFrame() of msmpeg2vdec.dll.  This is usually from
MFTDecoder::Output() msmpeg2vdec.dll, but I saw one instance from
MFTDecoder::Flush().

A typical report is in [1].

The only other threads with msmpeg2vdec.dll on the stack are in
WaitForMultipleObjects from ThreadedDecode.

I don't know what these are waiting for, but deadlock seems a possibility.

In 40.0b4 this hang is about 3 time more common than bug 1135562.
On 41.0a2, however, bug 1135562 seems to be about 4 times more common than
this hang.  I don't have an explanation for that.

342 hangs with the signatures of these bugs on 331 installations implies that
very few installations are seeing these hangs more than once.  The data is
similar on 39.0.

[1] https://crash-stats.mozilla.com/report/index/851f59f2-bc42-45f4-b2da-28ce42150719#allthreads
(Assignee)

Comment 29

3 years ago
Bug 662444 would be the sure way to fix this.
Depends on: 662444
(Assignee)

Comment 30

3 years ago
Ritu, I don't see this signature in the top 50 on release, beta, or aurora.
Do you have data indicating that this bug needs to be prioritized?
Flags: needinfo?(rkothari)
Karl, the only reason this bug came up in my search is because we (RelMan) are following up on FF40 tracked bugs and pinging devs on bugs that have been unchanged for a while. 

I don't see this crash signature in top 25 on beta channel. Based on that, do we want to mark status-firefox40: "fixed" instead of "affected"?
Flags: needinfo?(rkothari)

Updated

3 years ago
status-firefox40: ? → affected
(Reporter)

Comment 32

3 years ago
(In reply to Ritu Kothari (:ritu) from comment #31)
> I don't see this crash signature in top 25 on beta channel. Based on that,
> do we want to mark status-firefox40: "fixed" instead of "affected"?

It's not fixed if no fix has landed and it's still happening at low volume, but it may make sense to flip it to "wontfix" for that train if we don't find it important enough to fix.
We've fixed a lot of things properly in 42 so it is a better release target for further efforts to reduce crashes/hangs. If this is low enough volume I'd rather not try to tackle it right now.
Flags: needinfo?(ajones)
Can we untrack this?
I have marked 40 as wontfix and not tracked 41+ as there is a low incidence of occurrence for the associated crash report.
status-firefox40: affected → wontfix
status-firefox41: unaffected → affected
Component: Audio/Video → Audio/Video: Playback
this can't be a MSE issue
No longer blocks: 778617

Updated

3 years ago
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::MediaShutdownManager::Shutdown()] [@ shutdownhang | ntdll.dll@0x3c6bc] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::MediaShutdownManager::Shutdown()] [@ shutdownhang | ntdll.dll@0x3c6bc] [@ shutdow…
Not seeing this signature in 42 or later.
With no crashes in 28 days and 42 now in release I think we can close this bug.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
It happens again, see thread 42.
https://crash-stats.mozilla.com/report/index/39fd3300-7278-4661-a74b-11c162160517#allthreads

In addition, bug 1274498 is also stuck in MFTDecoder::Output, but it calls msmpeg2vdec.dll instead of MFH264Dec.dll.
Blocks: 1267752
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
See Also: → bug 1274498
mfh264dec.dll is the DLL in which the WMF H.264 decoder resides on Windows Vista. On Windows 7 and later, it's msmpeg2vdec.dll.
(In reply to Alastor Wu [:alwu] from comment #40)
> See thread 68.
> https://crash-stats.mozilla.com/report/index/c98046a7-2a43-4d63-8b53-
> 3c8ce2160513#allthreads

alwu: This is a Firefox 46 crash report. Are there any crashes in 47 or later? I don't see any in the Crash Signature URLs above. Maybe this is fixed.
Flags: needinfo?(alwu)
According the comment42, we didn't see the crash signature in FF47, it might be fixed.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago2 years ago
Flags: needinfo?(alwu)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.