Closed Bug 1540748 Opened 5 months ago Closed 4 months ago

Crash in [@ mozilla::CheckedInt<T>::value] via MediaDecoderStateMachine::GetDecodedAudioDuration

Categories

(Core :: Audio/Video: Playback, defect, P1, critical)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr60 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 + fixed

People

(Reporter: mccr8, Assigned: alwu)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [geckoview:fenix:m5][gvtv:p1][media-q2])

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug is for crash report bp-2a830582-a8d1-4c28-97f2-1cabf0190401.

Top 10 frames of crashing thread:

0 libxul.so mozilla::CheckedInt<long long>::value const mfbt/CheckedInt.h:535
1 libxul.so mozilla::MediaDecoderStateMachine::GetDecodedAudioDuration dom/media/TimeUnits.h:88
2 libxul.so mozilla::MediaDecoderStateMachine::HaveEnoughDecodedAudio dom/media/MediaDecoderStateMachine.cpp:2731
3 libxul.so mozilla::MediaDecoderStateMachine::DecodingState::DispatchDecodeTasksIfNeeded dom/media/MediaDecoderStateMachine.cpp:2352
4 libxul.so mozilla::MediaDecoderStateMachine::DecodingState::HandleAudioDecoded dom/media/MediaDecoderStateMachine.cpp:569
5 libxul.so mozilla::MozPromise<RefPtr<mozilla::AudioData>, mozilla::MediaResult, true>::ThenValue<mozilla::MediaDecoderStateMachine::RequestAudioData dom/media/MediaDecoderStateMachine.cpp:3078
6 libxul.so mozilla::MozPromise<RefPtr<mozilla::AudioData>, mozilla::MediaResult, true>::ThenValueBase::ResolveOrRejectRunnable::Run 
7 libxul.so mozilla::AutoTaskDispatcher::TaskGroupRunnable::Run xpcom/threads/TaskDispatcher.h:197
8 libxul.so mozilla::TaskQueue::Runner::Run xpcom/threads/TaskQueue.cpp:199
9 libxul.so nsThreadPool::Run xpcom/threads/nsThreadPool.cpp:244

Alex made integer overflow checks into release asserts, and this is one of the crashes. It is Android-only.

Rank: 9
Priority: -- → P1
Whiteboard: [geckoview:fenix:m4][gvtv:p1]

This CheckedInt assertion failure is a top crash for Fennec 68 Nightly.

Whiteboard: [geckoview:fenix:m4][gvtv:p1] → [geckoview:fenix:m4][gvtv:p1][media-q2]

Alastor could you please have a look at this?

Flags: needinfo?(alwu)

Alastor can you please check if bugs 1540746, 1540744 and 1540740 can all be solved via TimeUnits.h or if each needs an individual patch a level above TimeUnits.h.

It seems to me that we have to submit individual patches for those bugs.

Assignee: nobody → alwu
Flags: needinfo?(alwu)

This one might be fixed by bug1534993, because it looks like we getting overflow in AudioSink::GetPosition().

I just landed the bug1534993, we can see whether it can fix this crash as well.

This crash still happens [1] after landing bug1534993, will have a patch for it later.

[1] https://crash-stats.mozilla.com/report/index/4566e0f5-53e5-439a-b875-c7b8b0190414

As the crash happened in [1], the possible root cause is that we have improper endTime, which is super large positive value, and the startTime is negative (it's possible). As we've ensured that endTime is always larger than startTime, so a positive value substracting a negative value is the only possible situation to cause a overflow.

[1] https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/dom/media/MediaQueue.h#120

The description in comment8 was not quite right. As we would do the time adjustment in ReaderProxy [1], the audio sample's mTime should be adjusted to be larger than 0, unless the first sample's start time is less than the start time we got from metadata. So usually the start and end time of audio sample are both larger than zero.

If this crash were caused by negative start time (which might cause overflow if we substract it from a super large positive end time). We should force that all sample's start time in MediaQueue should be larger than zero.

[1] https://searchfox.org/mozilla-central/rev/d302c3058330a57f238be4062fddea629311ce66/dom/media/ReaderProxy.cpp#46

The purpose of substracting sample's time by the start time, which we got from the metadata, is to ensure that the start time of the first sample can start from 0 and align all following samples to the correct position.

However, if the first sample's time is earlier than the metadata's start time, the calculated result would be negative, which is not we want.

In this case, we should update the start time and manually adjust first sample's time to 0.

(In reply to Alastor Wu [:alwu] from comment #8)

As the crash happened in [1], the possible root cause is that we have improper endTime, which is super large positive value, and the startTime is negative (it's possible). As we've ensured that endTime is always larger than startTime, so a positive value substracting a negative value is the only possible situation to cause a overflow.

[1] https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/dom/media/MediaQueue.h#120

There's another issue that will occur now however.

the buffered range is calculated by the demuxer, so now the buffered range reported will be negative. If we were to adjust the start time offset, really we need to keep two start time, one to use with the buffered range, and one used for the time values.

(In reply to Jean-Yves Avenard [:jya] from comment #11)

the buffered range is calculated by the demuxer, so now the buffered range reported will be negative. If we were to adjust the start time offset, really we need to keep two start time, one to use with the buffered range, and one used for the time values.

Sorry, I didn't understand this part, did we adjust the buffered range? or you mean we should adjust buffered range as well?

Flags: needinfo?(jyavenard)

(In reply to Jean-Yves Avenard [:jya] from comment #11)

There's another issue that will occur now however.

the buffered range is calculated by the demuxer, so now the buffered range reported will be negative. If we were to adjust the start time offset, really we need to keep two start time, one to use with the buffered range, and one used for the time values.

I guess that the adjustment for buffered range you mentioned is here [1]?
[1] https://searchfox.org/mozilla-central/rev/966590c35c19d3b7850c5444a31bf0fd68a7d0c4/dom/media/MediaFormatReader.cpp#2947

Flags: needinfo?(jyavenard)
Attachment #9059416 - Attachment description: Bug 1540748 - the start time of the first sample should always be zero or positive. → Bug 1540748 - part1 : Android decoder should throw an error when the decoded sample's time is smaller than the time of first demuxed sample.

As the purpose of adjusting time is to make the start of the playback to align to zero, so the result should not be negative.

Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/27bcf73686ed
part1 : Android decoder should throw an error when the decoded sample's time is smaller than the time of first demuxed sample. r=jolin
https://hg.mozilla.org/integration/autoland/rev/d8cb636260a0
part2 : adjusted time should always be non-negative. r=jolin
Backout by shindli@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b9a625eff9e3
Backed out 2 changesets for causing perma assertion failure  !mTime.IsNegative(), at /builds/worker/workspace/build/src/dom/media/MediaData.h:303 CLOSED TREE

It's possible that the time of the first demuxed sample is different from the time of the first decoded sample.

This would happen when the first video frame is B frame, which might have larger PTS than other frames.

That means it's not the actually first frame for presentation, so we should update the start time after we get the time from both the first decoded audio and video sample to ensure the we won't hit the assertion in patch2.

We hit the assertion in patch2 when running the crash test 1389304.html [1], because the first demuxed video frame of the video in that test is not the first frame for presentation, which causes the result of adjustment being negative. Therefore, we have to adjust our start time manually after we get the first decoded sample from both audio and video.

Will ask for a review if there is no any problem on the try result [2].

[1] https://searchfox.org/mozilla-central/source/dom/media/test/crashtests/1389304.html
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=a51efb3c7bde8102944da0877a293772f2f6697e

I found that the crash test (1389304.html) has different decoded result in different platforms. On OSX, the first frame decoder returned has smallest timestamp, but it hasn't on Linux and Windows. So even I added patch3, which is used to adjust the start time by checking the first decoded sample, we would still hit the assertion in media data.

So it seems to me that we can't always guarantee the first decoded frame has smallest timestamp, so the assertion I added in patch2 actually is not a good assumption, because it's not about logic, it depends on file's content.

After discussed with jolin, I am going to remove the assertion in patch2, and print warning instead. Hoping patch1 is enough for fixing the crash.

Attachment #9060844 - Attachment is obsolete: true
Attachment #9059979 - Attachment description: Bug 1540748 - part2 : adjusted time should always be non-negative. → Bug 1540748 - part2 : add warning for negative adjusted time.
Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8c9fc50e48c2
part1 : Android decoder should throw an error when the decoded sample's time is smaller than the time of first demuxed sample. r=jolin
https://hg.mozilla.org/integration/autoland/rev/57b12599afe2
part2 : add warning for negative adjusted time. r=jolin
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Depends on: 1547604
Regressions: 1547436
Status: RESOLVED → REOPENED
Flags: needinfo?(alwu)
Resolution: FIXED → ---
Target Milestone: mozilla68 → ---

It seems to me that Android decoder would return output after the flush() is called, where mFirstDemuxedSampleTime would be reset.

From Android medio codec document [1],

Note, however, that there may still be outstanding onOutputBufferAvailable callbacks that were not handled prior to calling flush. The indices returned via these callbacks also become invalid upon calling flush and should be discarded.

It's possible that Android decoder returns sample in this situation, so we should discard sample if mFirstDemuxedSampleTime contains nothing.

[1] https://developer.android.com/reference/android/media/MediaCodec#flush()

Flags: needinfo?(alwu)
Whiteboard: [geckoview:fenix:m4][gvtv:p1][media-q2] → [geckoview:fenix:m5][gvtv:p1][media-q2]

There functions won't change any internal variables, so they should be marked as const.

Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8a834fa8a29f
part0 : mark functions const. r=jya
https://hg.mozilla.org/integration/autoland/rev/3835a514150a
part1 : Android decoder should throw an error when the decoded sample's time is smaller than the time of first demuxed sample. r=jolin,jya
https://hg.mozilla.org/integration/autoland/rev/7a04d9e59754
part2 : add warning for negative adjusted time. r=jolin
Status: REOPENED → RESOLVED
Closed: 4 months ago4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Regressions: 1567421
Regressions: 1565838
You need to log in before you can comment on or make changes to this bug.