Decoders should be drained whenever a gap is encountered playing MSE content

RESOLVED FIXED in Firefox 45



4 years ago
4 years ago


(Reporter: jya, Assigned: jya)


(Blocks 1 bug)

Dependency tree / graph

Firefox Tracking Flags

(firefox43 wontfix, firefox44 affected, firefox45+ fixed, firefox46 fixed)



(5 attachments, 6 obsolete attachments)

5.82 KB, patch
: review+
Details | Diff | Splinter Review
1.19 KB, patch
: review+
Details | Diff | Splinter Review
12.96 KB, patch
: review+
Details | Diff | Splinter Review
17.87 KB, patch
Details | Diff | Splinter Review
5.36 KB, patch
: review+
Details | Diff | Splinter Review
from bug 1194624 comment 41

"We can't at this stage be 100% compliant with the MSE buffer monitoring spec due to the underlying decoders which may hold frames.

The MSE specs define the status of readyState entirely based on the buffered range.

so readyState should be HAVE_CURRENT_DATA if our source buffer contained a single frame at position 0.
this is covered by the case "If HTMLMediaElement.buffered contains a TimeRange that ends at the current playback position and does not have a range covering the time immediately after the current position"

However, decoders hold frame, the WMF h264 decoder won't output anything until at least 25 frames have been input.
So the only way for us to have readyState == HAVE_CURRENT_DATA, is by having a buffered range containing at least 25 frames.

Same with HAVE_FUTURE_DATA, the spec require that we only have one single frame *buffered* past the currentTime to have readyState = HAVE_FUTURE_DATA. But on windows this will never occur until we have at least 26 frames buffered.

All this indicates that to be compliant with the spec, the buffered range must represent frames we *can* play.

The only way to achieve this is by draining the decoders when we reach any gap in the media data range. So that every single frame contained in the buffered range is playable.
Unfortunately this is something we have considered several times in the past but decided that we wouldn't do it."

To be compliant with the MSE spec, ultimately every frames reported in the buffered range must be playable, including frames located prior a data gap.

The only way to achieve this is draining the decoder upon reaching a gap or discontinuity and when resuming seeking backward automatically and dropping frames until the previous point.

The major issue here is ensuring that the data required to perform a seek backward will never get evicted and hold them or "pinned" them somehow
When seeking, the next keyframe timet would always be set to the seek time (as the next sample to be retrieved would be a keyframe). This could lead to the next key frame logic to be activated too aggressively.
Attachment #8696342 - Flags: review?(gsquelart)
Assignee: nobody → jyavenard
When seeking, the next keyframe time would always be set to the seek time (as the next sample to be retrieved would be a keyframe). This could lead to the next key frame logic to be activated too aggressively.

refOr() executes its argument obviously, which here is a bad thing !
Attachment #8696344 - Flags: review?(gsquelart)
When seeking, the next keyframe time would always be set to the seek time (as the next sample to be retrieved would be a keyframe). This could lead to the next key frame logic to be activated too aggressively.

Jamais deux sans trois
Attachment #8696345 - Flags: review+
Attachment #8696342 - Attachment is obsolete: true
Attachment #8696342 - Flags: review?(gsquelart)
Attachment #8696344 - Attachment is obsolete: true
Attachment #8696344 - Flags: review?(gsquelart)
This allows for all buffered frames to be playable.
Attachment #8696747 - Flags: review?(cpearce)
Comment on attachment 8696748 [details] [diff] [review]
P3. Update mochitests and add new one verifying behavior.

change waiting time value to be more explicit.

Interestingly, when there's an audio stream currentTime is always constanst: that is buffered.end(buffered.length-1) - 1/30

while if it's video only, it's buffered.end(buffered.length-1) - 1/30 +- 40ms
Attachment #8696748 - Attachment is obsolete: true
Attachment #8696748 - Flags: review?(cpearce)
try run:

I'll submit a P4 which is a MDSM change that makes it behave with WAITING_FOR_DATA the same was  as EOS does: stop pre-roll and force displaying the frame.
This allows to start displaying frames even when we have less than 5 frames.
Attachment #8696960 - Flags: review?(cpearce)
Attachment #8696961 - Flags: review?(cpearce)
Blocks: 1191138
Comment on attachment 8696961 [details] [diff] [review]
P5. Update mochitest to use only 3 frames.

on win7 try run, the currentTime varies to greatly to be meaningfull when you have only 3 frames. so no point testing it
Attachment #8696961 - Attachment is obsolete: true
Attachment #8696961 - Flags: review?(cpearce)
Attachment #8696941 - Attachment is obsolete: true
Attachment #8696941 - Flags: review?(cpearce)

All green even with the now re-enabled win7 tests.
Comment on attachment 8696747 [details] [diff] [review]
P2. Drain decoder when encountering gap.

Review of attachment 8696747 [details] [diff] [review]:

::: dom/media/MediaFormatReader.cpp
@@ +1046,5 @@
>        }
>        while (decoder.mOutput.Length()) {
>          RefPtr<MediaData> output = decoder.mOutput[0];
>          decoder.mOutput.RemoveElementAt(0);
>          decoder.mSizeOfQueue -= 1;

Just as an aside, can you use decoder.mOutput.size() instead of maintaining decoder.mSizeOfQueue?

@@ +1066,5 @@
> +    }
> +  }
> +  if (decoder.HasPromise() && decoder.mOutput.IsEmpty()) {
> +    if (decoder.mDrainComplete) {
> +      bool wasdraining = decoder.mDraining;

Nit: wasDraining
Attachment #8696747 - Flags: review?(cpearce) → review+
Comment on attachment 8697157 [details] [diff] [review]
P3. Update mochitests and add new one verifying behavior.

Review of attachment 8697157 [details] [diff] [review]:

::: dom/media/mediasource/test/test_DrainOnMissingData_mp4.html
@@ +27,5 @@
> +      info("Invoking play()");
> +      var promises = [];
> +      promises.push(once(el, 'playing'));
> +;
> +      return Promise.all(promises);

Can't you just "return once(el, 'playing')" here instead of creating an array of 1 element and calling Promises.all() on it?
Attachment #8697157 - Flags: review?(cpearce) → review+
Comment on attachment 8696960 [details] [diff] [review]
P4. Stop pre-rolling when encountering WAITING_FOR_DATA.

Review of attachment 8696960 [details] [diff] [review]:

::: dom/media/MediaDecoderStateMachine.cpp
@@ +782,5 @@
>               [self] (WaitForDataRejectValue aRejection) -> void {
>                 self->WaitRequestRef(aRejection.mType).Complete();
>               }));
> +    if (isAudio) {

You need a comment explaining *why* we're doing this.
Attachment #8696960 - Flags: review?(cpearce) → review+
Turns out that a relatively slow day when one person is sheriffing for 17 hours is not the time to slip in a patch that causes intermittent failures. Backed out in for ASan e10s timeouts in test_playback.html,, Mac timeouts in test_BufferingWait_mp4.html,, and Win8 failures in test_WaitingToEndedTransition_mp4.html,
All green try was posted above. 

there can't be any relation to the timeout in test_playback.html as all changes are limited to MSE
Yeah, given the frequency of the Mac and Win8 failures while I was retriggering, I wouldn't have expected you to see them on try - they're both probably around 5%, an awkward rate that's too much when you are looking at a day worth of m-i runs, but is hard to hit without extreme triggering on a try push.

Since I'm not burdened with even the knowledge of what MSE *is*, all I know about test_playback.html is what I saw: had 6 instances in 77 runs on your push, and none in 51 on the push below, but I'm happy to try another 49 on that push to make it an even hundred.
I believe that I've found the cause.
Asan builds are particularly slow, it may happen that NotifyDataArrived was being called after we had entered mDemuxEOS which would reset the mDemuxEOS flag ; causing the internal seek to kick in and we wouldn't then resolve the data promise with EOS => time out
JW, something I'm trying to get my head around is this.

From time to time, the test dom/media/mediasource/test/test_BufferingWait_mp4.html will timeout.
The reason for this is that it's waiting for the waiting event to be fired a 2.41s - 1/30. The logic for that value is that buffered range ends at 2.41s, less the duration of a video frame.
That particular MSE has an audio and video track.
The last audio sample has a start time of 2.368435s
The last video sample has a start time of 2.465000s

Locally, I managed to reproduce it (albeit after about 300 runs). Yet the waiting event was fired at 2.368434s. That value has been consistent across all try failures where that mochitest failed:
like: (OSX) (win10)

always 2.368434s.

Why not 2.368435s, and actually seeing that we have audio until 2.414875s, why isn't the waiting event fired at 2.414875s?
Flags: needinfo?(jwwang)
MediaDecoderStateMachine::MaybeStartBuffering() is called when a sample is popped, not when the sample is consumed by the underlying audio engine. So it is possible for 'waiting' to fire before the last sample is drained.

Why does waitUntilTime() need to check |v.currentTime >= targetTime|? It appear to me that the promise should be resolved once 'waiting' is received.
Flags: needinfo?(jwwang)
(In reply to JW Wang [:jwwang] from comment #24)
> Why does waitUntilTime() need to check |v.currentTime >= targetTime|? It
> appear to me that the promise should be resolved once 'waiting' is received.

This is mostly to deal with multiple waiting being fired. Like if play() starts before any data is buffered, a waiting event will be fired immediately.

So waiting for a particular waiting event at a given time threshold was the most convenient way to ignore the waiting events we do not care about.
JW, looking at this try run:

The win8 error, are all about "###!!! ASSERTION: We should have got duration already: 'mState > DECODER_STATE_DECODING_METADATA', file c:/builds/moz2_slave/try-w64-d-00000000000000000000/build/src/dom/media/MediaDecoderStateMachine.cpp, line 1497 "

This seems to happen when seek starts there:

I can't explain how this could ever happen, seeing that loadedmetadata has already been fired.

Got any ideas on what's going on?
Flags: needinfo?(jwwang)
I guess the state machine went dormant and mState became DECODER_STATE_DECODING_NONE.
Flags: needinfo?(jwwang)
hmmm.. isn't dormant enabled after at least 60s idle and if the video is not visible?

If we can't seek on a video that is dormant simply because it's been idle for a short time, how come we aren't seeing that bug more often ???
the MediaDecoder does appear to turn off dormant when seeking

However, it seems that MediaDecoder::UpdateDormantState() will have no effect if the video element considers it as inactive.

Still dormant shouldn't have kicked in before 60s...
ok.. can reproduce... if I set media.decoder.heuristic.dormant.timeout to 1ms only , seeks never complete.

Will take this in a separate bug
actually, no.. was a one off.. damn

I noticed the test case timed out before hitting the assertion. I guess the media element went dormant while the test runner was removing the page and our test script was still running. (INFO TEST-UNEXPECTED-FAIL | dom/media/mediasource/test/test_PlayEvents.html | got canplaythrough event - Result logged after SimpleTest.finish())
ah ok.. so it is a red herring then..

so I need to find out what's causing the timeout there... oh joy...
JW, you can obviously read the logs better than I do.

Can you tell what's causing the timeout? which step of the script?

I'm starting to think I'll just temporarily disable it because I don't want this to get in the way of this work.
These are the messages before timeout:
02:08:13     INFO -  82 INFO SourceBuffer buffered ranges grew from TimeRanges: [0, 3.203333)[6, 6.801666) to TimeRanges: [0, 3.203333)[6, 7.601666)
02:08:13     INFO -  83 INFO TEST-PASS | dom/media/mediasource/test/test_PlayEvents.html | got seeked and canplay event
02:08:13     INFO -  84 INFO TEST-PASS | dom/media/mediasource/test/test_PlayEvents.html | seeked to 6s
02:08:13     INFO -  85 INFO TEST-PASS | dom/media/mediasource/test/test_PlayEvents.html | readyState is HAVE_FUTURE_DATA
02:08:13     INFO -  86 INFO TEST-PASS | dom/media/mediasource/test/test_PlayEvents.html | fetchWithXHR load uri='bipbop/bipbop_video3.m4s' status=200
02:08:13     INFO -  87 INFO TEST-PASS | dom/media/mediasource/test/test_PlayEvents.html | fetchWithXHR load uri='bipbop/bipbop_video4.m4s' status=200
02:08:13     INFO -  88 INFO Loading buffer: [0, 23429)
02:08:13     INFO -  89 INFO SourceBuffer buffered ranges grew from TimeRanges: [0, 3.203333)[6, 7.601666) to TimeRanges: [0, 3.203333)[6, 8.403333)
02:08:13     INFO -  90 INFO Loading buffer: [0, 22446)
02:08:13     INFO -  91 INFO SourceBuffer buffered ranges grew from TimeRanges: [0, 3.203333)[6, 8.403333) to TimeRanges: [0, 3.203333)[6, 9.203333)
02:08:13     INFO -  92 INFO got stalled event
02:08:13     INFO -  93 INFO TEST-UNEXPECTED-FAIL | dom/media/mediasource/test/test_PlayEvents.html | Test timed out.

corresponding to

It looks like 'canplaythrough' didn't come.
I've managed to reproduce it consistently in a win8 VM with high core count and lots of RAM (8 cores, 24GB)
I'm still puzzled on why canplaythrough arrives but only after timing out
it does appear to be dormant related though:
dormant disabled: : all green
with dormant enabled: : timeouts
This allows for all buffered frames to be playable.

This is a significant rewrite from the first patch.
The current patches actually exposed a race problem that was there all along, because the patch delayed some actions, the consequence of the race was more obvious.

It could be the cause of a few intermittent in particular on slow machines.

The asan intermittent timeout was due to NotifyDataArrived being called well after we had detected an end of stream. So seeing that new data is available (the buffered range got extended) it clears the EOS flag and is waiting on the MDSM to call Request*Data again, call that will never come.

The other intermittent error (particularly obvious on windows 8) was due that under some circumstances the cached buffered range wouldn't be immediately refreshed. The MediaDecoder, right after calling NotifyDataArrived will query the buffered range. If the buffered range has changed then the HTMLMediaElement is signaled to refresh the readyState value. It was possible that the buffered range wasn't updated in time; so GetBuffered() would get an old time ranges which would prevent the StateMirror to kick in. And as such readyState wouldn't be modified and no event (canplay/canplaythrough) would be fired.

Finally, I also noticed a case where the MediaFormatReader could attempt to decode continuously in a loop even if WAITING_FOR_DATA was encountered.
We would only start to drop frames once we had a request for data. On platforms such as windows WMF, following a drain we would typically hold over 33 decoded frames which would have been held resulting in unnecessarily high memory usage.
This also results in a simplified logic for the following steps.
Attachment #8699647 - Flags: review?(cpearce)
Attachment #8696747 - Attachment is obsolete: true
Comment on attachment 8699647 [details] [diff] [review]
P5. Drop frames during internal seeking early.

Review of attachment 8699647 [details] [diff] [review]:

::: dom/media/MediaFormatReader.cpp
@@ +1042,5 @@
>    AbstractMediaDecoder::AutoNotifyDecoded a(mDecoder);
> +  // Drop any frames found prior our internal seek target.
> +  while (decoder.mTimeThreshold && decoder.mOutput.Length()) {
> +    auto& output = decoder.mOutput[0];

Your code would be easier (and faster!) to read and review if you used the actual type for your variables here instead of 'auto'. For example, you're assigning target = decoder.mTimeThreshold.ref(), so does that mean target is a reference to mTimeThreshold?

Please only use 'auto' where it improves readability; for example in heavily templated code. Then it'd be easier for me to r+ it faster.
Attachment #8699647 - Flags: review?(cpearce) → review+
Ok.. I see why we timeout like on run like this:

When playback encounters the end of the data (but mediasource isn't ended, so we're just supposed to stall), we drain the decoder and then the MDSM calls RequestVideoData repetitively which are almost immediately resolved.
28 frames were returned when the decoder was drained. We get 28 RequestVideoData. The 29th one that should be coming to reject the promise with: WAITING_FOR_DATA never comes.

Here are the log

JW, what do you think? do you have an explanation on why the MDSM stops requesting video ?

The only thing I can think of is that the audio track is 3.900952s long while the video track is 4.005s long and from time to time, depending on how currentTime progress it doesn't attempt to decode past the audio stream.

The difference with before is that we would have returned WAITING_FOR_DATA for the video track at 3.136666s (4.05 - 28 frames) so the MDSM would have always required more data.
Flags: needinfo?(jwwang)
Video frames are popped when current playback position (which is report by audio sink) passes the time of video frames. So MDSM stops requesting video when video queue is ample.
Flags: needinfo?(jwwang)
Isn't that a bug though?

It means that depending on timing we will never fire ended if the video track is slightly longer than the audio track (which can be rather common)
No, it is already handled by audio sink. When audio track reaches the end, audio sink will report system time instead which will keep advancing.
well, it doesn't appear that way (I opened bug 1233639)
See Also: → 1233639
The current problem happens when there is a gap in the stream, right? It is not about EOS.
not quite.

The gap is temporary and the MediaFormatReader will return WAITING_FOR_DATA once the MDSM request more than what's available.

Unfortunately, the MDSM never requests more data.

If you look at the log you see that it calls RequestAudioData -> which is rejected with WAITING_FOR_DATA
it then calls RequestVideoData until it's right pass the audio and stop there.
I then end the mediasource which causes the MediaFormatReader to resolve the audio waiting promise and the MDSM calls RequestAudioData which is now rejected with EOS.

But the request for more video never resumes (it's neither in WAITING_FOR_DATA mode nor END_OF_STREAM)
it seems that the MDSM doesn't properly handle the transition buffering -> completed
try run:

that's with the  dom/media/mediasource/test/test_WaitingToEndedTransition_mp4.html mochitest modified to ensure the video track is slightly smaller than the audio track so it never triggers bug 1233639
[Tracking Requested - why for this release]:
Two side bugs were discovered in the process of implementing this feature. All fixed. Worth spreading I believe
Jean-Yves, are you planning to fill an uplift request to 45? If you do, could you do that asap? Thanks
Flags: needinfo?(jyavenard)
Comment on attachment 8696345 [details] [diff] [review]
[MSE] P1. Ensure next random access point properly calculated after seek.

Approval is for all patches

Approval Request Comment
[Feature/regressing bug #]: 1229987
[User impact if declined]: YouTube stalls, some events not fired, intermittent mochitest timeout
[Describe test coverage new/current, TreeHerder]: In central for over a week, extra mochitest added
[Risks and why]: Low. We're doing what we used to do before the new MSE was introduced in 42.
[String/UUID change made/needed]: None
Flags: needinfo?(jyavenard)
Attachment #8696345 - Flags: approval-mozilla-aurora?
Attachment #8696345 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8696960 - Flags: approval-mozilla-aurora+
Attachment #8697157 - Flags: approval-mozilla-aurora+
Attachment #8699459 - Flags: approval-mozilla-aurora+
Attachment #8699647 - Flags: approval-mozilla-aurora+
Blocks: 1234742
You need to log in before you can comment on or make changes to this bug.