If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Mp4/h.264 video stuttering with Media Source Extensions

RESOLVED INVALID

Status

()

Core
Audio/Video: Playback
RESOLVED INVALID
a year ago
a year ago

People

(Reporter: Snowflake, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

a year ago
User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0
Build ID: 20160623154057

Steps to reproduce:

We are feeding the media source smoothly with a new 30fps mp4 h.264 frame quite precisely every 1/30 second in JavaScript.
The video stream is created with an equivalent of ffmpeg with "-movflags empty_moov+default_base_moof+frag_keyframe".


Actual results:

a) We want low-latency, but cannot get rid of the video element buffering/delay of at minimum 1/2 second.
b) And as longer the video is playing, as more the video is stuttering more and more, although we are still feeding the media source quite smoothly.

Contact me via email to get an URL to a non-public video player which gets live video data from one of our servers. By simply opening that URL you should be able to see that effect.


Expected results:

Smooth video display with close-to-zero latency.
how far apart are the keyframes?
(Reporter)

Comment 2

a year ago
(In reply to Jean-Yves Avenard [:jya] from comment #1)
> how far apart are the keyframes?

Give me a sec
(Reporter)

Comment 3

a year ago
(In reply to Jean-Yves Avenard [:jya] from comment #1)
> how far apart are the keyframes?

A single keyframe at the beginning, then P-frames only.
you mean a single keyframe ever? ouch.

that's no wonder you're seeing more and more pauses as you move forward.
As it attempts to decode frames ahead of time, and you're feeding the sourcebuffer slowly, it will likely reach the end of the buffered data often as decoding has caught up. At which time the decoder is drained to be able to output all frames already received.
As a drained decoder is no longer able to process more frames until a keyframe is seen, we have to seek back to the previous keyframe and start decoding again in a loop until we reach the point where it paused.

if the distance to the first keyframe keeps getting bigger, the time required to seek will grow over time. I am not surprised you're seeing pause (and likely CPU usage would go very high if no hardware decoder is present)

We could I guess only perform this attempt to output as many frames as possible only if we've seen multiple keyframes in the media.

But in any case, having a unique keyframe in a source buffer will not work well with MSE as it is now impossible to evict any data from the source buffer.
When the sourcebuffer size reaches its maximum (currently 100MiB), then as it's impossible to evict data, you will get an out of memory error.
As the eviction strategy is to evict frames already played, the first frame (which is the keyframe) would get evicted, causing as per spec, all following frames that depends on it to also be evicted. So you end up with an empty source buffer which would have to be repopulated.

Your encoding is not compatible with MSE as MSE is designed.
(Reporter)

Comment 5

a year ago
I see.

"As a drained decoder is no longer able to process more frames until a keyframe is seen"
-> is this a limitation of the WMF?

So we cannot avoid emitting a key frame every N frames?
What value of N do you recommend?

Unfortunately emitting key frames increases the bandwidth quickly -> N should be as high as possible.
And doesn't this also cause an delay of N frames because of the seeking algorithm you have described -> N should be 1 or as low as possible, right?

Isn't there a way without any seeking (only stalling instead) and working with P-frames only, therefore no need for keeping the old frames anyway ( -> being fine with MSE)?
(Reporter)

Comment 6

a year ago
Background:
Our use-case is personalized low-latency live streaming (for interactive applications like video chatting or cloud gaming) 
-> no need for any explicit seeking ever 
-> I would expect no need for non-P-frames (except the first frame).
(Reporter)

Comment 7

a year ago
Btw, a simple non-browser-based ffmpeg-library-based player prototype, which simply decodes and renders the same live incoming h.264 data stream as fast as possible, works quite well: Nearly no delay, and stuttering only as much as the incoming network packets.
So ideally the same behavior is also possible with the video tag / MSE :-)

Updated

a year ago
Component: Audio/Video → Audio/Video: MediaStreamGraph

Updated

a year ago
Component: Audio/Video: MediaStreamGraph → Audio/Video: Playback
(In reply to Folker from comment #5)
> I see.
> 
> "As a drained decoder is no longer able to process more frames until a
> keyframe is seen"
> -> is this a limitation of the WMF?

almost all decoders behave the same. They can't resume decoding from the last frame drained.
Some will enter into an error mode if the following frame isn't a keyframe, or they will just drop the following frames.

> 
> So we cannot avoid emitting a key frame every N frames?
> What value of N do you recommend?

you could follow the http live streaming suggestions. IIRC one keyframe every 120 frames is what apple recommend.

> 
> Unfortunately emitting key frames increases the bandwidth quickly -> N
> should be as high as possible.
> And doesn't this also cause an delay of N frames because of the seeking
> algorithm you have described -> N should be 1 or as low as possible, right?

the delay will only be seen if decoding catches up with the end of the buffered range. In practice we notice no noticeable pause with live streams as provided by HLS.js or DASH.js player.
But that's typically because they append data by chunks of 5s or more at a time, and playback is typically behind live by several seconds.

Adding one frame at a time is also extremely inefficiently processed by our MSE stack. The overhead would be huge (also make sure that the frames aren't all tagged as keyframes)

> 
> Isn't there a way without any seeking (only stalling instead) and working
> with P-frames only, therefore no need for keeping the old frames anyway ( ->
> being fine with MSE)?

With only one keyframe, you lose the ability for the MSE sourcebuffer to run the Prepare Append Algorithm (https://w3c.github.io/media-source/index.html#sourcebuffer-prepare-append) and the Coded Frame Eviction Algorithm (https://w3c.github.io/media-source/index.html#sourcebuffer-coded-frame-eviction). Once you reach the maximum size a sourcebuffer can contain, playback will break.
I'm guessing you'll see the same issue with other browsers such as Chrome (which uses a slightly higher sourcebuffer size threshold but it's still not infinite).
The process of eviction is described there:
https://w3c.github.io/media-source/index.html#sourcebuffer-coded-frame-removal
"Remove all possible decoding dependencies on the coded frames removed in the previous step by removing all coded frames from this track buffer between those frames removed in the previous step and the next random access point after those removed frames.
"

Once a keyframe is removed, and the frames depending on it have been removed, per step 3.5 of the coded frame removal algorithm " 3. set the need random access point flag on all track buffers to true.". So the next packet you append, *must* start with a keyframe.

So if you remove the first keyframe, every frames of the sourcebuffer will have to be evicted as well (as they are all dependent on that first keyframe)

MSE is designed to allow bitrate adaptation. Which you don't do. It allows to have the JS manage the buffered data, evicting ranges and the like.

For what you're intending to do, a streaming container such as Ogg or WebM would be much more appropriate.

What is certain, is that our own MSE implementation wasn't optimised for your use case. And I doubt it will work well with any other web browsers either.

I can try to add a small tweaks that reduce how compliant we are with the MSE specs when it comes to be able to play all frames up to a buffered range gap, so no seeking is required, but that won't resolve the eviction issue.
(Reporter)

Comment 9

a year ago
Since we are using Nvidia NVENC on server-side, we are stuck with H.264, preventing Ogg and WebM. But what difference does the container format make, since the problems are caused by the MSE spec?

With other browsers we have also problems, e.g. Edge is much smoother without stuttering, but a much larger delay of multiple seconds.

Ok, are my following summary and conclusions correct?

Short-term:

1. a) Emitting a key-frame every 120 frames as workaround for the eviction issue. Unfortunately, while the additional keyframes may not matter for buffered non-low-latency VOD/live-stream players, they can easily cause stuttering in low-latency streaming, since a larger keyframe packet needs much longer to transmit. b) An alternative would be to violate the MSE spec in case of not seeing multiple keyframes by not evicting the whole buffer when reaching the internal limit. Isn't this an interesting option since following the MSE spec in that case makes no sense anyway?
2. Your small tweaks proposal would help avoiding the stuttering, right? This would be great!
3. a) All this does not help reducing the latency, right? Which currently is around 1/2 second, well beyond low-latency. b) Or the solution of using the MSE together with the code path currently used also for WebRTC, which you already indicated that you would be not happy with. (But nevertheless, that's somehow the frustrating part, that basically we only want to get the low-latency WebRTC is already providing, but with H.264 and without all the high-level WebRTC stuff. But I understand that you don't like your current WebRTC implementation and want to avoid going this road further down.)

Mid-term:
4. Add i) low-latency and ii) keyframe-less video streams support to the MSE specs. And consequently optimize the video tag implementation for that use-case. Btw, low-latency obviously inherently requires adding one frame at a time.

Long-term:
5. Is MSE the long-term way to go for low-latency?

Thanks for all the information and discussion, that's very useful!
(In reply to Folker from comment #9)
> Since we are using Nvidia NVENC on server-side, we are stuck with H.264,
> preventing Ogg and WebM. But what difference does the container format make,
> since the problems are caused by the MSE spec?

You can also use a fragmented MP4 directly as your source attribute.
See MSE has a container. It's handled in the same fashion for our media architecture.

However, MSE works at a frame level. When you append a buffer to a sourcebuffer, the content is first demuxed and all demuxed frames are stored in a vector which can then be retrieved.
Working at a frame level allows to provide very accurate manipulations of the content such as removing a GOP from the stream. You can add overlapping data with a different resolution and so on.
Because it works at a frame level, it must know the keyframe, and a frame is only ever kept in the sourcebuffer if it can be decoded.

Other plain container however, works at a binary blob level. The content is only demuxed as its needed. Data will be downloaded in the media cache as needed, and when no longer used the binary blob is evicted from the media cache (which works with 32kB size blob).

Last year we had a different MSE implementation that was working with binary blobs rather than demuxed frames; but you would have had the same issue with the inability to evict anything
So while due to the content you are appending you must evict it all if using MSE, with plain container that's one thing you don't need to worry about.


> 1. a) Emitting a key-frame every 120 frames as workaround for the eviction
> issue. Unfortunately, while the additional keyframes may not matter for
> buffered non-low-latency VOD/live-stream players, they can easily cause
> stuttering in low-latency streaming, since a larger keyframe packet needs
> much longer to transmit.

MSE is designed to allow for adaptative streaming. On low bandwidth network you would typically drop resolution and then your keyframes would be much smaller.

> b) An alternative would be to violate the MSE spec
> in case of not seeing multiple keyframes by not evicting the whole buffer
> when reaching the internal limit. Isn't this an interesting option since
> following the MSE spec in that case makes no sense anyway?

Not really, again, i don't believe MSE is usable for the content you are appending. It's also much more complicated than it needs to be.


> 2. Your small tweaks proposal would help avoiding the stuttering, right?
> This would be great!

only temporarily as you would still run out of memory sooner or later.

> 3. a) All this does not help reducing the latency, right? Which currently is
> around 1/2 second, well beyond low-latency. b) Or the solution of using the
> MSE together with the code path currently used also for WebRTC, which you
> already indicated that you would be not happy with. (But nevertheless,
> that's somehow the frustrating part, that basically we only want to get the
> low-latency WebRTC is already providing, but with H.264 and without all the
> high-level WebRTC stuff. But I understand that you don't like your current
> WebRTC implementation and want to avoid going this road further down.)

you can use WebRTC with h264 !
When did i ever state I didn't like our WebRTC implementation?

Maybe WebRTC has a broadcast option. :jesup, is that something you could achieve with WebRTC? broadcasting a live program to unlimited amount of users?

> 
> Mid-term:
> 4. Add i) low-latency and ii) keyframe-less video streams support to the MSE
> specs. And consequently optimize the video tag implementation for that
> use-case. Btw, low-latency obviously inherently requires adding one frame at
> a time.
but then your data is bigger than it needs to be.

> 
> Long-term:
> 5. Is MSE the long-term way to go for low-latency?
> 
> Thanks for all the information and discussion, that's very useful!
Flags: needinfo?(rjesup)
(Reporter)

Comment 11

a year ago
(In reply to Jean-Yves Avenard [:jya] from comment #10)
> (In reply to Folker from comment #9)
> > Since we are using Nvidia NVENC on server-side, we are stuck with H.264,
> > preventing Ogg and WebM. But what difference does the container format make,
> > since the problems are caused by the MSE spec?
> 
> You can also use a fragmented MP4 directly as your source attribute.
> See MSE has a container. It's handled in the same fashion for our media
> architecture.
> 
> However, MSE works at a frame level. When you append a buffer to a
> sourcebuffer, the content is first demuxed and all demuxed frames are stored
> in a vector which can then be retrieved.
> Working at a frame level allows to provide very accurate manipulations of
> the content such as removing a GOP from the stream. You can add overlapping
> data with a different resolution and so on.
> Because it works at a frame level, it must know the keyframe, and a frame is
> only ever kept in the sourcebuffer if it can be decoded.
> 
> Other plain container however, works at a binary blob level. The content is
> only demuxed as its needed. Data will be downloaded in the media cache as
> needed, and when no longer used the binary blob is evicted from the media
> cache (which works with 32kB size blob).
> 
> Last year we had a different MSE implementation that was working with binary
> blobs rather than demuxed frames; but you would have had the same issue with
> the inability to evict anything
> So while due to the content you are appending you must evict it all if using
> MSE, with plain container that's one thing you don't need to worry about.

By plain container you mean putting the URL directly in the source attribute not using MSE, right?

But good low latency inherently requires a push model, where the server can send a single new frame as soon as available to the client. Putting an URL in the source attribute is a pull mode. While of course you can still try to optimize a pull model for low-latency to some degree, this approach seems to be fundamentally limited and flawed for low latency.

MSE allows to push the data into the media source, which is fundamentally the right approach for low latency, even if the current spec and the current implementations are not (yet) designed for it.

> > 1. a) Emitting a key-frame every 120 frames as workaround for the eviction
> > issue. Unfortunately, while the additional keyframes may not matter for
> > buffered non-low-latency VOD/live-stream players, they can easily cause
> > stuttering in low-latency streaming, since a larger keyframe packet needs
> > much longer to transmit.
> 
> MSE is designed to allow for adaptative streaming. On low bandwidth network
> you would typically drop resolution and then your keyframes would be much
> smaller.
> 
> > b) An alternative would be to violate the MSE spec
> > in case of not seeing multiple keyframes by not evicting the whole buffer
> > when reaching the internal limit. Isn't this an interesting option since
> > following the MSE spec in that case makes no sense anyway?
> 
> Not really, again, i don't believe MSE is usable for the content you are
> appending. It's also much more complicated than it needs to be.

The response to https://github.com/w3c/media-source/issues/133 seems to indicate that the intention of MSE is in fact to also support low-latency in the future. So even if the MSE are not there yet, MSE may go that direction. Which seems good to me.

Does https://github.com/w3c/media-source/issues/21 mean that we can expect Firefox to optimize MSE also for low latency? Would be great!

> > 2. Your small tweaks proposal would help avoiding the stuttering, right?
> > This would be great!
> 
> only temporarily as you would still run out of memory sooner or later.

I mean in combination with emitting keyframes from time to time.

> > 3. a) All this does not help reducing the latency, right? Which currently is
> > around 1/2 second, well beyond low-latency. b) Or the solution of using the
> > MSE together with the code path currently used also for WebRTC, which you
> > already indicated that you would be not happy with. (But nevertheless,
> > that's somehow the frustrating part, that basically we only want to get the
> > low-latency WebRTC is already providing, but with H.264 and without all the
> > high-level WebRTC stuff. But I understand that you don't like your current
> > WebRTC implementation and want to avoid going this road further down.)
> 
> you can use WebRTC with h264 !
> When did i ever state I didn't like our WebRTC implementation?

I meant that OpenH264 decoder you mentioned, and which my (mis)understanding was that you would like to replace by the WMF code path as much as possible.

> Maybe WebRTC has a broadcast option. :jesup, is that something you could
> achieve with WebRTC? broadcasting a live program to unlimited amount of
> users?

Yes, in principle WebRTC is also an option for us. However, it is much too high level for our use-case. WebRTC is primarily designed for video chatting via proxies etc. etc. which we all don't need. It would force us to unnecessarily implement all the WebRTC protocol stuff we don't need on server side only to get access the low-latency video rendering code path. 

So we would like to "buy" direct access to the obviously existing low-latency code path of the video tag, but without having to also "buy" the huge unnecessary WebRTC baggage. https://github.com/w3c/media-source/issues/21 makes me hope that MSE will be a good answer to that.

> > 
> > Mid-term:
> > 4. Add i) low-latency and ii) keyframe-less video streams support to the MSE
> > specs. And consequently optimize the video tag implementation for that
> > use-case. Btw, low-latency obviously inherently requires adding one frame at
> > a time.
> but then your data is bigger than it needs to be.

The H.264 data, which we are real-time encoding anyway, doesnt get bigger by that. But we get an overhead by TCP and WebSocket packets obviously, but this is the unavoidable price for low latency. But fortunately it is also not so much anyway.
(In reply to Folker from comment #11)
> 
> But good low latency inherently requires a push model, where the server can
> send a single new frame as soon as available to the client. Putting an URL
> in the source attribute is a pull mode. While of course you can still try to
> optimize a pull model for low-latency to some degree, this approach seems to
> be fundamentally limited and flawed for low latency.
> 
> MSE allows to push the data into the media source, which is fundamentally
> the right approach for low latency, even if the current spec and the current
> implementations are not (yet) designed for it.

you are fundamentally ignoring an important fact, and one I mentioned earlier.
As far as the playback code is concerned, be it MSE or a plain media, it's identical.

The player will retrieve data from the data when it needs it. It doesn't matter if you're pushing to the sourcebuffer. It won't be used until the player needs it.

Latency wise there is ZERO difference between how MSE behaves or traditional HTML5 video playback.

If you think that be pushing the data to a sourcebuffer you'll reduce latency: you're wrong.

I have provided solutions and feedback on various approaches you can use to improve your problem. MSE doesn't fit in that picture with the kind of content you want to use.

If you want low latency, use WebRTC, that's what it's designed for.
If you want to not have to worry about keyframes and evictions, assign the src attribute to a fragmented mp4.
Status: UNCONFIRMED → RESOLVED
Last Resolved: a year ago
Flags: needinfo?(rjesup)
Resolution: --- → INVALID
(Reporter)

Comment 13

a year ago
(In reply to Jean-Yves Avenard [:jya] from comment #12)
> (In reply to Folker from comment #11)
> > 
> > But good low latency inherently requires a push model, where the server can
> > send a single new frame as soon as available to the client. Putting an URL
> > in the source attribute is a pull mode. While of course you can still try to
> > optimize a pull model for low-latency to some degree, this approach seems to
> > be fundamentally limited and flawed for low latency.
> > 
> > MSE allows to push the data into the media source, which is fundamentally
> > the right approach for low latency, even if the current spec and the current
> > implementations are not (yet) designed for it.
> 
> you are fundamentally ignoring an important fact, and one I mentioned
> earlier.
> As far as the playback code is concerned, be it MSE or a plain media, it's
> identical.
> 
> The player will retrieve data from the data when it needs it. It doesn't
> matter if you're pushing to the sourcebuffer. It won't be used until the
> player needs it.
> 
> Latency wise there is ZERO difference between how MSE behaves or traditional
> HTML5 video playback.
> 
> If you think that be pushing the data to a sourcebuffer you'll reduce
> latency: you're wrong.

I am aware of that. But it seems to me "only" a limitation of the current MSE design.

My understanding of https://github.com/w3c/media-source/issues/21 is that this may change and MSE may be also designed for low-latency in the future. Yes, this can be considered as significant change/expansion of the original MSE intention. But it would be good I think. 

I am aware that this is not today yet. But hopefully not too far in the future. And maybe browser developers adopt it soon. :-) It seems for example that the Chrome people are already working on a special low-latency code path as I understand it.

> I have provided solutions and feedback on various approaches you can use to
> improve your problem. MSE doesn't fit in that picture with the kind of
> content you want to use.
> 
> If you want low latency, use WebRTC, that's what it's designed for.
> If you want to not have to worry about keyframes and evictions, assign the
> src attribute to a fragmented mp4.

The question for our situation is if one of these approaches is worth doing for us, since they have the fundamental disadvantages mentioned above, and - as long as my understanding of https://github.com/w3c/media-source/issues/21 is correct - MSE will solve it in a better way in the (hopefully not so distant) future.
You need to log in before you can comment on or make changes to this bug.