MSE audio playback not gapless

RESOLVED FIXED in Firefox 67

Status

()

defect
RESOLVED FIXED
4 years ago
3 months ago

People

(Reporter: tomerlahav, Assigned: jya)

Tracking

45 Branch
mozilla67
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox67 fixed)

Details

Attachments

(1 attachment)

Reporter

Description

4 years ago
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20151105030433

Steps to reproduce:

I've set-up a small webpage that reproduces the issue:
http://melatonin64.github.io/gapless-mse-audio/

This webpage uses MSE to append several audio files into a continuous audio stream.
The stream should be gapless as we use SourceBuffer's timestampOffset, appendWindowStart and appendWindowEnd to discard of padding.

In Firefox however, the audio does not play gaplessly.


Actual results:

Audible gaps in audio can be heard.


Expected results:

Audio should play gaplessly, as it does in Chrome.
Component: Untriaged → Audio/Video: Playback
Product: Firefox → Core
Assignee

Comment 1

4 years ago
It appears to play in Chrome because it does things that are not per spec, and their implementation of timestampOffset is wrong.

First, your media segments aren't one second long they are 1.068118s long and made of 2 moof+mdat.

You then set the appendWindow to [0,1] and timestampOffset to -0.046440:

You then perform an appendBuffer of a media segment that is [0, 1.068117).
Due to timestampOffset being -0.046440 this is as if you added [-0.046440, 1.021678), with an append window of [0,1) so we have [-0.046440, 0) at the beginning truncated (that's 2 samples) and [1, 1.021678) that should be evicted, however the sample covering the range [0.998458, 1.021678) falls in the condition per spec:
http://w3c.github.io/media-source/index.html#sourcebuffer-coded-frame-processing
step 9:
"If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame."

and as such, is evicted.

So really, from your original [0, 1.06s) you're left with [0.046440s, 1.044898s) of it and your sourcebuffer buffered range is now [0.000000, 0.998458)

You then set the appendWindowStart to 1s, appendWindowEnd to 2s and timestampOffset to 0.953560s.

Mix and repeat with exactly the same as above, and your buffered range will become [0.000000, 0.998458), [1.023219, 1.998457) etc...

This just can't play "gapless", simply because you handle 1.06s sample as if they were 1s long, and you're evicting too much data from it, to not feel the "gap" your samples must really be continuous: here exactly 1s long.
A workaround would be to set your appendWindowStart/appendWindowEnd in multiple of 1.068118 rather than 1s.

Now, we have a leeway for the append Window to allow +- 1 sample ; this is to allow streams with poorly muxed content and with a negative start time to work (otherwise , especially with video, the first sample being typically a keyframe would be evicted, and with say a keyframe every 4s , that cause a starting gap of 4s).

We could probably handle the fuzz only on the start time, not the appendWindowEnd.
Assignee: nobody → jyavenard
Status: UNCONFIRMED → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID

Comment 2

4 years ago
Thanks for your extremely detailed reply, I appreciate it!

If I understand correctly, you're saying that it's possible to achieve gapless MSE playback on Firefox.
Do you have any examples of how I should go about it?

Let me tell you how I created these audio files, and maybe you'll be able to tell me where I went wrong.
I created a sine wave as uncompressed 16-bit PCM audio at 44.1kHz - 10 seconds of uncompressed audio.
I then sliced these 10 seconds to 10 segments of PCM data - exactly 44100 samples each (1 second).

I encoded the segments to AAC using fdk-aac.
Now, AAC uses blocks of 1024 samples each.
Specifically, fdk-aac adds two blocks of silence ahead of the actual audio data.
This accounts for 2048 samples, which at 44100 rate are ~0.046440 seconds.
That's why I have to get rid of the first 0.046440 seconds of each file.

Now, AAC also must round up to a whole frame/block of data, which must be a multiple of 1024.
That's why junk audio is also being added to end of file which needs to be cut out as well.

The actual audio data that I care about (sine wave) is exactly 1 second long.
These 1 second (44100 samples) segments should be able to connect gaplessly to each other, provided that we get rid of the excess "junk" audio at the start and end of the encoded file.

This exact procedure seems to work on Chrome, but not on FF - would love to know what I should do differently in order to get the same result in FF as well.

Also - I'm not sure why the fact that my files have 2 moof+mdat is relevant - these files comply with http://www.w3.org/2013/12/byte-stream-format-registry/isobmff-byte-stream-format.html.

Comment 3

4 years ago
I re-read your comment, and I think I might understand the issue now (let me know if this is nonsense).
Basically, you're saying that Firefox is handling eviction at the encoded stream level.
So you get rid of blocks/frames, but cannot operate in resolution of actual audio samples.
Your atomic unit is in fact 1024 samples.

I think that the streams should be first decoded, and then eviction could be applied at the PCM level, where you have more granular control (basic unit is 1 sample instead of frame/1024 samples).
This would enable 100% gapless audio playback.
Does this make sense?
Assignee

Comment 4

4 years ago
Our MSE stack is *exactly* as per W3C spec. 

We deal on an encoded sample basis as mixed in the container, we demux all samples as the media segments are appended.

As per spec, the eviction is done per the time stamps found in each container and for each samples.
There is no handling in the spec for handling semi or partial compressed samples etc.

If you have compressed/muxed your audio content to have 1024 audio frames per compressed samples; then this is the best granularity you're going to get. 

Handling decoded samples instead would cause extreme memory usage, in particular with videos. 

But in this particular case; even if we did like you suggested; you would still get gaps as you are truncating 1.06s of audio into 1s block. The missing 0.06s itself will cause audible artefacts.

So what I suggested before was to not use 1s window; but 1.06s window and no set your starting timestampOffset to negative value at startup. I'll make a quick demo page for you later. 

The main reason it works for chrome isn't because they decode first and then evict. But because they massage the time stamps. Like if you appended a media segment starting at 0.09-1s. The resulting buffered range would become [0, 0.91]

And when you set timestampOffset the end result is never as per what's in the encoder but what they have massaged instead. The other browsers do not do that, and hence why you feel that the other browsers are broken. They aren't. Your code is :)
Comment hidden (obsolete)
Comment hidden (obsolete)
Assignee

Comment 7

4 years ago
Oh I see what you are doing in the code. Sorry read too quickly (did so while at lunch).
Identifying the silence in the mp3 content.

Let me give it some thought...
Assignee

Comment 8

4 years ago
From the spec:
The definition of a coded frame is "http://w3c.github.io/media-source/index.html#coded-frame"
"A unit of media data that has a presentation timestamp, a decode timestamp, and a coded frame duration."

the definition of a coded frame duration being:
"The duration of a coded frame. [...] For audio, the duration represents the sum of all the samples contained within the coded frame. For example, if an audio frame contained 441 samples @44100Hz the frame duration would be 10 milliseconds.

The handling of frames and the behaviour on what to do for the value on appendWindowStart and appendWindowEnd http://w3c.github.io/media-source/index.html#sourcebuffer-coded-frame-processing

step 8 and 9:
"9. If presentation timestamp is less than appendWindowStart, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.
Note"
9. If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame."

Seeing the definition of what a coded frame is and more importantly what the duration of a coded frame is ; it is clear to me that the current MSE API doesn't provide the required sample granularity to perform the task you want to do.

I suggest that you open a bug at W3C and submit your amendment.

In the mean time, what Chrome describes, is unique to their implementation and won't work with any other browsers. It's unfortunate that they break the standard really.
Assignee

Comment 9

4 years ago
I should add that in this particular example, removing the gap at the beginning is possible (which is what Chrome example is doing); it's the gap at the end that can't be as it's only a partial content of a coded frame.
Reporter

Comment 10

4 years ago
Thanks, I really appreciate your replies.

Looks like the spec does not allow for sample accurate gapless audio then...
I bet that in the Chrome implementation, instead of discarding the entire frame, they decode it and push only the samples that fall within the append window.

Or, another alternative, is that they do not discard the last frame (which is partially within the append window), and appending the next segment overrides the audio chunk that's past 1 second [1.0, 1.021678). 

I'll try to file a bug at W3C - thanks again for all your help!
Assignee

Comment 11

4 years ago
(In reply to Tomer Lahav from comment #10)
> Or, another alternative, is that they do not discard the last frame (which
> is partially within the append window), and appending the next segment
> overrides the audio chunk that's past 1 second [1.0, 1.021678). 

You'd probably still have the same issue here; in your particular example of 1s sample that may work ; but for all media, probably not.

The spec states that when appending overlapping frames, if the overlap is greater than 1us (https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing step 13) then you don't remove the frame.
In your case, the overlap would be much greater (21678us) so the end won't be removed either (instead they will live side by side)

Implementing it on our side would require some change, but nothing too significant. Rather than only storing the sample duration, we could store an extra validity window, and when decoding, the decoder could accurately drop the decoded frames outside that window giving you extremely accurate control).

> 
> I'll try to file a bug at W3C - thanks again for all your help!

that would be good.
Assignee

Updated

4 years ago
Depends on: 1226934
Assignee

Updated

4 years ago
No longer depends on: 1226934
Assignee

Updated

4 years ago
Depends on: 1226931
Assignee

Updated

5 months ago
See Also: → 1524890
Assignee

Comment 13

4 months ago

(In reply to Tomer Lahav from comment #2)

Thanks for your extremely detailed reply, I appreciate it!

If I understand correctly, you're saying that it's possible to achieve
gapless MSE playback on Firefox.
Do you have any examples of how I should go about it?

Let me tell you how I created these audio files, and maybe you'll be able to
tell me where I went wrong.
I created a sine wave as uncompressed 16-bit PCM audio at 44.1kHz - 10
seconds of uncompressed audio.
I then sliced these 10 seconds to 10 segments of PCM data - exactly 44100
samples each (1 second).

I encoded the segments to AAC using fdk-aac.
Now, AAC uses blocks of 1024 samples each.
Specifically, fdk-aac adds two blocks of silence ahead of the actual audio
data.
This accounts for 2048 samples, which at 44100 rate are ~0.046440 seconds.
That's why I have to get rid of the first 0.046440 seconds of each file.

Now, AAC also must round up to a whole frame/block of data, which must be a
multiple of 1024.
That's why junk audio is also being added to end of file which needs to be
cut out as well.

The actual audio data that I care about (sine wave) is exactly 1 second long.
These 1 second (44100 samples) segments should be able to connect gaplessly
to each other, provided that we get rid of the excess "junk" audio at the
start and end of the encoded file.

So with bug 1524890, we trimmed the frames so that rather than dropping the entire frame we make it fit within the appendWindowStart , appendWindowEnd

We end up with a perfect [0, 10s] buffered range. No more hole.

And yet, you can hear gaps in the audio.

You can give it a try.
Windows 64: https://queue.taskcluster.net/v1/task/CDvmPWdhQZq5Fwlk4lIUBg/runs/0/artifacts/public/build/install/sea/target.installer.exe
Mac: https://queue.taskcluster.net/v1/task/VC8v6OYJSnOjgjgRv2tBkQ/runs/0/artifacts/public/build/target.dmg

I'll investigate further what's going on here.

Assignee

Updated

4 months ago
Status: RESOLVED → REOPENED
Ever confirmed: true
Flags: needinfo?(tomerlahav)
Resolution: INVALID → ---
Assignee

Comment 14

4 months ago

So after a bit of investigation, this is what's happening, and why we here a gap still.

So the first two audio packet (1024 audio frames each) are dropped because they are fully outside the appending window. We rely on the demuxed data which states that the packet starts 0 and 0.023220s respectively (and each with a duration of 0.023220). By that info, none of those packets are needed.

So we start decoding from the 3rd packet.

When decoding that 3rd packet (and starting from there), then the first 449 decoded audio frames are silence (0s). That leads to an audible gap of 10.2ms every seconds as each packets is repeated every second.

Now if I don't drop those first two packets and I feed them to the decoder, either FFmpeg or Apple's CoreAudio AAC decoder.
The first packet decoded packet results in 1024 frames of silence.
The 2nd packet decoded has only the first 512 frames that are silence (well, more accurately, the first 450 are 0s, the rest isn't audible), the remaining 512 frames are noise.
The 3rd packet decoded output 1024 frames of noise.

So to get the behaviour we're expecting (apparent gapless), we would need to keep that 2nd packet and somehow makes it non-visible to MSE, but still used for decoding and then dropping all the decoded frames.

Have to think how that could be done with the existing structure in place.

Assignee

Comment 15

4 months ago

Some audio decoders, such as AAC and Opus have a need for a pre-roll content. As such, in order to be able to fully get the content of the first frame we keep the frame just prior that would have normally been dropped.

We set this frame to have a duration of 1us so that it will be dropped later by the decoding pipeline. The starting time of the first frame is adjusted so that we have continuous data, without gap in the buffered range.

Reporter

Comment 17

4 months ago

Awesome, glad to hear!
I'd be happy to test as well once you have a build to share...

Flags: needinfo?(tomerlahav)

Comment 18

4 months ago
Pushed by jyavenard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/754776b7161c
Keep the one dropped frame prior the first one to prime the decoder. r=bryce
Assignee

Updated

4 months ago
Depends on: 1530234

Comment 19

4 months ago
bugherder
Status: REOPENED → RESOLVED
Closed: 4 years ago4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
Assignee

Comment 20

4 months ago

(In reply to Tomer Lahav from comment #17)

Awesome, glad to hear!
I'd be happy to test as well once you have a build to share...

This feature is now available in Firefox Nightly 67, thank you for testing :)

Flags: needinfo?(tomerlahav)
Reporter

Comment 21

4 months ago

Tested it on nightly - sounds perfectly seamless, thank you!!

Flags: needinfo?(tomerlahav)
QA Whiteboard: [qa-67b-p2]
You need to log in before you can comment on or make changes to this bug.