Closed Bug 1222851 Opened 5 years ago Closed 2 years ago
MSE audio playback not gapless
47 bytes, text/x-phabricator-request
|Details | Review|
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0 Build ID: 20151105030433 Steps to reproduce: I've set-up a small webpage that reproduces the issue: http://melatonin64.github.io/gapless-mse-audio/ This webpage uses MSE to append several audio files into a continuous audio stream. The stream should be gapless as we use SourceBuffer's timestampOffset, appendWindowStart and appendWindowEnd to discard of padding. In Firefox however, the audio does not play gaplessly. Actual results: Audible gaps in audio can be heard. Expected results: Audio should play gaplessly, as it does in Chrome.
Component: Untriaged → Audio/Video: Playback
Product: Firefox → Core
It appears to play in Chrome because it does things that are not per spec, and their implementation of timestampOffset is wrong. First, your media segments aren't one second long they are 1.068118s long and made of 2 moof+mdat. You then set the appendWindow to [0,1] and timestampOffset to -0.046440: You then perform an appendBuffer of a media segment that is [0, 1.068117). Due to timestampOffset being -0.046440 this is as if you added [-0.046440, 1.021678), with an append window of [0,1) so we have [-0.046440, 0) at the beginning truncated (that's 2 samples) and [1, 1.021678) that should be evicted, however the sample covering the range [0.998458, 1.021678) falls in the condition per spec: http://w3c.github.io/media-source/index.html#sourcebuffer-coded-frame-processing step 9: "If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame." and as such, is evicted. So really, from your original [0, 1.06s) you're left with [0.046440s, 1.044898s) of it and your sourcebuffer buffered range is now [0.000000, 0.998458) You then set the appendWindowStart to 1s, appendWindowEnd to 2s and timestampOffset to 0.953560s. Mix and repeat with exactly the same as above, and your buffered range will become [0.000000, 0.998458), [1.023219, 1.998457) etc... This just can't play "gapless", simply because you handle 1.06s sample as if they were 1s long, and you're evicting too much data from it, to not feel the "gap" your samples must really be continuous: here exactly 1s long. A workaround would be to set your appendWindowStart/appendWindowEnd in multiple of 1.068118 rather than 1s. Now, we have a leeway for the append Window to allow +- 1 sample ; this is to allow streams with poorly muxed content and with a negative start time to work (otherwise , especially with video, the first sample being typically a keyframe would be evicted, and with say a keyframe every 4s , that cause a starting gap of 4s). We could probably handle the fuzz only on the start time, not the appendWindowEnd.
Assignee: nobody → jyavenard
Status: UNCONFIRMED → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
Thanks for your extremely detailed reply, I appreciate it! If I understand correctly, you're saying that it's possible to achieve gapless MSE playback on Firefox. Do you have any examples of how I should go about it? Let me tell you how I created these audio files, and maybe you'll be able to tell me where I went wrong. I created a sine wave as uncompressed 16-bit PCM audio at 44.1kHz - 10 seconds of uncompressed audio. I then sliced these 10 seconds to 10 segments of PCM data - exactly 44100 samples each (1 second). I encoded the segments to AAC using fdk-aac. Now, AAC uses blocks of 1024 samples each. Specifically, fdk-aac adds two blocks of silence ahead of the actual audio data. This accounts for 2048 samples, which at 44100 rate are ~0.046440 seconds. That's why I have to get rid of the first 0.046440 seconds of each file. Now, AAC also must round up to a whole frame/block of data, which must be a multiple of 1024. That's why junk audio is also being added to end of file which needs to be cut out as well. The actual audio data that I care about (sine wave) is exactly 1 second long. These 1 second (44100 samples) segments should be able to connect gaplessly to each other, provided that we get rid of the excess "junk" audio at the start and end of the encoded file. This exact procedure seems to work on Chrome, but not on FF - would love to know what I should do differently in order to get the same result in FF as well. Also - I'm not sure why the fact that my files have 2 moof+mdat is relevant - these files comply with http://www.w3.org/2013/12/byte-stream-format-registry/isobmff-byte-stream-format.html.
I re-read your comment, and I think I might understand the issue now (let me know if this is nonsense). Basically, you're saying that Firefox is handling eviction at the encoded stream level. So you get rid of blocks/frames, but cannot operate in resolution of actual audio samples. Your atomic unit is in fact 1024 samples. I think that the streams should be first decoded, and then eviction could be applied at the PCM level, where you have more granular control (basic unit is 1 sample instead of frame/1024 samples). This would enable 100% gapless audio playback. Does this make sense?
Our MSE stack is *exactly* as per W3C spec. We deal on an encoded sample basis as mixed in the container, we demux all samples as the media segments are appended. As per spec, the eviction is done per the time stamps found in each container and for each samples. There is no handling in the spec for handling semi or partial compressed samples etc. If you have compressed/muxed your audio content to have 1024 audio frames per compressed samples; then this is the best granularity you're going to get. Handling decoded samples instead would cause extreme memory usage, in particular with videos. But in this particular case; even if we did like you suggested; you would still get gaps as you are truncating 1.06s of audio into 1s block. The missing 0.06s itself will cause audible artefacts. So what I suggested before was to not use 1s window; but 1.06s window and no set your starting timestampOffset to negative value at startup. I'll make a quick demo page for you later. The main reason it works for chrome isn't because they decode first and then evict. But because they massage the time stamps. Like if you appended a media segment starting at 0.09-1s. The resulting buffered range would become [0, 0.91] And when you set timestampOffset the end result is never as per what's in the encoder but what they have massaged instead. The other browsers do not do that, and hence why you feel that the other browsers are broken. They aren't. Your code is :)
Oh I see what you are doing in the code. Sorry read too quickly (did so while at lunch). Identifying the silence in the mp3 content. Let me give it some thought...
From the spec: The definition of a coded frame is "http://w3c.github.io/media-source/index.html#coded-frame" "A unit of media data that has a presentation timestamp, a decode timestamp, and a coded frame duration." the definition of a coded frame duration being: "The duration of a coded frame. [...] For audio, the duration represents the sum of all the samples contained within the coded frame. For example, if an audio frame contained 441 samples @44100Hz the frame duration would be 10 milliseconds. The handling of frames and the behaviour on what to do for the value on appendWindowStart and appendWindowEnd http://w3c.github.io/media-source/index.html#sourcebuffer-coded-frame-processing step 8 and 9: "9. If presentation timestamp is less than appendWindowStart, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame. Note" 9. If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame." Seeing the definition of what a coded frame is and more importantly what the duration of a coded frame is ; it is clear to me that the current MSE API doesn't provide the required sample granularity to perform the task you want to do. I suggest that you open a bug at W3C and submit your amendment. In the mean time, what Chrome describes, is unique to their implementation and won't work with any other browsers. It's unfortunate that they break the standard really.
I should add that in this particular example, removing the gap at the beginning is possible (which is what Chrome example is doing); it's the gap at the end that can't be as it's only a partial content of a coded frame.
Thanks, I really appreciate your replies. Looks like the spec does not allow for sample accurate gapless audio then... I bet that in the Chrome implementation, instead of discarding the entire frame, they decode it and push only the samples that fall within the append window. Or, another alternative, is that they do not discard the last frame (which is partially within the append window), and appending the next segment overrides the audio chunk that's past 1 second [1.0, 1.021678). I'll try to file a bug at W3C - thanks again for all your help!
(In reply to Tomer Lahav from comment #10) > Or, another alternative, is that they do not discard the last frame (which > is partially within the append window), and appending the next segment > overrides the audio chunk that's past 1 second [1.0, 1.021678). You'd probably still have the same issue here; in your particular example of 1s sample that may work ; but for all media, probably not. The spec states that when appending overlapping frames, if the overlap is greater than 1us (https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing step 13) then you don't remove the frame. In your case, the overlap would be much greater (21678us) so the end won't be removed either (instead they will live side by side) Implementing it on our side would require some change, but nothing too significant. Rather than only storing the sample duration, we could store an extra validity window, and when decoding, the decoder could accurately drop the decoded frames outside that window giving you extremely accurate control). > > I'll try to file a bug at W3C - thanks again for all your help! that would be good.
Filed a bug with W3C: https://github.com/w3c/media-source/issues/37
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: INVALID → ---
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/754776b7161c Keep the one dropped frame prior the first one to prime the decoder. r=bryce
You need to log in before you can comment on or make changes to this bug.