Closed Bug 576539 Opened 14 years ago Closed 14 years ago

Omitted audio samples should be treated as silence

Categories

(Core :: Audio/Video, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- final+

People

(Reporter: kbrosnan, Assigned: cpearce)

References

()

Details

(Whiteboard: [Fx4Beta1Testday] [4b1])

Attachments

(4 files, 10 obsolete files)

14.43 KB, patch
kinetik
: review+
Details | Diff | Splinter Review
1.99 KB, patch
kinetik
: review+
Details | Diff | Splinter Review
39.98 KB, patch
cpearce
: review+
Details | Diff | Splinter Review
8.61 KB, patch
cpearce
: review+
Details | Diff | Splinter Review
Cannot play the whole video without it stuttering. I can play about 60s - 90s of the video perfectly. After that the video and audio start stuttering and does not recover without moving the playback slider.

Mozilla/5.0 (Windows; U; Windows NT 6.1; WOW64; en-US; rv:2.0b1) Gecko/20100630 Firefox/4.0b1
Summary: http://people.mozilla.com/~faaborg/files/20100625-tabsOnTop/Tabs.webmvp8.webm → Cannot play Faaborg's tabs on top webm video to the end
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:2.0b1) Gecko/20100628 Firefox/4.0b1

(In reply to comment #0)
> Cannot play the whole video without it stuttering. I can play about 60s - 90s
> of the video perfectly. 

Same here

>After that the video and audio start stuttering and
> does not recover without moving the playback slider.

Actually, if I move the sliders, everything's well again.
Whiteboard: [Fx4Beta1Testday] → [4b1]
Whiteboard: [4b1] → [Fx4Beta1Testday] [4b1]
The audio's timestamps in Alex's video are wrong. Alex: What tool did you use to produce your "tabs on top" WebM video (see URL)?

The first chunk of audio samples has a timestamp of 4s (and we actually incorrectly skip the first 4s of video frames, which is an unrelated bug I think) and then subsequent audio chunks do not have timestamps which are the first audio chunk's timestamp plus the sum of audio samples in preceding chunks.

Matthew tells Opera and VLC play that file fine, but Chrome exhibits the same problem we do.

Perhaps we should change the audio timestamps to be start time of the first chunk + duration of samples in preceding chunks? Even though the file is funnily encoded, it would be nice to play it. I'm unsure if encoders can legally miss chunks of audio data in order to encode silence though; if such files were valid, we'd play these files incorrectly, e.g. we'd not play silence.
I asked the experts if a WebM encoder can omit Vorbis Blocks to imply silence in the stream.

https://groups.google.com/a/webmproject.org/group/webm-discuss/browse_thread/thread/be8ba3f3bfa68fc4#

Lets see what they say before we implement the change I suggest in comment 2.
>Alex: What tool did you use
>to produce your "tabs on top" WebM video

The file was encoded using the Miro Video Converter: http://www.mirovideoconverter.com/

I can also provide the source file if that helps to determine the problem.
(In reply to comment #3)
> I asked the experts if a WebM encoder can omit Vorbis Blocks to imply silence
> in the stream.
> 

Steve Lhomme replied that an encoder can legally omit Vorbis Blocks to imply silence, so we must handle this case. He also pointed out this can happen in Ogg too.

In Alex's video, there are several omitted audio blocks, and it so happens that when we reach 1:19 into the video we've omitted enough audio that we go into an audio-underrun state.

There is actually two problems we need to solve here. Firstly, we need to interpret the absence of samples as silence during playback. When we encounter missing audio samples, we must interpret them as implied silence, and push the appropriate number of 0-valued sound samples into the audio hardware. Secondly, we must be able to seek into silent-by-omission blocks.
Assignee: nobody → chris
Status: NEW → ASSIGNED
Summary: Cannot play Faaborg's tabs on top webm video to the end → Omitted audio samples should be treated as silence
Patch 1 v1.
* Add a new field to the SoundData class, mIsSilence. When a SoundData object represents silence, all its normal fields are valid, except its mAudioData field, which is empty. When we come to play a silent sample, we play in a loop, playing samples from a 0-filled array until we've played the appropriate number of samples.
* Add nsBuiltinDecoderReader::PushAudio(SoundData*), which pushes SoundData objects onto nsBuiltinDecoderReader::mAudioQueue. We call this instead of pushing directly to the audio queue. PushAudio() detects if there's a gap between the incoming SoundData object and the previous one, and if so pushes a chunk of silent samples onto the audio queue to ensure we play over the gap.
* While decoding we treat blocks of omitted audio samples larger than 30ms as silence-implied-by-omission. We need to have this threshold, since at least the miro encoder/converter (and maybe other encoders) will sometimes omit extra samples up front, and then omit fewer samples in subsequent "frames". Without this threshold, we can be overeager in injecting silence, causing audio crackles.
* This patch is based on the buffered and OggIndex patches from bug 462957 and bug 519897 respectively.
Patch 2 v1
* Move nsOggReader::DecodeToFrame(PRInt64 aTarget) to nsBuiltinDecoderReader.
* After we seek in nsWebMReader::Seek(), call DecodeToFrame(). This ensures that we advance the audio decode to the seek target (previously we'd have begun playback from around the nearest previous keyframe?), and ensures that we've got at least one audio sample decoded after a seek.
* After a seek in nsBuiltinDecoderStateMachine::Run() in the SEEKING state, check the next decoded audio sample, and if it's after the seek target, inject silence into the audio stream.
* This is based on patch 1.

With this patch we can seek into silent-by-omission blocks, for example the first 4 seconds of Alex's video is a silent-by-omission block.
Attachment #457244 - Flags: review?(kinetik)
Attachment #457247 - Flags: review?(kinetik)
Although it may be possible to encode Ogg files with gaps in the timestamps, it is not legal. However, corrupt pages could lead to a similar results (gaps in the stream). libvorbis needs to know about such gaps in order to properly reinitialize the lapping machinery. In libogg, this is done by tracking the page sequence numbers, which are used to insert gaps into the packet numbers when pages arrive out of sequence. However, the WebM reader does not use libogg, and just continually increments mPacketCount every time a packet is passed to libvorbis. This counter should be incremented an extra time when there's a gap.
Ok, so after discussing this on IRC with Tim, I think we'll have to move the implied-silence detection up into the individual backends, since we should behave differently for Ogg and WebM.

For WebM... libvorbis needs to know if there's a gap before it decodes a packet, else it may use data from what it thinks is the preceeding packet while decoding, but which may actually not be. So we must detect the gaps in nsWebMReader's audio decode in advance of decoding audio data, using the timecodes given to us by libnestegg.

nsOggReader uses libogg, which uses the ogg page sequence numbers to detect gaps (since gaps in ogg aren't legal) so we don't need to detect the gaps in advance of decode in ogg, but we should still add silent samples in order to play across any gaps due to corruption or funnily/illegally encoded oggs.
Attachment #457244 - Flags: review?(kinetik)
Attachment #457247 - Flags: review?(kinetik)
Patch with Tim's feedback addressed.
* We now do the silence detection in the ogg and webm backends separately, and reset the vorbis synthesis in the webm backend before decoding after a gap.
* When we play silence, we play at most 100ms of silence, and then push any remaining silence onto the front of the audio queue in a new silent SoundData object. This ensures the current playback position advances smoothly across gaps.
* Add ogg testcase to the playback tests.
Attachment #457244 - Attachment is obsolete: true
Attachment #457705 - Flags: review?(kinetik)
As Patch 2 v1, but rebased on Patch 1 v2.
Attachment #457247 - Attachment is obsolete: true
Attachment #457706 - Flags: review?(kinetik)
Attached patch Patch 1 v3: With small fix (obsolete) — Splinter Review
I found a bug in patch 1, which caused us to end up skipping all audio on some videos. Fixed in this patch. Sorry for the bugspam!
Attachment #457705 - Attachment is obsolete: true
Attachment #457741 - Flags: review?(kinetik)
Attachment #457705 - Flags: review?(kinetik)
Attached patch Patch 1 v4: Another small fix (obsolete) — Splinter Review
I found another weirdly encoded video which triggered a bug, fix is in this patch. It should totally work now... Sorry for the bugspam!
Attachment #457741 - Attachment is obsolete: true
Attachment #457757 - Flags: review?(kinetik)
Attachment #457741 - Flags: review?(kinetik)
Comment on attachment 457706 [details] [diff] [review]
Patch 2 v2: Rebased on new patch 1

+    nsAutoPtr<SoundData> audio;
...
+      audio = mAudioQueue.PeekFront();

Make this:

+      nsAutoPtr<SoundData> audio(mAudioQueue.PeekFront());

With that, r+ on the chunks that share the decode-after-seek logic.  Can you split that chunk out from the chunk that queues silence after seeking?  The code sharing makes sense no matter what, but I have some issues with the silence handling code I'll address in the next comment.
Drive by review of patch 1:

+#define AUDIO_GAP_THRESHOLD 30

Document why choose '30'.

+    mIsSilence(PR_FALSE)

Why use a silence flag instead of just storing 'x' bytes of silence in
the array? Is it to avoid large memory allocations if there's a big
gap? If so please note in a comment.

+  // Play up to 100ms of silence, or 32KB of sound data, whichever is less.
+  const PRUint32 len = NS_MIN((info.mAudioRate / 10) * info.mAudioChannels,
+                              static_cast<PRUint32>(32 * 1024));

Why 100ms and 32KB of data? Document how those numbers were chosen. Use constants instead of magic numbers.

+
+  PRUint32 x = NS_MIN(aSound->AudioDataLength(), len);

Needs a better name than 'x'.

Ogg backend:
+    PRInt64 gap = chunks[0]->mTime - mAudioEndMs;
+    if (mAudioEndMs != -1 && gap > AUDIO_GAP_THRESHOLD) {
+      LOG(PR_LOG_DEBUG, ("nsOggReader::DecodeAudioData() injecting %lldms "
+                         "of silence to fill gap in audio\n", gap));
+      PRInt64 samples = (gap * mInfo.mAudioRate) / 1000;

Is it possible for an overflow to occur here if gap is large? Same in the WebM backend.
Drive by review of patch 2:

+                const nsVideoInfo& info = mReader->GetInfo();
+                PRInt64 duration = audio->mTime - seekTime;
+                PRUint32 samples = (duration * info.mAudioRate) / 1000;

Overflow possibilities here? Multiplying a 64 bit value by a 32 bit
value and storing it in a 32 bit value.
I have two problems with the current approach:

- AUDIO_GAP_THRESHOLD appears to use an arbitrary number, and seems to be papering over a lack of accurate detection of gaps (i.e. it should be possible to detect all real gaps, and they would all need silence handling--if we're detecting "gaps" that aren't gaps then the detection is flawed.)

As an example of a case where this threshold would not work is a 50 fps file where the encoder periodically dropped a video frame duration of audio.  Since the gap would be less than 30ms, it wouldn't be handled by silence insertion, and (after enough of these gaps) we'd eventually either suffer A/V desync or audio underrun.

I've written a standalone WebM/Vorbis decoder tool to experiment with accurate gap detection yet--I hope to post the results before I went on leave, but I ran out of time before I got it working properly.

- Rather than the decoder producing silent audio data, it seems like it'd be cleaner for the audio playback code to detect and silence-fill gaps at playback time.  This would reduce the gap detection in the decoder to dealing with the Vorbis lapping/reset issue (which would only be required in the WebM code).

One question:

+    // Reset the vorbis sync state, so that the vorbis decoder does not
+    // try to use data from before the gap to decode after the gap.
+    vorbis_synthesis_restart(&mVorbisDsp);
+    mPacketCount = 0;

Why reset Vorbis rather than just bumping the packet count by an extra increment and letting it handle this internally as Tim mentioned in comment 8?
(In reply to comment #17)
> - AUDIO_GAP_THRESHOLD appears to use an arbitrary number, and seems to be
> papering over a lack of accurate detection of gaps (i.e. it should be possible
> to detect all real gaps, and they would all need silence handling--if we're
> detecting "gaps" that aren't gaps then the detection is flawed.)

Ok, we can accurately detect gaps if we count played samples on the state machine. I have updated my patch and have it working. I'll post a new patch when I've cleaned it up.

> One question:
> 
> +    // Reset the vorbis sync state, so that the vorbis decoder does not
> +    // try to use data from before the gap to decode after the gap.
> +    vorbis_synthesis_restart(&mVorbisDsp);
> +    mPacketCount = 0;
> 
> Why reset Vorbis rather than just bumping the packet count by an extra
> increment and letting it handle this internally as Tim mentioned in comment 8?

I had thought both of these were required to force the vorbis decode to throw away prior data, but with accurate gap detection I see that this actually causes more gaps. Looking at vorbis_synthesis_blockin(), I think we do indeed want to increment the packetno (not reset it) rather than restart synthesis when there's a gap.
I think Chris Double's right in that we should worry about int overflow in the timestamp code. This patch moves the int overflow code out of static functions and into its own file so we can use the int-overflow aware add and multiply functions from other compilation units easier.

This patch is based on current trunk.
Attachment #461439 - Flags: review?(kinetik)
Based on patch 0, on current trunk - no longer based on buffered, or OggIndex patches.
Attachment #457757 - Attachment is obsolete: true
Attachment #461440 - Flags: review?(kinetik)
Attachment #457757 - Flags: review?(kinetik)
Attached patch Patch 2 v3: Decode to frame (obsolete) — Splinter Review
Rebased on trunk.
Attachment #457706 - Attachment is obsolete: true
Attachment #461441 - Flags: review?(kinetik)
Attachment #457706 - Flags: review?(kinetik)
This should block, videos with this encoding quirk are relatively common, ffmpeg outputs them, so we should take this fix. Requesting blocking 2.0.
blocking2.0: --- → ?
See bug 582328 which was raised against having the overflow code in VideoUtils.h. You're patch affects that (and fixes it I think).
Comment on attachment 461441 [details] [diff] [review]
Patch 2 v3: Decode to frame

Shouldn't nsOggReader::Seek be using this too?
Attachment #461441 - Flags: review?(kinetik) → review+
Attachment #461439 - Flags: review?(kinetik) → review+
(In reply to comment #24)
> Comment on attachment 461441 [details] [diff] [review]
> Patch 2 v3: Decode to frame
> 
> Shouldn't nsOggReader::Seek be using this too?

Oh yeah, I lost it when I rebased. I'll add it back in. Thanks for review!
Attached patch Patch 2 v4: Decode to frame (obsolete) — Splinter Review
With nsOggReader::Seek() using DecodeToTarget().
Attachment #461441 - Attachment is obsolete: true
Attachment #463069 - Flags: review+
I realized that the assertion that I added in patch 1 "NS_ASSERTION(audioStartTime != -1,...) in AudioLoop() is triggering during mochitest test_seek. This happens when we seek to the same time as the current playback position, e.g. "video.currentTime = video.currentTime;". This is because the SEEKING case in nsBuiltinDecoderStateMachine::Run() calls ResetPlayback(), which sets mAudioStartTime to -1, but Run()'s SEEKING case only reinitializes mAudioStartTime when the seek target is different to the current playback position. I think we should only stop playback and destroy the decode/audio threads when seeking if the seek target is different to the current playback position. The HTML5 spec still requires us to run the seek algorithm (and send all events etc) in this case, even though the current playback position isn't changing.
Attachment #463987 - Flags: review?(kinetik)
Comment on attachment 461440 [details] [diff] [review]
Patch 1 v5: Sample accurate audio gap detection

I'm generally a little worried about round-off errors with this,
since we're doing some calculations using milliseconds and others
using samples.  The code does look fine (and doesn't seem to
contain any places where the error can accumulate), so perhaps my
concern is unfounded.

+const PRUint32 SILENCE_BYTES_CHUNNK = 32 * 1024;

CHUNK typo here and where it's used.

+      missingSamples = NS_MIN(static_cast<PRInt64>(PR_UINT32_MAX), missingSamples);
+      audioDuration += PlaySilence(static_cast<PRUint32>(missingSamples), channels);

The comment and implementation of PlaySilence require it to be
called in a loop as it writes no more than 32kB at a time, but
it's not being called in a loop and seems to assuming
|missingSamples| will be written, no matter how large.

PlayFromAudioQueue has:
+    if (!mAudioStream->IsPaused()) {
...
+    } else {
+      mReader->mAudioQueue.PushFront(sound);
+      sound.forget();
+    }

...and it seems like PlaySilence needs the same logic (preferably shared)?

+      mAudioSamples += samples;

Do we need to worry about overflow here?  If there are no gaps,
this will never be reset to 0 and will overflow eventually.  On
the other hand, it'll take a fairly long time before it does.

+  PRInt64 mAudioStartMs;

The timestamps stored here are unsigned.  Can we make this
unsigned too?  I know you're using -1 as a sentinel to signal it
hasn't been used yet, but it seems like we could get away with
using 0 for that.

+  PRInt64 mAudioSamples;

Same here, I don't think this ever makes sense as a negative
value.  The only thing is that you need to be careful about
underflow here if you make it unsigned:

+  if (tstamp_samples - decoded_samples > 0) {
No longer depends on: 519897
Attachment #463987 - Flags: review?(kinetik) → review+
I know you didn't touch this, but since it's so close to the code you're changing:

     PRUint32 samples = 0;
     while ((samples = vorbis_synthesis_pcmout(&mVorbisDsp, &pcm)) > 0) {

...but...

extern int      vorbis_synthesis_pcmout(vorbis_dsp_state *v,float ***pcm);

It'd be really bad if pcmout returned an error and we didn't notice.

nsOggReader.cpp has the same problem (and also the same unnecessary while/if that you correctly reduced to just the while here).
Thanks for review!

(In reply to comment #28)
> Comment on attachment 461440 [details] [diff] [review]
> Patch 1 v5: Sample accurate audio gap detection
> 
> I'm generally a little worried about round-off errors with this,
> since we're doing some calculations using milliseconds and others
> using samples.  The code does look fine (and doesn't seem to
> contain any places where the error can accumulate), so perhaps my
> concern is unfounded.

At least some versions of ffmpeg encode WebM videos with timestamps be rounded to the nearest ms (despite them being stored in ns) anyway.

If you want to be more precise, we'd have to store our VideoData and SoundData timestamps in ns, or as fractions, which seems like overkill to me

The error won't accumulate, so we should be ok.


> The comment and implementation of PlaySilence require it to be
> called in a loop as it writes no more than 32kB at a time, but
> it's not being called in a loop and seems to assuming
> |missingSamples| will be written, no matter how large.

It's being called in AudioLoop()'s loop. We need to run the logic in the rest of AudioLoop() to ensure the current time updates smoothly, hence a limit on the sound write size.

> PlayFromAudioQueue has:
> +    if (!mAudioStream->IsPaused()) {
> ...
> +    } else {
> +      mReader->mAudioQueue.PushFront(sound);
> +      sound.forget();
> +    }
> 
> ...and it seems like PlaySilence needs the same logic (preferably shared)?

Good catch!

> 
> +      mAudioSamples += samples;
> 
> Do we need to worry about overflow here?  If there are no gaps,
> this will never be reset to 0 and will overflow eventually.  On
> the other hand, it'll take a fairly long time before it does.

I didn't think it was worth worrying about. mAudioSamples is a PRInt64, so at typical bitrates we'd have several million years of playback before this overflowed.

> +  PRInt64 mAudioStartMs;
> 
> The timestamps stored here are unsigned.  Can we make this
> unsigned too?  I know you're using -1 as a sentinel to signal it
> hasn't been used yet, but it seems like we could get away with
> using 0 for that.

We couldn't use 0 unless we added a flag to denote whether it had been initialized. 0 is a valid start time after all. We could use PR_UINT64_MAX as the sentinel value I suppose.

 
> +  PRInt64 mAudioSamples;

Sure, no worries, this can be unsigned.
Patch 1 with review comments addressed.

I'm still doing the audio gaps' timestamp calculation in PRInt64. All our timestamps are PRInt64s, and since all the timestamps we get out of libnestegg/WebM are divided by NS_PER_MS, they must be less than PR_INT64_MAX, so it's safe to store them in PRInt64s. If I switch to using PRUint64s for the audio gaps timestamp calculation, the unsigned-ness propagates through our overflow arithmetic functions to affect all our timestamps. I see no technical reason why we shouldn't make all timestamps unsigned though, but I think we should do that in another bug.
Attachment #461440 - Attachment is obsolete: true
Attachment #464298 - Flags: review?(kinetik)
Attachment #461440 - Flags: review?(kinetik)
Comment on attachment 464298 [details] [diff] [review]
Patch 1 v6: review comments addressed

(In reply to comment #30)
> At least some versions of ffmpeg encode WebM videos with timestamps be rounded
> to the nearest ms (despite them being stored in ns) anyway.

Yeah.  They're actually stored in milliseconds.  TimecodeScale is 1e6 nanoseconds (as recommended in the WebM spec), so the timecodes stored in the container only have millisecond resolution.

> We couldn't use 0 unless we added a flag to denote whether it had been
> initialized. 0 is a valid start time after all. We could use PR_UINT64_MAX as
> the sentinel value I suppose.

I was thinking you could treat 0 and uninitialized as the same and just reinitialize if it's 0, but that would break the case where it had initialized to 0 and then advanced to the next decode call (where it would reinit at the next tstamp, and then work from then on).  My mistake, sorry.

I'm not sure if using -1 is more or less ugly than using the UNINITIALIZED constant with PR_UINT64_MAX, especially since the prevailing style in the code is to use -1 (for better or worse), so it might be better to go back to using -1 if you don't mind... sorry.

(In reply to comment #31)
> I'm still doing the audio gaps' timestamp calculation in PRInt64. All our
> timestamps are PRInt64s, and since all the timestamps we get out of
> libnestegg/WebM are divided by NS_PER_MS, they must be less than PR_INT64_MAX,
> so it's safe to store them in PRInt64s. If I switch to using PRUint64s for the
> audio gaps timestamp calculation, the unsigned-ness propagates through our
> overflow arithmetic functions to affect all our timestamps. I see no technical
> reason why we shouldn't make all timestamps unsigned though, but I think we
> should do that in another bug.

Yeah, you're right, not worth changing it here.

     PRUint32 samples = 0;
     while ((samples = vorbis_synthesis_pcmout(&mVorbisDsp, &pcm)) > 0) {

You changed the nsOggReader version of this to PRInt32 but not the nsWebMReader version.  Or did you change total_samples to PRInt32 instead of samples?

+PRUint32 nsBuiltinDecoderStateMachine::PlaySilence(PRUint32 aSamples, PRUint32 aChannels)
+  PRUint32 samples;
...
+  return samples;

Declare this where it's first assigned and return it from inside the lock scope?  Actually, just get rid of the extra scope braces--they're not necessary.
Attachment #464298 - Flags: review?(kinetik) → review+
With signed mAudioStartMs and -1 sentinel, scope removed in PlaySilence(), and storing return value of all vorbis_synthesis_pcmout() calls in PRInt32. Carrying forward r=kinetik.
Attachment #464298 - Attachment is obsolete: true
Attachment #464613 - Flags: review+
Keywords: checkin-needed
Whiteboard: [Fx4Beta1Testday] [4b1] → [Fx4Beta1Testday] [4b1][needs landing]
Patch 2 didn't build with gcc, as I was assigning to an autoptr. Fixed in this patch, builds and tests pass on TryServer.
Attachment #463069 - Attachment is obsolete: true
Attachment #464960 - Flags: review+
Landed on m-c
http://hg.mozilla.org/mozilla-central/rev/117166848ffa
http://hg.mozilla.org/mozilla-central/rev/94eacfad6472
http://hg.mozilla.org/mozilla-central/rev/b88ed3964100
http://hg.mozilla.org/mozilla-central/rev/5eb803404899
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [Fx4Beta1Testday] [4b1][needs landing] → [Fx4Beta1Testday] [4b1]
Depends on: 587480
Depends on: 587481
Depends on: 587482
See Also: → 1853278
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: