Open Bug 1941303 Opened 1 year ago Updated 1 year ago

Audio Codec Delay isn't trimmed away when using MSE

Categories

(Core :: Audio/Video: Playback, defect)

Unspecified
All
defect

Tracking

()

Tracking Status
firefox-esr115 --- wontfix
firefox-esr128 --- affected
firefox134 --- wontfix
firefox135 --- wontfix
firefox136 --- wontfix

People

(Reporter: pehrsons, Unassigned)

References

(Regression)

Details

(Keywords: regression)

When looking into bug 1940733 I found that cubeb was fed more audio than the duration as reported by the demuxer.

This is because the trimming info from the demuxer is stripped away when using MSE.

I observed this visually sometimes as we would skip a few video frames when reaching the end of videos from svt.se.

First reproduced on Android. Then on linux, and I captured a profile and recorded the same run with rr: Pernosco. Fx Profiler. The video used is here.

In the beginning of the audio, packets are 2048 frames. Codec delay is 5058 frames, so the last 1086 frames of the 3rd packet are the first that should not be trimmed. For some reason we don't only push those 5058 frames extra. Cubeb receives 2048*3=6144 frames too many.

See Also: → 1940733

Set release status flags based on info from the regressing bug 1838058

:padenot, since you are the author of the regressor, bug 1838058, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Andreas noticed that this site appends two overlapping audio fragments.
The first ends at 185406/48000 and the second starts at 184320/48000, which is an overlap of 1086/48000.

There are two sources of additional audio samples, both a result of playing all audio samples sequentially rather than at their indicated presentation times.
The first is the 5058/48000 before the presentation time 0 and
the second is the 1086/48000 overlap of the two audio fragments.
I expect we can correct to remove the first, but I'm not clear what to do about the second.

The elst atom in the first fragment indicates that the first 5058/48000 of leading audio should not rendered. This is recorded when MP4TrackDemuxer::GetNextSample() overwrites the presentation time and duration of the first 3 packets.

The stripping of the mOriginalPresentationWindow from the MediaRawData in TrackBuffersManager::OnAudioDemuxCompleted() means that TrackBuffersManager believes that the packet has the reduced presentation interval, but instead all audio samples contained are played.

The stripping was introduced for bug 1838058, where incorrect fragment durations were seen. Applying the fragment duration for a MediaSource resource is somewhat at odds with the way the coded frame processing should extend the duration as more frames arrive:

If the media segment contains data beyond the current duration, then run the duration change algorithm with new duration set to the maximum of the current duration and the [[group end timestamp]].

Arguably the trimming of padding from the end of the audio would mean that there is never "data beyond the current duration" and so the duration should never be extended if a trailing padding start point is specified, but I think there may exist a reasonable case for not applying this trailing padding with MediaSource resources.

The trimming of leading padding is important, however, because the leading padding of AAC audio should not be rendered if the intention is to splice with a previous fragment.
Content could use appendWindowStart to trim the leading padding, but I don't see why it should need to do this if the resource specifies the duration of leading padding.

Perhaps support could implemented for the leading padding but not the trailing.

The media resource does not specify any trimming around the overlap of the two fragments, and appendWindowStart and appendWindowEnd are not involved at the overlap (on this site).

MSE has an audio splice algorithm, but this would be activated only "If last decode timestamp for track buffer is unset" and the discontinuity between the fragments is not large enough to trigger a reset of last decode timestamp.

This kind of situation does not matter for video frames because the duration of the overlapped frame is implicitly truncated when the overlapping frame is presented. For audio, the consequence in Gecko is that the samples for both packets are rendered in sequence.

If the overlapped frame were truncated, then it would resolve A/V sync problems, but the join would still be undesirable if non-silent. When two fragments each have init segments, then whether to treat them as separate streams (and so restart decoding on the second stream) is not necessarily clear. In this case, the sample rate differs, so Gecko has good reason to treat the second fragment as a new stream. The misalignment of the fragments also seems a good reason to treat the second fragment as a new stream (though Gecko does not consider such small misalignments as a reason to do so). When the second fragment is a new stream, rendering its initial samples is not desirable (unless they are silent), so we need to either depend on the site to indicate padding via the resource or appendWindowStart or guess that the historical 2112 sample intervals of padding is what needs to be removed. Trimming 2112 samples here would leave a gap.

I don't think this site is a good example to motivate our choice of what to do around audio overlaps.

I think this pernosco view is quite telling of the original issue in bug 1838058. We should fix this proper and remove the wallpaper fix landed there, to fix this bug.

I reached out to SVT on the overlap issue. I'm not convinced it's something we can fix, because even if we handled the overlap, switching back to the original stream would mean there's a gap instead. We'd be forced to invent some data, or skip, both options resulting in an audible glitch.

MSE related

Flags: needinfo?(padenot)
Severity: -- → S2
Severity: S2 → S3
You need to log in before you can comment on or make changes to this bug.