Closed Bug 1400254 Opened 7 years ago Closed 7 years ago

Losing Audio sync with video by using timestampOffset in MSE SourceBuffer

Categories

(Core :: Audio/Video: Playback, defect)

55 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: mkleinschmidt, Assigned: jya)

References

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0
Build ID: 20170824053622

Steps to reproduce:

Precondition
SampleRate=48000Hz, BitRate=192000, BitsPerSample=16

* add an audio part to SourceBuffer timeline with e.g. timestampOffset=-293.72, appendWindowStart=0, appendWindowEnd=5.36
* add the same part to timeline with timestampOffset=-288.36 (audio is placed after the previous), appendWindowStart=5.36, appendWindowEnd=10.72
* Example created: http://www.digamailboxdepot.net:8090/mse-audio-splice.html


Actual results:

A little gap is between the audio parts, which leads to loose the synchronisation with video with mutliple clips (more than 2 as in STR)


Expected results:

No gap shall be there as in Chrome Browser.
I forgot to tell, that the gap is visualized in the example http://www.digamailboxdepot.net:8090/mse-audio-splice.html in the "Buffer:" line. In FF there are 2 parts, in Chrome there is only one part.
Component: Untriaged → Audio/Video: Playback
Product: Firefox → Core
Chrome implementation of the buffered attribute is incorrect, as well as their timestampOffset..
So how it behaves and comparing to Firefox is meaningless.
See Also: → 1400598
Assignee: nobody → jyavenard
Some analysis:

Caveat: the use of appendWindowStart and appendWindowEnd do not allow to add partial frames to the source buffer.
See https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing
if a frame start time is before appendWindowStart or if the frame end time is after appendWindowEnd, the entire frame is dropped.
(see https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing step 8 and 9)

There is a provision to perform audio splices, but this is a "MAY support" and isn't a normative requirement.


Step by steps run:
first appendBuffer criterias are:
timestampOffset = -293.72s
appendWindowStart = 0s
appendWindowEnd = 5.36s

One thing to keep in mind, is that samples use integers for their timestamp in a mp4, while timestampOffset is a floating point number. -293.72 has no exact representation in integers microseconds, the closest value is -293719999us. So this is the value we will use in the calculations below (this rounding has no effect in the actual calculations, even if we had used floating point the end results would have been the same).
The use of integers isn't prohibited by the spec:
"Implementations don't have to internally store timestamps in a double precision floating point representation. This representation is used here because it is the represention for timestamps in the HTML spec."


So the first samples added to the source buffer will be: (times are in microseconds)
ProcessFrames: Processing audio/mp4a-latm frame(pts:293738666 end:293760000, dts:293738666, duration:21334, kf:1)

as: 293738666 - 293719999 = 18667us which is >= appendWindowStart and < appendWindowEnd

The previous sample in the stream had a pts of 293717333us:
293717333 - 293719999 = -2666us, as this is less than append window start, the sample is discarded
"If presentation timestamp is less than appendWindowStart, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame. "

The last sample added to the interval:
frame(pts:299050666 end:299072000, dts:299050666, duration:21334, kf:1)
299050666 - 293719999 = 5330667 and 299072000 = 5352001

the next sample times are:
frame(pts:299072000 end:299093333, dts:299072000, duration:21333, kf:1)
299072000 - 293719999us = 5352001 and 299093333 - 293719999us = 5373334
while 5352001 is >= appendWindowStart , we have the end time > appendWindowEnd
per spec:
"If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame. "

so the frame is dropped.

As such, the first operation gives us a buffered range of [0.018667, 5.352001]

The second appendBuffer uses the following criteria:
timestampOffset is -288.36s (rounded to -288359999us)
appendWindowStart = 5.36s
appendWindowEnd = 10.72s

The first frame added with those criterias is:
frame(pts:293738666 end:293760000, dts:293738666, duration:21334, kf:1)
as
293738666 - 288359999us = 5378667us and 293760000 - 288359999 = 5400001us
5378667us is >= appendWindowEnd and 5400001us is < appendWindowEnd

the previous frame being:
frame(pts:293717333 end:293738666, dts:293717333, duration:21333, kf:1)
293717333 - 288359999us = 5357334us and 293738666 - 288359999us = 5378667us
5357334us is less than appendWindowEnd so the frame is dropped
"If presentation timestamp is less than appendWindowStart, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame. "

So this explains why we have a buffered range with a gap from 5.352001s and 5.378667s 

Consequently the last frame added to the source buffer is:
frame(pts:299050666 end:299072000, dts:299050666, duration:21334, kf:1)
299050666 - 288359999 = 10690667 and 299072000 - 288359999 = 10712001

the next frame being:
frame(pts:299072000 end:299093333, dts:299072000, duration:21333, kf:1)
299072000 - 288359999 = 10712001 and 299093333 - 288359999 = 10733334us
10733334us is > 10.72s, and so the frame is dropped.

And so our final buffered range is:
[0.018667, 5.352001), (5.378667, 10.712001)
Exactly as per spec.

So the issue here is that no partial frames will be added to a source buffer when using appendWindowStart and appendWindowEnd.

*However*, you can still achieve what you want to do, because while no partial frame is ever added, a partial frame is never removed either, even when adding frames that overlaps (https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing step 13)
We only need to ensure that a frame in the interval [5.352001, 5.378667] isn't removed.

This can be done by setting appendWindowEnd to 5.378667 in the first place.

So this bug should be closed as invalid. the behaviour is exactly as per spec.

I will make an amendment to our code however, we have a workaround to allow frames *near* appendWindowStart, and so here we would allow to add to the source buffer the sample with a start time of -2666us..
I will remove this workaround if appendWindowEnd is set to a different value than the default.

Also note that the gap should have no impact on A/V sync. The timestamps are still valid.
Summary: Loosing Audio sync with video by using timestampOffset in MSE SourceBuffer → Losing Audio sync with video by using timestampOffset in MSE SourceBuffer
Comment on attachment 8909106 [details]
Bug 1400254 - Don't apply fuzz workaround when appendWindowEnd is set.

https://reviewboard.mozilla.org/r/180700/#review185774
Attachment #8909106 - Flags: review?(gsquelart) → review+
Pushed by jyavenard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/601946972c44
Don't apply fuzz workaround when appendWindowEnd is set. r=gerald
https://hg.mozilla.org/mozilla-central/rev/601946972c44
Status: UNCONFIRMED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
(In reply to Jean-Yves Avenard [:jya] from comment #3)
> Some analysis:
> 
> Caveat: the use of appendWindowStart and appendWindowEnd do not allow to add
> partial frames to the source buffer.
> See https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing
> if a frame start time is before appendWindowStart or if the frame end time
> is after appendWindowEnd, the entire frame is dropped.
> (see https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing
> step 8 and 9)
> 
> There is a provision to perform audio splices, but this is a "MAY support"
> and isn't a normative requirement.
> 
> 
> Step by steps run:
> first appendBuffer criterias are:
> timestampOffset = -293.72s
> appendWindowStart = 0s
> appendWindowEnd = 5.36s
> 
> One thing to keep in mind, is that samples use integers for their timestamp
> in a mp4, while timestampOffset is a floating point number. -293.72 has no
> exact representation in integers microseconds, the closest value is
> -293719999us. So this is the value we will use in the calculations below
> (this rounding has no effect in the actual calculations, even if we had used
> floating point the end results would have been the same).
> The use of integers isn't prohibited by the spec:
> "Implementations don't have to internally store timestamps in a double
> precision floating point representation. This representation is used here
> because it is the represention for timestamps in the HTML spec."
> 
> 
> So the first samples added to the source buffer will be: (times are in
> microseconds)
> ProcessFrames: Processing audio/mp4a-latm frame(pts:293738666 end:293760000,
> dts:293738666, duration:21334, kf:1)
> 
> as: 293738666 - 293719999 = 18667us which is >= appendWindowStart and <
> appendWindowEnd
> 
> The previous sample in the stream had a pts of 293717333us:
> 293717333 - 293719999 = -2666us, as this is less than append window start,
> the sample is discarded
> "If presentation timestamp is less than appendWindowStart, then set the need
> random access point flag to true, drop the coded frame, and jump to the top
> of the loop to start processing the next coded frame. "
> 
> The last sample added to the interval:
> frame(pts:299050666 end:299072000, dts:299050666, duration:21334, kf:1)
> 299050666 - 293719999 = 5330667 and 299072000 = 5352001
> 
> the next sample times are:
> frame(pts:299072000 end:299093333, dts:299072000, duration:21333, kf:1)
> 299072000 - 293719999us = 5352001 and 299093333 - 293719999us = 5373334
> while 5352001 is >= appendWindowStart , we have the end time >
> appendWindowEnd
> per spec:
> "If frame end timestamp is greater than appendWindowEnd, then set the need
> random access point flag to true, drop the coded frame, and jump to the top
> of the loop to start processing the next coded frame. "
> 
> so the frame is dropped.
> 
> As such, the first operation gives us a buffered range of [0.018667,
> 5.352001]
> 
> The second appendBuffer uses the following criteria:
> timestampOffset is -288.36s (rounded to -288359999us)
> appendWindowStart = 5.36s
> appendWindowEnd = 10.72s
> 
> The first frame added with those criterias is:
> frame(pts:293738666 end:293760000, dts:293738666, duration:21334, kf:1)
> as
> 293738666 - 288359999us = 5378667us and 293760000 - 288359999 = 5400001us
> 5378667us is >= appendWindowEnd and 5400001us is < appendWindowEnd
> 
> the previous frame being:
> frame(pts:293717333 end:293738666, dts:293717333, duration:21333, kf:1)
> 293717333 - 288359999us = 5357334us and 293738666 - 288359999us = 5378667us
> 5357334us is less than appendWindowEnd so the frame is dropped
> "If presentation timestamp is less than appendWindowStart, then set the need
> random access point flag to true, drop the coded frame, and jump to the top
> of the loop to start processing the next coded frame. "
> 
> So this explains why we have a buffered range with a gap from 5.352001s and
> 5.378667s 
> 
> Consequently the last frame added to the source buffer is:
> frame(pts:299050666 end:299072000, dts:299050666, duration:21334, kf:1)
> 299050666 - 288359999 = 10690667 and 299072000 - 288359999 = 10712001
> 
> the next frame being:
> frame(pts:299072000 end:299093333, dts:299072000, duration:21333, kf:1)
> 299072000 - 288359999 = 10712001 and 299093333 - 288359999 = 10733334us
> 10733334us is > 10.72s, and so the frame is dropped.
> 
> And so our final buffered range is:
> [0.018667, 5.352001), (5.378667, 10.712001)
> Exactly as per spec.
> 
> So the issue here is that no partial frames will be added to a source buffer
> when using appendWindowStart and appendWindowEnd.
> 
> *However*, you can still achieve what you want to do, because while no
> partial frame is ever added, a partial frame is never removed either, even
> when adding frames that overlaps
> (https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing
> step 13)
> We only need to ensure that a frame in the interval [5.352001, 5.378667]
> isn't removed.
> 
> This can be done by setting appendWindowEnd to 5.378667 in the first place.
> 
> So this bug should be closed as invalid. the behaviour is exactly as per
> spec.
> 
> I will make an amendment to our code however, we have a workaround to allow
> frames *near* appendWindowStart, and so here we would allow to add to the
> source buffer the sample with a start time of -2666us..
> I will remove this workaround if appendWindowEnd is set to a different value
> than the default.
> 
> Also note that the gap should have no impact on A/V sync. The timestamps are
> still valid.

Thanks very much for the explanation! What I also noticed from you is that appendWindowEnd value has to be larger to get the expected audio (>= appendWindowStart, < appendWindowEnd).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: