Decouple video and audio frame delivery

NEW
Unassigned

Status

()

Core
Audio/Video: Playback
4 years ago
2 years ago

People

(Reporter: jesup, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [getusermedia])

(Reporter)

Description

4 years ago
this is related to a number of other bugs and issues surrounding the differences between realtime video input (where frames do not have pre-known durations) and recorded video (where they're typically known).

Roc has some ideas on how to handle this.  This is a hook for that work

Comment 1

4 years ago
jesup: can you elaborate on what you think is needed here?
Flags: needinfo?(rjesup)
(In reply to Eric Rescorla (:ekr) from comment #1)
> jesup: can you elaborate on what you think is needed here?

Rob - Based on your discussion with Randell tonight in #media, I think you're planning to get to this soon, so I'm reassigning to you.  Can you describe what you're planning to do and when you're hoping to do it? (In time for Firefox 27?  Firefox 28?)  Thanks!
Assignee: nobody → roc
Flags: needinfo?(roc)
Firefox 27.

The plan: (not necessarily in order)
-- Extend ImageContainer to contain a set of images, each with a TimeStamp (real time) which is the presentation time of the image. Teach the compositor to pick the first image whose TimeStamp is <= the current time when compositing.
-- Remove the VideoSegment stuff. Video tracks no longer contain Image references.
-- MediaStreamGraph computes at each tick, for each SourceMediaStream, the set of video output sinks (VideoFrameContainers) that that stream's video frames are going to. It gives each SourceMediaStream the complete list of ImageContainers its video frames need to be placed in. It also computes the window of time into the future for which the current list of output ImageContainers is guaranteed to be valid; SourceMediaStreams should not submit frames whose presentation time is beyond that time, because streams might get disconnected, making those frames invalid. We'll have to do something here to account for potential stream blocking, as well.
-- The code feeding each SourceMediaStream (gUM, MediaDecoderStateMachine) is responsible for stuffing video frames into the SourceMediaStream's output ImageContainers. It is allowed to overwrite any and all images already in the ImageContainer, so for example a camera can replace any current image(s) with new ones regardless of what the TimeStamps say. There won't be any waiting for the MSG thread to iterate.
-- Add a multi-thread-safe API on ImageContainer to dispatch notifications when the image set is updated in any way. This will give a video encoder immediate access to any new images.
Flags: needinfo?(roc)

Comment 4

4 years ago
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #3)
> Firefox 27.
> 
> The plan: (not necessarily in order)
> -- Extend ImageContainer to contain a set of images, each with a TimeStamp
> (real time) which is the presentation time of the image.

Would this be the capture time of the image for gUM? In what timebase?


> Teach the
> compositor to pick the first image whose TimeStamp is <= the current time
> when compositing.

While you're in there, it would be good to have it have multiple
representations of the same image at the same timestamp, in order
to accommodate hardware-encoding cameras which simultaneously
deliver media in encoded form and I420.

> -- Remove the VideoSegment stuff. Video tracks no longer contain Image
> references.
> -- MediaStreamGraph computes at each tick, for each SourceMediaStream, the
> set of video output sinks (VideoFrameContainers) that that stream's video
> frames are going to. It gives each SourceMediaStream the complete list of
> ImageContainers its video frames need to be placed in.

Each sink has its own ImageContainer?


> It also computes the
> window of time into the future for which the current list of output
> ImageContainers is guaranteed to be valid; SourceMediaStreams should not
> submit frames whose presentation time is beyond that time, because streams
> might get disconnected, making those frames invalid. We'll have to do
> something here to account for potential stream blocking, as well.
> -- The code feeding each SourceMediaStream (gUM, MediaDecoderStateMachine)
> is responsible for stuffing video frames into the SourceMediaStream's output
> ImageContainers. It is allowed to overwrite any and all images already in
> the ImageContainer, so for example a camera can replace any current image(s)
> with new ones regardless of what the TimeStamps say. There won't be any
> waiting for the MSG thread to iterate.

What keeps the image containers locked while I'm doing that?

More generally, I'm trying to get clear on the data ownership
model here. As I understand it, in a number of cases we are
passing around references to shared memory buffers in order
to reduce copies. How does that work in this environment?

> -- Add a multi-thread-safe API on ImageContainer to dispatch notifications
> when the image set is updated in any way. This will give a video encoder
> immediate access to any new images.
Thanks for the feedback! It's very helpful.

(In reply to Eric Rescorla (:ekr) from comment #4)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #3)
> > Firefox 27.
> > 
> > The plan: (not necessarily in order)
> > -- Extend ImageContainer to contain a set of images, each with a TimeStamp
> > (real time) which is the presentation time of the image.
> 
> Would this be the capture time of the image for gUM? In what timebase?

By the time the images are stored in ImageContainer they should be stamped with their real time presentation time (system clock). This is what the compositor needs. When the MSG tells a SourceMediaStream about its destination ImgaeContainer(s), it will pass along the mapping from video track time to real time that the SourceMediaStream should use, which will take into account any necessary sync with audio output.

A gUM video source should probably just ignore that mapping and use the capture time, since we won't be trying to sync with audio output and we just want to display the latest frame available at all times. Using the capture time (which must be in the past) will achieve that.

> > Teach the
> > compositor to pick the first image whose TimeStamp is <= the current time
> > when compositing.
> 
> While you're in there, it would be good to have it have multiple
> representations of the same image at the same timestamp, in order
> to accommodate hardware-encoding cameras which simultaneously
> deliver media in encoded form and I420.

We've discussed some ideas about piping through encoded data. In fact, didn't you and I discuss them? I think we should do that as a separate project though.

> > -- MediaStreamGraph computes at each tick, for each SourceMediaStream, the
> > set of video output sinks (VideoFrameContainers) that that stream's video
> > frames are going to. It gives each SourceMediaStream the complete list of
> > ImageContainers its video frames need to be placed in.
> 
> Each sink has its own ImageContainer?

Yes, I think that's the best way to do it. We could go the other way and give each source an ImageContainer and propagate those to sinks, but we don't have the ability to change the current ImageContainer for an ImageLayer off the main thread, so I think graph changes would be harder to handle well.

> > It also computes the
> > window of time into the future for which the current list of output
> > ImageContainers is guaranteed to be valid; SourceMediaStreams should not
> > submit frames whose presentation time is beyond that time, because streams
> > might get disconnected, making those frames invalid. We'll have to do
> > something here to account for potential stream blocking, as well.
> > -- The code feeding each SourceMediaStream (gUM, MediaDecoderStateMachine)
> > is responsible for stuffing video frames into the SourceMediaStream's output
> > ImageContainers. It is allowed to overwrite any and all images already in
> > the ImageContainer, so for example a camera can replace any current image(s)
> > with new ones regardless of what the TimeStamps say. There won't be any
> > waiting for the MSG thread to iterate.
> 
> What keeps the image containers locked while I'm doing that?

The consumers of image data from an ImageContainer either call LockCurrentImage or GetCurrentImageAsSurface; these are basically the same but the former returns an Image and the latter returns a gfxASurface and can only be used on the main thread. Those calls are mutually exclusive with the SourceMediaStream's updates to the ImageContainer. My current idea for the update API is a method that simply replaces the entire set of images with a new set, atomically. That will work fine on the assumption that there's only one writer to an ImageContainer at a time.

Note that despite its name, for our purposes LockCurrentImage is just a getter and does not actually lock the container.

> More generally, I'm trying to get clear on the data ownership
> model here. As I understand it, in a number of cases we are
> passing around references to shared memory buffers in order
> to reduce copies. How does that work in this environment?

Unchanged. We'll stick with the idea that Image objects are immutable and reference-counted. We will do whatever's necessary to ensure that the lifetimes of Image references are short; e.g. it's OK for code to get a reference to an ImageContainer's Image for compositing, or for drawing to a canvas, or for encoding a video frame, but not OK for a DOM object to hold onto an Image reference indefinitely.

The hard case is when Images wrap a scarce resource, e.g. a gralloc buffer from the camera's small pool of capture buffers (perhaps just 2 buffers). In that case an ImageContainer fed by gUM will always store exactly one image, the latest one. When the SourceMediaStream puts a new image into the container, that image has to be forwarded to the compositor thread/process via IPDL, so the compositor might still be using the old image for a little while. Likewise we might have called GetCurrentImageAsSurface on the main thread to draw the image into a canvas, in which case the old image again might stay alive a little while. But by bounding these "little whiles" we should be able to ensure the old image is destroyed in time to recycle its gralloc buffer to capture the next camera frame. (Video encoding might be the hardest thing to bound, but if we can't encode video fast enough to keep up with camera capture ... that's a bigger problem :-).)
(Reporter)

Comment 6

4 years ago
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #5)
> Thanks for the feedback! It's very helpful.
> 
> (In reply to Eric Rescorla (:ekr) from comment #4)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #3)
> > > Firefox 27.
> > > 
> > > The plan: (not necessarily in order)
> > > -- Extend ImageContainer to contain a set of images, each with a TimeStamp
> > > (real time) which is the presentation time of the image.
> > 
> > Would this be the capture time of the image for gUM? In what timebase?
> 
> By the time the images are stored in ImageContainer they should be stamped
> with their real time presentation time (system clock). This is what the
> compositor needs. When the MSG tells a SourceMediaStream about its
> destination ImgaeContainer(s), it will pass along the mapping from video
> track time to real time that the SourceMediaStream should use, which will
> take into account any necessary sync with audio output.

I'm not clear how this would work, and how it would deal with input-to-graph resampling when clock domains are crossed (though I'm sure it can be worked out).

> A gUM video source should probably just ignore that mapping and use the
> capture time, since we won't be trying to sync with audio output and we just
> want to display the latest frame available at all times. Using the capture
> time (which must be in the past) will achieve that.

gUM video needs to be synced with audio as well, even if routed directly to a <video> element.  The only time you wouldn't would be if it was a video-only capture.  

In video-only capture, you can choose between always-latest -- which due to jitter and beating of the camera capture frequency against the compositor frequency can cause drops and dups of frames at times -- or a smoothed display with error tracking to decide when to drop or dup a frame.  The smoothed display will generally look better, as any dups or drops will be regular and not random-ish (and you won't get beats where the capture and composite almost line up and you get a series of dup/drop/dup/drops.  However the smoothed display might be slightly more complex, though perhaps it can piggyback of existing sync mechanisms.

gUM routed to a PeerConnection will want to maintain source timestamps.  Also note that video "source" timestamps are rarely any time we measure from the system or hardware; in order to avoid problems you want a 30fps source to lie and say the frames are captured at exactly 3000 timestamp ticks (@90000Hz) apart (and likely they more-or-less are at the hardware level).  The trick here is you need to know the FPS of the camera.  If the hardware doesn't supply them at 30fps exactly normal clock drift handling (i.e. timestamp->NTP correlation) will compensate for it (this is common as cameras usually have their own clocks, and so may capture at say 30.001fps - after an hour this will start to be noticeable - that would be near 4 frames off in an hour).

> 
> > > Teach the
> > > compositor to pick the first image whose TimeStamp is <= the current time
> > > when compositing.
> > 
> > While you're in there, it would be good to have it have multiple
> > representations of the same image at the same timestamp, in order
> > to accommodate hardware-encoding cameras which simultaneously
> > deliver media in encoded form and I420.
> 
> We've discussed some ideas about piping through encoded data. In fact,
> didn't you and I discuss them? I think we should do that as a separate
> project though.

Agreed.  Also, Google just ripped out the support for VideoCaptureImpl::DeliverEncodedCapturedFrame()
We could in theory use such a mechanism for primitive discontinuous mesh calls, but bandwidth control, RTCP handling in general (FIR/PLI/etc) is a bit of a nightmare.  At best this is low priority right now.

> > > -- MediaStreamGraph computes at each tick, for each SourceMediaStream, the
> > > set of video output sinks (VideoFrameContainers) that that stream's video
> > > frames are going to. It gives each SourceMediaStream the complete list of
> > > ImageContainers its video frames need to be placed in.
> > 
> > Each sink has its own ImageContainer?
> 
> Yes, I think that's the best way to do it. We could go the other way and
> give each source an ImageContainer and propagate those to sinks, but we
> don't have the ability to change the current ImageContainer for an
> ImageLayer off the main thread, so I think graph changes would be harder to
> handle well.

Is there one ImageContainer per sink, or one per source that feeds the sink?  (what happens when multiples sources with video feed a single sink via say TrackUnion)?  I'll note that I presume a sink is a track, nor a stream, so a stream with 4 video tracks added to a peerconnection would have (I presume) 4 MediaPipeline sinks.

> > > It also computes the
> > > window of time into the future for which the current list of output
> > > ImageContainers is guaranteed to be valid; SourceMediaStreams should not
> > > submit frames whose presentation time is beyond that time, because streams
> > > might get disconnected, making those frames invalid. We'll have to do
> > > something here to account for potential stream blocking, as well.
> > > -- The code feeding each SourceMediaStream (gUM, MediaDecoderStateMachine)
> > > is responsible for stuffing video frames into the SourceMediaStream's output
> > > ImageContainers. It is allowed to overwrite any and all images already in
> > > the ImageContainer, so for example a camera can replace any current image(s)
> > > with new ones regardless of what the TimeStamps say. There won't be any
> > > waiting for the MSG thread to iterate.
> > 
> > What keeps the image containers locked while I'm doing that?
> 
> The consumers of image data from an ImageContainer either call
> LockCurrentImage or GetCurrentImageAsSurface; these are basically the same
> but the former returns an Image and the latter returns a gfxASurface and can
> only be used on the main thread. Those calls are mutually exclusive with the
> SourceMediaStream's updates to the ImageContainer. My current idea for the
> update API is a method that simply replaces the entire set of images with a
> new set, atomically. That will work fine on the assumption that there's only
> one writer to an ImageContainer at a time.
> 
> Note that despite its name, for our purposes LockCurrentImage is just a
> getter and does not actually lock the container.
> 
> > More generally, I'm trying to get clear on the data ownership
> > model here. As I understand it, in a number of cases we are
> > passing around references to shared memory buffers in order
> > to reduce copies. How does that work in this environment?
> 
> Unchanged. We'll stick with the idea that Image objects are immutable and
> reference-counted. We will do whatever's necessary to ensure that the
> lifetimes of Image references are short; e.g. it's OK for code to get a
> reference to an ImageContainer's Image for compositing, or for drawing to a
> canvas, or for encoding a video frame, but not OK for a DOM object to hold
> onto an Image reference indefinitely.
> 
> The hard case is when Images wrap a scarce resource, e.g. a gralloc buffer
> from the camera's small pool of capture buffers (perhaps just 2 buffers). In
> that case an ImageContainer fed by gUM will always store exactly one image,
> the latest one. When the SourceMediaStream puts a new image into the
> container, that image has to be forwarded to the compositor thread/process
> via IPDL, so the compositor might still be using the old image for a little
> while. Likewise we might have called GetCurrentImageAsSurface on the main
> thread to draw the image into a canvas, in which case the old image again
> might stay alive a little while. But by bounding these "little whiles" we
> should be able to ensure the old image is destroyed in time to recycle its
> gralloc buffer to capture the next camera frame. (Video encoding might be
> the hardest thing to bound, but if we can't encode video fast enough to keep
> up with camera capture ... that's a bigger problem :-).)
Flags: needinfo?(rjesup)
(In reply to Randell Jesup [:jesup] from comment #6)
> Is there one ImageContainer per sink, or one per source that feeds the sink?
> (what happens when multiples sources with video feed a single sink via say
> TrackUnion)?  I'll note that I presume a sink is a track, nor a stream, so a
> stream with 4 video tracks added to a peerconnection would have (I presume)
> 4 MediaPipeline sinks.

For video elements, it only makes sense to have one ImageContainer. For PeerConnection I guess you create one ImageContainer for every video track you've negotiated with the other side. So yeah, multiple ImageContainers per sink.
Component: Audio/Video → Audio/Video: Playback
Assignee: roc → nobody
You need to log in before you can comment on or make changes to this bug.