474748 - Optimize video frame rendering

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Description

•

16 years ago

From a quick Shark profile of playing a 720p stream, 19% of the time is spent in oggplay_yuv2bgr. But ignoring that for now (separate bug), the next biggest chunk is 14% in cairo_surface_fill, from nsMediaDecoder::Paint. From what I can tell, the decode process looks like this: OggDecodeStateMachine NextFrame -> HandleVideoData -> allocates a w*h*4 buffer, mVideoData in the frame -> calls yuv2bgr do to yuv -> rgb into the new buffer PlayFrame -> PlayVideo -> grabs video update lock -> calls mDecoder->SetRGBData(w, h, rate, rgb) -> SetRGBData copies rgb into its own buffer -> releases video update lock -> deletes the frame (and the buffer allocated in NextFrame) Then in Paint, it grabs the video update lock for the duration of the operation, creates a gfxImageSurface that references the latest bits received in SetRGBData, and draws with EXTEND_PAD, a pixel-snapped rectangle, and a Fill(). We're hitting the slowest fill path on OSX at least; the EXTEND_PAD is probably not helping win32 (and certainly not linux). I'm looking into OSX now; in an ideal world, we would be able to quickly render an rgb buffer to any destination surface. So that's a bug if that's not happening. However, there are some misc things that we could do with the above, mainly related to memory usage: 1) Allow SetRGBData to take the buffer instead of having to copy it each time. 2) Reuse the buffer instead of allocating a new one for each frame; this would help especially in cases where we have to drop some frames. 3) Set up a two-buffer round robin queue; this would require some interlock in Paint(), but I think it would be a net win. I think #3 is the most interesting option. What I'm thinking is to allocate two buffers, A and B. Keep decoding frames into A, until the main thread grabs the lock and asks for a frame. At that point give the main thread A, and set up B as the decode target. When the main thread grabs the lock and asks for a frame again, hand it back B, and swap back to decoding to A. This ensures that the maximum memory consumption for frame data will always be the size of two decoded frames. The SetRGB call could be replaced with a "frame available" call that would just increment a frame counter; Paint() could compare that counter with the last frame that it had, so it knows when it already has the latest frame and can avoid requesting it again. This call wouldn't even require the lock (could just be an atomic increment).

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 1

•

16 years ago

More on the OSX issue (apologies for mixing a few different things in this bug): - If EXTEND_PAD is set, we go down the slow path on Quartz. The rendering isn't any different than EXTEND_NONE though, so we should just ignore it anyway. Fix: treat EXTEND_PAD as EXTEND_NONE for surface sources. We never rely on the full PAD behaviour anyway. - When we pass in an image surface, we call _cairo_surface_to_cgimage which has no choice but to call surface_snapshot, which results in a copy of the image bits being created. The reason it has to create a snapshot is because it needs to have control over when the data buffer is deallocated; when printing and when we go through the pattern path, the actual paint can happen much later after our actual Paint call (because the printer does random magic stuff, buffers commands, whatever). It assumes that it can take a reference to a CGImageRef and have the data stick around; this isn't the case when we create a CGImage with an externally allocated buffer. The fix here is to have platform-specific ways to allocate the A/B buffers I mentioned above; for OSX, this would be a QuartzImageSurface. These will create the CGImage with the CGImage owning the data, and we can write to it as needed. There still could be a race condition here if we update the data while there's an outstanding paint with it, but at worst that will result in the wrong video frame being printed or something if print happens while playback is still going. Doesn't seem like a big deal. If we had gfxQuartzImageSurfaces here, then there would be no copies involved in rendering, just a very fast blit. We'd be decoding straight into the correct buffer to begin with as well. On Windows we just want gfxWindowsSurfaces.

cajbir (:cajbir)

Comment 2

•

16 years ago

We could also avoid the current cost of allocating the RGB buffer on each frame by storing the YUV data in the frame instead. This is already allocated and given to us by liboggplay. We'd have to reorganise things so liboggplay doesn't delete the YUV data until we're done with it. This would avoid the cost of doing the yuv2rgb conversion when the frame is decoded and we could move this to when the frame is actually about to be painted. This would also allow us to keep the frame queue of decoded frames, rather than limiting to two decoded frames only. Keeping more allows us to decode ahead during idle time and helps lower the cost of decoding key frames just before they are about to be displayed. We used to decode these frames on a separate thread so a frame that took a long time to decode wouldn't cause audio to skip. Maybe it's worthwhile revisiting that. I notice that if I play the 720p video and open a new tab, such that the video is not displayed, then the audio plays smoothly.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 3

•

16 years ago

(In reply to comment #2) > We could also avoid the current cost of allocating the RGB buffer on each > frame by storing the YUV data in the frame instead. This is already allocated > and given to us by liboggplay. We'd have to reorganise things so liboggplay > doesn't delete the YUV data until we're done with it. That would mean doing the YUV2RGB on the main thread, at paint time, and currently it's rather slow, so that could be a problem. Maybe if we have SSE2 code on all platforms, and that's a lot faster, we can do this. > We used to decode these frames on a separate thread so a frame that took a long > time to decode wouldn't cause audio to skip. Maybe it's worthwhile revisiting > that. I'd hate to do thatl. > I notice that if I play the 720p video and open a new tab, such that the video > is not displayed, then the audio plays smoothly. But we're still decoding the video in that case, right? So that can't be a decoding issue.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 4

•

16 years ago

(In reply to comment #3) > (In reply to comment #2) > > We could also avoid the current cost of allocating the RGB buffer on each > > frame by storing the YUV data in the frame instead. This is already allocated > > and given to us by liboggplay. We'd have to reorganise things so liboggplay > > doesn't delete the YUV data until we're done with it. > > That would mean doing the YUV2RGB on the main thread, at paint time, and > currently it's rather slow, so that could be a problem. Maybe if we have SSE2 > code on all platforms, and that's a lot faster, we can do this. Yeah, I'd like to avoid moving that to the main thread. At some point in the future YUV -> RGB might become "free", when we can allocate an actual hardware YUV surface type and have that conversion happen on the video card, but we're a little ways away from that. Though Chris is right, I didn't consider the case where we'd be decoding faster than we're actually playing. Perhaps in that case we should have some kind of upper bound on memory usage, and have us stop decoding if we ever hit that until we actually process some more frames.

cajbir (:cajbir)

Comment 5

•

16 years ago

> That would mean doing the YUV2RGB on the main thread, > at paint time, and currently it's rather slow, so that > could be a problem. We could do it on the current decoder thread, at the time we call SetRGBData. This would at least avoid calling it when we skip frames. > But we're still decoding the video in that case, right? > So that can't be a decoding issue. We're also still doing the memory allocation and yuv>rgb conversion. Which is why I thought it interesting. Is the only thing not happening in that case the actual painting? > Perhaps in that case we should have some kind of > upper bound on memory usage, and have us stop decoding > if we ever hit that until we actually process some more frames. This is what happens at the moment. We keep a maximum of 20 frames around.

Biju

Updated

•

16 years ago

Blocks: 476397

Matthew Gregan [:kinetik]

Assignee

Comment 6

•

16 years ago

I was poking at this today and found that EXTEND_PAD is really hurting us on Linux (no surprise), to the point where a 720x480 video won't play back without stuttering on a 1.6GHz Core Duo because we spend upwards of 70ms inside the call to aContext->Fill() in nsMediaDecoder::Paint. A quick hack to change this to EXTEND_NONE drops the time to around 4ms. With EXTEND_NONE in place, the next culprit is yuv2rgb, which can be fixed by fixing bug 474749. With the MMX yuv2rgb in place, the inefficient copying of frame data is near the top, but only around 22% of the profile (yuv2rgb is down to around 10% at this point).

Christopher Blizzard (:blizzard)

Updated

•

16 years ago

Blocks: 482641

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 7

•

16 years ago

alright, let's take a patch to turn off EXTEND_PAD for xlib and xcb target surfaces. Some reftests will probably need to be disabled for GTK.

Matthew Gregan [:kinetik]

Assignee

Comment 8

•

16 years ago

Attached patch EXTEND_NONE patch v0 (obsolete) — Details — Splinter Review

Use EXTEND_NONE if the target surface is X11. I've got an experimental patch that keeps liboggplay's YUV and decodes it during the SetRGB call. Doing that removes an alloc/copy/free for each frame, and reduces the memory used by the FrameQueue. I'll clean up the patch and post it soon.

Attachment #367556 - Flags: superreview?(roc)

Attachment #367556 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 9

•

16 years ago

+ gfxASurface::gfxSurfaceType type = aContext->OriginalSurface()->GetType(); Don't we want CurrentSurface here?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 10

•

16 years ago

Why EXTEND_NONE only if it's X11? We may as well just do the extra work to see if we'd actually need PAD, and turn it off everywhere if not... though I guess we may want it off unconditionally on X11.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 11

•

16 years ago

EXTEND_PAD is the right thing to use. Video should draw like images: the rectangle snapped to device pixels, EXTEND_PAD used so that there are no seams between video content and adjacent images, videos or background colors. So, we should use PAD everywhere we can. Here we are disabling PAD just to work around an X bug. I don't see any cases where leaving PAD enabled on X targets would be desirable.

Matthew Gregan [:kinetik]

Assignee

Comment 12

•

16 years ago

Attached patch EXTEND_NONE patch v1 — Details — Splinter Review

Use current surface rather than original surface. Sorry I missed that.

Attachment #367556 - Attachment is obsolete: true

Attachment #367666 - Flags: superreview?(roc)

Attachment #367666 - Flags: review?(roc)

Attachment #367556 - Flags: superreview?(roc)

Attachment #367556 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

16 years ago

Attachment #367666 - Flags: superreview?(roc)

Attachment #367666 - Flags: superreview+

Attachment #367666 - Flags: review?(roc)

Attachment #367666 - Flags: review+

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

16 years ago

Flags: wanted1.9.1+

Whiteboard: [needs landing]

Matthew Gregan [:kinetik]

Assignee

Comment 13

•

16 years ago

Attached patch defer YUV patch v0 — Details — Splinter Review

Defer YUV conversion until we're going to paint the frame. FrameData now holds a pinned reference to liboggplay's decoded YUV buffer, reducing memory used by queued frames. YUV conversion moves from HandleVideoData (in the decode stage) to PaintVideo (called when we want to paint a video frame). SetRGBData now assumes ownership of the decoded RGB buffer passed in. I moved locking mVideoUpdateLock inside SetRGBData since there doesn't seem to be any reason to force the caller to take the lock. This makes a worthwhile difference to memory use, but doesn't improve decode/render speed much. Combined with Chris's patch to skip video frames when they miss the desired presentation time (bug 470639), it'll make a bigger difference for videos we're already struggling to play. With this change, it'll be easier to move to a scheme where we avoid the allocation of the RGB buffer for each frame we paint. As Vlad suggested, the decoder could have a small queue of native surfaces which we could YUV decode directly into.

Assignee: nobody → kinetik

Attachment #367716 - Flags: superreview?(roc)

Attachment #367716 - Flags: review?(chris.double)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

16 years ago

Attachment #367716 - Flags: superreview?(roc) → superreview+

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 14

•

16 years ago

Comment on attachment 367716 [details] [diff] [review] defer YUV patch v0 PaintVideo is a crappy name since we may not actually paint this frame and we're certainly not going to do it right now. It's more like SetCurrentVideoFrame.

Matthew Gregan [:kinetik]

Assignee

Comment 15

•

16 years ago

It's actually PlayVideo, my comment is wrong (I thought I fixed that before I submitted). It's still not the best name, shall I rename it to SetCurrentFrame and resubmit the patch?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 16

•

16 years ago

Yes, if you think SetCurrentFrame is a better name.

cajbir (:cajbir)

Updated

•

16 years ago

Attachment #367716 - Flags: review?(chris.double) → review+

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 17

•

16 years ago

Oh well, I checked it in as-is by accident. Guess that's OK. http://hg.mozilla.org/mozilla-central/rev/fd0eac910c25 http://hg.mozilla.org/mozilla-central/rev/3de6334ba553 I'm going to mark this as fixed since we fixed some real stuff here. Let's file new bugs on new issues.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Whiteboard: [needs landing] → [needs 191 landing]

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

16 years ago

Attachment #367666 - Flags: approval1.9.1+

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

16 years ago

Attachment #367716 - Flags: approval1.9.1+

Alfred Kayser

Comment 18

•

16 years ago

It is worthwhile to create a bug for the last comment from #13? "avoid the allocation of the RGB buffer for each frame we paint. As Vlad suggested, the decoder could have a small queue of native surfaces which we could YUV decode directly into."

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 19

•

16 years ago

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/11db14251011 http://hg.mozilla.org/releases/mozilla-1.9.1/rev/3d966bdfac88

Keywords: fixed1.9.1

Whiteboard: [needs 191 landing]

Matthew Gregan [:kinetik]

Assignee

Updated

•

16 years ago

Depends on: 497837

EXTEND_NONE patch v0 16 years ago Matthew Gregan [:kinetik] 2.16 KB, patch		Details \| Diff \| Splinter Review
EXTEND_NONE patch v1 16 years ago Matthew Gregan [:kinetik] 2.20 KB, patch	roc : review+ roc : superreview+ roc : approval1.9.1+	Details \| Diff \| Splinter Review
defer YUV patch v0 16 years ago Matthew Gregan [:kinetik] 7.74 KB, patch	cajbir : review+ roc : superreview+ roc : approval1.9.1+	Details \| Diff \| Splinter Review