Closed Bug 474748 Opened 16 years ago Closed 16 years ago

Optimize video frame rendering

Categories

(Core :: Audio/Video, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: vlad, Assigned: kinetik)

References

Details

(Keywords: fixed1.9.1)

Attachments

(2 files, 1 obsolete file)

From a quick Shark profile of playing a 720p stream, 19% of the time is spent in oggplay_yuv2bgr. But ignoring that for now (separate bug), the next biggest chunk is 14% in cairo_surface_fill, from nsMediaDecoder::Paint. From what I can tell, the decode process looks like this: OggDecodeStateMachine NextFrame -> HandleVideoData -> allocates a w*h*4 buffer, mVideoData in the frame -> calls yuv2bgr do to yuv -> rgb into the new buffer PlayFrame -> PlayVideo -> grabs video update lock -> calls mDecoder->SetRGBData(w, h, rate, rgb) -> SetRGBData copies rgb into its own buffer -> releases video update lock -> deletes the frame (and the buffer allocated in NextFrame) Then in Paint, it grabs the video update lock for the duration of the operation, creates a gfxImageSurface that references the latest bits received in SetRGBData, and draws with EXTEND_PAD, a pixel-snapped rectangle, and a Fill(). We're hitting the slowest fill path on OSX at least; the EXTEND_PAD is probably not helping win32 (and certainly not linux). I'm looking into OSX now; in an ideal world, we would be able to quickly render an rgb buffer to any destination surface. So that's a bug if that's not happening. However, there are some misc things that we could do with the above, mainly related to memory usage: 1) Allow SetRGBData to take the buffer instead of having to copy it each time. 2) Reuse the buffer instead of allocating a new one for each frame; this would help especially in cases where we have to drop some frames. 3) Set up a two-buffer round robin queue; this would require some interlock in Paint(), but I think it would be a net win. I think #3 is the most interesting option. What I'm thinking is to allocate two buffers, A and B. Keep decoding frames into A, until the main thread grabs the lock and asks for a frame. At that point give the main thread A, and set up B as the decode target. When the main thread grabs the lock and asks for a frame again, hand it back B, and swap back to decoding to A. This ensures that the maximum memory consumption for frame data will always be the size of two decoded frames. The SetRGB call could be replaced with a "frame available" call that would just increment a frame counter; Paint() could compare that counter with the last frame that it had, so it knows when it already has the latest frame and can avoid requesting it again. This call wouldn't even require the lock (could just be an atomic increment).
More on the OSX issue (apologies for mixing a few different things in this bug): - If EXTEND_PAD is set, we go down the slow path on Quartz. The rendering isn't any different than EXTEND_NONE though, so we should just ignore it anyway. Fix: treat EXTEND_PAD as EXTEND_NONE for surface sources. We never rely on the full PAD behaviour anyway. - When we pass in an image surface, we call _cairo_surface_to_cgimage which has no choice but to call surface_snapshot, which results in a copy of the image bits being created. The reason it has to create a snapshot is because it needs to have control over when the data buffer is deallocated; when printing and when we go through the pattern path, the actual paint can happen much later after our actual Paint call (because the printer does random magic stuff, buffers commands, whatever). It assumes that it can take a reference to a CGImageRef and have the data stick around; this isn't the case when we create a CGImage with an externally allocated buffer. The fix here is to have platform-specific ways to allocate the A/B buffers I mentioned above; for OSX, this would be a QuartzImageSurface. These will create the CGImage with the CGImage owning the data, and we can write to it as needed. There still could be a race condition here if we update the data while there's an outstanding paint with it, but at worst that will result in the wrong video frame being printed or something if print happens while playback is still going. Doesn't seem like a big deal. If we had gfxQuartzImageSurfaces here, then there would be no copies involved in rendering, just a very fast blit. We'd be decoding straight into the correct buffer to begin with as well. On Windows we just want gfxWindowsSurfaces.
We could also avoid the current cost of allocating the RGB buffer on each frame by storing the YUV data in the frame instead. This is already allocated and given to us by liboggplay. We'd have to reorganise things so liboggplay doesn't delete the YUV data until we're done with it. This would avoid the cost of doing the yuv2rgb conversion when the frame is decoded and we could move this to when the frame is actually about to be painted. This would also allow us to keep the frame queue of decoded frames, rather than limiting to two decoded frames only. Keeping more allows us to decode ahead during idle time and helps lower the cost of decoding key frames just before they are about to be displayed. We used to decode these frames on a separate thread so a frame that took a long time to decode wouldn't cause audio to skip. Maybe it's worthwhile revisiting that. I notice that if I play the 720p video and open a new tab, such that the video is not displayed, then the audio plays smoothly.
(In reply to comment #2) > We could also avoid the current cost of allocating the RGB buffer on each > frame by storing the YUV data in the frame instead. This is already allocated > and given to us by liboggplay. We'd have to reorganise things so liboggplay > doesn't delete the YUV data until we're done with it. That would mean doing the YUV2RGB on the main thread, at paint time, and currently it's rather slow, so that could be a problem. Maybe if we have SSE2 code on all platforms, and that's a lot faster, we can do this. > We used to decode these frames on a separate thread so a frame that took a long > time to decode wouldn't cause audio to skip. Maybe it's worthwhile revisiting > that. I'd hate to do thatl. > I notice that if I play the 720p video and open a new tab, such that the video > is not displayed, then the audio plays smoothly. But we're still decoding the video in that case, right? So that can't be a decoding issue.
(In reply to comment #3) > (In reply to comment #2) > > We could also avoid the current cost of allocating the RGB buffer on each > > frame by storing the YUV data in the frame instead. This is already allocated > > and given to us by liboggplay. We'd have to reorganise things so liboggplay > > doesn't delete the YUV data until we're done with it. > > That would mean doing the YUV2RGB on the main thread, at paint time, and > currently it's rather slow, so that could be a problem. Maybe if we have SSE2 > code on all platforms, and that's a lot faster, we can do this. Yeah, I'd like to avoid moving that to the main thread. At some point in the future YUV -> RGB might become "free", when we can allocate an actual hardware YUV surface type and have that conversion happen on the video card, but we're a little ways away from that. Though Chris is right, I didn't consider the case where we'd be decoding faster than we're actually playing. Perhaps in that case we should have some kind of upper bound on memory usage, and have us stop decoding if we ever hit that until we actually process some more frames.
> That would mean doing the YUV2RGB on the main thread, > at paint time, and currently it's rather slow, so that > could be a problem. We could do it on the current decoder thread, at the time we call SetRGBData. This would at least avoid calling it when we skip frames. > But we're still decoding the video in that case, right? > So that can't be a decoding issue. We're also still doing the memory allocation and yuv>rgb conversion. Which is why I thought it interesting. Is the only thing not happening in that case the actual painting? > Perhaps in that case we should have some kind of > upper bound on memory usage, and have us stop decoding > if we ever hit that until we actually process some more frames. This is what happens at the moment. We keep a maximum of 20 frames around.
Blocks: 476397
I was poking at this today and found that EXTEND_PAD is really hurting us on Linux (no surprise), to the point where a 720x480 video won't play back without stuttering on a 1.6GHz Core Duo because we spend upwards of 70ms inside the call to aContext->Fill() in nsMediaDecoder::Paint. A quick hack to change this to EXTEND_NONE drops the time to around 4ms. With EXTEND_NONE in place, the next culprit is yuv2rgb, which can be fixed by fixing bug 474749. With the MMX yuv2rgb in place, the inefficient copying of frame data is near the top, but only around 22% of the profile (yuv2rgb is down to around 10% at this point).
alright, let's take a patch to turn off EXTEND_PAD for xlib and xcb target surfaces. Some reftests will probably need to be disabled for GTK.
Attached patch EXTEND_NONE patch v0 (obsolete) — Splinter Review
Use EXTEND_NONE if the target surface is X11. I've got an experimental patch that keeps liboggplay's YUV and decodes it during the SetRGB call. Doing that removes an alloc/copy/free for each frame, and reduces the memory used by the FrameQueue. I'll clean up the patch and post it soon.
Attachment #367556 - Flags: superreview?(roc)
Attachment #367556 - Flags: review?(roc)
+ gfxASurface::gfxSurfaceType type = aContext->OriginalSurface()->GetType(); Don't we want CurrentSurface here?
Why EXTEND_NONE only if it's X11? We may as well just do the extra work to see if we'd actually need PAD, and turn it off everywhere if not... though I guess we may want it off unconditionally on X11.
EXTEND_PAD is the right thing to use. Video should draw like images: the rectangle snapped to device pixels, EXTEND_PAD used so that there are no seams between video content and adjacent images, videos or background colors. So, we should use PAD everywhere we can. Here we are disabling PAD just to work around an X bug. I don't see any cases where leaving PAD enabled on X targets would be desirable.
Use current surface rather than original surface. Sorry I missed that.
Attachment #367556 - Attachment is obsolete: true
Attachment #367666 - Flags: superreview?(roc)
Attachment #367666 - Flags: review?(roc)
Attachment #367556 - Flags: superreview?(roc)
Attachment #367556 - Flags: review?(roc)
Attachment #367666 - Flags: superreview?(roc)
Attachment #367666 - Flags: superreview+
Attachment #367666 - Flags: review?(roc)
Attachment #367666 - Flags: review+
Flags: wanted1.9.1+
Whiteboard: [needs landing]
Defer YUV conversion until we're going to paint the frame. FrameData now holds a pinned reference to liboggplay's decoded YUV buffer, reducing memory used by queued frames. YUV conversion moves from HandleVideoData (in the decode stage) to PaintVideo (called when we want to paint a video frame). SetRGBData now assumes ownership of the decoded RGB buffer passed in. I moved locking mVideoUpdateLock inside SetRGBData since there doesn't seem to be any reason to force the caller to take the lock. This makes a worthwhile difference to memory use, but doesn't improve decode/render speed much. Combined with Chris's patch to skip video frames when they miss the desired presentation time (bug 470639), it'll make a bigger difference for videos we're already struggling to play. With this change, it'll be easier to move to a scheme where we avoid the allocation of the RGB buffer for each frame we paint. As Vlad suggested, the decoder could have a small queue of native surfaces which we could YUV decode directly into.
Assignee: nobody → kinetik
Attachment #367716 - Flags: superreview?(roc)
Attachment #367716 - Flags: review?(chris.double)
Attachment #367716 - Flags: superreview?(roc) → superreview+
Comment on attachment 367716 [details] [diff] [review] defer YUV patch v0 PaintVideo is a crappy name since we may not actually paint this frame and we're certainly not going to do it right now. It's more like SetCurrentVideoFrame.
It's actually PlayVideo, my comment is wrong (I thought I fixed that before I submitted). It's still not the best name, shall I rename it to SetCurrentFrame and resubmit the patch?
Yes, if you think SetCurrentFrame is a better name.
Attachment #367716 - Flags: review?(chris.double) → review+
Oh well, I checked it in as-is by accident. Guess that's OK. http://hg.mozilla.org/mozilla-central/rev/fd0eac910c25 http://hg.mozilla.org/mozilla-central/rev/3de6334ba553 I'm going to mark this as fixed since we fixed some real stuff here. Let's file new bugs on new issues.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Whiteboard: [needs landing] → [needs 191 landing]
It is worthwhile to create a bug for the last comment from #13? "avoid the allocation of the RGB buffer for each frame we paint. As Vlad suggested, the decoder could have a small queue of native surfaces which we could YUV decode directly into."
Depends on: 497837
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: