Closed Bug 474749 Opened 16 years ago Closed 15 years ago

oggplay_yuv2bgr needs to be faster

Categories

(Core :: Audio/Video, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: vlad, Assigned: jrmuizel)

References

Details

Attachments

(1 file, 1 obsolete file)

Taking up 19% of a profile.  Way too much in this world of SSE crazyness, I think; I bet we can find faster yuv to rgb code somewhere (mplayer, X both come to mind).
I've CC'd one of the liboggplay maintainers. There is optimized code for yuv2rgb but I've turned it off as it produces incorrect colours. There is a liboggplay bug for this:

http://trac.annodex.net/ticket/421

There is an issue that this is enabled by a compile time flag. Should it be a runtime detection thing?
Yeah, whatever code we do needs to do runtime detection.. we really just need a SSE2 variant and a straight C variant for fallback.  Ideally also an ARM variant, with and without NEON.
Attached patch Fix yuv2bgr and yuv2rgb (obsolete) — Splinter Review
The attached patch fixes the color problems with the MMX code. It looks like the MMX code is less precise then the c code so it probably won't be bit identical.

The code also doesn't really look that fast (although it does have a noticeable impact on the profiles). Moonlight has code to do the conversion that at http://anonsvn.mono-project.com/viewvc/trunk/moon/src/yuv-converter.cpp?revision=120533&view=markup under the MIT/X license that's probably worth taking a look at.
Jeff's patch helps some, though not a huge amount; maybe 5% or so.  Playback still wasn't smooth.  For an experiment, I plugged in the YUV2RGB from intel's IPP package (which was remarkably easy to do, too bad IPP's commercial); the yuv2rgb step went from ~58% of total firefox cpu time to ~17%.  There's clearly a lot of work that can be done :)

However, I've heard that there is someone working on this for liboggplay that should be done very soon.. so hopefully we can take that.

Another option I've been pointed at has been OpenMAX DL -- yuv2rgb is part of the spec there, and ARM at least has contributed NEON-optimized libraries for this.  I can't find anything for x86, though.
VC intrinsics using mmx are horribly bad, there's no point trying to use them. 

Either inline assembler or external (yasm?) are the only possibilities which work.

I'd volunteer to write this conversion if I had a working environment. By the way, code which is LGPL-ed only (mencoder/libavcodec) is not enough right?
Blocks: 476397
(In reply to comment #4)
> Jeff's patch helps some, though not a huge amount; maybe 5% or so.  Playback
> still wasn't smooth.  For an experiment, I plugged in the YUV2RGB from intel's
> IPP package (which was remarkably easy to do, too bad IPP's commercial); the
> yuv2rgb step went from ~58% of total firefox cpu time to ~17%.  There's clearly
> a lot of work that can be done :)

What platform/compiler was this on? The total time in yuv2rgb was much for me on my mac.

> Another option I've been pointed at has been OpenMAX DL -- yuv2rgb is part of
> the spec there, and ARM at least has contributed NEON-optimized libraries for
> this.  I can't find anything for x86, though.

There's BSD licensed code for ARMv6 at http://labs.movenda.com/blog/?p=84 that we can use.
(In reply to comment #5)
> VC intrinsics using mmx are horribly bad, there's no point trying to use them. 
> 
> Either inline assembler or external (yasm?) are the only possibilities which
> work.

I'm not sure how easy it is to add a new build requirement, but I think we should certainly consider yasm. Having to write and maintain inline assembler for gcc and msvc is a pain and we have this problem in more places in the tree.

> I'd volunteer to write this conversion if I had a working environment. By the
> way, code which is LGPL-ed only (mencoder/libavcodec) is not enough right?

The code probably needs to be be at least GPL/LGPL/MPL tri-licensed. However, the existing code is MIT licensed so that's probably the preferred license.
(In reply to comment #6)
> (In reply to comment #4)
> > Jeff's patch helps some, though not a huge amount; maybe 5% or so.  Playback
> > still wasn't smooth.  For an experiment, I plugged in the YUV2RGB from intel's
> > IPP package (which was remarkably easy to do, too bad IPP's commercial); the
> > yuv2rgb step went from ~58% of total firefox cpu time to ~17%.  There's clearly
> > a lot of work that can be done :)
> 
> What platform/compiler was this on? The total time in yuv2rgb was much for me
> on my mac.

WinXP, MSVC9.  "was much for me" -- which word is missing there? "lower"? :)

That percentage isn't directly comparable to a whole-system profile because of vtune's weird module/thread/etc handling, but the relative change of 58%->17% is correct.

I have no problem with adding a build req on yasm; we'll have to sell a few people on it, but I think it can be done.
Lower was indeed the missing word :)
(In reply to comment #7)
> (In reply to comment #5)
> > VC intrinsics using mmx are horribly bad, there's no point trying to use them. 
> > 
> > Either inline assembler or external (yasm?) are the only possibilities which
> > work.
> 
> I'm not sure how easy it is to add a new build requirement, but I think we
> should certainly consider yasm. Having to write and maintain inline assembler
> for gcc and msvc is a pain and we have this problem in more places in the tree.

i've just finished the SSE2/MMX inline assembler code in GAS for yuv2rgb conversion...
about the MASM: basically i was thinking to define macros for the various instructions (i.e. see http://www.videolan.org/developers/vlc/doc/doxygen/html/mmx_8h-source.html) and depending on the compiler type the right asm code would be defined. currently i see no reason why this approach wouldn't work...
any comments?

btw: or YASM is a really option here?
(In reply to comment #10)
> i've just finished the SSE2/MMX inline assembler code in GAS for yuv2rgb
> conversion...

Is this code available somewhere?

> about the MASM: basically i was thinking to define macros for the various
> instructions (i.e. see
> http://www.videolan.org/developers/vlc/doc/doxygen/html/mmx_8h-source.html) and
> depending on the compiler type the right asm code would be defined. currently i
> see no reason why this approach wouldn't work...
> any comments?

It seems like this should work... Do you have any idea when it will be ready? It would be really nice to have in the upcoming beta.

> btw: or YASM is a really option here?

YASM probably isn't an option in the short term as it requires a new build tool.
Comment on attachment 359172 [details] [diff] [review]
Fix yuv2bgr and yuv2rgb

We should try to land the fixed version as a stop gap until we have good and fast code.
Attachment #359172 - Flags: review?(chris.double)
Comment on attachment 359172 [details] [diff] [review]
Fix yuv2bgr and yuv2rgb

Please also add a .patch file containing this patch in the media/liboggplay directory, update update.sh so that this patch is applied next time I refresh the liboggplay code, and change README_MOZILLA so that it describes the patch and the bug it came from.

See how the other .patch files are handled in the media/liboggplay directory.
Attachment #359172 - Flags: review?(chris.double)
Add the patch and update update.sh and README_MOZILLA
Attachment #359172 - Attachment is obsolete: true
Attachment #363190 - Flags: review?(chris.double)
Attachment #363190 - Flags: review?(chris.double) → review+
Keywords: checkin-needed
i've just commited the assembly based yuv to rgb conversion methods into liboggplay's repository (rev 3850) and i've tested it with the latest trunk of mozilla, and it seemed to work fine.

unfortunately this does not solve every issue; e.g. HD ogg playback is still lagging. after these changes it's definitely not the conversion functions' fault, as i've tried to use libswscale - mainly this is used in all major video playback programs, e.g. mplayer - just to compare it's speed and it's roughly the same - no wonder as the algo to solve the problem is similar...
http://hg.mozilla.org/mozilla-central/rev/690209fc5b6b
Assignee: nobody → jmuizelaar
Status: NEW → RESOLVED
Closed: 15 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.2a1
Backed this out due to test failures on OS X.  They could have been unrelated, but I don't have time to watch the tree more.  They passed on windows.  Linux hasn't completed at the time of this post.

REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/aspect-ratio-1a.html |
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/aspect-ratio-1b.html |
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/aspect-ratio-2a.html |
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/aspect-ratio-2b.html |
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/basic-1.html |
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/canvas-1a.html |
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/canvas-1b.html |
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/moz2_slave/mozilla-central-macosx-unittest/build/layout/reftests/ogg-video/zoomed-1.html |

http://hg.mozilla.org/mozilla-central/rev/6e008e161a0f
http://hg.mozilla.org/mozilla-central/rev/4092901442b7
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I had a quick look at this.  The reftests fail for me on OS X with the patch applied.  Comparing them using the reftest analyzer and Pixie.app, the black (and single pixel line of darker gray at the bottom of the video) differ by #010101.  The single pixel line of lighter gray at the top of the video remains #c0c0c0.

It might pass on Windows (and other platforms) because MMX/SSE (c.f. oggplay_yuv2rgb.c:60) isn't defined by the build system.  It's always defined on OS X x86 because the minimum supported CPU is Intel Core.
If we can encode pure white and get white out, then we can change the reftests to use that instead (and stick a non-white background on the reftests).

Not sure where those grays are coming from, there shouldn't be any.
Maybe you were looking at the test images in the browser and Firefox was out-zooming-out?
It does seem that Firefox was zoomed in. I used the reftest analyzer to load the failed tests, then Pixie to magnify (which uses nearest-neighbour) and query RGB values.  Does the reftest analyzer zoom the images by default?  If I load the reftests directly in the browser and magnify using Pixie, there are no grey lines, just black.  And the post-patch version is all near-black (0x010101).

I tried creating a white video for the reftests, but that fails with and without the patch, because the decoded frame is a mix of 0xffffff, 0xfefefe, and 0xfdfdfd.

The command I used to create the video was:
convert -size 140x100 xc:white a.png && convert a.png -evaluate set 100% +matte -depth 8 b.png && ffmpeg2theora -f image2 b.png -o out.ogv
There may be some SVG filter issue affecting the reftest analyzer.

I'm not sure what to do here. I suppose one option is to use <video> in the references as well. That should be reasonably for all the tests except basic-1; for that one, I think we could just make it a != test to ensure that <video> is rendering something non-white.
A few notes from my poking around this evening:

oggplay_yuv2rgb.c includes xmmintrin.h when !WIN32, but it only needs mmintrin.h.  xmmintrin.h requires SSE, but the yuv2rgb routine only uses MMX.

Unless we can assume MMX is always available on Linux and Win32, we need runtime feature detection to use this on those platforms.

I noticed that the MMX routine is significantly (>3x) slower than the native C routine in debug builds on Linux.  It turns out this is because GCC ends up furiously spilling and reloading registers between calls to the MMX intrinsics.  We might need to do something to avoid this, because for a 720x480 video, MMX yuv2rgb was taking upwards of 60ms per call on a 1.6GHz Core Duo.
The latest liboggplay svn has all new code for yuv conversion that does runtime detection. I've not yet had a chance to look at it or test it. If you're keen try it out.
Thanks, I gave that a shot and it looks promising.  It also doesn't suffer from debug-build slowdowns, because it uses inline asm rather than intrinsics.  Both the MMX and SSE2 implementations are faster than the existing MMX one.

As an aside, the SSE2 implementation was crashing on me due to alignment issues.  liboggplay's configure has a new test that defines ATTRIBUTE_ALIGNED_MAX.  I missed that initially, because the build still works.  Without it defined, some static data used by the SSE2 code ends up naturally aligned rather than 16-byte aligned as required by the CPU.
Depends on: 485291
Bug 485291 landed: http://hg.mozilla.org/mozilla-central/rev/46c478fbd908
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: