Investigate replacing gfx/ycbcr with libyuv

NEW
Unassigned

Status

()

Core
Graphics
P4
normal
5 years ago
2 years ago

People

(Reporter: kinetik, Unassigned)

Tracking

({perf})

Trunk
Points:
---

Firefox Tracking Flags

(blocking-basecamp:-)

Details

(URL)

(Reporter)

Description

5 years ago
Many moons ago we replaced the colour conversion code from liboggplay with faster code imported from the Chromium tree.  That code has since been spun off into a standalone library, libyuv, where it has seen significant optimization.  It's worth noting that we already have libyuv in the tree via WebRTC (media/webrtc/trunk/third_party/libyuv), so it'd be nice to share that.

Also see bug 656185 comment 30, which I'll paste here:

(In reply to Frank Barchard from comment #30)
> Hi, sorry I'm not familiar with your source/bugs reporting system.
> But I notice this code is based on code I wrote originally for chromium.
> 
> I've since rewritten it for webrtc, but in a separate package libyuv, if you
> just want the conversion and/or scaling functions.  Its SSSE3 and Neon
> optimized now and roughly 2x faster than the original MMX table method.
> 
> If you consider using libyuv, please post an issue
> http://code.google.com/p/libyuv/issues/list
> with your requirements/changes.  I'd also welcome your direct contribution.
(In reply to Matthew Gregan [:kinetik] from comment #0)
> > If you consider using libyuv, please post an issue
> > http://code.google.com/p/libyuv/issues/list
> > with your requirements/changes.  I'd also welcome your direct contribution.

I've posted <http://code.google.com/p/libyuv/issues/detail?id=92>.

Comment 2

5 years ago
The basic conversion is here
http://code.google.com/p/libyuv/source/search?q=I420ToARGB&origq=I420ToARGB&btnG=Search+Trunk

The low levels are implemented for SSSE3 and Neon
http://code.google.com/p/libyuv/source/search?q=I422ToARGBRow&origq=I422ToARGBRow&btnG=Search+Trunk
The _Any_ variations of row functions handle odd widths.
row_posix supports 64 bit as well as 32 bit, clang and gcc.

A libyuv issue has been opened for feature requests
http://code.google.com/p/libyuv/issues/detail?id=92
(In reply to Frank Barchard from comment #2)
> The low levels are implemented for SSSE3 and Neon
> http://code.google.com/p/libyuv/source/
> search?q=I422ToARGBRow&origq=I422ToARGBRow&btnG=Search+Trunk

So does this mean there is 4:2:2 support? That would be good news. I may have been looking at an old version of the code (I was looking at the copy we have in webrtc when making that list).

Comment 4

5 years ago
The conversion functionality includes 422.
There is conversion for I422ToARGB and many other YUV to/from RGB formats.
I400, I411, I422, I420, I444 and NV12 / NV21 all have optimized conversion to ARGB.
I420 can be converted to BGRA, ARGB, ABGR, RGBA, RGB565, ARGB1555, ARGB4444, RGB24 and BGR24.

There are 2 scaler - an I420 one and an ARGB one.  But theres no YUV scaling to RGB in one step like the old chromium code.
For rendering, hopefully you can use just a conversion to a texture format and let GPU to scaling.
(In reply to Frank Barchard from comment #4)
> There are 2 scaler - an I420 one and an ARGB one.  But theres no YUV scaling
> to RGB in one step like the old chromium code.
> For rendering, hopefully you can use just a conversion to a texture format
> and let GPU to scaling.

We already have software RGB scaling via pixman. For the most part, when we have a GPU we use it to do the conversion as well as the scaling. The software conversion is only the fallback path for places where that doesn't work (usually where the GPU driver has been blacklisted because it is too old and/or too buggy).

Is conversion and scaling in separate steps in libyuv still faster than the old one-step approach from the Chromium code? Is that true even for old (ARMv6) devices?

Comment 6

5 years ago
(In reply to Timothy B. Terriberry (:derf) from comment #5)
> (In reply to Frank Barchard from comment #4)
> > There are 2 scaler - an I420 one and an ARGB one.  But theres no YUV scaling
> > to RGB in one step like the old chromium code.
> > For rendering, hopefully you can use just a conversion to a texture format
> > and let GPU to scaling.
> 
> We already have software RGB scaling via pixman. For the most part, when we
> have a GPU we use it to do the conversion as well as the scaling. The
> software conversion is only the fallback path for places where that doesn't
> work (usually where the GPU driver has been blacklisted because it is too
> old and/or too buggy).
> 
> Is conversion and scaling in separate steps in libyuv still faster than the
> old one-step approach from the Chromium code? Is that true even for old
> (ARMv6) devices?

If you require scaling, as well as conversion, the one step approach is faster.
The old code is pretty inefficient though, so I suspect 2 step libyuv is faster than chromiums slow 1 step scale.
1. the horizontal filtering was an imul and does one pixel at a time.  libyuv produces multiple pixels at a time with ssse3.
2. the conversion was done with a table, one pixel at a time.  The table causes cache misses - going to pmaddubsw produces consistently fast blending and produces many pixels at a time - 8 output pixels for I420ToARGB conversion.
3. libyuv scaling has specialized code for common cases with webcams.  e.g. 1/2 size for taking 640x480 down to 320x240.  Pointers need to be aligned, but its a big speed up.
The one step scale + convert with left edge subpixel accurate clipping is a requirement for chromoting, which renders updated rectangles.
The low level row functions support a subpixel starting position, which is a step toward efficient clipping.  Chromoting branched off a scaler that does low levels with a left clip parameter that skips pixels, but thats less efficient, so I'm reluctant to take on support of their code/approach.
The upsampler centering is off (for performance) but is a requirement for webrtc.

Re pixman
Google Talk Plugin (for Hangouts) uses O3D, which is has pixman as a software fallback for blacklisted GPUs.  GTP converts the I420 content (from VP8) to ARGB and lets O3D render it with GPU, or pixman as a fallback.
The recent pixman has SSE2 at least.

Just say no to ARMv6 :-)
Neon is a big win and available on newer arm CPUs.
Without simd, you're just try to write code better than the compiler - it won't be a big win...  more constructive to have good compilers.
(In reply to Frank Barchard from comment #6)
> 1. the horizontal filtering was an imul and does one pixel at a time. 
> libyuv produces multiple pixels at a time with ssse3.

Assuming you have a CPU with SSSE3, of course. I'd need to look to see what % of our users this covers.

> The one step scale + convert with left edge subpixel accurate clipping is a
> requirement for chromoting, which renders updated rectangles.
> The low level row functions support a subpixel starting position, which is a
> step toward efficient clipping.  Chromoting branched off a scaler that does
> low levels with a left clip parameter that skips pixels, but thats less
> efficient, so I'm reluctant to take on support of their code/approach.

I'm not sure what you're saying is less efficient than what here. Can you elaborate? If you have subpixel starting offsets, I don't see why you can't implement clipping+conversion+scaling with negligible performance impact. ScaleYCbCrToRGB565 <http://mxr.mozilla.org/mozilla-central/source/gfx/ycbcr/ycbcr_to_rgb565.cpp#283> does it. You're welcome to steal whatever you want from there.

> The upsampler centering is off (for performance) but is a requirement for
> webrtc.

I couldn't parse that. Having the centering be off is a requirement for WebRTC, or having it be correct is a requirement? Again, this should just be a matter of getting the starting offsets right. See the same code I mentioned above.

> Just say no to ARMv6 :-)

Tell that to our users.

> Neon is a big win and available on newer arm CPUs.

There are still plenty of Tegra2's out there with ARMv7 and no NEON.

> Without simd, you're just try to write code better than the compiler - it
> won't be a big win...  more constructive to have good compilers.

The point of the ARMv6 media instructions is that you still get SIMD, just constrained to 4-byte registers. It's more work for less win, but it can still be a pretty substantial win. Also, ARM compilers are terrible (gcc being one of the worst offenders). "Wait for someone to fix gcc" is probably not a good plan.
(Reporter)

Updated

5 years ago
(Reporter)

Updated

5 years ago
OS: Linux → All
Hardware: x86_64 → All
Blocks: 794061

Comment 8

5 years ago
This is not a B2G blocker. We would happily take a patch though.
blocking-basecamp: --- → -
Priority: -- → P4

Comment 9

5 years ago
Update on libyuv

1. Most functions ported to Neon... about 90%.  Conversions, Scaling and some Effects.  Neon is especially good with RGB formats.  Core functions also ported to Mips DSPR2.

2. I420ToRGB565 (and other variations) done in single pass Neon and SSSE3.  This was a pretty big change.. around 8000 lines of code.  Its very fast, but doesn't do dithering or scaling... just conversion.
On x86, doing 1 pass vs 2 pass (per row) was less than 10% faster.  But on Neon its more than 2x faster doing 1 pass.

3. upsampling centering - currently 0.0, 0.0 represents center of the first pixel.  we tried changing this to 0.5 being the center, on chromium's bilinear, on just the vertical.  Its not checked in due to layout tests requiring recalibration, but vertical doesn't hurt performance too much, clamping the first and last rows.
Horizontal centering is alot harder.
(In reply to Frank Barchard from comment #9)
> On x86, doing 1 pass vs 2 pass (per row) was less than 10% faster.  But on
> Neon its more than 2x faster doing 1 pass.

This makes me suspect that there will be a similar performance regression for doing conversion+scaling in two passes instead of one, like we currently do for reasonable scale factors (i.e., roughly 0.67x to 2.0x for 4:2:0). If there's anything I can do to help make the code we have for this useful to you, please let me know.

> Horizontal centering is alot harder.

Can you explain to me why? Having implemented this, as I referenced above, I'm not seeing what the difficulty is.

Comment 11

5 years ago
Re 2 pass
For YUV to RGB565 I got roughly 15x performance in 2 passes and 30x performance in 1 pass, compared to C.  Even 15x would do :-)

Re horizontal harder
I mean with SIMD, where you produce 16 pixels at a time, its difficult to handle edges.  And remain optimally aligned.

Comment 12

5 years ago
Re centering.  Currently, 0.0 is considered the center of the first pixel - 100% the pixel.  This is for performance, so clipping on the top/left is not needed.  But we're talking about changing it, as it shifts the image a 1/2 pixel.

Neon optimization of the library is complete - on par with x86.
Next round of optimization is AVX2.

Comment 13

5 years ago
Breaking dep chain with bug 794061, we appear to have fixed that in a different way.
No longer blocks: 794061

Comment 14

5 years ago
Update on libyuv with respect to your requirements.

conversion:
yuv to rgb is now avx2 optimized.  39% faster than ssse3.

scaling:
ARGB scale now has SSSE3 column filtering and upsampler, which does columns first then rows.  10x faster upsampling on Intel and 3x on Arm.
On upsampling, filtering now centers on last texel to last pixel, avoiding need for extrusion.
565 is optimized for neon, but no dithering.

Near term I plan to add mirroring to scale for local view of webcams, and Neon column optimization.
Duplicate of this bug: 976695
Frank: what's the current state on libyuv?  I'd filed bug 976695 after noticing FastConvertYUVToRGB32Row() taking a fair bit of CPU in a profile.

I recently updated our copy of libyuv to rev 971 (pulled Jan 14 2014), and also moved it to media/libyuv and out of webrtc proper to ease reuse elsewhere in mozilla.
Flags: needinfo?(fbarchard)
Keywords: perf

Comment 17

4 years ago
Improvements are made to both ARGBScale and I420Scale, but there is no direct replacement for FastConvertYUVToRGB32Row.
ARGBScale now has horizontal mirroring.
Vertical-only scaling has been added to ARGBScale, which can be used on a variety of formats. (e.g. YUY2)
Linear filtering is added, which is horizontal bilinear but point sampling vertical, for better performance than bilinear.
Upsampling specialization added to YUV scaling.  (3x faster)
Source is now buildable for NaCL and pNaCL.
Scaler is now C89, allowing integration into C projects.

The conversion+scaling that you're looking for is started.  I wrote a low level function to convert a row of YUV to ARGB and then scale it.  The advantage of this is the conversion only happens when the row changes, which is less than every time when scaling up.  The ARGB scaler is more efficient than YUV, since filtering is done on 4 channels at a time vs planar YUV requiring 3 planes filtered.
Flags: needinfo?(fbarchard)

Comment 18

3 years ago
Status update.  Emphasis for the last few months of libyuv has been aarch64 port of Neon, and AVX2 versions of top 30 functions, for GCC/NaCL.

Updated

2 years ago
See Also: → bug 1256475
You need to log in before you can comment on or make changes to this bug.