Closed Bug 1414440 Opened 2 years ago Closed 2 years ago

ScaleYUVToARGB is too slow on 32bit Unix

Categories

(Core :: Graphics, defect, P3)

x86
FreeBSD
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- fixed

People

(Reporter: jbeich, Assigned: jbeich)

References

Details

(Whiteboard: [gfx-noted])

Attachments

(3 files)

0. Use hardware with at least AVX2 support
1. Set media.ffvpx.enabled=false in about:config (bug 1306521 workaround)
2. Open https://www.youtube.com/watch?v=1La4QzGeaaQ
3. Switch video quality to 4k 60fps
4. Switch view to theater mode or expand to fullscreen
5. Notice Firefox never reaches 60fps unless built with -msse2
   or uses layers.acceleration.force-enabled=true

https://searchfox.org/mozilla-central/rev/af86a58b157f/media/libyuv/libyuv/include/libyuv/row.h#34

31.16%  [6706150]  I422ToARGBRow_C @ objdir/toolkit/library/libxul.so
08.31%  [1787639]  ff_vp9_put_8tap_1d_h_8_8_ssse3 @ /usr/local/lib/libavcodec.so.57.107.100
07.84%  [1686477]  InterpolateRow_C @ objdir/toolkit/library/libxul.so
06.53%  [1406309]  decode_coeffs_8bpp @ /usr/local/lib/libavcodec.so.57.107.100
 00.00%  [4]         ff_vp9_decode_block
05.95%  [1280904]  ff_vp9_put_8tap_1d_v_8_8_ssse3 @ /usr/local/lib/libavcodec.so.57.107.100
05.88%  [1265198]  ScaleARGBFilterCols_C @ objdir/toolkit/library/libxul.so
04.71%  [1014235]  ff_vp9_loop_filter_h_16_16_avx @ /usr/local/lib/libavcodec.so.57.107.100
03.95%  [850666]   decode_mode @ /usr/local/lib/libavcodec.so.57.107.100
 00.00%  [23]        ff_vp9_decode_block
03.64%  [783967]   ff_vp9_loop_filter_v_16_16_avx @ /usr/local/lib/libavcodec.so.57.107.100
01.39%  [299071]   ff_vp9_loopfilter_sb @ /usr/local/lib/libavcodec.so.57.107.100
(In reply to Jan Beich from comment #0)
> (bug 1306521 workaround)

Typo: bug 1412558 workaround
Why libyuv asm requires -msse2 on x86? Is TestCpuFlag() broken? https://webrtc-codereview.appspot.com/44389004 doesn't explain why.
Flags: needinfo?(fbarchard)
Comment on attachment 8925178 [details]
Bug 1414440 - Always build libyuv x86 optimizations with Clang.

https://reviewboard.mozilla.org/r/196418/#review201660

::: media/libyuv/libyuv/include/libyuv/row.h:105
(Diff revision 1)
>  #define HAS_BGRATOUVROW_SSSE3
>  #define HAS_BGRATOYROW_SSSE3
>  #define HAS_COPYROW_ERMS
>  #define HAS_COPYROW_SSE2
>  #define HAS_H422TOARGBROW_SSSE3
> +#ifdef __SSE2__

Gecko doesn't use HalfFloat yet but the workaround needs to be limited to Clang.
```
$ ./mach bootstrap
$ echo "export CC=clang40 CXX=clang++40" >>.mozconfig
$ echo "export COMPILER_PATH=/usr/local/bin" >>.mozconfig
$ echo "ac_add_options --disable-debug-symbols" >>.mozconfig
$ ./mach build
[...]
/usr/local/bin/clang++40 -std=gnu++11 -o Unified_cpp_media_libyuv_libyuv1.o -c -I/tmp/mozilla-central/obj-i386-unknown-freebsd10.3/dist/system_wrappers -include /tmp/mozilla-central/config/gcc_hidden.h -DNDEBUG -DTRIMMED=1 -D_FILE_OFFSET_BITS=64 -DCHROMIUM_BUILD -DUSE_LIBJPEG_TURBO=1 -DUSE_NSS=1 -DGTK_DISABLE_SINGLE_INCLUDES=1 -DENABLE_REMOTING=1 -DENABLE_WEBRTC=1 -DENABLE_CONFIGURATION_POLICY -DENABLE_INPUT_SPEECH -DENABLE_NOTIFICATIONS -DENABLE_GPU=1 -DENABLE_EGLIMAGE=1 -DUSE_SKIA=1 -DENABLE_TASK_MANAGER=1 -DENABLE_WEB_INTENTS=1 -DENABLE_EXTENSIONS=1 -DENABLE_PLUGIN_INSTALLATION=1 -DENABLE_PROTECTOR_SERVICE=1 -DENABLE_SESSION_SERVICE=1 -DENABLE_THEMES=1 -DENABLE_BACKGROUND=1 -DENABLE_AUTOMATION=1 -DENABLE_PRINTING=1 -DENABLE_CAPTIVE_PORTAL_DETECTION=1 -DHAVE_JPEG -D__STDC_FORMAT_MACROS -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/tmp/mozilla-central/media/libyuv/libyuv -I/tmp/mozilla-central/obj-i386-unknown-freebsd10.3/media/libyuv/libyuv/libyuv_libyuv -I/tmp/mozilla-central/media/libyuv/libyuv/include -I/tmp/mozilla-central/media/libyuv/libyuv -I/tmp/mozilla-central/obj-i386-unknown-freebsd10.3/ipc/ipdl/_ipdlheaders -I/tmp/mozilla-central/ipc/chromium/src -I/tmp/mozilla-central/ipc/glue -I/tmp/mozilla-central/obj-i386-unknown-freebsd10.3/dist/include -fPIC -DMOZILLA_CLIENT -include /tmp/mozilla-central/obj-i386-unknown-freebsd10.3/mozilla-config.h -Qunused-arguments -I/usr/local/include -Qunused-arguments -Wall -Wc++11-compat -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wloop-analysis -Wc++11-compat-pedantic -Wc++14-compat -Wc++14-compat-pedantic -Wc++1z-compat -Wcomma -Wimplicit-fallthrough -Wstring-conversion -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wformat -Wno-gnu-zero-variadic-macro-arguments -Wformat-security -Wno-unknown-warning-option -Wno-return-type-c-linkage -fno-exceptions -fno-strict-aliasing -fno-rtti -fno-exceptions -fno-math-errno -pipe -O -fno-omit-frame-pointer -m32 -mmmx  -MD -MP -MF .deps/Unified_cpp_media_libyuv_libyuv1.o.pp   /tmp/mozilla-central/obj-i386-unknown-freebsd10.3/media/libyuv/libyuv/libyuv_libyuv/Unified_cpp_media_libyuv_libyuv1.cpp
In file included from /tmp/mozilla-central/obj-i386-unknown-freebsd10.3/media/libyuv/libyuv/libyuv_libyuv/Unified_cpp_media_libyuv_libyuv1.cpp:65:
/tmp/mozilla-central/media/libyuv/libyuv/source/row_gcc.cc:5457:5: error: couldn't allocate input reg
      for constraint 'x'
    "pshufd      $0x0,%3,%%xmm4                \n"
    ^
Unknown type!
UNREACHABLE executed at /wrkdirs/usr/ports/devel/llvm40/work/llvm-4.0.1.src/lib/IR/ValueTypes.cpp:285!
clang-4.0: error: unable to execute command: Abort trap
clang-4.0: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 4.0.1 (tags/RELEASE_401/final)
Target: i386-portbld-freebsd10.3
Thread model: posix
InstalledDir: /usr/local/llvm40/bin
```
Dimitry, can you check why inline SSE2 without -msse crashes Clang?
Flags: needinfo?(dimitry)
Attachment #8925222 - Attachment mime type: application/x-shellscript → text/plain
Nevermind, it seems GCC's -m32 defaults to SSE2 on x86_64.
Flags: needinfo?(dimitry)
OS: Linux → FreeBSD
Does this affect 32bit MinGW builds as well? I've noticed those don't enable SSE thus end up with unoptimized libyuv.
Flags: needinfo?(tom)
32bit MinGW with SSE2 enabled appear to be too broken. OTOH, Clang allows inline SSE* assembly even without -msse.
Flags: needinfo?(tom)
(In reply to Jan Beich from comment #7)
> Nevermind, it seems GCC's -m32 defaults to SSE2 on x86_64.

Yes, for compatibility reasons the i386-freebsd triple still uses i486 as its default CPU.  The options in the .sh file seem to indicate that MMX was turned on explicitly, though:

-target-cpu i486 -target-feature +mmx

A minimal test case for this is just:

// clang -cc1 -triple i386 -S -target-cpu i486 -target-feature +mmx testcase.cpp
short a;
int b;
void c() { asm("" : "+r"(a), "+r"(b) : "x"(1.9259299444e34f)); }

The "x" constraint tells it to use an SSE register.  The actual value (kScaleBias in the original source) doesn't really matter...
Priority: -- → P3
Whiteboard: [gfx-noted]
Hi, I'm the author of libyuv.  As you found 'x' requires -msse2  This was unintentional and not used very much, so it should be fixable.  The intent is you can compile sse, avx etc as long as you have a new enough compiler.

1. we should file a bug in libyuv issues with steps to reproduce it.
2. To keep the code that uses 'x' the normal way in libyuv is to disable those functions in row.h and ifdef the code.
3. Most uses of 'x' can be changed to 'm' and load the register.  That will be a better fix, but will take a few days.

I tried to build without sse2 within chromium. but its getting increasingly difficult.  Perhaps I'll just use linux.mk
Do you really have CPUs without sse2?  would be nice to remove the sse2 cpu detect/dispatching and rely on sse2 being there.

For gcc, it may work by enabling sse2 for the assembler.  -Wa,-msse2
Doesnt work for clang and will likely break on future versions of gcc, but I did this for awhile to enable Neon for assembly without enabling the compiler to generate it for C code.

mingw has an ifdef bug... was fixed oct 17.
https://chromium-review.googlesource.com/722319
But it was for AOSP and hasnt rolled into that yet.
Flags: needinfo?(fbarchard)
Ive got a possible fix:
https://chromium-review.googlesource.com/#/c/libyuv/libyuv/+/758043/

To reproduce

1. remove line 34 of row.h etc
#if defined(__pnacl__) || defined(__CLR_VER) || \
    (defined(__i386__) && !defined(__SSE2__))
#define LIBYUV_DISABLE_X86
#endif

2. build with make

CC=clang CXX=clang++  CFLAGS="-m32 -mtune=pentium -mno-sse -O2" CXXFLAGS="-m32 -mtune=pentium -mno-sse -O2" make -f linux.mk 

A fix is to change "x" to "mx"

// clang -c -mno-sse testcase.cpp
short a;
int b;
void c() { asm("" : "+r"(a), "+r"(b) : "mx"(1.9259299444e34f)); }

For gcc, I think we'll need -msse but not -msse2
Otherwise the clobber registers cant be declared.
Thanks for finishing and upstreaming the fix. I confirm, applying libyuv@01e994d74e4e makes Firefox 58 on FreeBSD 10.3 i386 play 2160p 60fps smoothly (tested on i7-6700K).
Comment on attachment 8925178 [details]
Bug 1414440 - Always build libyuv x86 optimizations with Clang.

https://reviewboard.mozilla.org/r/196418/#review202724

Looks good!
Attachment #8925178 - Flags: review?(sotaro.ikeda.g) → review+
The new version had only cosmetic changes (see below). Assuming r+ still applies.
- Regen the patch with -U8 (NPOTB)
- Adjust commit message to note it mainly affects BSD systems (NPOTB)
Keywords: checkin-needed
Assignee: nobody → jbeich
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/848aaca92e63
Always build libyuv x86 optimizations with Clang. r=sotaro
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/848aaca92e63
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
"m" with pshufd compiles but can crash if the memory is misaligned.
This affects 32 bit SSE2 version.  x64 and AVX2 are unaffected.
Only matters if you actually use the HalfFloat function.
I've done a CL to fix this
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/840459

diff --git a/source/row_gcc.cc b/source/row_gcc.cc
index d322c7b..70f72b2 100644
--- a/source/row_gcc.cc
+++ b/source/row_gcc.cc
@@ -5910,7 +5910,8 @@
 void HalfFloatRow_SSE2(const uint16* src, uint16* dst, float scale, int width) {
   scale *= kScaleBias;
   asm volatile (
-    "pshufd      $0x0,%3,%%xmm4                \n"
+    "movd        %3,%%xmm4                     \n"
+    "pshufd      $0x0,%%xmm4,%%xmm4            \n"
     "pxor        %%xmm5,%%xmm5                 \n"
     "sub         %0,%1                         \n"
 
@@ -5935,11 +5936,7 @@
   : "+r"(src),    // %0
     "+r"(dst),    // %1
     "+r"(width)   // %2
-#if defined(__x86_64__)
-  : "x"(scale)   // %3
-#else
   : "m"(scale)   // %3
-#endif
   : "memory", "cc",
     "xmm2", "xmm3", "xmm4", "xmm5"
   );

ps recent changes are for 10 bit YUV to 10 bit RGB formats.
You need to log in before you can comment on or make changes to this bug.