[Meta] Evaluate OpenCV performance with asm.js (compiled by Emscripten)

NEW
Unassigned

Status

()

Core
JavaScript Engine
3 years ago
2 years ago

People

(Reporter: kaku, Unassigned)

Tracking

(Depends on: 3 bugs, Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

99.15 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
4.17 MB, application/gzip
Details
1.04 MB, application/gzip
Details
(Reporter)

Description

3 years ago
Created attachment 8542790 [details]
OpenCV performance evaluation(1).xlsx

I recently evaluate the performance of OpenCV with asm.js. Thanks to the OpenCV team, there are already lots of performance tests in each OpenCV module. What I have done is that compile OpenCV library and its performance tests both in native and with asm.js. Run the native performance test and the javascript performance test (with the Firefox javascript engine). And the statistics data could be found in the following Google Sheet link (or a MS Excel file is attached):
https://docs.google.com/spreadsheets/d/1w8gr1_q_dQclTaGyiMtAhlwj_5_HPVKbHWfwUszPU40/edit?usp=sharing

In summary, there are about 20% tests are faster or equal to the native performance and about 60% tests are faster than 3x native performance.

Some patterns are observed. First, tests with operations which convert floating point values into integer values are much slower. Second, in native environment, single operation is much faster when it is dealing with integer type data comparing to dealing with floating point values; however, in javascript environment, the same operation performs equally in each data type, which leads to relatively larger performance drop in non-floating point data type.

Updated

3 years ago
Blocks: 1100203
Azakai, could you please provide some feedback for such findings?
We are evaluating of running OpenCV-asm.js on Firefox OS.
That will be great if you can give us some suggestions.
Thanks.

(In reply to Tzuhao Kuo [:tkuo] from comment #0)
> Created attachment 8542790 [details]
> OpenCV performance evaluation(1).xlsx
> 
> I recently evaluate the performance of OpenCV with asm.js. Thanks to the
> OpenCV team, there are already lots of performance tests in each OpenCV
> module. What I have done is that compile OpenCV library and its performance
> tests both in native and with asm.js. Run the native performance test and
> the javascript performance test (with the Firefox javascript engine). And
> the statistics data could be found in the following Google Sheet link (or a
> MS Excel file is attached):
> https://docs.google.com/spreadsheets/d/
> 1w8gr1_q_dQclTaGyiMtAhlwj_5_HPVKbHWfwUszPU40/edit?usp=sharing
> 
> In summary, there are about 20% tests are faster or equal to the native
> performance and about 60% tests are faster than 3x native performance.
> 
> Some patterns are observed. First, tests with operations which convert
> floating point values into integer values are much slower. Second, in native
> environment, single operation is much faster when it is dealing with integer
> type data comparing to dealing with floating point values; however, in
> javascript environment, the same operation performs equally in each data
> type, which leads to relatively larger performance drop in non-floating
> point data type.
Flags: needinfo?(azakai)

Comment 2

3 years ago
A few high-level questions:
 - do the OpenCV workloads using single-precision floats and, if so, are you compiling in Emscripten using PRECISE_F32?
 - do you see any difference in native performance when you disable auto-vectorization in the native compiler?  If so, this might be another good testcase for SIMD.js
 - regarding the float-to-int-conversion-performance are you talking about numeric conversions from double/float to integer or bitwise reinterpretations of doubles/floats as ints?  The former case is a known area where we are worse than native (extra branching to handle the out-of-bounds cases).  The latter would, iiuc, require Emscripten to emit a store-as-float/load-as-int so this could be a pattern we add to asm.js (so that we could emit a vmov directly w/o loads/stores).
 - regarding the second point you made of float vs non-float operation performance: that is quite surprising; did you determine this by looking at the float vs non-float benchmarks or from some other experiments?

Comment 3

3 years ago
Luke covered all the questions I have.

In addition, if there is something still surprisingly slow, I can take a look. Please make a build with  -O3 -profiling for that.
Flags: needinfo?(azakai)
(Reporter)

Comment 4

3 years ago
(In reply to Luke Wagner [:luke] from comment #2)

The Google sheet is updated with new experiment data.

> A few high-level questions:
>  - do the OpenCV workloads using single-precision floats and, if so, are you
> compiling in Emscripten using PRECISE_F32?
Not really. In the test suit, single operation (ex, compare between two matrix) is operating on several data types (UINT8, INT8, UINT16, INT16, INT32, FLOAT32, DOUBLE64). I re-compiled the program with -s PRECISE_F32=1 and test cases operating on FLOAT32 data type are boosted up (About 10% test cases shift from "2x~3x slower" category to "1x~2x solwer" category). However, the total distribution does not change a lot since the main performance drop does not come from test cases operating on FLOAT32 data.

>  - do you see any difference in native performance when you disable
> auto-vectorization in the native compiler?  If so, this might be another
> good testcase for SIMD.js
Two experiments were conducted:
1. Re-compiled the program with -fno-tree-vectorize; The result of this experiment get me confused. Several test cases get better performance after disabling auto-vectorization. I will keep inspecting on this item.
2. Re-compiled the program with ENABLE_SSE, ENABLE_SSE2, ENABLE_SSE3; The performance is better in several cases.

>  - regarding the float-to-int-conversion-performance are you talking about
> numeric conversions from double/float to integer or bitwise
> reinterpretations of doubles/floats as ints?  The former case is a known
> area where we are worse than native (extra branching to handle the
> out-of-bounds cases).  The latter would, iiuc, require Emscripten to emit a
> store-as-float/load-as-int so this could be a pattern we add to asm.js (so
> that we could emit a vmov directly w/o loads/stores).
I am dealing with the former one, numeric conversion. Good to know that it has been handing now.

>  - regarding the second point you made of float vs non-float operation
> performance: that is quite surprising; did you determine this by looking at
> the float vs non-float benchmarks or from some other experiments?
Again, in the test suit, single operation is operating on several data types (UINT8, INT8, UINT16, INT16, INT32, FLOAT32, DOUBLE64). This pattern could be observed through the performance result of single operation on different data type; and his pattern is especially obvious in the "compare" and "normalize" operations (see raw469~552 and raw1121~1156 in the Google sheet).

Comment 5

3 years ago
(In reply to Tzuhao Kuo (Kaku) [:tkuo] from comment #4)
Thanks for all your help.  That is pretty surprising results wrt -fno-tree-vectorize.  (CC'ing sunfish who might find this data point useful in the more general SIMD.js vs autovectorization discussion.)

> Again, in the test suit, single operation is operating on several data types
> (UINT8, INT8, UINT16, INT16, INT32, FLOAT32, DOUBLE64). This pattern could
> be observed through the performance result of single operation on different
> data type; and his pattern is especially obvious in the "compare" and
> "normalize" operations (see raw469~552 and raw1121~1156 in the Google sheet).

For some of these worst cases, do you suppose you could profile the individual benchmark with the builtin FF profiler with "Show Gecko platform data" checked (in general devtools settings) and see if there is any significant self time in any frames named "trampoline"?  A common source of this is calling a short-running asm.js function from non-asm.js from a loop (which trampolines into asm.js each time).  The trampoline can take 1000 cycles so this can easily lead to 10x slowdowns in the case of trivial asm.js functions.

More generally, if you have time to work with us on this, it would be really helpful to take individual bad microbenchmarks, profile them (to verify it isn't trampoline time), and file individual bugs with the microbenchmark code.
(Reporter)

Comment 6

3 years ago
Created attachment 8548149 [details]
profiling_data.tar.gz

Hi Luke and Alon,

The attachment are profiling data and runnable HTML/JS files of 4 different cases:
1) "compare" operation on 640x480 unsigned char data
2) "compare" operation on 640x480 floating point data
3) "normalize" operation on 640x480 unsigned char data
4) "normalize" operation on 640x480 floating point data

As Shown in the Google Sheet, case (1) gets about 13x slower than native performance and case (2) get only 2.5x slower. However, the running time of case (1) and (2) in asm.js do not show significant different so I would say the performance drop comes from that case (1) gets better optimization in native environment.

Case (3) and case (4) are the pattern that single operation works better while dealing with floating point data than dealing with non-floating point data. I do find that there are a period of time of case (3) spents on the so-called "trampoline" but not so significant; the situation does not happen in case(4).

Please take a look around the attached files which include HTML/JS file and the according profiling file. If any further information are need, please let me know.

Comment 7

3 years ago
(In reply to Tzuhao Kuo (Kaku) [:tkuo] from comment #6)
Thanks for the specific examples!

For #1, I don't see any problems for the one hot function where 80% of the time is spent.  I wonder if there are uint8-specific optimizations we're missing out on, though.  By any chance is there a uint32 version of this benchmark?

For #3, the profile shows 73% of the time is under the FFI call to _lrint which in turn does an apply call to _rint which does a bunch of slow-looking stuff (floating-point modulus).  By comparison, when I step into glibc's lrint, it performs a single cvtsd2si.  Perhaps Emscripten could emit ~~ in this case?  By comparison, #4 doesn't appear to spend very much time under _lrint.
If this is 32-bit x86, the useByteOpRegister code where stores always use %al is one uint8-specific optimization we're missing out on, though I don't know if that's the issue here.

Comment 9

3 years ago
(In reply to Luke Wagner [:luke] from comment #7)
> For #3, the profile shows 73% of the time is under the FFI call to _lrint
> which in turn does an apply call to _rint which does a bunch of slow-looking
> stuff (floating-point modulus).  By comparison, when I step into glibc's
> lrint, it performs a single cvtsd2si.  Perhaps Emscripten could emit ~~ in
> this case?  By comparison, #4 doesn't appear to spend very much time under
> _lrint.

On emscripten master (1.29.0), that code was replaced by musl's C code, so there wouldn't be an FFI, and should be much faster. However, it still is doing more than just ~~.

First, it checks for float exceptions, which always do nothing, but LLVM can only wipe those out if LTO is enabled (then it can also wipe out the call of lrint to rint).

After that, it inspects the bits to decide how to round the number,

https://github.com/kripken/emscripten/blob/master/system/lib/libc/musl/src/math/rint.c

so it's not doing a naive (int)a_double which would correspond to ~~a_double. I assume they have a good reason for doing so, but I don't know offhand why; if we think this is important, I could look into it.

Comment 10

3 years ago
It'd be nice to re-benchmark with a newer Emscripten build to see what the native/asm ratio afterwards; perhaps even more performance faults will have been fixed.

But, on the subject of optimizing lrint, I noticed that MUSL is calling rint from lrint, which seems to add unnecessary double->int conversion.  Also, I noticed that, for lrint, glibc uses roundsd (which is SSE4 but glibc uses cpu feature testing to choose which .so to load).  I guess we'd need a new stdlib primitive... would Math.round work?  Lastly, I don't really understand all this fp mode/exception stuff, but since the web hides these choices, perhaps Emscripten would be justified in hacking MUSL accordingly?
Math.round isn't usable here, as it handles ties in a manner unlike any of the standard rounding modes.
(Reporter)

Comment 12

3 years ago
(In reply to Luke Wagner [:luke] from comment #10)
> It'd be nice to re-benchmark with a newer Emscripten build to see what the
> native/asm ratio afterwards; perhaps even more performance faults will have
> been fixed.

Thank you all for keeping an eye on this bug.

I had re-benchmarked the testing set by emsdk1.29 and several test cases are boosted up. In short, test cases with numerical conversion from floating point data to non-floating point data get around 50% speed-up. More detailed data are shown in the Google sheet, the new benchmark is in column-I and the comparision to native performance is in column-L; statistics charts are shown in page2, the distribution now is more centralized to the 0x~5x section, ans the mass are in the 1x~2x range.

Next, I am going to post other obviously observed patterns with performance drop, includes numerical issues as well as logical operation issues. Hope these cases could also help.

The Gooele sheet: https://docs.google.com/spreadsheets/d/1w8gr1_q_dQclTaGyiMtAhlwj_5_HPVKbHWfwUszPU40/edit?usp=sharing
(Reporter)

Updated

3 years ago
Depends on: 1121860
(Reporter)

Updated

3 years ago
Depends on: 1121877
(Reporter)

Comment 13

3 years ago
(In reply to Luke Wagner [:luke] from comment #7)
> (In reply to Tzuhao Kuo (Kaku) [:tkuo] from comment #6)
> Thanks for the specific examples!
> 
> For #1, I don't see any problems for the one hot function where 80% of the
> time is spent.  I wonder if there are uint8-specific optimizations we're
> missing out on, though.  By any chance is there a uint32 version of this
> benchmark?

Sorry that the OpenCV library limits it's array data type to the following primitive types, not include the uint32.
* 8-bit unsigned integer (uchar)
* 8-bit signed integer (schar)
* 16-bit unsigned integer (ushort)
* 16-bit signed integer (short)
* 32-bit signed integer (int)
* 32-bit floating-point number (float)
* 64-bit floating-point number (double)

Does anyone above might also help?
Flags: needinfo?(luke)
(Reporter)

Updated

3 years ago
Depends on: 1121908

Comment 14

3 years ago
Oh, int32 would be great as well.  And again I'm just interested to see if Emscripten/asm.js do relatively better on int32 (vs. native) than on int8 (vs. native).  Thanks!
Flags: needinfo?(luke)
(Reporter)

Comment 15

3 years ago
Created attachment 8555023 [details]
operation_compare_32SCX.tar.gz

(In reply to Luke Wagner [:luke] from comment #14)
> Oh, int32 would be great as well.  And again I'm just interested to see if
> Emscripten/asm.js do relatively better on int32 (vs. native) than on int8
> (vs. native).  Thanks!

Sorry for my late reply.
The new attached file is the profiling result of data type int32. In this case, int32 does not better than uint8.
Flags: needinfo?(luke)

Comment 16

3 years ago
Ah, that and your bug 1121860 comment 10 would suggest that our x86-byte-store issue probably isn't the problem here.  These are pretty tight loops, so probably we just need to dig into the code to see if it's a codegen/regalloc issue.

Btw, I see 21% time in the profile under __ZN2cvL11randBits_8uEPhiPyPKNS_3VecIiLi2EEEb (distributed periodically through the profile) doing a lot of what looks like int64 math in a rand function.  Is this by any chance part of the testing harness and not part of the reported benchmark time?
Flags: needinfo?(luke)
(Reporter)

Comment 17

3 years ago
(In reply to Luke Wagner [:luke] from comment #16)
> Ah, that and your bug 1121860 comment 10 would suggest that our
> x86-byte-store issue probably isn't the problem here.  These are pretty
> tight loops, so probably we just need to dig into the code to see if it's a
> codegen/regalloc issue.
> 
> Btw, I see 21% time in the profile under
> __ZN2cvL11randBits_8uEPhiPyPKNS_3VecIiLi2EEEb (distributed periodically
> through the profile) doing a lot of what looks like int64 math in a rand
> function.  Is this by any chance part of the testing harness and not part of
> the reported benchmark time?

Yes, you are right. This function is used to prepare the testing data and its execution time does not included in the benchmark report.

Comment 18

3 years ago
By the way, are there any macro OpenCV benchmarks (that don't test single operations in tight loops but rather use OpenCV on a set of large realistic tasks)?  I was thinking that, if so, it'd be nice to include this as asmjs-apps-*-* workload (so we can track it on http://arewefastyet.com/#machine=28&view=breakdown&suite=asmjs-apps).
You need to log in before you can comment on or make changes to this bug.