Closed Bug 1512491 Opened 6 years ago Closed 5 years ago

Access to Uint8Array is 25% slower

Categories

(Core :: JavaScript Engine: JIT, defect, P2)

64 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla68
Performance Impact none
Tracking Status
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- fixed

People

(Reporter: CoolCmd, Assigned: anba, Mentored)

Details

(Keywords: good-first-bug, perf, testcase, Whiteboard: [lang=c++])

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0

Steps to reproduce:

navigate to https://jsfiddle.net/CoolCmd/f0x1huma/
click 'Run Tests' button at right


Actual results:

benchmark result:
new Uint8Array() x 49.49 ops/sec
new Uint8Array(new ArrayBuffer()) x 39.71 ops/sec


Expected results:

both tests should have identical ops/sec values, as in Firefox 50, Edge and Chrome.

compare the test1() and test2(). they differ only in the way the array is created.
note: test measures only array access, not array creation.
Thank you for opening this issue.
I am forwarding it to Kannan to investigate.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(kvijayan)
Keywords: perf, testcase
Priority: -- → P2
Whiteboard: [qf]
Ion only optimizes the single-integer-argument case, not the single-buffer-argument case:

https://searchfox.org/mozilla-central/source/js/src/jit/MCallOptimize.cpp#2929

Comment 1 is slightly confusing because it suggests we are comparing `new Uint8Array()` and `new UintArray(new ArrayBuffer())`.

It's actually comparing `new Uint8Array(I32)` vs `new Uint8Array(new ArrayBuffer(I32))`.

The apparent cost of separately allocating an ArrayBuffer is likely not the problem here - the ArrayBuffer will need to be allocated regardless in either case.

The problem here most likely arises from the fact that we are not Ion-optimizing the construction with single-buffer-argument.  Instead of inlining the call, Ion would presumably emit an IC or a CallVM for the `new Uint8Array()`, which is the likely source of the performance differential.

The fix is to handle this case in Ion.  This will require checking the type of the incoming argument using TI, and if it's an `ArrayBuffer` instance, emitting an optimized inline MInstruction to construct the array.

This performance delta should exist on all TypedArray kinds - not just Uint8Array.

This might be a good first-bug for somebody wanting to work on Ion stuff.
Flags: needinfo?(kvijayan)
Whiteboard: [qf] → [qf][lang=c++]
Mentor: kvijayan
(In reply to Kannan Vijayan [:djvj] from comment #2)
> The problem here most likely arises from the fact that we are not
> Ion-optimizing the construction with single-buffer-argument.  Instead of
> inlining the call, Ion would presumably emit an IC or a CallVM for the `new
> Uint8Array()`, which is the likely source of the performance differential.

This is actually not relevant for the test case, because only a single typed array is constructed for each test case. 

Standalone shell test case:
---
const BUFFER_SIZE = 16 * 1024 * 1024;
var a1 = new Uint8Array(BUFFER_SIZE);
a1.fill(1);
var a2 = new Uint8Array(new ArrayBuffer(BUFFER_SIZE));
a2.fill(1);

function f() {
  var c = 0;
  var t = performance.now();
  for (var i = 0; i < 20; ++i) {
    for (var j = 0; j < BUFFER_SIZE; j++) {
      c += a1[j];
    }
  }
  return [performance.now() - t, c];
}

function g() {
  var c = 0;
  var t = performance.now();
  for (var i = 0; i < 20; ++i) {
    for (var j = 0; j < BUFFER_SIZE; j++) {
      c += a2[j];
    }
  }
  return [performance.now() - t, c];
}

print(f());
print(g());
---


The important bit here is that the length for the typed array |a2| exceeds SINGLETON_BYTE_LENGTH [1], which means |a2| will be instantiated as a singleton whereas |a1| won't be marked as a singleton [2,3]. If |a2| is not created as a singleton, for example by commenting out [2], the performance will be the same for both typed arrays. 

Jan mentioned in bug 1437471 that the singleton code is mostly a performance optimization for Mandreel and we may want to consider to remove this code to avoid unexpected performance cliffs (like in bug 1437471 or in this bug), if we're willing to take a ~4000 regression in Mandreel (bug 1437471, comment 3).


[1] https://searchfox.org/mozilla-central/rev/fd32b3a6fa3eff1468311f6fcf32b45c117136df/js/src/vm/TypedArrayObject.h#132-137
[2] https://searchfox.org/mozilla-central/rev/fd32b3a6fa3eff1468311f6fcf32b45c117136df/js/src/vm/TypedArrayObject.cpp#749-753
[3] https://searchfox.org/mozilla-central/rev/fd32b3a6fa3eff1468311f6fcf32b45c117136df/js/src/vm/TypedArrayObject.cpp#383-390
Flags: needinfo?(kvijayan)
Ugh, I can't believe I completely missed the point of the perf.  Just focused on the construction.

Thanks for catching my mistake Andre.
Flags: needinfo?(kvijayan)
:djvj, in fact you are the second who completely missed, see bug 1506480.
Ah, there's always safety in numbers :)
Whiteboard: [qf][lang=c++] → [qf-][lang=c++]
Hi,

I am interested in this bug and would like to solve it. If possible, could you tell me how to proceed with this and what files to modify to solve this bug. Is there a way to test my code after making changes?

Thank you!!
(In reply to sharath.savasere from comment #7)
> I am interested in this bug and would like to solve it. If possible, could
> you tell me how to proceed with this and what files to modify to solve this
> bug. Is there a way to test my code after making changes?

The first step is to check if you can reproduce this issue. Which implies that you build SpiderMonkey from the sources.

https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Build_Documentation

Then after building the JavaScript shell, you should try to see if you can reproduce the performance issue with the code from comment 3.
(In reply to Nicolas B. Pierron [:nbp] (away until 02/01/2019) from comment #8)
> (In reply to sharath.savasere from comment #7)
> > I am interested in this bug and would like to solve it. If possible, could
> > you tell me how to proceed with this and what files to modify to solve this
> > bug. Is there a way to test my code after making changes?
> 
> The first step is to check if you can reproduce this issue. Which implies
> that you build SpiderMonkey from the sources.
> 
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/
> Build_Documentation
> 
> Then after building the JavaScript shell, you should try to see if you can
> reproduce the performance issue with the code from comment 3.

I did the installation through mozilla-central repo. I built the js shell and ran the code in comment 3. I got these results as output:

504.696044921875,335544320
637.552001953125,335544320

Can you tell me the next steps for tackling this bug!
Thanks!
Great, now, with a debug build try to identify what is the difference in the generated code, and see if it would be easy to hoist a repeated set of instructions/conditions out of the slow loop. If you cannot reproduce any differences with a debug build, then toggle the ion-spew on with the configure option on an optimized build.

Then to dump the Assembly, use the environment variable IONFLAGS=help and IONFILTER=foo.js:4, to list all different outputs and filter them based on a given location.

This page[1] might also help you to continue your investigation, or learn a few debugging tricks specific (or not) to SpiderMonkey.

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips

(In reply to Nicolas B. Pierron [:nbp] from comment #10)

Great, now, with a debug build try to identify what is the difference in the
generated code, and see if it would be easy to hoist a repeated set of
instructions/conditions out of the slow loop. If you cannot reproduce any
differences with a debug build, then toggle the ion-spew on with the
configure option on an optimized build.

Then to dump the Assembly, use the environment variable IONFLAGS=help and
IONFILTER=foo.js:4, to list all different outputs and filter them based on a
given location.

This page[1] might also help you to continue your investigation, or learn a
few debugging tricks specific (or not) to SpiderMonkey.

[1]
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/
Hacking_Tips

I looked at the output assembly code of the two functions and the assembly code for function f() is more compared to function g(). I asked in #jsapi channel and was told that its something to do with inline functional call. I am not sure which files to look into and what functions to change. I saw the iongraph plots and its very hard to find out what is causing this difference in time.

Thank you!

IIRC you already have generated the iongraph pdfs/pngs. So for the simplified test case from comment #3, iongraph should have outputted four functions:

 function 0 (self-hosted:6695): success; 29 passes.
 function 1 (self-hosted:6695): success; 30 passes.
 function 2 (/tmp/f.js:7): success; 29 passes.
 function 3 (/tmp/f.js:17): success; 29 passes.

function 0 and function 1 can be ignored here, and we only need to look at function 2 and function 3, where function 2 is function f from the test case and function 3 is function g.

We know that function 2/f is the fast one and function 3/g is the slow one, now we only need to identify how these two functions differ resp. which compiler optimization path led to a different output. So let's take a look at "func02-pass26-Add KeepAlive Instructions-mir.gv.pdf" and "func03-pass26-Add KeepAlive Instructions-mir.gv.pdf", which is the output of the very last MIR optimization path before starting to lower to a more machine level representation (LIR).

The inner loop for (var j = 0; j < BUFFER_SIZE; j++) { ... } is represented in the blocks 4-6, where loading the typed array element c += a2[j]; is represented in block 6.

In the fast case for function 2/f, the following instructions are used:

spectremaskindex phi62 typedarraylength56
loadunboxedscalar typedarrayelements57 spectremaskindex67 uint8

Whereas the slow case function 3/g uses:

constant 0x1000000
boundscheck phi56 constant60
spectremaskindex phi56 constant60
constantelements 0x7efce2100000
loadunboxedscalar constantelements64 spectremaskindex63 uint8

That means in the slower function we're always emitting a boundscheck MBoundsCheck instruction, which is probably the cause of the slow down. In the fast function the boundscheck was somehow moved outside of block 6 and is now present in block 4:

boundscheck loadslot53 typedarraylength56

Now we need to identify which optimization is responsible for moving the boundscheck for function 2/f. Eventually we'll end up in "func02-pass13-Range Analysis-mir.gv.pdf", which is the very first pass which has the bounds check in block 4 instead of block 6. This optimization pass is implemented in RangeAnalysis.h and RangeAnalysis.cpp, so you'll need to inspect why the range analysis was able to hoist the bounds check only for function 2/f, but not for function 3/g. Maybe this function is a good starting point for further debugging?

So I just took another quick look at this.

This seems to be a misbehaving interaction between LICM and Range Analysis. LICM doesn't hoist MConstant (MConstant will only be hoisted if its user is hoisted, but the user, MBoundsCheck in this case, can't be hoisted, because one of its operands isn't loop invariant) and not hoisting the MConstant leads to rejecting to hoist MBoundsCheck in the Range Analysis phase, because its length (the MConstant) wasn't hoisted and is therefore considered as not loop invariant.

And additionally LICM also never hoists MConstantElements, which leads to another slow down in function g when compared to function f.

That means there are two action points:

  1. We need to teach RangeAnalysis to hoist MBoundsCheck with non-hoisted MConstants.
  2. And we need to find out why MConstantElements is considered as not being worth hoistable in LICM.
Assignee: nobody → andrebargull
Status: NEW → ASSIGNED

Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/07d5d42519f3
Part 1: Hoist bound checks with constants. r=nbp,jandem
https://hg.mozilla.org/integration/autoland/rev/27b59f0c6630
Part 2: Hoist access to MConstantElements. r=nbp

Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Performance Impact: --- → -
Whiteboard: [qf-][lang=c++] → [lang=c++]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: