Open Bug 771106 Opened 12 years ago Updated 2 years ago

Meta: improve memory access performance in Emscripten-translated code

Categories

(Core :: JavaScript Engine, defect)

Other Branch
x86
macOS
defect

Tracking

()

People

(Reporter: bhackett1024, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [js:t])

Attachments

(4 files)

Currently, memory access performance is a major gap between native code and Emscripten-translated JS (and other autotranslators).  For a native access like:

x[i]

From C this is compiled to a single base+index instruction.  The translated JS for this looks something like:

Mem[(x + (i << 2)) >> 2]

In some cases new variables can be introduced by optimizations (don't know if these are in Emscripten or the Closure compiler) which can eliminate one or both of these shifts, but those don't seem to be widely applicable and aren't used much on the (simple) fannkuch benchmark.

This metabug is about changes to the JITs and/or Emscripten to allow the translation of x[i] to be compiled by the JIT to a base+index instruction and (hopefully hoisted from loops) bounds check.

While Emscripten is the focus here, not looking to overly constrain the input and allow adaptation to/by other autotranslators.
Translated version of fannkuch-11 benchmark by emscripten -O3 (code is 1s faster than -O2).  I currently get 10.2s in JM and 10.0s in IM.
Above benchmark hand optimized to eliminate most shifts.  I get 9.3s in JM and 23.9s in IM (weird perf fault, needs investigating).

This changes the representation of 'x' in the above to be an index into Mem[] rather than the absolute offset.  Don't know how hard this change would be to make to Emscripten, but generating code like this would require less pattern matching in the JIT and would apply more easily to other JS engines.
Attached file C++ fannkuch
Original C++.  gcc -O3 is 3.3s for me.
Attachment #639312 - Attachment is patch: false
Depends on: 771285
Depends on: 771383
Regarding things like

Mem[(x + (i << 2)) >> 2]

it is possible to split off x >> 2 and if that recurs to define x2 = x >> 2. However for (i << 2) >> 2 it is not trivial to replace it with say i | 0 since a few bits can get zeroed out here. This might become easier though once emscripten has a C++ LLVM backend (see bug 771285 comment 5) because likely inside LLVM it is straightforward to detect when such operations are on pointers (and we can reasonably assume the top few bits are not needed).
Attached file fannkuches
Ok, we already have the infrastructure in the current compiler to optimize similar expressions, I did some tests now. Attached are 4 versions of fannkuch, with an example of the code in each. Closure was not run to keep things readable.

src.0.js  HEAP32[($i_23_i << 2 >> 2) + $20$s2]
src.1.js  HEAP32[($i_23_i & 1073741823) + $20$s2]
src.2.js  HEAP32[($i_23_i | 0) + $20$s2]
src.3.js  HEAP32[$i_23_i + $20$s2]

So, src.0.js is the original unmodified compiler. src.1.js replaces << >> with a single & with the proper mask, which is a safe transformation that seems like it could be useful (1 operation instead of 2). src.2.js does an unsafe transformation of << >> to | 0, which is valid for pointers and happens to be ok here. Finally, src.3.js is the same as the previous one but without the |0, for the smallest possible code.

time mozjs --ion -n src.*.js 11

gives

src.0.js  9.869 seconds
src.1.js  9.797
src.2.js  9.793
src.3.js  10.273

1 and 2 give slightly less than a 1% speedup. Very different than the hand-optimized version mentioned earlier so I guess that hand-optimized one did something important that these simple optimizations did not.

Note that 1% is not a bad thing in itself. However 1 generates larger code and 2 is unsafe, so for now I won't use these optimizations in emscripten.

3 is slower, not surprising I guess, the JIT needs to add checks on the type of the variable inside HEAP.
Btw, $20$s2 is a helper variable the optimizer generated, after seeing that $20 was used through >> 2 several times, so it defined $20$s2 = $20 >> 2.
Ah, what are the times with -m -n?  ion still behaves weirdly here and is slower on the optimized benchmark.
Ok, with -m -n I get

src.0.js  9.761
src.1.js  9.713
src.2.js  9.749
src.3.js  10.141
Depends on: 771835
Depends on: 771864
Blocks: 767238
Whiteboard: [js:t]
Assignee: general → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: