Closed Bug 864214 Opened 7 years ago Closed 3 years ago

IonMonkey: Generate asm.js heap load/stores when possible

Categories

(Core :: JavaScript Engine, defect, P5)

Other Branch
x86
macOS
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bhackett, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [leave open])

Attachments

(2 files, 2 obsolete files)

Attached patch patch (obsolete) — Splinter Review
asm.js uses specialized MIR nodes for heap accesses, which are faster than what IonMonkey normally generates.  It would be good if vanilla IonMonkey also generated these where possible, i.e. when the load/store can execute infallibly on a statically known array.  The attached patch does this for x86.  ARM should be straightforward but I don't have a way to test changes to it.  x64 is considerably more complicated, as it depends on signal handlers to be able to execute loads/stores without the bounds checks that x86/ARM require.

It would be possible on x86 and (presumably) ARM to get things going even faster than they are using these nodes, as bounds checks could be hoisted from the loop and instructions cut from the loads/stores.  Saving that for followup, maybe.
Attachment #740153 - Flags: review?(luke)
Attachment #740153 - Flags: review?(jdemooij)
Comment on attachment 740153 [details] [diff] [review]
patch

Optimizing typed array accesses in general sounds great, but I'd rather not reuse the AsmJS*Heap nodes since it makes them more complicated.  Instead, can you add new or extend the existing *TypedArray* nodes.  If there is duplication in the backend codegen, that can be factored out.
Attachment #740153 - Flags: review?(luke) → review-
Attached patch x86 patch (obsolete) — Splinter Review
Well, this patch adds {Load,Store}TypedArrayElementStatic nodes which are identical in behavior to AsmJS{Load,Store}Heap except for how the typed array is baked in.  It's also twice as large, thanks to all the new boilerplate.
Attachment #740153 - Attachment is obsolete: true
Attachment #740153 - Flags: review?(jdemooij)
Attachment #740887 - Flags: review?(luke)
Attachment #740887 - Flags: review?(jdemooij)
Comment on attachment 740887 [details] [diff] [review]
x86 patch

Review of attachment 740887 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/IonBuilder.cpp
@@ +6201,5 @@
> +    // conversion, so that out of bounds accesses do not need to bail out.
> +    bool intConversion;
> +    if (*next == JSOP_POS)
> +        intConversion = false;
> +    else if (*next == JSOP_ZERO && *(next + JSOP_ZERO_LENGTH) == JSOP_BITOR)

Your overall stated goal, which I like, is to optimize patterns on general (non-asm.js) principles, but this bytecode pattern matching contradicts that.  For one thing, this matches a pattern more narrow than asm.js (loads produce a type "intish" or "doublish" which can be consumed by many operations (that perform ToInt32/ToNumber on their operand).  It seems like you'd want to hook into the more general truncation analysis that is already performed.
Attachment #740887 - Flags: review?(luke)
(In reply to Brian Hackett (:bhackett) from comment #2)
> Created attachment 740887 [details] [diff] [review]
> x86 patch
> 
> Well, this patch adds {Load,Store}TypedArrayElementStatic nodes which are
> identical in behavior to AsmJS{Load,Store}Heap except for how the typed
> array is baked in.  It's also twice as large, thanks to all the new
> boilerplate.

I volunteer to implement the ARM support if this helps.

Forgive my ignorance, but it appears that the JIT code is
specialized on the typed array size and this is given by the
ins->mir->length()?

Could I also ask if constant indexes use these code paths?
Bug 865516 is optimizing out the bounds checks for small
constant indexes in asm.js code.

If the array length is known when compiling the JIT code
and the index is constant then the check might be avoided.
Further, is the index variable range type information
available to the backend to make code choice decisions?
Attached patch x86 patchSplinter Review
I think my overall stated goal is that code should run at the same speed whether asm.js is used or not, but, yeah, being robust against the patterns matched against by asm.js is part of that.  This patch allows LoadTypedArrayElementStatic to be fallible.  The fallible and infallible versions should run at about the same speed, provided that the associated snapshots are not keeping things alive longer than necessary (another TODO to get to soon).

Both the truncation analysis and the bytecode sniffing are used to mark the loads as infallible.  The truncation analysis is not sufficient by itself because (a) it doesn't do anything with general numeric conversions, and (b) it interacts badly with the expression folding done in GVN and the 'x[y]' in 'x[y] | 0' will not usually be marked as truncated.  The latter is a more involved fix and outside the bounds of this patch, but since it also causes 'x + y' in '(x + y) | 0' ops to usually require overflow checks it's another upcoming TODO.
Attachment #740887 - Attachment is obsolete: true
Attachment #740887 - Flags: review?(jdemooij)
Attachment #741817 - Flags: review?(luke)
Attachment #741817 - Flags: review?(jdemooij)
(In reply to Douglas Crosher [:dougc] from comment #4)
> I volunteer to implement the ARM support if this helps.

Sure, this would be great.

> Forgive my ignorance, but it appears that the JIT code is
> specialized on the typed array size and this is given by the
> ins->mir->length()?

Yes, the typed array being specialized on is statically known; ins->mir->length() is the total length (in bytes) of the array.

> Could I also ask if constant indexes use these code paths?
> Bug 865516 is optimizing out the bounds checks for small
> constant indexes in asm.js code.
> 
> If the array length is known when compiling the JIT code
> and the index is constant then the check might be avoided.
> Further, is the index variable range type information
> available to the backend to make code choice decisions?

As with AsmJSLoadHeap, LoadTypedArrayElementStatic uses a register to hold the pointer, but by inspecting the MIR you can determine whether that register will always hold a constant value.

The backend has range information for the index, which you can use to eliminate the bounds checks even if the index is not constant.
Thank you for the information.

The ARM bounds check implementation uses a logical shift to
mask the index and only works for lengths of 2^n.  Asm.js
imposes some restrictions on the heap size.  Would you be
happy with the compiler giving up on the optimization if
the length is not 2^n for the ARM?

The x64 implementation for asm.js requires the heap array
to be prepared - mapped to a specific layout in memory.
Did you really want to attempt this?  Might it be possible
to specialize the JIT code on the array being already
prepared?    How would the array get prepared?  Alternatively
the x64 could just copy the x86 approach.
Attachment #741817 - Flags: review?(luke) → review+
(In reply to Douglas Crosher [:dougc] from comment #7)
> The ARM bounds check implementation uses a logical shift to
> mask the index and only works for lengths of 2^n.  Asm.js
> imposes some restrictions on the heap size.  Would you be
> happy with the compiler giving up on the optimization if
> the length is not 2^n for the ARM?

This would be fine.

> The x64 implementation for asm.js requires the heap array
> to be prepared - mapped to a specific layout in memory.
> Did you really want to attempt this?  Might it be possible
> to specialize the JIT code on the array being already
> prepared?    How would the array get prepared?  Alternatively
> the x64 could just copy the x86 approach.

I think the x64 layout could be used, though it would be more invasive to the engine in general.  Right now I'm mostly interested in x86 and ARM performance, since those are the architectures used by the vast majority of our users currently.
Comment on attachment 741817 [details] [diff] [review]
x86 patch

Review of attachment 741817 [details] [diff] [review]:
-----------------------------------------------------------------

I'm a bit uneasy about making major optimizations x86-only (most of our OS X and Linux users do use x64 builds), please file a bug for adding it.

::: js/src/ion/Lowering.cpp
@@ +2092,5 @@
>  
>  bool
> +LIRGenerator::visitLoadTypedArrayElementStatic(MLoadTypedArrayElementStatic *ins)
> +{
> +    LLoadTypedArrayElementStatic *lir =

Nit: JS_ASSERT(ins->ptr()->type() == MIRType_Int32);
Attachment #741817 - Flags: review?(jdemooij) → review+
If no one beats me to it then I'll take care of the x64 backend support by adapting the x86 code.  This will allow the x64 to also use this optimization for an arbitrary array length. The ARM support will also be written to work with an arbitrary length, but it will bit a little faster for a length of 2^n.

Supporting this optimization for a length that is not just 2^n might help support a common pattern.  If the length is 2^n+g then a 2^n-1 mask of a pointer index at the entry to a function would declare a useful range for the pointer and allow the bounds checks to be optimized away even in the case of small positive offsets, up to 'g'.  This would fit a pattern of a pointer to a structure stored in the array.

It would help if an 'inRange' flag could be added to the MIR object because all the backends will want to know if the index is known to be within the extent of the array length so that they can avoid the bounds check in this case.  This could just a well be a follow up patch.
(In reply to Douglas Crosher [:dougc] from comment #10)
> Supporting this optimization for a length that is not just 2^n might help
> support a common pattern.  If the length is 2^n+g then a 2^n-1 mask of a
> pointer index at the entry to a function would declare a useful range for
> the pointer and allow the bounds checks to be optimized away even in the
> case of small positive offsets, up to 'g'.  This would fit a pattern of a
> pointer to a structure stored in the array.

We already consolidate and loop hoist bounds checks for the other kinds of array and typed array access ops.  I'm planning to expand that to these static accesses in a followup; this should improve both the 'embedded structure with many nearby offsets' and the 'embedded array within the typed array' cases.  The relevant code is around MBoundsCheck and friends if you're interested.
It looks like this regressed Octane-mandreel on AWFY by at least 5%. Can you take a look?
Attached patch ARM supportSplinter Review
This patch adds support for the ARM backend.  It handles an arbitrary sized array length, but will be a little faster for a length of 2^n.  The code paths have been tested, but this is not proposed to be committed yet because the performance difference is hard to measure.  A constant is currently used for the array base and this requires two instructions, or a constant pool load, on the ARM - it might be better to keep the base in a register for the ARM.

The use of a shift and zero test for the 2^n bounds check might be implemented in general code as an optimization for general comparisons and this might improve performance a little even without this patch.  The patch takes advantage of the ARM's conditional instructions and more general support for these in the ARM assembler could help code in general.
(In reply to Jan de Mooij [:jandem] from comment #14)
> It looks like this regressed Octane-mandreel on AWFY by at least 5%. Can you
> take a look?

Any news on this? The regression on mandreel is still on AWFY and likely affecting various non-asm.js emscripten and mandreel code on the web.
I compared outputs with -D and after this bug we are executing about 8% more Ion instructions on mandreel than before.  The actual LIR opcode counts are unchanged except for the ones related to the typed array accesses, so this is just a benchmark where the asm.js loads/stores are less efficient than in the usual typed array case.  That seems likely to be because we can hoist/consolidate bounds checks for normal typed array accesses, while asm.js heap accesses (on x86/arm at least) always require bounds checks.  For mandreel, we used to eliminate 2/3 of bounds checks with these optimizations; now that is near zero.  Fixing this and even improving on the earlier state will happen before long (see comment 11) but right now I'm more interested in other parity issues; if you're happy with asm.js x86 perf as is, I don't see a reason to rush a fix in.
asm.js x86 bounds checking was shoehorned in at the last minute when I had to rip out the use of segmentation because of Win64, so I'm not surprised it's worse than bounds-check hoisting.
(In reply to Brian Hackett (:bhackett) from comment #17)
> I compared outputs with -D and after this bug we are executing about 8% more
> Ion instructions on mandreel than before.  The actual LIR opcode counts are
> unchanged except for the ones related to the typed array accesses, so this
> is just a benchmark where the asm.js loads/stores are less efficient than in
> the usual typed array case.  That seems likely to be because we can
> hoist/consolidate bounds checks for normal typed array accesses, while
> asm.js heap accesses (on x86/arm at least) always require bounds checks. 
> For mandreel, we used to eliminate 2/3 of bounds checks with these
> optimizations; now that is near zero.  Fixing this and even improving on the
> earlier state will happen before long (see comment 11) but right now I'm
> more interested in other parity issues; if you're happy with asm.js x86 perf
> as is, I don't see a reason to rush a fix in.

Sounds fine to me, this is important but not urgent IMO.
Depends on: 891400
Depends on: 897202
Assignee: general → nobody
Depends on: 1132290
Is there anything left to do here for now, or can we close this bug?
Flags: needinfo?(bhackett1024)
Priority: -- → P5
We should be doing this for x86 now, any ARM support should be done in another bug.
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bhackett1024)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.