1316820 - Wasm baseline: Better instruction selection for zero-initialization

Assignee

Description

•

8 years ago

Right now to initialize locals we clear a register and then store 32 bits at a time into the frame.  There are possible improvements:

On x64, at least, scratch will be a 64-bit register and we can move 64 bits at a time.

On SSE2 or better SIMD systems we may be able to store 128 bits at a time.  (I suppose on some systems we have 512-bit SIMD for that matter.)

If we have only one initializing store then it's better to store a zero literal, probably.

It is possible that allocating locals to registers (bug 1316808) will lead to some constant-tracking optimizations that will make fast initialization less important, but for the sake of performance and code density we should probably investigate fast initialization without waiting for better register allocation.

Lars T Hansen [:lth]

Assignee

Comment 1

•

8 years ago

Speaking of code density: Another thing is that frames can be fairly large in the baseline compiler since there's always one slot per variable.  When we have more than a few slots to initialize we should definitely emit a partially unrolled loop, not a fully unrolled loop like now.  ("A few" might be 32-ish, and we might do 32-at-a-time initialization.  Having data from real programs about frame sizes would be helpful here.)  For the pathological example in bug 1317678 we emit over 30000 individual movl instructions to initialize the frame.

Lars T Hansen [:lth]

Assignee

Updated

•

8 years ago

Mentor: lhansen

Michelangelo De Simone [Offline] [:mds]

Updated

•

7 years ago

Assignee: nobody → mdesimone

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/102034/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/102034/

Michelangelo De Simone [Offline] [:mds]

Comment 3

•

7 years ago

:lth the patch is __NOT__ for review just yet, it's incomplete and I've pushed it only to get some feedback.

The basic ideas here are:

1. if we have enough locals to justify it AND we are on (x64 OR x86 w/ SSE2+): store DOUBLE QUADWORDS via movdqu until we are done; otherwise

2. if we are are on x64, store QUADWORDS (like in the sample patch) until we are done; otherwise

3. move doublewords (as it is now)

Is it ok or I've made a huge mess?:)

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Assignee

Comment 4

•

7 years ago

The patch is OK.  What you have is a generic fallback 64-bit path; instead of being specific to x64 it should be possible to use platform-independent 64-bit operations, and then #ifdef on JS_64BIT (defined in ../jstypes.h).  This would follow specific paths for x64 and x86 probably, that would use platform-specific sse instructions.  It may be that the code needs a little restructuring to accomplish this, but it doesn't seem hard.

I am keen to also solve the other two problems I mentioned above:

- If it's a single 4-byte slot, just store an immediate 0.  This will be better on x86 and x64, probably; on ARM it won't make a difference.

- If we'd end up with many stores, don't unroll the init loop completely like we do now, but instead generate a partially unrolled loop.  This is just good hygiene.

Flags: needinfo?(lhansen)

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/1-2/

Michelangelo De Simone [Offline] [:mds]

Comment 6

•

7 years ago

mozreview-review

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

https://reviewboard.mozilla.org/r/102034/#review104422

Draft patch, not for review/landing.

Attachment #8823506 - Flags: review-

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/2-3/

Michelangelo De Simone [Offline] [:mds]

Comment 8

•

7 years ago

mozreview-review

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

https://reviewboard.mozilla.org/r/102034/#review104428

Attachment #8823506 - Flags: review-

Michelangelo De Simone [Offline] [:mds]

Comment 9

•

7 years ago

(In reply to Lars T Hansen [:lth] from comment #4)

Draft patch; not for review just yet (I've r-'ed it myself to be sure).

> - If it's a single 4-byte slot, just store an immediate 0.  This will be
> better on x86 and x64, probably; on ARM it won't make a difference.
> 
> - If we'd end up with many stores, don't unroll the init loop completely
> like we do now, but instead generate a partially unrolled loop.  This is
> just good hygiene.

So:

1. an immediate 0 is being stored directly if there's only one dword slot;
2. if on 64-bit capable arch, .move64 is used
3. the store loop (inside the #ifdef js_64bit) is partially unrolled (the block is of 8 operations for readability)

If I didn't do something epically stupid here, in the final patch I'll unroll 32 operations-at-a-time on both the 64 and 32 bit ifdefs.

Michelangelo De Simone [Offline] [:mds]

Updated

•

7 years ago

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Assignee

Comment 10

•

7 years ago

(In reply to Michelangelo De Simone [:mds] from comment #9)
> (In reply to Lars T Hansen [:lth] from comment #4)
> 
> Draft patch; not for review just yet (I've r-'ed it myself to be sure).
> 
> > - If it's a single 4-byte slot, just store an immediate 0.  This will be
> > better on x86 and x64, probably; on ARM it won't make a difference.
> > 
> > - If we'd end up with many stores, don't unroll the init loop completely
> > like we do now, but instead generate a partially unrolled loop.  This is
> > just good hygiene.
> 
> So:
> 
> 1. an immediate 0 is being stored directly if there's only one dword slot;
> 2. if on 64-bit capable arch, .move64 is used
> 3. the store loop (inside the #ifdef js_64bit) is partially unrolled (the
> block is of 8 operations for readability)

Point 3 is not right.  What we need here is a loop that's unrolled *in the emitted code*, not in the compiler, which is what you have here.  (Sorry if I was unclear about that.)  That is, if we have fewer slots than the unrollLimit (say, 32 slots) then the emitted code is this:

   ;; zero is a local scratch register
   zero = 0
   store(zero, slot0)
   store(zero, slot1)
   store(zero, slot2)

but if we have more slots than the unrollLimit then the *emitted* code is roughly this:

   ;; zero, lim, and p are local scratch registers
   ;; numSlotsToInit and unrollLimit are compile-time constants

   #define WORDSIZE 8 // for example
   #define ITERATIONS numSlotsToInit / unrollLimit
   ASSERT(ITERATIONS > 0)

   p = pointer to the lowest address in locals area to init
   lim = p + ITERATIONS * unrollLimit * WORDSIZE

   zero = 0

  L1:
   ;; here we assume the offset can be encoded in the store instruction
   ;; at zero cost so these additions in the store instructions are implicit.
   ;; The loop body should be 8 store instructions, followed by a single
   ;; add-immediate / compare / branch triple.
   store(zero, p+0)
   store(zero, p+WORDSIZE)
   store(zero, p+WORDSIZE*2)
   ...
   store(zero, p+(unrollLimit-1)*WORDSIZE)
   p += unrollLimit*WORDSIZE
   if (p < lim) goto L1

   #define TAIL numSlotsToInit - ITERATIONS * unrollLimit

   store(zero, p+0)
   ...
   store(zero, p+(TAIL*WORDSIZE))

(Now I also wonder why I used "i+4" rather than "i" as the slot offset for the frame init in the existing code but that's my problem, not yours (yet).  I'll get back to you on this.)

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Assignee

Comment 11

•

7 years ago

A couple more remarks.

- On x64 there is a "move 32 bits sign extended to r/m64" instruction.  (REX.W + C7 /0 in the Intel docs.)
  Since we're storing zero, we can use this to store a single 8-byte constant zero on that architecture.
  This is obviously small potatoes; don't worry about it unless you fell like dealing with the extra
  complexity.  Notably this is an x64-only thing, no other architectures will benefit from it.

- In the existing initialization code, it is not obvious that the code clears the stack frame from the
  high addresses toward the low addresses and does not initialize the padding at the lowest addresses,
  if there is any, but that is what happens.  It would be nice to preserve the don't-init-the-padding
  behavior in the non-unrolled case but I won't insist on it (especially since we wastefully init any
  padding in the middle of the frame already).

- In your patch you have the line 
    masm.store32(Imm32(0), Address(StackPointer, localOffsetToSPOffset(varLow_)));
  That should almost certainly be varLow_ + 4, following the existing logic.

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/3-4/

Michelangelo De Simone [Offline] [:mds]

Comment 13

•

7 years ago

The patch emits a partially unrolled loop according to "UNROLL_LIMIT".

I've looked for generic ways to deal with stack pointers (growing upward/downward) according to the specific architecture but I haven't found anything clear so the patch may be wrong in certain conditions (on ARM the stack can grow either ways but I've only dealt with the downward case).

AFAICT modern *nixes on ARM(64) have the stack growing downward so this should be ok but what if it's not the case?

Hopefully this is a step in the right direction and sorry it took me so long: I've found myself unexpectedly booked with other things.

(In reply to Lars T Hansen [:lth] from comment #10)

> (Now I also wonder why I used "i+4" rather than "i" as the slot offset for

Likely explanation is that as the SP on x86/x86_64 grows downward you need to offset the address of 1 word otherwise the first address you write to would actually be the second and the last instruction might end up operating beyond the barrier (and likely segfault).

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/4-5/

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/5-6/

Lars T Hansen [:lth]

Assignee

Comment 16

•

7 years ago

mozreview-review

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

https://reviewboard.mozilla.org/r/102034/#review116336

The structure is right here.  There are some style issues and some bugs though.

::: js/src/wasm/WasmBaselineCompile.cpp:2181
(Diff revision 6)
>                default:
>                  MOZ_CRASH("Function argument type");
>              }
>          }
>  
> -        // Initialize the stack locals to zero.
> +        if (varHigh_ - varLow_)

Use "!=" because it's easier to understand and because it corresponds to the assert in emitInitStackLocals.

::: js/src/wasm/WasmBaselineCompile.cpp:7309
(Diff revision 6)
> +void
> +BaseCompiler::emitInitStackLocals()
> +{
> +    MOZ_ASSERT(varLow_ != varHigh_, "there should be stack locals to initialize");
> +
> +#ifndef UNROLL_LIMIT

Better style is to use a const uint32_t (or size_t, whatever you feel is appropriate) for UNROLL_LIMIT.  You can use a #define but we're not likely to change this anywhere except within the code, and if you use a #define you have to #undef it again later, or we risk breaking unified builds.

::: js/src/wasm/WasmBaselineCompile.cpp:7315
(Diff revision 6)
> +#define UNROLL_LIMIT 32
> +#endif
> +
> +    // local_slots represents the number of all the virtual
> +    // i32 locals.
> +    uint32_t local_slots = (varHigh_ - varLow_) / 4;

Do you really want to divide by 4 here and not by WORDSIZE?  Also the comment is incorrect, because there could be locals of other types.  Something like 'init_words' might be more indicative of the actual function.

::: js/src/wasm/WasmBaselineCompile.cpp:7317
(Diff revision 6)
> +
> +    // local_slots represents the number of all the virtual
> +    // i32 locals.
> +    uint32_t local_slots = (varHigh_ - varLow_) / 4;
> +
> +    if (!local_slots)

Can't happen because `varLow_` != `varHigh_` and alignment is at least 4.

::: js/src/wasm/WasmBaselineCompile.cpp:7322
(Diff revision 6)
> +    if (!local_slots)
> +        return;
> +
> +    // In case we have only one local to initialize, it's just
> +    // faster/easier to store an immediate zero.
> +    if (local_slots == 1) {

On x64 it is possible to do this even for eight bytes because we can move a 32-bit immediate sign-extended to 64 bits to memory directly.

::: js/src/wasm/WasmBaselineCompile.cpp:7327
(Diff revision 6)
> +    if (local_slots == 1) {
> +        masm.store32(Imm32(0), Address(StackPointer, localOffsetToSPOffset(varLow_ + 4)));
> +        return;
> +    }
> +
> +#if defined(JS_64BIT)

You have cases here and below for 32 and 64 bits - basically the word size of the architecture.  There is a poorly known trick here: on 64-bit systems, a Register (and a RegI32) is known to have a 64-bit personality, you don't need to use RegI64.  This would be more obvious if there were abstractions for this using the tag 'Ptr' or 'Word'.  But because of the trick, needI32() will allocate a 64-bit register on a 64-bit platform.  This removes the special-casing of allocating and freeing the "zero" register and means you don't have to go to allocGPR() for "p" and "lim".  If you introduce a storeToFrameWord() function you also get rid of the special cases for storeToFrameI{32,64}.

::: js/src/wasm/WasmBaselineCompile.cpp:7328
(Diff revision 6)
> +        masm.store32(Imm32(0), Address(StackPointer, localOffsetToSPOffset(varLow_ + 4)));
> +        return;
> +    }
> +
> +#if defined(JS_64BIT)
> +#define WORDSIZE 8

Again, use a const variable for WORDSIZE.

::: js/src/wasm/WasmBaselineCompile.cpp:7341
(Diff revision 6)
> +
> +    // Full unroll if we have less than UNROLL_LIMIT locals.
> +    // Keeping this case explicit, instead of merging it with the
> +    // partial unroll, helps reducing the emitted instructions.
> +    // In this case we emit at most UNROLL_LIMIT+2 instructions.
> +    if (local_slots < UNROLL_LIMIT)  {

See comment on local_slots above, we unroll only up to 16 instructions on x64 (though of course we'll still initialize up to 64 bytes this way).  Seems a bit unintuitive to me.

::: js/src/wasm/WasmBaselineCompile.cpp:7366
(Diff revision 6)
> +    Label L1;
> +    masm.bind(&L1);
> +
> +    // At this point the register 'p' contains the pointer to the first slot to initialize,
> +    // meaning the effective address of "varLow_".
> +    for (int32_t i = 0; i < UNROLL_LIMIT; ++i) {

Suppose the stack frame size is 128.  Then local_slots==32 which also equals UNROLL_LIMIT, so we come here, we don't take the fully-unrolled path.  Now we unroll 32 times (UNROLL_LIMIT), storing a word every time.  Suppose we're on a 64-bit systems.  Then we will store 8*64=256 bytes, scribbling over the stack frame.

Probably just a consequence of the mismatch between WORDSIZE and local_slots.

::: js/src/wasm/WasmBaselineCompile.cpp:7379
(Diff revision 6)
> +    // Adjust the base (p) for branching.
> +    masm.subPtr(ImmWord(UNROLL_LIMIT * WORDSIZE), p);
> +
> +    // The upper (lower) bound is computed from varHigh_: branch back to L1 if
> +    // p is still to hit lim.
> +    Register lim = allocGPR();

Stylistically I would move the allocation and computation of lim to where you compute p, since the computation is invariant and nothing interesting is going on with register pressure anyway - p and zero remain live and disjoint from lim.

Attachment #8823506 - Flags: review?(lhansen)

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/6-7/

Michelangelo De Simone [Offline] [:mds]

Comment 18

•

7 years ago

mozreview-review-reply

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

https://reviewboard.mozilla.org/r/102034/#review116336

> Can't happen because `varLow_` != `varHigh_` and alignment is at least 4.

I have a doubt here (warning: it might dull): if the alignment is "at least" 4, we're on JS_64, and we only have one 32 bit local to initialize, we'd have:

wordsize = 8
init_words = 4/8 = 0

In such a case we'd branch in the "full unroll" loop performing one single iteration and crossing the boundary by writing 8 bytes instead of 4.

Most likely I'm missing something exquisitely obvious here...

> On x64 it is possible to do this even for eight bytes because we can move a 32-bit immediate sign-extended to 64 bits to memory directly.

This means that masm.store32 will REX-prefix the opcode automagically if on JS_64?

> You have cases here and below for 32 and 64 bits - basically the word size of the architecture.  There is a poorly known trick here: on 64-bit systems, a Register (and a RegI32) is known to have a 64-bit personality, you don't need to use RegI64.  This would be more obvious if there were abstractions for this using the tag 'Ptr' or 'Word'.  But because of the trick, needI32() will allocate a 64-bit register on a 64-bit platform.  This removes the special-casing of allocating and freeing the "zero" register and means you don't have to go to allocGPR() for "p" and "lim".  If you introduce a storeToFrameWord() function you also get rid of the special cases for storeToFrameI{32,64}.

I've replaced everything with 'Register'; a clarification, though: you mean that the special handling for I32/I64 should only be performed by the storeToFrameWord() function (as I've done in the revised patch) or there's some other "poorly known trick" for masm.store32 too?:)

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/7-8/

Michelangelo De Simone [Offline] [:mds]

Comment 20

•

7 years ago

mozreview-review

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

https://reviewboard.mozilla.org/r/102034/#review123240

Clearing r? myself: three tests are failing (baseline-opt.js, func.wast.js, get_local.wast.js); checking...

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/8-9/

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/9-10/

Lars T Hansen [:lth]

Assignee

Comment 23

•

7 years ago

Michelangelo, do you still want a review on this?  I see your previous comment says you're clearing the review, but the bug still has an r? flag on it.  (Sorry for taking so long to respond to this.)

Flags: needinfo?(mdesimone)

Michelangelo De Simone [Offline] [:mds]

Comment 24

•

7 years ago

The patch should be ready to review, at least it's passing the tests now. :)

Flags: needinfo?(mdesimone)

Lars T Hansen [:lth]

Assignee

Comment 25

•

7 years ago

mozreview-review

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

https://reviewboard.mozilla.org/r/102034/#review130372

The overall logic seems right and the code is nice and clean.  There's a bug that needs fixing.  Other than that it would be nice to make storeToFrameWord pay for itself by using it more places (with a slightly different API) but if that's too awkward we can just inline it where it's used.

::: js/src/wasm/WasmBaselineCompile.cpp:631
(Diff revision 10)
>  
>      void storeToFrameF32(FloatRegister r, int32_t offset) {
>          masm.storeFloat32(r, Address(StackPointer, localOffsetToSPOffset(offset)));
>      }
>  
> +    void storeToFrameWord(Register r, int32_t offset) {

This function isn't currently paying for itself, having just a single use.  I think I had hoped that it could also be used in the unrolled loops, not just in the fast base case.

Could it take as now a register r holding the value to store, and an Address for the second operand?  In that case it should be usable from more sites, I think, and none of the ifdefs below for stores need exist.  (If that becomes too messy maybe you should just re-inline this function where it's called.)

::: js/src/wasm/WasmBaselineCompile.cpp:7352
(Diff revision 10)
> +
> +    Register p = needI32();
> +    masm.computeEffectiveAddress(Address(StackPointer, localOffsetToSPOffset(varLow_ + wordsize)), p);
> +
> +    RegI32 lim = needI32();
> +    masm.computeEffectiveAddress(Address(StackPointer, localOffsetToSPOffset(varHigh_)), lim);

I think this is wrong.  The loop test tests whether p has advanced below lim, and lim is exactly beyond the last address we should be initializing.  But if the unroll_limit does not divide the number of words evenly, then we'll do an extra iteration before we then do the last stores following the loop.  Suppose init_words is 33 and wordsize is 4.  We'll do one iteration and then subtract 32*4 from p.  p is now still above lim, so we'll do another 32 stores.  And then we'll do the last unrolled store.

::: js/src/wasm/WasmBaselineCompile.cpp:7358
(Diff revision 10)
> +
> +    Label L1;
> +    masm.bind(&L1);
> +
> +    // At this point the register 'p' contains the pointer to the first slot to initialize,
> +    // meaning the effective address of "varLow_".

"... on the first iteration."  Or something to that effect, since p will be updated at the end of the loop.

::: js/src/wasm/WasmBaselineCompile.cpp:7380
(Diff revision 10)
> +    // p is still to hit lim.
> +    masm.branchPtr(Assembler::LessThan, lim, p, &L1);
> +
> +    // Zero the remaining locals.
> +    const uint32_t iterations = init_words / unroll_limit;
> +    const uint32_t tail = init_words - (iterations * unroll_limit);

Isn't this just init_words % unroll_limit?

Attachment #8823506 - Flags: review?(lhansen) → review-

Michelangelo De Simone [Offline] [:mds]

Comment 26

•

7 years ago

(In reply to Lars T Hansen [:lth] from comment #25)

> ::: js/src/wasm/WasmBaselineCompile.cpp:7352
> (Diff revision 10)
> > +
> > +    Register p = needI32();
> > +    masm.computeEffectiveAddress(Address(StackPointer, localOffsetToSPOffset(varLow_ + wordsize)), p);
> > +
> > +    RegI32 lim = needI32();
> > +    masm.computeEffectiveAddress(Address(StackPointer, localOffsetToSPOffset(varHigh_)), lim);
> 
> I think this is wrong.  The loop test tests whether p has advanced below
> lim, and lim is exactly beyond the last address we should be initializing. 
> But if the unroll_limit does not divide the number of words evenly, then
> we'll do an extra iteration before we then do the last stores following the
> loop.  Suppose init_words is 33 and wordsize is 4.  We'll do one iteration
> and then subtract 32*4 from p.  p is now still above lim, so we'll do
> another 32 stores.  And then we'll do the last unrolled store.

Does it make more sense to assert INITWORDS % WORDSIZE == 0, have something like ceiling(lim) instead of lim, or something else?

Flags: needinfo?(lhansen)

Michelangelo De Simone [Offline] [:mds]

Comment 27

•

7 years ago

To clarify the "ceiling(lim)" comment above: I mean setting lim as the next address divisible by wordsize.

Lars T Hansen [:lth]

Assignee

Comment 28

•

7 years ago

I should think so.  If, in the initialization of lim, instead of varHigh_ one uses a value as you describe, then we'll get the correct number of unrolled iterations, followed by the correct unrolled tail.  It should not be necessary to do more work at runtime, only to adjust how we compute the loop exit condition in the compiler.

Flags: needinfo?(lhansen)

Comment hidden (mozreview-request)

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/102034/diff/10-11/

Lars T Hansen [:lth]

Assignee

Comment 30

•

7 years ago

mozreview-review

Comment on attachment 8823506 [details]
Bug 1316820 - [WASM] The way stack locals are initialized has been optimized.

https://reviewboard.mozilla.org/r/102034/#review160440

Apart from one bug stemming from confusion between varHigh and varLow, and some minor cleanup, this looks pretty close to done.  Let's do one more iteration, though.

It's tricky to test this directly without looking at assembly language output for specific cases and we don't have any infrastructure for that...  But it would be good if you could write some wasm test cases with a variety of local variables so that we could at least look at the output manually.  Let me know if you don't know how to do that, it's not actually difficult.

::: js/src/wasm/WasmBaselineCompile.cpp:2263
(Diff revision 11)
>                default:
>                  MOZ_CRASH("Function argument type");
>              }
>          }
>  
> -        // Initialize the stack locals to zero.
> +        if (varHigh_ != varLow_)

Suggest you keep this as varLow_ < varHigh_, as it was.  See my next comment.

::: js/src/wasm/WasmBaselineCompile.cpp:7392
(Diff revision 11)
>  }
>  
> +void
> +BaseCompiler::emitInitStackLocals()
> +{
> +    MOZ_ASSERT(varLow_ != varHigh_, "there should be stack locals to initialize");

Suggest you assert varLow_ < varHigh_ here so that we not only check that they are different but that they have the correct relationship.  varLow_ is the low byte offset in the frame for true locals, varHigh_ the high offset.  (localOffsetToSPOffset will invert this relationship.)

::: js/src/wasm/WasmBaselineCompile.cpp:7418
(Diff revision 11)
> +    // Full unroll if we have less than unroll_limit locals.
> +    // Keeping this case explicit, instead of merging it with the
> +    // partial unroll, helps reducing the emitted instructions.
> +    // In this case we emit at most unroll_limit+2 instructions.
> +    if (init_words < unroll_limit)  {
> +        for (int32_t i = varLow_; i < varHigh_; i += wordsize)

Coding style requires a { here and a } after the closing line of this loop.  And the two lines should be indented properly.

::: js/src/wasm/WasmBaselineCompile.cpp:7430
(Diff revision 11)
> +        return;
> +    }
> +
> +    Register p = needI32();
> +    masm.computeEffectiveAddress(Address(StackPointer, localOffsetToSPOffset(varLow_ + wordsize)), p);
> +

Here I would MOZ_ASSERT((varHigh_ - varLow_) % wordsize == 0), and if that fails we'll fix it elsewhere in the compiler, I can't think of a reason why that should not be true.

::: js/src/wasm/WasmBaselineCompile.cpp:7431
(Diff revision 11)
> +    }
> +
> +    Register p = needI32();
> +    masm.computeEffectiveAddress(Address(StackPointer, localOffsetToSPOffset(varLow_ + wordsize)), p);
> +
> +    uint32_t fixValue = (varLow_ - varHigh_) % (unroll_limit * wordsize);

This must be varHigh_ - varLow_ since varLow_ < varHigh_, see comment above.

Then we see that this is the same value as init_words % unroll_limit, which is the same value you also use for tail, further down.

::: js/src/wasm/WasmBaselineCompile.cpp:7432
(Diff revision 11)
> +
> +    Register p = needI32();
> +    masm.computeEffectiveAddress(Address(StackPointer, localOffsetToSPOffset(varLow_ + wordsize)), p);
> +
> +    uint32_t fixValue = (varLow_ - varHigh_) % (unroll_limit * wordsize);
> +    uint32_t newVarHigh_ = varHigh_ + (fixValue * wordsize);

And so this is really varHigh_ - (fixValue * wordsize).

Attachment #8823506 - Flags: review?(lhansen)

Michelangelo De Simone [Offline] [:mds]

Updated

•

7 years ago

Assignee: michel → nobody

Lars T Hansen [:lth]

Assignee

Comment 31

•

7 years ago

Would be a shame not to land this now that we're so close.

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Lars T Hansen [:lth]

Assignee

Comment 32

•

7 years ago

Attached patch bug1316820-rabaldr-frame-init.patch (obsolete) — Details — Splinter Review

(Substantially Michelangelo's work.)

I believe this is correct and complete, it is a litte bit more complicated than I anticipated because of a couple of corner cases I did not see before.

Still working on comprehensive test cases.  This passes local testing with the wasm test suite, but I expect we have very few tests with enough locals to really stress test this.

Attachment #8823506 - Attachment is obsolete: true

Lars T Hansen [:lth]

Assignee

Comment 33

•

7 years ago

Attached file bug1316820-test-cases.tar — Details

Test cases for various unroll lengths.  Not meant to land, but for inspecting the output.  See comments in the tests for how to do that.

Lars T Hansen [:lth]

Assignee

Comment 34

•

7 years ago

Attached patch bug1316820-rabaldr-frame-init.patch — Details — Splinter Review

Implement the frame zeroing in Rabaldr as a partially unrolled loop.

As explained in comment 0 and comment 1, the justification for this change is:

- optimize the one-local case to store a literal
- use 8-byte stores when possible to reduce instruction count (and open
  up for even wider stores later, I left a comment behind for this)
- improve code locality by not generating a ton of code even if there are many
  locals

Comments in the code should explain pretty well how the unrolling is being done.

This is not a high priority work exactly but since we had an almost-finished patch it seems a shame to waste it.

Regarding testing, I have not written any landable test cases for this.  Our test suite + real apps should test it fairly well, as for so many other aspects of codegen.  The test cases I have attached will just substantiate - by inspection - that the unrolling works as it should.  This set can be expanded in various ways.

Attachment #8900242 - Attachment is obsolete: true

Attachment #8900415 - Flags: review?(bbouvier)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 35

•

7 years ago

Comment on attachment 8900415 [details] [diff] [review]
bug1316820-rabaldr-frame-init.patch

Review of attachment 8900415 [details] [diff] [review]:
-----------------------------------------------------------------

Looks sensible. Thanks for the patch.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +7484,5 @@
> +BaseCompiler::emitInitStackLocals()
> +{
> +    MOZ_ASSERT(varLow_ < varHigh_, "there should be stack locals to initialize");
> +
> +    static const uint32_t wordSize = JS_BITS_PER_WORD/8;

nit: spaces around /

Is that different from sizeof(void*)?

@@ +7541,5 @@
> +    // stores, though.)
> +
> +    // Fully-unrolled case.
> +
> +    if (initWords < 2*unrollLimit)  {

nit: spaces around *

Attachment #8900415 - Flags: review?(bbouvier) → review+

Lars T Hansen [:lth]

Assignee

Comment 36

•

7 years ago

(In reply to Benjamin Bouvier [:bbouvier] (away from Aug 26th to Sept 10th) from comment #35)
> Comment on attachment 8900415 [details] [diff] [review]
> bug1316820-rabaldr-frame-init.patch
> 
> Review of attachment 8900415 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> > +    static const uint32_t wordSize = JS_BITS_PER_WORD/8;
> 
> Is that different from sizeof(void*)?

No, it isn't.  Good point.

Pulsebot

Comment 37

•

7 years ago

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/633389558a27
Rabaldr, optimized zeroing of locals. patch=mds+lth, r=bbouvier

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 38

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/633389558a27

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox57: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Bug 1316820 - [WASM] The way stack locals are initialized has been optimized. 7 years ago Michelangelo De Simone [Offline] [:mds] 59 bytes, text/x-review-board-request		Details
bug1316820-rabaldr-frame-init.patch 7 years ago Lars T Hansen [:lth] 5.34 KB, patch		Details \| Diff \| Splinter Review
bug1316820-test-cases.tar 7 years ago Lars T Hansen [:lth] 20.00 KB, application/x-tar		Details
bug1316820-rabaldr-frame-init.patch 7 years ago Lars T Hansen [:lth] 5.93 KB, patch	bbouvier : review+	Details \| Diff \| Splinter Review