Closed Bug 1232205 Opened 8 years ago Closed 8 years ago

wasm: Baseline JIT

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: lth, Assigned: lth)

References

()

Details

(Whiteboard: [games:p1][platform-rel-Games])

Attachments

(9 files, 14 obsolete files)

4.09 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
9.89 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
3.79 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
9.28 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
22.63 KB, patch
luke
: review+
lth
: review+
Details | Diff | Splinter Review
14.77 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
13.42 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
4.15 KB, patch
luke
: review+
Details | Diff | Splinter Review
191.86 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
Implement the baseline JIT compiler for webassembly.
I think I signed up for implementing Debugger support for this.
Depends on: 1259295
Summary: Baseline JIT for webassembly → wasm: Baseline JIT
Attached file bug1232205-wasm-baseline.bundle (obsolete) —
A bundle containing my patch queue, including the 2016-04-25 postorder patch at the root.  x64 only, and still somewhat messy.  See comments at the head of asmjs/WasmBaselineCompile.cpp for more status information.

This passes all our in-tree wasm tests (with an adjustment to hasI64, included here) and all our in-tree asm.js tests except (a) those that use non-signaling bounds or interrupt checks, and (b) one modulo test, commented out here.
Depends on: 1272252
Attached file bug1232205-wasm-baseline.bundle (obsolete) —
Current snapshot.
Attachment #8745230 - Attachment is obsolete: true
Attached file bug1232205-wasm-baseline.bundle (obsolete) —
Current snapshot.  A little messy again, but i32, i64, f32, and f64 are in place except for copysign, trunc, nearest, and some corner cases.  Passes almost all wasm and asm.js tests in the current repo.  No SIMD, no atomics; x64-only.
Attachment #8751658 - Attachment is obsolete: true
Depends on: 1272640
Attached patch combined.patch (obsolete) — Splinter Review
For convenience for people who want to take a peek, here's a combined version of the current bundle as a single patchfile.
Attached file bug1232205-wasm-baseline.bundle (obsolete) —
Cleaned up and rebased to current m-i.  A few tests remain commented out (see separate patches in this set); it passes all other wasm and asm.js tests and runs AngryBots.
Attachment #8754941 - Attachment is obsolete: true
Attachment #8755117 - Attachment is obsolete: true
Attached file bug1232205-wasm-baseline.bundle (obsolete) —
Additionally compiles on win32, win64, linux32, linux32-arm-sim, linux64, linux64-arm64-sim, mac64.
Attachment #8755356 - Attachment is obsolete: true
Attached patch bug1232205-infrastructure.patch (obsolete) — Splinter Review
Part 1, infrastructure.

This hooks the baseline compiler into the rest of the engine and creates the necessary infrastructure to trigger baseline compilation when appropriate, but the baseline compiler is just a stub (and the mechanism that determines whether to trigger it always returns false).
Attachment #8756853 - Flags: review?(bbouvier)
Might it be practical to note ranges of local variables that hold a constant and defer emitting the constant until necessary?

The main motivation would be an alternative to determining the live ranges where a `set_local` operation dominates uses and thus avoid flushing a register to memory. In some cases the value might be unused and then the constant need never be materialized.

This would support a `get_and_zero_local` operator that could be used to help reduce flushing of locals cached in registers. Oh, and could the baseline compiler cache local variables in registers too, not just the values stack?

With this optimization I could explore the expressionless encoding again and since it is not constrained by the stack ordering it might be able to optimize operator ordering to reduce register pressure and then this may well be the encoding that gives the best performance from a baseline compiler. Something to explore.

Even being able to defer emitting a constant in a local variable would help work around the stack ordering constraints and could help reduce register pressure, so it seems useful even without the above use case.

For decoding to SSA, if wasm code can use `get_and_zero_local` on the last use of a local then at control flow merge points the locals not live will be flagged as constant zero rather than having potentially different definitions, and this would avoid the need to emit a phi for them rather a single definition could be carried forward flagged as zero (do you think that would work?).

What do you think?
(In reply to Douglas Crosher [:dougc] from comment #11)

> Might it be practical to note ranges of local variables that hold a constant
> and defer emitting the constant until necessary?

While it is certainly possible for the wasm code to contain assignments of constants to locals, I would expect the generator of the wasm code to perform constant propagation and get rid of the bulk of such code.  (Initializations excepted.)

I expect that for your expressionless encoding this will be a much more acute problem, but as it is not a problem for the existing wasm format I'm probably disinclined to try to do anything about it.

> Oh, and could the
> baseline compiler cache local variables in registers too, not just the
> values stack?

Yes, I do want to do something in that direction, but I've not decided exactly how to do it.  The two obvious choices are to assign some fixed locals to registers (eg incoming parameters that are in registers + the first few locals, depending on how many registers we want to use for this) or to cache local variables in straight-line code.  A combination of the two is appealing.  It may be a little while before I start looking into this optimization though, there are many other things that are higher priority.

> With this optimization I could explore the expressionless encoding again and
> since it is not constrained by the stack ordering it might be able to
> optimize operator ordering to reduce register pressure and then this may
> well be the encoding that gives the best performance from a baseline
> compiler. Something to explore.

Be my guest.  The baseline compiler's goal is to generate acceptable code at high speed - we don't want "good" code if that will significantly impact the compiler's speed.
(In reply to Douglas Crosher [:dougc] from comment #11)

> Might it be practical to note ranges of local variables that hold a constant
> and defer emitting the constant until necessary?
...
> Oh, and could the baseline compiler cache local variables in registers too, not just the values stack?

These might be two sides of the same coin, as they are about having alternative "locations" for locals.  The baseline compiler could probably do something clever in straight-line code and notably across calls, but it would probably want to spill at control flow boundaries (if / block / loop).
(In reply to Lars T Hansen [:lth] from comment #12)
> (In reply to Douglas Crosher [:dougc] from comment #11)
> 
> > Might it be practical to note ranges of local variables that hold a constant
> > and defer emitting the constant until necessary?
> 
> While it is certainly possible for the wasm code to contain assignments of
> constants to locals, I would expect the generator of the wasm code to
> perform constant propagation and get rid of the bulk of such code. 
> (Initializations excepted.)

Good point, but fwiw here are some numbers:
zlib: 10% of set_locals have a constant value.
AngryBots: also 10% of set_locals have a constant value.

Guess this is relatively small, but these numbers help a little to make a case for this optimization, just as constants on the values stack are optimized.

> I expect that for your expressionless encoding this will be a much more
> acute problem, but as it is not a problem for the existing wasm format I'm
> probably disinclined to try to do anything about it.

Yes, it's an 'acute problem' for the expressionless style. Keep in might that some producers might want to emit code in this style to avoid the burden of shoe horning their output into the stack order.

Another optimization that would help for the expressionless style is to note the result local variables for operations in general as these locals would then be known to not be live after reading the arguments and might help avoid flushing some values to memory. This might require looking ahead with the current encoding, so might be a bit of a burden.

> > Oh, and could the
> > baseline compiler cache local variables in registers too, not just the
> > values stack?
> 
> Yes, I do want to do something in that direction, but I've not decided
> exactly how to do it.  The two obvious choices are to assign some fixed
> locals to registers (eg incoming parameters that are in registers + the
> first few locals, depending on how many registers we want to use for this)
> or to cache local variables in straight-line code.  A combination of the two
> is appealing.  It may be a little while before I start looking into this
> optimization though, there are many other things that are higher priority.

That sounds great. I think binaryen can already sort locals to bias high use frequency locals into low indexes which might fit with this strategy. But a LRU cache might be simple enough and the registers are a resource that might be best shared between local variables and the expression stack values so might need a common cache.

> > With this optimization I could explore the expressionless encoding again and
> > since it is not constrained by the stack ordering it might be able to
> > optimize operator ordering to reduce register pressure and then this may
> > well be the encoding that gives the best performance from a baseline
> > compiler. Something to explore.
> 
> Be my guest.  The baseline compiler's goal is to generate acceptable code at
> high speed - we don't want "good" code if that will significantly impact the
> compiler's speed.

Better code at high compilation speed is what I have in mind. So the producer would have the burden of replacing get_local by get_and_zero_local and all the baseline compiler needs to do is defer emitting these zero constants which does not appear to be a large performance burden and something already done for the expression stack values.

It will be a smaller step for me after the baseline compiler can cache locals in registers, so if that is on the horizon then I will keep following.

But I can start exploring the wasm input code generation and studying what potential there is to reduce the live set by reordering operators etc.

Do you see any merit in the approach though?

Do you think it could minimize unnecessary writes of values cached in registers?

Can you see any obvious and significant cases it would not handle?

Do you think there is any potential for a high speed compiler to emit better code if the producer optimizes the operation order to minimize the live set which could reduce register pressure?
(In reply to Douglas Crosher [:dougc] from comment #14)

> zlib: 10% of set_locals have a constant value.
> AngryBots: also 10% of set_locals have a constant value.

That's interesting, but it's hard to know what that means without knowing what happens next.  If the locals are read in the same basic block then we could optimize but I'd expect the C++ compiler to have constant-propagated already.

> Another optimization that would help for the expressionless style is to note
> the result local variables for operations in general as these locals would
> then be known to not be live after reading the arguments and might help
> avoid flushing some values to memory. This might require looking ahead with
> the current encoding, so might be a bit of a burden.

I see the appeal of that, but I don't quite see it happening :)

> But I can start exploring the wasm input code generation and studying what
> potential there is to reduce the live set by reordering operators etc.
> 
> Do you see any merit in the approach though?
> 
> Do you think it could minimize unnecessary writes of values cached in
> registers?
> 
> Can you see any obvious and significant cases it would not handle?

For all those questions: I don't have the data.  It's somewhat plausible.  It may or may not matter.

> Do you think there is any potential for a high speed compiler to emit better
> code if the producer optimizes the operation order to minimize the live set
> which could reduce register pressure?

Yes, I think that is probably the case.  But I don't know if the producer should do that, because surely the producer wants to produce code that makes the 2nd-tier jit produce better code?  After all the point of the baseline compiler is to get the application up and running (throughput), but the expectation is that all truly performance-sensitive code will hit the 2nd-tier jit pretty quickly.
(In reply to Lars T Hansen [:lth] from comment #15)
> (In reply to Douglas Crosher [:dougc] from comment #14)
...
> > Do you think there is any potential for a high speed compiler to emit better
> > code if the producer optimizes the operation order to minimize the live set
> > which could reduce register pressure?
> 
> Yes, I think that is probably the case.  But I don't know if the producer
> should do that, because surely the producer wants to produce code that makes
> the 2nd-tier jit produce better code?  After all the point of the baseline
> compiler is to get the application up and running (throughput), but the
> expectation is that all truly performance-sensitive code will hit the
> 2nd-tier jit pretty quickly.

fwiw Nothing comes to mind that would disadvantage a higher performance compiler by having the operators ordered to reduce the live set. Might argue that the produce should not have to bother with this and should be free to arrange the order to optimize encoded size, or even that the encoding should be optimize for interpretation. Anyway it might be interesting to get some data, and I'll keep exploring it.
Comment on attachment 8756853 [details] [diff] [review]
bug1232205-infrastructure.patch

Review of attachment 8756853 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks! I assume this would land with the entire baseline compiler set of patches. Otherwise, it'd be nice to know how the booleans atomics/simdObserved are getting used, or just defer them to another patch.

::: js/src/asmjs/AsmJS.cpp
@@ +1682,5 @@
>      ImportMap             importMap_;
>      ArrayViewVector       arrayViews_;
>      bool                  atomicsPresent_;
> +    bool                  atomicsObserved_; // Within function body
> +    bool                  simdObserved_;    // Within function body

Might be useful to indicate this can be set before any function is seen (e.g. globals section) but will be unset before we run into the first function.

@@ +2529,5 @@
>  ExtractSimdValue(ModuleValidator& m, ParseNode* pn)
>  {
>      MOZ_ASSERT(IsSimdLiteral(m, pn));
>  
> +    m.simdObserved();

nit: setSimdObserved()

::: js/src/asmjs/WasmBaselineCompile.h
@@ +1,4 @@
> +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*-
> + * vim: set ts=8 sts=4 et sw=4 tw=99:
> + *
> + * Copyright 2015 Mozilla Foundation

(uber nit: 2016)

@@ +20,5 @@
> +#define asmjs_wasm_baseline_compile_h
> +
> +#include "asmjs/WasmBinary.h"
> +#include "asmjs/WasmIonCompile.h"
> +#include "jit/MacroAssembler.h"

jit/MacroAssembler.h looks like it's not needed right now.

::: js/src/asmjs/WasmGenerator.h
@@ +260,5 @@
> +    void setUsesAtomics() {
> +        usesAtomics_ = true;
> +    }
> +
> +    bool usesSignalsForInterrupts() const {

I guess the purpose of usesSimd(), usesAtomics(), usesSignalsForInterrupts() is to prevent baseline compilation of functions, right? It is not used right now, but I am pretty sure it is going to get used in the patch series, so fine to keep in this patch.

Do we abort baseline compilation whenever any of these three booleans is set to true? If so, we could just have one boolean canBaselineCompile. Or something else? I would love to see how these are used in BaselineCanCompile already.

::: js/src/asmjs/WasmIonCompile.cpp
@@ +3475,5 @@
> +      case wasm::IonCompileTask::CompileMode::Ion:
> +        return wasm::IonCompileFunction(task);
> +      case wasm::IonCompileTask::CompileMode::Baseline:
> +        return wasm::BaselineCompileFunction(task);
> +      default:

There are not too many modes: can we just explicitly mention None here, so that any new mode (!) would trigger a compilation warning/error, please?
Attachment #8756853 - Flags: review?(bbouvier) → review+
(In reply to Benjamin Bouvier [:bbouvier] from comment #17)
> Comment on attachment 8756853 [details] [diff] [review]
> bug1232205-infrastructure.patch
> 
> Thanks! I assume this would land with the entire baseline compiler set of
> patches. Otherwise, it'd be nice to know how the booleans
> atomics/simdObserved are getting used, or just defer them to another patch.

This will not land by itself, not to worry.

> Do we abort baseline compilation whenever any of these three booleans is set
> to true? If so, we could just have one boolean canBaselineCompile. Or
> something else? I would love to see how these are used in BaselineCanCompile
> already.

The point here is to defer the decision to the baseline compiler, which knows what it can do on the particular platform; these flags may be used by the baseline compiler, or they may not - it is platform dependent.  (Right now they are used on x64 but not elsewhere because the baseline compiler turns itself off on other platforms, for example.)  This is a moderately clean separation of concerns.

The followup patch should arrive today or tomorrow, I'm cleaning it up now.
Patch 2/8: Define AllSingleMask on all platforms.

The baseline compiler uses this to manage float32 registers.  It was previously defined only on MIPS (!) but I've been unable to find a reason why it could not be defined generally.
Attachment #8755851 - Attachment is obsolete: true
Attachment #8757946 - Flags: review?(bbouvier)
Patch 3/8: Factor out int64ToFloatingPoint and truncateToInt64, on x64.

This code was previously private to CodeGenerator-x64, but I've moved it to  MacroAssembler-x64 so that the baseline compiler can also use it.

For this and the following two patches I had a choice: I could move the code to the MacroAssembler, or I could turn it into a static method on the CodeGenerator.  The former seemed more natural, but feedback is welcome.
Attachment #8757947 - Flags: review?(bbouvier)
Patch 4/8: Factor out floating min and max, on x86-shared.

Again, code private to CodeGenerator-x86-shared, moved to MacroAssembler-x86-shared so that the baseline compiler can access it.
Attachment #8757948 - Flags: review?(bbouvier)
Attached patch bug1232205-ool-truncate.patch (obsolete) — Splinter Review
Patch 5/8: Factor outOfLineWasmTruncateCheck and outOfLineTruncateSlow out from CodeGenerator-x86-shared to MacroAssembler-x86-shared.

Same drill as previous two patches.
Attachment #8757949 - Flags: review?(bbouvier)
Patch 6/8: Let the jit-test driver recognize a directive to run wasm-baseline tests, and make it trigger wasm-baseline tests if also-noasmjs is seen.

The purpose here is to allow wasm tests to be flagged as test-also-wasm-baseline (see next patch), so that we test the baseline compiler when possible.

But since we really want to test the baseline compiler's ability to handle generated asm.js code we take test-also-noasmjs as a signal that the file contains asm.js code, and just run the wasm baseline compiler on it.  This isn't the only way to do it, but it seemed reasonable.
Attachment #8757951 - Flags: review?(bbouvier)
Patch 7/8: add test-also-wasm-baseline directives to the wasm test cases.
Attachment #8757952 - Flags: review?(bbouvier)
Patch 8/8: Wasm baseline compiler (only).

This is a big one, so let me offer a few guideposts.

- The structure is very similar to WasmIonCompile.cpp, and where it is
  not it's mainly for performance reasons (eg, there's an emitBody that
  contains the decoding loop, not an emitExpr that is called for each
  expression) and frequently based on profiling results.

- There is plenty of in-line platform-ifdeffed masm code.  As we add more
  platforms much of this code will migrate into Masm itself (and I'll be
  taking on that job).  A simple example: xor32 is x86-only today but that's
  just arbitrary.  A more complicated example: anything to do with 64-bit math.
  I mention this only so that not too many eyebrows are raised at this,
  it is IMO a reasonable design at the moment - but not long term.

- There are *many* "TODO / OPTIMIZE" comments.  I intend to fix all of these
  (one way or another) but landing the compiler does not block on those,
  and having the comments is useful.  I *can* file bugs for them individually
  but wasn't planning to do that for a while yet.

- The thing I am most afraid of here is whether I'm handling calls
  properly - this means frame setup/teardown and register management.
  The register allocator currently flushes its value stack before
  calls so no registers are live across a call, but remember that we can
  intercall between Ion and Baseline code, so things need to be done
  properly.

(I fear these may be like David Lynch's hints to understanding Mulholland Drive: "Observe that the lamp is red.")
Attachment #8757957 - Flags: review?(bbouvier)
Depends on: 1277008
Depends on: 1277011
Attached patch bug1232205-saveregs-bugfix.patch (obsolete) — Splinter Review
Patch 9/8: Small bug fix.

This fixes a bug that was discovered while testing on x86 - saving and restoring registers around an OOL call mistakenly preserved the return register.

(I can roll this into the baseline compiler if you prefer.)
Attachment #8758613 - Flags: review?(bbouvier)
Comment on attachment 8757946 [details] [diff] [review]
bug1232205-allsinglemask.patch

Review of attachment 8757946 [details] [diff] [review]:
-----------------------------------------------------------------

Nice! What about MIPS?

::: js/src/jit/arm/Architecture-arm.h
@@ +291,5 @@
>      static uint32_t ActualTotalPhys();
>  
>      typedef uint64_t SetType;
>      static const SetType AllDoubleMask = ((1ull << 16) - 1) << 32;
> +    static const SetType AllSingleMask = ((1ull << 32) - 1);

(think the parens around the subtract are not needed, but...)

@@ +296,1 @@
>      static const SetType AllMask = ((1ull << 48) - 1);

... could we take advantage of this patch to make this code easier to understand for newcomers?

Double check me, but I think we'd have:

AllDoubleMask = ((1ull << TotalDouble) - 1) << TotalSingle;
AllSingleMask = (1ull << TotalSingle) - 1;
AllMask = (1ull << invalid_freg); // not sure this one makes it clearer
Attachment #8757946 - Flags: review?(bbouvier) → review+
Comment on attachment 8757947 [details] [diff] [review]
bug1232205-long-to-float-and-back.patch

Review of attachment 8757947 [details] [diff] [review]:
-----------------------------------------------------------------

I think we could split the masm function and not need to have inputIsUnsigned/inputIsDouble/outputIsDouble args, to make the API simpler.

If that means having many dummy MacroAssembler implementations for other platforms, maybe the check_masm_style annotations could be extended to provide default implementations which do MOZ_CRASH("NYI").

::: js/src/jit/x64/MacroAssembler-x64.h
@@ +864,5 @@
>      void loadConstantSimd128Int(const SimdConstant& v, FloatRegister dest);
>      void loadConstantSimd128Float(const SimdConstant& v, FloatRegister dest);
>  
> +    void convertInt64ToFloatingPoint(Register input, bool inputIsUnsigned,
> +                                     FloatRegister output, bool outputIsDouble);

We have an opportunity to split the code even better here: how about removing the outputIsDouble and splitting this method into

convertInt64ToFloat32(...)
convertInt64ToDouble(...)

Or even better: convert{Int64,UInt64}To{Float32,Double}(...)
Even more methods, but cleaner separation of types.

@@ +867,5 @@
> +    void convertInt64ToFloatingPoint(Register input, bool inputIsUnsigned,
> +                                     FloatRegister output, bool outputIsDouble);
> +    void wasmTruncateToInt64(FloatRegister input, bool inputIsDouble, Register output,
> +                             bool isUnsigned, Label* oolEntry, Label* oolRejoin,
> +                             FloatRegister tempDouble);

Ditto: wasmTruncate{Float32,Double}To{Uint64,Int64}
Attachment #8757947 - Flags: review?(bbouvier)
Comment on attachment 8757948 [details] [diff] [review]
bug1232205-floating-min-max.patch

Review of attachment 8757948 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

::: js/src/jit/x86-shared/MacroAssembler-x86-shared.cpp
@@ +358,5 @@
> +
> +    bind(&done);
> +}
> +
> +

nit: extra white line

::: js/src/jit/x86-shared/MacroAssembler-x86-shared.h
@@ +102,5 @@
>  
>      bool asmMergeWith(const MacroAssemblerX86Shared& other);
>  
> +    // Evaluate first = minmax<isMax>(first, second).  Check for equality and NaN if
> +    // specialHandling is true.

specialHandling is actually special handling for NaN values only (not for equality). Could we rename it here and in the definitions? maybeNaNInput, inputMaybeNaN, handleNaN, or any other more specific idea you'd have.
Attachment #8757948 - Flags: review?(bbouvier) → review+
Patch 8/8, v2: Rolls in the bugfix (see patch 9/8 above) and also some desirable cleanup.

Luke, adding you as a reviewer after discussions with Benjamin.
Attachment #8757957 - Attachment is obsolete: true
Attachment #8758613 - Attachment is obsolete: true
Attachment #8757957 - Flags: review?(bbouvier)
Attachment #8758613 - Flags: review?(bbouvier)
Attachment #8758708 - Flags: review?(luke)
Attachment #8758708 - Flags: review?(bbouvier)
Comment on attachment 8757952 [details] [diff] [review]
bug1232205-test-directives.patch

Review of attachment 8757952 [details] [diff] [review]:
-----------------------------------------------------------------

What if, instead, we just tested baseline for everything under jit-tests/tests/asm.js?
Comment on attachment 8757949 [details] [diff] [review]
bug1232205-ool-truncate.patch

Review of attachment 8757949 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.

::: js/src/jit/MacroAssembler.cpp
@@ +1807,5 @@
> +                                      bool compilingAsmJS)
> +{
> +#if defined(JS_CODEGEN_ARM) || defined(JS_CODEGEN_ARM64)
> +    if (needFloat32Conversion) {
> +        convertFloat32ToDouble(src, ScratchDoubleReg);

While we're here, can you use a ScratchScope for the double scratch, please?

@@ +1813,5 @@
> +    }
> +#else
> +    FloatRegister srcSingle = src.asSingle();
> +    if (needFloat32Conversion) {
> +        MOZ_ASSERT(src.isSingle());

pre-existing: out of curiosity, shouldn't this pass even if we use srcSingle = src? If needFloat32Conversion and the input is not into a float32 register, we're gonna have a bad time. Otherwise, this assertion is self-evident because of the asSingle() call before.

::: js/src/jit/x86-shared/MacroAssembler-x86-shared.cpp
@@ +423,5 @@
> +    // Handle special values (not needed for unsigned values).
> +    if (!isUnsigned) {
> +        if (toType == MIRType::Int32) {
> +            // MWasmTruncateToInt32
> +            if (fromType == MIRType::Double) {

I thought to myself it'd be nice to have a clean API for each cross-product of types here too, but there is some common code (check for NaN, jump to error) that would be duplicated in each combination. It's a bit ugly though, but well. If you feel like doing it, please submit another patch and r?me (you'd get bonus points, eternal glory and whatnot).
Attachment #8757949 - Flags: review?(bbouvier) → review+
Comment on attachment 8757951 [details] [diff] [review]
bug1232205-test-driver.patch

Review of attachment 8757951 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed on IRC: it'd be better to have per-directory directives that add a list of other testing modes to use for a given test file. Lars agreed to do this later. In the meanwhile, let's have the most test coverage we can and use this. The implication can go away when we have per-directory directives (and test-also-noasmjs and test-also-wasm-baseline could go too). Thanks for the patch.
Attachment #8757951 - Flags: review?(bbouvier) → review+
Comment on attachment 8757952 [details] [diff] [review]
bug1232205-test-directives.patch

Review of attachment 8757952 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks. Out of curiosity, would putting |test-also-wasm-baseline| in lib/wasm.js work too? I don't see any reason why it would, but the testing harness has surprised me a few times already with nice features :)
Attachment #8757952 - Flags: review?(bbouvier) → review+
Comment on attachment 8756853 [details] [diff] [review]
bug1232205-infrastructure.patch

Review of attachment 8756853 [details] [diff] [review]:
-----------------------------------------------------------------

Drive-by nits:

::: js/src/asmjs/AsmJS.cpp
@@ +6991,5 @@
>  
> +    if (m.simdObserved())
> +        f.setUsesSimd();
> +    if (m.atomicsObserved())
> +        f.setUsesAtomics();

The FunctionValidator is available everywhere that simd/atomics; can you remove the state/methods from ModuleValidator and call the FunctionValidator methods directly?  This has the advantage of not requiring any explicit clearing.

::: js/src/asmjs/WasmBaselineCompile.h
@@ +31,5 @@
> +// compiler to have different capabilities on different platforms
> +// and defer to the full Ion compiler if capabilities are missing.
> +// The FunctionGenerator contains information about the capabilities
> +// that are required to compile its code.)
> +bool BaselineCanCompile(const FunctionGenerator* fg);

\n after return type (here and below)

::: js/src/shell/js.cpp
@@ +6758,5 @@
>      enableIon = !op.getBoolOption("no-ion");
>      enableAsmJS = !op.getBoolOption("no-asmjs");
>      enableNativeRegExp = !op.getBoolOption("no-native-regexp");
>      enableUnboxedArrays = op.getBoolOption("unboxed-arrays");
> +    enableWasmBaseline = op.getBoolOption("wasm-baseline");

I think the name for this property should be "wasm-always-baseline" / wasmAlwaysBaseline since, when activated, it always chooses baseline.  Thus, this property is inherently test/debug-only.  Later, when we have tiering and are able to enable baseline by default, I think we'd add a separate property (with the current name), set to true by default, that can be used to disable using baseline/tiering for test/debug.
(In reply to Luke Wagner [:luke] from comment #35)
> Comment on attachment 8756853 [details] [diff] [review]
> bug1232205-infrastructure.patch
> 
> Review of attachment 8756853 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Drive-by nits:
> 
> ::: js/src/asmjs/AsmJS.cpp
> @@ +6991,5 @@
> >  
> > +    if (m.simdObserved())
> > +        f.setUsesSimd();
> > +    if (m.atomicsObserved())
> > +        f.setUsesAtomics();
> 
> The FunctionValidator is available everywhere that simd/atomics;

Several places I only have a ModuleValidator, eg, in IsCoercionCall() and IsSimdLiteral(), is there a trick to get to a FunctionValidator from the ModuleValidator or from another value available there?  Those are both static functions.

> can you
> remove the state/methods from ModuleValidator and call the FunctionValidator
> methods directly?  This has the advantage of not requiring any explicit
> clearing.

If there's a way to get the FunctionValidator, absolutely happy to do this.
(In reply to Lars T Hansen [:lth] from comment #36)
> Several places I only have a ModuleValidator, eg, in IsCoercionCall() and
> IsSimdLiteral(), is there a trick to get to a FunctionValidator from the
> ModuleValidator or from another value available there?  Those are both
> static functions.

Ah, I see, b/c those functions are also called in a module-level context for globals.  Really, the root issue here is detecting SIMD local types.  Perhaps instead you could have the FunctionValidator, once it finishes the function body (in FV::finish(), effectively the same place you're currently propagating those flags from MV to FV), just scan its list of local types and flag SIMD?
(In reply to Luke Wagner [:luke] from comment #37)
> (In reply to Lars T Hansen [:lth] from comment #36)
> > Several places I only have a ModuleValidator, eg, in IsCoercionCall() and
> > IsSimdLiteral(), is there a trick to get to a FunctionValidator from the
> > ModuleValidator or from another value available there?  Those are both
> > static functions.
> 
> Ah, I see, b/c those functions are also called in a module-level context for
> globals.  Really, the root issue here is detecting SIMD local types. 
> Perhaps instead you could have the FunctionValidator, once it finishes the
> function body (in FV::finish(), effectively the same place you're currently
> propagating those flags from MV to FV), just scan its list of local types
> and flag SIMD?

Just to clarify, examining the types by themselves is not enough for all cases but along with examining operations this seems to work fine.  Thanks!  Expect a review coming up, just to be on the safe side...
This addresses all nits from Benjamin and Luke.  Carrying forward Benjamin's r+.

Luke, can you look over the cleaned-up attribute computation in AsmJS.cpp?  Thanks.
Attachment #8756853 - Attachment is obsolete: true
Attachment #8759163 - Flags: review?(luke)
Attachment #8759163 - Flags: review+
(In reply to Benjamin Bouvier [:bbouvier] from comment #27)
> Comment on attachment 8757946 [details] [diff] [review]
> bug1232205-allsinglemask.patch
> 
> 
> @@ +296,1 @@
> >      static const SetType AllMask = ((1ull << 48) - 1);
> 
> ... could we take advantage of this patch to make this code easier to
> understand for newcomers?
> 
> Double check me, but I think we'd have:
> 
> AllDoubleMask = ((1ull << TotalDouble) - 1) << TotalSingle;
> AllSingleMask = (1ull << TotalSingle) - 1;
> AllMask = (1ull << invalid_freg); // not sure this one makes it clearer

That's all correct, though the last one doesn't smell too good.  AllDoubleMask | AllSingleMask is better, and is what MIPS uses.  I'll tidy it up, there's no cost to doing so.
Blocks: 1277008
No longer depends on: 1277008
Blocks: 1277011
No longer depends on: 1277011
Comment on attachment 8759163 [details] [diff] [review]
bug1232205-infrastructure-v2.patch

Review of attachment 8759163 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent, thank you!

::: js/src/asmjs/AsmJS.cpp
@@ +2867,5 @@
>          MOZ_ASSERT(continuableStack_.empty());
>          MOZ_ASSERT(breakLabels_.empty());
>          MOZ_ASSERT(continueLabels_.empty());
> +
> +        for ( auto iter = locals_.all(); !iter.empty(); iter.popFront() ) {

nit: no spaces before "auto" or after "popFront()".

::: js/src/asmjs/WasmGenerator.h
@@ +261,5 @@
> +        usesAtomics_ = true;
> +    }
> +
> +    bool usesSignalsForInterrupts() const {
> +        return m_ && m_->args().useSignalHandlersForInterrupt;

I think the "m_ &&" is unnecessary (m_ is initialized immediately by MG::startFuncDef()).  Also it'd be creepy if the value of this seemingly-immutable compilation parameter had a varying value.

::: js/src/asmjs/WasmIonCompile.cpp
@@ +3476,5 @@
> +        return wasm::IonCompileFunction(task);
> +      case wasm::IonCompileTask::CompileMode::Baseline:
> +        return wasm::BaselineCompileFunction(task);
> +      default:
> +        MOZ_CRASH("Uninitialized task");

Can you add the third case; 'default' silences the compiler's "you forgot a case" warning.
Attachment #8759163 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #41)
> Comment on attachment 8759163 [details] [diff] [review]
> bug1232205-infrastructure-v2.patch
> 
> 
> ::: js/src/asmjs/WasmIonCompile.cpp
> @@ +3476,5 @@
> > +        return wasm::IonCompileFunction(task);
> > +      case wasm::IonCompileTask::CompileMode::Baseline:
> > +        return wasm::BaselineCompileFunction(task);
> > +      default:
> > +        MOZ_CRASH("Uninitialized task");
> 
> Can you add the third case; 'default' silences the compiler's "you forgot a
> case" warning.

Ho hum.  As discussed on IRC with Benjamin, gcc 5.2.1 on my Linux box will then complain that there are switch arms that do not return a value, and I'm not sure if that's better.  (I assume this is a gcc or glibc bug, since MOZ_CRASH calls abort and abort "should" be marked as noreturn.)  But sure, I can hack something up.
Comment on attachment 8758708 [details] [diff] [review]
bug1232205-wasm-baseline-compiler-v2.patch

Review of attachment 8758708 [details] [diff] [review]:
-----------------------------------------------------------------

First of all, this code is really nice, clean, well-documented and readable.  Very excited to land it.

Here's a quick first-pass of comments just looking at the global structure/style.  I'll try to take a few more passes over the next few days; please feel free to post updated versions of the patch as you go.  Ultimately, I don't think it's realistic for me or bbouvier to fully review every single function here, so I think the best plan is to rubberstamp and land after a few passes and iterate on trunk.  That way we also benefit from early fuzzing.  Speaking of, we should reach out to the fuzzing team after landing and to tell them about the new flag.

::: js/src/asmjs/WasmBaselineCompile.cpp
@@ +23,5 @@
> + * additional tag to indicate the area of improvement.
> + *
> + * Mostly the following naming abbreviation is used: "I"=Int32,
> + * "X"=Int64, "F"=Float32, "D"=Float64.  If an unsigned interpretation
> + * is to be used, a "U" is appended.

We generally use I32/I64/F32/F64 for this distinction (matching the wasm spec in naming convention).  Would it be ok to use that here too?

@@ +97,5 @@
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +#  include "jit/x86-shared/CodeGenerator-x86-shared.h"
> +#endif
> +
> +//#define DEBUG_TRAIL

I usually try to avoid landing quick-n-dirty debug stuff like this, keeping it in a local patch instead.

@@ +105,5 @@
> +using mozilla::SpecificNaN;
> +
> +namespace js {
> +namespace wasm {
> +namespace baseline {

I'm generally inclined not to have a sub-namespace inside wasm.  It both seems like it doesn't really give you full power to reuse names and you still end up needing to prefix things with "Baseline" anyway.  We could also switch to Base* as a prefix without, I feel, any real loss of readability.  Also, by nesting so many types inside FunctionCompiler, there's not that many global names anyway.  So for FunctionCompiler, perhaps we could rename to BaseCompiler?  (And then later, for symmetry, we could rename FunctionCompiler inside WasmIonCompile.cpp to IonCompile.)  The ModuleX/FunctionX split is really only useful in files which have both.

@@ +124,5 @@
> +    // The baseline compiler tracks control items on a stack of its
> +    // own as well.
> +    //
> +    // TODO / REDUNDANT: It would be nice if we could make use of the
> +    // iterator's ControlItems and not require our own stack for that.

It'd be nice talk with Dan about this to see if there's something we could do to generalize ExprIter b/c I agree this would be great to do at some point in the future (after landing :).

@@ +127,5 @@
> +    // TODO / REDUNDANT: It would be nice if we could make use of the
> +    // iterator's ControlItems and not require our own stack for that.
> +
> +    typedef NoVal Value;
> +    typedef NoCtl ControlItem;

I think you can delete this: ExprIterPolicy already typedefs these to a special Nothing type that is template-specialized to be a bit more efficient.

@@ +130,5 @@
> +    typedef NoVal Value;
> +    typedef NoCtl ControlItem;
> +};
> +
> +typedef ExprIter<BaselineCompilePolicy> BaselineWasmIterator;

I generally try not to say "Wasm" in names (especially for statics/internals of Wasm* files).  How about BaselineExprIter?

@@ +161,5 @@
> +static const Register ScratchRegX86 = ebx;
> +
> +class ScratchRegisterScope
> +{
> +public:

At least until we change SM style, the style is two-spaces before 'public'/'private' (and 'case').

@@ +4293,5 @@
> +    return pushControl(&blockEnd);
> +}
> +
> +void
> +FunctionCompiler::endBlock()

Just as a style question: do you think it makes sense to have *all* non-helper codegen methods out-of-line (like you've done here and below for some of the bigger methods), grouped by category with these nice little categorical comment blocks like you have above for "blocks and loops"?

@@ +6177,5 @@
> +#ifdef JS_CODEGEN_ARM64
> +    // FIXME: There is a hack up at the top to allow the baseline
> +    // compiler to compile on ARM64 (by defining StackPointer), but
> +    // the resulting code cannot run.  So prevent it from running.
> +    MOZ_CRASH("Several adjustments required for ARM64 operation");

Heh, I think we'll crash 20 ways if we try to run ARM64 in asm.js/wasm, so I wouldn't spend any code/comments on ARM64 other than keeping it building.  Or maybe a good place to keep a list of "don't forgot to.." is in an #elif in BaselineCanCompile.

@@ +6383,5 @@
> +    MOZ_ASSERT(done(), "all bytes must be consumed");
> +    MOZ_ASSERT(func_.callSiteLineNums().length() == lastReadCallSite_);
> +}
> +
> +static LiveRegisterSet volatileReturnGPR() {

static LiveRegisterSet\n
VolatileReturnGPR()\n
{
(In reply to Lars T Hansen [:lth] from comment #42)
> Ho hum.  As discussed on IRC with Benjamin, gcc 5.2.1 on my Linux box will
> then complain that there are switch arms that do not return a value.

Yeah, I commonly see that (I guess gcc doesn't let itself assume that if you handle all the visible enumerators that a new non-enumerator integral value won't flow through) and just stick a MOZ_CRASH() (or MOZ_MAKE_COMPILER_ASSUME_IS_UNREACHABLE if I'm worried about code size impact of a hot inline function) as the last line of the function; that shuts the compiler up.
(In reply to Benjamin Bouvier [:bbouvier] from comment #32)
> Comment on attachment 8757949 [details] [diff] [review]
> bug1232205-ool-truncate.patch
> 
> Review of attachment 8757949 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jit/MacroAssembler.cpp
> @@ +1807,5 @@
> > +                                      bool compilingAsmJS)
> > +{
> > +#if defined(JS_CODEGEN_ARM) || defined(JS_CODEGEN_ARM64)
> > +    if (needFloat32Conversion) {
> > +        convertFloat32ToDouble(src, ScratchDoubleReg);
> 
> While we're here, can you use a ScratchScope for the double scratch, please?

That's probably not appropriate here because src flows out of that scope (src=ScratchDoubleReg in the next line after the ones you quoted), and it seems inappropriate to use ScratchDoubleScope for the whole function body.  I agree this is just a bug waiting to happen, though.

(Also the arm64 assembler doesn't even have ScratchDoubleScope but that's not an excuse.)

> @@ +1813,5 @@
> > +    }
> > +#else
> > +    FloatRegister srcSingle = src.asSingle();
> > +    if (needFloat32Conversion) {
> > +        MOZ_ASSERT(src.isSingle());
> 
> pre-existing: out of curiosity, shouldn't this pass even if we use srcSingle
> = src? If needFloat32Conversion and the input is not into a float32
> register, we're gonna have a bad time. Otherwise, this assertion is
> self-evident because of the asSingle() call before.

I agree, this code seems very confused.  I'll try to tidy up.  I also think "needFloat32Conversion" is not very indicative of what's going on here: "widenFloat32ToDouble" or "inputIsFloat32" would be better.  I'll change that too.

The code will also not be correct on MIPS if push() pushes the single register (two singles can alias a double) but hopefully it pushes the wider double register?

(Soapbox: when doing platform ifdefs, ALWAYS have an #else MOZ_CRASH() case and opt in per-platform so that you know that somebody has thought about it.)

(I will forego the glory of your last suggestion for now.)
Patch 5/8, v2: out-of-line truncate code.

I'm going to ask for a re-review here because I cleaned up the code in MacroAssembler.cpp (outOfLineTruncateSlow) as discussed in my previous comment.
Attachment #8757949 - Attachment is obsolete: true
Attachment #8759210 - Flags: review?(bbouvier)
(In reply to Luke Wagner [:luke] from comment #43)
> Comment on attachment 8758708 [details] [diff] [review]
> bug1232205-wasm-baseline-compiler-v2.patch
> 
> Review of attachment 8758708 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Ultimately, I don't think it's realistic for me or bbouvier to fully review
> every single function here, so I think the best plan is to rubberstamp and
> land after a few passes and iterate on trunk.  That way we also benefit from
> early fuzzing.  Speaking of, we should reach out to the fuzzing team after
> landing and to tell them about the new flag.

Good plan.

> ::: js/src/asmjs/WasmBaselineCompile.cpp
> @@ +23,5 @@
> > + * additional tag to indicate the area of improvement.
> > + *
> > + * Mostly the following naming abbreviation is used: "I"=Int32,
> > + * "X"=Int64, "F"=Float32, "D"=Float64.  If an unsigned interpretation
> > + * is to be used, a "U" is appended.
> 
> We generally use I32/I64/F32/F64 for this distinction (matching the wasm
> spec in naming convention).  Would it be ok to use that here too?

I know, and I chose the abbreviation with some trepidation for that reason, but decided to go with it because it is less noisy.  Initially my reluctance to the standard notation came from writing things like pop2I32(...) and going slightly insane.

Not saying "no", not saying "yes" either, not yet.  I may experiment to see how bad I think it is.

> @@ +105,5 @@
> > +using mozilla::SpecificNaN;
> > +
> > +namespace js {
> > +namespace wasm {
> > +namespace baseline {
> 
> I'm generally inclined not to have a sub-namespace inside wasm.

I can live with that.

> So for FunctionCompiler, perhaps we could rename
> to BaseCompiler?

And that.  (And so on for other naming suggestions you make later.)

> @@ +124,5 @@
> > +    // The baseline compiler tracks control items on a stack of its
> > +    // own as well.
> > +    //
> > +    // TODO / REDUNDANT: It would be nice if we could make use of the
> > +    // iterator's ControlItems and not require our own stack for that.
> 
> It'd be nice talk with Dan about this to see if there's something we could
> do to generalize ExprIter b/c I agree this would be great to do at some
> point in the future (after landing :).

Yes, there's this functionality, and then also a sniff-next-opcode function that would be nice to have for work you've not seen yet and that I currently have to simulate (which is bearable).

> @@ +4293,5 @@
> > +    return pushControl(&blockEnd);
> > +}
> > +
> > +void
> > +FunctionCompiler::endBlock()
> 
> Just as a style question: do you think it makes sense to have *all*
> non-helper codegen methods out-of-line (like you've done here and below for
> some of the bigger methods), grouped by category with these nice little
> categorical comment blocks like you have above for "blocks and loops"?

Well... We could do that.  The current situation is a little messy since there's really no clear dividing line between what's inline and out-of-line.  I'd want to measure performance along the way because I'm a little concerned that the C++ compiler will not as readily in-line an out-of-line method, but I could just be paranoid, and again, the current situation is messy and not obviously optimal in any way.

> 
> @@ +6177,5 @@
> > +#ifdef JS_CODEGEN_ARM64
> > +    // FIXME: There is a hack up at the top to allow the baseline
> > +    // compiler to compile on ARM64 (by defining StackPointer), but
> > +    // the resulting code cannot run.  So prevent it from running.
> > +    MOZ_CRASH("Several adjustments required for ARM64 operation");
> 
> Heh, I think we'll crash 20 ways if we try to run ARM64 in asm.js/wasm, so I
> wouldn't spend any code/comments on ARM64 other than keeping it building. 

Belt and suspenders... :)
Patch 3/8 v2: Refactored as requested earlier.  Looking good, IMO - clearly an improvement.
Attachment #8757947 - Attachment is obsolete: true
Attachment #8759252 - Flags: review?(bbouvier)
(In reply to Benjamin Bouvier [:bbouvier] from comment #34)
> Comment on attachment 8757952 [details] [diff] [review]
> bug1232205-test-directives.patch
> 
> Review of attachment 8757952 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks. Out of curiosity, would putting |test-also-wasm-baseline| in
> lib/wasm.js work too? I don't see any reason why it would, but the testing
> harness has surprised me a few times already with nice features :)

There's no evidence of that going on.  Given that wasm.js is loaded dynamically based on not-statically-computable information there would at least have to be an instrumented load command that feeds information from files it loads back into the (currently running) test harness, which seems like a tall order.

Work on an improved test runner has been spun out as bug 1277770.
(In reply to Lars T Hansen [:lth] from comment #47)
> (In reply to Luke Wagner [:luke] from comment #43)
> > Comment on attachment 8758708 [details] [diff] [review]
> > bug1232205-wasm-baseline-compiler-v2.patch
> > 

> > @@ +4293,5 @@
> > > +    return pushControl(&blockEnd);
> > > +}
> > > +
> > > +void
> > > +FunctionCompiler::endBlock()
> > 
> > Just as a style question: do you think it makes sense to have *all*
> > non-helper codegen methods out-of-line (like you've done here and below for
> > some of the bigger methods), grouped by category with these nice little
> > categorical comment blocks like you have above for "blocks and loops"?
> 
> Well... We could do that.  The current situation is a little messy since
> there's really no clear dividing line between what's inline and out-of-line.
> I'd want to measure performance along the way because I'm a little concerned
> that the C++ compiler will not as readily in-line an out-of-line method, but
> I could just be paranoid, and again, the current situation is messy and not
> obviously optimal in any way.

I did measure performance.  There's no discernible difference (sample of one, but it's what we have).  I'll go with your suggestion.
This version addresses most of Luke's concerns on the previous patch, and cleans up comments and a few other things.

Outstanding items:

- I've started an email thread about reusing ExprIter's control stack.
- I've not concluded about renaming I => I32, X => I64, etc, but I'm working
  on it.
Attachment #8758708 - Attachment is obsolete: true
Attachment #8758708 - Flags: review?(luke)
Attachment #8758708 - Flags: review?(bbouvier)
Attachment #8759677 - Flags: review?(luke)
Attachment #8759677 - Flags: review?(bbouvier)
Comment on attachment 8759677 [details] [diff] [review]
bug1232205-wasm-baseline-compiler-v3.patch

Review of attachment 8759677 [details] [diff] [review]:
-----------------------------------------------------------------

Spent a few more hours scanning today, I think it looks good to land.  It'll be interesting (later, after x86/ARM/debugger/tiering) to look into the lines you have optimized with TODO/OPTIMIZE since many look quite promising.

::: js/src/asmjs/WasmBaselineCompile.cpp
@@ +107,5 @@
>  
> +#include "jit/MacroAssembler-inl.h"
> +#include "jit/shared/CodeGenerator-shared.h"
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +#  include "jit/x86-shared/CodeGenerator-x86-shared.h"

Are these two CodeGenerator #includes still necessary?  I'd hope that we only had a dependency on the MacroAssembler (and anything in the CG we needed would be moved to the MA).

@@ +191,5 @@
> +    };
> +
> +    typedef Vector<PooledLabel*, 8, SystemAllocPolicy> LabelVector;
> +
> +    struct UniquePooledLabelFreePolicy {

{ on newline

@@ +315,5 @@
> +    // code density and branch prediction friendliness will be less
> +    // important.
> +
> +    class OutOfLineCode : public TempObject {
> +    private:

2-space indent for public/private labels (here and below)

@@ +504,5 @@
> +    // Stack-allocated local slots.
> +
> +    int32_t pushLocal(size_t nbytes) {
> +        if (nbytes == 8)
> +            localSize_ = (localSize_ + 7) & ~7;

To match other places in baldr, could you write
  localSize_ = AlignBytes(localSize_, nbytes)
here and about 4 places below?

@@ +527,5 @@
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +        uint32_t space = GeneralRegisterSet::NonVolatile().size() * sizeof(intptr_t)
> +            + FloatRegisterSet::NonVolatile().getPushSizeInBytes();
> +#else
> +        // See definitions of FramePushedAfterSave and NonVolatileRegs in WasmStubs.cpp,

Once we're inside Ion/Baseline code (which we enter via GenerateEntry), there are no non-volatile registers.  GenerateEntry takes care of saving the (C++) caller's non-volatiles and after that, nothing is preserved by the Ion ABI.  So I'd just remove all non-volatile-saving code in baseline, incl saveSize_.

@@ +529,5 @@
> +            + FloatRegisterSet::NonVolatile().getPushSizeInBytes();
> +#else
> +        // See definitions of FramePushedAfterSave and NonVolatileRegs in WasmStubs.cpp,
> +        // as well as SavedNonVolatileRegisters() in RegisterSets.h:  Who is responsible
> +        // for saving the link register, and where does it get saved?  If it gets pushed

fwiw, lr is saved by PushRetAddr() by GenerateFunctionPrologue as part of the AsmJSFrame.

@@ +1013,5 @@
> +    void sync() {
> +        size_t start = 0;
> +        size_t lim=stk_.length();
> +
> +        for ( size_t i=lim ; i > 0 ; i-- ) {

SM whitespace convention is "for (size_t i = lim; i > 0; i--)".  I think quite a few of the loops/ifs do this; could you convert the rest too?

@@ +1701,5 @@
> +                    storeToFrameI(i->gpr(), l.offs);
> +                break;
> +              case MIRType::Int64:
> +                if (i->argInRegister())
> +                    storeToFrameX(Register64(i->gpr()), l.offs);

It would really be nice to see I32 here instead of X, given that nowhere else uses this I/X/D/F convention so it's a bit jarring if you're hopping in and out.
Attachment #8759677 - Flags: review?(luke) → review+
The sniffing of SIMD usage in the first patch ("infrastructure") is not quite sufficient: SIMD can be used even if it does not show up in the types, operations, or constructors, namely as literals, as in these two cases:

   function f() { return i16x8(1,2,3,4,5,6,7,8); }
   function f() { i4(1,2,3,4) }

I found these by running all the asm.js tests with --wasm-always-baseline, since some SIMD tests are not yet run automatically.
Attachment #8760145 - Flags: review?(luke)
Comment on attachment 8759210 [details] [diff] [review]
bug1232205-ool-truncate-v2.patch

Review of attachment 8759210 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

::: js/src/jit/MacroAssembler.cpp
@@ +1816,5 @@
> +    if (widenFloatToDouble) {
> +        MOZ_ASSERT(src.isSingle());
> +        srcSingle = src;
> +        src = src.asDouble();
> +        push(srcSingle);

Really not a big deal, but do we need the srcSingle variable at all?

if (widen) {
  MOZ_ASSERT(src.isSingle());
  push(src);
  convertFloat32ToDouble(src, src.asDouble());
  src = src.asDouble();
}

And below:

if (widen) {
  MOZ_ASSERT(src.isDouble());
  pop(src.asSingle());
}

For what it's worth, MIPS32 and MIPS64 (and also at least x64) push the double's content of a FloatRegister, as shows e.g. https://dxr.mozilla.org/mozilla-central/source/js/src/jit/mips32/MacroAssembler-mips32.cpp#739 (push(FloatRegister) calls ma_push(FloatRegister) on ARM-masm-like platforms).

Feel free to make a more specific #ifdef as you expressed in the comment (ifdef x64 and x86), it definitely makes sense to me to express a clearer intent here. Probably mips can do as ARM does, as ARM has the same aliasing constraints? (unless we use the upper single of FloatRegister as independent virtual registers)

::: js/src/jit/MacroAssembler.h
@@ +1596,5 @@
>      void convertTypedOrValueToFloatingPoint(TypedOrValueRegister src, FloatRegister output,
>                                              Label* fail, MIRType outputType);
>  
> +    void outOfLineTruncateSlow(FloatRegister src, Register dest, bool widenFloatToDouble,
> +                               bool compilingAsmJS);

Feel free to rename the last arg as compilingWasm.
Attachment #8759210 - Flags: review?(bbouvier) → review+
Addresses all of Luke's concerns from his review of v3, except for the thing about the control stack which I'll do in a followup bug.  No other functional changes from v3.  (An interdiff with v3 is likely to be larger than the patch, due to the completely pervasive naming changes.)

Carrying Luke's r+.
Attachment #8759677 - Attachment is obsolete: true
Attachment #8759677 - Flags: review?(bbouvier)
Attachment #8760251 - Flags: review?(bbouvier)
Attachment #8760251 - Flags: review+
(In reply to Benjamin Bouvier [:bbouvier] from comment #54)
> Comment on attachment 8759210 [details] [diff] [review]
> bug1232205-ool-truncate-v2.patch
> 
> Review of attachment 8759210 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks!
> 
> ::: js/src/jit/MacroAssembler.cpp
> @@ +1816,5 @@
> > +    if (widenFloatToDouble) {
> > +        MOZ_ASSERT(src.isSingle());
> > +        srcSingle = src;
> > +        src = src.asDouble();
> > +        push(srcSingle);
> 
> Really not a big deal, but do we need the srcSingle variable at all?

Potato-potahto... a matter of taste.  I happen to dislike the non-caching style you suggest above, but there's nothing objective about that, just a C programmer grumbling about kids these days.

> For what it's worth, MIPS32 and MIPS64 (and also at least x64) push the
> double's content of a FloatRegister, as shows e.g.
> https://dxr.mozilla.org/mozilla-central/source/js/src/jit/mips32/
> MacroAssembler-mips32.cpp#739 (push(FloatRegister) calls
> ma_push(FloatRegister) on ARM-masm-like platforms).

Yes.  I did minimal cleanup here but I don't know how general the logic *really* needs to be.  It's likely this code really should be in platform files.

> Feel free to make a more specific #ifdef as you expressed in the comment
> (ifdef x64 and x86), it definitely makes sense to me to express a clearer
> intent here.

OK.

> Probably mips can do as ARM does, as ARM has the same aliasing
> constraints? (unless we use the upper single of FloatRegister as independent
> virtual registers)

I will let the MIPS build MOZ_CRASH and let the MIPS people clean it up.  cc'ing now :)

> ::: js/src/jit/MacroAssembler.h
> @@ +1596,5 @@
> >      void convertTypedOrValueToFloatingPoint(TypedOrValueRegister src, FloatRegister output,
> >                                              Label* fail, MIRType outputType);
> >  
> > +    void outOfLineTruncateSlow(FloatRegister src, Register dest, bool widenFloatToDouble,
> > +                               bool compilingAsmJS);
> 
> Feel free to rename the last arg as compilingWasm.

Will do so.
Attachment #8760145 - Flags: review?(luke) → review+
Comment on attachment 8759252 [details] [diff] [review]
bug1232205-long-to-float-and-back-v2.patch

Review of attachment 8759252 [details] [diff] [review]:
-----------------------------------------------------------------

Cool, thank you.

::: js/src/jit/x64/MacroAssembler-x64.cpp
@@ +103,5 @@
> +    j(Assembler::Signed, &isSigned);
> +    vcvtsq2sd(input, output, output);
> +    jump(&done);
> +
> +    bind(&isSigned);

I think here and below in the float32 equivalent, we've lost the comment that we divide by 2, convert and multiply by two. It's just weird to have a comment for the unsigned case and not for the signed case, so we can either remove both or keep both.
Attachment #8759252 - Flags: review?(bbouvier) → review+
Blocks: 1278635
Incorporates bug fixes for OOM handling and non-unified builds, and refactors the remainderI32 and remainderI64 operations for simplicity.
Attachment #8761123 - Flags: review?(bbouvier)
Attachment #8760251 - Attachment is obsolete: true
Attachment #8760251 - Flags: review?(bbouvier)
Note to reviewers, the patches on this bug as attached are from various rebasings, not all of them the same, and they may not apply together to any particular repo version.  For a link to a Mercurial bundle containing the most recent coherent set, with a defined base patch in m-i, see the URL field of this bug.
Note to self: Something broke in the test runner: the wast tests are no longer being run with the baseline compiler despite spec.js containing the correct directive.
Comment on attachment 8761123 [details] [diff] [review]
bug1232205-wasm-baseline-compiler-v4.patch

Review of attachment 8761123 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice, thank you for doing this! The code is very clean, comments are really helpful, the control flow is simple to follow.

Below "a few" style nits, remarks and suggestions about the code. Some are good, some are bad, some are just OK. I'd like to see them addressed at some point, but that could be done in a follow-up bug, for those that are not errors.

I have only glanced at the emit* instructions, but the structure looks good. (I guess it's time for the inquisition to implement differential wasm testing)

Also, not sure if it would make a big difference, but experimenting with the goto-dynamic-label optimization in the main for/switch loop could bring some benefits (in CPUs older than skylake).

::: js/src/asmjs/WasmBaselineCompile.cpp
@@ +174,5 @@
> +// EBX not being one of the WasmTableCall registers; and needing a
> +// temp register for load/store that has a single-byte persona.
> +static const Register ScratchRegX86 = ebx;
> +
> +class ScratchRegisterScope

Probably for a later patch: if you make a new kind of ScratchRegisterScope-like class, you could pass it the BaseCompiler struct, which could check isAvailable(ScratchRegX86) and mark it as taken in the ctor, and give it back in the dtor.

@@ +208,5 @@
> +
> +    // The strongly typed register wrappers have saved my bacon a few
> +    // times; though they are largely redundant they stay, for now.
> +
> +    struct RegI32

Is it a case where inheritance would be better than composition? You'd keep the strong typing but wouldn't need to redefine operator==/operation!= and access the register through the reg member anytime. (plus inheritance makes sense: a RegI32 really *is* a Register, for instance)

@@ +268,5 @@
> +    };
> +
> +    struct Local
> +    {
> +        Local() : type(MIRType::None), offs(UINT32_MAX), size(0) {}

It might be some hassle, but if you initialize offs to a sentinel value, might as well have a getter that MOZ_ASSERT that we're not reading the sentinel value? (so make offs private and have a getter on it). Ditto for type.

@@ +360,5 @@
> +    const FuncBytes&            func_;
> +    size_t                      lastReadCallSite_;
> +    TempAllocator&              alloc_;
> +    const ValTypeVector&        locals_;         // Types of parameters and locals
> +    int32_t                     localSize_;      // Size of local area (stable after beginFunction)

Could we precise "(in bytes)"? I wondered whether this was in bytes or local number, so let's prevent the future reader from checking.

@@ +372,5 @@
> +    Label                       outOfLinePrologue_;
> +    Label                       bodyLabel_;
> +
> +    FuncCompileResults&         compileResults_;
> +    MacroAssembler&             masm;            // No '_' suffix - too tedious...

(Heard about v8's __ macro? :-))

@@ +432,5 @@
> +            return nullptr;
> +        return ool;
> +    }
> +
> +    bool generateOutOfLineCode() {

nit: MOZ_MUST_USE

@@ +528,5 @@
> +
> +    Register allocGPR() {
> +        MOZ_ASSERT(hasGPR());
> +        Register r = availGPR_.takeAny();
> +        return r;

nit: r's definition can be inlined here

@@ +575,5 @@
> +            return FloatRegisters::AllSingleMask;
> +          case MIRType::Double:
> +            return FloatRegisters::AllDoubleMask;
> +          default:
> +            MOZ_CRASH("TODO: maskFromTypeFPU");

You can even static_assert here.

@@ +648,5 @@
> +            ConstI64,             // 64-bit integer constant ("i64val")
> +            ConstF32,             // 32-bit floating constant ("f32val")
> +            ConstF64,             // 64-bit floating constant ("f64val")
> +
> +            None                  // Uninitialized or void

Could we split the cross-product into two enums, Kind / Type? (or any clearer naming than Kind vs Type)
With SIMD, you'd have 32 new enum values otherwise.

@@ +852,5 @@
> +#endif
> +    }
> +
> +    void loadConstI32(Register r, Stk& src) {
> +        masm.mov(ImmWord((uint32_t)src.i32val & 0xFFFFFFFFU), r);

It's a general remark about the usage of Stk: it seems rather dangerous to have unchecked usage of the union value without looking at the Kind first (maybe it's done in the callers?). Could we either have small getters for the values that would MOZ_ASSERT on the kind, or MOZ_ASSERT the kind here? (I can think of some cases like *Reinterpret opcodes that don't want to check the kind, but these are the exceptions, not the common cases).

Or if these methods are only used internally, maybe they should be marked as private, as a token of "callers know what they're doing"?

@@ +983,5 @@
> +    //    sync.  That type of fix would go into needI32().
> +
> +    void sync() {
> +        size_t start = 0;
> +        size_t lim=stk_.length();

nit: spaces before and after the =

@@ +1008,5 @@
> +                v.kind = Stk::MemI32;
> +                v.offs = masm.framePushed();
> +                break;
> +              }
> +              case Stk::RegisterI32:

nit: by our style guide, if any switch case takes curlies, all of them must have curlies.

@@ +1087,5 @@
> +
> +    bool hasLocal(uint32_t slot) {
> +        for (size_t i = stk_.length(); i > 0; i--) {
> +            // Memory opcodes are first in the enum, single check against MemLast is fine.
> +            Stk::Kind kind = stk_[i-1].kind;

general style nit: spaces between operands and operator! i - 1, but works also for +, etc.

@@ +1089,5 @@
> +        for (size_t i = stk_.length(); i > 0; i--) {
> +            // Memory opcodes are first in the enum, single check against MemLast is fine.
> +            Stk::Kind kind = stk_[i-1].kind;
> +            if (kind <= Stk::MemLast)
> +                return false;

Don't we want to continue, in this case? (unless there's an invariant that elements with Mem kind are positioned before the Local kind elements in stk_)

@@ +1141,5 @@
> +
> +    // Push the value onto the stack.
> +
> +    void pushI32(int32_t v) {
> +        Stk& x = push();

I don't know if this matters, but if the stack already contains a constant, it doesn't need to be re-generated. Maybe an optimization opportunity for later or something to explore. It might not be that important because such constants should be already LICM'd and GVN'd by the emitter/emscripten, but an experiment is needed to confirm.

@@ +1208,5 @@
> +          case Stk::LocalI32:
> +            loadLocalI32(r.reg, v);
> +            break;
> +          case Stk::MemI32:
> +            masm.Pop(r.reg);

I understand this works and we don't need to reference by offset because v is at the back of the stk_ vector. Can we assert this here?

@@ +1242,5 @@
> +        return r;
> +    }
> +
> +    RegI32 popI32(RegI32 specific) {
> +        Stk& v = stk_.back();

I wonder, can you do a popCopy here? I think this would reduce stack churn in the call to needI32 below (in the case v.kind == Register), + we don't need to popBack() at the bottom. And if you remove the return value that is always the unmodified unique parameter, we could also simplify the code by inverting the condition of the if. (Although I guess returning specific is nice for reusing the popI32() as a parameter)

@@ +1281,5 @@
> +        }
> +    }
> +
> +    MOZ_MUST_USE
> +    RegI64 popI64() {

We could probably reduce the amount of new code by using macros...

@@ +1513,5 @@
> +    // values on the value stack.
> +
> +    size_t stackConsumed(size_t numval) {
> +        size_t size = 0;
> +        for (uint32_t i = stk_.length()-1; numval > 0; numval--, i--) {

MOZ_ASSERT(numval <= stk_.length());

@@ +1592,5 @@
> +
> +    Vector<Control, 8, SystemAllocPolicy> ctl_;
> +
> +    MOZ_MUST_USE
> +    bool pushControl(UniquePooledLabel* label, UniquePooledLabel* otherLabel=nullptr) {

nit: spaces before and after =

@@ +1610,5 @@
> +        return true;
> +    }
> +
> +    void popControl() {
> +        Control& last = ctl_.back();

You could use popCopy here.

@@ +1645,5 @@
> +
> +        wasm::GenerateFunctionPrologue(masm, localSize_, mg_.funcSigIndex(func_.index()),
> +                                       &compileResults_.offsets());
> +
> +        maxFramePushed_ = masm.framePushed();

Shouldn't this include sizeof(AsmJSFrame)?

@@ +1689,5 @@
> +
> +        // Initialize the stack locals to zero.
> +        //
> +        // TODO / OPTIMIZE: on x64, at least, scratch will be a 64-bit
> +        // register and we can move 64 bits at a time.

Wondering: you could use a FloatRegister and a xorps and move 128 bits at a time on platforms supporting SIMD (x64 always has SSE2 so yes for sure, x86 if it has SSE2, ARM nope not yet), but I don't know if the latency/throughput will be better than using GPR/stack moves.

@@ +1727,5 @@
> +        // it.  rax/eax is a volatile non-argument register in all
> +        // x86/x64 ABIs and is not live when we're in the function prologue.
> +
> +        masm.movePtr(masm.getStackPointer(), eax);
> +        masm.subPtr(Imm32(maxFramePushed_ - (localSize_ + sizeof(AsmJSFrame))), eax);

MOZ_ASSERT(maxFramePushed_ >= localSize_ + sizeof(AsmJSFrame)); before this please.

@@ +1728,5 @@
> +        // x86/x64 ABIs and is not live when we're in the function prologue.
> +
> +        masm.movePtr(masm.getStackPointer(), eax);
> +        masm.subPtr(Imm32(maxFramePushed_ - (localSize_ + sizeof(AsmJSFrame))), eax);
> +        masm.branchPtr(Assembler::AboveOrEqual,

If you invert the condition, you can avoid the unconditional jump thereafter and the stackOverflow label.

@@ +1784,5 @@
> +        {}
> +
> +        uint32_t lineOrBytecode_;
> +        ABIArgGenerator abi_;
> +        bool saveState_;

This is true when we need to restore the state, because we saved it. How about renaming to restoreState_, or savedState_, or mustRestoreState_?

@@ +1787,5 @@
> +        ABIArgGenerator abi_;
> +        bool saveState_;
> +        size_t stateSize_;
> +        size_t adjust_;
> +        size_t outgoingSize_;

outgoingSize_ is the size of the arguments passed onto the stack. How about renaming argStackSize_ or something more explicit?

At least if you don't rename, some comments would be useful for saveState_ (what is the "state" we're talking about? it could be arguments, after all) and outgoingSize_.

@@ +1827,5 @@
> +#endif
> +            }
> +            masm.freeStack(size + call.stateSize_ + call.adjust_);
> +        } else if (!call.calleePopsArgs_) {
> +            masm.freeStack(call.outgoingSize_ + call.adjust_);

We know that call.adjust_ is 0 here or we would have entered the previous branch.

@@ +1845,5 @@
> +
> +    void startCallArgs(FunctionCall& call, size_t outgoingSize)
> +    {
> +        call.outgoingSize_ = outgoingSize;
> +        if (call.stateSize_  || call.adjust_) {

nit: remove one space before ||

@@ +1857,5 @@
> +                MOZ_CRASH("BaseCompiler platform hook: startCallArgs");
> +#endif
> +            }
> +        } else if (outgoingSize > 0) {
> +            masm.reserveStack(outgoingSize + call.adjust_);

Same remark as above.

@@ +1887,5 @@
> +          case ValType::I32: {
> +            ABIArg argLoc = call.abi_.next(MIRType::Int32);
> +            if (argLoc.kind() == ABIArg::Stack) {
> +#if defined(JS_CODEGEN_X64) || defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_ARM)
> +                ScratchRegisterScope scratch(masm);

I guess that's something you've found out in the x86 patch, but there's no reason on x86 that ebx be not used at this point, so it might get clobbered. A comment for the x86 patch, though :-)

@@ +2230,5 @@
> +
> +    void checkDivideByZeroI64(RegI64 rhs, RegI64 srcDest, Label* done) {
> +#ifdef JS_CODEGEN_X64
> +        masm.testq(rhs.reg.reg, rhs.reg.reg);
> +        if (isCompilingAsmJS()) {

asm.js doesn't have int64, so this can be removed.

@@ +2275,5 @@
> +        if (zeroOnOverflow) {
> +            masm.j(Assembler::NotEqual, &notMin);
> +            masm.xorq(srcDest.reg.reg, srcDest.reg.reg);
> +            masm.jump(done);
> +        } else if (isCompilingAsmJS()) {

this branch is dead code too

@@ +2719,5 @@
> +        MOZ_CRASH("BaseCompiler platform hook: extendU32ToI64");
> +#endif
> +    }
> +
> +    class OutOfLineTruncateFOrF64ToI32 : public OutOfLineCode

Name might be incomplete: OutOfLineTruncateF32OrF64ToI32

@@ +2727,5 @@
> +      public:
> +        OutOfLineTruncateFOrF64ToI32(AnyReg src, RegI32 dest) : src(src), dest(dest) {}
> +
> +        virtual void generate(MacroAssembler& masm) {
> +            bool isAsmJS = true; // Wasm passes for AsmJS in this context

I think this is not correct, see below...

@@ +2742,5 @@
> +    bool truncateF32ToI32(RegF32 src, RegI32 dest) {
> +        OutOfLineCode* ool = addOutOfLineCode(new (alloc_) OutOfLineTruncateFOrF64ToI32(AnyReg(src), dest));
> +        if (!ool)
> +            return false;
> +        masm.branchTruncateFloat32(src.reg, dest.reg, ool->entry());

Actually, AsmJS and Wasm don't have the same truncation semantics: asm.js does a truncation modulo the int32 range (try assertEq(2**32 + 1 | 0, 1) in a shell), but in Wasm, it's an invalid conversion. (See EmitTruncate in WasmIonCompile.cpp)

So you just need to use wasmTruncateF32ToInt32 or branchTruncateFloat32 whether we're in wasm or asm.js, and have the correct OOL generated too.

@@ +2752,5 @@
> +    bool truncateF64ToI32(RegF64 src, RegI32 dest) {
> +        OutOfLineCode* ool = addOutOfLineCode(new (alloc_) OutOfLineTruncateFOrF64ToI32(AnyReg(src), dest));
> +        if (!ool)
> +            return false;
> +        masm.branchTruncateDouble(src.reg, dest.reg, ool->entry());

ditto

@@ +3267,5 @@
> +{
> +    int32_t c;
> +    if (popConstI32(c)) {
> +        RegI32 r = popI32();
> +        masm.add32(Imm32(c), r.reg);

(another optimization idea) Is it worth it for commutative instructions in general to also check if the constant isn't the just-below-top-of-stack value and commute them before trying to use an imm32?

@@ +3281,5 @@
> +
> +void
> +BaseCompiler::emitAddI64()
> +{
> +    // Ditto check for constant here

Maybe add a TODO here?

@@ +4628,5 @@
> +    // These registers are free in the remainder of the block.
> +    freeI32(rc);
> +    freeJoinReg(r);
> +
> +    pushVoid();

There was a discussion on the design repo about this, but I think that if brIf has a value and the branch is not taken, it returns this value. We're not relying on this in current tests, but this might be something to keep in mind for future changes.

@@ +4733,5 @@
> +        return false;
> +
> +    ExprType type = func_.sig().ret();
> +
> +    if (IsVoid(type)) {

This could be a case in the switch below, for less control flow.

@@ +6168,5 @@
>  
>  bool
> +BaseCompiler::emitFunction()
> +{
> +    if (!stk_.reserve(8))

I think the preallocated number of Stk is 8 in the declaration of stk_, already?

@@ +6292,5 @@
> +    // needs to also test single-low/single-high registers.
> +
> +#if defined(DEBUG) && (defined(JS_CODEGEN_X64) || defined(JS_CODEGEN_X32))
> +    FloatRegister f = allocFPU<MIRType::Float32>();
> +    MOZ_ASSERT(!isAvailable(f.asDouble()));

This makes me think that there are many FloatRegister comparisons (FLoatRegister1 == FloatRegister2). They compare the type of the content (single, double, simd), so maybe overriding the comparison operator here would lead to fewer stack spilling.

::: js/src/asmjs/WasmTypes.h
@@ +642,5 @@
>  {
>      uint32_t insnOffset_;
>    public:
>      HeapAccess() = default;
> +    static const uint32_t NoLengthCheck = UINT32_MAX;

This is unused in this patch, can you please remove it?
Attachment #8761123 - Flags: review?(bbouvier) → review+
(In reply to Benjamin Bouvier [:bbouvier] from comment #63)
> Comment on attachment 8761123 [details] [diff] [review]
> bug1232205-wasm-baseline-compiler-v4.patch
> 
> Review of attachment 8761123 [details] [diff] [review]:
> -----------------------------------------------------------------

Thanks for comments, some select follow-ups here.

> ... experimenting with the
> goto-dynamic-label optimization in the main for/switch loop could bring some
> benefits (in CPUs older than skylake).

Nice idea!  It's a little tricky to do that right now because each iteration does a song and dance with ensureBallast() and reserve(), and there's more trickiness coming with a token lookahead mechanism, but it would be interesting to discuss (later) how to avoid that.  (Other reactions: (1) I'm on a Haswell and you have to go far down the profile before emitBody shows up.  (b) We need to start worrying about optimizing for ARM more.)

> @@ +208,5 @@
> > +
> > +    // The strongly typed register wrappers have saved my bacon a few
> > +    // times; though they are largely redundant they stay, for now.
> > +
> > +    struct RegI32
> 
> Is it a case where inheritance would be better than composition? You'd keep
> the strong typing but wouldn't need to redefine operator==/operation!= and
> access the register through the reg member anytime. (plus inheritance makes
> sense: a RegI32 really *is* a Register, for instance)

Right, I like this.  The existing structure just grew out of a prototype and I probably never considered inheritance.  There are cases where inheritance hides errors (conversion to FloatRegister) but I think the current design hides errors in the same ways (the register field is always called "reg"), so...

> @@ +268,5 @@
> > +    };
> > +
> > +    struct Local
> > +    {
> > +        Local() : type(MIRType::None), offs(UINT32_MAX), size(0) {}
> 
> It might be some hassle, but if you initialize offs to a sentinel value,
> might as well have a getter that MOZ_ASSERT that we're not reading the
> sentinel value? (so make offs private and have a getter on it). Ditto for
> type.

Sigh, yes.  Probably the same for Stk, as you remark later.
> 
> @@ +648,5 @@
> > +            ConstI64,             // 64-bit integer constant ("i64val")
> > +            ConstF32,             // 32-bit floating constant ("f32val")
> > +            ConstF64,             // 64-bit floating constant ("f64val")
> > +
> > +            None                  // Uninitialized or void
> 
> Could we split the cross-product into two enums, Kind / Type? (or any
> clearer naming than Kind vs Type).  With SIMD, you'd have 32 new enum values otherwise.

Possibly.  Worth revisiting when we add SIMD (which is long-tail for the baseline jit, since not in MVP).

> @@ +1089,5 @@
> > +        for (size_t i = stk_.length(); i > 0; i--) {
> > +            // Memory opcodes are first in the enum, single check against MemLast is fine.
> > +            Stk::Kind kind = stk_[i-1].kind;
> > +            if (kind <= Stk::MemLast)
> > +                return false;
> 
> Don't we want to continue, in this case? (unless there's an invariant that
> elements with Mem kind are positioned before the Local kind elements in stk_)

I think we covered that on IRC, if we're scanning down the stack from the top and we get to any Mem opcode we know that the stack has been pushed up to that point, hence there won't be any Local slots below it and the search can stop.

There are various opportunities for optimizing sync().  The current invariant is simple, and hence reduces compile time, but the generated code might suffer as a result.

> @@ +1141,5 @@
> > +
> > +    // Push the value onto the stack.
> > +
> > +    void pushI32(int32_t v) {
> > +        Stk& x = push();
> 
> I don't know if this matters, but if the stack already contains a constant,
> it doesn't need to be re-generated.

It's cheaper to regenerate it than to look for it, and it keeps the invariants much simpler.  Also remember the constant is never put in a register until it is popped.  So at that point, in order for there to be an optimization here, we'd have to note that the constant is in a register.

I think that a general optimization of "what's in a register now" would probably handle this too, and I hope to handle it that way eventually, for constants and locals.

> @@ +1645,5 @@
> > +
> > +        wasm::GenerateFunctionPrologue(masm, localSize_, mg_.funcSigIndex(func_.index()),
> > +                                       &compileResults_.offsets());
> > +
> > +        maxFramePushed_ = masm.framePushed();
> 
> Shouldn't this include sizeof(AsmJSFrame)?

I don't think so, the intent is for maxFramePushed_ only to track the value of framePushed(), and for all uses of maxFramePushed_ to make appropriate adjustments.  I will investigate a little deeper.

> @@ +1728,5 @@
> > +        // x86/x64 ABIs and is not live when we're in the function prologue.
> > +
> > +        masm.movePtr(masm.getStackPointer(), eax);
> > +        masm.subPtr(Imm32(maxFramePushed_ - (localSize_ + sizeof(AsmJSFrame))), eax);
> > +        masm.branchPtr(Assembler::AboveOrEqual,
> 
> If you invert the condition, you can avoid the unconditional jump thereafter
> and the stackOverflow label.

Good eye!  I had it that way initially, then I added some stuff for saving non-volatile registers so it got inverted, then Luke had me remove that again because it was redundant, but I forgot to change the condition back.

> @@ +1887,5 @@
> > +          case ValType::I32: {
> > +            ABIArg argLoc = call.abi_.next(MIRType::Int32);
> > +            if (argLoc.kind() == ABIArg::Stack) {
> > +#if defined(JS_CODEGEN_X64) || defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_ARM)
> > +                ScratchRegisterScope scratch(masm);
> 
> I guess that's something you've found out in the x86 patch, but there's no
> reason on x86 that ebx be not used at this point, so it might get clobbered.
> A comment for the x86 patch, though :-)

So... I have not had problems with this.  It is my understanding that x86 does not pass any arguments in registers, and we always -- currently, though that needs to be changed -- sync() before a call, so ebx is not on the value stack, not that it would be anyhow.

Are you thinking of something else?

> @@ +2727,5 @@
> > +      public:
> > +        OutOfLineTruncateFOrF64ToI32(AnyReg src, RegI32 dest) : src(src), dest(dest) {}
> > +
> > +        virtual void generate(MacroAssembler& masm) {
> > +            bool isAsmJS = true; // Wasm passes for AsmJS in this context
> 
> I think this is not correct, see below...
> 
> @@ +2742,5 @@
> > +    bool truncateF32ToI32(RegF32 src, RegI32 dest) {
> > +        OutOfLineCode* ool = addOutOfLineCode(new (alloc_) OutOfLineTruncateFOrF64ToI32(AnyReg(src), dest));
> > +        if (!ool)
> > +            return false;
> > +        masm.branchTruncateFloat32(src.reg, dest.reg, ool->entry());
> 
> Actually, AsmJS and Wasm don't have the same truncation semantics: asm.js
> does a truncation modulo the int32 range (try assertEq(2**32 + 1 | 0, 1) in
> a shell), but in Wasm, it's an invalid conversion. (See EmitTruncate in
> WasmIonCompile.cpp)
> 
> So you just need to use wasmTruncateF32ToInt32 or branchTruncateFloat32
> whether we're in wasm or asm.js, and have the correct OOL generated too.

OK, I will investigate.  Sort of sad if we don't have test cases that trigger this, will follow up on that too.

> @@ +3267,5 @@
> > +{
> > +    int32_t c;
> > +    if (popConstI32(c)) {
> > +        RegI32 r = popI32();
> > +        masm.add32(Imm32(c), r.reg);
> 
> (another optimization idea) Is it worth it for commutative instructions in
> general to also check if the constant isn't the just-below-top-of-stack
> value and commute them before trying to use an imm32?

It might be!  What I want to do is to run some analyses on wasm code, once we have a reasonable volume of it, to find out which optimizations might be beneficial.  Each test in the baseline compiler slows it down, so there's a trade-off.  And in these cases I would have expected the front-end to typically create Expr+Const rather than Const+Expr.  But I could be wrong.  A corpus analysis will tell me.

> @@ +6292,5 @@
> > +    // needs to also test single-low/single-high registers.
> > +
> > +#if defined(DEBUG) && (defined(JS_CODEGEN_X64) || defined(JS_CODEGEN_X32))
> > +    FloatRegister f = allocFPU<MIRType::Float32>();
> > +    MOZ_ASSERT(!isAvailable(f.asDouble()));
> 
> This makes me think that there are many FloatRegister comparisons
> (FLoatRegister1 == FloatRegister2). They compare the type of the content
> (single, double, simd), so maybe overriding the comparison operator here
> would lead to fewer stack spilling.

I don't understand that comment, but in general I am a little worried about the generality of the register allocation machinery in the AllocatableWhateverRegisterSets and what that costs us.  It has not shown up in my profiles so far but I've only profiled AngryBots.  It'll be interesting to pay attention to it over time though.  With 64-bit support on 32-bit platforms we'll be hitting that logic for GPRs too, I think.
(In reply to Benjamin Bouvier [:bbouvier] from comment #63)
> Comment on attachment 8761123 [details] [diff] [review]
> bug1232205-wasm-baseline-compiler-v4.patch
> 
> Review of attachment 8761123 [details] [diff] [review]:
> -----------------------------------------------------------------

> @@ +1242,5 @@
> > +        return r;
> > +    }
> > +
> > +    RegI32 popI32(RegI32 specific) {
> > +        Stk& v = stk_.back();
> 
> I wonder, can you do a popCopy here?

I don't think so.  Consider the entire function:

1    RegI32 popI32(RegI32 specific) {
2        Stk& v = stk_.back();
3        if (!(v.kind() == Stk::RegisterI32 && v.i32reg() == specific)) {
4            needI32(specific);
5            popI32(v, specific);
6            if (v.kind() == Stk::RegisterI32)
7                freeI32(v.i32reg());
8        }
9        stk_.popBack();
A        return specific;
B    }

The call to popI32 on line 5 will look at the kind of v and move v's value into specific.  Suppose that initially the slot v is in a register but not equal to specific and that there are no more registers or that specific is already allocated.  In this case needI32() may sync the stack.  That will generally update the stack slots, and that update must be reflected in v for the popI32 on line 5 to work.  Hence we need a reference here, not a copy.
(In reply to Lars T Hansen [:lth] from comment #64)
> (In reply to Benjamin Bouvier [:bbouvier] from comment #63)
> > Comment on attachment 8761123 [details] [diff] [review]
> > bug1232205-wasm-baseline-compiler-v4.patch
> > 
> > Review of attachment 8761123 [details] [diff] [review]:
> > -----------------------------------------------------------------

> > @@ +1645,5 @@
> > > +
> > > +        wasm::GenerateFunctionPrologue(masm, localSize_, mg_.funcSigIndex(func_.index()),
> > > +                                       &compileResults_.offsets());
> > > +
> > > +        maxFramePushed_ = masm.framePushed();
> > 
> > Shouldn't this include sizeof(AsmJSFrame)?
> 
> I don't think so, the intent is for maxFramePushed_ only to track the value
> of framePushed(), and for all uses of maxFramePushed_ to make appropriate
> adjustments.  I will investigate a little deeper.

This code is correct, but the value was being used ineptly.  Thanks!
(In reply to Benjamin Bouvier [:bbouvier] from comment #63)
> Comment on attachment 8761123 [details] [diff] [review]
> bug1232205-wasm-baseline-compiler-v4.patch
> 
> Review of attachment 8761123 [details] [diff] [review]:
> -----------------------------------------------------------------

> @@ +2727,5 @@
> > +      public:
> > +        OutOfLineTruncateFOrF64ToI32(AnyReg src, RegI32 dest) : src(src), dest(dest) {}
> > +
> > +        virtual void generate(MacroAssembler& masm) {
> > +            bool isAsmJS = true; // Wasm passes for AsmJS in this context
> 
> I think this is not correct, see below...
> 
> @@ +2742,5 @@
> > +    bool truncateF32ToI32(RegF32 src, RegI32 dest) {
> > +        OutOfLineCode* ool = addOutOfLineCode(new (alloc_) OutOfLineTruncateFOrF64ToI32(AnyReg(src), dest));
> > +        if (!ool)
> > +            return false;
> > +        masm.branchTruncateFloat32(src.reg, dest.reg, ool->entry());
> 
> Actually, AsmJS and Wasm don't have the same truncation semantics: asm.js
> does a truncation modulo the int32 range (try assertEq(2**32 + 1 | 0, 1) in
> a shell), but in Wasm, it's an invalid conversion. (See EmitTruncate in
> WasmIonCompile.cpp)
> 
> So you just need to use wasmTruncateF32ToInt32 or branchTruncateFloat32
> whether we're in wasm or asm.js, and have the correct OOL generated too.
> 
> @@ +2752,5 @@
> > +    bool truncateF64ToI32(RegF64 src, RegI32 dest) {
> > +        OutOfLineCode* ool = addOutOfLineCode(new (alloc_) OutOfLineTruncateFOrF64ToI32(AnyReg(src), dest));
> > +        if (!ool)
> > +            return false;
> > +        masm.branchTruncateDouble(src.reg, dest.reg, ool->entry());
> 
> ditto

Forked as bug 1279876, since we have no test cases that expose the bug and it could look like some refactoring work is needed to share code with the back-ends.
(In reply to Lars T Hansen [:lth] from comment #67)
> Full try run:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f8d799fa607

Errors for Linux-64 Opt SM-tc "cgc" and "r" builds and Windows 8 plain shell builds, all along these lines:

Assertion failure: CurrentJitContext(), at c:/builds/moz2_slave/try_w64-d_sm-plaindebug-000000/src/js/src/jit/Ion.cpp:86
TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\wasm\basic-conversion.js | Assertion failure: CurrentJitContext(), at c:/builds/moz2_slave/try_w64-d_sm-plaindebug-000000/src/js/src/jit/Ion.cpp:86 (code -2147483645, args "--wasm-always-baseline")
"Once more unto the breach, dear friends, once more; Or close the wall up with our English dead."

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4044abf559fb
https://hg.mozilla.org/integration/mozilla-inbound/rev/893294e2a38721cca2347f567eb6776d4ab21bfb
Bug 1232205 - Wasm baseline: infrastructure.  r=bbouvier, r=luke

https://hg.mozilla.org/integration/mozilla-inbound/rev/5d7856a5357f3caed90111a198406634031886d7
Bug 1232205 - Wasm baseline: Add AllSingleMask for x86, ARM, ARM64, and none. r=bbouvier

https://hg.mozilla.org/integration/mozilla-inbound/rev/5b8a3144277a729820c941b31ca65ca29fec34aa
Bug 1232205 - Wasm baseline: Factor out int64ToFloatingPoint and truncateToInt64, on x64. r=bbouvier

https://hg.mozilla.org/integration/mozilla-inbound/rev/6653d66ac84936566aee9d33360ca86f0dace319
Bug 1232205 - Wasm baseline: Factor out floating min and max, on x86-shared. r=bbouvier

https://hg.mozilla.org/integration/mozilla-inbound/rev/9dc732679461f35a1a69edc3a02199db1a9adcb0
Bug 1232205 - Wasm baseline: Factor outOfLineWasmTruncateCheck and outOfLineTruncateSlow, on x86-shared. r=bbouvier

https://hg.mozilla.org/integration/mozilla-inbound/rev/3ffd09eaea63fdeb9adba4cda00478dc1ebc2980
Bug 1232205 - Wasm baseline: Add wasm baseline logic to test driver. r=bbouvier

https://hg.mozilla.org/integration/mozilla-inbound/rev/394f71580dca6026a38e64294eee2f8620d74f92
Bug 1232205 - Wasm baseline: Add wasm baseline directives to wasm tests. r=bbouvier

https://hg.mozilla.org/integration/mozilla-inbound/rev/bc6465bc355ff1a29016d7199c143a56c114625b
Bug 1232205 - Wasm baseline compiler. r=luke, r=bbouvier
This needs fuzzing.  Note, it is x64-only at this point.

Normally, the Ion backend compiles all wasm code.  To select the baseline compiler, run the shell with --wasm-always-baseline.
Flags: needinfo?(choller)
Whiteboard: [games:p1] → [games:p1][platform-rel-Games]
Flags: needinfo?(choller)
Depends on: 1281961
You need to log in before you can comment on or make changes to this bug.