Odin: Split parsing/validation from codegen

RESOLVED FIXED in Firefox 42

Status

()

defect
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: bbouvier, Assigned: bbouvier)

Tracking

Trunk
mozilla42
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox40 affected, firefox42 fixed)

Details

Attachments

(11 attachments, 26 obsolete attachments)

4.13 KB, patch
luke
: review+
Details | Diff | Splinter Review
8.41 KB, patch
luke
: review+
Details | Diff | Splinter Review
6.09 KB, patch
luke
: review+
Details | Diff | Splinter Review
5.29 KB, patch
luke
: review+
Details | Diff | Splinter Review
13.76 KB, patch
luke
: review+
Details | Diff | Splinter Review
3.97 KB, patch
luke
: review+
Details | Diff | Splinter Review
7.74 KB, patch
luke
: review+
Details | Diff | Splinter Review
39.70 KB, patch
luke
: review+
Details | Diff | Splinter Review
198.97 KB, patch
luke
: review+
Details | Diff | Splinter Review
356 bytes, application/javascript
Details
281.52 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
Odin currently parses, validates, emits MIR and performs the final codegen step linearly. We could have parsing/validation made in a thread, another thread converting the AST into MIR and performing codegen (and use the remainder of the helper threads, when the number of cores >2, to execute Ion background compilation jobs). The net effect would be to pipeline parsing+validation with MIR and code generation, which are all currently performed on the same thread. This is a first step to an asm.js baseline compiler, and it would reduce asm.js compile times on systems with 4 or more cores.
Duplicate of this bug: 959263
Random small cleanups.
Attachment #8598106 - Flags: review?(luke)
Posted patch wip.patch (obsolete) — Splinter Review
This is *very* WIP-y, but this should help validating the architecture. In the current state, it can compile functions of this form:

  function f() { return 2 * 4 | 0; }

(that is, using int32 literals, int32 mul, and ret)

Here are the newly introduced data structures:
- wasm::FunctionBuilder, which takes an asm.js module as an in-param and outputs a wasm::Function, described below.
- wasm::Function, which should contain the binary form of a Function, from the number of locals (NYI) to the compact form itself.
- FunctionCompiler can now read the binary form and then emit the associated MIR graph. For now, the FunctionCompiler contains the FunctionBuilder (for quick implementation purposes), but of course they'll be independent in the future, and ultimately the FunctionCompiler should just receive the wasm::Function from the FunctionBuilder.

Please don't mind the XXX / TODO items in the file, which are mostly reminders for myself.

The plan is the following:
- continue converting statements into the binary form.
- when that's done, replace the FunctionCompiler argument of all Check* functions by a FunctionBuilder.
- remove the MDefinition argument from all Check* functions.
- move the FunctionCompiler in the wasm:: namespace.

How does that look?
Attachment #8598113 - Flags: feedback?(luke)
Attachment #8598106 - Flags: review?(luke) → review+
Posted patch wip.part2.patch (obsolete) — Splinter Review
That's the diff between the wip.patch and the big folded one that I am going to post in a sec.
Posted patch folded.patch (obsolete) — Splinter Review
That's wip + wip part 2, but it's really big, so not asking for feedback on this one.
Comment on attachment 8598113 [details] [diff] [review]
wip.patch

The patch has changed a lot since this one, so no need to give feedback on it right now.
Attachment #8598113 - Attachment is obsolete: true
Attachment #8598113 - Flags: feedback?(luke)
Posted patch wip.patch (obsolete) — Splinter Review
Big WIP patch that I'll try to split in several patches later. All tests excluding SIMD / atomics / conditionals in if conditions seem to pass / testBullet pass. Need to investigate the testBullet issue before measuring overhead.
Attachment #8599970 - Attachment is obsolete: true
Attachment #8599972 - Attachment is obsolete: true
Posted patch wip.patch (obsolete) — Splinter Review
An updated patch that passes testBullet, which had two issues (one at compile time, easy to debug; another one being a difference of behavior, harder to debug). I've added tests corresponding to these two issues, as they would have helped a lot debugging.

Now, keeping on with ternary conditionals, as they're used in our benchmarks.
Attachment #8605530 - Attachment is obsolete: true
I've solved the latest issue (which was an issue with ternary conditionals, totally untested in our code base, so I've added a test), so I could benchmark a bit. I've run each benchmark once so there could be some noise. This isn't an apple to apple comparison, as the patched version doesn't have the ternary optimization yet.

Results seem to indicate that

1) either we stay in the noise neighborhood of the baseline patch (with some slow downs, which I assume being caused by the lack of ternary optimizations -- although this hypothesis needs to be confirmed)
2) or there is some very *light* speedup (perhaps due to code locality? or noise).

| --------------------------- | ---------- | --------- |
| Benchmark                   | Baseline   | Patched   |
| --------------------------- | ---------- | --------- |
| Massive (higher is better)                           |
| --------------------------- | ---------- | --------- |
| Massive overall score       | 7,285      | 7,357     |
| --------------------------- | ---------- | --------- |
| main-thread-poppler-cold    | 6,129      | 6,322     |
| --------------------------- | ---------- | --------- |
| main-thread-poppler-warm    | 6,048      | 6,257     |
| --------------------------- | ---------- | --------- |
| box2d-throughput            | 30,448     | 30,821    |
| --------------------------- | ---------- | --------- |
| box2d-throughput-f32        | 34,106     | 33,322    |
| --------------------------- | ---------- | --------- |
| lua-binarytrees             | 17,102     | 16,434    |
| --------------------------- | ---------- | --------- |
| lua-scimark                 | 18,945     | 19,155    |
| --------------------------- | ---------- | --------- |
| poppler-throughput          | 22,626     | 22,835    |
| --------------------------- | ---------- | --------- |
| sqlite-throughput           | 12,767     | 13,063    |
| --------------------------- | ---------- | --------- |
| poppler-cold-preparation    | 704        | 725       |
| --------------------------- | ---------- | --------- |
| poppler-warm-preparation    | 3,350      | 3,436     |
| --------------------------- | ---------- | --------- |
| sqlite-cold-preparation     | 185        | 188       |
| --------------------------- | ---------- | --------- |
| sqlite-warm-preparation     | 4,008      | 4,132     |
| --------------------------- | ---------- | --------- |
| asmjs-apps (lower is better)                         |
| --------------------------- | ---------- | --------- |
| box2d-loadtime              | 202        | 213       |
| --------------------------- | ---------- | --------- |
| box2d-throughput            | 1309       | 1318      |
| --------------------------- | ---------- | --------- |
| bullet-loadtime             | 840        | 906       |
| --------------------------- | ---------- | --------- |
| bullet-throughput           | 1567       | 1611      |
| --------------------------- | ---------- | --------- |
| lua_binarytrees-loadtime    | 173        | 160       |
| --------------------------- | ---------- | --------- |
| lua_binarytrees-throughput  | 1047       | 1066      |
| --------------------------- | ---------- | --------- |
| zlib-loadtime               | 200        | 204       |
| --------------------------- | ---------- | --------- |
| zlib-throughput             | 1272       | 1334      |
| --------------------------- | ---------- | --------- |
| asmjs-ubench (lower is better)                       |
| --------------------------- | ---------- | --------- |
| copy                        | 3322       | 3323      |
| --------------------------- | ---------- | --------- |
| corrections                 | 5122       | 5118      |
| --------------------------- | ---------- | --------- |
| fannkuch                    | 2430       | 2519      |
| --------------------------- | ---------- | --------- |
| fasta                       | 8279       | 8323      |
| --------------------------- | ---------- | --------- |
| life                        | 3145       | 2996      |
| --------------------------- | ---------- | --------- |
| memops                      | 4432       | 4476      |
| --------------------------- | ---------- | --------- |
| primes                      | 4981       | 5127      |
| --------------------------- | ---------- | --------- |
| skinning                    | 8740       | 8933      |
| --------------------------- | ---------- | --------- |
Posted patch wip.patch (obsolete) — Splinter Review
Attachment #8607491 - Attachment is obsolete: true
Just so I understand what's being measured: parsing/validation is generating an IR that is consumed on the same thread to do MIR-gen?  If so, that's fantastic since it means we can land w/o waiting for pipelining speedups.  I'm not super concerned with single-core devices, but do you get the same results for 2 and 4 hardware threads (fixing 'numCpus' and using 'taskset')?
(In reply to Luke Wagner [:luke] from comment #13)
> Just so I understand what's being measured: parsing/validation is generating
> an IR that is consumed on the same thread to do MIR-gen?
Yes, exactly.

> If so, that's
> fantastic since it means we can land w/o waiting for pipelining speedups.
> I'm not super concerned with single-core devices, but do you get the same
> results for 2 and 4 hardware threads (fixing 'numCpus' and using 'taskset')?
I'll do these measures tomorrow.
Attachment #8598106 - Attachment description: cleanups.patch → 1 - Random cleanups.patch
When building the MIR graph for asm.js, we did some short-circuiting optimizations for ternaries that include constant values (e.g. x ? 1 : y is semantically equivalent to x || y, which can be short-circuited if x is true). This was before bhackett's work in bug 1028580, which optimized these short-circuiting ternaries as an optimization pass in Ion. So far, this optimization pass was disabled for asm.js, as Odin already had done the work when building the MIR graph.

However, with the work being done in this bug, doing the short-circuiting optimizations when decoding the function compact encoding is tedious.

In the meanwhile, benchmarking (asm.js-apps / asm.js-ubench) shows that disabling the short circuiting optimizations in Odin while enabling the FoldTest pass all the time yields almost the same results (in the noise levels, ~2%). So let's just do this here.

Benchmarks results (20-100 runs).
Legend: "benchmark_name: baseline / with_patch (relative speedup)"

asmjs-apps
zlib-loadtime: 200.32 / 200.4 (-0.04%)
box2d-throughput: 1308.51 / 1293.45 (1.15%)
bullet-loadtime: 846.15 / 840.6 (0.66%)
zlib-throughput: 1283.5 / 1276.2 (0.57%)
lua_binarytrees-loadtime: 151.75 / 152.6 (-0.56%)
lua_binarytrees-throughput: 1039.45 / 1018.25 (2.04%)
bullet-throughput: 1534.77 / 1520.5 (0.93%)
box2d-loadtime: 196.12 / 197.55 (-0.73%)

asmjs-ubench
life: 3166.66 / 3071.0 (3.02%)
memops: 4457.49 / 4408.0 (1.11%)
fannkuch: 2434.38 / 2395.5 (1.6%)
mandelbrot-polyfill: 413.42 / 421.0 (-1.83%)
skinning: 8806.53 / 9020.0 (-2.42%)
fbirds-native: 98.01 / 102.5 (-4.58%)
fbirds-polyfill: 288.19 / 287.5 (0.24%)
mandelbrot-native: 310.43 / 311.0 (-0.18%)
corrections: 5115.75 / 5065.0 (0.99%)
fasta: 8233.16 / 8209.0 (0.29%)
copy: 3343.36 / 3285.0 (1.75%)
primes: 5024.45 / 4956.0 (1.36%)
Attachment #8612908 - Flags: review?(luke)
Attachment #8612908 - Flags: review?(luke) → review+
Posted patch all.patch (obsolete) — Splinter Review
So that's the big patch, not split yet (I plan to do so tomorrow), if you want to have a first look.

Here are a few hints for understanding the changes:
- FunctionCompiler has been split into FunctionBuilder (which is used during encoding) and FunctionCompiler (which is used during decoding).
- Each CheckX function does the type checking and adds the compact encoded form to the code stream. For most functions, next to each CheckX function there's the equivalent EmitX function, which consumes the compact form from the code stream. That helps reasoning about the layout and operands order, in general. There's a nice symmetry between Check/Emit even with respect to their arguments: first arg of CheckX is the FunctionBuilder, first arg of EmitX is the FunctionCompiler.
- FunctionCompiler used ParseNode* at several places as map keys or set values. At all these places, we needed to ensure that the key was unique and monotonically increasing with the position in the code stream, so I've simply used the position itself in all these places.

The patch isn't final yet (there are a few TODOs and FIXMEs still in the code, but most relate to the metadata debug information, mainly used for retrieving the source coordinates from the position in the code stream).
Attachment #8611249 - Attachment is obsolete: true
Attachment #8613688 - Flags: feedback?(luke)
Posted patch all.patch (obsolete) — Splinter Review
There was a slight issue with the patch above: a For loop was basically a condition and a maybe<increment> (so 2 opcodes to distinguish between them), as the statement in the init part could stand as a statement of its own. However, this doesn't mix well with Blocks, which expect a fixed amount of statements (and this init statement wouldn't be in the statements count, hence the issue). Fortunately, asm.js/simd-fbirds had such a control flow example (a block containing a for loop). I've isolated and reduced this test case in testControlFlow as well.

We now have 4 opcodes for For loops: (w or w/o init statement) cross-product (w or w/o increment statement), but this is temporary, as I plan to rewrite encoding of all loops with a "forever" opcode.

Also: I've added a debug-only opcode, DebugCheckPoint, that's inserted after a for-loop and at the end of a block, so that we can ensure basic consistency of the function encoding. If this debugging approach shows useful, it can be extended to other control flow statements.
Attachment #8613688 - Attachment is obsolete: true
Attachment #8613688 - Flags: feedback?(luke)
Attachment #8614250 - Flags: feedback?(luke)
Comment on attachment 8614250 [details] [diff] [review]
all.patch

I am currently splitting the patch.
Attachment #8614250 - Flags: feedback?(luke)
Posted patch 3. Folded patch (obsolete) — Splinter Review
Folded patch, for getting a global view of how components interact.
Attachment #8614250 - Attachment is obsolete: true
Let's take it easy.
Attachment #8615632 - Flags: review?(luke)
When building the compact format from functions, we now identify functions / globals / functions tables by an vector index in the ModuleCompiler. So all of these 3 structures need to be stored into a vector inside the MC. This patch contains all changes to the MC.
Attachment #8615637 - Flags: review?(luke)
Posted patch 3.3. OpcodesSplinter Review
Opcodes. Note that I will implement (later) the For / While loops as Forever loops (I wanted to start with something closest to the current implementation, to make sure that there won't be any behavior differences). So among the weirdos:

- Id: the encoding often applies the following scheme:
  - put a temporary opcode that can be patched
  - encode the sub-expression
  - deduct from the sub-expression the type of the operation and then patch the temporary opcode.
  However, for coercions e.g., we sometimes put a temporary opcode although we didn't need one (typical example: +(+f()); the second + is useless), so we just pass the current definition, hence the need for Id.
  Note that Stmt::Id is needed when we call CoerceResult, expecting void and the actual retType is void (this only happens for Atomics Fence).

- As loops aren't implemented via a Forever opcode (yet), we need to make a distinction between the For loops: whether they do have an initializer or not; whether they do have an increment or not. This leads to 2 x 2 = 4 opcodes.

- This version has expression statements, for simplicity of implementation (I'll remove them in a follow-up, if that's fine).

- Stmt::Noop happens when we have an expression statement with nothing inside (only example: ;;;;;;;;;;;;). The caller pre-allocates the temporary opcode, but the expression statement is empty, so we need a no-op for that case.

- Stmt::DebugCheckPoint is an idea I've got for helping debugging (put checkpoints at some defined places in the code stream during encoding, and then check that they are at these positions when decoding). This helped me find one edge case in the implementation.
Attachment #8615649 - Flags: review?(luke)
Posted patch 3.4. Function object (obsolete) — Splinter Review
The Function object is the data passed from the FunctionBuilder (during encoding) to the FunctionCompiler (which will use it for decoding). It has a very simple interface for reading and writing.
Attachment #8615650 - Flags: review?(luke)
Posted patch 3.5. Function Builder (obsolete) — Splinter Review
The FunctionBuilder is the new first argument to all Check* functions, and it used for encoding the asm.js code. As a result, it inherited a few functions from the FunctionCompiler, which were used during parsing / validation.
Attachment #8615652 - Flags: review?(luke)
Posted patch 3.6. Function Compiler (obsolete) — Splinter Review
Notable changes to the FunctionCompiler:

- we don't use ParseNode* anymore when building new blocks. ParseNodes were used for retrieving the source coordinates and annotate the block; I do think that using the position in the bytecode stream, we could retrieve the source coordinates, using debug metadata. However, as this is a debug-only thing, just used for the (linux) JIT perf support, maybe this can wait a bit.

- The LabeledBlockMap can't use a PropertyName* anymore as its key. Instead, a unique u32 identifier of the label is used as the key.
- Ditto for the LabeledBlockMap / NodeStack which were using ParseNode*, which instead use the position in the bytecode (which has the property to uniquely identify statements in a monotonically increasing fashion, which seems to be sufficient for the task these ADT need to accomplish).

The FunctionCompiler is still used as an advanced data structure by the Emit* functions, which mirror the Check* function by consuming the bytecode and emit the MIR.
Attachment #8615662 - Flags: review?(luke)
Posted patch 3.7. Encoding and Decoding (obsolete) — Splinter Review
Unfortunately, this one is quite large and can't really be split. This is pretty mechanical: it uses the FunctionBuilder (Check* functions) and the FunctionCompiler (Emit* functions) to do the same task that was done before this patch series.
Attachment #8615663 - Flags: review?(luke)
Comment on attachment 8613700 [details]
MozReview Request: A few free tests.

A few free tests.
Attachment #8613700 - Attachment description: MozReview Request: Bug 1157624: WIP → MozReview Request: A few free tests.
Changes to ModuleCompiler
Function interface.
Posted file MozReview Request: FunctionBuilder. (obsolete) —
FunctionBuilder.
FunctionCompiler.
I've pushed to reviewboard to make the folded patch easier to understand, as mozreview does a good job of showing code that has moved and changes within a single line of code:

https://reviewboard.mozilla.org/r/9771/diff/2/#index_header
So lua_binarytrees doesn't like that exitsSignatures_ stores the address of the Signature that's been added in the ExitMap. Terrence told me on IRC that keys could move, so we actually need to make a copy of the signature before storing it in the exitsSignatures_ array.
Attachment #8615637 - Attachment is obsolete: true
Attachment #8615667 - Attachment is obsolete: true
Attachment #8615668 - Attachment is obsolete: true
Attachment #8615669 - Attachment is obsolete: true
Attachment #8615670 - Attachment is obsolete: true
Attachment #8615671 - Attachment is obsolete: true
Attachment #8615672 - Attachment is obsolete: true
Attachment #8615637 - Flags: review?(luke)
Attachment #8615699 - Flags: review?(luke)
I've compiled the whole browser with debug and optimized symbols, to see whether I would hit any assertion added in the decoding code. No assertions were hit when running Massive, Tappy Chicken and BananaBread. Tappy Chicken and BananaBread seem to have the same behavior with or without the patch. Moreover, all tests pass in our test suite. As a consequence, I consider that most behavior differences have been ruled out.

Benchmarking, now. I've tried the simplest setup (not modifying the cores configurations in SM yet - though I've set the CPU scaling governors to "performance" at the OS level). One issue I had with benchmarking is that the OS detects that the CPUs get really hot, and then it throttles down the CPU frequencies, leading to weird results. Hopefully, this happens with and without applying the patch series, so we can assume both configurations are affected.

In a nutshell, for the asmjs-apps and asmjs-ubench benchmarks, there seems to be no difference above the level of noise. Massive shows weird differences for sqlite-warm-preparation (and others which get faster with the patch!), but my machine has so much variance that I'm not even sure it's due to my patch. I'll retest a few times.

For each line, we have:
BENCHMARK_NAME - avg(baseline scores) / avg(scores with patch applied) (relative speedup or slowdown)

AsmJS-Apps (10 runs each)
zlib-loadtime: 209.4 / 211.6 (-1.05%)
box2d-throughput: 1366.5 / 1355.4 (0.81%)
bullet-loadtime: 904.2 / 873.3 (3.42%)
zlib-throughput: 1370.2 / 1332.5 (2.75%)
lua_binarytrees-loadtime: 154.5 / 154.3 (0.13%)
lua_binarytrees-throughput: 1067.8 / 1129.9 (-5.82%)
bullet-throughput: 1644.9 / 1587.2 (3.51%)
box2d-loadtime: 208.7 / 197.1 (5.56%)

AsmJS-Apps (20 runs each)
zlib-loadtime: 211.15 / 208.6 (1.21%)
box2d-throughput: 1361.0 / 1312.7 (3.55%)
bullet-loadtime: 891.95 / 857.75 (3.83%)
zlib-throughput: 1338.15 / 1288.5 (3.71%)
lua_binarytrees-loadtime: 160.9 / 158.7 (1.37%)
lua_binarytrees-throughput: 1074.95 / 1043.15 (2.96%)
bullet-throughput: 1617.75 / 1550.05 (4.18%)
box2d-loadtime: 207.45 / 203.6 (1.86%)

AsmJS-ubench (10 runs each)
life: 3150.0 / 3185.5 (-1.13%)
memops: 4526.41666667 / 4621.58333333 (-2.1%)
fannkuch: 2478.91666667 / 2508.25 (-1.18%)
mandelbrot-polyfill: 425.5 / 423.916666667 (0.37%)
skinning: 8805.75 / 8945.33333333 (-1.59%)
fbirds-native: 98.8333333333 / 103.833333333 (-5.06%)
fbirds-polyfill: 290.333333333 / 308.25 (-6.17%)
mandelbrot-native: 310.083333333 / 312.5 (-0.78%)
corrections: 5134.5 / 5220.16666667 (-1.67%)
fasta: 8316.91666667 / 8293.41666667 (0.28%)
copy: 3353.41666667 / 3522.58333333 (-5.04%)
primes: 5119.25 / 5150.5 (-0.61%)

AsmJS-Ubench (20 runs)
life: 3163.1 / 3191.5 (-0.9%)
memops: 4521.95 / 4500.6 (0.47%)
fannkuch: 2491.95 / 2506.2 (-0.57%)
mandelbrot-polyfill: 425.8 / 419.7 (1.43%)
skinning: 8875.2 / 8885.7 (-0.12%)
fbirds-native: 103.5 / 103.7 (-0.19%)
fbirds-polyfill: 303.9 / 295.3 (2.83%)
mandelbrot-native: 314.8 / 311.05 (1.19%)
corrections: 5148.1 / 5160.7 (-0.24%)
fasta: 8270.05 / 8250.85 (0.23%)
copy: 3351.25 / 3411.55 (-1.8%)
primes: 5085.7 / 5056.1 (0.58%)

Massive (2 runs each)
box2d-throughput: 30290.5 / 28625.5 (5.5%)
box2d-throughput-f32: 33498.5 / 32212.0 (3.84%)
box2d-variance: 10000.0 / 10000.0 (0.0%)
lua-binarytrees: 17199.5 / 15729.5 (8.55%)
lua-scimark: 18435.0 / 19997.5 (-8.48%)
main-thread-poppler-cold: 5933.0 / 6051.5 (-2.0%)
main-thread-poppler-warm: 5942.0 / 6086.0 (-2.42%)
main-thread-sqlite-cold: 10000.0 / 10000.0 (0.0%)
main-thread-sqlite-warm: 10000.0 / 10000.0 (0.0%)
massive-total: 7163.5 / 7199.0 (-0.5%)
poppler-cold-preparation: 690.5 / 692.0 (-0.22%)
poppler-throughput: 22511.5 / 22011.5 (2.22%)
poppler-variance: 10000.0 / 10000.0 (0.0%)
poppler-warm-preparation: 3171.0 / 3367.5 (-6.2%)
sqlite-cold-preparation: 181.5 / 181.5 (0.0%)
sqlite-throughput: 12785.5 / 12827.5 (-0.33%)
sqlite-warm-preparation: 3708.5 / 4095.5 (-10.44%)
Comment on attachment 8615632 [details] [diff] [review]
3.1.tests.patch

Always a good start.
Attachment #8615632 - Flags: review?(luke) → review+
Comment on attachment 8615649 [details] [diff] [review]
3.3. Opcodes

Review of attachment 8615649 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/asmjs/AsmJSValidate.cpp
@@ +2418,5 @@
>  /*****************************************************************************/
>  
>  namespace {
>  
> +enum class WasmType : uint8_t {

How about AsmType?  Later we could rename 'Type' to 'AsmJSType'.

@@ +2421,5 @@
>  
> +enum class WasmType : uint8_t {
> +    Int,
> +    Float,
> +    Double,

How about Int32/Float32/Float64?

@@ +2451,5 @@
> +    BreakLabel,
> +
> +    CallInt,
> +    CallInd,
> +    CallImp,

(I know this is my fault, but) how about CallInternal/CallIndirect/CallImport/GetLocal/GetGlobal/Literal?
Attachment #8615649 - Flags: review?(luke) → review+
Comment on attachment 8615650 [details] [diff] [review]
3.4. Function object

Review of attachment 8615650 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/asmjs/AsmJSValidate.cpp
@@ +2782,5 @@
> +
> +class Function
> +{
> +    typedef Vector<uint8_t, 4096> Bytecode;
> +    Bytecode bytecode_;

What is the lifetime of a Function?  Is it short-lived (as in, there is only 1 of them alive at a time)?  When this is done in parallel, is the expectation for there to be N alive (where N is roughly # threads)?

Another question: Function is stored in a Maybe<Function> in FunctionComiler.  Since the Vector is stored by value, this means a lot of copying when the Vector length is < 4096.  Is there any way to create the Function as a plain object on the stack and pass it in by & first to the FunctionBuilder and then by const& to the FunctionCompiler?
(In reply to Luke Wagner [:luke] from comment #42)
> Comment on attachment 8615650 [details] [diff] [review]
> 3.4. Function object
> 
> Review of attachment 8615650 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> What is the lifetime of a Function?  Is it short-lived (as in, there is only
> 1 of them alive at a time)?  When this is done in parallel, is the
> expectation for there to be N alive (where N is roughly # threads)?
The Function is the piece of data given by the FunctionBuilder (parsing) to the FunctionCompiler (creating the MIR graph). While we don't do this in parallel, it's short-lived. However, when it's done in parallel, each FunctionBuilder will create one Function, and each FunctionCompiler will consume one. If building is faster than compiling, Functions may even be added to a queue until one FunctionCompiler thread gets available.

> Another question: Function is stored in a Maybe<Function> in
> FunctionComiler.  Since the Vector is stored by value, this means a lot of
> copying when the Vector length is < 4096.  Is there any way to create the
> Function as a plain object on the stack and pass it in by & first to the
> FunctionBuilder and then by const& to the FunctionCompiler?
Copying will only happen when the Vector length is < 1024, which is the current limit for inline storage in a Vector. No problem with using a plain object on the stack and pass it by reference to the other structures, though using the Maybe pattern seemed to pave the way for parallelization.
(In reply to Luke Wagner [:luke] from comment #41)
> Comment on attachment 8615649 [details] [diff] [review]
> 3.3. Opcodes
> @@ +2451,5 @@
> > +    BreakLabel,
> > +
> > +    CallInt,
> > +    CallInd,
> > +    CallImp,
> 
> (I know this is my fault, but) how about
> CallInternal/CallIndirect/CallImport/GetLocal/GetGlobal/Literal?

Done! I've also renamed Cond -> Conditional, for consistency.
As Function was ambiguous in the global scope (and thus couldn't be used as an argument type), I've chosen to rename it as AsmFunction (in parallel to AsmType). This also const-ifies all the read-only functions.
Attachment #8615650 - Attachment is obsolete: true
Attachment #8615650 - Flags: review?(luke)
Attachment #8616696 - Flags: review?(luke)
FB is now receiving its AsmFunction by reference.
Attachment #8615652 - Attachment is obsolete: true
Attachment #8615652 - Flags: review?(luke)
Attachment #8616697 - Flags: review?(luke)
... and FC by const reference
Attachment #8615662 - Attachment is obsolete: true
Attachment #8615662 - Flags: review?(luke)
Attachment #8616698 - Flags: review?(luke)
Posted patch 3.7. Encoding and Decoding (obsolete) — Splinter Review
Attachment #8615663 - Attachment is obsolete: true
Attachment #8615663 - Flags: review?(luke)
Attachment #8616699 - Flags: review?(luke)
Posted patch 3. Folded patch (obsolete) — Splinter Review
Attachment #8613700 - Attachment is obsolete: true
Attachment #8615631 - Attachment is obsolete: true
New results for Massive (with all threads, again), adding to the ones in comment #39.

There might be something wrong with the two lua benchmarks, I'll make sure to check that the results are the same tomorrow. However, from one benchmarking serie to the other, I don't get similar results at all. In this series, lua_binary is slower once the patch applied, lua_scimark is faster with the patch; while in comment #39, it was the exact same opposite!

Alon, can you weigh in: are the Massive lua benchmarks expected to have that much variance?

poppler-variance:
    Average: 10000.0 / 10000.0 (0.0%)
    Median: 10000.0 / 10000.0 (0.0%)
        
box2d-throughput-f32:
    Average: 33301.83 / 33883.33 (-1.75%)
    Median: 33179.0 / 34341.0 (-3.5%)
        
box2d-throughput:
    Average: 30217.5 / 30165.5 (0.17%)
    Median: 30555.0 / 30294.0 (0.85%)
        
poppler-throughput:
    Average: 22089.33 / 22110.17 (-0.09%)
    Median: 22354.0 / 22098.0 (1.15%)
        
poppler-warm-preparation:
    Average: 3239.83 / 3284.0 (-1.36%)
    Median: 3260.0 / 3317.0 (-1.75%)
        
sqlite-warm-preparation:
    Average: 4025.0 / 4000.0 (0.62%)
    Median: 4057.0 / 3984.0 (1.8%)
        
lua-scimark:
    Average: 19025.83 / 18843.33 (0.96%)
    Median: 19615.0 / 18685.0 (4.74%)
        
main-thread-sqlite-cold:
    Average: 10000.0 / 10000.0 (0.0%)
    Median: 10000.0 / 10000.0 (0.0%)
        
lua-binarytrees:
    Average: 15550.17 / 16576.17 (-6.6%)
    Median: 15973.0 / 16894.0 (-5.77%)
        
box2d-variance:
    Average: 10000.0 / 10000.0 (0.0%)
    Median: 10000.0 / 10000.0 (0.0%)
        
massive-total:
    Average: 7180.83 / 7227.83 (-0.65%)
    Median: 7184.0 / 7260.0 (-1.06%)
        
sqlite-cold-preparation:
    Average: 184.0 / 185.0 (-0.54%)
    Median: 184.0 / 185.0 (-0.54%)
        
main-thread-poppler-cold:
    Average: 6093.0 / 6073.83 (0.31%)
    Median: 6073.0 / 6148.0 (-1.23%)
        
main-thread-poppler-warm:
    Average: 5870.83 / 5946.17 (-1.28%)
    Median: 5898.0 / 6028.0 (-2.2%)
        
poppler-cold-preparation:
    Average: 694.67 / 698.83 (-0.6%)
    Median: 698.0 / 703.0 (-0.72%)
        
main-thread-sqlite-warm:
    Average: 10000.0 / 10000.0 (0.0%)
    Median: 10000.0 / 10000.0 (0.0%)
        
sqlite-throughput:
    Average: 12811.83 / 12844.33 (-0.25%)
    Median: 12982.0 / 12970.0 (0.09%)
Flags: needinfo?(azakai)
They are deterministic and run in a worker, so they should have fairly low variance. Still, I see around 1-2% of noise on them on my machine, so definitely they do have some variance. But the noise you report sounds significantly higher than I would expect. How many runs did you do to compute the mean and median in those? How much variance do you see with and without the patches?
Flags: needinfo?(azakai)
(In reply to Alon Zakai (:azakai) from comment #51)
> They are deterministic and run in a worker, so they should have fairly low
> variance. Still, I see around 1-2% of noise on them on my machine, so
> definitely they do have some variance. But the noise you report sounds
> significantly higher than I would expect. How many runs did you do to
> compute the mean and median in those? How much variance do you see with and
> without the patches?

This was run 6 times.

For what it's worth, I've run both lua benchmarks 50 times in shell mode, to see whether this reproduced, and the results are the following:

(legend is still before/after(relative difference))

binarytrees-runtime:
    Average: 4368.9 / 4305.33 (1.46%)
    Median: 4187.0 / 4164.0 (0.55%)
    Variance: 106144.46 / 88831.22 (16.31%)
    Standard deviation: 325.8 / 298.05 (8.52%)
        
scimark-runtime:
    Average: 10346.72 / 10242.7 (1.01%)
    Median: 10072.0 / 10140.0 (-0.68%)
    Variance: 393417.47 / 378799.08 (3.72%)
    Standard deviation: 627.23 / 615.47 (1.87%)
        
binarytrees-startup:
    Average: 119.3 / 118.45 (0.71%)
    Median: 119.0 / 118.0 (0.84%)
    Variance: 10.28 / 8.48 (17.51%)
    Standard deviation: 3.21 / 2.91 (9.35%)
        
scimark-startup:
    Average: 119.3 / 118.45 (0.71%)
    Median: 119.0 / 118.0 (0.84%)
    Variance: 10.28 / 8.48 (17.51%)
    Standard deviation: 3.21 / 2.91 (9.35%)

This tells that my machine really has too much variance, but median and average seem to confirm there is no big difference. Moreover, binarytrees results are the same with/without the patch, which I assume confirm that the taken code paths are the same. I'll check the compilation time numbers to be sure, but I think we're good to go, otherwise.
(In reply to Luke Wagner [:luke] from comment #13)
> I'm not super concerned with single-core devices, but do you get the same
> results for 2 and 4 hardware threads (fixing 'numCpus' and using 'taskset')?

Methodology: 5 runs, take the median value, run with taskset + add the --threads-count argument to the js shell, measure compilation times on Poppler/Sqlite (longest benchmarks to compile).

8 cores:
- Poppler:
  - before: 370ms
  - after: 360ms (!)
- Sqlite:
  - before: 1732ms
  - after: 1753ms

4 cores:
- Poppler:
  - before: 362ms
  - after: 368ms
- Sqlite:
  - before: 1806ms
  - after: 1825ms

2 cores:
- Poppler:
  - before: 658ms
  - after: 657ms (!)
- Sqlite:
  - before: 2523ms
  - after: 2570ms

1 core:
- Poppler:
  - before: 959ms
  - after: 970ms
- Sqlite:
  - before: 3818ms
  - after: 3752ms (!)


So there seems to be a slowdown in most cases (if we ignore the cases marked with (!)), but it seems fairly low overall, in the range of noise, so things look good so far.
Diff: the Emit*Expr functions now contain plain switches that just redirects from the opcode to a specific Emit function, removing the need for braces for the switch cases (overall, 100 deletions).
Attachment #8616699 - Attachment is obsolete: true
Attachment #8616699 - Flags: review?(luke)
Attachment #8622869 - Flags: review?(luke)
Posted patch 3. Folded patch (obsolete) — Splinter Review
Attachment #8616700 - Attachment is obsolete: true
Benjamin, just one thing to remind: Massive results are score and not time, right?
So higher is better, and on all your posts you see to think the opposite.
(In reply to Guilherme Lima from comment #56)
> Benjamin, just one thing to remind: Massive results are score and not time,
> right?
> So higher is better, and on all your posts you see to think the opposite.

That's right, I was confused because in asmjs-apps and asmjs-ubench, that's the opposite way. Also, it doesn't alter much the results: instead of low speedups, we have low slowdowns (and conversely), but nothing really significant/big.
Comment on attachment 8615699 [details] [diff] [review]
3.2. Changes to ModuleCompiler

Review of attachment 8615699 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/asmjs/AsmJSValidate.cpp
@@ +1166,5 @@
>                  Type::Which type_;
> +                // The AsmJSModule splits variables indexes into two distinct
> +                // spaces (SIMD and non-SIMD) while the ModuleCompiler doesn't
> +                // distinguish them.
> +                uint32_t moduleIndex_;

I forgot about this comment and was confused later as to why modules and compilers had separate index spaces.  How about naming 'scalarOrSimdIndex_' here and in associated names.  With that, probably don't need a comment.

@@ +1661,5 @@
>          if (GlobalMap::Ptr p = globals_.lookup(name))
>              return p->value();
>          return nullptr;
>      }
> +    const Global* lookupGlobal(uint32_t index) const {

For symmetry with ModuleCompiler::function(), can you rename this to 'global()'?

@@ +1747,5 @@
>      }
>      bool addGlobalVarImport(PropertyName* varName, PropertyName* fieldName, AsmJSCoercion coercion,
>                              bool isConst)
>      {
>          uint32_t index;

Similarly, could you rename scalarOrSimdIndex?

@@ +1932,5 @@
>          }
>          if (!module_->addExit(ffiIndex, exitIndex))
>              return false;
> +        Signature sigCopy(moduleLifo_);
> +        if (!sigCopy.copy(exitDescriptor.sig()))

Instead of making a copy, how about: allocating the Signature in the moduleLifo (like we do with Global) and having both exits_ and exitSignatures_ hold a Signature*?
Attachment #8615699 - Flags: review?(luke) → review+
Comment on attachment 8616696 [details] [diff] [review]
3.4. Function object

Review of attachment 8616696 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/asmjs/AsmJSValidate.cpp
@@ +2792,5 @@
> +    explicit AsmFunction(ExclusiveContext* cx) : bytecode_(cx) {}
> +
> +  private:
> +    AsmFunction(const AsmFunction&) = delete;
> +    AsmFunction(AsmFunction&& other) = delete;

I didn't think move ctors were generated automatically and thus didn't need to be deleted.

@@ +2800,5 @@
> +    template<class T> size_t writePrimitive(T v) {
> +        size_t writeAt = bytecode_.length();
> +        if (!bytecode_.growBy(sizeof(T)))
> +            return -1;
> +        memcpy(&bytecode_[writeAt], &v, sizeof(T));

How about:
  bytecode_.append(reinterpret_cast<uint8_t*>(&v), sizeof(v))
Also, for future reference, there is a growByUninitialized().

@@ +2870,5 @@
> +
> +    template<class T>
> +    void patch32(size_t pc, T i) {
> +        static_assert(sizeof(T) == sizeof(uint32_t),
> +                      "patch32 must be used with 32-bits wide types");

Why not just take uint32_t then?
Attachment #8616696 - Flags: review?(luke) → review+
Comment on attachment 8616697 [details] [diff] [review]
3.5. Function Builder

Review of attachment 8616697 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/asmjs/AsmJSValidate.cpp
@@ +2880,5 @@
>  };
> +
> +// Encapsulates the building of a wasm function from an asm.js function
> +// source code, packing the asm.js code into the wasm binary form that can
> +// be decoded and compiled with a FunctionCompiler.

For now (here and other places), let's not say it's the wasm binary format (it's not yet) or explicitly mention 'wasm' and just say 'asm bytecode' to refer to what is being generated and say 'asm' for the shared wasm/asm.js path.  In theory, we could interpret the bytecode (particularly if we switch to a postorder serialization).
Attachment #8616697 - Flags: review?(luke) → review+
Attachment #8616698 - Flags: review?(luke) → review+
Comment on attachment 8622869 [details] [diff] [review]
3.7. Encoding and Decoding

Great work on this!  I found it really easy to follow the code (even though I can't say I audited every single line).

r+ with the below list of comments addressed (written on a plane, hence no splinter).  I'll be on PTO for 3 weeks so if one of my comments is bogus, just skip it.  Given the extensive changes, it might be good to put up a final combined patch and needinfo? gkw for a round of fuzzing before landing.  I'd also run (in a debug build to catch asserts) all the major public Unity/Epic/AutoDesk/etc demos.  rs=me for simple fixes.
 - Implement TODOs or file bugs mentioned in TODO comment.
 - Now or later, it seems like FunctionBuilder should be scoped such that it is destroyed before FunctionCompiler is created (and this bits handed off moved to AsmFunction).
 - Could you line up the columns in the switch in EmitStatement?
 - No curlies around some single-line then branches (unless that style changed recently?  I hear Gecko style is coming...)
 - pack*/read* seem asymmetric; how about write*/read* or pack*/unpack*?

Btw, I think AsmJSValidate.cpp is getting big enough that, in a future patch, we should split it up into Check and Emit files.  This would require a new header to share things between these two files.  Trampolines would also be nice to move into a new file.
Attachment #8622869 - Flags: review?(luke) → review+
Thanks for the reviews!


(In reply to Luke Wagner [:luke] from comment #59)
> Comment on attachment 8616696 [details] [diff] [review]
> I didn't think move ctors were generated automatically and thus didn't need
> to be deleted.
C++ reference says that a move ctor is generated automatically if you haven't defined the dtor, copy ctor/assignment operator, nor the move assignment operator, which is the case here. Would it be not generated anyways, explicit is better than implicit IMO. I don't know if there's a policy enforcing move ctors to be deleted in our codebase if they're not user defined, but in any case, we can get back to this later (if it matters).

> 
> How about:
>   bytecode_.append(reinterpret_cast<uint8_t*>(&v), sizeof(v))
> Also, for future reference, there is a growByUninitialized().
Cleaner, thanks!

> 
> @@ +2870,5 @@
> > +
> > +    template<class T>
> > +    void patch32(size_t pc, T i) {
> > +        static_assert(sizeof(T) == sizeof(uint32_t),
> > +                      "patch32 must be used with 32-bits wide types");
> 
> Why not just take uint32_t then?
It is actually used with int32 and uint32 in the gigantic patch.
(In reply to Luke Wagner [:luke] from comment #61)
> Comment on attachment 8622869 [details] [diff] [review]
> 3.7. Encoding and Decoding
> 
> Great work on this!  I found it really easy to follow the code (even though
> I can't say I audited every single line).
Thank you for this review!

> 
> r+ with the below list of comments addressed (written on a plane, hence no
> splinter).  I'll be on PTO for 3 weeks so if one of my comments is bogus,
> just skip it.  Given the extensive changes, it might be good to put up a
> final combined patch and needinfo? gkw for a round of fuzzing before
> landing.  I'd also run (in a debug build to catch asserts) all the major
> public Unity/Epic/AutoDesk/etc demos.  rs=me for simple fixes.
Good idea, I'll check all of this and request fuzzing. Already found a "simple fix" on try: the GC hazard analysis wasn't pattern-matching the CheckCallArgs function anymore as its signature changed, so I had to change the devtools/rootAnalysis/annotations.js file to update to the new signature.

>  - Implement TODOs or file bugs mentioned in TODO comment.
Opened a dependent bug (these TODOs all relate to debug-only, perf (as in, the linux CLI tool) annotations that I doubt are actually being used).

>  - Now or later, it seems like FunctionBuilder should be scoped such that it
> is destroyed before FunctionCompiler is created (and this bits handed off
> moved to AsmFunction).
Right, that will pave the way for the parallelization work. Done.

>  - Could you line up the columns in the switch in EmitStatement?
Done.

>  - No curlies around some single-line then branches (unless that style
> changed recently?  I hear Gecko style is coming...)
Haha, quite a challenge without knowing which lines are to be changed! Found 3 occurrences in the patch, but still can't wait for clang-format to be used by default...

>  - pack*/read* seem asymmetric; how about write*/read* or pack*/unpack*?
I've chosen write/read ("pack" was used to help distinguishing between the AsmFunction and the FunctionBuilder APIs, but it was actually more confusing than helping).

> 
> Btw, I think AsmJSValidate.cpp is getting big enough that, in a future
> patch, we should split it up into Check and Emit files.  This would require
> a new header to share things between these two files.  Trampolines would
> also be nice to move into a new file.
I'll do this in future patches (when you'll be back from PTO, I guess).
Posted patch 3. Folded patch (obsolete) — Splinter Review
gkw and decoder, can you please fuzz this patch? It changes asm.js features only, so the specific asm.js fuzzer and the differential fuzzing could show useful.

This patch applies cleanly on mozilla-inbound f30e1413ebd1.
Attachment #8622871 - Attachment is obsolete: true
Flags: needinfo?(gary)
Flags: needinfo?(choller)
Attachment #8628182 - Flags: review+
And for what it's worth, the folded patch on try is green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ff2f1c630ab
Posted file unpack-shell.js
So all demos seem to work, except the packed ones on luke's repo (AngryBotsPacked and the PlatformerGamePacked), and for these two, there's an assertion triggered in ~FunctionCompiler. So the fun thing is that neither the original unpacked asm.js file, nor the asmjsunpacker-worker.js file trigger the assert (might this mean there is a different behavior here?). I had to make a script that can be executed in the JS shell and which can unpack the packed version; the resulting unpacked script indeed triggers the assertion. For the record, here's the shell script. I'll investigate tomorrow.
Posted patch 3. Folded patch (obsolete) — Splinter Review
Found the fix to the assertion triggering in the packed demos: the same label could be used for different independent blocks in a function, but labels were not deleted from the FunctionBuilder's map (label -> unique id), so a labelled break or continue inside the second block labelled with the same name would point to the first block:

function f() {
  a: do { break a; /* points to the expected block */ } while (0);
  a: do { break a; /* pointed to the wrong block */ } while (0);
}

I've added the fix (after checking a label statement, remove its entry in the label map), and folded it in the patch as part of the rs=luke trivial fixes :)

Still requesting fuzzing against this patch, if that's possible :)
This one cleanly applies on m-i 250855:c0f955560a47.
Attachment #8628182 - Attachment is obsolete: true
Attachment #8628813 - Flags: review+
Annnd another small issue found thanks to the packed games demos (thanks luke!). I am pondering a solution to make sure that these small differences don't show up anymore...

gkw, decoder: this patch applies cleanly on mozilla-inbound 250855:c0f955560a47, any chance you can fuzz against it, please?
Attachment #8628813 - Attachment is obsolete: true
Attachment #8628955 - Flags: review+
(In reply to Benjamin Bouvier [:bbouvier] from comment #68)
> any chance you can fuzz against it, please?

Yes, I've started fuzzing on one of my Mac machines.
Depends on: 1180059
I haven't found anything interesting yet after one weekend's worth of fuzzing.
Flags: needinfo?(gary)
Attachment #8628955 - Flags: feedback+
Thank you gary! Last try run before landing (unless there are other issues, of course)...
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=c295b86bc3a2
Another try run (JS and jit-tests only), as I've hit many oranges, and the GC jit-tests don't want to work even after a few retriggers...
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3b69a2a0df2
Keywords: leave-open
The Jit2 crash in comment 71 was unrelated, as shows comment 72. Landing.
Flags: needinfo?(choller)
https://hg.mozilla.org/mozilla-central/rev/dbab48c2e593
https://hg.mozilla.org/mozilla-central/rev/cd949a8d064c
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
You need to log in before you can comment on or make changes to this bug.