Closed Bug 1554094 Opened 6 years ago Closed 5 years ago

Cranelift: Generate "movswl %regD, %regL" for sign extension (and small misc isel improvements)

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: jseward, Assigned: bbouvier)

References

Details

Wasm-via-Cranelift on Intel generates 16-to-32-bit signed widening thusly

0xC3B5C31A2D6:  shll 0x10:I8, %edi
0xC3B5C31A2D9:  sarl 0x10:I8, %edi

when in the same situation Ion generates

0x2C78D0217EA5:  movswl %r8w, %r8d

and this is visible in some of the hottest blocks of our interim CL benchmark set.

I haven't yet looked at where the double-shift comes from. It may exist in the input Wasm, or it might be because CL doesn't have the reg->reg form of movswl, so is created in legalisation. I do observe that CL can create the mem->reg form of movswl, though.

Similarly

0x2A02DA2D275E:  shll 0x18:I8, %eax
0x2A02DA2D2761:  sarl 0x18:I8, %eax

instead of movsbl. Seen in the hottest block of wasm_lua_binarytrees.c.js.

Blocks: cranelift-x64-parity
No longer blocks: cranelift

On investigation, it looks as if the CL pipeline can handle sign-extends as a single instruction fine, all the way down to reg-to-reg movsXY on x86. However, the wasm 1.0 spec only has i64.extend_s/i32. Hence 1.0-compliant producers have to synthesise other widenings from a double shift, and indeed that's visible in fragments of the zlib.c benchmark:

(i32.shr_s
   (i32.shl
      (i32.and (get_local $3) (i32.const 255))
      (i32.const 24)
   )
   (i32.const 24)
)

and I'd furthermore guess that Ion rewrites that into a single instruction at some point, since those double-shifts don't appear in the Ion output. So we'd need something equivalent, early in the CL pipeline, maybe "preopt".

and I'd furthermore guess that Ion rewrites that into a single instruction at some point [..]

nbp confirms this: https://searchfox.org/mozilla-central/source/js/src/jit/MIR.cpp#2793

Priority: -- → P3
Summary: Wasm via Cranelift: generate "movswl %regD, %regL" → Cranelift: Generate "movswl %regD, %regL" for sign extension

Taking, since I've fixed this in one commit of this PR.

The rule seems to be that if there's a left-shift then right-shift by N bits, on a GPR that can contain M bits, and the shift immediate is M - N for both instructions, then it can be rewritten as a (un)signed-extended move. This is what the PR implements, and no tests fail with this, so I'm pretty confident.

Other small instruction selection improvements in this PR:

  • use an immediate variant of adjust_sp_down whenever possible.
  • fold chain of binary_opcode_imm together. Say we have add_imm(add_imm(x, 42), 58), this can be folded into a single add_imm(x, 100).
Assignee: nobody → bbouvier
Status: NEW → ASSIGNED
Type: defect → enhancement
Summary: Cranelift: Generate "movswl %regD, %regL" for sign extension → Cranelift: Generate "movswl %regD, %regL" for sign extension (and small misc isel improvements)

Benchmark results don't seem to show a lot of improvements, sometimes there are even slowdowns:

# mode=cranelift+cranelift, runs=5, problem size=4
# Lower score is better
box2d           5489    5330    1.03    [0.964, 1.007]  [0.997, 1.023]  [5290, 5530]    [5312, 5454]
bullet          7358    7300    1.008   [0.997, 1.037]  [0.99, 1.009]   [7338, 7632]    [7228, 7368]
conditionals    2030    2016    1.007   [0.997, 1.0]    [0.994, 1.003]  [2023, 2030]    [2003, 2023]
copy            3744    3724    1.005   [0.953, 1.002]  [0.999, 1.002]  [3568, 3750]    [3720, 3733]
corrections     4790    4965    0.965   [0.975, 1.01]   [0.997, 1.014]  [4671, 4838]    [4952, 5034]
fannkuch        2764    2729    1.013   [0.986, 1.0]    [0.996, 1.007]  [2724, 2765]    [2718, 2748]
fasta           4397    4485    0.98    [0.986, 1.04]   [0.966, 1.011]  [4335, 4575]    [4333, 4536]
fib             2806    2816    0.996   [0.999, 1.025]  [0.999, 1.032]  [2802, 2876]    [2814, 2906]
ifs             1031    1033    0.998   [0.997, 1.002]  [0.997, 1.003]  [1028, 1033]    [1030, 1036]
binarytrees     5206    5289    0.984   [0.985, 1.011]  [0.999, 1.003]  [5130, 5265]    [5284, 5304]
memops          8022    8236    0.974   [0.973, 1.001]  [0.999, 1.072]  [7806, 8028]    [8224, 8825]
primes          4983    5080    0.981   [0.997, 1.006]  [0.98, 1.021]   [4966, 5015]    [4977, 5185]
raybench        8192    8530    0.96    [0.99, 1.002]   [0.952, 1.031]  [8114, 8211]    [8119, 8792]
rust-fannkuch   35550   35458   1.003   [0.997, 1.002]  [0.995, 1.007]  [35446, 35608]  [35286, 35697]
skinning        4124    4109    1.004   [0.993, 1.009]  [0.998, 1.027]  [4096, 4162]    [4100, 4218]
zlib            7094    7051    1.006   [0.999, 1.007]  [0.993, 1.002]  [7085, 7146]    [7002, 7065]

(/me wondering if we should try to align loop headers on memory boundaries, to make results slightly more deterministic)

I'll try to use other Valgrind measurements, and check that the hot-blocks don't contain any of the patterns we'd like to avoid here.

Added a simple commit to fold x | 0 or x + 0 into x (both of these patterns show up a lot in Embenchen). Right now this is using a copy instruction, which is unfortunate (it generates a move instruction); I'm tempted to add a id instruction for this purpose. I've added some simple folds I could think of; I'll look at the ones Spidermonkey does to make sure I don't miss any.

I've ran the benchmarks in Callgrind, to see the effect on the number of instructions / data reads / data writes. This table shows results for the benchmarks when run with load level 3:

                            Baseline                                           After optimization
                            inst count       data read       data write        inst count       data read       data write
fib.js                      2,113,411,409    821,466,450     469,356,770   :   2,113,411,405    821,466,446     469,356,770
raybench.js                 5,309,762,358    1,886,197,917   508,211,005   :   5,309,077,404    1,886,992,388   508,211,946
rust-fannkuch.js            2,274,344,633    946,585,543     278,509,109   :   2,274,344,696    946,585,600     278,509,109
wasm_box2d.js               6,160,563,357    2,271,481,867   515,276,017   :   6,116,848,070    2,271,478,607   515,269,396
wasm_bullet.js              9,902,744,891    3,784,539,475   765,786,581   :   9,576,466,079    3,784,859,012   765,786,599
wasm_conditionals.js        2,500,074,669    213,077,020     79,955,541    :   2,500,074,667    213,077,067     79,955,541
wasm_copy.js                5,502,935,020    438,306,820     310,901       :   5,502,935,048    438,306,901     310,901
wasm_corrections.js         4,516,149,387    217,554,053     311,168       :   4,446,149,381    217,554,126     311,168
wasm_fannkuch.js            2,543,080,545    846,805,569     162,748,921   :   2,543,073,903    846,806,813     162,748,921
wasm_fasta.js               5,519,295,263    825,276,824     144,053,166   :   5,519,612,013    825,910,246     144,369,834
wasm_ifs.js                 1,488,136,664    515,600,328     245,042,580   :   1,488,136,709    515,600,374     245,042,580
wasm_linpack_float.c.js     5,935,743,024    1,626,604,764   277,014,783   :   5,935,738,852    1,626,605,327   277,014,885
wasm_lua_binarytrees.c.js   5,767,590,432    2,301,008,525   846,898,221   :   5,683,362,574    2,324,225,936   867,183,689
wasm_lua_scimark.c.js       2,849,485,201    1,102,571,660   350,803,789   :   2,779,678,898    1,119,827,842   367,916,317
wasm_memops.js              13,424,643,915   2,097,944,988   419,739,659   :   13,424,643,920   2,097,945,060   419,739,659
wasm_primes.js              4,694,549,840    295,466,816     310,423       :   4,694,549,819    295,466,878     310,423
wasm_skinning.js            5,367,023,929    1,116,816,451   257,914,442   :   5,367,023,945    1,116,816,515   257,914,445
wasm_zlib.c.js              8,003,386,883    3,238,593,334   490,699,853   :   7,921,675,579    3,238,622,360   490,699,853

Or, in relative difference (values < 0.01% rounded to 0):

name            icount  d-read  d-writes
fib             0       0       0
raybench        0.02    0       0
rust-fannkuch   0       0       0
box2d           0.71    0       0
bullet          3.3     0       0
conditionals    0       0       0
copy            0       0       0
corrections     1.55    0       0
fannkuch        0       0       0
fasta           0       0       0
ifs             0       0       0
linpack_float.c 0       0       0
binarytrees.c   1.47    0       0
scimark.c       2.45    0       0
memops          0       0       0
primes          0       0       0
skinning        0       0       0
zlib.c          1.03    0       0
  • sometimes the number of executed instructions actually raises after the optimization. Not sure what could cause this.
  • the number of executed instructions may decrease, while the number of data writes / reads may increase. My theory is that this is because the live range of some variables is extended, while there were many more smaller live ranges before.
  • every time it happens, it seems the number of writes increases much more than the number of reads, suggesting more fills that are never read back. It might just be another symptom of the weakness of our register allocator.
  • The increase in data writes / reads never exceeds 0.01% of loads/stores, as shown by the second table, so maybe there's nothing to worry about.
  • The increase in data traffic can't even be correlated with the performance regression observed with the patch, since raybench / corrections (which had the biggest losses) don't show dramatic increases in instructions count.

This is probably going to show more useful with register allocation improvements.

This has landed with bug 1571464.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.