Cranelift: Generate "movswl %regD, %regL" for sign extension (and small misc isel improvements)
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
People
(Reporter: jseward, Assigned: bbouvier)
References
Details
Wasm-via-Cranelift on Intel generates 16-to-32-bit signed widening thusly
0xC3B5C31A2D6: shll 0x10:I8, %edi
0xC3B5C31A2D9: sarl 0x10:I8, %edi
when in the same situation Ion generates
0x2C78D0217EA5: movswl %r8w, %r8d
and this is visible in some of the hottest blocks of our interim CL benchmark set.
I haven't yet looked at where the double-shift comes from. It may exist in the input Wasm, or it might be because CL doesn't have the reg->reg form of movswl, so is created in legalisation. I do observe that CL can create the mem->reg form of movswl, though.
Reporter | ||
Comment 1•6 years ago
•
|
||
Similarly
0x2A02DA2D275E: shll 0x18:I8, %eax
0x2A02DA2D2761: sarl 0x18:I8, %eax
instead of movsbl
. Seen in the hottest block of wasm_lua_binarytrees.c.js.
Assignee | ||
Updated•6 years ago
|
Reporter | ||
Comment 2•6 years ago
•
|
||
On investigation, it looks as if the CL pipeline can handle sign-extends as a single instruction fine, all the way down to reg-to-reg movsXY on x86. However, the wasm 1.0 spec only has i64.extend_s/i32
. Hence 1.0-compliant producers have to synthesise other widenings from a double shift, and indeed that's visible in fragments of the zlib.c benchmark:
(i32.shr_s
(i32.shl
(i32.and (get_local $3) (i32.const 255))
(i32.const 24)
)
(i32.const 24)
)
and I'd furthermore guess that Ion rewrites that into a single instruction at some point, since those double-shifts don't appear in the Ion output. So we'd need something equivalent, early in the CL pipeline, maybe "preopt".
Reporter | ||
Comment 3•6 years ago
|
||
and I'd furthermore guess that Ion rewrites that into a single instruction at some point [..]
nbp confirms this: https://searchfox.org/mozilla-central/source/js/src/jit/MIR.cpp#2793
Updated•6 years ago
|
Assignee | ||
Comment 4•6 years ago
|
||
Taking, since I've fixed this in one commit of this PR.
The rule seems to be that if there's a left-shift then right-shift by N bits, on a GPR that can contain M bits, and the shift immediate is M - N for both instructions, then it can be rewritten as a (un)signed-extended move. This is what the PR implements, and no tests fail with this, so I'm pretty confident.
Other small instruction selection improvements in this PR:
- use an immediate variant of
adjust_sp_down
whenever possible. - fold chain of
binary_opcode_imm
together. Say we haveadd_imm(add_imm(x, 42), 58)
, this can be folded into a singleadd_imm(x, 100)
.
Assignee | ||
Comment 5•6 years ago
|
||
Benchmark results don't seem to show a lot of improvements, sometimes there are even slowdowns:
# mode=cranelift+cranelift, runs=5, problem size=4
# Lower score is better
box2d 5489 5330 1.03 [0.964, 1.007] [0.997, 1.023] [5290, 5530] [5312, 5454]
bullet 7358 7300 1.008 [0.997, 1.037] [0.99, 1.009] [7338, 7632] [7228, 7368]
conditionals 2030 2016 1.007 [0.997, 1.0] [0.994, 1.003] [2023, 2030] [2003, 2023]
copy 3744 3724 1.005 [0.953, 1.002] [0.999, 1.002] [3568, 3750] [3720, 3733]
corrections 4790 4965 0.965 [0.975, 1.01] [0.997, 1.014] [4671, 4838] [4952, 5034]
fannkuch 2764 2729 1.013 [0.986, 1.0] [0.996, 1.007] [2724, 2765] [2718, 2748]
fasta 4397 4485 0.98 [0.986, 1.04] [0.966, 1.011] [4335, 4575] [4333, 4536]
fib 2806 2816 0.996 [0.999, 1.025] [0.999, 1.032] [2802, 2876] [2814, 2906]
ifs 1031 1033 0.998 [0.997, 1.002] [0.997, 1.003] [1028, 1033] [1030, 1036]
binarytrees 5206 5289 0.984 [0.985, 1.011] [0.999, 1.003] [5130, 5265] [5284, 5304]
memops 8022 8236 0.974 [0.973, 1.001] [0.999, 1.072] [7806, 8028] [8224, 8825]
primes 4983 5080 0.981 [0.997, 1.006] [0.98, 1.021] [4966, 5015] [4977, 5185]
raybench 8192 8530 0.96 [0.99, 1.002] [0.952, 1.031] [8114, 8211] [8119, 8792]
rust-fannkuch 35550 35458 1.003 [0.997, 1.002] [0.995, 1.007] [35446, 35608] [35286, 35697]
skinning 4124 4109 1.004 [0.993, 1.009] [0.998, 1.027] [4096, 4162] [4100, 4218]
zlib 7094 7051 1.006 [0.999, 1.007] [0.993, 1.002] [7085, 7146] [7002, 7065]
(/me wondering if we should try to align loop headers on memory boundaries, to make results slightly more deterministic)
I'll try to use other Valgrind measurements, and check that the hot-blocks don't contain any of the patterns we'd like to avoid here.
Assignee | ||
Comment 6•6 years ago
|
||
Added a simple commit to fold x | 0
or x + 0
into x
(both of these patterns show up a lot in Embenchen). Right now this is using a copy
instruction, which is unfortunate (it generates a move instruction); I'm tempted to add a id
instruction for this purpose. I've added some simple folds I could think of; I'll look at the ones Spidermonkey does to make sure I don't miss any.
Assignee | ||
Comment 7•6 years ago
|
||
I've ran the benchmarks in Callgrind, to see the effect on the number of instructions / data reads / data writes. This table shows results for the benchmarks when run with load level 3:
Baseline After optimization
inst count data read data write inst count data read data write
fib.js 2,113,411,409 821,466,450 469,356,770 : 2,113,411,405 821,466,446 469,356,770
raybench.js 5,309,762,358 1,886,197,917 508,211,005 : 5,309,077,404 1,886,992,388 508,211,946
rust-fannkuch.js 2,274,344,633 946,585,543 278,509,109 : 2,274,344,696 946,585,600 278,509,109
wasm_box2d.js 6,160,563,357 2,271,481,867 515,276,017 : 6,116,848,070 2,271,478,607 515,269,396
wasm_bullet.js 9,902,744,891 3,784,539,475 765,786,581 : 9,576,466,079 3,784,859,012 765,786,599
wasm_conditionals.js 2,500,074,669 213,077,020 79,955,541 : 2,500,074,667 213,077,067 79,955,541
wasm_copy.js 5,502,935,020 438,306,820 310,901 : 5,502,935,048 438,306,901 310,901
wasm_corrections.js 4,516,149,387 217,554,053 311,168 : 4,446,149,381 217,554,126 311,168
wasm_fannkuch.js 2,543,080,545 846,805,569 162,748,921 : 2,543,073,903 846,806,813 162,748,921
wasm_fasta.js 5,519,295,263 825,276,824 144,053,166 : 5,519,612,013 825,910,246 144,369,834
wasm_ifs.js 1,488,136,664 515,600,328 245,042,580 : 1,488,136,709 515,600,374 245,042,580
wasm_linpack_float.c.js 5,935,743,024 1,626,604,764 277,014,783 : 5,935,738,852 1,626,605,327 277,014,885
wasm_lua_binarytrees.c.js 5,767,590,432 2,301,008,525 846,898,221 : 5,683,362,574 2,324,225,936 867,183,689
wasm_lua_scimark.c.js 2,849,485,201 1,102,571,660 350,803,789 : 2,779,678,898 1,119,827,842 367,916,317
wasm_memops.js 13,424,643,915 2,097,944,988 419,739,659 : 13,424,643,920 2,097,945,060 419,739,659
wasm_primes.js 4,694,549,840 295,466,816 310,423 : 4,694,549,819 295,466,878 310,423
wasm_skinning.js 5,367,023,929 1,116,816,451 257,914,442 : 5,367,023,945 1,116,816,515 257,914,445
wasm_zlib.c.js 8,003,386,883 3,238,593,334 490,699,853 : 7,921,675,579 3,238,622,360 490,699,853
Or, in relative difference (values < 0.01% rounded to 0):
name icount d-read d-writes
fib 0 0 0
raybench 0.02 0 0
rust-fannkuch 0 0 0
box2d 0.71 0 0
bullet 3.3 0 0
conditionals 0 0 0
copy 0 0 0
corrections 1.55 0 0
fannkuch 0 0 0
fasta 0 0 0
ifs 0 0 0
linpack_float.c 0 0 0
binarytrees.c 1.47 0 0
scimark.c 2.45 0 0
memops 0 0 0
primes 0 0 0
skinning 0 0 0
zlib.c 1.03 0 0
- sometimes the number of executed instructions actually raises after the optimization. Not sure what could cause this.
- the number of executed instructions may decrease, while the number of data writes / reads may increase. My theory is that this is because the live range of some variables is extended, while there were many more smaller live ranges before.
- every time it happens, it seems the number of writes increases much more than the number of reads, suggesting more fills that are never read back. It might just be another symptom of the weakness of our register allocator.
- The increase in data writes / reads never exceeds 0.01% of loads/stores, as shown by the second table, so maybe there's nothing to worry about.
- The increase in data traffic can't even be correlated with the performance regression observed with the patch, since raybench / corrections (which had the biggest losses) don't show dramatic increases in instructions count.
This is probably going to show more useful with register allocation improvements.
Assignee | ||
Comment 8•5 years ago
|
||
This has landed with bug 1571464.
Description
•