Wasm baseline: check for unnecessary scratch usage on ARM

RESOLVED INVALID

Status

()

enhancement
P3
normal
RESOLVED INVALID
2 years ago
2 years ago

People

(Reporter: luke, Unassigned)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

I was recently noticing that some common immediate-taking MacroAssembler functions (push(), add32(), sub32(), ...) don't even attempt to fold the immediate into the Operand2 but instead simply use a scratch register and two instructions.  It seems like you only get this immediate folding if you use the ARM-specific functions (like ma_add) which CodeGeneratorARM does.

It might be fruitful for significantly reducing rabaldr code size on ARM to audit the hottest MacroAssembler functions to either (1) try to do better folding in the shared MacroAssembler functions (a win for everybody), (2) add #ifdef ARM paths that use the ARM-specific ma_/as_ functions.
Another "Cretonne will save us here" instance, no pressure on Jakob...

Option (1) is anyway most palatable.

Another bête noir: Redundant moves are not reliably optimized out by the assembler / macroassembler, not even on x86 as I remember it.  Rabaldr has started accreting some cruft as a result.
Priority: -- → P3
See Also: → 1331568
Luke, I'm sure it's hard to remember now but could you be more specific about what you saw and where?  The ARM32 renditions of add32(), sub32() and many others go via ma_add which in turn goes into ma_alu which optimizes many common cases with small immediates for this entire class of instructions.  Some simple test runs with Rabaldr shows the generated code to use immediates properly.  (Well, I tried i32.add, which uses masm.add32().)

It is true that add32 etc with immediates all claim the scratch register early on, but it goes unused if it is not needed.

(Speaking of code density, it would be nice to use store-multiple / load-multiple in prologue and epilogue to avoid the sequence of store and load instructions, but that requires the frame layout to change a little bit - we need to push FP before TLS.)
Flags: needinfo?(luke)
Ah, probably then I was confused by seeing the scratch register reservation and assuming there was a use.  I'll resolve invalid.

That's an interesting idea w.r.t store-/load-multiple.  One nice thing about having a sequence of pushes is it more closely mimics x86 which means ARM simulator single-step profiling gives more implicit coverage of x86.
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(luke)
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.