Open Bug 1316808 Opened 3 years ago Updated 4 months ago

Wasm baseline: Allocate some locals to registers (investigation)

Categories

(Core :: Javascript: WebAssembly, task, P3)

task

Tracking

()

ASSIGNED

People

(Reporter: lth, Assigned: lth)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(6 files)

Try to allocate some locals to registers on machines where there are enough registers.  

This is probably hard to do well in a one-pass compiler but it might be that just keeping register arguments and the first few locals in registers is a viable strategy; another (more general) strategy is caching locals in registers in straight-line code.  Such caching could also track constant values in registers, if that is deemed valuable.  A combination of techniques may be desirable: parameters and the first few locals could be cached on entry to the function but not statically assigned to registers throughout.

(On a large corpus of code it should be possible to compute, for every signature comprising the types of parameters and locals, and using a static weight for loops, a list in priority order of which parameters and locals that should be assigned to registers.  Or something like that.  Wasm makes this simple.  Static assignments are desirable because they are not flushed to memory by the pre-block sync() call.)

It's likely that we want to delay this until we've done more sync() optimization, notably, attemptint to reduce sync() activity in the register allocator and optimistically delaying sync() at block boundaries.  Only when those have been completed will we see the true potential for allocating locals to registers.
See Also: → 1316820
Priority: P5 → P3
Here's a stray thought about how to make good use of locals that are cached in registers.  (For this we want bug 1318654 to be finished probably, but it would work without it.)

Suppose we can cheaply maintain a data structure that tracks locals that are cached in registers and which registers they are in.  We now specialize (say) RegI32 and popI32(): let there be a new register type ValI32  that is a base class of RegI32, and let there be a new function popValI32() that returns a ValI32 instead of a RegI32.  Now change the signatures of low-level emitters that currently take a RegI32 but do not modify that register to instead take a ValI32.

Higher-level emitters that pass an argument to a ValI32 can call popValI32() instead of popI32().

Now when popValI32() wants to pop a local but that local is cached in a register, the register can be returned as-is: no load or copy is needed.  popI32(), in contrast, must make a copy or can - in some circumstances, when there's no conflict - choose to destroy the caching by returning the register.

The conflict alluded to in the previous paragraph occurs if we pop twice for eg (+ x x) where x is local, first as a ValI32 that returns the register of x and then as a RegI32, should the second pop, which will normally be a popI32(), choose not to make a copy but instead return the local.  Food for thought.

(We can track constant values in registers in the same way as locals, maybe.  Worth it?  Who knows.  Data will tell.)
Depends on: 1316806
I performed a simple corpus analysis by counting the number of integer and floating point params and locals for the functions of Tanks, AngryBots, and ZenGarden (will attach results).  For x64 and ARM64 these sums give an accurate picture of how many statically assigned registers we would need to cover all locals.

Somewhat as expected, but it's nice to have this confirmed, we cover the vast majority of functions with only a few statically allocated registers.  Here are the top entries for ZenGarden ("ints" are i32 and i64 params and locals; "floats" are f32 and f64 ditto; "count" is the number of functions having those characteristics):

ints: 1 floats: 0 count: 18547
ints: 2 floats: 0 count: 8240
ints: 1 floats: 1 count: 7228
ints: 2 floats: 1 count: 6519
ints: 0 floats: 2 count: 5075
ints: 2 floats: 2 count: 4003
ints: 3 floats: 1 count: 3846
ints: 3 floats: 0 count: 3699
ints: 4 floats: 0 count: 2915
ints: 3 floats: 2 count: 2893
ints: 0 floats: 1 count: 2679
ints: 1 floats: 2 count: 2494
ints: 2 floats: 3 count: 2196
ints: 1 floats: 3 count: 2171
ints: 3 floats: 4 count: 1992
ints: 2 floats: 4 count: 1601
ints: 3 floats: 3 count: 1536
ints: 0 floats: 7 count: 1482
ints: 3 floats: 5 count: 1248
ints: 0 floats: 8 count: 1236
ints: 2 floats: 5 count: 1107
ints: 4 floats: 1 count: 1096
ints: 1 floats: 4 count: 1059
ints: 2 floats: 6 count: 927
ints: 5 floats: 0 count: 849
ints: 3 floats: 7 count: 849
ints: 3 floats: 6 count: 843
ints: 0 floats: 9 count: 799
ints: 0 floats: 6 count: 740
ints: 4 floats: 3 count: 734
ints: 1 floats: 5 count: 648
ints: 0 floats: 10 count: 617
ints: 2 floats: 8 count: 589
ints: 3 floats: 8 count: 583
ints: 4 floats: 2 count: 546
ints: 2 floats: 7 count: 544
ints: 0 floats: 11 count: 486
ints: 0 floats: 5 count: 468
ints: 4 floats: 4 count: 422
ints: 5 floats: 1 count: 396
ints: 0 floats: 3 count: 390
ints: 1 floats: 6 count: 375
ints: 5 floats: 2 count: 356
ints: 4 floats: 5 count: 345
ints: 2 floats: 9 count: 323
ints: 1 floats: 8 count: 314
ints: 0 floats: 12 count: 314
ints: 1 floats: 7 count: 302
ints: 1 floats: 12 count: 292
ints: 0 floats: 4 count: 285
ints: 3 floats: 9 count: 278
ints: 6 floats: 0 count: 271

If statically allocated registers can be made to coincide with parameter registers (and why could they not?) then function prologues will be shorter, there will be less load and store traffic, and there will be fewer instructions emitted.  With the idea from comment 1 there will also be fewer register-register moves.

Statically allocated registers with lazy restore after calls (when we must spill) is a simple bitvector problem and need not slow down compilation much at all.
Attached file Tanks-histo.txt
Attached file AngryBots-histo.txt
Attached file ZenGarden-histo.txt
Attached file wasm-sig
Corpus analysis program (python)
Attached file Cumulative-histo.txt
Postprocssing of the previous text files.  This shows that for the trio of programs considered here, 8 statically assigned integer registers cover 99% of all functions, and 14 statically assigned float registers cover 95% of all functions (very long tail, the largest function has 661 float params + locals).

(Hello, ARM64.)

But even four integer registers cover 92% of all functions, and 8 floating point registers cover 88%.  This is excellent news for both ARM and x64.

(Output of

  cat AngryBots-histo.txt Tanks-histo.txt ZenGarden-histo.txt | ./cumulative-histo.py

where the program follows.)
Attached file cumulative-histo.py
I'm backburner-working on this so I guess I'll take it.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Component: JavaScript Engine: JIT → Javascript: Web Assembly
Type: defect → enhancement
Type: enhancement → task
You need to log in before you can comment on or make changes to this bug.