Closed Bug 948838 Opened 11 years ago Closed 8 years ago

Extra spill/reload with backtracking allocator


(Core :: JavaScript Engine: JIT, defect)

Not set





(Reporter: jandem, Assigned: bhackett1024)




(2 files, 4 obsolete files)

See bz's testcase in bug 946426 comment 0:

  (function(count) {
    var start = new Date;
    while (count-- != 0);
    var stop = new Date;
    print((stop - start) / 10000000 * 1e6);

On 32-bit I get this with LSRA:

[Codegen] movl       %eax, %ebp
[Codegen] subl       $0x1, %ebp
[Codegen] jo         ((705))
[Codegen] testl      %eax, %eax
[Codegen] je         ((713))
[Codegen] movl       %ebp, %eax
[Codegen] jmp        ((720))

And with --ion-regalloc=backtracking:

[Codegen] movl       %ebx, %eax
[Codegen] subl       $0x1, %eax
[Codegen] jo         ((711))
[Codegen] movl       %eax, 0x78(%esp)
[Codegen] testl      %ebx, %ebx
[Codegen] je         ((723))
[Codegen] movl       0x78(%esp), %ebx
[Codegen] jmp        ((732))

It would be nice if we could get rid of this store/load inside the loop, like with LSRA.

The problem is worse if I change |count-- != 0| to |--count != 0|:

With LSRA:

[Codegen] subl       $0x1, %eax
[Codegen] jo         ((703))
[Codegen] testl      %eax, %eax
[Codegen] jne        ((711))


[Codegen] subl       $0x1, %ebx
[Codegen] jo         ((709))
[Codegen] movl       %ebx, 0x78(%esp)
[Codegen] testl      %ebx, %ebx
[Codegen] je         ((721))
[Codegen] movl       0x78(%esp), %ebx
[Codegen] jmp        ((730))
LiveIntervals don't currently contain the uses for phi nodes. Consequently, the heuristics see no reason to extend the lifetime of the result register of the sub.

Also, phi uses are LUse::ANY, which can be in memory but prefers a register, and the backtracking allocator often ignores this preference (isRegisterUse returns false for ANY in many cases).

That explains the load. The store comes from the fact that there are LUse::KEEPALIVE uses after the loop, and the backtracking allocator is spilling the value as soon as it can. This may be trickier to fix, though it's also less important.
Attached patch patch (obsolete) — Splinter Review
I think another problem here is that the presence of the call to new Date after the loop is affecting register allocation within the loop.  I don't think this should be happening --- when allocating we should consider the structure of the hot code first and fill in the details for cold code afterward.  trySplitAcrossHotcode originally did this, but with bug 939824 its behavior changed somewhat (split points were still done based on the interval's register uses, which are not ideal here) and before that bug 826734 gave splitting across calls (as is done here) a higher priority than splitting across hot code.  The attached patch restores the earlier behavior and fixes this testcase but for several octane tests increases the amount of spill code executed:

                   OLD    NEW
richards.js:       0.082  .079
deltablue.js:      0.099  .103
crypto.js:         0.272  .29
raytrace.js:       0.095  .097
earley-boyer.js:   0.087  .114
splay.js:          0.063  .063
navier-stokes.js:  0.286  .288
pdfjs.js:          0.152  .148
mandreel.js:       0.252  .234
gbemu.js:          0.083  .136
box2d.js:          0.082  .083
zlib.js:           0.186  .186
typescript.js:     0.095  .103

So I'd like to not land this without a better understanding / fix for what's going on here.
Attached patch tweaks (obsolete) — Splinter Review
I think the main issue here is the disabling of spilling to argument slots which happened in bug 906858.  Splitting at hot code boundaries presumably leads to more cold intervals without register uses, and when those are arguments we end up doing memory -> memory moves.  If I restore the argument slot spilling and make the attached minor tweaks I get the following fractions of spill code executed, which are generally flat or an improvement vs. the current state.

richards.js: 0.072
deltablue.js: 0.101
crypto.js: 0.269
raytrace.js: 0.098
earley-boyer.js: 0.085
splay.js: 0.055
navier-stokes.js: 0.287
pdfjs.js: 0.159
mandreel.js: 0.235
gbemu.js: 0.093
box2d.js: 0.083
zlib.js: 0.186
typescript.js: 0.097
I've been looking into this. I agree with bhackett's analysis; the splitAt mechanism introduced in bug 939824 currently tries to minimize ranges, but in testcases like the one here, this causes it to spill inside the loop. So far I've only seen this come up in non-asm.js code. I'm looking into improving it.
Blocks: 1063603
Attached patch patch (rebased) (obsolete) — Splinter Review
Rebased Brian's patch, no non-trivial changes.
Attachment #8395658 - Attachment is obsolete: true
Attached patch tweaks (rebased) (obsolete) — Splinter Review
Attachment #8395915 - Attachment is obsolete: true
Attached patch patchSplinter Review
The problem with landing this patch has been that it regressed some asm.js tests (bullet and lua_binarytrees on AWFY, at least).  It's not obvious why this is the case and rather than dig around trying to understand it I would rather break this impasse by just preserving the current behavior for asm.js code.

We really want to switch to the backtracking allocator for normal Ion compilation and doing this while preserving both identical behavior and performance for asm.js code will be pretty annoying to do.  It would be better to make the necessary changes to fix Ion perf, include minor special casing like this in cases where asm.js perf is negatively affected, then try to remove these special cases once we're using the backtracking allocator exclusively.
Assignee: nobody → bhackett1024
Attachment #8526787 - Attachment is obsolete: true
Attachment #8548940 - Flags: review?(sunfish)
Attached patch tweaksSplinter Review
Tweaks patch.  This modifies the allocator a bit to avoid artificial contention between registers, and to reduce the weight given to ANY uses of a vreg.  On the asm.js AWFY tests on x86, this doesn't seem to move any of the apps much, and on the unbench tests copy slows down (10%) but several others speed up (1-10%).
Attachment #8526788 - Attachment is obsolete: true
Attachment #8549683 - Flags: review?(sunfish)
Comment on attachment 8549683 [details] [diff] [review]

Review of attachment 8549683 [details] [diff] [review]:

::: js/src/jit/BacktrackingAllocator.cpp
@@ +451,5 @@
> +
> +            i = (i + 1) % AnyRegister::Total;
> +        } while (i != registerStart);
> +
> +        registerStart = (registerStart + 1) % AnyRegister::Total;

I don't currently see how this helps. Using the same register for two disjoint LiveIntervals should mean that a third LiveInterval which intersects with both has more choices. Having the allocator cycle around the registers here seems like it would increase contention overall.

It's possible that this has a significant affect on asmjs-ubench benchmarks, because many of them use div and mod operators which have fixed inputs and outputs, and cycling around changes when we pick registers that later turn out to be needed by fixed intervals.

@@ +1448,5 @@
>          LUse *use = iter->use;
>          switch (use->policy()) {
>            case LUse::ANY:
> +            usesTotal += 500;

I have no intuition about what number is right here, so whatever looks good in benchmarks is cool with me :-).
Attachment #8549683 - Flags: review?(sunfish)
Comment on attachment 8548940 [details] [diff] [review]

Review of attachment 8548940 [details] [diff] [review]:

Sounds good. There's more to learn here, but this patch looks good for now.
Attachment #8548940 - Flags: review?(sunfish) → review+
appears to have regressed Octane-deltablue and Octane-earleyBoyer  on  on Backtracking allocator
Also looks like a 4% regression on x64 Octane-zlib.
I'm going to mark this as fixed. The big part has landed. Please open a new bug if you still want to land the tweaks.
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.