Closed Bug 1039458 Opened 10 years ago Closed 6 years ago

Unroll tight loops

Categories

(Core :: JavaScript Engine: JIT, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: till, Assigned: bhackett1024)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [performance] [shumway])

Attachments

(2 files, 2 obsolete files)

Even in cases where we optimize out all bounds checks, tight loops are still quite a bit faster when unrolled a few times.

The attached benchmark compares a nested loop copying values from one typed array into another with the same loop unrolled four times.

On my machine, the baseline version takes ~500ms and the unrolled one ~307ms, so slightly more than 60% of the original.

(Note: the original case motivating the bug is the need to copy rects from one image buffer into another. %TypedArray%.prototype.set doesn't take ranges, unfortunately, so it's not possible to use it to copy parts of lines from one buffer to another, leaving us with manually copying each pixel as the only option.)
I don't know if we want to unroll just tight loops specifically; it would be nice if we had profiling information to help us decide when and how to unroll loops, including whether to unroll multiple iterations or just the first iteration.  But it would definitely be good to have a loop unrolling pass to experiment with, even if we don't turn it on until we're doing PGO.
Assignee: nobody → bhackett1024
Blocks: shumway-m4
Keywords: perf
Attached patch WIP (obsolete) — Splinter Review
WIP.  This uses the existing symbolic range analysis stuff to try to unroll loops on which we can compute an upper bound on the number of iterations they will execute.  This means we don't have to keep unrolling the loop termination test and can use a single test in the unrolled loop, so we should see benefits from this unrolling even if no other optimizations can be applied to the loop body.  For now we just unroll loops with an initial test and single block for the body and backedge, but in the future this could be relaxed.  In the ideal, very-simple-body case like:

n = 10000;
a = new Int32Array(n);
function foo() {
    for (var i = 0; i < n; i++)
	a[i] = i;
}
for (var i = 0; i < 100000; i++)
    foo();

I get a speedup from 837ms to 423ms with a 10x unrolling.
How hard would it be to unroll to something like Duff's device, where we jump into the middle loop initially to take care of the remainder (i.e. total iterations mod unroll count)?
s/middle loop/middle of the loop
(In reply to Brian Hackett (:bhackett) from comment #2)
> I get a speedup from 837ms to 423ms with a 10x unrolling.

Nice! Does this also work for the am3 loop in Octane-crypto? The loop body is a single basic block..

(In reply to Shu-yu Guo [:shu] (PTO until Aug. 28) from comment #3)
> How hard would it be to unroll to something like Duff's device, where we
> jump into the middle loop initially to take care of the remainder (i.e.
> total iterations mod unroll count)?

I think that complicates the graph structure, as the loop header will no longer dominate the rest of the loop, IIUC.
(In reply to Jan de Mooij [:jandem] from comment #5)
> (In reply to Brian Hackett (:bhackett) from comment #2)
> > I get a speedup from 837ms to 423ms with a 10x unrolling.
> 
> Nice! Does this also work for the am3 loop in Octane-crypto? The loop body
> is a single basic block..
> 
> (In reply to Shu-yu Guo [:shu] (PTO until Aug. 28) from comment #3)
> > How hard would it be to unroll to something like Duff's device, where we
> > jump into the middle loop initially to take care of the remainder (i.e.
> > total iterations mod unroll count)?
> 
> I think that complicates the graph structure, as the loop header will no
> longer dominate the rest of the loop, IIUC.

That's right, it ends up being irreducible control flow. I guess the question is how many other optimizations does it mess up?
(In reply to Jan de Mooij [:jandem] from comment #5)
> (In reply to Brian Hackett (:bhackett) from comment #2)
> > I get a speedup from 837ms to 423ms with a 10x unrolling.
> 
> Nice! Does this also work for the am3 loop in Octane-crypto? The loop body
> is a single basic block..

It should, though I haven't tested it yet, and the clone() implementations in this patch for the various MIR nodes in the loop body probably aren't sufficient yet.  This patch will pretty much just do 'while' and 'for' loops with no inner control flow (break, continue, if, etc.).  I'd like to also handle 'do while' loops in this bug and in the future be able to deal with some kinds of inner control flow, like the tight loop in kraken-astar that iterates through an array and breaks out when it finds what it's looking for.  Allowing arbitrary control flow in the unrolled loop would have diminishing returns, though maybe we could unroll complicated hot loops if we had basic block hit counts and could figure out which path through the loop is generally taken.

> (In reply to Shu-yu Guo [:shu] (PTO until Aug. 28) from comment #3)
> > How hard would it be to unroll to something like Duff's device, where we
> > jump into the middle loop initially to take care of the remainder (i.e.
> > total iterations mod unroll count)?
> 
> I think that complicates the graph structure, as the loop header will no
> longer dominate the rest of the loop, IIUC.

Yeah, we should really keep all the loops reducible.  From a compilation perspective that's much easier to analyze; Duff's device looks like a "here's this crazy thing you can do in C" programmer thing.  Duff's device would be a more effective optimization than this unrolling strategy in cases where the loop bounds are generally small and not divisible by the unrolling count.  Then we will be executing many iterations in the original loop (which still exists, and appears after the unrolled loop to take care of any remaining iterations) whereas Duff's device will always be executing unrolled code.

If we wanted to do something similar to Duff's device we could do two unrollings.  Right now the unrolled loop handles the first (N - (N % UnrollCount)) iterations and the original loop handles the remaining (N % UnrollCount) iterations.  If we replaced the original loop with a switch-style dispatch into a second unrolled body, that code could execute the remaining (N % UnrollCount) iterations without requiring a loop.  FWIW this is similar to the code we would get after transforming Duff's device into a reducible loop.  Doing this would be too much for this bug but maybe something for the future.
Attached patch patch (obsolete) — Splinter Review
Add basic loop unrolling, disabled by default.  I haven't done much extensive benchmarking with this patch but on octane it seems to improve crypto by 7% and not move other tests much, and it has a net slowdown on kraken of 1% due to some regressions in the crypto tests.  I'd like to see what's going on with those and how often we're executing unrolled code etc. but I'd also like to just land this as is since it works (passes jit-tests) and can be fuzzed while it gets tuned.
Attachment #8461324 - Attachment is obsolete: true
Attachment #8461919 - Flags: review?(jdemooij)
Sorry for the delay. I'll review this patch tomorrow.
Comment on attachment 8461919 [details] [diff] [review]
patch

Review of attachment 8461919 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, some comments below.

* I tried the patch and it doesn't work for the tight loop in "process" in Kraken imaging-desaturate, I think we never end up in the loop unrolling code. Do you know why? That test is a poster child for loop unrolling, so it'd be good to know how hard it is to fix.

* Same for SS bitwise-and, but there the problem is likely that it runs in the global scope. It doesn't really matter if we don't handle that.

::: js/src/jit/Ion.cpp
@@ +1519,5 @@
> +
> +            for (size_t i = 0; i < r.loopIterationBounds.length(); i++) {
> +                if (!UnrollLoop(graph, r.loopIterationBounds[i]))
> +                    return false;
> +            }

Nit: I'd prefer a UnrollLoops that we can pass r.loopIterationBounds, so that this loop is inside LoopUnroller.cpp

::: js/src/jit/LoopUnroller.cpp
@@ +17,5 @@
> +{
> +    typedef HashMap<MDefinition *, MDefinition *,
> +                    PointerHasher<MDefinition *, 2>, SystemAllocPolicy> DefinitionMap;
> +
> +    LoopUnroller(MIRGraph &graph)

Nit: make this |explicit|

@@ +120,5 @@
> +
> +    JS_ASSERT(oldPreheader->numSuccessors() == 1);
> +
> +    // Only unroll loops with two blocks: an initial one ending with the
> +    // bound's test, and the body ending with the backedge.

We should probably also limit the number of instructions per block? SS math-partial-sums has a loop we may unroll but the body is pretty big.

@@ +281,5 @@
> +        // Clone the contents of the original loop into the unrolled loop body.
> +        for (size_t i = 0; i < ArrayLength(bodyBlocks); i++) {
> +            MBasicBlock *block = bodyBlocks[i];
> +
> +            if (i != 0 && block->entryResumePoint()) {

Why the i != 0? Should it be unrollIndex? Could use a comment.

@@ +369,5 @@
> +    // The MIR graph is valid, but now has several new blocks. Update or
> +    // recompute previous analysis information for the remaining optimization
> +    // passes.
> +    ClearDominatorTree(graph);
> +    if (!BuildDominatorTree(graph))

Does this have to happen after every loop, or can we unroll multiple loops and then do this only once?
Attachment #8461919 - Flags: review?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #10)
> * I tried the patch and it doesn't work for the tight loop in "process" in
> Kraken imaging-desaturate, I think we never end up in the loop unrolling
> code. Do you know why? That test is a poster child for loop unrolling, so
> it'd be good to know how hard it is to fix.

This loop's termination condition, 'while (p--)' isn't able to be handled by our symbolic range analysis.  If p is initially negative the loop will never terminate so we can't compute a bound on the number of iterations.  I think we should improve the range analysis to cover this case, by optimistically adding a 'p >= 0' bounds check to the preheader, provided that no bounds checks in the script have been invalidated before.  This should let the range analysis compute a loop bound and let us unroll the loop and consolidate/hoist bounds checks on the arrays accessed in the loop.

> * Same for SS bitwise-and, but there the problem is likely that it runs in
> the global scope. It doesn't really matter if we don't handle that.

Yeah, the range analysis needs the loop control variables to be local variables and not global properties.

> We should probably also limit the number of instructions per block? SS
> math-partial-sums has a loop we may unroll but the body is pretty big.

I think this can wait for later, since this patch doesn't try to tune anything (and loop unrolling is turned off).  I think a single limit on the number of instructions per block we unroll would be OK initially but in the long run would be pretty brittle/simplistic.  I'm hoping that eventually we'll be doing multiple levels of Ion compilation, with the more highly optimized one having profile information computed by the initial compilation, and this would give us more information about what loops / paths through loops are super hot and should be unrolled if possible.

> @@ +281,5 @@
> > +        // Clone the contents of the original loop into the unrolled loop body.
> > +        for (size_t i = 0; i < ArrayLength(bodyBlocks); i++) {
> > +            MBasicBlock *block = bodyBlocks[i];
> > +
> > +            if (i != 0 && block->entryResumePoint()) {
> 
> Why the i != 0? Should it be unrollIndex? Could use a comment.

Actually, we should be testing both |i| and |unrollIndex|.  But, thinking about this some more, I think this chunk of code is unnecessary.  I added it so that the resume points of instructions in the block would definitely match those in the original loop body, but the entry resume points are only needed because of the vagaries of control flow, and unrolling the loop eliminates that.  Resuming into the correct state should be taken care of by the resume points we duplicate and attach to each unrolled effectful instruction.

> @@ +369,5 @@
> > +    // The MIR graph is valid, but now has several new blocks. Update or
> > +    // recompute previous analysis information for the remaining optimization
> > +    // passes.
> > +    ClearDominatorTree(graph);
> > +    if (!BuildDominatorTree(graph))
> 
> Does this have to happen after every loop, or can we unroll multiple loops
> and then do this only once?

Yeah, we only need to do this once.  Fixed.
Attached patch updatedSplinter Review
Attachment #8461919 - Attachment is obsolete: true
Attachment #8464834 - Flags: review?(jdemooij)
Comment on attachment 8464834 [details] [diff] [review]
updated

Review of attachment 8464834 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay.

::: js/src/jit/MIR.h
@@ +131,5 @@
>        : producer_(nullptr), consumer_(nullptr)
>      { }
>  
> +    // MUses can only be copied when they are not in a use list.
> +    MUse(const MUse &other)

Make this constructor explicit. There should be very few places where we use it.

@@ +380,5 @@
>          trackedSite_()
>      { }
>  
> +    // Copying a definition leaves the list of uses and the block empty.
> +    MDefinition(const MDefinition &other)

Same here.

@@ +784,5 @@
>        : resumePoint_(nullptr)
>      { }
>  
> +    // Copying an instruction leaves the block and resume point as empty.
> +    MInstruction(const MInstruction &other)

Same here.

@@ +868,5 @@
>      }
> +
> +    MAryInstruction() { }
> +
> +    MAryInstruction(const MAryInstruction<Arity> &other)

Same here.
Attachment #8464834 - Flags: review?(jdemooij) → review+
We should probably add --ion-loop-unrolling=on to the fuzzing harness.
Flags: needinfo?(gary)
Flags: needinfo?(choller)
(In reply to Jan de Mooij [:jandem] from comment #16)
> We should probably add --ion-loop-unrolling=on to the fuzzing harness.

Added in fuzzing rev 1b2d6847c161. Thanks for the headsup!
Flags: needinfo?(gary)
Flags: needinfo?(choller)
Depends on: 1060454
Blocks: 862249
Loop unrolling is nice-to-have for Shumway, but does not block Shumway's M4 milestone.
No longer blocks: shumway-m4
Priority: -- → P5
The leave-open keyword is there and there is no activity for 6 months.
:bhackett, maybe it's time to close this bug?
Flags: needinfo?(bhackett1024)
We support unrolling loops but it is not enabled anywhere by default.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(bhackett1024)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: