Closed Bug 1067610 Opened 9 years ago Closed 9 years ago

Backtracking allocator should recover better when evicting part of a group

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox41 --- fixed

People

(Reporter: sunfish, Assigned: bhackett1024)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 1 obsolete file)

Bug 1058083 re-enabled the integer divide-by-constant optimization. While there was the expected win on asmjs-ubench-copy, there was a regression on asmjs-ubench-memops. It's due to an extra register reload inside the hot loop. Here's the relevant parts of the IONFLAGS=regalloc dump:

[140,143 Phi] [def v44<i>] [use v37:r?] [use v49:r?]
...
[150,151 AddI] [def v49<i>:tied(0)] [use v44:r?] [use v48:r?]
...
[158,159 DivOrModConstantI] [def v52<i>:rax] [temp v51<g>:rdx] [use v49:r]

v44 and v49 are grouped.

The v44 group gets assigned %rdx, not because it needs it, but because it's free at the time. Later, v51 comes up for allocation and it actually needs %rdx. We successfully evict v49 and give %rdx to v51. So far so good. However, we don't evict v44, and we don't split v49 out of the group. With v49 still in the group, and the group still allocated to %rdx, we resort to spilling v49.

[RegAlloc] Allocating v49[1] [152,160) [priority 8] [weight 0]
[RegAlloc]   Hint rdx, used by group allocation
[RegAlloc]   rdx collides with v51[0] [158,160) [weight 2000000]
[RegAlloc]   Spilling interval
[RegAlloc]     Allocating spill location stack:12

This puts a reload in the loop.

This is a specific instance of a pretty common problem. I've been contemplating the following possible solutions:

 - When evicting one register in a group, evict the whole group
 - When evicting a register in a group, split it out of the group
 - Teach LiveInterval to support multiple virtual registers at a time, and obviate the need for groups

The last one is kind of radical, but it would be cool because it would allow "grouped" registers to be split and evicted just like individual registers.
Blocks: 826741
Don't know if you guys saw this one; perhaps another nice general backtracking improvement?
(In reply to Luke Wagner [:luke] from comment #1)
> Don't know if you guys saw this one; perhaps another nice general
> backtracking improvement?

Yeah, I'd like to look at this once we have backtracking on for Ion by default.
Attached patch patch (obsolete) — Splinter Review
This problem seems to be the main thing affecting backtracking performance on octane-richards.  Because of the inflexible handling of the preferred register for virtual register groups, we end up spilling some of the key vregs at each point they are used.  This doesn't increase the instruction count a lot (it's about 1% more with backtracking vs. lsra) but using perf on Linux shows many more stalled cycles with the backtracking version.  The attached patch removes the virtual register group preferred registers, and instead of allocating entire groups at once allocates lists of live intervals at a time.  If there is more than one interval being allocated then they are all in the same group; the lists to allocate start out with all the initial intervals in a group, and can shrink over time as the allocation proceeds.  This gives the allocator a lot more flexibility when dealing with virtual register groups, fixes the richard regression for me and doesn't seem to regress other things, though I need to do more testing.
Assignee: sunfish → bhackett1024
Attachment #8574696 - Flags: review?(sunfish)
Blocks: 1134510
Also, I think once LSRA is removed we'll be able to do a major cleanup patch on how this stuff is organized.  I think that LiveInterval can be removed, and Range would become the fundamental unit of the register allocator, with the allocator handling sets of Ranges at a time, which might be associated with multiple vregs but should all have the same allocation.
Comment on attachment 8574696 [details] [diff] [review]
patch

Review of attachment 8574696 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, this patch isn't as scary as it first looks. Still, this code is getting quite complex.

(In reply to Brian Hackett (:bhackett) from comment #4)
> Also, I think once LSRA is removed we'll be able to do a major cleanup patch
> on how this stuff is organized.  I think that LiveInterval can be removed,
> and Range would become the fundamental unit of the register allocator, with
> the allocator handling sets of Ranges at a time, which might be associated
> with multiple vregs but should all have the same allocation.

This will definitely be nice, and will hopefully help us eliminate a lot of the complexity here. My idea was to keep LiveInterval, and generalize it to represent these sets of Ranges, because LiveInterval is already a class holding a set of disjoint Ranges, and we already have code for splitting them and doing interference testing and so forth, much of which is code we can continue to use when we start tracking multiple virtual registers together. However, I'll probably be happy with any change which reduces the height of the LiveRange/LiveInterval/VirtualRegister/VirtualRegisterGroup stack :-).
Attachment #8574696 - Flags: review?(sunfish) → review+
Cleanup patch as described in comment 4.  This is a substantial rewrite of the organizational data structures used by the backtracking allocator, but is not supposed to actually change the allocator's behavior.  A few algorithms had to be reworked, like splitAt(), but they should still have the same effect.  The basic data structures are:

LiveRange: A continuous range of positions where a virtual register is live.
LiveBundle: A set of LiveRanges which do not overlap.
VirtualRegister: A set of all LiveRanges used for some LDefinition.

LiveBundle is used for performing allocations.  It has a very similar role to LiveInterval, but I renamed it because LiveInterval is a terrible name (it does not match the mathematical notion of an interval, so only sows confusion).

I also changed the structures so that bundles and registers contain singly linked lists of ranges, rather than vectors.  This is cleaner and more efficient and makes it clear that each range belongs to at most one bundle and one register.

Once this gets in I'll rewrite the first patch (which does change the allocator's behavior), and hopefully be able to remove VirtualRegisterGroup.
Attachment #8594367 - Flags: review?(sunfish)
Attached patch addendumSplinter Review
This goes along with the last patch.  Move groups and related stuff currently have internal pointers to the LAllocations within live intervals.  LSRA depended on this for some reason I never understood, but the backtracking allocator doesn't, so this patch fixes things up to use LAllocation directly instead of the extra indirection.
Attachment #8594368 - Flags: review?(sunfish)
Comment on attachment 8594367 [details] [diff] [review]
structural rewrite

Review of attachment 8594367 [details] [diff] [review]:
-----------------------------------------------------------------

Overall, I'm excited about this patch! LiveRanges, LiveBundles, and VirtualRegisters looks like a good organization, and should do what we want to start cleaning up some of the things we run into with LiveIntervals today.

For my sanity, the code in buildLivenessInfo is mainly just being moved here, and there's no significant change in the meaning of liveness here, right?

I expect this is just the beginning of a larger story. I've made a pass through the code here and it looks good, though I'm sure I'll have questions and more to say as I really get into it, but I think this looks like a great start!

::: js/src/jit/BacktrackingAllocator.cpp
@@ +74,5 @@
> +
> +void
> +LiveRange::distributeUses(LiveRange* other)
> +{
> +    MOZ_ASSERT(other->vreg() == vreg());

You can also assert(this != other) here, for sanity.

@@ +83,5 @@
> +        if (other->covers(use->pos)) {
> +            uses_.removeAndIncrement(iter);
> +            other->addUse(use);
> +        } else {
> +            iter++;

I find this iteration pattern, where one path adjusts iter by reference and another path increments it, hard to follow. This is a case where the post-increment idiom works fairly well:

for (...; iter; ) {
  UsePosition* use = *iter++;

  // safe to disconnect use from the list
}

and then you don't need removeAndIncrement, and can just remove. What do you think?
Attachment #8594367 - Flags: review?(sunfish) → review+
(In reply to Dan Gohman [:sunfish] from comment #8)
> @@ +83,5 @@
> > +        if (other->covers(use->pos)) {
> > +            uses_.removeAndIncrement(iter);
> > +            other->addUse(use);
> > +        } else {
> > +            iter++;
> 
> I find this iteration pattern, where one path adjusts iter by reference and
> another path increments it, hard to follow. This is a case where the
> post-increment idiom works fairly well:
> 
> for (...; iter; ) {
>   UsePosition* use = *iter++;
> 
>   // safe to disconnect use from the list
> }
> 
> and then you don't need removeAndIncrement, and can just remove. What do you
> think?

Unfortunately, if we increment iter and then remove the item before it, the iterator will become messed up --- the iterator includes a |prev| member, which will go out of sync if that element is itself removed from the list.  Normally the inline lists aren't designed to be changed at all while they are being iterated (see the modifyCount_ member), so I added the removeAndIncrement method to allow this sort of operation.
(In reply to Brian Hackett (:bhackett) from comment #9)
> Unfortunately, if we increment iter and then remove the item before it, the
> iterator will become messed up --- the iterator includes a |prev| member,
> which will go out of sync if that element is itself removed from the list. 
> Normally the inline lists aren't designed to be changed at all while they
> are being iterated (see the modifyCount_ member), so I added the
> removeAndIncrement method to allow this sort of operation.

Ah, I see. Could you add some comments around the code that uses removeAndIncrement noting that this is what's going on?
Comment on attachment 8594368 [details] [diff] [review]
addendum

Review of attachment 8594368 [details] [diff] [review]:
-----------------------------------------------------------------

Great!

::: js/src/jit/shared/CodeGenerator-shared.h
@@ +234,5 @@
>          MOZ_ASSERT(offset % sizeof(Value) == 0);
>          return offset;
>      }
>  
> +    inline int32_t ToStackOffset(const LAllocation& a) const {

Can this take the LAllocation by value instead of by reference?
Attachment #8594368 - Flags: review?(sunfish) → review+
Keywords: leave-open
This caused several small regressions and improvements on kraken and octane.  I think the regressions outweigh the improvements and since we're close to the next merge it's probably better to just back out and reland as is after the merge.  If anyone wants to, go ahead and back out, otherwise I'll do it this weekend (provided the tree ever reopens).

I'd rather not go and figure out what's going on with the regressions immediately.  I need to rewrite the first patch in this bug, which will change how the backtracking allocator works and the code it generates anyways, and afterwards can deal with any perf fallout from the two patches together.
Attached patch patchSplinter Review
Redo the first patch on top of the structural rewrite patch.  This removes VirtualRegisterGroup and some associated cruft in VirtualRegister. In its place it adds a new class, SpillSet, which keeps track of the places where split off pieces of an initial bundle have been spilled.  I thought about merging this with the spill parent on a bundle, but I think it's better to keep them as separate but related concepts.  Doing it this way allows us to split a bundle without spilling it everywhere, while still being able to use a consistent stack slot when spilling the pieces (avoiding the dreaded memory->memory move).  Still, things are a lot cleaner after this patch:

 3 files changed, 610 insertions(+), 930 deletions(-)

Testing this patch plus the previous one on x86, my octane score improves by maybe 1% (richards and zlib improvements, it seems) and my kraken score is flat.  Still, we'll need to wait for this to get on awfy to be sure, so it would be nice to land this relatively early in the next cycle.
Attachment #8574696 - Attachment is obsolete: true
Flags: needinfo?(bhackett1024)
Attachment #8603848 - Flags: review?(sunfish)
Regression urls (sorry I cannot merge them in one entry yet):
http://arewefastyet.com/regressions/#/regression/1476122
http://arewefastyet.com/regressions/#/regression/1476143
http://arewefastyet.com/regressions/#/regression/1476232
http://arewefastyet.com/regressions/#/regression/1476205

Confirmed fixed by:
http://arewefastyet.com/regressions/#/regression/1476932
http://arewefastyet.com/regressions/#/regression/1476953
http://arewefastyet.com/regressions/#/regression/1477097
http://arewefastyet.com/regressions/#/regression/1477123

So it seems all the changes were caused by this patch, except for kraken-darkroom that also regressed partially by another patch. Everything is sorted out and back to pre-Friday numbers.
Upon r+, feel free to land and I'll post the regression links with status for investigation.
Comment on attachment 8603848 [details] [diff] [review]
patch

Review of attachment 8603848 [details] [diff] [review]:
-----------------------------------------------------------------

Yay, no more VirtualRegisterGroup :-). I've now finished a basic readthrough of the code and it looks good. Doubtless this is just the beginning, of course.

::: js/src/jit/InlineList.h
@@ +86,5 @@
>          T* result = static_cast<T*>(this->next);
>          removeAfter(this, result);
>          return result;
>      }
> +    T* back() const {

Ordinarily, a const back() method would return a const reference (pointer in this case) to the element. Is there a const inconsistency somewhere that led to this method needing a const qualifier but returning non-const?
Attachment #8603848 - Flags: review?(sunfish) → review+
(In reply to Dan Gohman [:sunfish] from comment #17)
> Ordinarily, a const back() method would return a const reference (pointer in
> this case) to the element. Is there a const inconsistency somewhere that led
> to this method needing a const qualifier but returning non-const?

No, but I think the 'const' on these InlineForwardList methods is referring to the const-ness of the list itself, rather than the individual elements; having a non-const pointer to an element won't allow anyone to modify the structure of the list.  InlineList (the doubly linked version of InlineForwardList) already has a 'T* peekBack() const' method doing the same thing as this one.
Keywords: leave-open
A regression/improvement was reported on AWFY (by mayankleoboy1@gmail.com, thanks!):
- slave: Mac OS X 10.10 32-bit (Mac Pro, shell)
- mode: Ion

Regression(s)/Improvement(s):
- octane: Richards: 10.08% (improvement)
- octane: Crypto: -5.81% (regression)
- octane: Mandreel: -3.71% (regression)
- octane: Box2D: -2.97% (regression)
- octane: zlib: 7.86% (improvement)

Recorded range:
- http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=df6f82e4a2db&tochange=d9b6cc4ce4cc

More details: http://arewefastyet.com/regressions/#/regression/1592575

Is there something we can do about the crypto/mandreel/box2d regression?
https://hg.mozilla.org/mozilla-central/rev/261cadb83015
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
(In reply to Hannes Verschore [:h4writer] from comment #20)
> 
> Is there something we can do about the crypto/mandreel/box2d regression?

I'll look into these later this week.
Flags: needinfo?(bhackett1024)
There were a lot more regressions/improvements than just the Octane tests. But probably it's the same underlying cause.
Depends on: 1167677
Depends on: 1167815
AWFY showed regression on almost all platforms it currently manages:
Win 8 shell: http://arewefastyet.com/regressions/#/regression/1596828
Win 8 browser: http://arewefastyet.com/regressions/#/regression/1596814
Win 8 PGO: http://arewefastyet.com/regressions/#/regression/1596895
Mac browser: http://arewefastyet.com/regressions/#/regression/1596844
Mac 64, shell: http://arewefastyet.com/regressions/#/regression/1594155 
Mac 64, noasmjs: http://arewefastyet.com/regressions/#/regression/1594185
Mac 32, shell: http://arewefastyet.com/regressions/#/regression/1592575
Mac 64, noasmjs: http://arewefastyet.com/regressions/#/regression/1594240

Too bad I cannot group the regressions yet.
But the affected benchmarks are:
- kraken
- dromaeo
- octane
- massive
- jetstream
- asmjs-apps
- assorted
- dart

Off course there are also improvements, but they don't offset the regressions.
E.g. http://arewefastyet.com/regressions/#/regression/1596844 is very bad.
(In reply to Hannes Verschore [:h4writer] from comment #24)
> Off course there are also improvements, but they don't offset the
> regressions.
> E.g. http://arewefastyet.com/regressions/#/regression/1596844 is very bad.

What am I supposed to make of this?  I looked at this link and at the Box2D regression (the largest reported one) but it's pretty clearly a bogus report.
(In reply to Brian Hackett (:bhackett) from comment #25)
> (In reply to Hannes Verschore [:h4writer] from comment #24)
> > Off course there are also improvements, but they don't offset the
> > regressions.
> > E.g. http://arewefastyet.com/regressions/#/regression/1596844 is very bad.
> 
> What am I supposed to make of this?  I looked at this link and at the Box2D
> regression (the largest reported one) but it's pretty clearly a bogus report.

You are totally correct. I marked the Box2D as noise. The richard regression is also questionable. Not sure. 
But I think the others are real.
(In reply to Hannes Verschore [:h4writer] from comment #24)
> AWFY showed regression on almost all platforms it currently manages:
> Win 8 shell: http://arewefastyet.com/regressions/#/regression/1596828
> Win 8 browser: http://arewefastyet.com/regressions/#/regression/1596814
> Win 8 PGO: http://arewefastyet.com/regressions/#/regression/1596895
> Mac browser: http://arewefastyet.com/regressions/#/regression/1596844
> Mac 64, shell: http://arewefastyet.com/regressions/#/regression/1594155 
> Mac 64, noasmjs: http://arewefastyet.com/regressions/#/regression/1594185
> Mac 32, shell: http://arewefastyet.com/regressions/#/regression/1592575
> Mac 64, noasmjs: http://arewefastyet.com/regressions/#/regression/1594240
> 
> Too bad I cannot group the regressions yet.
> But the affected benchmarks are:
> - kraken
> - dromaeo
> - octane
> - massive
> - jetstream
> - asmjs-apps
> - assorted
> - dart
> 
> Off course there are also improvements, but they don't offset the
> regressions.
> E.g. http://arewefastyet.com/regressions/#/regression/1596844 is very bad.

OK, I had time to look through these links and think that your characterization of the effects of this patch is pretty unfair.  Lots of tests regressed, and lots of tests improved, too.  This is, for better or for worse, what happens when the register allocator changes.

Bug 1167677 should help x86, though it might not, and in any case this patch seemed to be a net improvement on x86, improving octane and not having large effects elsewhere.

More things regressed on x64, and I'll try to understand what's going on with the asmjs-apps that did so, as well as a couple of kraken tests (gaussian-blur and darkroom) and misc tests (interpreter and arraymod).  In any case this patch had a neutral effect on the major benchmarks we track.
Blocks: 1170811
Depends on: 1170840
(In reply to Brian Hackett (:bhackett) from comment #27)
> (In reply to Hannes Verschore [:h4writer] from comment #24)
> > AWFY showed regression on almost all platforms it currently manages:
> > Win 8 shell: http://arewefastyet.com/regressions/#/regression/1596828
> > Win 8 browser: http://arewefastyet.com/regressions/#/regression/1596814
> > Win 8 PGO: http://arewefastyet.com/regressions/#/regression/1596895
> > Mac browser: http://arewefastyet.com/regressions/#/regression/1596844
> > Mac 64, shell: http://arewefastyet.com/regressions/#/regression/1594155 
> > Mac 64, noasmjs: http://arewefastyet.com/regressions/#/regression/1594185
> > Mac 32, shell: http://arewefastyet.com/regressions/#/regression/1592575
> > Mac 64, noasmjs: http://arewefastyet.com/regressions/#/regression/1594240
> > 
> > Too bad I cannot group the regressions yet.
> > But the affected benchmarks are:
> > - kraken
> > - dromaeo
> > - octane
> > - massive
> > - jetstream
> > - asmjs-apps
> > - assorted
> > - dart
> > 
> > Off course there are also improvements, but they don't offset the
> > regressions.
> > E.g. http://arewefastyet.com/regressions/#/regression/1596844 is very bad.
> 
> OK, I had time to look through these links and think that your
> characterization of the effects of this patch is pretty unfair.  Lots of
> tests regressed, and lots of tests improved, too.  This is, for better or
> for worse, what happens when the register allocator changes.
> 
> Bug 1167677 should help x86, though it might not, and in any case this patch
> seemed to be a net improvement on x86, improving octane and not having large
> effects elsewhere.
> 
> More things regressed on x64, and I'll try to understand what's going on
> with the asmjs-apps that did so, as well as a couple of kraken tests
> (gaussian-blur and darkroom) and misc tests (interpreter and arraymod).  In
> any case this patch had a neutral effect on the major benchmarks we track.

Sounds logical. It is indeed normal that a register change can have a larger big impact.
I agree that the listed benchmarks are the most important ones to check like asmjs-apps.

though something very curious happened with landing of bug 1169594 on asmjs-apps.
http://arewefastyet.com/#machine=29&view=single&suite=asmjs-apps&subtest=box2d-throughput&start=1431867600.05&end=1433347605.5
You see the regression caused by this patch and you see that bug 1169594 brought the performance back on that benchmark, but only on non-asmjs. Normal asmjs is still bad. Is there some bad/strange interaction going on?
(In reply to Hannes Verschore [:h4writer] from comment #28)
> though something very curious happened with landing of bug 1169594 on
> asmjs-apps.
> http://arewefastyet.com/#machine=29&view=single&suite=asmjs-
> apps&subtest=box2d-throughput&start=1431867600.05&end=1433347605.5
> You see the regression caused by this patch and you see that bug 1169594
> brought the performance back on that benchmark, but only on non-asmjs.
> Normal asmjs is still bad. Is there some bad/strange interaction going on?

The patch in bug 1169594 touches AddKeepAliveInstructions, which is called only for non-asmjs code:
https://dxr.mozilla.org/mozilla-central/source/js/src/jit/Ion.cpp#1597-1602
(In reply to Hannes Verschore [:h4writer] from comment #28)
> Sounds logical. It is indeed normal that a register change can have a larger
> big impact.

I've marked these regressions as 'wontfix' in AWFY. Given the nature of the change and the fact that this is now in beta.
Flags: needinfo?(bhackett1024)
Do you remember the reason why you set kTrustedScriptBuffer for ASan from 270k to 450k, and how was the number decided?

  https://dxr.mozilla.org/mozilla-central/rev/ff04d410e74b69acfab17ef7e73e7397602d5a68/js/xpconnect/src/XPCJSContext.cpp#3453

It was not there in the reviewed patch (attachment 8603848 [details] [diff] [review]).
Flags: needinfo?(bhackett1024)
(In reply to Ting-Yu Chou [:ting] from comment #31)
> Do you remember the reason why you set kTrustedScriptBuffer for ASan from
> 270k to 450k, and how was the number decided?
> 
>  
> https://dxr.mozilla.org/mozilla-central/rev/
> ff04d410e74b69acfab17ef7e73e7397602d5a68/js/xpconnect/src/XPCJSContext.
> cpp#3453
> 
> It was not there in the reviewed patch (attachment 8603848 [details] [diff] [review]
> [review]).

I don't remember why I made this change, but the backtracking allocator affects JIT stack frame size so the older, smaller limit was probably causing test failures in automation.
Flags: needinfo?(bhackett1024)
You need to log in before you can comment on or make changes to this bug.