Open Bug 1941608 Opened 23 days ago Updated 23 days ago

Explore AvlTree improvements

Categories

(Core :: Performance, enhancement)

enhancement

Tracking

()

People

(Reporter: jlink, Assigned: jlink)

References

(Blocks 1 open bug)

Details

A few months ago, while looking at a profile from Denis (now lost to the sands of time) showing lots of Ion compilation time on Pixel 6, I noticed that main thread linking was spending an unusually large amount of time in JitCodeRange::compare(). This wasn't something that I had noticed before and it seemed worth taking a quick look at.

Indeed, initial haphazard performance testing (see here and here) indicates that the Pixel 6 strongly likes any improvements that I make to AvlTree or the comparison functions used with it. The A55 seems to like those same changes a bit but not nearly as much.

For other platforms the impact is less clear - maybe there is some improvement and maybe there is some regression. This comparison implies that perhaps all platforms are happy when all of the improvements are applied together. This is not entirely unexpected because it has been hypothesized that improving Ion compilation times might not always be a net positive for SP3 scores sometimes because it increases the load for main-thread linking. However, once the main thread linking is improved enough, further Ion compilation improvements would then be beneficial.

A little bit of clarification on the two comparisons that I linked to above:

  1. The first comparison shows the impact of both de-branching JitCodeRange::compare() and some de-branching/branching improvements in AvlTree::find_worker().
  2. The second comparison shows the impact of de-branching LiveRange/LiveRangePlus::compare() and does not also contain the changes from the previous report.

We know from profiling that the AvlTree, at least as used by the register
allocator, is bad news from a cache-miss perspective. This is particularly
noticeable when allocating large functions. I suspect it is also bad news from
a branch-prediction perspective, but I have no data to back that up.

One thing we could do is to rework the node storage so as to have a vector of
nodes, and have inter-node pointers be 32-bit indices in this array. Currently
they are pointers, so this would save 64 bits per node on 64 bit targets.

More work would be to use a B-Tree, which is what I believe Chris Fallin's RA2
uses. This would reduce the total number of nodes we need to visit.

Zooming out a bit, what the AvlTrees are used for is to determine whether there
is an intersection between the LiveRanges in a LiveBundle and the ranges for
which a register is currently committed. In both cases -- in the LiveBundle
and in the tree -- the ranges are non-overlapping and in order. So I wonder if
the check could be done by finding the start point range of a bundle in the
tree, and then "walking forwards" through the bundle and the tree together,
until either the bundle or tree has no more ranges, or an overlap is found.

Either way, for a convincing patch, it would be good to have cache miss and
mispredict numbers from h/w perf counters to verify the change. This might be
a bit less haphazard than (eg) "A55 seems to like <some change> / Pixel 6
definitely likes <change>". I am all set up to measure such things and am
happy to try out patches (in the shell) if that helps.

You need to log in before you can comment on or make changes to this bug.