Closed Bug 1772282 Opened 2 years ago Closed 2 years ago

Replace js/src/ds/SplayTree.h with an AvlTree.h and change all of SM's uses accordingly

Categories

(Core :: JavaScript Engine: JIT, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
103 Branch
Tracking Status
firefox103 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

SM contains a splay tree implementation in js/src/ds/SplayTree.h. This has
three uses:

(1) in the register allocator, to keep track of LiveRanges for each physical reg

(2) in the register allocator, to record ranges containing calls

(3) in js/src/ds/MemoryProtectionExceptionHandler.cpp, to record memory ranges
that are protected (?)

When compiling via Ion, (1) is a significant expense during register
allocation. (2) and (3) are believed to be low-traffic.

The splay implementation has the following disadvantages:

  • the workload for (1) is lookup-heavy, and per profiling, the splay tree is
    significantly slower than an AVL tree, especially for large functions. This
    may be due to the AVL tree being better balanced on average, or the AVL tree
    not requiring modification during lookup, or a combination of both.

  • the splay tree has no iteration facility. This leads to a couple of kludgey
    workarounds:

    • the class BacktrackingAllocator::PrintLiveRange for printing trees

    • the mutually-redundant data structures
      CallRangeList callRangesList;
      SplayTree<CallRange*, CallRange> callRanges;
      used by the RA to keep track of ranges containing calls.

  • SplayTree.h doesn't make clear which methods are "public interface" and
    which are internal implementation.

This bug has three patches:

(a) one that adds AvlTree.h, but does not use it
(b) one that changes all SplayTree uses to AvlTree
(c) one that removes SplayTree.h

Note that none of this changes the top level mfbt/SplayTree.h, nor changes any
uses of that. That is a separate thing which looks like it was once the same
as SM's copy, but the two copies diverged some time back.

(In reply to Julian Seward [:jseward] from comment #0)

(3) in js/src/ds/MemoryProtectionExceptionHandler.cpp, to record memory ranges
that are protected (?)

This was used with AsmJS, as a way to detect out-of-bounds, in the 4G virtual space addressable from 32 bits pointers. I am not sure if this is still used for WebAssembly.

This then got an extra use case for a debug mode of LifoAlloc, where LifoAlloc are temporarily marked as read-only when moved from the main thread to the helper thread which resume the compilation. The intent was to pin-point to the part of the code which is corrupting the content of LifoAlloc buffers.

Here are perf numbers for the upcoming patch set.

First, it is important to appreciate that patches produce different
allocations than the original. I spent some hours tracking down this bizarre
phenomenon. I had assumed that the allocator only cared about "does this
range overlap some other range in the tree?" -- that is, essentially, set
membership. But no, the allocator depends on the actual tree layout,
specifically which node is closest to the root when more than one node matches
a query, and that's different for the two tree implementations.

This behaviour manifests in BacktrackingAllocator::tryAllocateRegister, where
register-use trees are queried:

      if (!rAlias.allocations.contains(range, &existing)) {
        continue;
      }

This asks "does the tree contain a range that overlaps range?; if yes, return
it in existing". If more than one range in the tree overlaps range, which
we get in existing is arbitrary. The code goes on to decide whether it's OK
to evict the bundle containing existing based (in part) on existings spill
weight. This seems to me a bug in the logic in that if existing has a low
spill weight then it may choose to evict existings bundle, even though some
other range -- that we weren't given -- has a higher spill weight. Hence it
could incorrectly decide to evict a bundle that has a higher spill weight than
the bundle we are trying to allocate.

But I may misunderstand. I made multiple attempts to "fix" this, but failed.

With all that said, the patched allocator actually generates very slightly
better allocations, for the wasm embenchen suite: On x86_64, an avg 0.23%
reduction in insn counts, with no regressions, and on x86_32, an avg 0.18%
reduction in insn counts, with worst case regression a 0.06% increase. See
the attached file.

For allocation costs for an input with large-ish functions (a couple in the
1000-5000 LIR range), viz, wasm-bullet, the total cost of
BacktrackingAllocator::go are (M = million):

before:  insns=551.446M  L1cacheMisses=6.040M (simulated cache)
after:   insns=491.930M  L1cacheMisses=6.566M (simulated cache)

So a clear win on the insn count. The L1 misses increase is unfortunate. It
may be possible to do better by giving the tree a private allocator only for
tree nodes, so that they are not interleaved between other allocations made by
the RA. Also by removing the balance-indication (-1, 0, 1) field in each node
by stealing bits from one of the child pointers, although this would increase
the insn count.

Overall, times are the same or better:

Octane (higher is better)
x86_64  before  best of 20 = 30124
x86_64  after   best of 20 = 30168

x86_32  before  best of 20 = 26950
x86_32  after   best of 20 = 27122

JetStream2 (higher is better) x86_64, on 32-bit-deep VNC
https://browserbench.org/JetStream
before  111.402  114.602  113.268  110.251  113.487
after   112.217  117.732  117.091  111.815  116.316

wasm: Embenchen: COMPILE/RUN TIMES (lower is better) (best of 5) i5-1135G7
               --before--  ->  --after--
wasm_bullet.j  31 /  774   ->  28 /  777
wasm_lua_bina  19 / 2579   ->  18 / 2567
wasm_box2d.js  10 /  593   ->  10 /  590
wasm_zlib.c.j  11 /  899   ->  10 /  902
All other compile times < 10 ms, too low to be meaningful
Depends on: 1772123

This patch adds js/src/ds/AvlTree.h, which provides facilities similar to
js/src/ds/SplayTree.h, but with reduced lookup costs. The patch merely adds
the code and test cases but does not make any use of it.

To make it easy to change use points, the public interface is identical in
naming and semantics to that of SplayTree.h, to the extent it can be. See
comments on AvlTree::contains and AvlTree::maybeLookup. The memory allocation
strategy is also the same.

There are new facilities for iterating over a tree, either completely or from
an arbitrary start point.

AvlTree.h is structured so as to hide the implementation from users as much as
possible. Most of the file is an implementation class AvlTreeImpl, which
contains all the difficult stuff -- AVL insertion, deletion, rebalancing,
search and iteration. The end-user interface, plus documentation, is in a
small class AvlTree at the bottom of the file.

The patch also contains self-test code, testAvlTree.cpp. The public interface
class AvlTree is (intentionally) too restrictive for proper testing. So
this file defines its own testing interface class, AvlTreeTestIF. It then
uses that to test that insertion, deletion and iteration work correctly, and
that the tree remains balanced.

This changes all 3 of SM's uses of js/src/ds/SplayTree.h to use
js/src/ds/AvlTree.h. The new interface is almost identical to the old one, so
the changes are mostly trivial:

(0) js/src/jit/JitcodeMap.h: two comments referencing unknown "trees" have
been amended.

(1) js/src/ds/MemoryProtectionExceptionHandler.cpp: this uses a tree to record
memory ranges that are protected (?). The only change is of the type of the
tree.

(2) BacktrackingAllocator.h: a minor use, to record ranges containing calls
(BacktrackingAllocator::callRanges). Also just a change of type. It would
be possible to use the AVL trees to merge the partially-redundant fields
::callRanges and ::callRangesList, but that is beyond the scope of this
patch.

(3) BacktrackingAllocator.h: the main use: changing LiveRangeSet to use an
AvlTree. This is also just a renaming of the type.

(3, more) struct PrintLiveRange has been removed. It was a workaround for
the fact that the splay trees had no iteration facility. Its use, in
BacktrackingAllocator::dumpAllocations, has been replaced by an AVL iterator.

(3, more) Note that this change causes the allocator to produce different
allocations. This is because the allocator depends on the actual tree layout,
specifically which node is closest to the root when more than one node matches
a query, and that's different for the two tree implementations.

This behaviour manifests in BacktrackingAllocator::tryAllocateRegister, where
register-use trees are queried:

if (!rAlias.allocations.contains(range, &existing)) {
continue;
}

This asks "does the tree contain a range that overlaps range?; if yes,
return it in existing". If more than one range in the tree overlaps range,
which one is written to existing is arbitrary. The code goes on to decide
whether it's OK to evict the bundle containing existing based (in part) on
existings spill weight.

This could be seen as a bug in the logic in that if existing has a low spill
weight then it may choose to evict existings bundle, even though some other
range -- that wasn't returned -- has a higher spill weight. Hence it could
incorrectly decide to evict a bundle that has a higher spill weight than the
bundle for which allocation is attempted.

The above analysis may be a misinterpretation of the logic. Multiple attempts
to "fix" it were made, without success. In any case the resulting
allocations are marginally better. See
https://bugzilla.mozilla.org/show_bug.cgi?id=1772282#c2

Depends on D148246

This patch simply removes js/src/ds/SplayTree.h and makes no other changes.

Depends on D148247

Blocks: 1772991

Could you also collect some data on memory usage, just to make sure it's reasonable? LifoAlloc knows how much memory it has allocated, so it should be possible to compare the amount of memory allocated during regalloc when compiling some Wasm module in the shell, with and without these changes. It doesn't have to be a detailed comparison, but it would be good to verify it's less than or similar to what we have now on an interesting workload.

Flags: needinfo?(jseward)

(In reply to Jan de Mooij [:jandem] from comment #6)

Could you also collect some data on memory usage, just to make sure it's
reasonable?

Sure. I have the following numbers, but I'm not sure if they are what you are
after. For each test, I ran the shell with --no-threads, and printed
alloc.lifoAlloc()->peakSizeOfExcludingThis() and
alloc.lifoAlloc()->computedSizeOfExcludingThis()
at the end of each function compilation. It seems from the numbers that the
same allocator is used for the whole compilation, and not reset in between.

Test cases are: wasm_bullet from Embenchen, which has some large-ish
functions, and the .wasm for the LibreOffice test case we have.

Results:
"comp" = computedSizeOfExcludingThis()
"peak" = peakSizeOfExcludingThis()

Unpatched (w/ splay trees)
wasm_bullet peak 4194304 comp 4194304
LibreOffice peak 103436560 comp 24117248

Patched (w/ AVL trees) [this bug]
wasm_bullet peak 4194304 comp 4194304
LibreOffice peak 103436560 comp 24117248

Patched (w AVL trees) + batch allocation [this bug + bug 1772991]
wasm_bullet peak 4194304 comp 4194304
LibreOffice peak 103436560 comp 24117248

So either I did something wrong or the numbers haven't changed. I wouldn't
find the latter surprising, considering that (1) the number of nodes in the
trees should be exactly the same, just differently balanced, and (2) the node
size is probably unchanged. For splay trees we have:

  struct Node {
    T item;
    Node* left;
    Node* right;
    Node* parent;

For AVL trees it is

  enum class Tag : uint8_t { /* 4 values */ }
  struct Node {
    Node* left;
    Node* right;
    Tag tag;
    T item;

In both cases we have an item T and left and right pointers. The splay nodes
have a parent pointer; the AVL nodes have a uint8_t tag which (I assume) is
unfortunately rounded up to sizeof(void*) :-( So they are the same size.

[Even worse is, the tag has only 2 live bits, so it could be got rid of
entirely by stealing bits from the left/right pointers, at some cost in
instruction count.]

Is that what you're looking for? Or are there more meaningful stats I can get
out of class LifoAlloc? I skimmed LifoAlloc's sources but didn't see
anything better.

Flags: needinfo?(jseward) → needinfo?(jdemooij)

We talked about this on Matrix earlier today, but yes that looks perfect and makes sense. Thanks!

Flags: needinfo?(jdemooij)
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/63cb0d0c4935
(part 1 of 3) Replace js/src/ds/SplayTree.h with an AvlTree.h and change all of SM's uses accordingly.  r=rhunt.
https://hg.mozilla.org/integration/autoland/rev/13287ab08d40
(part 2 of 3) Replace js/src/ds/SplayTree.h with an AvlTree.h and change all of SM's uses accordingly.  r=jandem.
https://hg.mozilla.org/integration/autoland/rev/cc5be1d8d929
(part 3 of 3) Replace js/src/ds/SplayTree.h with an AvlTree.h and change all of SM's uses accordingly.  r=rhunt.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: