Closed Bug 1532592 Opened 6 years ago Closed 6 years ago

BaseCompiler::stk_ is always heap allocated, no matter what

Categories

(Core :: JavaScript: WebAssembly, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: jseward, Assigned: jseward)

Details

Attachments

(2 files, 2 obsolete files)

A curious combination of circumstances causes the baseline compiler's operand stack to always be heap allocated, no matter how trivial the function to be compiled.

The stack is declared thusly

typedef Vector<Stk, 8, SystemAllocPolicy> StkVector;
..
StkVector stk_;

which seems reasonable. However, BaseCompiler::emitBody starts off with this

uint32_t overhead = 0;

and the main processing loop contains this

    if (overhead == 0) {
      // Check every 50 expressions -- a happy medium between
      // memory usage and checking overhead.
      overhead = 50;

      // Checking every 50 expressions should be safe, as the
      // baseline JIT does very little allocation per expression.
      CHECK(alloc_.ensureBallast());

      // The pushiest opcode is LOOP, which pushes two values
      // per instance.
      CHECK(stk_.reserve(stk_.length() + overhead * 2));
   }

The result is that, when we encounter the first opcode of the function to compile, stk_ is empty, but we wind up reserving 100 elements anyway. Since the inline size of |stk_| is 8 this always forces a heap allocation.

So the obvious fix is to specify an an initial inline size for stk_ that is modestly over 100 -- for example, 128. Right? Alas, no, because our Vector implementation limits inline size to just below 1024 bytes, which given the sizeof(Stk), is I think about 62 elements. See mfbt::Vector.h:297 ("static constexpr size_t kMaxInlineBytes = ...").

Per DHAT profiling, compiling the Tanks demo allocates 225.9MB in 94456 blocks, of which 69.0MB in 33688 blocks are these pointless resizes-into-the-heap.

Given the inline-size limit of 1024 bytes for our Vector implementation, I can't currently think how to work around this.

Nice find. Observe related bug 1316845 when pondering how to fix this.

Attached patch bug1532592-scheme1-2.diff (obsolete) — Splinter Review

For compiling the Tanks demo on an x86_64-linux build, gcc 8.2.1, -O2, this
change reduces the amount of heap allocation from 225,917,785 bytes in 94,456
blocks to 156,924,761 bytes in 60,767 blocks. The instruction count falls
from 2392 million to 2379 million.

Assignee: nobody → jseward

Another idea is to put the stk_ into CompiledCode so that, even though it's empty() at the end of each compilation, the backing malloc() bytes get reused across compilation tasks. This may actually provide a small improvement beyond the numbers in comment 2.

Also, in case you're generally interested in this, do you see any other hot Vectors that frequently exceed their inline allocation and could be similarly moved into CompileCode?

(In reply to Luke Wagner [:luke] from comment #3)

Another idea is to put the stk_ into CompiledCode so that, even though
it's empty() at the end of each compilation, the backing malloc() bytes
get reused across compilation tasks. This may actually provide a small
improvement beyond the numbers in comment 2.

Right. Yes. I tried (almost) exactly that too. It does indeed also mostly
remove the stack-to-heap copies. But the instruction count falls by only 2
million out of 2394 million, whereas with the comment 2 patch the drop is
by about 13 million.

Indeed, I worry that this approach would slow down the compiler in the worst
case. It necessarily involves changing the type of BaseCompiler::stk_ from
StkVector to StkVector&, and so every use of the stack will require an extra
memory reference.

Also, in case you're generally interested in this, do you see any other hot
Vectors that frequently exceed their inline allocation and could be
similarly moved into CompileCode?

I did other measurements too and there are some worthwhile wins but much smaller
than this one; this is the biggie. See bug 1532974.

(In reply to Julian Seward [:jseward] from comment #4)

Indeed, I worry that this approach would slow down the compiler in the worst
case. It necessarily involves changing the type of BaseCompiler::stk_ from
StkVector to StkVector&, and so every use of the stack will require an extra
memory reference.

I think you could keep the non-& StkVector by using .swap() from the CompiledCode to BaseCompiler (which is what all the other CompiledCode Vectors do as well).

In principle that Vector size limit could be bumped, but obviously we want to not go too far from page size, and there's a "did you really mean it" tradeoff to letting bigger inline sizes occur. So, yeah, I'm inclined to leave it as is like people here are thinking, but it's worth noting that someone, at least, would find value in it being higher. If more users manifest, we can always bump then.

(In reply to Jeff Walden [:Waldo] from comment #6)

Yeah, I actually didn't want to increase it either. Anyway, problem solved;
Luke's swappy suggestion in comment 5 works fine.

Attachment #9048828 - Attachment is obsolete: true

As it stands, the baseline compiler's operand stack (StkVector) has inline
capacity of 8, but is immediately resized to at least 100 elements at the
start of every function compilation by ballast-check logic in
BaseCompiler::emitBody(), hence causing a heap allocation and associated free.
This patch moves ownership of the stack from class BaseCompiler to
wasm::BaselineCompileFunctions(), so that it can be allocated and deallocated
just once per group of functions being compiled.

For one large test case (Tanks), this reduces total heap turnover during
compilation from 225,917,785 bytes in 94,456 blocks to 158,939,993 bytes in
61,752 blocks, a significant reduction in heap use. It also reduces the
compiler's dynamic instruction count by about 0.5%.

OOC, is there little additional benefit to storing 'stk' in CompiledCode, instead of on the stack, so that it could be reused between tasks?

CompiledCode seems to be the output of the compilation (for a single task?)
and parking the stack in there doesn't really feel right. What about putting
it into CompileTask instead? That would make it clearer that it is something
that has compile-thread lifetime.

Flags: needinfo?(luke)

Oops right, I mixed those two up.

Flags: needinfo?(luke)

Yeah, CompileTask works too and perhaps makes more sense; I'd recommend changing the {Ion,Baseline}CompileFunctions() to take a CompileTask& directly instead of adding an extra stk& parameter.

Luke, thanks for the feedback. However ..

On looking at this more, I realised this would mean lifting Stk, which is a
type entirely private to BaseCompiler, into WasmGenerator.h, so that the
vector type could be declared there. This strikes me as an undesirable
breaking of abstraction. Given that we get most of the win from merely moving
the stack into wasm::BaselineCompileFunctions(), I decided to stay with the
existing scheme.

For ZenGarden, without the patch, the stack is allocated in the heap around
111000 times. With the stack in wasm::BaselineCompileFunctions() it is
allocated 2452 times. In the theoretical best case in which one single
thread owns it, it would be allocated just once. So we get 97.8% of the win
with the stack in wasm::BaselineCompileFunctions(). For Tanks, the equivalent
numbers are 33000, 1157 and 96.5%.

As it stands, the baseline compiler's operand stack (StkVector) has inline
capacity of 8, but is immediately resized to at least 100 elements at the
start of every function compilation by ballast-check logic in
BaseCompiler::emitBody(), hence causing a heap allocation and associated free.
This patch moves ownership of the stack from class BaseCompiler to
wasm::BaselineCompileFunctions(), so that it can be allocated and deallocated
just once per group of functions being compiled.

For one large test case (Tanks), this reduces total heap turnover during
compilation from 225,917,785 bytes in 94,456 blocks to 158,939,993 bytes in
61,752 blocks, a significant reduction in heap use. It also reduces the
compiler's dynamic instruction count by about 0.5%.

Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f1c641a4fac8 BaseCompiler::stk_ is always heap allocated, no matter what. r=lhansen.

Good point; keeping encapsulation and getting 97.8% of the win is a reasonable compromise :)

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Attachment #9049245 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: