Closed Bug 1708381 Opened 1 year ago Closed 1 year ago

Tremendous slowdown for C64 emulator with emscripten 2.0.17 due to more aggressive inlining

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

x86_64
All
defect

Tracking

()

RESOLVED DUPLICATE of bug 1712078

People

(Reporter: lth, Assigned: jseward)

References

Details

Attachments

(6 files)

Original issue here: https://github.com/emscripten-core/emscripten/issues/13899

STR (for now - the content at "Part 2" may change to mask this bug at some point in the nearish future):

Part 1

Part 2

  • repeat part 1 (you can reuse the downloaded file) but go instead to http://vc64web.github.io/ for the emulator
  • observe that the frame rate sucks (i get less than 20fps on my very beefy desktop system)

This was originally reported on an ancient mac but repros on a current Linux system too.

This is almost certainly a problem with Ion. Disabling baselinejit does not make a difference; disabling optimizingjit brings the FPS up to at least 50.

This does not appear to be a problem on the M1 MacBook Pro, suggesting that it is a problem related to code generation or register allocation on x64.

It is also a problem on Windows 10 x64, which is not totally surprising, though Windows has a different ABI so there are differences to how register allocation is done.

Severity: -- → S3
OS: Unspecified → All
Priority: -- → P2

Running Ion-only: Profile with "bad" code shows massive jank and very long requestAnimationFame callback running times. Most of the time (72%) is spent in a single wasm function (labeled "612"). Profile with good code has none of those problems, it's mostly idle.

Running baseline-only, the profile for the "bad" code resembles the Ion profile for the "good" code, there's no jank, animation callbacks are 11-14ms.

Attached file vC64.wat.gz

Disassembly of the bad code (vC64.wasm), 23MB. Function 612 is the mother of all br_tables.

Hard to say how to approach this. We no longer have an alternative register allocator, and simply turning off the other optimization passes will only tell us something if we can assume the allocator is correct.

The tremendous slowdown suggests that something goes very wrong - that a register or local is being clobbered, leading to a long-running loop for example. It's not just an optimization that fails, or something locally suboptimal. (In principle, what goes wrong could have gone wrong elsewhere, and set a status variable to a bogus value, say, but since fiddling with the inlining limit makes the bug come and go, it's more likely that the problem is within the one function.)

Attached file hotblocks-baseline.txt

Hot blocks profile on x86_64, once the emulator is up and running and the big block cursor is flashing. For the Baseline compiler, when running the build that is slow on Ion.

Attached file hotblocks-ion-fast.txt

The same, but for Ion, when running the non-problematic ("fast") build.

Attached file hotblocks-ion-slow.txt

The same, but for Ion, when running the problematic ("slow") build.

(In reply to Lars T Hansen [:lth] from comment #5)

Disassembly of the bad code (vC64.wasm), 23MB. Function 612 is the mother of all br_tables.

Given the very long, almost featureless sequence of spill-slot operations shown in
comment #8, I think it plausible that the problem is somehow related to this br_table
or associated control flow.

I notice that there are 233 block opening marks that precede the br_table itself.
They appear to be nested, though:

        if  ;; label = @7
          local.get 2
          i32.const 10808
          i32.add
          local.set 4
          block  ;; label = @8
            block  ;; label = @9
              block  ;; label = @10
                block  ;; label = @11
                  block  ;; label = @12
                  ...
                  block  ;; label = @238
                    block  ;; label = @239
                      block  ;; label = @240
                        local.get 3
                        i32.const 1
                        i32.sub
                        br_table 234 (;@6;) 1 [...]

That there are a large number of branch targets is no surprise -- it's an interpreter
after all. But the fact that they appear to be nested is a bit strange. Are they
really nested, or is this just some artefact of the wasm bytecode format?

Is there some way to turn this fn 612 into something we can compile in the shell?

Here is the complete long load-add-store sequence. It is a single basic block
between 12000 and 14000 insns long (I didn't count exactly) and it constitutes
about 45% of the dynamic insn count once the C64 simulator is up and idling.

It is 42384 code bytes long. AFAIK all recent Intels have a 32KB I1 cache, so
has potential to cause a lot of icache misses, unless the prefetchers can
avoid that, which might contribute to the factor-of-8 slowdown.

The sequence allocates an 8528 byte block of stack, stores %edi at offset
8524, and then does

block[offset which is < 8524] = block[8524] + constant of around 30000
for offset starting at 8520 and going down to 52 at the end

Not sure what to make of it. Is the entry (begin of SB rank 0) a function
preamble, or not?

It looks as if all of the loads from the stack could have been avoided if %edi
hadn't been spilled to block[8524] right at the start.

Assigning to Mr Register Allocator.

Assignee: nobody → jseward
Status: NEW → ASSIGNED

This isn't a regalloc problem, fundamentally -- it's a MIR-transformation
problem. The huge block in comment 10 is visible in output of OptimizeMIR.

What I think happens is as follows. The offending function 612 is the core of
the C64's instruction emulator, and probably has the structure:

void fn612(char* wasmHeapBase, uint nextPC) {
   // At the start of linear memory is a struct that contains the machine
   // state, and is about 30KB in size
   while (keepGoing) {
      nextOpcode = .. *nextPC++ .. // whatever, not important
      switch (nextOpcode) {
         case 0: { update state for opcode 0; break }
         case 1: { update state for opcode 1; break }
         ..
         case N: { update state for opcode N; break } for N = several hundred
      }
   }
}

Each "update state for opcode X" pokes some bit of the 30KB state, mostly but
not completely at offsets which are different for each X. Hence each arm of
the switch contains a few MIR expressions of the form

  wasmHeapBase + const   for const in 0 .. 30K

The MIR optimiser sees them, notes they are all invariant in the loop, and
lifts them out to the top level, and removes duplicates (GVNs them). This
still leaves about 2100 different offsets from wasmHeapBase. So the loop is
now preceded by basically 2100 copies of this

  6:constant 0x74c8
  7:add (0:wasmparameter) (6:constant 0x74c8)

(where I interpret 0:wasmparameter to be the heap base) for different constant
values.

The effect on performance is catastrophic because

(1) the individual switch arms are typically 10-40 MIR insns long, so if the
loop interates fewer than several hundred times, the cost of this setup
block will dominate, and

(2) clearly the RA can't allocate all of these lifted-out expressions to
registers (not to mention, each one is individually pretty cold, so they
wouldn't qualify anyway), so they are all spilled. That means that, in
each switch arm, a linear memory access becomes a dependent load pair -- a
reload to get the lifted-out constant, followed by a load the get the data.
If the lifting hadn't occurred then the access would simply be
*(wasmHeapBase + imm_offset).

This theory is consistent with a change of inlining in which the switch
statement was previously in one function ("interpretOneInstruction", etc) and
that was called from a loop by another function.

If the above is correct, it shouldn't be difficult to hand-write a small wasm
input that can reproduce the problem.

Jan, it seems plausible to me that this has happened before with Ion.
I mean: apparently-reasonable GVN and LICM choices lead to an
unreasonable result. Do you know if that's the case, and if so do we
have any knowledge/background/insight into how to avoid this?

Flags: needinfo?(jdemooij)

Great analysis. It reminds me of bug 1641599, we saw the same issue there. For JS code we have a limit on the script size for Warp/Ion compilation, so these issues are less likely to affect us there.

We probably have to add some heuristic to turn off (or limit) LICM for large loops or loops with many branches?

Flags: needinfo?(jdemooij)

The branchiness seems key. Previously, the cost of a trip through the function assuming k iterations was k*m where the m was some average cost of an arm. After the optimization the cost is s*b+k*(m-s) where b is the number of branches and s is the cost of hoisted code per branch. Assuming k=10, b=256, m=10 and s=4, we go from 10*10=100 to 4*256+10*(10-4)=1060. This seems hard to incorporate (indeed, how to compute b?) and in this case we're looking to avoid disaster more than being optimal.

For wasm in particular, I would almost be inclined to argue that LICM out of any conditional block in a loop is not something we want to do, because the front end would be presumed to have moved what should be moved, and our MIR level is close enough to source to be a proxy for it.

I verified that this problem occurs also on arm64-Ion, kind-of in contradiction
to comment 2.


I see the attraction of threshold-based fixes per comments 14 and 15, but I'm
concerned it could lead to performance cliffs, the avoidance of which I had
understood to be an informal goal of wasm. Here's a different kind of
proposal.


A possible solution -- assuming the analysis is correct -- would be to disallow
LICM and GVN for expressions of the form wasmHeapBase + constant, justified
thusly:

  • We can compute such expressions "for free" as part of loads/stores on all
    targets if constant is smallish (2^12, on arm64?), and on arm64 even if we
    have to use a second insn to synthesise a larger constant into a register.

  • So there's no point in LICMing them, and allocating a second register to hold
    it (in the best case) or spilling (as here)

  • For GVN, we need to avoid the danger that GVNing wasmHeapBase + constant in
    two separate arms would lead to the value being hoisted to a block inside the
    loop but prior to the switch (viz, the immediate dominator of each arm).
    (But will GVN do that? It would generate partial redundancies unless the
    expression was used in all arms). Even just within each arm, GVNing of
    these expressions seems marginal to me.

This leads into dangerous territory, though. The above proposal might fix the
immediate problem. But it only works because the simulator state structure
being accessed is parked at the start of the linear memory -- I'd guess it's a
C++ global variable and that's what emscripten does with global variables.

Consider the situation if instead the structure lived in the emscripten C++
heap [in linear memory]. Then each access would be of the form

  *(wasmHeapBase + structBaseOffset + constant)

where structBaseOffset would also be a parameter to fn612.

If that sum is presented in MIR as

  wasmHeapBase + (structBaseOffset + constant)

then the above rules would not help: all of the 2100 different
structBaseOffset + constant terms would get lifted out, and we'd be just as
badly off as at present. If the sum was presented as

  (wasmHeapBase + structBaseOffset) + constant

then all the wasmHeapBase + structBaseOffset get lifted out as a single value
(good, call it structAddr), but that value is also obviously loop invariant,
and so all the structAddr + constant exprs also get lifted out, so we're
still no further forward.

Maybe the LICM-inhibition rule needs to apply to expressions of the form
E + const where E can contain wasmHeapBase plus any other stuff we
want to add to it, but not any constants. Then duplicated Es can get
LICM'd/GVN'd, no problem, and we just need ensure that the addition of the
constant is the last thing that happens, and that it can't be LICM'd/GVN'd.

I'm at least moderately sure this is all bog-standard
generate-good-array-and-structure-access-code C/C++/Fortran stuff, and we just
need to look in the right textbook to find the "standard" solution.

(In reply to Julian Seward [:jseward] from comment #16)

I verified that this problem occurs also on arm64-Ion, kind-of in contradiction
to comment 2.

Presumably a bad experiment on my part.

A possible solution -- assuming the analysis is correct -- would be to disallow
LICM and GVN for expressions of the form wasmHeapBase + constant, justified
thusly:

  • We can compute such expressions "for free" as part of loads/stores on all
    targets if constant is smallish (2^12, on arm64?), and on arm64 even if we
    have to use a second insn to synthesise a larger constant into a register.

Making this code good on arm64 is bug 1709863. The offset size is complicated, it is 9-bit signed for most practical purposes. For almost all other constant offsets we care about, a single move immediate will load it into a temp and will not increase register pressure.

  • So there's no point in LICMing them, and allocating a second register to hold
    it (in the best case) or spilling (as here)

I agree.

More followups later, I think/agree this is a bit tricky to get right with the kinds of heuristics you suggest.

This shows the problem. It appears to be a LICM problem. With the loop
removed, GVN does not common up the duplicate expressions
(i32.mul (local.get $$i) (i32.const 101)) in each arm, which is good.
But with the loop enabled all of them are lifted to the top level
(and not commoned up).

(Later: I meant: run with IONFLAGS=licm,gvn,dump-mir-expr)

And if the constants are different? Are these still hoisted?

Yes, they are still hoisted (with constants 101 .. 105 in the 5 arms).

Muchnick p402-403 Sec 13.2 discusses this problem. His proposed
fix is to only allow hoisting from a block in a loop if that block
dominates all exit blocks of the loop. Maybe that's a proxy for
"is executed every iteration of the loop", which sounds like what
we want. Requires further contemplation.

(In reply to Julian Seward [:jseward] from comment #20)

[..] only allow hoisting from a block in a loop if that block
dominates all exit blocks of the loop.

Ignore that; I misunderstood the Muchnick text. Muchnick's
discussion is purely about correctness, not profitability.

Depends on: 1712078

Now that we have a clearer idea of what went wrong here, I'm delegating the
required LICM fixes to child bug 1712078. When that's fixed we can revisit
this bug and either close it or hang further work off it, as necessary.

With the patch at https://bugzilla.mozilla.org/show_bug.cgi?id=1712078#c9
I get the following results:

Limited inlining, with patch 56 fps
Max inlining, with patch 56 fps
Max inlining, without patch 12 fps

In other words .. the patch fixes this problem.

This was looked into in some detail, and fixed by, bug 1712078.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1712078
You need to log in before you can comment on or make changes to this bug.