Closed Bug 1340235 Opened 7 years ago Closed 2 years ago

Virtual function calls are 2.15x slower in wasm compared to asm.js

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jujjyl, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

155.47 KB, application/x-zip-compressed
Details
154.88 KB, application/zip
Details
Attached file rec.zip
STR:

1. Enable wasm with javascript.options.wasm;true
2. Visit https://clbri.com/dump/rec/asm_rec.html 
3. Visit https://clbri.com/dump/rec/wasm_rec.html

Observed:

asm.js runs in Avg. 195.525750 msecs.
wasm runs in Avg. 420.779500 msecs.

Geckoprofiler from asm.js: https://perfht.ml/2kO5gG3
Geckoprofiler from wasm: https://perfht.ml/2kO5ywD

(these profiles don't really have anything interesting that'd be different, they just show that execution is running fully inside the asm.js/wasm codegen)

Lars mentioned that Raybench has lots of virtual function calls, so this can further explain the perf difference observed in bug 1340106.

Attached test cases for local execution.
Blocks: 1340106
The above was built with Emscripten's -O3.

Building with -O3 -s DISABLE_EXCEPTION_CATCHING=1 -fno-exceptions does not affect the timings, so this is not related to exception handling at least.
In asm.js, the virtual function calls look like this:

>  d6 = d6 + +FUNCTION_TABLE_dii[HEAP32[HEAP32[i3 >> 2] >> 2] & 3](i3, 0);

whereas in wasm, they look like

>  (call_indirect $0
>   (get_local $var$0)
>    (i32.const 0)
>    (i32.add
>      (i32.and
>        (i32.load
>          (i32.load
>            (get_local $var$0)
>          )
>        )
>        (i32.const 3)
>      )
>      (i32.const 8)
>    )
>  )

There is an extra " + 8" there compared to asm.js, not sure what that is about. Though one extra addition should not explain a 2x slowdown. Alon, what's the pattern here? Does that look like expected from Binaryen codegen for a virtual member function call?
Flags: needinfo?(azakai)
I believe this is a known issue. asm.js function tables have a known type, while wasm has a single table and so the type must be checked at each call. cc'ing sunfish who explained this to me.

There is also that +8 there as you noticed, which as you said should be a minor issue here. It is caused by us folding all the asm.js tables into one wasm table. There is a plan to remove that, as well as the mask too - just build with EMULATED_FUNCTION_POINTERS=1 (and make sure you are in wasm-only mode, no asm.js fallback). Then the output could look like this:

      (call_indirect $FUNCSIG$iii
        (get_local $1)
        (i32.const 10)
        (get_local $3)
      )

I'd be curious to hear how much that helps, but I suspect little.

(Eventually though we do want to turn that on by default in wasm-only mode, as any small speedup is nice, and it does shrink code size. Only thing stopping that is finding time to do more testing.)
Flags: needinfo?(azakai)
Oh, that's interesting. With -s EMULATED_FUNCTION_POINTERS=1, the code does clean up to

>  (call_indirect $0
>    (get_local $var$0)
>    (i32.const 0)
>    (i32.load
>      (i32.load
>        (get_local $var$0)
>      )
>    )
>  )

and runtime changes from ~420 msecs to ~396 msecs, so a -5.7% reduction in execution time, though still 2.03x slower than asm.js.
So the signature-checking overhead shouldn't be that much by itself; just an extra mov-imm in the caller and cmp-imm;jne in the callee.  There's also a length check (rather than the masking asm.js does) that could currently loads the length from memory every time (and this load could be hoisted), but that's probably not worth that much either.

Rather, I'm guessing that the table here is either imported, exported or has one of its elements initialized to be an import.  If any of these is the case (which I'm guessing is true here), then a call_indirect can change instance and thus we have to take a slower call path.  Is there a way to get Emscripten to not import/export/initialize-with-import its table?
Oh, that's the first I hear of those issues, perhaps I misunderstood sunfish earlier. Are those expected to be the typical performance characteristics going forward here? If so we should investigate what to do, right now we do import the table (to simplify things) and do put imports in the table (to avoid creating wrappers etc.).
I don't know if other engines have an optimization that makes the non-external-functions-containing-table case faster than the external-functions-containing-table case.

On my machine, asm.js in FF is fastest (250ms), Chrome's asm.js and wasm are about the same (460ms / 430ms resp.), and FF's wasm is slowest (540ms).  :(

But this made me remember probably the biggest slowdown (and what makes FF wasm slower than Chrome wasm): we're reloading table-base and table-length from memory each time; I know Chrome is currently patching them in (and re-patching on resize).  So we have an extra dependent load.  For bug 1338217 we already want to be able to hoist these TLS values, so that they are only loaded once, for *memory* base/length, so we could use the same mechanism (just different TlsData offsets) here, probably for a big boost.

Lastly, Alon: one thing wasm can do with its heterogeneous design that asm.js can't--which could perhaps compensate for the cost of the sig check--is store the vtable entries inline in the table, so that the object's vtbl-ptr is just an index into the Table and all vfuns are a constant offset.  I know this is non-trivial, but it could remove an entire dependent load in every engine...
Hmm, I don't know enough to tell. One issue is that seems like it would require clang frontend changes, I'm not sure how feasible that is technically and otherwise. Another issue is I don't know enough about the C++ ABI - it seems like putting the vtable in the table means that the functions can only be called, but not read. I.e., right now you can read from a vtable and compare that to another pointer, but vtable-in-table would disallow that. Oddly enough an old emscripten feature (inheriting from webidl-wrapped classes from JS) did depend on reading the vtable elements, but perhaps it was in violation of C++ ABI assumptions all along?
Looking at some LLVM IRs, another issue is vtable nesting (probably a better name for it that I don't know),

@_ZTV6Parent = internal unnamed_addr constant [4 x i8*] [i8* null, i8* bitcast ({ i8*, i8*, i8* }* @_ZTI6Parent to i8*), i8* bitcast (i32 (%struct.Parent*)* @_ZN6Parent6implmeEv to i8*), i8* bitcast (i32 (%struct.Parent*)* @_ZN6Parent5getitEv to i8*)], align 4

@_ZTI6Parent = internal constant { i8*, i8*, i8* } { i8* bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTVN10__cxxabiv120__si_class_type_infoE, i32 0, i32 2) to i8*), i8* getelementptr inbounds ([8 x i8], [8 x i8]* @_ZTS6Parent, i32 0, i32 0), i8* bitcast ({ i8*, i8* }* @_ZTI4Pure to i8*) }

@_ZTS6Parent = internal constant [8 x i8] c"6Parent\00"

The main vtable is the first, and it contains a pointer to the second, so the main vtable doesn't just contain functions, it also contains pointers. The second then contains pointers to non-functions as well, the last.

So anyhow seems like a non-trivial change to how clang works. But could be an interesting investigation.
Component: JavaScript Engine → JavaScript Engine: JIT
(In reply to Alon Zakai (:azakai) from comment #9)
Yeah, I didn't imagine it'd be easy.  For the nesting you mention, is the pointer-in-vtbl used on any hot path?  B/c, if it's only for some rare case (like EH), one could have a special "get data" function in a vtable that you'd call to get a pointer into linear-memory with all the data.
Well, again this goes beyond by knowledge of C++ ABI details, but I think that information is used for e.g. dynamic_cast, which could be important for speed.
One other optimization idea here: instead of having two-word elements, wasm tables could store just the callee.  In the common case that a table contains only a single instance, the callee would be the target function.  In the dynamic case that a table contains functions from multiple instances, one stub per (function,instance) could be generated which would do: `mov tls, TlsReg`.  Thus, there would be no penalty for importing/exporting tables as long as the table ended up only containing one instance.  Incidentally, this change was probably going to be necessary when tables are shared to allow atomic word-sized updates of table elements.
Priority: -- → P3
And from bug 1428453, another tiny optimization: the null check could be replaced by using SIGSEGV.  The challenge is that pc is null, but *sp can be sniffed and used instead to lookup the trap.
Component: JavaScript Engine: JIT → Javascript: WebAssembly
Depends on: 1639153
Priority: P3 → P2

Summarizing the issues from above (without knowing whether they have been resolved or not; edits coming):

  • emscripten compile with EMULATED_FUNCTION_POINTERS=1 to get better code
  • table bounds check (bug 1625891)
  • table base and table length reloaded from memory, not commoned by Ion; dependent load issue
    • not an issue because these (re)loads are invariably followed by a call, and it is faster to reload than to save+restore?
    • table base and length could be changed by the call?
    • far from clear that these loads are dependent a lot of the time, the tls is more or less assumed to be in a register already, would be good to drill down on this
  • call_indirect null check after fetching code pointer (bug 1709578)
  • call_indirect signature check
  • call_indirect pessimizes by assuming an instance / context change (bug 1639153)

I'm taking this for now in order to perform a new assessment and file necessary dependent bugs.

Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Summary: virtual function calls are 2.15x slower in wasm compared to asm.js → [meta] Virtual function calls are 2.15x slower in wasm compared to asm.js
Depends on: 1625891
See Also: → 1649109
Depends on: 1709578

Typical indirect call to unknown index now, lightly edited:

0000002B  44 8b e0                  mov %eax, %r12d        ; r12 = index arg
00000039  41 ba 23 00 00 00         mov $0x23, %r10d       ; r10 = signature
0000003F  49 8b 46 60               movq 0x60(%r14), %rax  ; rax = table bound
00000043  44 3b e0                  cmp %eax, %r12d        ; bounds check
00000046  0f 82 02 00 00 00         jb 0x000000000000004E
0000004C  0f 0b                     ud2
0000004E  49 8b 46 68               movq 0x68(%r14), %rax  ; rax = table base
00000052  41 c1 e4 04               shl $0x04, %r12d       ; r12 = scaled index
00000056  49 03 c4                  add %r12, %rax         ; rax = base+index
00000059  4c 89 74 24 08            movq %r14, 0x08(%rsp)  ; save tls
0000005E  4c 8b 70 08               movq 0x08(%rax), %r14  ; r14 = new tls
00000062  4c 89 34 24               movq %r14, (%rsp)      ; save new tls
00000066  45 85 f6                  test %r14d, %r14d      ; null check against the tls
00000069  0f 85 02 00 00 00         jnz 0x0000000000000071
0000006F  0f 0b                     ud2
00000071  4d 8b 3e                  movq (%r14), %r15      ; load destination
00000074  4d 8b 66 20               movq 0x20(%r14), %r12  ; chain 
00000078  49 8b 5e 18               movq 0x18(%r14), %rbx  ;   up
0000007C  49 89 9c 24 98 00 00 00   movq %rbx, 0x98(%r12)  ;     activation
00000084  48 8b 00                  movq (%rax), %rax      ; load code pointer
00000087  ff d0                     call %rax              ; call
00000089  4c 8b 74 24 08            movq 0x08(%rsp), %r14  ; load old tls
0000008E  4d 8b 3e                  movq (%r14), %r15      ;   heap ptr
00000091  4d 8b 56 20               movq 0x20(%r14), %r10  ; unchain
00000095  4d 8b 66 18               movq 0x18(%r14), %r12  ;   the
00000099  4d 89 a2 98 00 00 00      movq %r12, 0x98(%r10)  ;     activation

The work in bug 1639153 will deal with the elaborate context switch before and after the call so that that is only performed if the call is actually cross-instance.

The computation of base+index might be able to use LEA, though this is a micro-issue at best.

The bounds check can be commoned using the normal BCE machinery, which will be useful if the function is called multiple times.

The null check can be elided, as can the signature check, if they are dominated by similar checks that have succeeded and the table has not been modified, but since the operation is a call it's something we can not assume without a lot of context, which we don't have.

In addition to hoisting table bounds checks, I think the load of table-base and table-length can be hoisted (more aggressively than the bounds check) using MWasmLoadTls (just like the memory-base and memory-length).

Note: LEA isn't possible on x64 atm b/c table elements are 16 bytes and the max scale factor is 8. However, iiuc, with bug 1639153, there isn't a need to store a tls* in the table, and thus LEA could be used. While a micro-optimization, this does save, in aggregate, 3 instructions.

Additionally, the null check could be removed unconditionally by having a (non-null) thunk used to represent null entries that safely traps when called. This isn't currently possible b/c the activation swap needs a valid callee tls, so it's also gated on bug 1639153.

Thus, with bug 1639153 and just the above optimizations, I think that brings the call_indirect sequence down to: (maybe) bounds-check, LEA table-base+index*8, call *ptr.

(In reply to Luke Wagner [:luke] from comment #16)

In addition to hoisting table bounds checks, I think the load of table-base and table-length can be hoisted (more aggressively than the bounds check) using MWasmLoadTls (just like the memory-base and memory-length).

It can, but the question is whether this matters or is even desirable. We're doing an indirect call, so if these values have been hoisted they're going to be spilled across the call and then restored again. Of course, if the tls has to be loaded first then we have some dependent loads that can be avoided, but I think we need more data on how often that is a problem.

Note: LEA isn't possible on x64 atm b/c table elements are 16 bytes and the max scale factor is 8.

TIL...

However, iiuc, with bug 1639153, there isn't a need to store a tls* in the table, and thus LEA could be used. While a micro-optimization, this does save, in aggregate, 3 instructions.

Additionally, the null check could be removed unconditionally by having a (non-null) thunk used to represent null entries that safely traps when called. This isn't currently possible b/c the activation swap needs a valid callee tls, so it's also gated on bug 1639153.

Yeah, see discussion on comment 14 and also on bug 1709578. There are some open questions about how much that actually buys us but I think it's worth investigating in depth, with consideration of the future table representation.

Thus, with bug 1639153 and just the above optimizations, I think that brings the call_indirect sequence down to: (maybe) bounds-check, LEA table-base+index*8, call *ptr.

And that would be lovely :-)

(In reply to Lars T Hansen [:lth] from comment #17)

(In reply to Luke Wagner [:luke] from comment #16)

In addition to hoisting table bounds checks, I think the load of table-base and table-length can be hoisted (more aggressively than the bounds check) using MWasmLoadTls (just like the memory-base and memory-length).

It can, but the question is whether this matters or is even desirable. We're doing an indirect call, so if these values have been hoisted they're going to be spilled across the call and then restored again. Of course, if the tls has to be loaded first then we have some dependent loads that can be avoided, but I think we need more data on how often that is a problem.

Ah, yes, good point.

Depends on: 1711412
Depends on: 1742930

Meta status has been transferred to bug 1742930.

Assignee: lhansen → nobody
Status: ASSIGNED → NEW
No longer depends on: 1625891, 1639153, 1709578, 1711412, 1742930
Summary: [meta] Virtual function calls are 2.15x slower in wasm compared to asm.js → Virtual function calls are 2.15x slower in wasm compared to asm.js
Depends on: 1742930
No longer blocks: 1742832
Summary: Virtual function calls are 2.15x slower in wasm compared to asm.js → [meta] Virtual function calls are 2.15x slower in wasm compared to asm.js
Keywords: meta
Summary: [meta] Virtual function calls are 2.15x slower in wasm compared to asm.js → Virtual function calls are 2.15x slower in wasm compared to asm.js

Status update. With Dmitry's patch + null check optimization (bug 1709587) and better code generation (bug 1743586), we're down to 380ms/iter for wasm and 287ms/iter for asm.js, ie, 1.32x slower. For this the wasm had to be recompiled since the existing wasm was pre-mvp, but the asm.js code is the original.

Looking at the wasm, it definitely exports the table so we're paying for an indirect stub for all the table entries; the dynamically-same-instance check will then hit and shunt us off on a fairly fast path. Once we do stage 2 of the indirect stub optimization we'll avoid the stubs, I think, since the table is not actually shared between multiple instances.

Attached file rec2.zip

Recompiled wasm + added a Makefile and a slight variation on the cpp to use with the JS shell.

Priority: P2 → P3
Blocks: 1742930
No longer blocks: 1340106
Severity: normal → --
No longer depends on: 1742930
See Also: → 1340106

What was the meaning of that edit? It is completely wrong.

No longer blocks: 1742930
Depends on: 1742930

I think we're done here. According to bug 1625891 comment 7, bounds checks account for as much of 20% of the remaining running time of the wasm program, and though we could get rid of most of the bounds checks in this particular case, that optimization would not pay off in realistic programs - the overhead is an artifact of the microbenchmark. To the extent there's more to discuss, let's take the discussion to the cited bug.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: