OdinMonkey: speed up asmjs=>asmjs FFI calls

RESOLVED WONTFIX

Status

()

Core
JavaScript Engine: JIT
P5
normal
RESOLVED WONTFIX
3 years ago
9 months ago

People

(Reporter: rfkelly, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 2 obsolete attachments)

(Reporter)

Description

3 years ago
Created attachment 8389091 [details]
asmffibench.js

Having an asmjs module call some other asmjs-compiled function as an FFI is currently very slow, as the call must traverse two generic slow paths:  it exits the calling module via the generic GenerateFFIInterpreterExit, and it enters the callee module via the generic CallAsmJS wrapper.

The attached microbenchmark makes an asmjs=>asmjs call in a tight loop, and the difference in performance is pretty stark:

    $ js ./asmffibench.js
    Best: 6443 iterations per second

    $ js --no-asmjs ./asmffibench.js 
    Best: 1000000 iterations per second

Some of this difference will be due to inlining of the function call in the non-asmjs case, but it would be nice to get things at least as fast as non-inlining Ion can do:

    $ js --no-asmjs --ion-inlining=off ./asmffibench.js 
    Best: 94339 iterations per second

This may have some overlap with Bug 962641 which looks to optimize Ion=>asmjs calls, but seems interesting enough in isolation to deserve an independent bug.
(Reporter)

Comment 1

3 years ago
Created attachment 8389099 [details] [diff] [review]
faster-asmjs-ffi.patch

Attaching my preliminary exploration of this area.

This patch adds a third exit path to each FFI invokation, designed specifically for asmjs=>asmjs calls.  It's not much, but it makes a difference already:

    $ js ./asmffibench.js 
    Best: 10593 iterations per second


The only clever thing the new exit does is avoid boxing the arguments into Values, instead passing them as an array of ints/doubles.  All the heavy lifting of the call is done by a helper function "InvokeDirectAsmJS", a lightly-specialised clone of CallAsmJS.  My short-term plan is to chip away at this function, moving more and more of it into the generated assembly, until it can be removed entirely.

Luke, is "InvokeDirectAsmJS" an accurate summary of what a compiled asmjs=>asmjs call will have to do in the general case?  Is there anything in there that is obviously unnecessary or suboptimal at this early stage?  In email you suggested that much of this call overhead might be unnecessary, so I want to sanity-check my understanding of how the current call-path works.

Some high-level thoughts/notes:

  * it seems better to check for asmjs=>asmjs FFI calls at module link time and make the necessary adjustments then, rather than copying the approach of the "TryEnableIon" check after each invokation.  But AFAICT the linker doesn't have access to the call signature of the FFI exit, so it can't validate that the FFI's signature matches the call.

  * naively, it seems profitable to overwrite the interpreter exit trampoline code with an asmjs=>asmjs-specific one if we find such during linking, rather than generating a third exit path for every single FFI.  I've no idea if that's actually sensible or doable in pratice though.
Attachment #8389099 - Flags: feedback?(luke)

Comment 2

3 years ago
Comment on attachment 8389099 [details] [diff] [review]
faster-asmjs-ffi.patch

Review of attachment 8389099 [details] [diff] [review]:
-----------------------------------------------------------------

Great way to start!

I think the target state is that your generated AsmJS stub just pushes an AsmJSActivation on the stack (in generated code; feel free to ignore the SPS part: it will soon be removed) and possibly copies stack args to the top of the stack if there are any.  In the future (after asm.js stack walking and changing Ion to use the same mechanism), I'd like to remove AsmJSActivation and just use JitActivations; that'd avoid the need to push any AsmJSActivation here and make the AsmJS-stub case basically almost trivial (maybe swapping out heap and global registers).

I wouldn't worry too much about the interp-exit stub size: for all of Citadel it is 18K.   Ion exits cost 67K which is still only .2% of total code size.  (Recall that exits are per-(name,signature), not per-callsite.)  On the other hand, if we were to generate Ion/asm stubs at runtime, instead of at compile-time, in addition to saving size, it'd allow us to specialize better to the callee w/o resorting to link-time patching.  For example, Ion could avoid dynamically loading the callee out of the ExitDatum as it does now.  Similarly, it'd make it easy to bake in the callee in the AsmJS stub.

Performing specialization at link-time seems nice although I don't think it's really necessary for good performance.  For the same reason as above, I think it'd be fine to save the Signature in AsmJSModule::Exit if it simplifies things.

::: js/src/jit/AsmJS.cpp
@@ +6325,5 @@
> +    AsmJSActivation activation(cx, calleeModule, calleeExportIndex);
> +
> +    // IMHO we *shouldn't* need a JitActivation since there's one from the existing asmjs call.
> +    // But putting this in fixes a segfault.  Need to understand what's going on here..
> +    JitActivation jitActivation(cx, /* firstFrameIsConstructing = */ false, /* active */ false);

I think what's happening here is that the JitActivation needs to be the newest activation on the stack so, if you don't push this here, then we end up with AsmJSActivation -> JitActivation -> AsmJSActivation.  For this patch, I think you could probably do a little hack and insert the AsmJSActivation between the caller's AsmJSActivation and JitActivation.  (In the future (after adding stackwalking support to asm.js and then switching Ion to use this), I'd like for asm.js to be able to push frames in a JitActivation (just like how Ion and Baseline share the same JitActivation) so that we could avoid all this.)
Attachment #8389099 - Flags: feedback?(luke) → feedback+
(Reporter)

Comment 3

3 years ago
Created attachment 8395371 [details] [diff] [review]
faster-asmjs-ffi.patch

Here's an updated patch based on the above feedback.  It removes the InvokeDirectAsmJS helper function and does everything in assembly, along the lines described in Comment 2.

It seems to work OK in some light manual testing, and brings performance ahead of the non-inlining ion case:

  $ ./js/src/js --no-asmjs --ion-inlining=off ~/asmffibench.js 
  Best: 94339 iterations per second

  $ ./js/src/js ~/asmffibench.js 
  Best: 121951 iterations per second

A couple of concerns with the current version:

  * Changes to mainThread.asmJSActivationStack_ are suppoded to be protected
    with JSRuntime::AutoLockForInterrupt, which this patch doesn't do.  It's
    not clear to me what the potential race-conditions are or how to achieve
    this locking from the jitted code.

  * It doesn't cleanup the extra activations in the error path, instead letting
    control jump back to the initial amsjs call and counting on the AsmJSActivation
    destructor to do the right thing.

I also need to add some dedicated tests and cleanup some smaller todos, but overall it seems to be coming together pretty nicely.
Attachment #8389099 - Attachment is obsolete: true

Comment 4

3 years ago
(In reply to Ryan Kelly [:rfkelly] from comment #3)
Awesome work!

>   * Changes to mainThread.asmJSActivationStack_ are suppoded to be protected
>     with JSRuntime::AutoLockForInterrupt, which this patch doesn't do.  It's
>     not clear to me what the potential race-conditions are or how to achieve
>     this locking from the jitted code.

Ah, forgot about this earlier.  The race the lock is protecting against is RequestInterruptForAsmJSCode (which is intended to be called off the main thread).  The lock basically ensures that all the stores to the new AsmJSActivation finish before the new AsmJSActivation is pushed onto asmJSActivationStack_.  Since you're emitting the machine code, you know that the compiler won't reorder, so the only concern is that the CPU might reorder.  I'm not an expert on the different processor memory models, so we'll need to double check, but I think x86 might not need anything.  I don't know about ARM.

Comment 5

3 years ago
Oops, I forgot that the lock in ~AsmJSActivation is also necessary to keep the AsmJSActivation from being removed while RequestInterruptForAsmJSCode is working on it.  This use case is harder to work around without adding locking overhead.

One thought is that we could avoid this whole thing by
 - maintaining a linked list of all live AsmJSModules in a JSRuntime (guarded by the interrupt lock)
 - RequestInterruptForAsmJSCode calls protectCode() on *all* AsmJSActivations (also guarded by the interrupt lock)

With this change, your code as is would be fine and we could remove the lock in AsmJSActivation.  This seems less efficient than what we're doing now but (1) interrupts shouldn't occur in normal execution (only when a single frame has been running longer than a second), (2) we do the same thing for ion code.
(Reporter)

Comment 6

3 years ago
> One thought is that we could avoid this whole thing by
>  - maintaining a linked list of all live AsmJSModules in a JSRuntime (guarded by the interrupt lock)
>  - RequestInterruptForAsmJSCode calls protectCode() on *all* AsmJSActivations (also guarded by the interrupt lock)

Interesting proposal, I'll code it up and give it a try.  Do we risk some sort of DoS vector if e.g. a malicious script creates a large number of asmjs modules and we have to walk the whole list?  It appears that the Ion version of this can protect() entire chunks of memory rather than walking them one function at a time.
(Reporter)

Comment 7

3 years ago
Created attachment 8415071 [details] [diff] [review]
faster-asmjs-ffi.patch

For comparison, I did a version of the previous patch with a simple mfence-based check for the interrupt lock.  As expected this is a bit of a performance killer:

    $ ./js/src/js --no-asmjs --ion-inlining=off ~/asmffibench.js 
    Best: 92592 iterations per second
    
    $ ./js/src/js ~/asmffibench.js 
    Best: 32894 iterations per second

It will be interesting to compare with the approach you suggest above.
Attachment #8395371 - Attachment is obsolete: true

Comment 8

3 years ago
(In reply to Ryan Kelly [:rfkelly] from comment #6)
> Do we risk some sort of DoS vector if e.g. a malicious script creates a large number of
> asmjs modules and we have to walk the whole list?

The useful thing here is that the interrupt callback should should almost never be triggered unless code is running for more than a second.  I guess code could try to create millions of asm.js modules, but even that would take small number of seconds which is still better than a script simply running "while(true){}".  Ultimately, at least until e10s, a malicious site has tons of ways to DoS the user.
In wasm, we now have the ability to export a function table and reimport it in another module, and we optimize these calls when they're from wasm to wasm. While this doesn't solve the general case of calling into a wasm FFI, I think it's enough for the use case here, which is having a way to call fast into wasm modules. Should we close or revisit this bug?
Flags: needinfo?(rfkelly)
Flags: needinfo?(luke)
Priority: -- → P5

Comment 10

9 months ago
Yes, I think the function tables should handle this use case quite well (even more generally, since they are mutable).
Status: NEW → RESOLVED
Last Resolved: 9 months ago
Flags: needinfo?(luke)
Resolution: --- → WONTFIX
(Reporter)

Comment 11

9 months ago
clearing ni?; this SGTM
Flags: needinfo?(rfkelly)
You need to log in before you can comment on or make changes to this bug.