Closed Bug 1435360 Opened 2 years ago Closed 2 years ago

Baldr: use branching for wasm interrupt callbacks

Categories

(Core :: JavaScript Engine: JIT, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox60 --- wontfix
firefox61 --- fixed

People

(Reporter: luke, Assigned: luke)

References

Details

Attachments

(5 files, 4 obsolete files)

The current async signal/SuspendThread technique for interrupt wasm is rather fragile and incomplete:
 - because general JS can run from the interrupt callback (easily, too, from the JS shell timeout() function), every interruptible wasm PC must be GC-safe (and probably isn't, and especially won't be with 'anyref')
 - it's possible for interrupts to be lost if the PC is in various places and not all uses of RequestInterruptCallback() retry periodically; pathological cases could spend most of their time in non-interruptible code ranges avoiding interruption altogether

I briefly considered using code-patching approaches (e.g., nop->ud2), but this has problems since wasm code is shared between threads.

So instead, I expect the overhead to be pretty low to put an 'interrupt' flag in TlsData and poll this on loop backedges.  For function prologues, we can use the same jitStackLimit trick as JS JIT code does, clearing the jitStackLimit during interrupt to reuse the existing stack-overflow check.

In either case, I think I'd unify interrupts with traps by saying that traps can conditionally resume.  This could be useful in the future if wasm has trap handlers.

This should remove a fair amount of hacks added for the async interrupt (in WasmFrameIter, the simulators, WasmSignalHandler.cpp, WasmStubs.cpp, etc).
Don't know what you had planned in terms of the flag but a standardish trick that I'm actually not sure we're using elsewhere is to, instead of reading a value and comparing it to zero, say, just accessing an address which points to a page that is either read-protected or not (normally not).  If we could statically link this address into the code instead of going via Tls we might have a one-instruction check.

(Perhaps this page could be the page directly preceding the Tls so that if the tls pointer is available we can read off of that, no patching required.)
Status: ASSIGNED → NEW
Priority: -- → P2
Another note-to-self: with this change, we can drop the requirement of wasm on the ability to asynchronously interrupt via pthread_kill() and thus drop the Android 4.3 requirement.  With this and ARM64, I think this will mean wasm is always enabled on everything we test on treeherder which could allow us to remove the "wasmIsSupported()" early exits which, in turn, would avoid the "wasm is disabled so tests silently pass" failure mode.
(In reply to Luke Wagner [:luke] from comment #2)
> With this and ARM64, I think this will mean
> wasm is always enabled on everything we test on treeherder which could allow
> us to remove the "wasmIsSupported()" early exits which, in turn, would avoid
> the "wasm is disabled so tests silently pass" failure mode.

It's possible that there's some implicit constraint in that statement I'm not seeing, so at the risk of being redundant: We still have configurations without a JIT (PPC64, Sparc, others) and probably configurations with a JIT that don't support wasm (PPC).  I don't know if we run treeherder tests on the "none" configuration but we ought to.
My assumption is that our JS shell tests only need to pass on tier-1 platforms.  Keeping 'none' working, even if it's not a tier-1, is a good point though.
Indeed bug 1442508 is about to enable full testing of no-jit builds.
Sweet, that makes the situation even simpler, thanks for pointing it out.
I have this working.  The slowdowns I see seem to be higher for numeric benchmarks and less for more macro workloads:

 box2d: 1.8% (slower)
 bullet: 3.5%
 lua-binarytrees: 4.5%
 fasta: 9%
 skinning: 10%

I tested and confirmed (1) the overhead is only from the loop interrupt check (the prologue check was free), (2) it's only from the actual codegen; the reservation of WasmTlsReg (on x64) wasn't measurable.

I'm disappointed by the 10% slowdown, but:
 1. There's not a good other way I can think of to fix the minor existing and major future (with reference types) GC hazards
 2. With finer-grained process separation where unrelated tabs don't share processes, we could turn off the interrupt check in Ion loops.  Apparently Chakra does this.  In my experience, trivial iloops (that don't go through a prologue) are really rare in normal web browsing.  If a developer has a bug and opens the debugger, then they'd get wasm baseline which would keep the interrupt checks.
Attached patch turn-off-async-interrupt (obsolete) — Splinter Review
Sweet sweet code removal and relocation of the remaining Ion bits to jit/AsyncInterrupt.(h,cpp)
Attachment #8957033 - Flags: review?(jdemooij)
Attached patch turn-off-async-interrupt (obsolete) — Splinter Review
Pretty simple patch.  The main interesting changes are:
 - traps become resumable (when CheckForInterrupt() says not to interrupt)
 - the Runtime tracks all live instances so it can set the 'interrupted' bit on all of them when requested
 - as with Ion, the existing prologue stack-overflow check is (ab)used to also check interrupt by changing the stackLimit when an interrupt is requested
Attachment #8957034 - Flags: review?(bbouvier)
These look like the same patch to me.
Comment on attachment 8957034 [details] [diff] [review]
turn-off-async-interrupt

Review of attachment 8957034 [details] [diff] [review]:
-----------------------------------------------------------------

Yep, same patch twice, with code motion and new features.
Attachment #8957034 - Flags: review?(bbouvier)
Attached patch non-resumable-interrupt (obsolete) — Splinter Review
Oops!
Attachment #8957034 - Attachment is obsolete: true
Attachment #8957214 - Flags: review?(bbouvier)
Attached patch rm-wasm-async-interrupt (obsolete) — Splinter Review
Now, passing try.
Attachment #8957033 - Attachment is obsolete: true
Attachment #8957033 - Flags: review?(jdemooij)
Attachment #8957400 - Flags: review?(jdemooij)
Oops, wrong patch.
Attachment #8957214 - Attachment is obsolete: true
Attachment #8957400 - Attachment is obsolete: true
Attachment #8957214 - Flags: review?(bbouvier)
Attachment #8957400 - Flags: review?(jdemooij)
Attachment #8957401 - Flags: review?(jdemooij)
With try fixes.
Attachment #8957402 - Flags: review?(bbouvier)
Comment on attachment 8957401 [details] [diff] [review]
rm-wasm-async-interrupt

Review of attachment 8957401 [details] [diff] [review]:
-----------------------------------------------------------------

This is great, much simpler. We should consider removing the Ion interrupt mechanism too; I'll try to get some benchmark numbers soon...

::: js/src/jit/Lowering.cpp
@@ +93,5 @@
>  static void
>  TryToUseImplicitInterruptCheck(MIRGraph& graph, MBasicBlock* backedge)
>  {
>      // Implicit interrupt checks require wasm signal handlers to be installed.
> +    if (!jit::HaveAsyncInterrupt() || JitOptions.ionInterruptWithoutSignals)

Nit: fix the comment
Attachment #8957401 - Flags: review?(jdemooij) → review+
Comment on attachment 8957402 [details] [diff] [review]
tls-wasm-interrupt

Review of attachment 8957402 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, thanks.

::: js/src/jit/CodeGenerator.cpp
@@ +12694,5 @@
>  void
> +CodeGenerator::visitWasmInterruptCheck(LWasmInterruptCheck* lir)
> +{
> +    MOZ_ASSERT(gen->compilingWasm());
> +    const MWasmInterruptCheck* mir = lir->mir();

nit: can be inlined

::: js/src/wasm/WasmBuiltins.cpp
@@ +211,5 @@
>      return HandleThrow(cx, iter);
>  }
>  
> +static void*
> +ReportError(JSContext* cx, unsigned errorNumber)

Can you comment that it returns the sentinel value for resumePC?

@@ +230,5 @@
> +    activation->finishWasmTrap();
> +    return resumePC;
> +}
> +
> +static void*

Can you comment it returns either nullptr to throw or a resumepc?

@@ +231,5 @@
> +    return resumePC;
> +}
> +
> +static void*
> +OnTrap(Trap trap)

This new name breaks symmetry with all the others. I'd prefer either renaming all to OnX (like OnOutOfBounds, etc.) or to keep the previous name (with a preference to keep the previous name which was more explicit).

@@ +267,5 @@
> +        if (!CheckRecursionLimit(cx))
> +            return nullptr;
> +        if (activation->wasmExitFP()->tls->isInterrupted())
> +            return CheckInterrupt(cx, activation);
> +        return ReportError(cx, JSMSG_OVER_RECURSED);

When is this line reachable? Either we've hit the recursion limit and returned just after the call to CheckRecursionLimit, or an interrupt was triggered. Is it in case we've popped elements off the stack and CheckRecursionLimit "lies" about what happened?

::: js/src/wasm/WasmCompartment.h
@@ +73,5 @@
> +// CheckForInterrupt(), it should call RunningCodeInterrupted() to clear the
> +// interrupt request for all wasm Instances to avoid spurious trapping.
> +
> +void
> +RunningCodeInterrupted(JSContext* cx);

nit: This name isn't very talkative. When I saw the first usage, I first thought it was a RAII guard, that was incorrectly called. How about just ClearInstancesInterrupt? (or ClearAllInstancesInterrupt, or ClearCompartmentInterrupt, or anything else that indicates it clears flags?)
Attachment #8957402 - Flags: review?(bbouvier) → review+
(In reply to Benjamin Bouvier [:bbouvier] from comment #17)
Thanks!

> > +static void*
> > +OnTrap(Trap trap)
> 
> This new name breaks symmetry with all the others. I'd prefer either
> renaming all to OnX (like OnOutOfBounds, etc.) or to keep the previous name
> (with a preference to keep the previous name which was more explicit).

Well, there is a symmetry here: "Report" means "unconditionally report an error" (such that the caller knows to jump to then jump to the throw stub) and that's why OnTrap() calls a function named ReportError().  However OnTrap() is unique in that it either reports or resumes, so I think that warrants breaking symmetry to call out the difference.  Also, WasmReport(OutOfBounds|UnalignedAccess) are going away with the final patch of bug 1428453 (once I finish this bug) so there will only be the one WasmReportInt64JSCall().

> > +        if (!CheckRecursionLimit(cx))
> > +            return nullptr;
> > +        if (activation->wasmExitFP()->tls->isInterrupted())
> > +            return CheckInterrupt(cx, activation);
> > +        return ReportError(cx, JSMSG_OVER_RECURSED);
> 
> When is this line reachable? Either we've hit the recursion limit and
> returned just after the call to CheckRecursionLimit, or an interrupt was
> triggered. Is it in case we've popped elements off the stack and
> CheckRecursionLimit "lies" about what happened?

This is to handle cases where C++ and JIT were checking subtly different limits (frame sizes and such).
(In reply to Luke Wagner [:luke] from comment #18)
> > > +OnTrap(Trap trap)

(In the interest of landing, I'll stick with OnTrap(), but if you feel strongly about the name, happy to revisit in followup patch.)
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a463d224c412
Baldr: remove wasm async interrupt support (r=jandem)
https://hg.mozilla.org/integration/mozilla-inbound/rev/8cdf945be534
Baldr: implement wasm interrupt in terms of TlsData branch and stack overflow check (r=bbouvier)
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e2a6bd47f697
Baldr: jit::EnsureAsyncInterrupt() shouldn't cause JSRuntime failure if there is no async interrupt support on CLOSED TREE (r=me)
In splitting out jit::EnsureAyncInterrupt(), I was mistakenly returning 'false' when old Android prohibited use of pthread_kill().  wasm::EnsureSignalHandlers() only returns false on catastrophic OOM, not when signal handling isn't available.  Since jit::EnsureAsyncSupport() doesn't do allocation, it can't fail.
Backout by aciure@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/55c87e7ea09d
Backed out 3 changesets for Android XPCShel failures a=backout on a CLOSED TREE
Previously, wasm would be disabled because the `ro.build.version.sdk > 19` check would fail when enabling signals.  Now there is no such dependency (b/c we're not using pthread_kill()), so wasm is enabled, allowing this xpcshell test (test_wasm_recompile.js) to actually run.  Apparently Android 4.3 is the *only* ARM we run xpcshell tests on, so I expect this test simply never passed.  From a quick look, it looks like the binary code comparison is failing (using the shell wasmExtractCode()).  Now that I think about it, code compilation isn't deterministic, so it's surprising this test passes at all...
Attached patch fix-idb-testSplinter Review
This changes all the IDB xpcshell tests to test behavior instead of machine code equality.  Seems to fix Android 4.3 X2 failure on try.
Attachment #8957773 - Flags: review?(bbouvier)
(In reply to Luke Wagner [:luke] from comment #25)
> Previously, wasm would be disabled because the `ro.build.version.sdk > 19`
> check would fail when enabling signals.  Now there is no such dependency
> (b/c we're not using pthread_kill()), so wasm is enabled, allowing this
> xpcshell test (test_wasm_recompile.js) to actually run.  Apparently Android
> 4.3 is the *only* ARM we run xpcshell tests on, so I expect this test simply
> never passed.  From a quick look, it looks like the binary code comparison
> is failing (using the shell wasmExtractCode()).  Now that I think about it,
> code compilation isn't deterministic, so it's surprising this test passes at
> all...

Luke, can you have a look at bug 1308780 to see if there are changes there as a fallout from changes made here?
Flags: needinfo?(luke)
(In reply to Luke Wagner [:luke] from comment #18)
> Well, there is a symmetry here: "Report" means "unconditionally report an
> error" (such that the caller knows to jump to then jump to the throw stub)
> and that's why OnTrap() calls a function named ReportError().  However
> OnTrap() is unique in that it either reports or resumes, so I think that
> warrants breaking symmetry to call out the difference.  Also,
> WasmReport(OutOfBounds|UnalignedAccess) are going away with the final patch
> of bug 1428453 (once I finish this bug) so there will only be the one
> WasmReportInt64JSCall().

Makes sense to rename indeed because we don't necessarily trap. But `OnTrap` feels like a generic callback (type) name, not a specific function name. How about HandleTrap?
Comment on attachment 8957773 [details] [diff] [review]
fix-idb-test

Review of attachment 8957773 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #8957773 - Flags: review?(bbouvier) → review+
(In reply to Lars T Hansen [:lth] from comment #27)
IIUC, the same issue exists, and the same fix would apply, just moved from wasm/WasmSignalHandlers.cpp to jit/AsyncInterrupt.cpp.
Flags: needinfo?(luke)
(In reply to Benjamin Bouvier [:bbouvier] from comment #28)
HandleTrap() is a good name and matches HandleThrow().
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d6183aa40da
Baldr: remove wasm async interrupt support (r=jandem)
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb7a4a96c333
Baldr: implement wasm interrupt in terms of TlsData branch and stack overflow check (r=bbouvier)
https://hg.mozilla.org/integration/mozilla-inbound/rev/c4bc47b31311
Improve wasm IDB serialization tests to verify modules deterministically (rs=bbouvier)
Backout by cbrindusan@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/edcbf982b033
Backed out 3 changesets for web-platform reftests failures on marionette/content/reftest.js CLOSED TREE
So it looks like this patch is excellent at surfacing previously-dormant bugs.  In this SM-arm failure case, bug 1428453 regressed the testcase added by bug 1332493, but this was covered by the fact that the stack contents happened to be 0.  Apparently this patch changed them to be reliably non-zero.
Flags: needinfo?(luke)
Previously, there were two stack checks: one for the constant fr.initialSize() of the frame done by GenerateFunctionPrologue(), and one for the patched maxStackHeight done right after.  However, if the first stack check fails, the two DebugFrame fields are uninitialized, which is what bug 1332493 fixed/tested.

With this change, GenerateFunctionPrologue() goes back to its earlier behavior of not reserving the frame to allow Ion/Rabladr/Stubs to do the right thing.  Rabaldr can then reserve *just* the DebugFrame (not worrying about stack overflow from huge fr.initialSize()), initialize those fields, then do the stack check.

Patch is mostly code motion.  I don't think the fix is necessarily worth backporting because it's debug-only and only for a very specific case.
Attachment #8958325 - Flags: review?(lhansen)
Comment on attachment 8958325 [details] [diff] [review]
fix-baseline-stack-check

Review of attachment 8958325 [details] [diff] [review]:
-----------------------------------------------------------------

This looks OK but you'll need to adapt it to the ARM64 logic that just landed, so clearing review for now.
Attachment #8958325 - Flags: review?(lhansen)
Fortunately, quite easy to rebase.  Passes arm64 jit-tests.
Attachment #8958550 - Flags: review?(lhansen)
Comment on attachment 8958550 [details] [diff] [review]
fix-baseline-stack-check

Review of attachment 8958550 [details] [diff] [review]:
-----------------------------------------------------------------

OK, looks good.
Attachment #8958550 - Flags: review?(lhansen) → review+
So, because when it rains it pours, this patch also tickled a *third* bizarre latent problem.  As comment 4 mentioned, this patch causes Wr4 to reliably fail despite that test having little to do with JS.  I wasn't able to repro locally and printf() logs don't show any interrupts happening at the point of failure, so I have no idea what's actually failing.

However, I was able to bisect down to a minimal patch which merely adds a file and unused empty functions:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=44a417105be28b8421005795874f555a9b5df20e
Renaming jit/AsyncInterrupt.cpp to jit/ZsyncInterrupt.cpp succeeds:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4c152abcd81a4094ac36dc9cf651ed4bbb643f0
yet renaming to jit/Interrupt.cpp fails:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=16f2817abd8320ba80417cd35f5fd55103976aee
Folding jit/AsyncInterrupt.cpp into existing files succeeds:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=08fa91f15d5f7ac450cfe059c19b44d72a4d893d

So there seems to be some bizarre issue with file names here.  Obviously I'd file a separate bug to investigate, but Jan, are you ok with moving AsyncInterrupt.(cpp,h) into jit/JitCompartment.h and (because there is no JitCompartment.cpp), jit/Ion.cpp to resolve this bug?
Flags: needinfo?(jdemooij)
(In reply to Luke Wagner [:luke] from comment #40)
> Obviously I'd
> file a separate bug to investigate, but Jan, are you ok with moving
> AsyncInterrupt.(cpp,h) into jit/JitCompartment.h and (because there is no
> JitCompartment.cpp), jit/Ion.cpp to resolve this bug?

That seems okay to get this landed.

Does it only affect Linux32 opt or also other platforms?
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #41)
Only Linux32 opt (which includes Linux32 Styo Disabled Opt).
Full try run seems green (taking into account frequent orange of ccov tests on m-c):
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=e22449ddbc379dc9585f260c69f975f61f7feb3f
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc6098c0deeb
Baldr: do baseline stack-overflow check after initializing DebugFrame (again) (r=lth)
https://hg.mozilla.org/integration/mozilla-inbound/rev/a74ec55e8bad
Improve wasm IDB serialization tests to verify modules deterministically (rs=bbouvier)
https://hg.mozilla.org/integration/mozilla-inbound/rev/d8b27e30ef91
Baldr: remove wasm async interrupt support (r=jandem)
https://hg.mozilla.org/integration/mozilla-inbound/rev/b832f0e5e302
Baldr: implement wasm interrupt in terms of TlsData branch and stack overflow check (r=bbouvier)
... and right as I was about to file the follow-up investigation bug for comment 40, someone else hit Wr4 on m-i and so now the test is disabled with bug 1445834 to investigate.
Depends on: 1446903
Depends on: 1446907
Duplicate of this bug: 1431790
You need to log in before you can comment on or make changes to this bug.