Closed Bug 1367871 Opened 5 years ago Closed 5 years ago

Assertion failure: funcIndex >= AsmJSFirstDefFuncIndex, at js/src/wasm/AsmJS.cpp:372


(Core :: JavaScript Engine, defect)

Not set



Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed


(Reporter: gkw, Assigned: bbouvier)



(Keywords: assertion, bugmon, testcase, Whiteboard: [jsbugmon:])


(3 files, 1 obsolete file)

The following testcase crashes on mozilla-central revision f81bcc23d37d (build with --enable-debug --enable-more-deterministic, run with --fuzzing-safe --no-threads --no-baseline --no-ion):

See attachment.


#0  0x0000000000d02000 in js::AsmJSMetadata::getFuncName (this=this@entry=0x7f9ed0618400, maybeBytecode=<optimized out>, funcIndex=funcIndex@entry=0, name=name@entry=0x7fff96b64fe0) at js/src/wasm/AsmJS.cpp:372
#1  0x0000000000d697e6 in js::wasm::Instance::getFuncName (name=0x7fff96b64fe0, funcIndex=0, this=<optimized out>) at js/src/wasm/WasmInstance.cpp:775
#2  js::wasm::Instance::getFuncAtom (this=<optimized out>, cx=0x7f9ed0672000, funcIndex=0) at js/src/wasm/WasmInstance.cpp:782
#3  0x0000000000d698c5 in js::wasm::FrameIterator::functionDisplayAtom (this=this@entry=0x7fff96b65720) at js/src/wasm/WasmFrameIterator.cpp:177
#4  0x0000000000bdf810 in js::FrameIter::functionDisplayAtom (this=this@entry=0x7fff96b65690) at js/src/vm/Stack.cpp:834
#5  0x0000000000bca1a0 in js::SavedStacks::insertFrames(JSContext*, js::FrameIter&, JS::MutableHandle<js::SavedFrame*>, mozilla::Variant<JS::AllFrames, JS::MaxFrames, JS::FirstSubsumedFrame>&&) (this=this@entry=0x7f9ed064e0f8, cx=cx@entry=0x7f9ed0672000, iter=..., frame=..., frame@entry=..., capture=capture@entry=<unknown type in /home/gkwubu/shell-cache/js-dbg-64-dm-linux-f81bcc23d37d/js-dbg-64-dm-linux-f81bcc23d37d, CU 0x4d7d1f7, DIE 0x4f98adc>) at js/src/vm/SavedStacks.cpp:1352

For detailed crash information, see attachment.

This is a reproducible/reliable testcase, but hard to reduce.
autoBisect shows this is probably related to the following changeset:

The first bad revision is:
user:        Yury Delendik
date:        Sat Jan 07 10:36:11 2017 -0600
summary:     Bug 1286948 - Extends AbstractFramePtr to reference wasm::DebugFrame. r=luke,shu

Yury/Benjamin, is bug 1286948 a likely regressor?

(Note that line 982 might be the interesting portion of the testcase, reproduced below)

s = "timeout(0.01);var code=wasmTextToBinary(`(module(func(export\"f1\")(loop$top br$top)))`);new WebAssembly.Instance(new WebAssembly.Module(code)).exports.f1()"
Blocks: 1286948
Flags: needinfo?(ydelendik)
Flags: needinfo?(bbouvier)
Whiteboard: [jsbugmon:update] → [jsbugmon:]
JSBugMon: Cannot process bug: Error: Failed to isolate test from comment
Looking. So far, I can only reproduce with --enable-deterministic mode, fwiw.
After some time spent in the debugger and rr, found how this issue could arise and made a smaller test case which reproduces also in non-deterministic mode:

function asmJS(glob, ffi) {
    "use asm";
    var out = ffi.out;
    function f() {
    return f;

var ffi = {};
ffi.out = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary('(module (func (export "f") (loop $top (br $top))))'))).exports.f;

asmJS(this, ffi)();

The stack frame iterator is using the activation->resumePC() (which is just runtime->wasmResumePC(), common to all the wasm::Activation) to recover the Code/CodeRange, but when we're in the asm.js activation, the value of resumePC is incorrect! (it's set to the wasm infinite loop's body) Thus we're getting the wasm Code/CodeRange values, and the function index for the first wasm function (=== 0) is less than AsmJSFirstFuncDef, hence this assertion.

Luke, this suggests that the resumePC should actually remain in the WasmActivation, right?
Flags: needinfo?(ydelendik)
Flags: needinfo?(luke)
Flags: needinfo?(bbouvier)
(duh, bugzilla chose to remove my needinfo on my behalf :/)
Flags: needinfo?(bbouvier)
Ah hah, I think the problem is when there are two WasmActivations on the stack for difference instances.  After bug 1360263, WasmActivation::interrupted() just forwards to JSRuntime::wasmResumePC() so both WasmActivations will think they are interrupted when only the inner activation actually is.  For the outer WasmActivation, JSRuntime::wasmResumePC will then be used to lookup the Code/CodeRange which will not match fp->instance->code.

I think the fix is to make WasmActivation::interrupted() only say 'true' for the innermost; once way would be to have it do a search from cx->activation() down to the first act where act->isWasm() and check whether act == this.

Also, it'd be good to assert fp->instance->code == code_ in FrameIterator's ctor.
Flags: needinfo?(luke)
This fix is not enough, but I am not too sure to understand why: in the jit-tests, wasm/timeout/2.js asserts (no code_ in a WasmFrameIterator ctor):

I wonder if this is that the gc() after the timeout has caused the ilooping Instance to be deregistered from the wasm compartment, causing the lookupCode at the compartment level to fail, because the resumePC is still in the ilooping Instance?

I think the interrupted() function should also check that the resumePC is actually in the activation's instance code. With a bit of refactoring, code_ in WasmFrameIterator can be inferred from the interrupted() call.
Flags: needinfo?(bbouvier)
Attached patch fix.patch (obsolete) — Splinter Review
Attempt at fix described above.
Assignee: nobody → bbouvier
Attachment #8872692 - Flags: review?(luke)
Comment on attachment 8872692 [details] [diff] [review]

Review of attachment 8872692 [details] [diff] [review]:

::: js/src/vm/Stack.cpp
@@ +1710,5 @@
> +            continue;
> +
> +        WasmActivation* act = iter->asWasm();
> +        if (act != this)
> +            continue;

Hmm, I think what you want to say here is just 'return false'.  Also, I wouldn't try to optimize the lookup of code by returning pcode.  Rather, I'd just keep this a simple boolean "is this wasm activation interrupted?" query.
Attachment #8872692 - Flags: review?(luke)
Attached patch fix.patchSplinter Review
Attachment #8872692 - Attachment is obsolete: true
Attachment #8872707 - Flags: review?(luke)
Comment on attachment 8872707 [details] [diff] [review]

Review of attachment 8872707 [details] [diff] [review]:

::: js/src/vm/Stack.cpp
@@ +1713,5 @@
> +        if (act != this)
> +            return false;
> +
> +        wasm::Frame* fp = act->exitFP();
> +        return fp && fp->instance()->code().containsFunctionPC(pc);

Can you change this to an assert?  Because the meaning of !interrupted() is that the WasmActivation exited "normally" and that's not the case if this test fails.
Attachment #8872707 - Flags: review?(luke) → review+
Pushed by
Loop over the wasm activations to find the one that's actually interrupted; r=luke
Probably more likely to be caused by bug 1360263. I've checked Aurora and the assertion doesn't trigger there.
Blocks: 1360263
No longer blocks: 1286948
Backout by
Backed out changeset 8f6a8013e62c for Stack.cpp asserts on a CLOSED TREE.
This caused and the patch got backed out (thanks RyanVM).

I've noticed that wasm::MaybeActiveActivation was doing the same as my manual loop with ActivationIterator (without the cx check), so I could replace the whole chunk with:

auto* act = wasm::MaybeActiveActivation(cx_);
if (act != this)
  return false;

MOZ_ASSERT(fp && fp->instance()->code().containsFunctionPC(pc));
return true;

Unfortunately, this now breaks wasm/timeout/2.js, because fp->instance() is nullptr when act == this... I have no idea why, especially since the two codes seem semantically equivalent.

Luke, any idea why?
Flags: needinfo?(luke)
Wait, the assert (at the origin of the backout) only triggered for windows builds?
That's strange.  So you're saying that fp is non-null, fp->tls is non-null but fp->tls->instance is null?  That's super-weird b/c TlsData::instance should be simply initialized to 'this' in the Instance ctor.  Maybe the nullness is elsehwere?

Also, could you confirm that the wasmResumePC is in the instance with the iloop and actually within that iloop?
Flags: needinfo?(luke)
(In reply to Luke Wagner [:luke] from comment #19)
> That's strange.  So you're saying that fp is non-null, fp->tls is non-null
> but fp->tls->instance is null?

Correct, given I modify the code as explained in comment 17. Otherwise this doesn't show up in a linux build...

>  That's super-weird b/c TlsData::instance
> should be simply initialized to 'this' in the Instance ctor.  Maybe the
> nullness is elsehwere?

Will investigate a bit more.

> Also, could you confirm that the wasmResumePC is in the instance with the
> iloop and actually within that iloop?

Yes to all of this.
(In reply to Benjamin Bouvier [:bbouvier] from comment #20)
> Will investigate a bit more.

So, fun story: MaybeActiveActivation returns a WasmActivation iff it's at the top or below N JitActivations. The change suggested in comment 17 was causing a crash in wasm/timeout/2, because of a series of events.

wasm/timeout/2.js does the following: an instance calls into another instance (of another module), the callee being an infinite loop. There's a timeout function, which final callback does a gc() before continuing.

- in the final callback, a gc() call happens (we're in JS, so the top activaiton is an interpreter one).
- we trace wasm activations, using a wasm frame iterator
- the wasm frame iterator finds the first activation, calls interrupted()
- because of the change in comment 17, MaybeActiveActivation would return nullptr (top activation is an interpreter activation).
- so the innermost wasm frame would think it's not the interrupted one, and be popped in the wasm frame iterator
- since the inner frame has been skipped, the ilooping callee instance is not marked
- then it's gc'd
- after the final timeout callback, the JS interpreter tries to make a stack trace, using another wasm frame iterator
- now it can figure that the innermost wasm frame is the interrupted one (no more interpreter frame above)
- but its instance has been gc'd! hence the crash.

TL;DR: we should rename MaybeActiveActivation to MaybeActiveTopActivation, or MaybeActiveInnermostActivation, or something that refers to skipping the JitActivation, I don't know.

Now, back to the Windows issue.
Wow, good job detectiving that.  Agreed on renaming.
Aha! When scheduling an interrupt on windows, we don't switch thread but actually use the current thread [1] (watchdog thread, in the JS shell case), which never initializes TlsContext. The fix is as easy as mimicking the unrolled ActivationIterator loop, without the cx == TlsContext.get() assertion.

Pushed by
Loop over the wasm activations to find the one that's actually interrupted; r=luke
Closed: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.