Closed Bug 1334239 Opened 5 years ago Closed 5 years ago

WebAssembly application crashes at launch, corrupted call stack and no crash handler giving a report


(Core :: JavaScript Engine, defect)

Not set



Tracking Status
firefox52 --- fixed
firefox53 --- fixed
firefox54 --- fixed


(Reporter: jujjyl, Assigned: luke)



(4 files)


1. Download
2. Unzip, enable javascript.options.wasm;true pref and run stingray_webgl_release.html e.g. in python -m SimpleHTTPServer (file:// might work as well)

Observed: When the page loads, the content process will crash.

Attempted to attach VS2015 debugger to the application (had remote.autostart/e10s disabled), and the crash does also occur and VS2015 is able to catch the crash, but the callstack is completely bogus and VS2015 says that the local stack resides outside the application module, and there's very random looking code in those locations.
Doesn't crash in the shell, but causes a RuntimeError: index out of bounds.

Can repro the crash on a browser under linux x64, investigating.
Ah, I think I see the bug: we have a signature mismatch error happening when calling an imported JS function from a table (quite the corner case) and I've forgotten to call masm.wasmEmitTrapOutOfLineCode() from the GenerateImportFunction() stub.

Jukka/Alon: there does seem to be a dynamic wasm error here: we're calling the 81st imported function, _glClearColor, which is imported with signature (f64,f64,f64,f64)->void but called with signature (f32,f32,f32,f32)->().  (The caller has function index 33004.)  Btw, I see we import _emscripten_glClearColor *twice* with the signature (f32,f32,f32,f32)->void, but one of these imports is not used at all and the other is only called directly and not present in the (elem) section.
Flags: needinfo?(jujjyl)
Flags: needinfo?(azakai)
Hmm, what is the difference between

(f32,f32,f32,f32)->() versus (f32,f32,f32,f32)->void ? Is that relevant here?

In C side, glClearColor() should be

GL_APICALL void GL_APIENTRY glClearColor (GLfloat red, GLfloat green, GLfloat blue, GLfloat alpha);

where GL_APICALL and GL_APIENTRY expand to empty strings and GLfloat is float in C code. We expect to call the function _glClearColor(), and the function _emscripten_glClearColor() should not be called (but due to an inefficiency/bug, it is known to be present in the source code, 

The function _glClearColor() is an extern import from JS. I'm not sure how it picks up the import signature in current Binaryen? We do have the glClearColor__sig: 'viiii' fields in GL code, though I think that might only apply to the glGetProcAddress() machinery.
Flags: needinfo?(jujjyl)
Oh, also, I'm surprised what could be special about glClearColor(), because a number of GL API functions take in floats/doubles, and I'd expect all of those would be wrong if this one is wrong. Is that the case?
If I do a simple test on C side with the following:

> #include <emscripten/html5.h>
> extern "C" {
> // void glClearColor(double red, double green, double blue, double alpha); (*)
> void glClearColor(float red, float green, float blue, float alpha);
> void glClear(unsigned int mask);
> }
> int main()
> {
>   EmscriptenWebGLContextAttributes attrs;
>   emscripten_webgl_init_context_attributes(&attrs);
>   attrs.majorVersion = 1;
>   attrs.minorVersion = 0;
>   EMSCRIPTEN_WEBGL_CONTEXT_HANDLE context = emscripten_webgl_create_context(0, &attrs);
>   emscripten_webgl_make_context_current(context);
>   glClearColor(1.f, 1.f, 0.f, 1.f);
>   glClear(0x00004000/*GL_COLOR_BUFFER_BIT*/);
>   return 0;
> }

built with

> em++ a.cpp -o a.html -O3 -g2 -s BINARYEN=1

that does work properly. What gets generated as import and call are

>   (import "env" "_glClearColor" (func $import$19 (param f64 f64 f64 f64)))

>       (call $import$19
>         (f64.const 1)
>         (f64.const 1)
>         (f64.const 0)
>         (f64.const 1)
>       )

which occurs both in float and double case (toggling the line marked with (*) above), so it looks like at least the signature that the function is declared in C side does not affect this.
Hmm, when I use binaryen to wasm-dis that wasm, I see

 * we indeed import it once as f32 params, once as f64 params
 * the first (f32s) is never called. recent binaryen should dce it out, so if this was built with a recent version, we should investigate that
 * the second (f64s) is called from $legalfunc$_emscripten_glClearColor, which receives f32 inputs, promotes them to f64, then calls that second import (it "legalizes" the f32 type for the call out to JS, where we accept only the basic asm.js ffi types)
 * $legalfunc$_emscripten_glClearColor in turn is placed in the function table, and can be called from there. So C code would have a function pointer to a function with f32s - as that is how it is declared in C - which is $legalfunc$_emscripten_glClearColor, which then converts and calls out to the f64 import. So C code calls with f32s, the stub converts, and we call out to JS with f64s.

That all seems to be ok - I don't see a method imported one way and called another. Perhaps there is a difference between the disassembly of wasm-dis and spidermonkey's? How can I run spidermonkey's to compare?
Flags: needinfo?(azakai)
Jukka - in your last example, I think a difference from the original testcase is that glClearColor is only called by a function pointer there (maybe from glGetProcAddress or such?).
(In reply to Jukka Jylänki from comment #3)
> (f32,f32,f32,f32)->() versus (f32,f32,f32,f32)->void ? Is that relevant here?

Sorry, just I should've typed "void" in both cases.

(In reply to Alon Zakai (:azakai) from comment #6)

From what I'm seeing, _glClearColor is the 81st function import and it's imported with signature (f64,f64,f64,f64) and it's going into the (elem) section as element 600.
Attached patch word-sizeSplinter Review
Although this isn't the bug, I noticed that the sig-id comparison should be pointer-sized.
Assignee: nobody → luke
Attachment #8832664 - Flags: review?(bbouvier)
Oh, right, there is both glClearColor and emscripten_glClearColor.

For glClearColor, it is imported as f64 (because we detect it that way based on usage, due to asm.js ffis), and as such we place it directly in the table, with no stub. Then a function pointer call to it with f32s is wrong, so yes, there is a problem here. Not obvious to me how to fix it yet.
Stepping through the code, I also noticed that, when we updated to 0xD, the new binary constant values of ValType implicitly disabled the packed-word optimization for signature checks (b/c the ValTypes are no longer in the range 0-4 but some big constant).  This patch does what I should've done initially and explicitly maps the four types to their packed numeric value.  It'll be interesting to see how much this wins on indirect-call-heavy workloads like bullet.
Attachment #8832665 - Flags: review?(bbouvier)
And here's the actual, trivial fix (with test).
Attachment #8832666 - Flags: review?(bbouvier)
Alon: maybe you have the binaryen bug understood, but just in case it helps, here's the error stack (with names from the names section, I see) of the indirect call signature mismatch:

Thanks, yeah, already managed to reduce it to a tiny testcase:

Kind of surprising we didn't run into this before... but it does happen only on a js library function, which has a type not allowed in asm.js ffis (like f32), and which is called both as a function pointer and as a regular function. Only all those together hit the bug.
Comment on attachment 8832664 [details] [diff] [review]

Review of attachment 8832664 [details] [diff] [review]:

Good catch, thanks.

Does the none architecture needs to be modified too?
Attachment #8832664 - Flags: review?(bbouvier) → review+
Comment on attachment 8832665 [details] [diff] [review]

Review of attachment 8832665 [details] [diff] [review]:

Easier to maintain indeed, thanks.

::: js/src/wasm/WasmTypes.cpp
@@ +596,5 @@
> +      case ValType::F32x4:
> +      case ValType::B8x16:
> +      case ValType::B16x8:
> +      case ValType::B32x4:
> +        break;

nit: i find it always weird to have mixed \n decisions at the end of case lines. Can you make it consistent, please? (personal preference is \n for everyone)
Attachment #8832665 - Flags: review?(bbouvier) → review+
Comment on attachment 8832666 [details] [diff] [review]

Review of attachment 8832666 [details] [diff] [review]:

Attachment #8832666 - Flags: review?(bbouvier) → review+
When this gets fixed, will these kinds of invalid wasm modules fail to compile altogether? Or will they abort at runtime if they happen to perform an invalid call?
(In reply to Jukka Jylänki from comment #18)
> When this gets fixed, will these kinds of invalid wasm modules fail to
> compile altogether? Or will they abort at runtime if they happen to perform
> an invalid call?

The latter, a runtime exception will be thrown.
(In reply to Jukka Jylänki from comment #18)
Yeah, the fault here is essentially calling a C++ function with a wrong-typed func-ptr:

  void foo(float) {}
  typedef void (*PFD)(double);
  int main() {
    PFD pfd = (PFD)foo;

This is caught at runtime with a wasm signature check at the ptr-to-fun call.
(In reply to Benjamin Bouvier [:bbouvier] from comment #15)
> Does the none architecture needs to be modified too?

Seems like no because it's marked as PER_SHARED_ARCH, I guess; there's no branch32/branchPtr in none/MacroAssembler.h.
Pushed by
Baldr: use word-size comparison for signature pointer (r=bbouvier)
Baldr: re-enable SigIdDesc immediate optimization (r=bbouvier)
Baldr: emit trap out-of-line paths in function import wrappers (r=bbouvier)
Attached patch branch.patchSplinter Review
Approval Request Comment
[Feature/Bug causing the regression]: bug 1313180
[User impact if declined]: crash in corner case error behavior
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: verified manually
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: very low
[Why is the change risky/not risky?]: tiny patch that fixes an obvious bug
Attachment #8833197 - Flags: review+
Attachment #8833197 - Flags: approval-mozilla-beta?
Attachment #8833197 - Flags: approval-mozilla-aurora?
Comment on attachment 8833197 [details] [diff] [review]

Regression from 52, causes a startup crash in some cases, let's uplift the fix.
Attachment #8833197 - Flags: approval-mozilla-beta?
Attachment #8833197 - Flags: approval-mozilla-beta+
Attachment #8833197 - Flags: approval-mozilla-aurora?
Attachment #8833197 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.