Closed Bug 1334239 Opened 7 years ago Closed 7 years ago

WebAssembly application crashes at launch, corrupted call stack and no crash handler giving a report

Categories

(Core :: JavaScript Engine, defect)

x86_64
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: jujjyl, Assigned: luke)

Details

Attachments

(4 files)

STR:

1. Download https://autodesk.app.box.com/s/vmil3m4d4jq1dikpm1g6uz8hgwx00ce4
2. Unzip, enable javascript.options.wasm;true pref and run stingray_webgl_release.html e.g. in python -m SimpleHTTPServer (file:// might work as well)

Observed: When the page loads, the content process will crash.

Attempted to attach VS2015 debugger to the application (had remote.autostart/e10s disabled), and the crash does also occur and VS2015 is able to catch the crash, but the callstack is completely bogus and VS2015 says that the local stack resides outside the application module, and there's very random looking code in those locations.
Doesn't crash in the shell, but causes a RuntimeError: index out of bounds.

Can repro the crash on a browser under linux x64, investigating.
Ah, I think I see the bug: we have a signature mismatch error happening when calling an imported JS function from a table (quite the corner case) and I've forgotten to call masm.wasmEmitTrapOutOfLineCode() from the GenerateImportFunction() stub.

Jukka/Alon: there does seem to be a dynamic wasm error here: we're calling the 81st imported function, _glClearColor, which is imported with signature (f64,f64,f64,f64)->void but called with signature (f32,f32,f32,f32)->().  (The caller has function index 33004.)  Btw, I see we import _emscripten_glClearColor *twice* with the signature (f32,f32,f32,f32)->void, but one of these imports is not used at all and the other is only called directly and not present in the (elem) section.
Flags: needinfo?(jujjyl)
Flags: needinfo?(azakai)
Hmm, what is the difference between

(f32,f32,f32,f32)->() versus (f32,f32,f32,f32)->void ? Is that relevant here?

In C side, glClearColor() should be

GL_APICALL void GL_APIENTRY glClearColor (GLfloat red, GLfloat green, GLfloat blue, GLfloat alpha);

where GL_APICALL and GL_APIENTRY expand to empty strings and GLfloat is float in C code. We expect to call the function _glClearColor(), and the function _emscripten_glClearColor() should not be called (but due to an inefficiency/bug, it is known to be present in the source code, https://github.com/kripken/emscripten/issues/4908). 

The function _glClearColor() is an extern import from JS. I'm not sure how it picks up the import signature in current Binaryen? We do have the glClearColor__sig: 'viiii' fields in GL code, though I think that might only apply to the glGetProcAddress() machinery.
Flags: needinfo?(jujjyl)
Oh, also, I'm surprised what could be special about glClearColor(), because a number of GL API functions take in floats/doubles, and I'd expect all of those would be wrong if this one is wrong. Is that the case?
If I do a simple test on C side with the following:

> #include <emscripten/html5.h>
> 
> extern "C" {
> // void glClearColor(double red, double green, double blue, double alpha); (*)
> void glClearColor(float red, float green, float blue, float alpha);
> void glClear(unsigned int mask);
> }
> 
> int main()
> {
>   EmscriptenWebGLContextAttributes attrs;
>   emscripten_webgl_init_context_attributes(&attrs);
>   attrs.majorVersion = 1;
>   attrs.minorVersion = 0;
>   EMSCRIPTEN_WEBGL_CONTEXT_HANDLE context = emscripten_webgl_create_context(0, &attrs);
>   emscripten_webgl_make_context_current(context);
>   glClearColor(1.f, 1.f, 0.f, 1.f);
>   glClear(0x00004000/*GL_COLOR_BUFFER_BIT*/);
>   return 0;
> }

built with

> em++ a.cpp -o a.html -O3 -g2 -s BINARYEN=1

that does work properly. What gets generated as import and call are

>   (import "env" "_glClearColor" (func $import$19 (param f64 f64 f64 f64)))

> 
>       (call $import$19
>         (f64.const 1)
>         (f64.const 1)
>         (f64.const 0)
>         (f64.const 1)
>       )
>

which occurs both in float and double case (toggling the line marked with (*) above), so it looks like at least the signature that the function is declared in C side does not affect this.
Hmm, when I use binaryen to wasm-dis that wasm, I see

 * we indeed import it once as f32 params, once as f64 params
 * the first (f32s) is never called. recent binaryen should dce it out, so if this was built with a recent version, we should investigate that
 * the second (f64s) is called from $legalfunc$_emscripten_glClearColor, which receives f32 inputs, promotes them to f64, then calls that second import (it "legalizes" the f32 type for the call out to JS, where we accept only the basic asm.js ffi types)
 * $legalfunc$_emscripten_glClearColor in turn is placed in the function table, and can be called from there. So C code would have a function pointer to a function with f32s - as that is how it is declared in C - which is $legalfunc$_emscripten_glClearColor, which then converts and calls out to the f64 import. So C code calls with f32s, the stub converts, and we call out to JS with f64s.

That all seems to be ok - I don't see a method imported one way and called another. Perhaps there is a difference between the disassembly of wasm-dis and spidermonkey's? How can I run spidermonkey's to compare?
Flags: needinfo?(azakai)
Jukka - in your last example, I think a difference from the original testcase is that glClearColor is only called by a function pointer there (maybe from glGetProcAddress or such?).
(In reply to Jukka Jylänki from comment #3)
> (f32,f32,f32,f32)->() versus (f32,f32,f32,f32)->void ? Is that relevant here?

Sorry, just I should've typed "void" in both cases.

(In reply to Alon Zakai (:azakai) from comment #6)

From what I'm seeing, _glClearColor is the 81st function import and it's imported with signature (f64,f64,f64,f64) and it's going into the (elem) section as element 600.
Attached patch word-sizeSplinter Review
Although this isn't the bug, I noticed that the sig-id comparison should be pointer-sized.
Assignee: nobody → luke
Status: NEW → ASSIGNED
Attachment #8832664 - Flags: review?(bbouvier)
Oh, right, there is both glClearColor and emscripten_glClearColor.

For glClearColor, it is imported as f64 (because we detect it that way based on usage, due to asm.js ffis), and as such we place it directly in the table, with no stub. Then a function pointer call to it with f32s is wrong, so yes, there is a problem here. Not obvious to me how to fix it yet.
Stepping through the code, I also noticed that, when we updated to 0xD, the new binary constant values of ValType implicitly disabled the packed-word optimization for signature checks (b/c the ValTypes are no longer in the range 0-4 but some big constant).  This patch does what I should've done initially and explicitly maps the four types to their packed numeric value.  It'll be interesting to see how much this wins on indirect-call-heavy workloads like bullet.
Attachment #8832665 - Flags: review?(bbouvier)
And here's the actual, trivial fix (with test).
Attachment #8832666 - Flags: review?(bbouvier)
Alon: maybe you have the binaryen bug understood, but just in case it helps, here's the error stack (with names from the names section, I see) of the indirect call signature mismatch:

__ZN9Scaleform6Render2GL11GLImmediate12glClearColorEffff
__ZN9Scaleform6Render2GL23GraphicsDeviceImmediate12glClearColorEffff
__ZN9Scaleform6Render2GL3HAL19setRenderTargetImplEPNS0_16RenderTargetDataEjRKNS0_5ColorE
__ZN9Scaleform6Render3HAL16PushRenderTargetEPNS0_12RenderTargetEjRKNS_4RectIfEENS0_5ColorE
__ZN9Scaleform6Render3HAL11PushFiltersEPNS0_15FilterPrimitiveE
__ZThn8_N9Scaleform6Render15FilterPrimitive9EmitToHALERNS0_15RenderQueueItemERNS0_20RenderQueueProcessorE
__ZN9Scaleform6Render20RenderQueueProcessor23drawProcessedPrimitivesEv
__ZN9Scaleform6Render20RenderQueueProcessor12ProcessQueueENS1_16QueueProcessModeE
__ZN9Scaleform6Render20RenderQueueProcessor5FlushEv
__ZN9Scaleform6Render3HAL5FlushEv
__ZN9Scaleform6Render3HAL10EndDisplayEv
__ZN9Scaleform6Render13TreeCacheRoot4DrawEPNS0_3HALE
__ZN9Scaleform6Render3HAL4DrawEPNS0_8TreeRootE
__ZN22scaleformstudio_plugin9S2DThread6renderEP27RenderDevicePluginArgumentsRN9Scaleform6Render11ContextImpl13DisplayHandleINS4_8TreeRootEEE
__ZN22scaleformstudio_plugin9S2DSystem6renderEP27RenderDevicePluginArguments
__ZN22scaleformstudio_plugin9S2DPlugin6renderEP27RenderDevicePluginArguments
__ZZN8stingray18resource_generator26RenderDevicePluginModifier6renderEPNS_13RenderContextEyRNS_19ResourceGeneratorIOEEN3__08__invokeEPvS7_S7_S7_
__ZN8stingray15GLRenderContext8dispatchEjPKNS_13RenderContext7CommandE
__ZN8stingray14GLRenderDevice8dispatchEjPPNS_13RenderContextEj
__ZN8stingray11RenderWorld6renderERNS0_12RenderParamsE
__ZN8stingray15RenderInterface15process_messageEPNS0_13RenderMessageE
__ZN8stingray15RenderInterface12post_messageINS_14RenderWorldMsgEEEvjPKT_
__ZN8stingray15RenderInterface12render_worldERNS_5WorldERKNS_6CameraERKNS_8ViewportERKNS_18ShadingEnvironmentEj
__ZN8stingray10plugin_api11application12render_worldEP9CApiWorldPK10CApiCameraPK12CApiViewportPK22CApiShadingEnvironmentPK10CApiWindow
__ZZN12_GLOBAL__N_121load_world_managementERN8stingray14LuaEnvironmentEEN3__78__invokeEP9lua_State
_luaD_precall
_luaV_execute
_luaD_call
_f_call
dynCall_vii
invoke_vii
_luaD_rawrunprotected
_luaD_pcall
_lua_pcall
__ZN8stingray14LuaEnvironment4callEP9lua_StateNS_15HandleErrorTypeEii
__ZN8stingray14LuaEnvironment11call_globalEPKcNS_15HandleErrorTypeENS_8LuaStackE
__ZN12_GLOBAL__N_16MyGame6renderEv
__ZN8stingray11Application6updateEv
__Z6updatePv
dynCall_vi
dynCall
browserIterationFunc
runIter
Browser_mainLoop_runner
Thanks, yeah, already managed to reduce it to a tiny testcase:

https://github.com/kripken/emscripten/commit/16bc53481f7ccecc96a5c4cc7a0443300204c3ce

Kind of surprising we didn't run into this before... but it does happen only on a js library function, which has a type not allowed in asm.js ffis (like f32), and which is called both as a function pointer and as a regular function. Only all those together hit the bug.
Comment on attachment 8832664 [details] [diff] [review]
word-size

Review of attachment 8832664 [details] [diff] [review]:
-----------------------------------------------------------------

Good catch, thanks.

Does the none architecture needs to be modified too?
Attachment #8832664 - Flags: review?(bbouvier) → review+
Comment on attachment 8832665 [details] [diff] [review]
reoptimize-sig-id

Review of attachment 8832665 [details] [diff] [review]:
-----------------------------------------------------------------

Easier to maintain indeed, thanks.

::: js/src/wasm/WasmTypes.cpp
@@ +596,5 @@
> +      case ValType::F32x4:
> +      case ValType::B8x16:
> +      case ValType::B16x8:
> +      case ValType::B32x4:
> +        break;

nit: i find it always weird to have mixed \n decisions at the end of case lines. Can you make it consistent, please? (personal preference is \n for everyone)
Attachment #8832665 - Flags: review?(bbouvier) → review+
Comment on attachment 8832666 [details] [diff] [review]
fix-table-import-mismatch

Review of attachment 8832666 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.
Attachment #8832666 - Flags: review?(bbouvier) → review+
When this gets fixed, will these kinds of invalid wasm modules fail to compile altogether? Or will they abort at runtime if they happen to perform an invalid call?
(In reply to Jukka Jylänki from comment #18)
> When this gets fixed, will these kinds of invalid wasm modules fail to
> compile altogether? Or will they abort at runtime if they happen to perform
> an invalid call?

The latter, a runtime exception will be thrown.
(In reply to Jukka Jylänki from comment #18)
Yeah, the fault here is essentially calling a C++ function with a wrong-typed func-ptr:

  void foo(float) {}
  typedef void (*PFD)(double);
  int main() {
    PFD pfd = (PFD)foo;
    pfd(1.0);
  }

This is caught at runtime with a wasm signature check at the ptr-to-fun call.
(In reply to Benjamin Bouvier [:bbouvier] from comment #15)
> Does the none architecture needs to be modified too?

Seems like no because it's marked as PER_SHARED_ARCH, I guess; there's no branch32/branchPtr in none/MacroAssembler.h.
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3d1891948fe9
Baldr: use word-size comparison for signature pointer (r=bbouvier)
https://hg.mozilla.org/integration/mozilla-inbound/rev/fb57bb8234ad
Baldr: re-enable SigIdDesc immediate optimization (r=bbouvier)
https://hg.mozilla.org/integration/mozilla-inbound/rev/223ff75af17c
Baldr: emit trap out-of-line paths in function import wrappers (r=bbouvier)
Attached patch branch.patchSplinter Review
Approval Request Comment
[Feature/Bug causing the regression]: bug 1313180
[User impact if declined]: crash in corner case error behavior
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: verified manually
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: very low
[Why is the change risky/not risky?]: tiny patch that fixes an obvious bug
Attachment #8833197 - Flags: review+
Attachment #8833197 - Flags: approval-mozilla-beta?
Attachment #8833197 - Flags: approval-mozilla-aurora?
Comment on attachment 8833197 [details] [diff] [review]
branch.patch

Regression from 52, causes a startup crash in some cases, let's uplift the fix.
Attachment #8833197 - Flags: approval-mozilla-beta?
Attachment #8833197 - Flags: approval-mozilla-beta+
Attachment #8833197 - Flags: approval-mozilla-aurora?
Attachment #8833197 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: