Closed Bug 1488515 Opened 4 years ago Closed 4 years ago

[Rabaldr ARM] Loading WASM (Blazor) site fails with unresponsive script, causing browser to become unresponsive and debugging tools fail

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

ARM
Android
defect

Tracking

()

VERIFIED FIXED
mozilla65
Tracking Status
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- verified
firefox65 --- verified

People

(Reporter: noahcuroe, Assigned: lth)

References

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36

Steps to reproduce:

Attempt to load the Blazor site https://testing.sdsetup.com on Firefox for Android (tested stable, nightly).


Actual results:

A script, mono.js, becomes unresponsive, causing the page to not finish loading. This leads to the entire browser becoming unresponsive (new sites and tabs are unresponsive unless the browser is restarted), and remote debugging tools disconnect (it is because remote debugging tools fail that I am reporting this here rather than elsewhere).


Expected results:

The site should have gotten past the loading page and loaded.

Firefox Focus and other mobile browsers (ex. Chrome, Samsung Internet, Opera) load the site without issues.
Hi, thanks for your report. Tested with Huawei MediaPad M2 (Android 5.1.1), Nokia 6 (Android 7.1.1) and Nexus 9 (Android 6.0.1) on all branches and I can confirm your issue. I'll make the status as New and let developers decide the next step.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Unspecified → Android
Hardware: Unspecified → ARM
We need to investigate this, to see if there's an issue with wasm compilation or something obvious.
I can confirm that this never stops loading in today's FF Nightly on my Samsung Galaxy Tab S2 (Android 7).  The browser seems fine after the tab in question is closed, although if I close the tab before I get the slow-script dialog other tabs are unresponsive for a while before they recover (presumably some script is looping and is eventually killed and we reestablish normalcy).
If I disable Ion, the site hangs (I don't even see the dialog).
If I disable Rabaldr, the site loads properly, and fairly quickly.
Summary: Loading WASM (Blazor) site fails with unresponsive script, causing browser to become unresponsive and debugging tools fail → [Rabaldr] Loading WASM (Blazor) site fails with unresponsive script, causing browser to become unresponsive and debugging tools fail
But this is not reproducible on Linux 64-bit desktop, so it could be:

- an ARM code generation problem
- a 32-bit code generation problem (will test next)

It's probably not an Android problem because then we'd have the same problem with Ion.
Not a problem on 32-bit linux either, so we should assume it's an ARM Rabaldr problem.  Fun times for everyone (esp me).
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Component: General → Javascript: Web Assembly
Priority: -- → P2
Product: Firefox for Android → Core
Summary: [Rabaldr] Loading WASM (Blazor) site fails with unresponsive script, causing browser to become unresponsive and debugging tools fail → [Rabaldr ARM] Loading WASM (Blazor) site fails with unresponsive script, causing browser to become unresponsive and debugging tools fail
Version: Firefox 63 → Trunk
Attached file mono.wasm
Factoid: there's a wasm file, "mono.wasm"; FF quickly finishes compilation of this file (in the browser, on Android, using streaming compilation), so if there's any hanging we hang while running the content.

One explanation I have for seeing the slow-script dialog with both compilers enabled but not with baseline enabled is that the transition to Ion code allows interrupts to be seen while they are not (always?) seen in baseline-compiled code.
Actually I get the slow-script dialog with Rabaldr too, and if I choose to stop the script the browser is responsive, and if I then reload the site in a new tab it repeats that pattern.  But if I close the tab while the script is running things seem to deteriorate some, tabs are not responsive until the script has timed out (I expect), and if I then reload the site the slow-script dialog does not appear as it should.

So in summary: interrupt handling in Rabaldr seems to be ok; the script is looping in some reasonably well-defined way; there may be a Fennec problem lurking additionally, so be careful how you test this.
Some call stacks after stopping the script, courtesy of WebIDE.  Except for the leaf function these are identical, and the four are pairwise equal; at least this provides a place to start.

For the @320 stack, the stop location is mono.wasm:160481:1
For the @421 stack, the stop location is mono.wasm:172346:1


Error: Script terminated by timeout at:
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[320]:0x272e1
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2261]:0x96508
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4016]:0xff6b3
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2283]:0x97628
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4045]:0x1015a1
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1645]:0x73a05
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1799]:0x7ac4a
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4947]:0x12698b
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2815]:0xaf119
Module._mono_wasm_load_runtime@https://testing.sdsetup.com/_framework/wasm/mono.js:1:175479
ccall@https://testing.sdsetup.com/_framework/wasm/mono.js:1:5646
cwrap/<@https://testing.sdsetup.com/_framework/wasm/mono.js:1:6032
start/</window.Module</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33391
callRuntimeCallbacks@https://testing.sdsetup.com/_framework/wasm/mono.js:1:15850
postRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:16878
doRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178429
run@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178590
runCaller@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178035
removeRunDependency@https://testing.sdsetup.com/_framework/wasm/mono.js:1:18344
start/</window.Module</</</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33223

​
Error: Script terminated by timeout at:
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[412]:0x2a13a
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2261]:0x96505
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4016]:0xff6b3
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2283]:0x97628
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4045]:0x1015a1
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1645]:0x73a05
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1799]:0x7ac4a
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4947]:0x12698b
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2815]:0xaf119
Module._mono_wasm_load_runtime@https://testing.sdsetup.com/_framework/wasm/mono.js:1:175479
ccall@https://testing.sdsetup.com/_framework/wasm/mono.js:1:5646
cwrap/<@https://testing.sdsetup.com/_framework/wasm/mono.js:1:6032
start/</window.Module</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33391
callRuntimeCallbacks@https://testing.sdsetup.com/_framework/wasm/mono.js:1:15850
postRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:16878
doRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178429
run@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178590
runCaller@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178035
removeRunDependency@https://testing.sdsetup.com/_framework/wasm/mono.js:1:18344
start/</window.Module</</</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33223


Error: Script terminated by timeout at:
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[320]:0x272e1
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2261]:0x96508
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4016]:0xff6b3
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2283]:0x97628
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4045]:0x1015a1
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1645]:0x73a05
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1799]:0x7ac4a
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4947]:0x12698b
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2815]:0xaf119
Module._mono_wasm_load_runtime@https://testing.sdsetup.com/_framework/wasm/mono.js:1:175479
ccall@https://testing.sdsetup.com/_framework/wasm/mono.js:1:5646
cwrap/<@https://testing.sdsetup.com/_framework/wasm/mono.js:1:6032
start/</window.Module</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33391
callRuntimeCallbacks@https://testing.sdsetup.com/_framework/wasm/mono.js:1:15850
postRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:16878
doRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178429
run@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178590
runCaller@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178035
removeRunDependency@https://testing.sdsetup.com/_framework/wasm/mono.js:1:18344
start/</window.Module</</</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33223
​

Error: Script terminated by timeout at:
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[412]:0x2a13a
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2261]:0x96505
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4016]:0xff6b3
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2283]:0x97628
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4045]:0x1015a1
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1645]:0x73a05
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[1799]:0x7ac4a
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[4947]:0x12698b
@https://testing.sdsetup.com/_framework/wasm/mono.wasm:wasm-function[2815]:0xaf119
Module._mono_wasm_load_runtime@https://testing.sdsetup.com/_framework/wasm/mono.js:1:175479
ccall@https://testing.sdsetup.com/_framework/wasm/mono.js:1:5646
cwrap/<@https://testing.sdsetup.com/_framework/wasm/mono.js:1:6032
start/</window.Module</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33391
callRuntimeCallbacks@https://testing.sdsetup.com/_framework/wasm/mono.js:1:15850
postRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:16878
doRun@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178429
run@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178590
runCaller@https://testing.sdsetup.com/_framework/wasm/mono.js:1:178035
removeRunDependency@https://testing.sdsetup.com/_framework/wasm/mono.js:1:18344
start/</window.Module</</</<@https://testing.sdsetup.com/_framework/blazor.webassembly.js:1:33223
> For the @320 stack, the stop location is mono.wasm:160481:1
> For the @421 stack, the stop location is mono.wasm:172346:1

Since WebIDE unhelpfully renders wasm offsets in hex, these are

0x272E1
0x2A13A

The function at 0x272E0 is this:

(func $func320 (param $var0 i32) (param $var1 i32) (result i32)
    get_local $var1
    if (result i32)
      get_local $var0
      get_local $var1
      i32.div_u
    else
      i32.const 0
    end
  )

The function at 0x2A139 is this:

(func $func412 (param $var0 f64) (result i32)
    get_local $var0
    get_local $var0
    f64.ne
    if (result i32)
      i32.const -2147483648
    else
      get_local $var0
      f64.const 2147483648
      f64.ge
      if (result i32)
        i32.const -2147483648
      else
        get_local $var0
        f64.const -2147483649
        f64.le
        if (result i32)
          i32.const -2147483648
        else
          get_local $var0
          i32.trunc_s/f64
        end
      end
    end
  )

Although this one is probably the one we want:

(func $func2261 (param $var0 i32) (result i32)
    (local $var1 i32) (local $var2 i32) (local $var3 i32) (local $var4 i32) 
    (local $var5 i32) (local $var6 f64) (local $var7 f64)
    get_local $var0
    i32.eqz
    set_local $var4
    i32.const 8
    set_local $var1
    loop $label0
      get_local $var3
      i32.const 4
      i32.ne
      if
        get_local $var4
        i32.eqz
        if
          get_local $var0
          get_local $var3
          i32.const 2
          i32.shl
          i32.add
          get_local $var1
          i32.store
        end
        get_local $var1
        i32.const 8
        i32.add
        set_local $var1
        get_local $var3
        i32.const 1
        i32.add
        set_local $var3
        br $label0
      end
    end $label0
    i32.const 633068
    i32.load
    tee_local $var3
    get_local $var3
    i32.const 8
    call $func232
    i32.const 31
    i32.add
    i32.const 5
    i32.shr_u
    i32.const 2
    i32.shl
    i32.const 39
    i32.add
    i32.const 1073741808
    i32.and
    i32.sub
    tee_local $var5
    f64.convert_u/i32
    set_local $var7
    f64.const 32
    set_local $var6
    i32.const 32
    set_local $var2
    i32.const 4
    set_local $var3
    loop $label2
      loop $label1
        get_local $var5
        get_local $var7
        get_local $var6
        f64.div
        f64.floor
        call $func412
        call $func320
        tee_local $var1
        i32.const -8
        i32.and
        i32.const 8000
        get_local $var1
        i32.const 8000
        i32.lt_u
        select
        tee_local $var1
        get_local $var2
        i32.eq
        if
          get_local $var6
          f64.const 1.2599210498948732
          f64.mul
          set_local $var6
          br $label1
        end
      end $label1
      get_local $var4
      i32.eqz
      if
        get_local $var0
        get_local $var3
        i32.const 2
        i32.shl
        i32.add
        get_local $var1
        i32.store
      end
      get_local $var3
      i32.const 1
      i32.add
      set_local $var3
      get_local $var6
      f64.const 1.2599210498948732
      f64.mul
      set_local $var6
      get_local $var1
      i32.const 8000
      i32.lt_s
      if
        get_local $var1
        set_local $var2
        br $label2
      end
    end $label2
    get_local $var3
  )

(WebIDE is pretty buggy and prone to blipping out at any time, which is why I'm recording all this here.)
All of that is more reasonably expressed like this:

(func $func320 (param $var0 i32) (param $var1 i32) (result i32)
    return $var1 ? (div_u $var0 $var1) : 0
)

(func $func412 (param $var0 f64) (result i32)
    if (isNaN($var0))
      return -2147483648
    if ($var0 >= 2147483648)
      return -2147483648
    if ($var0 <= -2147483649)
      return -2147483648
    return i32.trunc_s/f64($var0)
  )

with the inner loop that calls those two functions expressed like this:

again:
  $var1 := $func320( $func412( floor($var7 / $var6) ), $var5 );
  $var1 := unsigned($var1) <= 8000u ? ($var1 & -8) : 8000;
  if ($var1 == $var2) {
     $var6 := $var6 * 1.2599210498948732;
     goto again
  }
The only really suspicious code here is i32.trunc_s/f64, which is known to be hairy; of the others, floor(double) is a callout to shared code, and unsigned division and floating compare and divide should also be shared.
Comparing runs on x64 and ARM, it actually looks like the f64.div is incorrect; it divides 16096 by 32 in both cases; this should yield 503 but yields (at least as far as the debugger is concerned) 32 on ARM.  That result is consistent with the operation not being performed at all, or with the result register not being written, since when we pop operands off the stack the rhs '32' is in the register that will end up being the destination register.

(It is of course possible that this is some optimization gone berserk, too, so let's not blame the ARM assembler until we've ruled out everything else.)

It doesn't /have/ to be f64.div, it could also be f64.floor picking up the wrong register.  There's no way in WebIDE to tell because we can't see the wasm eval stack.  But one of these, or the combination, does something wrong.
Attached file gendiv.js
Definitely a combination of f64.div and f64.floor, the attached program generates a module embedded in HTML that reproduces the bug.  This should print 503 but prints 32 on the ARM device.  The f64.div by itself produces the correct result, 503.
Unsurprisingly, the error does not reproduce on the emulator.
(In reply to Lars T Hansen [:lth] from comment #16)
> Unsurprisingly, the error does not reproduce on the emulator.

Actually it does repro, if we use --no-wasm-ion, as we should.

There is at least one bug here, which is that both the baseline compiler and the builtin thunk for floor() try to rectify the arguments to deal with softfp ABI constraints (floats are passed in integer registers).  Arguably this is a bug in the baseline compiler, although the code has been the way it is now since the compiler landed, so...

It is somewhat frightening that we don't have test cases that expose this problem.

Anyway I don't know if this is the problem that causes the malfunction, but it is a good first thing to fix.
That definitely *is* the problem.  The value that we're flooring is in d1 in the baseline compiler; d0 will be the result register and already has the value 32.  The baseline compiler moves d1 -> [r1, r0] for the ABI fixup; the thunk layer then moves d0 -> [r1, r0], expecting the first double arg to be in d0; this is wrong of course, d0 == 32.  So now we're flooring 32, and 32 is the result, which is moved back into d0 first by the thunk layer and then again by the baseline compiler.
Tiny patch + test case.  I have verified that this fixes the problem on the device.
Attachment #9023530 - Flags: review?(bbouvier)
We'll want to uplift this to 64 once it has landed.
Comment on attachment 9023530 [details] [diff] [review]
bug1488515-use-hardfp-for-builtins.patch

Review of attachment 9023530 [details] [diff] [review]:
-----------------------------------------------------------------

Good find and great investigation! Yeah, the thunk makes it so we don't have to worry about hard vs soft FP: https://searchfox.org/mozilla-central/source/js/src/jit/MacroAssembler.cpp#3202-3206
Thanks for the patch!
Attachment #9023530 - Flags: review?(bbouvier) → review+
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f9108997903c
avoid wasm hardfp -> softfp transition twice on Android. r=bbouvier
Comment on attachment 9023530 [details] [diff] [review]
bug1488515-use-hardfp-for-builtins.patch

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: Some WebAssembly content will inexplicably hang / abort with the slow-script dialog on Android.

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): This is a well understood point fix for what is clearly a miscommunication between two layers in the wasm engine.  The fix only affects ARM.

String changes made/needed:
Attachment #9023530 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/f9108997903c
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Comment on attachment 9023530 [details] [diff] [review]
bug1488515-use-hardfp-for-builtins.patch

wasm fix for arm, approved for 64.0b9
Attachment #9023530 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Verified as fixed in build 64.0b9 and Nightly 65.0a1 (2018-11-12).

Device: Sony Xperia Z5 (Android 6.0.1)

Thank you!
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.