Closed Bug 1725638 Opened 3 years ago Closed 3 years ago

[ARM64 M1] Frequent tab crashes with zoom video

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

ARM64
macOS
defect

Tracking

()

VERIFIED FIXED
93 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 92+ verified
firefox91 --- wontfix
firefox92 + verified
firefox93 + verified

People

(Reporter: lth, Assigned: jseward)

References

Details

(Keywords: sec-high, Whiteboard: [sec-survey][adv-main92+r][adv-esr91.1+r])

Attachments

(1 file)

Spun off from bug 1720650 with same steps to reproduce, basically, zoom in the browser will crash the tab with high frequency - sometimes within seconds, sometimes within minutes. For more information and a crash dump, see bug 1720650 comment 14. Mostly these crashes do not leave crash dumps, but the times when I see backtraces in the console they look like the one reported there.

(Point of note, it is not possible to build Firefox in debug-noopt mode on the M1, the build dies with a linker error because the distance between a jump and its target is too great. This makes debugging this problem more complex.)

Triage (will be updated):

One thing I've noticed is that this tends to crash quickly after I start talking... even if the mic is off. But without sound in the room it can stay up for a while. But for example, if the computer running zoom in the browser makes a noise in some way, the tab will often crash.

Julian makes the point that we don't know if this is ARM64 specific.

I played around with this a bit. Causing the crash is difficult; it happens at
seemingly random times after connection to a meeting. It has the feel of
a threading or at least timing-related bug. I did establish however that

  1. x86_64-linux is unaffected; at least, I can't get an x86_64 build to crash.

  2. On arm64-linux, it feels like it crashes more often in an assertion-enabled
    build, although that may be an illusion.

  3. On arm64-linux, it requires Ion to crash. That is,
    javascript.options.wasm_optimizingjit must be true in order to get a crash.

FWIW, a number of the times when I've run into this it's because an assertion has fired, so it would make sense if assertion-enabled builds feel more crashy. The one crash dump I have (https://crash-stats.mozilla.org/report/index/ddd562f0-993f-4291-af1b-528030210805) has such a crash, but stack traces printed in the console have effectively been equivalent to the one in the dump.

More initial triage results. To get it to crash requires all 3 of the
following:

  1. arm64 -- x86_64 will not crash
  2. ion -- javascript.options.wasm_optimizingjit=true, .wasm_baselinejit=false
  3. simd enabled -- javascript.options.wasm_simd must be true

It will then crash (segfault) after a few seconds or minutes in
js::wasm::GetNearestEffectiveTls. Adding debug printing shows there are two
different ways it can crash:

  1. fp points at unmapped memory. Hence fp->callerIsExitOrJitEntryFP()
    segfaults.

  2. fp points at something accessible, but for whatever reason, the assertion
    !LookupCode(fp->returnAddress()) fails, meaning that LookupCode
    returned non-null when it should have returned null.

This would explain why it fails in both debug and non-debug builds, but with
the debug builds more easily.

  1. is particularly interesting, because the failing fp values always have a
    particular form, eg:
fp = 0xa500a600a700a8
fp = 0xc700cc00cf00cc
fp = 0x10001200130016
fp = 0xb500ba00be00ba

This might lead one to wonder whether a saved FP value (presumably in the
previous frame) has been overwritten by a SIMD value.

It seems also to me that, once the tab has crashed, it will crash again much
more quickly if one clicks on "Restore this tab!".

Here's a bug that should be fixed regardless, though I doubt it matters for this problem:

diff --git a/js/src/jit/arm64/MoveEmitter-arm64.cpp b/js/src/jit/arm64/MoveEmitter-arm64.cpp
--- a/js/src/jit/arm64/MoveEmitter-arm64.cpp
+++ b/js/src/jit/arm64/MoveEmitter-arm64.cpp
@@ -134,7 +134,7 @@ void MoveEmitterARM64::emitSimd128Move(c
   }
 
   vixl::UseScratchRegisterScope temps(&masm.asVIXL());
-  const ARMFPRegister scratch = temps.AcquireD();
+  const ARMFPRegister scratch = temps.AcquireQ();
   masm.Ldr(scratch, toMemOperand(from));
   masm.Str(scratch, toMemOperand(to));
 }
Group: core-security
Keywords: sec-high

At least one important bug: MoveResolver (in MoveEmitterARM64::emit) misallocates space for SIMD spill slot and scribbles on the frame pointer (not the diff shown above, though that should be fixed too). Plausibly this can be used to trick carefully controlled code into forging the frame pointer, which can be used for arbitrary memory reads and/or to redirect the code pointer. This is going to need an uplift to 92 and ESR91.

Assignee: lhansen → jseward

Presumably (thinking out loud), the Architecture file on each platform ought to define the spill slot size, this should not be hidden down in the move emitter like it is now. This would sensibly be the max of sizeof(Registers::RegisterContent) and sizeof(FloatRegisters::RegisterContent).

(In reply to Lars T Hansen [:lth] from comment #6)

At least one important bug: MoveResolver (in MoveEmitterARM64::emit) misallocates
space for SIMD spill slot [..]

FTR, the smoking gun there, that we found, is

   ;; What's with q4/q3 here?  16-byte store/load but PSP only moves by 8
   0x00002fd60bc94404:  sub     x28, x28, #0x8
   0x00002fd60bc94408:  mov     sp, x28
   0x00002fd60bc9440c:  str     q4, [x28]
   0x00002fd60bc94410:  mov     v4.16b, v6.16b
   0x00002fd60bc94414:  mov     v6.16b, v2.16b
   0x00002fd60bc94418:  mov     v2.16b, v1.16b
   0x00002fd60bc9441c:  mov     v1.16b, v3.16b
   0x00002fd60bc94420:  ldr     q3, [x28]
   0x00002fd60bc94424:  add     x28, x28, #0x8
   0x00002fd60bc94428:  mov     sp, x28
Attached file Bug 1725638. r=lth. —

Attached file Bug 1725638 - Fix Zoom problems. r=lth.

This fixes two problems relating to translation of MoveGroups on AArch64:

  1. MoveEmitterARM64::emitSimd128Move: in the case where both source and
    target are in memory, we requested a D-class (64-bit) register when in fact
    a Q-class (128-bit) register should have been requested. It's unclear
    whether this in fact causes a problem -- it may be that the emission of the
    associated instruction ignores this. But if it doesn't, then we really do
    need to explicitly request a Q-class reg.

  2. MoveEmitterARM64::toMemOperand: this reserves sizeof(void*) (8) bytes
    of stack space for a move that needs to go via memory. This is obviously
    wrong for a 128-bit SIMD move, and will cause the bottom 8 bytes of the
    stack to be overwritten, per comments above. The patch changes that size
    to Simd128DataSize (16), which is correct and avoids the overwrite.

Comment on attachment 9237026 [details]
Bug 1725638. r=lth.

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: In order to understand the significance and exploitability of the flaw, an
    attacker would have to know quite a lot about register allocation in
    optimising compilers. Once they had got that, however, it would seem to us
    plausible to develop an exploit that (per comment 6) could overwrite the
    stored frame pointer, then do arbitrary reads and/or redirect execution
    elsewhere.

But only on aarch64 platforms, including aarch64-Android. Not on x86/x64.

  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: firefox90, firefox91, firefox92(=beta), esr91.
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?:
  • How likely is this patch to cause regressions; how much testing does it need?: It's very simple (a 2-liner); it affects only wasm on AArch64; it has been
    tested fairly extensively by hand. I think it is unlikely to cause
    regression.
Attachment #9237026 - Flags: sec-approval?

Comment on attachment 9237026 [details]
Bug 1725638. r=lth.

Beta/Release Uplift Approval Request

  • User impact if declined: Worst case -- (assuming skilled attacker etc): On AArch64 targets (MacOS,
    Android) -- remote code execution as a result of loading a malicious
    WebAssembly payload.

For definite -- crashing of tab for Zoom's in-browser (wasm-based) client,
making it unusable.

Potential -- crashing of tab for any application that uses wasm SIMD on
AArch64.

  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: Start a Zoom meeting using some other device/client, and get hold of the
    meeting ID.

With an AArch64/MacOS build of Fx, check you have
javascript.options.wasm_optimizingjit = true.

Go to https://www.zoom.us/join and enter the meeting ID. But then don't click
on "Launch Meeting"; instead scroll down and click on "Join from Your
Browser". Once the browser connects to the meeting, the tab will crash at
some random point, but almost certainly within about 5 minutes of connecting.
A debug (assertions-enabled) build will crash more readily.

  • List of other uplifts needed: none
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It's very simple (a 2-liner); it affects only wasm on AArch64; it has been
    tested fairly extensively by hand. I think it is unlikely to cause
    regression.
  • String changes made/needed: none
Attachment #9237026 - Flags: approval-mozilla-beta?
Flags: qe-verify+

Comment on attachment 9237026 [details]
Bug 1725638. r=lth.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Is sec:high
  • User impact if declined: Worst case -- (assuming skilled attacker etc): On AArch64 targets (MacOS,
    Android) -- remote code execution as a result of loading a malicious
    WebAssembly payload.

For definite -- crashing of tab for Zoom's in-browser (wasm-based) client,
making it unusable.

Potential -- crashing of tab for any application that uses wasm SIMD on
AArch64.

  • Fix Landed on Version: 93
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It's very simple (a 2-liner); it affects only wasm on AArch64; it has been
    tested fairly extensively by hand. I think it is unlikely to cause
    regression.
  • String or UUID changes made by this patch: none
Attachment #9237026 - Flags: approval-mozilla-esr91?
Group: core-security → javascript-core-security

Comment on attachment 9237026 [details]
Bug 1725638. r=lth.

Approved to land.

[No security hat] I don't think the commit message is very good - Zoom (the video conferencing site) is something that triggers the issue but it's not the real problem. I assumed for a bit that this was a fullscreen/APZ-style zoom issue. I think even a blank commit message would be better; and it has the benefit of not implicating the reproducing site.

Attachment #9237026 - Flags: sec-approval? → sec-approval+

(In reply to Tom Ritter [:tjr] (Intermittently available) from comment #14)

I think even a blank commit message would be better; and it has the benefit of not implicating the reproducing site.

I agree.

Good point; will redo patch with a blank commit message.

Attachment #9237026 - Attachment description: Bug 1725638 - Fix Zoom problems. r=lth. → Bug 1725638. r=lth.

Comment on attachment 9237026 [details]
Bug 1725638. r=lth.

Approved for 92.0b8.

Attachment #9237026 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Group: javascript-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
QA Whiteboard: [qa-triaged]

As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.

Please visit this google form to reply.

Flags: needinfo?(jseward)
Whiteboard: [sec-survey]

Comment on attachment 9237026 [details]
Bug 1725638. r=lth.

Approved for 91.1esr.

Attachment #9237026 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+

(In reply to Lars T Hansen [:lth] from comment #7)

Presumably (thinking out loud), the Architecture file on each platform ought to define the spill slot size, this should not be hidden down in the move emitter like it is now. This would sensibly be the max of sizeof(Registers::RegisterContent) and sizeof(FloatRegisters::RegisterContent).

Lars, did we fix this within the patch here?
It seems not. Can you please file a follow-up if appropriate.

Flags: needinfo?(lhansen)

(In reply to Frederik Braun [:freddy] from comment #23)

Can you please file a follow-up if appropriate.

Done, bug 1727698.

Flags: needinfo?(lhansen)

Hello,

I tried to reproduce this issue on my MacBook Air M1 2020 on Fx 93.0a1 (BuildID: 20210813092746) and Fx 92.0a1 (Build ID: 20210715094037) (the build from the original bug) and I cant reproduce this issue.

@Lars maybe you can take a look and confirm this issue is fixed on Firefox 92.0b8 and 93.0a1 and 91.1esr?

Flags: needinfo?(lhansen)

Hello,

With help from @jseward we managed the reproduce the bug on Fx 93.0a1 (BuildID: 20210813092746).

I can confirm that the issue is fixed on Fx 93.01 (Latest Nightly) and Fx 92.0b8. I couldn't test on 91.1esr treeherder build because mac OS says its broken.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
Flags: needinfo?(lhansen)
Whiteboard: [sec-survey] → [sec-survey][adv-main92+r]
Whiteboard: [sec-survey][adv-main92+r] → [sec-survey][adv-main92+r][adv-esr91.0.1+r]
Whiteboard: [sec-survey][adv-main92+r][adv-esr91.0.1+r] → [sec-survey][adv-main92+r][adv-esr91.1+r]
Flags: needinfo?(jseward)
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: