Closed Bug 1720650 Opened 3 years ago Closed 3 years ago

[ARM64 M1] Zoom video is rendered incorrectly

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

ARM64
macOS
defect

Tracking

()

VERIFIED FIXED
93 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- verified
firefox90 --- wontfix
firefox91 --- wontfix
firefox92 --- verified
firefox93 --- verified

People

(Reporter: andrei.purice, Assigned: lth)

References

(Regression)

Details

(Keywords: regression)

Attachments

(3 files)

Attached image m11.jpg

Affected Versions:
90.0 - Release
91.0b2 - Beta
92.0a1 (2021-07-14) - Nightly

Affected Platforms:
ARM64 MacOS:

miniMac: MacOS Big Sur 11.4, Mac mini (M1, 2020), Chip: Apple M1

Steps to reproduce:

  1. Have 2 participants for a video conference.
  2. Go to zoom.us and enter a chat room.
  3. Notice the video rendering of the participants.

Expected Results:
Zoom video is rendered without issues.

Actual Result:
Zoom video is rendered incorrectly.

Severity suggestion: S3
Not reproducible on Chrome.

Not reproducible on Firefox 89.0.
I will provide a regression range asap.

I can confirm this behavior on an M1 MacBook Air. Noticed while watching https://mozilla.zoom.us/j/98837869149 in browser.

I get as far as:

 8:28.03 INFO: Narrowed nightly regression window from [2021-05-05, 2021-05-07] (2 days) to [2021-05-05, 2021-05-06] (1 days) (~0 steps left)
 8:28.03 INFO: Got as far as we can go bisecting nightlies...
 8:28.03 INFO: Last good revision: cee8c3405f2ed0e2a7170113463495f5a7f868ff (2021-05-05)
 8:28.03 INFO: First bad revision: c9980e971a31b2bd47783dc4a9a26fca4a4c57d6 (2021-05-06)

After it starts downloading taskcluster builds, the builds fails to run with error WARNING: Process exited with code -9 which I'm guessing is M1 system integrity protection or signing issues since M1 is pickier about those type of things.

I'll try to bisect locally in the background while I look at other stuff.

It is the patches that landed in Bug 1687629. Marking this as a regression, and will change component.

Component: WebRTC → Javascript: WebAssembly
Regressed by: 1687629
Has Regression Range: --- → yes

(Unable to verify atm) Is the same issue exists if javascript.options.wasm_baselinejit, javascript.options.wasm_optimizingjit, or javascript.options.wasm_simd switched to "false"?

The WebAssembly team is currently working on reproducing and investigating this issue. I will mark it as a P1 for now.

Priority: -- → P1

I can confirm that turning off javascript.options.wasm_simd in about:config corrects the video issue.

(We don't allow unassigned P1 bugs; assigning to Yury for now, we'll chat on Monday.)

Assignee: nobody → ydelendik
Status: NEW → ASSIGNED
Summary: [ARM64 M1]Zoom video is rendered incorrectly → [ARM64 M1] Zoom video is rendered incorrectly

This appears to be an Ion-only problem. Leaving SIMD enabled but disabling javascript.options.wasm_optimizingjit, the site works as it should in current Nightly.

I guess it's somewhat likely that this could be a register assignment problem, but there are two options for that. The baseline compiler generally reuses one of the inputs for the output, but ion does not. So the problem could be in masm, if it has a bug that assumes this reuse; or in ion lowering/codegen, where it incorrectly specifies register constraints. Obviously there could be other problems too.

(Digression.) Given that we have one working compiler and one non-working compiler it might be possible to do something rr-like here, if we don't find the problem by other means. Wasm execution of a given program is completely dependent on:

  • the initial contents of wasm memory and globals
  • the order of calls into wasm from the host and those calls' parameters
  • the side effects of calls out of wasm on wasm memory and globals
  • the return values of the latter calls (the order of those calls will be deterministic)

We can record and replay the initial state and all external effects, and so we may be able to first run the baseline compiler on an input and then rerun ion on the same input and attempt to discover where the execution diverges. I'm not saying this will be easy, and if there are multiple modules with multiple memories in play it might even be very awkward, but it seems like an option.

I can reproduce this on aarch64-linux, when connecting to a Zoom meeting
initiated on a different machine. I then get the following very strange
symptoms:

  • running the browser normally, it can connect to the meeting, asks me if I
    want to allow audio/video, etc, and the other end "knows" it has connected,
    but I can never see a picture -- it is all black. After some seconds, the
    tab reliably crashes. If someone can tell me how to get GDB to deal with
    multiple Firefox processes, I can try and get a stack trace.

  • running the browser on valgrind (both with and without instrumentation), the
    tab never crashes (uh?!). Also, I don't get a black screen -- instead the
    "bizarre greenness" typified by the picture on comment 0 appears.

Running on valgrind slows it down so much as to show another interesting
aspect. The initial picture transmitted when the connection starts
(keyframe?) is fine. The bizarre greenness gradually seeps into the picture
over perhaps the next 20 or so frames, gradually corrupting the picture. As
if small errors in the green-channel values are creeping in and gradually
causing saturation. It doesn't happen all at once. I guess this means the
inter-key-frame updates are wrongly handled.

Valgrind plus wasm simd disabled works fine.

So what's the difference, from a SIMD insn perspective, between the real CPU
and Valgrind's simulated CPU with no instrumentation? Not much. It may be
that V does not correctly observe SIMD floating point rounding, or loses
precision sometimes.

So maybe this is some kind of floating point arithmetic exactness problem? Or
maybe it is indeed a regalloc problem that is feeding junk floating point
values (NaNs?) into the computation.

I verified that x86_64-linux is not affected.

One possibly easy starting point would be to compare the instruction selection
rules for aarch64 SIMD of baseline and Ion, to check for obvious differences.
I guess that's basically a case of looking at the sequence of MacroAssembler
calls resulting from each wasm SIMD opcode.

Assignee: ydelendik → lhansen

I've encountered a crash in an opt-debug build: https://crash-stats.mozilla.org/report/index/ddd562f0-993f-4291-af1b-528030210805. This suggests that a frame pointer is tagged incorrectly, which suggests that it could be overwritten by junk. (But it could also be some other logic error.) This is worth tracking down on its own.

I suspect that Valgrind SIMD vs hardware SIMD is a red herring probably; the black screen is more likely a permissions problem with access to the camera. I had that at least once.

I can confirm that the picture looks fine initially and that it becomes broken gradually, in the following way. Sit perfectly still while connecting the two systems, and make sure you're out of view of one of the cameras. Both pictures on both computers will look fine. Now blink. The picture of you on the computer where you're not visible to the camera will turn to red mush instantly. Nothing happens to the picture of not-you on the computer where you are visible, at least not yet, though over time some green may seep in and then spread slowly. If you produce movement to the camera where you are initially not visible, the green spreads from the area around the movement until it fills the screen.

(M1 laptop on one end, x64 laptop on the other.)

Tested with Julian with an M1 on either end; basically same behavior. There is a picture at the right scale - we can both perceive vague shapes of the right size, and the other person waving and so on - but it is very attenuated and also very green, and then the occasional crash.

It is worth noting once again that this is Ion-only, so a problem with spilling / restoring data is a candidate along with register allocation / management.

See comment block in WasmGenerator.cpp in this patch for more
information.

ARM64 code generation for SIMD narrowing is incorrect in the case when rhs == dest: sqxtun+sqxtun2 must always be executed in that order because the former clears the upper part of the destination register while the latter leaves the low bits of the destination untouched (subtle). But when rhs == dest we do them in the other order to avoid a move. This is wrong.

I don't know if this is the only bug, but it is for sure a bug: bisection pointed to a function that used this operation and the generated code is clearly wrong.

Fixing the narrowing operation fixes the rendering problem. It does not fix the crash, though. Crashes are frequent enough that the service is basically unusable. But I'll probably open a new bug for it.

The ARM64 architecture uses two instructions to implement SIMD narrowing,
SQXTN and SQXTN2 (and their unsigned-saturation variants). These are not
quite symmetric: the former will clear the upper bits of the output register
and set the lower bits, while the latter will just set the upper bits.
To implement an across-the-vector narrowing, the former must be executed
before the latter.

For some operations, we execute them in the other order to save a move.
But this is incorrect, so this patch reintroduces the move when it
is required. This effectively makes the three-address operations the
same as the two-address operations and it's possible we want to
propagate the fix higher up in the stack to avoid the duplication of
code, but for now a local fix seems best.

Blocks: 1725638

We'll need uplifts to both beta and esr91.

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2e3e1b631c62
Fix codegen problem in narrowing.  r=yury
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch

Comment on attachment 9236171 [details]
Bug 1720650 - Fix codegen problem in narrowing. r?yury

Beta/Release Uplift Approval Request

  • User impact if declined: Wasm code using wasm SIMD on arm64 will compute the wrong results; applications such as the Zoom in-browser client become unusable.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: See comment 0. It was easy to reproduce so we have tested this ourselves (two of us have tested the fix independently); QE attention needs are probably light.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Fairly obvious codegen bug with fairly obvious fix.
  • String changes made/needed:

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: A codegen bug in wasm SIMD on aarch64 will make certain wasm applications unusable; some of these applications are important.
  • User impact if declined: See beta uplift request.
  • Fix Landed on Version: FF93
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): See beta uplift request.
  • String or UUID changes made by this patch:
Attachment #9236171 - Flags: approval-mozilla-esr78?
Attachment #9236171 - Flags: approval-mozilla-beta?
Flags: qe-verify+

Comment on attachment 9236171 [details]
Bug 1720650 - Fix codegen problem in narrowing. r?yury

Approved for 92.0b4. We'll revisit the ESR91 request once this has had a bit more bake time.

Attachment #9236171 - Flags: approval-mozilla-esr91?
Attachment #9236171 - Flags: approval-mozilla-esr78?
Attachment #9236171 - Flags: approval-mozilla-beta?
Attachment #9236171 - Flags: approval-mozilla-beta+
QA Whiteboard: [qa-triaged]

Reproduced the issue on the affected Nightly 92.0a1 on M1 Macbook Air. Verified-fixed on the latest Firefox Nightly 93.0a1 (2021-08-15) and Firefox Beta 92.0b4, the video recording no longer displays graphical corruptions.
Waiting for this to bake long enough to be uplifted to ESR and verify further.

Comment on attachment 9236171 [details]
Bug 1720650 - Fix codegen problem in narrowing. r?yury

Approved for 91.1esr.

Attachment #9236171 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+

Verified-fixed on the latest Firefox ESR 91.1.0esr on MacOS 11 M1 MacBook Air.

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: