Closed Bug 1866396 Opened 8 months ago Closed 4 months ago

[Linux, aarch64] Content process crashes making the browser unusable.

Categories

(Core :: Security: Process Sandboxing, defect, P1)

Firefox 122
ARM64
Linux
defect

Tracking

()

RESOLVED FIXED
125 Branch
Tracking Status
firefox120 + wontfix
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- wontfix
firefox124 --- fixed
firefox125 --- fixed

People

(Reporter: pbone, Assigned: pbone)

References

Details

Attachments

(2 files, 1 obsolete file)

While investigating Bug 1866025 I ran into content process crashes on Linux on Apple Silicon. I don't have a backtrace yet.

Thread 1 "Web Content" received signal SIGSYS, Bad system call.
0x0000ffff7d37fc68 in clone3 () from /lib64/libc.so.6

0x0000ffff7d37fc68 in clone3 () from /lib64/libc.so.6
#1 0x0000ffff7d37fb24 in __clone3_internal () from /lib64/libc.so.6
#2 0x0000ffff7d37fbc8 [PAC] in __clone_internal () from /lib64/libc.so.6
#3 0x0000ffff7d3101ac [PAC] in create_thread () from /lib64/libc.so.6
#4 0x0000ffff7d310ba0 [PAC] in pthread_create@GLIBC_2.17 () from /lib64/libc.so.6
#5 0x0000aaab9f24747c [PAC] in pthread_create (thread=thread@entry=0xffffef56fad8,
attr=attr@entry=0xffffef56fae0, start_routine=0xffff7d241f68 <_pt_root>, arg=arg@entry=0xffff7c5a1660)
at /mnt/dev/moz/unified/mozglue/interposers/pthread_create_interposer.cpp:99
#6 0x0000ffff7d23bd4c in _PR_CreateThread (type=PR_USER_THREAD,
start=0xffff6b909720 <nsThread::ThreadFunc(void*)>, arg=0xffff69b4c1a0, priority=<optimized out>,
scope=<optimized out>, state=PR_JOINABLE_THREAD, stackSize=262144, isGCAble=<optimized out>)
at /mnt/dev/moz/unified/nsprpub/pr/src/pthreads/ptthread.c:458
#7 0x0000ffff6b90b27c in nsThread::Init (this=this@entry=0xffff7c57d7c0, aName=...)
at /mnt/dev/moz/unified/xpcom/threads/nsThread.cpp:619
#8 0x0000ffff6b913524 in nsThreadManager::NewNamedThread (this=<optimized out>, aName=...,
aOptions=aOptions@entry=..., aResult=aResult@entry=0xffffef56fcd0)
at /mnt/dev/moz/unified/xpcom/threads/nsThreadManager.cpp:597
#9 0x0000ffff6b91b2c8 in NS_NewNamedThread (aName=..., aResult=aResult@entry=0xffff6a2867f0,
aInitialEvent=..., aOptions=aOptions@entry=...)
at /mnt/dev/moz/unified/xpcom/threads/nsThreadUtils.cpp:176
#10 0x0000ffff6bc367f0 in NS_NewNamedThread<15ul> (aName=..., aResult=0xffff6a2867f0,
aInitialEvent=<optimized out>, aOptions=...)
at /mnt/dev/moz/unified/obj-aarch64-unknown-linux-gnu/dist/include/nsThreadUtils.h:87
#11 0x0000ffff6fac937c in mozilla::ProcessHangMonitor::ProcessHangMonitor (this=<optimized out>)
at /mnt/dev/moz/unified/dom/ipc/ProcessHangMonitor.cpp:1189
#12 0x0000ffff6faca2b0 in mozilla::ProcessHangMonitor::GetOrCreate ()
at /mnt/dev/moz/unified/dom/ipc/ProcessHangMonitor.cpp:1215

If I disable sandboxing then Firefox works.

Component: Untriaged → Security: Process Sandboxing
Product: Firefox → Core
Version: unspecified → Firefox 122

The Arch Linux ARM package is built with --without-wasm-sandboxed-libraries. The line was added with the ominous commit message "extra/firefox: fix"

I don't see anything related to the sandbox in the Fedora firefox.spec but I haven't yet looked at the patches.

(In reply to Janne Grunau from comment #2)

The Arch Linux ARM package is built with --without-wasm-sandboxed-libraries. The line was added with the ominous commit message "extra/firefox: fix"

I don't see anything related to the sandbox in the Fedora firefox.spec but I haven't yet looked at the patches.

I think the wasm option is a different kind of sandboxing. I had to give that option also because I didn't have/couldn't easily make a working wasm32-wasi toolchain. It kept failing during compilation without --without-wasm-sandboxed-libraries.

[Tracking Requested - why for this release]:

I can reproduce this on 120-122. But I don't know if it is a change in the OS or a regression in Firefox?

[Tracking Requested - why for this release]:

I can reproduce this on 120-122. But I don't know if it is a change in the OS or a regression in Firefox?

I believe this bug is INVALID:

SIGSYS, Bad system call. is expected during the normal operation of a sandboxed process. Your gdb should not stop on it if it's properly configured, i.e. via ./mach run --debug or similar.

Please check your GDB configuration. You might find that this "crash" is purely GDB stopping in a signal and Firefox itself works correctly.

Starting gdb via mach will load build/.gdbinit which contains:

handle SIGSYS noprint nostop pass

See also https://firefox-source-docs.mozilla.org/contributing/debugging/debugging_firefox_with_gdb.html#i-keep-getting-a-sig32-or-sigsegv-in-js-jit-code-under-gdb-even-though-there-is-no-crash-when-gdb-is-not-attached-how-do-i-fix-it

Flags: needinfo?(pbone)

I can't reproduce this crash using ALARM's Firefox 120 PKGBUILD + my patch for Bug 1866025. The unpatched build works on devices with ~8GB RAM since that disables PHC (actual RAM size available to Linux is ~7.6GB).
Fedora's packaged Firefox with disabled jemalloc/PHC doesn't crash either.

Is there anything specific needed to reproduce this? My understanding was that this also crashed on startup.

:pbone based on comment 6 is this invalid, however will track this bug as requested in the meantime

Hi GCP, Thanks for your info on Matrix last week. I was confused about the signal handler. However I still think this is a real bug, or at least a problem that's not related to GDB since the bug reproduces when I'm not using GDB.

STR:

  • Install Asahi Linux on a M1 Mac Mini,
  • Build firefox from Mozilla's sources
    ** This takes some messing around and at least requires Bug 1866400 and this mozconfig:
    ac_add_options --enable-optimize
    ac_add_options --enable-debug
    ac_add_options --with-libclang-path=/usr/lib64
    ac_add_options --without-wasm-sandboxed-libraries

This was necessary for Firefox to build

  • Run firefox with ./mach run. No debugger is involved.
  • The window opens but the content processes crash constantly.
  • Set security.sandbox.content.level to 0.
  • The content processes no longer crash.

With debugger

About the debugger. The above symptoms were without the debugger. Below is with a debugger. Because the content processes crash so early I started firefox with MOZ_DEBUG_CHILD_PROCESS=20 ./mach run so that new processes would announce their pid then sleep for 20 seconds while I attached GDB with gdb -p $pid. After attaching to the process I pressed c for continue to let the execution proceed since it was stopped as the debugger attached. The process ran, first waiting until the end of the sleep then crashing with:

Program terminated with signal SIGSYS, Bad system call.
The program no longer exists.

This message is why I was looking at the SIGSYS handler.

I was initially confused about the purpose of the signal and the signal handler involved in standboxing. However I still think this is a real bug because:

  • It crashes without a debugger attached
  • It crashes when sandboxing is enabled, but it's okay when sandboxing is disabled.

If I the steps again and attach GDB to a child process then:

  • handle SIGSYS stop
  • c
  • The program stops in clone3, If I continue again (or even single-step) then it crashes as above.
  • Something I don't know yet is why when there are multiple calls to pthread_spawn() (which I can test using a breakpoint) why not all of them generate SIGSYS or crash.

GDB config

I now have

add-auto-load-safe-path /home/paul/dev/moz
set debuginfod enabled on

in my .gdbinit and getting the symbols from the system libraries is automatic now.

Other builds

Also the pre-compiled Asahi Linux version of firefox works fine for me. So something is different in how its patched/built/configured that makes it work when one I build crashes. Also the Linux aarch64 builds form treehearder crash for me in the same way. It might be helpful for someone else to try to reproduce this from their own build or the treeheader Linux aarch64 builds.

Flags: needinfo?(pbone)

The problem is that clone3 passes all its arguments via a pointer to a struct, so we can't inspect them in a seccomp-bpf filter (e.g., to ensure that the call is only creating a new thread and not a new process). When clone3 was being developed this was pointed out as an issue, and my recollection from reading through the archived mailing list threads a while ago is that there was some speculation about how seccomp could be extended to handle it, but the design issues were nontrivial, so in the end none of that happened and clone3 support landed anyway.

Normally, glibc will fall back to regular clone, which seccomp can handle (and this may be one reason why the kernel devs weren't more concerned about it). But I seem to recall it has a configure option for minimum supported kernel version, and if this is used then it will strip out any fallbacks that “shouldn't” be needed on that kernel version, and clone3clone seems to be one of them; see__ASSUME_CLONE3 in the glibc source.

See also https://crbug.com/1213452 on this topic. In particular, Chromium also disallows clone3, so I'd expect Chrome to have the same problem.

It's potentially interesting that Asahi's own builds aren't affected when ours are, given the same kernel and libc, but it's also possible that they just patched the sandbox policy to allow clone3 unconditionally, which as mentioned I'm not comfortable with. I'd need to find out more about how their Firefox packages are built. It's also possible, but a little tedious, to find out if this is the case without inspecting the source by running with MOZ_SANDBOX_LOGGING=1 set and tracing through the disassembly of the seccomp filter.

One piece of information that could be useful: if you run /lib64/libc.so.6 as an executable with no arguments, it will print out a few pieces of information including the minimum supported kernel.

Also, I could've missed something but I don't think it's been mentioned yet: is this Asahi with the original Arch-based userland or the newer Fedora version?

And, if you can run the working browser with MOZ_SANDBOX_LOGGING=1 (and maybe also the broken browser for comparison) and attach the BPF disassembly from stderr, I can try to see if they're handling the syscall differently.

For example, this is the relevant part of how Debian's firefox package handles syscall 435:

  4) LOAD 0  // System call number
  7) if A >= 0x89; then JMP 8 else JMP 74
  8) if A >= 0x10d; then JMP 9 else JMP 42
  9) if A >= 0x12f; then JMP 10 else JMP 26
 10) if A >= 0x145; then JMP 11 else JMP 19
 11) if A >= 0x14f; then JMP 12 else JMP 16
 12) if A >= 0x1b4; then JMP 13 else JMP 15
 15) if A >= 0x1b3; then JMP 241 else JMP 175
241) RET 0x50026  // errno = 38
Flags: needinfo?(pbone)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #12)

One piece of information that could be useful: if you run /lib64/libc.so.6 as an executable with no arguments, it will print out a few pieces of information including the minimum supported kernel.

paul@calcium:~$ /lib64/libc.so.6 
GNU C Library (GNU libc) stable release version 2.38.
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 13.2.1 20231011 (Red Hat 13.2.1-4).
libc ABIs: UNIQUE ABSOLUTE
Minimum supported kernel: 3.7.0
For bug reporting instructions, please see:
<https://www.gnu.org/software/libc/bugs.html>.

Also, I could've missed something but I don't think it's been mentioned yet: is this Asahi with the original Arch-based userland or the newer Fedora version?

The Fedora version, I took a guess of what might be needed to reproduce the first bug I was investigating and picked Fedora.

And, if you can run the working browser with MOZ_SANDBOX_LOGGING=1 (and maybe also the broken browser for comparison) and attach the BPF disassembly from stderr, I can try to see if they're handling the syscall differently.

The Fedora/Asahi version (works), Note this is Firefox 120:

  4) LOAD 0  // System call number
  5) if A >= 0x8c; then JMP 6 else JMP 59
  6) if A >= 0xd8; then JMP 7 else JMP 33
  7) if A >= 0x10d; then JMP 8 else JMP 21
  8) if A >= 0x124; then JMP 9 else JMP 15
  9) if A >= 0x1b3; then JMP 10 else JMP 13
 10) if A >= 0x1b7; then JMP 11 else JMP 12
 12) if A >= 0x1b4; then JMP 160 else JMP 232
232) RET 0x50026  // errno = 38

The Mozilla version (crashes), from mozilla-central CI:

  4) LOAD 0  // System call number
  5) if A >= 0x88; then JMP 6 else JMP 57
  6) if A >= 0xca; then JMP 7 else JMP 32
  7) if A >= 0xf3; then JMP 8 else JMP 20
  8) if A >= 0x118; then JMP 9 else JMP 15
  9) if A >= 0x123; then JMP 10 else JMP 13
 10) if A >= 0x125; then JMP 11 else JMP 12
 11) if A >= 0x126; then JMP 155 else JMP 146
155) JMP 415
415) RET 0x30001  // Trap #1
Flags: needinfo?(pbone)

Here's some strace output. I see that the process sets up some signal handling and seccomp, does a few more system calls and then crashes on the very first clone3. Maybe it's useful becase we can check clone3's args.

rt_sigaction(SIGSYS, NULL, {sa_handler=0xffff49f6730c, sa_mask=[], sa_flags=SA_NODEFER|SA_SIGINFO}, 8) = 0
rt_sigaction(SIGSYS, {sa_handler=0xffff49f9e924, sa_mask=[], sa_flags=SA_NODEFER|SA_SIGINFO}, NULL, 8) = 0
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, {len=460, filter=0xffff36271000}) = 0
write(6, "C", 1)                        = 1
read(6, "O", 1)                         = 1
close(6)                                = 0
close(36)                               = 0
mmap(NULL, 327680, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xffff358b0000
mprotect(0xffff358c0000, 262144, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0xffff358ff210, parent_tid=0xffff358ff210, exit_signal=0, stack=0xffff358b0000, stack_size=0x4ea00, tls=0xffff358ff880} => {parent_tid=[0]}, 88) = -460377360
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_errno=EPERM, si_call_addr=0xffff49a5fc68, si_syscall=__NR_clone3, si_arch=AUDIT_ARCH_AARCH64} ---
+++ killed by SIGSYS (core dumped) +++

I note that clone3's stack size isn't divisible by 16K. However there are other calls to clone3 before seccomp gets setup that use the same stack size that succeed.

Flags: needinfo?(jld)

(In reply to Paul Bone [:pbone] from comment #13)

Minimum supported kernel: 3.7.0

So this isn't what I thought it was, because we should have the clone3 fallback…

The Mozilla version (crashes), from mozilla-central CI:

  4) LOAD 0  // System call number
  5) if A >= 0x88; then JMP 6 else JMP 57
  6) if A >= 0xca; then JMP 7 else JMP 32
  7) if A >= 0xf3; then JMP 8 else JMP 20
  8) if A >= 0x118; then JMP 9 else JMP 15
  9) if A >= 0x123; then JMP 10 else JMP 13
 10) if A >= 0x125; then JMP 11 else JMP 12
 11) if A >= 0x126; then JMP 155 else JMP 146
155) JMP 415
415) RET 0x30001  // Trap #1

…but we're treating clone3 as an unexpected syscall. Normally this crashes on Nightly (with a crash report) and fails with ENOSYS otherwise, but it's possible that glibc is doing this with signals blocked, in which case the process would be immediately killed (on all release branches) with no crash report or logging.

In any case, right now this looks like our bug, not the distro's.

I have an arm64 Chromebook where I can reproduce part of this: the libc in the Linux environment (Debian 11, bullseye, currently oldstable) is too old to have the clone3 support code at all, but downloading a treeherder build and running it, I can see the case for clone3 missing from the logs. And that shouldn't happen, because the clone3 case in our EvaluateSyscall method should be unconditional.

I wonder if something weird is going on in the code that generates the syscall number decision tree or maybe where it iterates syscall numbers. I'll look into this some more….

The tip of getting the BPF instructions helped a lot. I think I got to the same place as you. I put a printf here https://searchfox.org/mozilla-central/source/security/sandbox/chromium/sandbox/linux/bpf_dsl/policy_compiler.cc#236 and it only counted to 295 before jumping to 2xxxxxxxxx then 4xxxxxxxxx. Skipping over clone3 and many others.

If you want to keep looking that's cool, I'm going to have lunch and then I can do some other stuff and check-in on this bug later.

(In reply to Paul Bone [:pbone] from comment #16)

The tip of getting the BPF instructions helped a lot. I think I got to the same place as you. I put a printf here https://searchfox.org/mozilla-central/source/security/sandbox/chromium/sandbox/linux/bpf_dsl/policy_compiler.cc#236 and it only counted to 295 before jumping to 2xxxxxxxxx then 4xxxxxxxxx. Skipping over clone3 and many others.

I just found the same thing. It's as if MAX_PUBLIC_SYSCALL were set to 294 instead of 1024, and… oh. I think I found the problem:

#elif defined(__aarch64__)

#include <asm-generic/unistd.h>
#define MIN_SYSCALL 0u
#define MAX_PUBLIC_SYSCALL __NR_syscalls
#define MAX_SYSCALL MAX_PUBLIC_SYSCALL
Flags: needinfo?(jld)

But in my /usr/include/asm-generic/unistd.h I have

#define __NR_syscalls 452

So I still don't know.

Try looking at ~/.mozbuild/sysroot-aarch64-linux-gnu/usr/include/asm-generic/unistd.h:

#define __NR_syscalls 294

Mozilla's official builds are built against a fairly old userland, for compatibility with older distros. This uses a sysroot downloaded as an artifact (see bug 1690930 and related), and is enabled by default for ./mach build. So this explains why both local builds and CI are affected, but distro builds aren't, since they usually (as far as I know) use their own headers and libraries.

This also means that if you build --without-sysroot it should work. But also we should fix that code.

Incidentally, upstream Chromium has the same code, but there are no official builds of Chrome for aarch64 Linux, so as far as Chromium-based browsers go it's probably all distro builds which generally wouldn't be affected.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #19)

Try looking at ~/.mozbuild/sysroot-aarch64-linux-gnu/usr/include/asm-generic/unistd.h:

#define __NR_syscalls 294

Yep. comfirmed.

Mozilla's official builds are built against a fairly old userland, for compatibility with older distros. This uses a sysroot downloaded as an artifact (see bug 1690930 and related), and is enabled by default for ./mach build. So this explains why both local builds and CI are affected, but distro builds aren't, since they usually (as far as I know) use their own headers and libraries.

So it is completely how I built it, and makes sense how the distro folks both avoided the problem & didn't remember solving the problem. Thanks for figuring that out. I also appriciate following along & learning how seccomp works.

IMHO we should fix this if we want to make Linux aarch64 official.

This also means that if you build --without-sysroot it should work. But also we should fix that code.

Checking now.

Severity: -- → S3
Priority: -- → P2

The bug is marked as tracked for firefox120 (release). However, the bug still has low severity.

:gcp, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(gpascutto)

I think this would've been an S1 but for these builds not being officially supported(?) (yet) and I'm not 100% clear if this crash happens on non-Apple-Silicon ARM.

Flags: needinfo?(gpascutto) → needinfo?(pbone)
Severity: S3 → S2
Priority: P2 → P1
Attachment #9381152 - Attachment description: Bug 1866396 - Hard code the number of system calls for Linux on aarch64 r=glandium → Bug 1866396 - Hard code the number of system calls for Linux on aarch64 r=jld
Attachment #9381152 - Attachment description: Bug 1866396 - Hard code the number of system calls for Linux on aarch64 r=jld → Bug 1866396 - Hard code the number of system calls for Linux on aarch64 r=gcp
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7e485c1acee5
Hard code the number of system calls for Linux on aarch64 r=gcp
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 125 Branch

The patch landed in nightly and beta is affected.
:pbone, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox124 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(pbone)
Attachment #9390994 - Flags: approval-mozilla-release?
Attachment #9390995 - Flags: approval-mozilla-release?

Uplift Approval Request

  • Steps to reproduce for manual QE testing: Not required
  • User impact if declined: Linux users on aarch64 are unable to use Firefox
  • Is Android affected?: no
  • Fix verified in Nightly: yes
  • String changes made/needed: None
  • Explanation of risk level: The patch is simple and only sets the maximum syscall to a higher number
  • Needs manual QE test: no
  • Risk associated with taking this patch: Low risk
  • Code covered by automated testing: yes
Flags: needinfo?(pbone)
Attachment #9390995 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9390994 - Attachment is obsolete: true
Attachment #9390994 - Flags: approval-mozilla-release? → approval-mozilla-release-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: