[Linux, aarch64] Content process crashes making the browser unusable.
Categories
(Core :: Security: Process Sandboxing, defect, P1)
Tracking
()
People
(Reporter: pbone, Assigned: pbone)
References
Details
Attachments
(2 files, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-release+
|
Details | Review |
While investigating Bug 1866025 I ran into content process crashes on Linux on Apple Silicon. I don't have a backtrace yet.
Assignee | ||
Comment 1•1 year ago
|
||
Thread 1 "Web Content" received signal SIGSYS, Bad system call.
0x0000ffff7d37fc68 in clone3 () from /lib64/libc.so.6
0x0000ffff7d37fc68 in clone3 () from /lib64/libc.so.6
#1 0x0000ffff7d37fb24 in __clone3_internal () from /lib64/libc.so.6
#2 0x0000ffff7d37fbc8 [PAC] in __clone_internal () from /lib64/libc.so.6
#3 0x0000ffff7d3101ac [PAC] in create_thread () from /lib64/libc.so.6
#4 0x0000ffff7d310ba0 [PAC] in pthread_create@GLIBC_2.17 () from /lib64/libc.so.6
#5 0x0000aaab9f24747c [PAC] in pthread_create (thread=thread@entry=0xffffef56fad8,
attr=attr@entry=0xffffef56fae0, start_routine=0xffff7d241f68 <_pt_root>, arg=arg@entry=0xffff7c5a1660)
at /mnt/dev/moz/unified/mozglue/interposers/pthread_create_interposer.cpp:99
#6 0x0000ffff7d23bd4c in _PR_CreateThread (type=PR_USER_THREAD,
start=0xffff6b909720 <nsThread::ThreadFunc(void*)>, arg=0xffff69b4c1a0, priority=<optimized out>,
scope=<optimized out>, state=PR_JOINABLE_THREAD, stackSize=262144, isGCAble=<optimized out>)
at /mnt/dev/moz/unified/nsprpub/pr/src/pthreads/ptthread.c:458
#7 0x0000ffff6b90b27c in nsThread::Init (this=this@entry=0xffff7c57d7c0, aName=...)
at /mnt/dev/moz/unified/xpcom/threads/nsThread.cpp:619
#8 0x0000ffff6b913524 in nsThreadManager::NewNamedThread (this=<optimized out>, aName=...,
aOptions=aOptions@entry=..., aResult=aResult@entry=0xffffef56fcd0)
at /mnt/dev/moz/unified/xpcom/threads/nsThreadManager.cpp:597
#9 0x0000ffff6b91b2c8 in NS_NewNamedThread (aName=..., aResult=aResult@entry=0xffff6a2867f0,
aInitialEvent=..., aOptions=aOptions@entry=...)
at /mnt/dev/moz/unified/xpcom/threads/nsThreadUtils.cpp:176
#10 0x0000ffff6bc367f0 in NS_NewNamedThread<15ul> (aName=..., aResult=0xffff6a2867f0,
aInitialEvent=<optimized out>, aOptions=...)
at /mnt/dev/moz/unified/obj-aarch64-unknown-linux-gnu/dist/include/nsThreadUtils.h:87
#11 0x0000ffff6fac937c in mozilla::ProcessHangMonitor::ProcessHangMonitor (this=<optimized out>)
at /mnt/dev/moz/unified/dom/ipc/ProcessHangMonitor.cpp:1189
#12 0x0000ffff6faca2b0 in mozilla::ProcessHangMonitor::GetOrCreate ()
at /mnt/dev/moz/unified/dom/ipc/ProcessHangMonitor.cpp:1215
If I disable sandboxing then Firefox works.
Assignee | ||
Updated•1 year ago
|
Comment 2•1 year ago
|
||
The Arch Linux ARM package is built with --without-wasm-sandboxed-libraries
. The line was added with the ominous commit message "extra/firefox: fix"
I don't see anything related to the sandbox in the Fedora firefox.spec but I haven't yet looked at the patches.
Assignee | ||
Comment 3•1 year ago
|
||
(In reply to Janne Grunau from comment #2)
The Arch Linux ARM package is built with
--without-wasm-sandboxed-libraries
. The line was added with the ominous commit message "extra/firefox: fix"I don't see anything related to the sandbox in the Fedora firefox.spec but I haven't yet looked at the patches.
I think the wasm option is a different kind of sandboxing. I had to give that option also because I didn't have/couldn't easily make a working wasm32-wasi toolchain. It kept failing during compilation without --without-wasm-sandboxed-libraries
.
Assignee | ||
Comment 4•1 year ago
|
||
[Tracking Requested - why for this release]:
I can reproduce this on 120-122. But I don't know if it is a change in the OS or a regression in Firefox?
Assignee | ||
Comment 5•1 year ago
|
||
[Tracking Requested - why for this release]:
I can reproduce this on 120-122. But I don't know if it is a change in the OS or a regression in Firefox?
Comment 6•1 year ago
•
|
||
I believe this bug is INVALID:
SIGSYS, Bad system call.
is expected during the normal operation of a sandboxed process. Your gdb should not stop on it if it's properly configured, i.e. via ./mach run --debug
or similar.
Comment 7•1 year ago
•
|
||
Please check your GDB configuration. You might find that this "crash" is purely GDB stopping in a signal and Firefox itself works correctly.
Starting gdb via mach will load build/.gdbinit
which contains:
handle SIGSYS noprint nostop pass
Comment 8•1 year ago
|
||
I can't reproduce this crash using ALARM's Firefox 120 PKGBUILD + my patch for Bug 1866025. The unpatched build works on devices with ~8GB RAM since that disables PHC (actual RAM size available to Linux is ~7.6GB).
Fedora's packaged Firefox with disabled jemalloc/PHC doesn't crash either.
Is there anything specific needed to reproduce this? My understanding was that this also crashed on startup.
Comment 9•1 year ago
|
||
:pbone based on comment 6 is this invalid, however will track this bug as requested in the meantime
Assignee | ||
Comment 10•1 year ago
|
||
Hi GCP, Thanks for your info on Matrix last week. I was confused about the signal handler. However I still think this is a real bug, or at least a problem that's not related to GDB since the bug reproduces when I'm not using GDB.
STR:
- Install Asahi Linux on a M1 Mac Mini,
- Build firefox from Mozilla's sources
** This takes some messing around and at least requires Bug 1866400 and this mozconfig:
ac_add_options --enable-optimize
ac_add_options --enable-debug
ac_add_options --with-libclang-path=/usr/lib64
ac_add_options --without-wasm-sandboxed-libraries
This was necessary for Firefox to build
- Run firefox with
./mach run
. No debugger is involved. - The window opens but the content processes crash constantly.
- Set
security.sandbox.content.level
to 0. - The content processes no longer crash.
With debugger
About the debugger. The above symptoms were without the debugger. Below is with a debugger. Because the content processes crash so early I started firefox with MOZ_DEBUG_CHILD_PROCESS=20 ./mach run
so that new processes would announce their pid then sleep for 20 seconds while I attached GDB with gdb -p $pid
. After attaching to the process I pressed c
for continue to let the execution proceed since it was stopped as the debugger attached. The process ran, first waiting until the end of the sleep then crashing with:
Program terminated with signal SIGSYS, Bad system call.
The program no longer exists.
This message is why I was looking at the SIGSYS handler.
I was initially confused about the purpose of the signal and the signal handler involved in standboxing. However I still think this is a real bug because:
- It crashes without a debugger attached
- It crashes when sandboxing is enabled, but it's okay when sandboxing is disabled.
If I the steps again and attach GDB to a child process then:
handle SIGSYS stop
- c
- The program stops in
clone3
, If I continue again (or even single-step) then it crashes as above. - Something I don't know yet is why when there are multiple calls to
pthread_spawn()
(which I can test using a breakpoint) why not all of them generate SIGSYS or crash.
GDB config
I now have
add-auto-load-safe-path /home/paul/dev/moz
set debuginfod enabled on
in my .gdbinit
and getting the symbols from the system libraries is automatic now.
Other builds
Also the pre-compiled Asahi Linux version of firefox works fine for me. So something is different in how its patched/built/configured that makes it work when one I build crashes. Also the Linux aarch64 builds form treehearder crash for me in the same way. It might be helpful for someone else to try to reproduce this from their own build or the treeheader Linux aarch64 builds.
Comment 11•1 year ago
|
||
The problem is that clone3
passes all its arguments via a pointer to a struct, so we can't inspect them in a seccomp-bpf filter (e.g., to ensure that the call is only creating a new thread and not a new process). When clone3
was being developed this was pointed out as an issue, and my recollection from reading through the archived mailing list threads a while ago is that there was some speculation about how seccomp could be extended to handle it, but the design issues were nontrivial, so in the end none of that happened and clone3
support landed anyway.
Normally, glibc will fall back to regular clone
, which seccomp can handle (and this may be one reason why the kernel devs weren't more concerned about it). But I seem to recall it has a configure option for minimum supported kernel version, and if this is used then it will strip out any fallbacks that “shouldn't” be needed on that kernel version, and clone3
→clone
seems to be one of them; see__ASSUME_CLONE3
in the glibc source.
See also https://crbug.com/1213452 on this topic. In particular, Chromium also disallows clone3
, so I'd expect Chrome to have the same problem.
It's potentially interesting that Asahi's own builds aren't affected when ours are, given the same kernel and libc, but it's also possible that they just patched the sandbox policy to allow clone3
unconditionally, which as mentioned I'm not comfortable with. I'd need to find out more about how their Firefox packages are built. It's also possible, but a little tedious, to find out if this is the case without inspecting the source by running with MOZ_SANDBOX_LOGGING=1
set and tracing through the disassembly of the seccomp filter.
Comment 12•1 year ago
|
||
One piece of information that could be useful: if you run /lib64/libc.so.6
as an executable with no arguments, it will print out a few pieces of information including the minimum supported kernel.
Also, I could've missed something but I don't think it's been mentioned yet: is this Asahi with the original Arch-based userland or the newer Fedora version?
And, if you can run the working browser with MOZ_SANDBOX_LOGGING=1
(and maybe also the broken browser for comparison) and attach the BPF disassembly from stderr, I can try to see if they're handling the syscall differently.
For example, this is the relevant part of how Debian's firefox
package handles syscall 435:
4) LOAD 0 // System call number
7) if A >= 0x89; then JMP 8 else JMP 74
8) if A >= 0x10d; then JMP 9 else JMP 42
9) if A >= 0x12f; then JMP 10 else JMP 26
10) if A >= 0x145; then JMP 11 else JMP 19
11) if A >= 0x14f; then JMP 12 else JMP 16
12) if A >= 0x1b4; then JMP 13 else JMP 15
15) if A >= 0x1b3; then JMP 241 else JMP 175
241) RET 0x50026 // errno = 38
Assignee | ||
Comment 13•1 year ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #12)
One piece of information that could be useful: if you run
/lib64/libc.so.6
as an executable with no arguments, it will print out a few pieces of information including the minimum supported kernel.
paul@calcium:~$ /lib64/libc.so.6
GNU C Library (GNU libc) stable release version 2.38.
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 13.2.1 20231011 (Red Hat 13.2.1-4).
libc ABIs: UNIQUE ABSOLUTE
Minimum supported kernel: 3.7.0
For bug reporting instructions, please see:
<https://www.gnu.org/software/libc/bugs.html>.
Also, I could've missed something but I don't think it's been mentioned yet: is this Asahi with the original Arch-based userland or the newer Fedora version?
The Fedora version, I took a guess of what might be needed to reproduce the first bug I was investigating and picked Fedora.
And, if you can run the working browser with
MOZ_SANDBOX_LOGGING=1
(and maybe also the broken browser for comparison) and attach the BPF disassembly from stderr, I can try to see if they're handling the syscall differently.
The Fedora/Asahi version (works), Note this is Firefox 120:
4) LOAD 0 // System call number
5) if A >= 0x8c; then JMP 6 else JMP 59
6) if A >= 0xd8; then JMP 7 else JMP 33
7) if A >= 0x10d; then JMP 8 else JMP 21
8) if A >= 0x124; then JMP 9 else JMP 15
9) if A >= 0x1b3; then JMP 10 else JMP 13
10) if A >= 0x1b7; then JMP 11 else JMP 12
12) if A >= 0x1b4; then JMP 160 else JMP 232
232) RET 0x50026 // errno = 38
The Mozilla version (crashes), from mozilla-central CI:
4) LOAD 0 // System call number
5) if A >= 0x88; then JMP 6 else JMP 57
6) if A >= 0xca; then JMP 7 else JMP 32
7) if A >= 0xf3; then JMP 8 else JMP 20
8) if A >= 0x118; then JMP 9 else JMP 15
9) if A >= 0x123; then JMP 10 else JMP 13
10) if A >= 0x125; then JMP 11 else JMP 12
11) if A >= 0x126; then JMP 155 else JMP 146
155) JMP 415
415) RET 0x30001 // Trap #1
Assignee | ||
Comment 14•1 year ago
|
||
Here's some strace output. I see that the process sets up some signal handling and seccomp, does a few more system calls and then crashes on the very first clone3. Maybe it's useful becase we can check clone3's args.
rt_sigaction(SIGSYS, NULL, {sa_handler=0xffff49f6730c, sa_mask=[], sa_flags=SA_NODEFER|SA_SIGINFO}, 8) = 0
rt_sigaction(SIGSYS, {sa_handler=0xffff49f9e924, sa_mask=[], sa_flags=SA_NODEFER|SA_SIGINFO}, NULL, 8) = 0
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, {len=460, filter=0xffff36271000}) = 0
write(6, "C", 1) = 1
read(6, "O", 1) = 1
close(6) = 0
close(36) = 0
mmap(NULL, 327680, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xffff358b0000
mprotect(0xffff358c0000, 262144, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0xffff358ff210, parent_tid=0xffff358ff210, exit_signal=0, stack=0xffff358b0000, stack_size=0x4ea00, tls=0xffff358ff880} => {parent_tid=[0]}, 88) = -460377360
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_errno=EPERM, si_call_addr=0xffff49a5fc68, si_syscall=__NR_clone3, si_arch=AUDIT_ARCH_AARCH64} ---
+++ killed by SIGSYS (core dumped) +++
I note that clone3's stack size isn't divisible by 16K. However there are other calls to clone3 before seccomp gets setup that use the same stack size that succeed.
Assignee | ||
Updated•1 year ago
|
Comment 15•1 year ago
|
||
(In reply to Paul Bone [:pbone] from comment #13)
Minimum supported kernel: 3.7.0
So this isn't what I thought it was, because we should have the clone3 fallback…
The Mozilla version (crashes), from mozilla-central CI:
4) LOAD 0 // System call number 5) if A >= 0x88; then JMP 6 else JMP 57 6) if A >= 0xca; then JMP 7 else JMP 32 7) if A >= 0xf3; then JMP 8 else JMP 20 8) if A >= 0x118; then JMP 9 else JMP 15 9) if A >= 0x123; then JMP 10 else JMP 13 10) if A >= 0x125; then JMP 11 else JMP 12 11) if A >= 0x126; then JMP 155 else JMP 146 155) JMP 415 415) RET 0x30001 // Trap #1
…but we're treating clone3
as an unexpected syscall. Normally this crashes on Nightly (with a crash report) and fails with ENOSYS
otherwise, but it's possible that glibc is doing this with signals blocked, in which case the process would be immediately killed (on all release branches) with no crash report or logging.
In any case, right now this looks like our bug, not the distro's.
I have an arm64 Chromebook where I can reproduce part of this: the libc in the Linux environment (Debian 11, bullseye, currently oldstable) is too old to have the clone3
support code at all, but downloading a treeherder build and running it, I can see the case for clone3
missing from the logs. And that shouldn't happen, because the clone3
case in our EvaluateSyscall
method should be unconditional.
I wonder if something weird is going on in the code that generates the syscall number decision tree or maybe where it iterates syscall numbers. I'll look into this some more….
Assignee | ||
Comment 16•1 year ago
|
||
The tip of getting the BPF instructions helped a lot. I think I got to the same place as you. I put a printf here https://searchfox.org/mozilla-central/source/security/sandbox/chromium/sandbox/linux/bpf_dsl/policy_compiler.cc#236 and it only counted to 295 before jumping to 2xxxxxxxxx then 4xxxxxxxxx. Skipping over clone3 and many others.
If you want to keep looking that's cool, I'm going to have lunch and then I can do some other stuff and check-in on this bug later.
Comment 17•1 year ago
|
||
(In reply to Paul Bone [:pbone] from comment #16)
The tip of getting the BPF instructions helped a lot. I think I got to the same place as you. I put a printf here https://searchfox.org/mozilla-central/source/security/sandbox/chromium/sandbox/linux/bpf_dsl/policy_compiler.cc#236 and it only counted to 295 before jumping to 2xxxxxxxxx then 4xxxxxxxxx. Skipping over clone3 and many others.
I just found the same thing. It's as if MAX_PUBLIC_SYSCALL
were set to 294 instead of 1024, and… oh. I think I found the problem:
#elif defined(__aarch64__)
#include <asm-generic/unistd.h>
#define MIN_SYSCALL 0u
#define MAX_PUBLIC_SYSCALL __NR_syscalls
#define MAX_SYSCALL MAX_PUBLIC_SYSCALL
Assignee | ||
Comment 18•1 year ago
|
||
But in my /usr/include/asm-generic/unistd.h
I have
#define __NR_syscalls 452
So I still don't know.
Comment 19•1 year ago
|
||
Try looking at ~/.mozbuild/sysroot-aarch64-linux-gnu/usr/include/asm-generic/unistd.h
:
#define __NR_syscalls 294
Mozilla's official builds are built against a fairly old userland, for compatibility with older distros. This uses a sysroot downloaded as an artifact (see bug 1690930 and related), and is enabled by default for ./mach build
. So this explains why both local builds and CI are affected, but distro builds aren't, since they usually (as far as I know) use their own headers and libraries.
This also means that if you build --without-sysroot
it should work. But also we should fix that code.
Incidentally, upstream Chromium has the same code, but there are no official builds of Chrome for aarch64 Linux, so as far as Chromium-based browsers go it's probably all distro builds which generally wouldn't be affected.
Assignee | ||
Comment 20•1 year ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #19)
Try looking at
~/.mozbuild/sysroot-aarch64-linux-gnu/usr/include/asm-generic/unistd.h
:#define __NR_syscalls 294
Yep. comfirmed.
Mozilla's official builds are built against a fairly old userland, for compatibility with older distros. This uses a sysroot downloaded as an artifact (see bug 1690930 and related), and is enabled by default for
./mach build
. So this explains why both local builds and CI are affected, but distro builds aren't, since they usually (as far as I know) use their own headers and libraries.
So it is completely how I built it, and makes sense how the distro folks both avoided the problem & didn't remember solving the problem. Thanks for figuring that out. I also appriciate following along & learning how seccomp works.
IMHO we should fix this if we want to make Linux aarch64 official.
This also means that if you build
--without-sysroot
it should work. But also we should fix that code.
Checking now.
Updated•1 year ago
|
Comment 21•1 year ago
|
||
The bug is marked as tracked for firefox120 (release). However, the bug still has low severity.
:gcp, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Comment 22•1 year ago
|
||
I think this would've been an S1 but for these builds not being officially supported(?) (yet) and I'm not 100% clear if this crash happens on non-Apple-Silicon ARM.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 23•1 year ago
|
||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 24•1 year ago
|
||
Comment 25•1 year ago
|
||
bugherder |
Updated•1 year ago
|
Comment 26•1 year ago
|
||
The patch landed in nightly and beta is affected.
:pbone, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox124
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 27•1 year ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D202293
Updated•1 year ago
|
Assignee | ||
Comment 28•1 year ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D202293
Updated•1 year ago
|
Comment 29•1 year ago
|
||
Uplift Approval Request
- Steps to reproduce for manual QE testing: Not required
- User impact if declined: Linux users on aarch64 are unable to use Firefox
- Is Android affected?: no
- Fix verified in Nightly: yes
- String changes made/needed: None
- Explanation of risk level: The patch is simple and only sets the maximum syscall to a higher number
- Needs manual QE test: no
- Risk associated with taking this patch: Low risk
- Code covered by automated testing: yes
Assignee | ||
Updated•1 year ago
|
Updated•11 months ago
|
Updated•11 months ago
|
Comment 30•11 months ago
|
||
uplift |
Updated•11 months ago
|
Description
•