Open Bug 1883915 Opened 8 months ago Updated 3 months ago

firefox tabs keep on crashing for no reason on gentoo/musl

Categories

(Core :: Widget: Gtk, defect, P5)

Firefox 123
x86_64
Linux
defect

Tracking

()

UNCONFIRMED

People

(Reporter: glmlrqaoufmbirifie, Unassigned)

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0

Steps to reproduce:

i compiled the latest version of firefox on gentoo being www-client/firefox-123.0.1::gentoo against the musl/hardened/selinux profile. I have created a fresh new profile and also used troubleshoot mode to disable everything. I click on any webpage and randomly my tabs keep on crashing and even reloading them causes them to instantly crash again only after reloading quite a few times does it let me view the page. I cant see any error logs or crash reports anywhere so it would be great if you could tell me where i can find this. kind regards.

Actual results:

tabs crashing

Expected results:

tabs not crash

[Parent 32547, IPC I/O Parent] WARNING: process 491 exited on signal 4: file /var/tmp/portage/www-client/firefox-123.0.1/work/firefox-123.0.1/ipc/chromium/src/base/process_util_posix.cc:265

This is the only error that I have seen from the terminal

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core
Component: Widget: Gtk → Security: Process Sandboxing
OS: Unspecified → Linux
Hardware: Unspecified → x86_64

Not sure why this was moved to Sandboxing?

Signal 4 is SIGILL so this means you compiled Firefox in a way that it's trying to use instructions that your CPU doesn't support. I think this bug is INVALID, unless i'm missing something?

If you can attach gdb to it you should see the faulty instruction. You'll need to do the following in gdb first:

handle SIG38 noprint nostop pass
handle SIG64 noprint nostop pass
handle SIGSYS noprint nostop pass

Or the debugger will stop on most system calls. But after that, you should be able to run and see where you crash.

Component: Security: Process Sandboxing → Widget: Gtk
Flags: needinfo?(glmlrqaoufmbirifie)

Hello Gian, sorry for the late response. I have tried using gdb however when it crashes the tab it doesn't actually show up on gdb where I can do a bt it just carries on running. I am not sure what I could do further to see the exact reason why this is happening.

Flags: needinfo?(glmlrqaoufmbirifie)

Our current process architecture means that there will be different processes for different web pages. If there's a web page where you encounter this issue more frequently try the following:

  • Open a tab with the web page most likely to crash
  • In a separate tab navigate to about:processes
  • Identify the PID of the process corresponding to the web page you loaded
  • Attach with gdb to that process, letting it continue uninterrupted once gdb is attached
  • Keep browsing until the page crashes

Keep in mind that if you browse to a web page with a different domain the process will change, even within the same tab, and you'll have to reattach to it.

Again sorry for the late reply I didn't see a response to this. I can confirm this happens as well with firefox-128

I have followed the advice above however I am getting the problem where the page is not responsive (i.e can't click on any elements and the page just freezes) when I attach to that process. However lldb does give me an output the moment I attach to it and I have posted its log however I am not sure if this is helpful.

Priority: -- → P5

(In reply to Anon from comment #6)

I have followed the advice above however I am getting the problem where the page is not responsive (i.e can't click on any elements and the page just freezes) when I attach to that process. However lldb does give me an output the moment I attach to it and I have posted its log however I am not sure if this is helpful.

The process freezing is normal behavior, you have to tell lldb to resume the process execution after attaching:

  • lldb --attach-pid <pid>
  • wait for the process to be attached
  • in the lldb prompt execute: process continue
  • the process will resume at this point, keep using it. At the point of crash the process will freeze again, go back to the lldb window and you'll find that it will display the crashing thread and the signal that stopped it
  • in the lldb prompt execute: bt this will print the stack trace at the point of crash, save it and attach it here
  • in the lldb prompt use quit to leave the debugging session

I have used the above steps and obtained a bt of what is causing a crash on a particular tab. I am not sure exactly if this is the crash I wanted to find out or it could be another one. Either way, here it is.

Attached file crash.log

This appears to be a sandboxing issue. A child process is crashing because it's trying to load a font and it's being killed by the SIGSYS signal we raise when an syscall that is not allowed within the sandbox gets called. It is bizarre that we're trying to load a font in a child process, under normal conditions it should never happen. However yours is a peculiar setup using both musl and SELinux. SELinux in particular may alter the control flow significantly enough that we might end up in an unexpected situation, and try loading a font from within a child process instead of the parent process. Without more information it's hard to know why this happens. We'd need a trace of the whole execution of the code that loads the font in the crashing process in order to debug this. That being said you might want to check if SELinux is blocking some syscalls right before the crash, that might give us a hint as to what is happening.

I would like to mention that I am no longer using selinux and just using a musl/clang profile. This is also an issue on musl/gcc profile on gentoo. So how would I be able to obtain a whole execution of the code? I think this issue is related to the musl libc as most if not all systems that I have used and some of my colleagues have used all have the exact same issue.

One possibility is via the rr tool - which is packaged for Gentoo - but it generates enormous traces that are hard to share. However, Alpine Linux is building Firefox with musl and they're not encountering this particular issue. Did you change anything else to Firefox' configuration? Have you got any preferences that were manually modified in about:config? If you could attach the contents of the about:support page it would be very helpful.

Attached file about_support.txt

I have attached the information on about:support. Please let me know if there is anything else that you would like me to share. I don't have experience with a graphical alpine amd64 system unfortunately, so I am not sure what its like for alpine systems.

(In reply to Gabriele Svelto [:gsvelto] from comment #11)

This appears to be a sandboxing issue. A child process is crashing because it's trying to load a font and it's being killed by the SIGSYS signal we raise when an syscall that is not allowed within the sandbox gets called. It is bizarre that we're trying to load a font in a child process, under normal conditions it should never happen. However yours is a peculiar setup using both musl and SELinux. SELinux in particular may alter the control flow significantly enough that we might end up in an unexpected situation, and try loading a font from within a child process instead of the parent process. Without more information it's hard to know why this happens. We'd need a trace of the whole execution of the code that loads the font in the crashing process in order to debug this. That being said you might want to check if SELinux is blocking some syscalls right before the crash, that might give us a hint as to what is happening.

Should we move this bug from Gtk → Security: Process Sandboxing?

(In reply to Anon from comment #15)

Should we move this bug from Gtk → Security: Process Sandboxing?

Not really, because that font shouldn't be loaded in a child process at all. The sandboxing code is doing what it's expected to do in this case, we have to figure out why the code that loads the fonts is getting confused.

I have been looking in the alpine packages repository to see how they build firefox. There are two patches related to the sandbox, maybe it might be something related with these that trigger the problem on gentoo.

https://git.alpinelinux.org/aports/tree/community/firefox/sandbox-fork.patch
https://git.alpinelinux.org/aports/tree/community/firefox/sandbox-sched_setscheduler.patch

I am thinking it might be the first one although I could be wrong entirely.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: