Closed Bug 1841377 Opened 1 year ago Closed 1 year ago

firefox 114 segfaults at startup (MUSL LIBC)

Categories

(Firefox Build System :: Toolchains, defect)

Firefox 114
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1841040

People

(Reporter: a.horodniceanu, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Steps to reproduce:

I tried to launch firefox 114 on my musl gentoo system. Firefox has been compiled with clang and no lto.

Actual results:

firefox crashes, most often before any window can open but sometimes it can get a little further before segfaulting in the same way. Using gdb I can get the following backtrace:

GNU gdb (Gentoo 13.2 vanilla) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-gentoo-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/firefox/firefox...
Reading symbols from /usr/lib/debug//usr/lib/firefox/firefox.debug...
(gdb) run
Starting program: /usr/lib/firefox/firefox
[New LWP 26617]
[LWP 26617 exited]
[Detaching after fork from child process 26618]
[New LWP 26646]
[Detaching after vfork from child process 26647]
[New LWP 26648]
[New LWP 26649]
[New LWP 26650]
[New LWP 26651]
[New LWP 26652]
[New LWP 26653]
[New LWP 26654]
[New LWP 26655]
[Detaching after fork from child process 26656]
[New LWP 26657]

Thread 1 "firefox" received signal SIGSEGV, Segmentation fault.
js::LifoAlloc::LifoAlloc (this=0x7fffea112190, defaultChunkSize=4096) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/ds/LifoAlloc.h:769
769     /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/ds/LifoAlloc.h: No such file or directory.
(gdb) bt
#0  js::LifoAlloc::LifoAlloc(unsigned long) (this=0x7fffea112190, defaultChunkSize=4096) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/ds/LifoAlloc.h:769
#1  js::InterpreterStack::InterpreterStack() (this=0x7fffea112190) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/vm/Stack.h:810
#2  js::ProtectedData<js::CheckMainThread<(js::AllowedHelperThread)0>, js::InterpreterStack>::ProtectedData<>(js::CheckMainThread<(js::AllowedHelperThread)0> const&) (this=0x7fffea112190, check=<optimized out>) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/threading/ProtectedData.h:83
#3  js::ProtectedDataNoCheckArgs<js::CheckMainThread<(js::AllowedHelperThread)0>, js::InterpreterStack>::ProtectedDataNoCheckArgs<>() (this=0x7fffea112190) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/threading/ProtectedData.h:176
#4  JSRuntime::JSRuntime(JSRuntime*) (this=0x7fffea112190, parentRuntime=0x0) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/vm/Runtime.cpp:87
#5  0x00007ffff3e84a23 in js_new<JSRuntime, JSRuntime*&>(JSRuntime*&) (args=<optimized out>) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox_build/dist/include/js/Utility.h:520
#6  js::NewContext(unsigned int, JSRuntime*) (maxBytes=33554432, parentRuntime=0x0) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/src/vm/JSContext.cpp:168
#7  0x00007ffff04f0751 in mozilla::CycleCollectedJSContext::Initialize(JSRuntime*, unsigned int) (this=0x7fffea129110, aParentRuntime=0x0, aMaxBytes=33554432) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/xpcom/base/CycleCollectedJSContext.cpp:129
#8  0x00007ffff0b5cd66 in XPCJSContext::Initialize() (this=this@entry=0x7fffea129110) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/xpconnect/src/XPCJSContext.cpp:1211
#9  0x00007ffff0b5d64e in XPCJSContext::NewXPCJSContext() () at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/xpconnect/src/XPCJSContext.cpp:1424
#10 0x00007ffff0b88b19 in nsXPConnect::InitJSContext() () at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/xpconnect/src/nsXPConnect.cpp:92
#11 xpc::InitializeJSContext() () at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/js/xpconnect/src/nsXPConnect.cpp:107
#12 0x00007ffff3cf2ed5 in XREMain::XRE_mainRun() (this=this@entry=0x7fffffffc600) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/toolkit/xre/nsAppRunner.cpp:5369
#13 0x00007ffff3cf3c8f in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (this=this@entry=0x7fffffffc600, argc=argc@entry=1, argv=argv@entry=0x7fffffffd858, aConfig=...) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/toolkit/xre/nsAppRunner.cpp:5864
#14 0x00007ffff3cf4030 in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=1, argv=0x7fffffffd858, aConfig=...) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/toolkit/xre/nsAppRunner.cpp:5920
#15 0x000055555557d622 in do_main(int, char**, char**) (argc=1, argv=0x7fffffffd858, envp=0x7fffffffd868) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/browser/app/nsBrowserApp.cpp:227
#16 main(int, char**, char**) (argc=<optimized out>, argv=<optimized out>, envp=0x7fffffffd868) at /var/tmp/notmpfs/portage/www-client/firefox-114.0/work/firefox-114.0/browser/app/nsBrowserApp.cpp:445

Expected results:

firefox shouldn't have crashed.

Similar bugs have been reported (https://bugzilla.mozilla.org/show_bug.cgi?id=1837690 https://bugzilla.mozilla.org/show_bug.cgi?id=1841001) but the segfault is happening in other places so I consider this a separate issue.

The Bugbug bot thinks this bug should belong to the 'Core::JavaScript Engine' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → JavaScript Engine
Product: Firefox → Core

I'm also running into this with GCC 13+bfd+llvm 16+rust 1.70.

Component: JavaScript Engine → Toolchains
Product: Core → Firefox Build System

(In reply to Horodniceanu Andrei from comment #1)

Similar bugs have been reported (https://bugzilla.mozilla.org/show_bug.cgi?id=1837690 https://bugzilla.mozilla.org/show_bug.cgi?id=1841001) but the segfault is happening in other places so I consider this a separate issue.

Thanks for all the investigation, and attempting to hook gdb to get more investigation.

However, the fact that you got more than one signature would suggest that there is a different root cause behind all these crashes.
In most cases, when a user report many different failures at different locations, these frequently hints at a memory corruption.

Have you run a memcheck recently?

Flags: needinfo?(a.horodniceanu)

Were you able to reproduce with firefox-115.0?

(In reply to Nicolas B. Pierron [:nbp] from comment #4)

Have you run a memcheck recently?

I let memtest86+ run for around 4 hours and it got 8 passes with no errors so I think that this is not a hardware issue.

(In reply to tt_1 from comment #5)

Were you able to reproduce with firefox-115.0?

Yes, the same issue still persists:

Thread 1 "firefox" received signal SIGSEGV, Segmentation fault.
js::LifoAlloc::LifoAlloc (this=0x7fffe5b6a110, defaultChunkSize=4096) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/ds/LifoAlloc.h:769
769     /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/ds/LifoAlloc.h: No such file or directory.
(gdb) bt
#0  js::LifoAlloc::LifoAlloc(unsigned long) (this=0x7fffe5b6a110, defaultChunkSize=4096) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/ds/LifoAlloc.h:769
#1  js::InterpreterStack::InterpreterStack() (this=0x7fffe5b6a110) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/vm/Stack.h:810
#2  js::ProtectedData<js::CheckMainThread<(js::AllowedHelperThread)0>, js::InterpreterStack>::ProtectedData<>(js::CheckMainThread<(js::AllowedHelperThread)0> const&) (this=0x7fffe5b6a110, check=<optimized out>)
    at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/threading/ProtectedData.h:83
#3  js::ProtectedDataNoCheckArgs<js::CheckMainThread<(js::AllowedHelperThread)0>, js::InterpreterStack>::ProtectedDataNoCheckArgs<>() (this=0x7fffe5b6a110)
    at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/threading/ProtectedData.h:176
#4  JSRuntime::JSRuntime(JSRuntime*) (this=0x7fffe5b6a110, parentRuntime=0x0) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/vm/Runtime.cpp:87
#5  0x00007ffff3f3b933 in js_new<JSRuntime, JSRuntime*&>(JSRuntime*&) (args=<optimized out>) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox_build/dist/include/js/Utility.h:520
#6  js::NewContext(unsigned int, JSRuntime*) (maxBytes=33554432, parentRuntime=0x0) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/vm/JSContext.cpp:168
#7  0x00007ffff0524101 in mozilla::CycleCollectedJSContext::Initialize(JSRuntime*, unsigned int) (this=0x7fffe5b7f0e0, aParentRuntime=0x0, aMaxBytes=33554432)
    at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/xpcom/base/CycleCollectedJSContext.cpp:129
#8  0x00007ffff0b9de56 in XPCJSContext::Initialize() (this=this@entry=0x7fffe5b7f0e0) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/xpconnect/src/XPCJSContext.cpp:1208
#9  0x00007ffff0b9e74e in XPCJSContext::NewXPCJSContext() () at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/xpconnect/src/XPCJSContext.cpp:1421
#10 0x00007ffff0bca049 in nsXPConnect::InitJSContext() () at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/xpconnect/src/nsXPConnect.cpp:92
#11 xpc::InitializeJSContext() () at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/xpconnect/src/nsXPConnect.cpp:107
#12 0x00007ffff3da345c in XREMain::XRE_mainRun() (this=this@entry=0x7fffffffc600) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/toolkit/xre/nsAppRunner.cpp:5364
#13 0x00007ffff3da436e in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (this=this@entry=0x7fffffffc600, argc=argc@entry=1, argv=argv@entry=0x7fffffffd858, aConfig=...)
    at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/toolkit/xre/nsAppRunner.cpp:5859
#14 0x00007ffff3da4710 in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=1, argv=0x7fffffffd858, aConfig=...) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/toolkit/xre/nsAppRunner.cpp:5915
#15 0x000055555557d772 in do_main(int, char**, char**) (argc=1, argv=0x7fffffffd858, envp=0x7fffffffd868) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/browser/app/nsBrowserApp.cpp:227
#16 main(int, char**, char**) (argc=<optimized out>, argv=<optimized out>, envp=0x7fffffffd868) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/browser/app/nsBrowserApp.cpp:445
Flags: needinfo?(a.horodniceanu)
Attached file gdb.txt
I was also able to get a backtrace when the browser was able to show a window before crashing, in some of my attempts it took a couple a seconds, in others it worked for a few minutes:
```
Attached file delayed-crash-gdb-log
I was also able to get a backtrace when the browser was able to show a window before crashing, in some of my attempts it took a couple a seconds, in others it worked for a few minutes

If bugzilla is giving me an error about not being able to process my attachment maybe it shouldn't post a draft of my comment without any notification.

I do not see what would cause any failure in LifoAlloc, especially during the initialization.
The content of the LifoAlloc should all be nullptr-initialized, and no dereference should be executed in the initialization process, except setting the various fields of the LifoAlloc it-self.

Can you dump the assembly of the function in which you are crashing?

I wonder if this could be a miss-compilation issue, where fields are not properly initialized.
What CXX_FLAGS are you using on Gentoo? Maybe some of these flags are the root cause of this issue …

Attached file build.log
(In reply to Nicolas B. Pierron [:nbp] from comment #10)
> Can you dump the assembly of the function in which you are crashing?

Can you elaborate on this? I though that the crash is in the LifoAlloc constructor but, for the life of me, I can't get gdb to dump its assembly, it instead chooses to dump the assembly of the JSRuntime class constructor.


(In reply to Nicolas B. Pierron [:nbp] from comment #10)
> I wonder if this could be a miss-compilation issue, where fields are not properly initialized.
> What `CXX_FLAGS` are you using on Gentoo? Maybe some of these flags are the root cause of this issue …

Here's the beginning of the build.log, before I cancelled it. It contains the common flags used in the build, if there are other environment settings that have a relevant role in the build, let me know:

(In reply to Nicolas B. Pierron [:nbp] from comment #10)

I wonder if this could be a miss-compilation issue, where fields are not properly initialized.
What CXX_FLAGS are you using on Gentoo? Maybe some of these flags are the root cause of this issue …

Another possibly dangerous thing that I have enabled is for clang to use libc++ instead of gcc's libstdc++, which the wiki mentions is heavily discouraged:

It is also possible to add the default-libcxx USE flag to use LLVM's C++ STL with clang, however this is heavily discouraged because libstdc++ and libc++ are not ABI compatible. i.e. A program built against libstdc++ will likely break when using a library built against libc++, and vice versa.

This shouldn't be an issue however, and to make sure of it I removed gcc and rebuilt firefox and all its dependencies recursively with similar flags CXXFLAGS="-O2 -pipe -march=native -fstack-protector-strong -D_FORTIFY_SOURCE=2" and no lto because, as I understand it, it would eliminate the possibility of this unsafe flag breaking firefox. The crash is the same, it fails in LifoAlloc.

The reason that the disassembly reports the JSRuntime constructor is that the interpreterStack_ is the first field of the JSRuntime, and it is implicitly initialized to zero using vector registers:

0x00007ffff3fde877 <+23>:	vxorps %xmm0,%xmm0,%xmm0
[…]
0x00007ffff3fde890 <+48>:	vmovaps %ymm0,(%rdi)

The failures seems to happen while dereferencing %rdi, to reset the memory to zero.
To reach the JSRuntime constructor, %rdi should be non-null, as js_new checks for bad allocation before calling the constructor.

Thus the problem is that js_malloc returns a non-null bad pointer that is later used to initialize the JSRuntime, InterpreterStack and LifoAlloc.

I am not sure if this might be related or not, but I notice that there is a patch applied:
0021-bmo-1754469-memory_mozalloc_throw.patch

I do not see any patches on bugzilla for Bug 1754469, thus I wonder from where this patch comes from.

Glandium, any idea if we have known compatibility issues with musl library and our allocator implementation?

Flags: needinfo?(mh+mozilla)

(In reply to Nicolas B. Pierron [:nbp] from comment #14)

I am not sure if this might be related or not, but I notice that there is a patch applied:
0021-bmo-1754469-memory_mozalloc_throw.patch

I do not see any patches on bugzilla for Bug 1754469, thus I wonder from where this patch comes from.

The ebuilds used to build firefox, which are glorified bash scripts, can be found here. The patches listed in the build log are from https://dev.gentoo.org/~juippis/mozilla/patchsets/${FIREFOX_PATCHSET}, each ebuild setting the FIREFOX_PATCHSET variable.

For reference, firefox-113.0.2.ebuild works fine but going to 114 or over causes crashes.

As per comment 14, the crash happens here.

=> 0x00007ffff3fde890 <+48>:	vmovaps %ymm0,(%rdi)

This seems to be AT&T syntax, so this would be about writing to %rdi.

Now, the observant would see that there's another write access to %rdi a few lines above:

   0x00007ffff3fde87b <+27>:	movq   $0x0,0x50(%rdi)

and that doesn't fail. So it's probably not a permission/wrong address issue.

The disassembly of the preceding instructions says %rdi is unchanged since the entry in the function. And %rdi, in the x86_64 calling convention is the first argument. So despite there being no registers dump in the attached file, we can fortunately deduce it from the backtrace, as being the this value in

js::LifoAlloc::LifoAlloc (this=0x7fffe59b1090, defaultChunkSize=4096) at /var/tmp/notmpfs/portage/www-client/firefox-115.0/work/firefox-115.0/js/src/ds/LifoAlloc.h:769

Going back to the failing instruction: it's vmovaps. It's Vectorized Move Aligned Packed Single Precision Floating-Point Value. It requires 32-byte alignment. 0x7fffe59b1090 is not 32-byte aligned. Presumably it comes out from several stack frames ago, in js_new, via js_malloc.

Now, since this is all compiled, it seems the compiler is making assumptions about alignment that is not guaranteed. std::alignment_of<JSRuntime>() is 8.

Flags: needinfo?(mh+mozilla)

Now, sizeof(JSRuntime) is 37816, which is massive, and with many allocators, that will pretty much guarantee something aligned much more than 32 bytes, hiding the problem...

FWIW clang 16 with -march=native uses vmovups (where the u stands for unaligned) instead of vmovaps...

In any case, this doesn't seem to be a bug on our side.

Status: UNCONFIRMED → RESOLVED
Closed: 1 year ago
Resolution: --- → INVALID

(In reply to Mike Hommey [:glandium] from comment #19)

In any case, this doesn't seem to be a bug on our side.

Indeed, the bug seems to be with clang's code generation. Thank you for explaining.

(In reply to Mike Hommey [:glandium] from comment #16)

Now, since this is all compiled, it seems the compiler is making assumptions about alignment that is not guaranteed. std::alignment_of<JSRuntime>() is 8.

The compiler does report alignof(JSRuntime) as being 64 but, as you said in your last comment, this issue doesn't have anything to do with firefox.

(In reply to Mike Hommey [:glandium] from comment #18)

FWIW clang 16 with -march=native uses vmovups (where the u stands for unaligned) instead of vmovaps...

I now understand that you meant to try to use clang 16 with -march=native but by removing -march=native from CXXFLAGS the crashing instruction becomes:

   0x00007ffff41d996d <+45>:	movaps %xmm0,(%rdi)
   0x00007ffff41d9970 <+48>:	movaps %xmm0,0x10(%rdi)
   0x00007ffff41d9974 <+52>:	movaps %xmm0,0x20(%rdi)

which happen to work since %rdi is 16 byte aligned. Thank you again for your help.

The compiler does report alignof(JSRuntime) as being 64

It returns 8 here. How does it return 64 for you? It would be good to know because if there's a legitimate reason it does, there there's a bug in the JS code, because it ultimately uses malloc, which won't guarantee such an alignment.

(In reply to Mike Hommey [:glandium] from comment #21)

It returns 8 here. How does it return 64 for you? It would be good to know because if there's a legitimate reason it does, there there's a bug in the JS code, because it ultimately uses malloc, which won't guarantee such an alignment.

(gdb) p alignof(JSRuntime)
$2 = 64
(gdb) p sizeof(JSRuntime)
$3 = 46272

I can not give an explanation for why these values would differ from yours.

Looking for what the compiler produces when compiling Runtime.cpp I get:

  1. firefox 114 compiled with clang 15: 30: c5 fc 29 07 vmovaps %ymm0,(%rdi)
  2. firefox 114 compiled with clang 16: 2d: c5 fc 29 07 vmovaps %ymm0,(%rdi)
  3. firefox 113 compiled with clang 15: 30: c5 fc 11 07 vmovups %ymm0,(%rdi)
  4. firefox 113 compiled with clang 16: 2d: c5 fc 11 07 vmovups %ymm0,(%rdi)

By 113 I mean firefox-113.0.2 and by 114 I mean firefox 114.0 since between those the bug started happening. Currently, packaged for gentoo there are firefox-115.0.2 and firefox-115.0.3 if there is a need for a newer version.

(In reply to Horodniceanu Andrei from comment #22)

(In reply to Mike Hommey [:glandium] from comment #21)

It returns 8 here. How does it return 64 for you? It would be good to know because if there's a legitimate reason it does, there there's a bug in the JS code, because it ultimately uses malloc, which won't guarantee such an alignment.

(gdb) p alignof(JSRuntime)
$2 = 64
(gdb) p sizeof(JSRuntime)
$3 = 46272

I can not give an explanation for why these values would differ from yours.

I do get these values on Firefox 114. So there was a bug, but it's fixed.

<a bisect later>

This was fixed in bug 1841040... which was the exact same issue.

Duplicate of bug: 1841040
Resolution: INVALID → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: