Closed Bug 1839139 Opened 1 year ago Closed 1 year ago

Crash in [@ EnterBaseline] affecting users in the es-ar locale doing searches on Google

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1839669

People

(Reporter: gsvelto, Unassigned)

References

Details

(Keywords: crash, topcrash, topcrash-startup)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/1590b3bf-fe4b-467f-950c-df6c80230616

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0  ?  @0x00001cb414863436  
1  ?  @0x00001cb4148634ed  
2  libxul.so  EnterBaseline  js/src/jit/BaselineJIT.cpp:142
2  libxul.so  js::jit::EnterBaselineInterpreterAtBranch  js/src/jit/BaselineJIT.cpp:198
3  libxul.so  Interpret  js/src/vm/Interpreter.cpp:2225
4  libxul.so  js::RunScript  js/src/vm/Interpreter.cpp:431
4  libxul.so  js::InternalCallOrConstruct  js/src/vm/Interpreter.cpp:585
4  libxul.so  InternalCall  js/src/vm/Interpreter.cpp:620
4  libxul.so  js::Call  js/src/vm/Interpreter.cpp:652
5  libxul.so  js::fun_call  js/src/vm/JSFunction.cpp:956

I'm chucking this in the JIT component because it seems to be happening in JIT compiled code. We've got a large spike in release primarily spanish-speaking users. Out of all the crashes under these signatures the top affected locale is es-ar with 76.90% of the crashes.

Almost all of those crashes are in the release channel but spread over several versions, additionally many users seem to be crashing while doing searches on Google so maybe this was triggered by a change on their side which unearthed a bug in Firefox.

Even more crashes affecting users in the es-ar locale but under a different signature (but with a similar stack and still doing searches on Google).

Crash Signature: [@ EnterBaseline] → [@ EnterBaseline] [@ chunk_alloc | <unknown in libxul.so>]
Summary: Crash in [@ EnterBaseline] → Crash in [@ EnterBaseline] affecting users in the es-ar locale doing searches on Google

Even more crashes, these are all on very old versions of Firefox so definitely something triggered by a server-side change.

Crash Signature: [@ EnterBaseline] [@ chunk_alloc | <unknown in libxul.so>] → [@ EnterBaseline] [@ chunk_alloc | <unknown in libxul.so>] [@ <unknown in firefox-bin>] [@ base_alloc]

Ouch, even more, this is bad.

Crash Signature: [@ EnterBaseline] [@ chunk_alloc | <unknown in libxul.so>] [@ <unknown in firefox-bin>] [@ base_alloc] → [@ EnterBaseline] [@ chunk_alloc | <unknown in libxul.so>] [@ <unknown in firefox-bin>] [@ base_alloc] [@ js::jit::EnterBaselineInterpreterAtBranch]
Severity: -- → S2
Priority: -- → P1

The spike is coming from the Huayra distro (https://es.wikipedia.org/wiki/Huayra_GNU/Linux) which is an Argentinian distro for education, they had a major release v6 last week, it is probably starting to be deployed.

I sent an email to info [at] educar.gob.ar, which maintain that distro.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

I sent an email to info [at] educar.gob.ar, which maintain that distro.

Thanks, they will get 2 emails then, I sent them one as well :)

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 10 content process crashes on release
  • Top 5 desktop browser crashes on Linux on beta
  • Top 5 desktop browser crashes on Linux on release

For more information, please visit BugBot documentation.

Keywords: topcrash

Looking at the crash linked in comment 1, we're crashing in code that looks like this:

    1cb414863417:   mov    %rsp,%rbx         // rbx = rsp - (rdx * 8)
    1cb41486341a:   mov    %rdx,%rax
    1cb41486341d:   shl    $0x3,%rax
    1cb414863421:   sub    %rax,%rbx
    1cb414863424:   mov    %rsp,%rax         // rax = rsp - 0x800
    1cb414863427:   sub    $0x800,%rax
    1cb41486342d:   cmp    %rbx,%rax         // while rax >= rbx
    1cb414863430:   jb     0x1cb414863444
    1cb414863436:   movl   $0x0,(%rax)       //   *rax = 0
^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^
    1cb41486343c:   sub    $0x800,%rax       //   rax -= 0x800
    1cb414863442:   jmp    0x1cb41486342d


This appears to be some sort of stack-probing code. I've left out the preceding context, but it's right at the beginning of a function. rdx is being passed in from the caller. In this case, it's 19535. It looks like we're allocating room for that many 8-byte values. In 2048-byte steps, we walk the stack and touch each page.

Oh, it's this code. That's used in EnterJIT, so we're apparently calling into jit code with ~20000 values on the stack and running out of space. Specifically, we're doing on-stack-replacement to tier up from the C++ interpreter to the baseline interpreter, which entails copying all the values that are currently on the interpreter's stack (arguments, local variables, intermediate results) from the heap onto the native stack.

I should look at more than one crash, but for now one hypothesis is that the distro changed the default stack size.

A couple other things I've noticed:

We're already doing a stack overflow check in EnterBaseline, so whatever's going wrong here is somehow circumventing that.

20000 is our default limit for max stack arguments.

Oh, one other thing about the crash I've looked at is that we crash when we still have several iterations left to go in the loop, so our check is way off. Maybe cx->nativeStackLimit is being set up wrong somehow?

Looking at another four EnterBaseline crashes with useragent-locale "es-ar", they're all crashing in the same code. I suggest asking the maintainer whether anything changed regarding stack limits.

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on beta
  • Top 10 content process crashes on release
  • Top 5 desktop browser crashes on Linux on beta
  • Top 5 desktop browser crashes on Linux on release

For more information, please visit BugBot documentation.

The bug is marked as tracked for firefox114 (release), tracked for firefox115 (beta) and tracked for firefox116 (nightly). We have limited time to fix this, the soft freeze is in 9 days. However, the bug still isn't assigned.

:sdetar, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(sdetar)

Did we ever hear anything back yet from emails sent by Pascal and Emilio to maintainers of the distro? It seems like we won't be able to make much progress until we are able to talk to them.

Flags: needinfo?(sdetar)

(In reply to Steven DeTar [:sdetar] from comment #13)

Did we ever hear anything back yet from emails sent by Pascal and Emilio to maintainers of the distro? It seems like we won't be able to make much progress until we are able to talk to them.

I haven't received any response.

Hi all! i'm working on Huayra GNU/Linux, we use the official build/binary download from download.mozilla.org (sha verified)
Just install on /opt/firefox and have very minimal customization (only disable the updater)

Huayra 6 is base Debian 11.x

You can check our package for this purpose at:

https://github.com/HuayraLinux/firefox-installer

whatever thing we can help, let us know

Saludos!

using kvm/qemu:

kvm -cdrom huayra-amd64-6.0.iso -m 2G

(In reply to Fernando Toledo from comment #15)

Hi all! i'm working on Huayra GNU/Linux, we use the official build/binary download from download.mozilla.org (sha verified)
Just install on /opt/firefox and have very minimal customization (only disable the updater)

Huayra 6 is base Debian 11.x

You can check our package for this purpose at:

https://github.com/HuayraLinux/firefox-installer

whatever thing we can help, let us know

Saludos!

FYI: The Firefox version that was shipped in Huayra 6.0 is 114.0.1

Hi Fernando,

It looks like something is going wrong with the amount of stack memory that is available. Have you recently changed the system default stack size? I believe our default stack limit on Linux is 8MB.

If my math is correct, we're crashing when allocating a ~160KB stack frame.

Edit to add: this bit of code seems like it might be relevant. Did you lower the value of RLIMIT_STACK in Huayra 6.0?

Flags: needinfo?(ragnarok)

(In reply to Iain Ireland [:iain] from comment #18)

Hi Fernando,

It looks like something is going wrong with the amount of stack memory that is available. Have you recently changed the system default stack size? I believe our default stack limit on Linux is 8MB.

If my math is correct, we're crashing when allocating a ~160KB stack frame.

Edit to add: this bit of code seems like it might be relevant. Did you lower the value of RLIMIT_STACK in Huayra 6.0?

nope we do not change these settings.

ragnarok@huayra:~/Descargas/isos$ ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30691
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 95
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 30691
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Flags: needinfo?(ragnarok)

I see in that crash report was on Huayra 5.0 (previous and old release) and FF 112.0.1
We shipped Firefox 90 at release time.
Anyway in Huayra 5 it is possible to update to the latest version of FF too.

The crash reports, were they all from the same version?

The crashes are coming from all versions including 114.0.1 but all from Huayara 5.0. Did something change in that version of the distribution?

Flags: needinfo?(ragnarok)

(In reply to Gabriele Svelto [:gsvelto] from comment #21)

The crashes are coming from all versions including 114.0.1 but all from Huayara 5.0. Did something change in that version of the distribution?

No changes were made from our repo, but users can receive updates directly from the debian repo
I will make some more test. Can someone reproduce the problem?
I still can't reproduce it

Flags: needinfo?(ragnarok)

Crash in bug 1839669 comment 5 looks nearly identical to this, and that seems like a non-Argentinian user.

See Also: → 1839669

They also seem to be able to reproduce with official binaries...

They all seem to be in Debian/Debian-based distros. I'm on Arch and can't repro...

Julien could repro this on a Debian 10 VM.

I could reproduce it on VM qemu+kvm Using Huayra 5 (Debian 10) and FF 90.0:

goto google.com
search and click for "Images" results

Good, what does ulimit -a give you there?

Duping this, the other bug explains this is a Google-side change that causes huge resource usage, but the question remains why only Debian users are actually hitting some system limit and crashing.

Status: NEW → RESOLVED
Closed: 1 year ago
Duplicate of bug: 1839669
Resolution: --- → DUPLICATE

The duplicate bug is already tracking all the necessary releases, so dropping the flags from this one.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: