1839669 - Google Images search reproducibly causes tab crash

Reporter

Description

•

2 years ago

•

Using Firefox 115.0b7 that I compiled from sources on Debian GNU/Linux 10 (buster = oldoldstable), with a fresh new profile, no addons installed:

Any Google images search causes the tab to crash (this occurs reproducibly, 100% of the time).

E.g.: navigate to https://www.google.com/search?q=test&tbm=isch — page very briefly displays search results, and is then immediately replaced by “Gah. Your tab just crashed.”

stderr displays the following as the crash occurs:

[Parent 2163, IPC I/O Parent] WARNING: process 2825 exited on signal 11: file /huge/mozilla/ipc/chromium/src/base/process_util_posix.cc:264

Crash report is here: https://crash-stats.mozilla.org/report/index/733c5977-8443-4b2d-b54d-0b1250230621 (this is the crash corresponding to the aforequoted URL and diagnostic line)

I will try with other versions and various pre-built binaries and will update this bug-report accordingly. Please advise me on what else I can do to help diagnose this.

David A. Madore

Reporter

Comment 1

•

2 years ago

Attached file raw data from about:support for crash — Details

David A. Madore

Reporter

Comment 2

•

2 years ago

Tested with Firefox started in “troubleshooting” mode (-safe-mode), with the exact same result. Crash report for this test is: https://crash-stats.mozilla.org/report/index/7d6d1f88-8f9a-4845-a317-f5d860230621

David A. Madore

Reporter

Comment 3

•

2 years ago

Tested with prebuilt Firefox “Developer Edition” version 115.0b8 to rule out problem being caused by my custom build: crash still occurs. Crash report for this test is: https://crash-stats.mozilla.org/report/index/68b73906-0f06-4e0f-9325-2ba2f0230621

Again, this also occurs in “troubleshooting” mode (-safe-mode). Will now try the latest nightly.

David A. Madore

Reporter

Comment 4

•

2 years ago

Tested with prebuilt Firefox nightly 116.0a1 buildid 20230621040008 downloaded from https://www.mozilla.org/en-US/firefox/channel/desktop/ (and again with a completely new profile): crash still occurs in the same way. Crash report for this test is: https://crash-stats.mozilla.org/report/index/d224aa81-3ada-4206-ac64-ae85d0230621

The same crash also occurs in troubleshooting mode: https://crash-stats.mozilla.org/report/index/0a8c1d80-a580-4e86-bb84-dfaf50230621

Let me try with the current stable version as well, to be sure.

David A. Madore

Reporter

Comment 5

•

2 years ago

Tested with prebuilt Firefox stable 114.0.2 buildid 20230619081400 downloaded from https://www.mozilla.org/en-US/firefox/download/thanks/ (still with a completely new profile): crash still occurs in the same way. Crash report: https://crash-stats.mozilla.org/report/index/792f14ea-329c-4272-8c62-3b4510230621

Summary/conclusion: crash occurs in the same way for all Firefox versions I could try (stable=114, beta=115, nightly=116), whether prebuilt by Mozilla or built by myself, whether in normal or troubleshooting mode, whether with a fresh profile or my standard one; it is systematic and fully reproducible whenever I open any Google Images results page.

This is very curious because the bug started happening all of a sudden a few days/weeks ago. I did not upgrade anything on my system that might explain it. I am at loss as to what else I might test. Somebody please advise.

David A. Madore

Reporter

Comment 6

•

2 years ago

The thread https://forums.linuxmint.com/viewtopic.php?p=2337495 suggests I'm not the only one who encounters this bug.

Also bug #1838999 may be the same as this one.

David A. Madore

Reporter

Comment 7

•

2 years ago

Crash dumps from builds with debugging info suggest that the segfault comes from EnterBaseline(JSContext*, EnterJitData&) called from js::jit::EnterBaselineInterpreterAtBranch(JSContext*, js::InterpreterFrame*, unsigned char*) called from js::Interpret(JSContext*, js::RunState&) — so maybe this bug should be assigned to the “JavaScript Engine: JIT” component? Not sure how to proceed with triage (also, not sure whether this is a smoking gun). Leaving in “General” until someone more knowledgeable decides otherwise.

David A. Madore

Reporter

Comment 8

•

2 years ago

PS: I also tried with an old version of Firefox (111.0.1) that I know I had previously used to visit Google Images without problem, and it also crashes.

Conclusion: the bug is not a regression in Firefox, it has existed for some time, but Google must have changed something in their JavaScript code which now triggers the crash (at least for some systems, including me, but probably not every Firefox user on Linux either because obviously this would have been noticed sooner).

David A. Madore

Reporter

Comment 9

•

2 years ago

Another test: on near-identical machines, the bug occurs on a Debian 10 (Buster) system and not on a Debian 11 (Bullseye). I don't know what to make of this.

François Bienvenu

Comment 10

•

2 years ago

I was going to file a new bug report because I am also experiencing this bug, on two different machines (both running Debian 10 but with different settings).

I can reproduce the bug with:

Firefox 91.5.0esr (64-bit)
Firefox 102.12.0esr (64-bit)

I found a few other recent mentions of this bug online — such as this one from 6 days ago: https://support.google.com/websearch/thread/221293972/google-images-constantly-crashing?hl=en

François Bienvenu

Comment 11

•

2 years ago

Also, regarding the following claim:

the bug is not a regression in Firefox, it has existed for some time, but Google must have changed something in their JavaScript code which now triggers the crash

For me the bug started a few days ago, without me having upgraded any software. So that goes in the same direction.

Quentin Godfroy

Comment 12

•

2 years ago

I can reproduce this bug on Firefox 102.10.0esr-1~~deb10u1 from Debian buster/updates, as well as 102.12.0esr-1~~deb10u1

Quentin Godfroy

Comment 13

•

2 years ago

crash report https://crash-stats.mozilla.org/report/index/88350b93-be0d-4ca1-b9c7-3ef0a0230622

Emilio Cobos Álvarez (:emilio)

Updated

•

2 years ago

Comment 14

•

2 years ago

another related report, not Huayra only:

https://support.google.com/websearch/thread/221293972/google-images-constantly-crashing?hl=en

Iain Ireland [:iain]

Comment 15

•

2 years ago

I've taken a look at each of the crash reports linked in these comments. They're all crashing in the same way as the crashes in bug 1839139. Our working hypothesis in that bug was that the problem was specific to the Huayra distro, but evidence in this bug seems to indicate that it affects many Debian-based distros. (Unless I'm missing something, though, nobody's reported a crash on Ubuntu yet?)

The story I can piece together so far:

Google recently made a change to its image search page. There is now a JS function that has nearly 20000 values on the stack (which could be arguments, local variables, intermediate results of an ongoing computation, or a mix of all three). This function also contains a loop. At some point while interpreting this function in C++, we hit the top of the loop, increment the warm up counter, and decide to tier up to use JIT code. To do so, we have to copy the values from the interpreter stack onto the native C++ stack. Before we copy the arguments, we touch each page of the new stack frame; this is only strictly necessary on Windows, but we do it on every platform for simplicity. At some point in this loop, we touch a page and segfault.

Before we do this, we've done a stack check and verified that the amount of additional stack memory we're allocating does not overflow the native stack limit we've set for ourselves. So there appears to be a disagreement between our self-imposed limit, and the limit from the OS. This is somehow distro dependent, but messily: for example, it affects Debian 10 but not Debian 11.

On Linux we expect the default stack limit to be 8MB. Huayra confirmed that this is still true for them.

I don't immediately know how to put these pieces together. If somebody who can reproduce this could capture a recording in rr, maybe that would help us get to the bottom of it.

Gian-Carlo Pascutto [:gcp]

Updated

•

2 years ago

Duplicate of this bug: 1839248

Gian-Carlo Pascutto [:gcp]

Updated

•

2 years ago

Duplicate of this bug: 1839139

Gian-Carlo Pascutto [:gcp]

Comment 18

•

2 years ago

There's a lot of distros derived from that Debian, a bunch of dupe bugs, and some signatures with a fair amount of crash traffic (>4000/day) -> S2.

Severity: -- → S2

Component: General → JavaScript Engine

Priority: -- → P2

Product: Firefox → Core

Gian-Carlo Pascutto [:gcp]

Comment 19

•

2 years ago

I'm not sure about the best component but given the analysis pointing the stack exhaustion being triggered in the JS Engine, I guess that's a good a starting place as any.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 20

•

2 years ago

Adjusting summary: process_util_posix.cc:264 is the line where we log that the other process crashed; it's not related to the cause of the crash.

Summary: Google Images search reproducibly causes tab crash (segfault in process_util_posix.cc:264) → Google Images search reproducibly causes tab crash

Gian-Carlo Pascutto [:gcp]

Updated

•

2 years ago

Crash Signature: [@ EnterBaseline] [@ chunk_alloc | <unknown in libxul.so>] [@ <unknown in firefox-bin>] [@ base_alloc] [@ js::jit::EnterBaselineInterpreterAtBranch]

Teo V.

Comment 21

•

2 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #20)

Adjusting summary: process_util_posix.cc:264 is the line where we log that the other process crashed; it's not related to the cause of the crash.

Can someone provide a summary as to what software packages that source code (process_util_posix.cc) file may invoke from the environment, so we can eliminate or confirm the possibility of a buggy shared library in the Debian (and other) distro?

Donal Meehan [:dmeehan]

Updated

•

2 years ago

status-firefox114: --- → affected

status-firefox115: --- → affected

status-firefox116: --- → affected

status-firefox-esr102: --- → affected

tracking-firefox114: --- → +

tracking-firefox115: --- → +

tracking-firefox116: --- → +

Gian-Carlo Pascutto [:gcp]

Comment 22

•

2 years ago

Can someone provide a summary as to what software packages that source code (process_util_posix.cc) file may invoke from the environment

As the comment tried to explain, that source code is completely irrelevant to the problem. It just notices a Firefox process has crashed and reports it.

David A. Madore

Reporter

Comment 23

•

2 years ago

Concerning Ubuntu: I've been asking all around for people to try to reproduce this bug, someone told me they had it with Ubuntu 18.04 (no further details given), but I've been unable to reproduce with Ubuntu 18.04.6 LTS in a virtual box (neither with the distro-provided Firefox nor with a Mozilla-compiled nightly). I'll try to get more details from them as to exactly what version they had. (IF the difference is down to the kernel version, this could be a clue as to what is happening.)

Gabriele Svelto [:gsvelto]

Comment 24

•

2 years ago

FYI I've verified by looking at the memory map in the crash dumps that the stack is smaller than we expect. Specifically it's 152 KiB in size. This is the main thread stack, other threads' stacks appear larger but not much larger: 256 KiB each.

Gabriele Svelto [:gsvelto]

Comment 25

•

2 years ago

Interestingly the main process' main thread stack is similar in size even on my machine which is Gentoo.

jmgonk

Comment 26

•

2 years ago

I'm having this problem using 102.12.0esr - downloaded from Mozilla, not built from source - on Slackware, so it definitely isn't a Debian thing.

Teo V.

Comment 27

•

2 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #22)

Can someone provide a summary as to what software packages that source code (process_util_posix.cc) file may invoke from the environment

As the comment tried to explain, that source code is completely irrelevant to the problem. It just notices a Firefox process has crashed and reports it.

Ah, it's a monitoring class or function, not something which provides glue to the POSIX thread/process functions. I thought it was a glue sort of thing.

Anyways, for me, the tab begins to render the Google Images page (loads images etc.), then suddenly simply crashes mid-render. As in the dupe bug I filed, nothing out of the ordinary in the web console except a warning from google not to be fooled to type in stuff into the console with an odd looking file name. Possible ideas on my side, and what I've tried (cf. Bug 1839248 for details).

Resource exhaustion (however, that other tabs to other sites can open no prob speaks against that). Haven't checked whether this could be involved.
"Dark Matter" HMTL in the Google Images page. Has anyone tried putting the HTML, CSS and JS of the page under a loupe? Haven't explored this as my XP in HTML, JS and CSS is way too low.
Aggressive Ad-blocker blocking (however, I've tried with mine both on and off, no dice).
Hardware acceleration fail (why it would appear in an image page (not a video page) is beyond me, and work no probs on Youtube). However, trying setting media.hardware-video-decoding.enabled to false in about:config only worked for a day or two, then back to crashing again (cf. Bug 1839248)
Tainted profile is eliminated, tried a fresh clean profile, and still crashing.

Gian-Carlo Pascutto [:gcp]

Comment 28

•

2 years ago

Possible ideas on my side, and what I've tried

FWIW the underlying cause was identified in comment 15: Google changed this page in a way that it uses some very weirdly written JavaScript code. Now there are two things that seem off:

a) There should be enough (stack) memory available to Firefox to deal with this, but there isn't, at least on the affected distros. It's not clear why.
b) In theory the effect of that would be to stop Firefox's JavaScript JIT from optimizing the code, making the page slow in Firefox (also not exactly desirable), but our check for this situation doesn't seem to fire and we crash instead.

So it looks like there's 2 bugs here in Firefox, and it's possible the "weird code" on that page that triggers them is itself a bug in Google's JS tooling.

Fernando Toledo

Comment 29

•

2 years ago

Another reproducible test, clean debian install 10.7 and FF 102.12.0est on VM using the debian-10.7.0-amd64-xfce-CD-1.iso file

Steve Fink [:sfink] [:s:]

Updated

•

2 years ago

Flags: needinfo?(iireland)

Flags: needinfo?(gsvelto)

Iain Ireland [:iain]

Comment 33

•

2 years ago

•

Edited

Our minimum stack quota is 1MB on 64-bit Linux, so if comment 24 is correct and we're only given 152K, that could explain why our stack check isn't working. It does raise the question of why we're getting such a small stack.

Oh! If 152K is the size of the memory that's actually been mapped, then maybe this isn't surprising. 152K is within the margin of error for the ~160K I was estimating as the size of the stack frame we're trying to allocate. We expect the kernel to map new stack memory for us as we touch it, so crashing with 152K of successfully mapped stack memory is consistent with what we're calculating elsewhere: we try allocating and touching a bunch of stack memory, well within the amount of stack memory that rlimit says we can access, but crash before we've managed to touch all of it.

I think our best bet for making forward progress is getting an rr / pernosco recording of the crash from somebody who can reproduce it locally.

Flags: needinfo?(iireland)

Martin K

Updated

•

2 years ago

Duplicate of this bug: 1838999

Martin K

Comment 35

•

2 years ago

I also reported this issue as bug #1838999 but this one seems to have made better progress so I've closed mine as a duplicate of this one.

Martin K

Comment 36

•

2 years ago

If it helps, I run all instances of Firefox within a Linux cgroup, as otherwise it tends to grab too much RAM and cause other processes to fail:

$ cgset -r memory.soft_limit_in_bytes=3892578125   /user/$UID.user/firefox
$ cgset -r memory.limit_in_bytes=4180731904        /user/$UID.user/firefox
$ cgset -r memory.kmem.limit_in_bytes=522591488    /user/$UID.user/firefox
$ cgset -r memory.kmem.tcp.limit_in_bytes=33554432 /user/$UID.user/firefox
$ cgset -r cpu.shares=768                          /user/$UID.user/firefox

My system resources:

$ free
              total        used        free      shared  buff/cache   available
Mem:        8165492     2825644     1062512      416668     4277336     4561844
Swap:             0           0           0
$ ulimit -Ss
8192
$ ulimit -Hs
unlimited

Gabriele Svelto [:gsvelto]

Comment 37

•

2 years ago

This is what I'm seeing in one of the crashes:

7f57a0f8e000-7f57a0f8f000 rw-p 00000000 00:00 0
7fccf0cff000-7fccf0d70000 r--s 00000000 00:05 40511                      /memfd:mozilla-ipc (deleted)
7ffc64d76b18 <--- crash address
7ffc64d77000-7ffc64d9d000 rw-p 00000000 00:00 0                          [stack]
7ffc64d9d000-7ffc64d9f000 rw-p 00000000 00:00 0
7ffc64df1000-7ffc64df4000 r--p 00000000 00:00 0                          [vvar]
7ffc64df4000-7ffc64df6000 r-xp 00000000 00:00 0                          [vdso]

Note that the stack is definitely smaller than 8 MiB, the crash is only 1256 bytes above the top of the stack so well within the guard page but the stack hasn't been extended. There's a pretty large area above so the kernel would have more than enough space to grow the stack if it wanted to.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Comment 38

•

2 years ago

For comparison I tried the following: open Google search in a content process, measure the stack size, it's 132 KiB. Run a search, the stack size grows to 188 KiB. I tried on both release and nightly with similar results so the stack starts small but grows as expected on my box.

Gian-Carlo Pascutto [:gcp]

Comment 39

•

2 years ago

I run all instances of Firefox within a Linux cgroup, as otherwise it tends to grab too much RAM and cause other processes to fail

Do you have a bug on file for this?

Gabriele Svelto [:gsvelto]

Comment 40

•

2 years ago

FYI I've had a look at the stacks of other threads than the main thread and everything seems in order. Threads spawned from a thread pool get 256 KiB (see here):

7f57932a6000-7f57932a7000 ---p 00000000 00:00 0
7f57932a7000-7f57932e7000 rw-p 00000000 00:00 0

Note how it's a single mapping and there's a user-visible guard page above it, so nothing unexpected here.

Gian-Carlo Pascutto [:gcp]

Comment 41

•

2 years ago

FWIW the 64k stack-growth-then-error logic in the kernel was changed in 2018:

commit 1d8ca3be86ebc6a38dad8236f45c7a9c61681e78
Author: Waiman Long <longman@redhat.com>
Date:   Tue Nov 6 15:12:29 2018 -0500

    x86/mm/fault: Allow stack access below %rsp

Gian-Carlo Pascutto [:gcp]

Comment 42

•

2 years ago

Sampling the crashes shows the newest crashing kernel is 4.19, which just predates that patch.

Gian-Carlo Pascutto [:gcp]

Comment 43

•

2 years ago

•

Edited

This is one of the few crash reports on a new kernel (5.19): https://crash-stats.mozilla.org/report/index/19ceeb97-1248-4b3d-a09b-fb1670230623

Would be nice to know if it's the same cause or a different issue.

Gabriele Svelto [:gsvelto]

Comment 44

•

2 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #43)

This is one of the few crash reports on a new kernel (5.19): https://crash-stats.mozilla.org/report/index/19ceeb97-1248-4b3d-a09b-fb1670230623

Would be nice to know if it's the same cause or a different issue.

That crash looks like a completely different issue, it's pushing a register below the stack (i.e. it's underflowing, not overflowing). The stack overflow crash are all hitting mov dword [rax], 0x0 as the crashing instruction and rax is above the stack so you can use that to tell them apart. Unfortunately some of this signatures are catch-all for a bunch of different problems many of which being bad hardware.

jfriesse

Comment 45

•

2 years ago

I'm running Void Linux with Mozilla build of Ffirefox ESR and also getting regular crash on google images. So problem is definitively not Debian specific. I've also tried to run firefox using rr (rr record firefox) but it looks like rr is doing something what makes firefox NOT to crash :( Any other idea what to try (FF debug build, some firefox settings, different way to call rr, ...) to reproduce the bug using rr?

Gian-Carlo Pascutto [:gcp]

Comment 46

•

2 years ago

•

Edited

152K is within the margin of error for the ~160K I was estimating as the size of the stack frame we're trying to allocate. We expect the kernel to map new stack memory for us as we touch it

Does the 160kb jump, combined with https://bugzilla.mozilla.org/show_bug.cgi?id=909094#c51 point 4, and there being fair indications as to what change in the kernel stopped this from happening (removal of the 64kb jump limit) explain the situation? We're doing a too large stack usage increase at once and on older kernels that's being disallowed? So doing some intermediate probes at 32kb distance might work around it?

Gabriele Svelto [:gsvelto]

Comment 47

•

2 years ago

•

Edited

So what's happening on older kernels is that as we probe the stack we're not updating the stack pointer. Older kernels forbid accesses to the stack that are farther away than 65536 + 256 bytes from the stack pointer, see this code.

So we can probably work around this but we need to (temporarily) bump the stack pointer to make sure the probes don't exceed that range. I don't know how feasible it is but it would avoid this issue altogether.

Gian-Carlo Pascutto [:gcp]

Updated

•

2 years ago

Comment 41 is private: false

Comment 42 is private: false

Comment 46 is private: false

Gabriele Svelto [:gsvelto]

Updated

•

2 years ago

Comment 37 is private: false

Comment 38 is private: false

Comment 40 is private: false

Jan de Mooij [:jandem]

Assignee

Comment 48

•

2 years ago

•

Edited

(In reply to Gabriele Svelto [:gsvelto] from comment #47)

So we can probably work around this but we need to (temporarily) bump the stack pointer to make sure the probes don't exceed that range. I don't know how feasible it is but it would avoid this issue altogether.

It's possible to use the stack pointer register for this loop. I'll do a try push with and without that change so people can test it.

Try push: https://treeherder.mozilla.org/jobs?repo=try&revision=ed3f8350580d42b058546306511543f60ce16c1f

BugBot [:suhaib / :marco/ :calixte]

Comment 49

•

2 years ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on release (startup)
Top 20 desktop browser crashes on beta
Top 10 desktop browser crashes on nightly
Top 10 content process crashes on beta
Top 10 content process crashes on release
Top 5 desktop browser crashes on Linux on beta
Top 5 desktop browser crashes on Linux on release

For more information, please visit BugBot documentation.

Keywords: topcrash, topcrash-startup

BugBot [:suhaib / :marco/ :calixte]

Comment 50

•

2 years ago

The bug is marked as tracked for firefox114 (release), tracked for firefox115 (beta) and tracked for firefox116 (nightly). We have limited time to fix this, the soft freeze is in 6 days. However, the bug still isn't assigned.

:sdetar, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(sdetar)

François Bienvenu

Comment 51

•

2 years ago

The bug has been fixed for me, without me updating any software on my computer or doing anything to Firefox.

I guess that means that Google has changed something.

Jan de Mooij [:jandem]

Assignee

Comment 52

•

2 years ago

Attached file Browser test — Details

A browser test that creates an OSR stack frame with a similar size as on Google Images.

jcristau confirmed this triggers the crash too.

Jan de Mooij [:jandem]

Assignee

Updated

•

2 years ago

Attachment #9340700 - Attachment mime type: application/octet-stream → text/html

Quentin Godfroy

Comment 53

•

2 years ago

(In reply to Jan de Mooij [:jandem] from comment #52)

A browser test that creates an OSR stack frame with a similar size as on Google Images.

I confirm it crashes for me too.

Gabriele Svelto [:gsvelto]

Comment 54

•

2 years ago

This should crash if run in automation too.

Teo V.

Comment 55

•

2 years ago

(In reply to Jan de Mooij [:jandem] from comment #52)

Created attachment 9340700 [details]
Browser test

A browser test that creates an OSR stack frame with a similar size as on Google Images.

jcristau confirmed this triggers the crash too.

Yep, can confirm tab crashes for me too. Wondered why it didn't render anything, then realized it was a bare <script> tag in the file 😆
I'm on 114.0.2 (64-bit), Debian 10 fully updated.

Jan de Mooij [:jandem]

Assignee

Comment 56

•

2 years ago

For people that can reproduce this, below are two Linux64 builds. One of them should crash and the other has the fix. Can you confirm this?

Julien Cristau [:jcristau]

Comment 57

•

2 years ago

no crash
crash

Quentin Godfroy

Comment 58

•

2 years ago

Same here, 1) the javascript test file says "ok" while 2) crashes

Jan de Mooij [:jandem]

Assignee

Comment 59

•

2 years ago

Great, thanks. That confirms the analysis and suggested fix from gsvelto in comment 47.

(In reply to Gabriele Svelto [:gsvelto] from comment #54)

This should crash if run in automation too.

Hm not for the JS shell jobs, but I see those use a Debian 11 image so are probably different from the browser tests that use the ubuntu1804 image afaict.

Jan de Mooij [:jandem]

Assignee

Comment 60

•

2 years ago

Attached file Bug 1839669 - Use stack pointer register for stack probes to fix crashes on older Linux kernels. r?iain! — Details

Google Images creates a huge stack frame with more than 19550 slots (more than 150 KB)
and then uses OSR to enter Baseline Interpreter code.

The stack probing we do there caused crashes because older kernels don't like it when
the distance between the address and RSP is more than about 64 KB.

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → jdemooij

Status: NEW → ASSIGNED

Steven DeTar [:sdetar]

Updated

•

2 years ago

Flags: needinfo?(sdetar)

Iain Ireland [:iain]

Updated

•

2 years ago

Duplicate of this bug: 1721020

0x80

Comment 62

•

2 years ago

(In reply to Jan de Mooij [:jandem] from comment #59)

Great, thanks. That confirms the analysis and suggested fix from gsvelto in comment 47.

Both 1 and 2 are crashing in here in Debian Buster, 4.19.0-17-amd64, libc 2.28-10+deb10u2.

DMESG for 1:

[11701.901604] Isolated Web Co[21339]: segfault at 7ffe75d1da68 ip 000026517e73b436 sp 00007ffe75d3fa68 error 6
[11701.901611] Code: 00 00 50 55 48 89 e5 48 83 ec 48 48 89 e3 48 89 d0 48 c1 e0 03 48 2b d8 48 89 e0 48 2d 00 08 00 00 48 3b c3 0f 82 0e 00 00 00 <c7> 00 00 00 00 00 48 2d 00 08 00 00 eb e9 48 89 e3 48 89 d6 c1 e6

and 2:
[11839.751296] Isolated Web Co[21731]: segfault at 7ffca25a28b8 ip 00002e99eb11a436 sp 00007ffca25c58b8 error 6
[11839.751303] Code: 00 00 50 55 48 89 e5 48 83 ec 48 48 89 e3 48 89 d0 48 c1 e0 03 48 2b d8 48 89 e0 48 2d 00 08 00 00 48 3b c3 0f 82 0e 00 00 00 <c7> 00 00 00 00 00 48 2d 00 08 00 00 eb e9 48 89 e3 48 89 d6 c1 e6

Pulsebot

Comment 63

•

2 years ago

Pushed by jdemooij@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/304d01f5488b Use stack pointer register for stack probes to fix crashes on older Linux kernels. r=iain

Jan de Mooij [:jandem]

Assignee

Comment 64

•

2 years ago

(In reply to 0x80 from comment #62)

Both 1 and 2 are crashing in here in Debian Buster, 4.19.0-17-amd64, libc 2.28-10+deb10u2.

DMESG for 1:

Are you sure that was with the first build? I used a disassembler for the "Code:" output and it's the code without the fix. Maybe try closing all Firefox instances to ensure it's not reusing it.

0x80

Comment 65

•

2 years ago

(In reply to Jan de Mooij [:jandem] from comment #64)

(In reply to 0x80 from comment #62)

Both 1 and 2 are crashing in here in Debian Buster, 4.19.0-17-amd64, libc 2.28-10+deb10u2.

DMESG for 1:

Are you sure that was with the first build? I used a disassembler for the "Code:" output and it's the code without the fix. Maybe try closing all Firefox instances to ensure it's not reusing it.

You're right. Another instance was opened and induced the crash on the Nightly 116. Now is confirmed the fix worked! Sorry for that.

jmgonk

Comment 66

•

2 years ago

Build 1: no crash
Build 2: crash, dmesg "Isolated Web Co[11255]: segfault at 7ffe04226e68 ip 0000308ecf79b436 sp 00007ffe04242668 error 6"

Mike Hommey [:glandium]

Comment 67

•

2 years ago

(In reply to Jan de Mooij [:jandem] from comment #59)

Hm not for the JS shell jobs, but I see those use a Debian 11 image so are probably different from the browser tests that use the ubuntu1804 image afaict.

Of course, this all being run in docker means the distro in the image doesn't matter. What matters is the host kernel. uname -a on build workers says: Linux e2dd8429b993 5.4.0-1106-gcp #115~18.04.1-Ubuntu SMP Mon May 22 20:46:39 UTC 2023 x86_64 GNU/Linux
and on ubuntu 18.04 test workers:
Linux 1978ca2e4d43 4.4.0-1014-aws #14taskcluster1-Ubuntu SMP Tue Apr 3 10:27:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux (unlike what the kernel version says, it's not running on AWS)
Don't ask me why they're not using the same kernel.

Mike Hommey [:glandium]

Comment 68

•

2 years ago

(Of course, js shell jobs run on build workers)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 69

•

2 years ago

(In reply to Mike Hommey [:glandium] from comment #67)

Don't ask me why they're not using the same kernel.

I'm not sure if they're still doing it, but in the past Ubuntu has had a complicated approach to kernel versions — they'd backport newer kernels from the short-term releases to older LTS releases to support new hardware, and point releases of the LTS branches would use the new kernel by default if the system was initially installed as that release but not if it was upgraded from an older point release (I assume so that working installs would have less chance of regressions, but a new install might need new hardware support). So you could have two systems that are the “same version” of Ubuntu, but on different major versions of the kernel.

But also, it looks like 18.04.0 shipped with kernel 4.15; 4.4 was the kernel for 16.04, so that's weird.

Cristina Horotan [:chorotan]

Comment 70

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/304d01f5488b

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox116: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 116 Branch

Donal Meehan [:dmeehan]

Comment 71

•

2 years ago

:jandeem could you add an uplift request on this?
This week is 115 RC, is this safe to take into RC or does it need additional time?

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Comment 72

•

2 years ago

Comment on attachment 9340720 [details]
Bug 1839669 - Use stack pointer register for stack probes to fix crashes on older Linux kernels. r?iain!

Beta/Release Uplift Approval Request

User impact if declined: Crashes on Google Images on older Linux kernels.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Pretty small and self-contained fix that has been tested in Nightly for a few days.
String changes made/needed:
Is Android affected?: No

Flags: needinfo?(jdemooij)

Attachment #9340720 - Flags: approval-mozilla-beta?

Jan de Mooij [:jandem]

Assignee

Updated

•

2 years ago

Attachment #9340720 - Flags: approval-mozilla-release?

Attachment #9340720 - Flags: approval-mozilla-esr115?

Attachment #9340720 - Flags: approval-mozilla-esr102?

Donal Meehan [:dmeehan]

Comment 73

•

2 years ago

Comment on attachment 9340720 [details]
Bug 1839669 - Use stack pointer register for stack probes to fix crashes on older Linux kernels. r?iain!

Approved for 115.0 RC1.

Rejecting release approval request, 115.0 is in RC week and there are no planned dot releases on 114.0.
Clearing the esr115 approval, uplifting to 115 will also ensure it's included it in esr115.

Attachment #9340720 - Flags: approval-mozilla-release?

Attachment #9340720 - Flags: approval-mozilla-release-

Attachment #9340720 - Flags: approval-mozilla-esr115?

Attachment #9340720 - Flags: approval-mozilla-beta?

Attachment #9340720 - Flags: approval-mozilla-beta+

Donal Meehan [:dmeehan]

Comment 74

•

2 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/74ea28917b97

status-firefox115: affected → fixed

Flags: in-testsuite+

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox114: affected → wontfix

tracking-firefox-esr102: --- → 115+

Donal Meehan [:dmeehan]

Comment 75

•

2 years ago

Comment on attachment 9340720 [details]
Bug 1839669 - Use stack pointer register for stack probes to fix crashes on older Linux kernels. r?iain!

Approved for 102.13esr.

Attachment #9340720 - Flags: approval-mozilla-esr102? → approval-mozilla-esr102+

Donal Meehan [:dmeehan]

Comment 76

•

2 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr102/rev/68d9c3b95fd1

status-firefox-esr102: affected → fixed

Brindusa Tot, DTE

Updated

•

2 years ago

Flags: qe-verify+

Alexandru Trif, Desktop Test Engineering [:atrif]

Comment 77

•

2 years ago

•

Edited

Hello!
Reproduced the issue with Firefox 114.0.2 on Debian GNU/Linux 10 and Debian 9 LXDE session Virtual Machine. After performing an image search on Google the tab crashes.
I can no longer reproduce the tab crash on the same operating systems after opening the link from comment 0 or after searching for images inside Google Search with Firefox 115.0, 116.0a1 (2023-06-26), and 102.13esr (treeherder build from comment 75).

Status: RESOLVED → VERIFIED

Has STR: --- → yes

status-firefox115: fixed → verified

status-firefox116: fixed → verified

status-firefox-esr102: fixed → verified

Flags: qe-verify+

raw data from about:support for crash 2 years ago David A. Madore 42.59 KB, application/json		Details
Browser test 2 years ago Jan de Mooij [:jandem] 562 bytes, text/html		Details
Bug 1839669 - Use stack pointer register for stack probes to fix crashes on older Linux kernels. r?iain! 2 years ago Jan de Mooij [:jandem] 48 bytes, text/x-phabricator-request	dmeehan : approval-mozilla-beta+ dmeehan : approval-mozilla-release- dmeehan : approval-mozilla-esr102+	Details \| Review