Open Bug 1993100 Opened 3 months ago Updated 14 days ago

Crash in [@ guess_category_value] at startup in ESR 140.3

Tracking

(firefox-esr115 unaffected, firefox-esr140 affected, firefox144 wontfix, firefox145 wontfix, firefox146 wontfix, firefox147 fix-optional)

Status:

NEW

Tracking Flags:

Tracking

Status

firefox-esr115

---

unaffected

firefox-esr140

---

affected

firefox144

---

wontfix

firefox145

---

wontfix

firefox146

---

wontfix

firefox147

---

fix-optional

People

(Reporter: jesup, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, csectype-uaf, regression)

Crash Data

Randell Jesup [:jesup] (needinfo me)

Reporter

Description

•

3 months ago

e5e5 crashes that spiked on Linux starting at about Sept 15th or so. 10/day to ~200/day
Basically all startup crashes; yet another getenv() crash likely due to our calling setenv

See also bug 1989027 which this is basically a (secure) dup of, and bug 1977388, and bug 1981179 (https://treeherder.mozilla.org/logviewer?job_id=521025736&repo=autoland&lineNumber=2910)

The metabug for this is probably bug 1748366

Crash report: https://crash-stats.mozilla.org/report/index/0c91ba96-74b1-410a-bc48-5b1ae0251006

Reason:

SIGSEGV / SI_KERNEL

Top 10 frames:

0  libc.so.6  __GI_getenv  stdlib/stdlib/getenv.c:84
1  libc.so.6  guess_category_value  intl/intl/dcigettext.c:1573
1  libc.so.6  __dcigettext  intl/intl/dcigettext.c:647
2  libgio-2.0.so.0  g_socket_output_stream_class_init  gio/gsocketoutputstream.c:249
2  libgio-2.0.so.0  g_socket_output_stream_class_intern_init  gio/gsocketoutputstream.c:57
3  libgobject-2.0.so.0  type_class_init_Wm  gobject/gtype.c:2299
3  libgobject-2.0.so.0  g_type_class_ref  gobject/gtype.c:3014
4  libgobject-2.0.so.0  g_object_new_valist  gobject/gobject.c:2499
5  libgobject-2.0.so.0  g_object_new  gobject/gobject.c:2040
6  libgio-2.0.so.0  g_socket_connection_get_output_stream  gio/gsocketconnection.c:118

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 1

•

3 months ago

Looks like it's an ESR issue, starting Sept 19th

Daniel Veditz [:dveditz]

Comment 2

•

3 months ago

There was a small number of crashes here, but the bulk of these crashes now are all ESR 140.3.x, and they are 100% start-up crashes. We don't need to hide this.

Group: dom-core-security

Keywords: regression

OS: Unspecified → Linux

Summary: Crash in [@ guess_category_value] → Crash in [@ guess_category_value] at startup in ESR 140.3

Thinker Li [:sinker]

Comment 3

•

3 months ago

Should we land a patch to catch this setenv() pattern in mozglue?

Thinker Li [:sinker]

Updated

•

3 months ago

Severity: -- → S2

BugBot [:suhaib / :marco/ :calixte]

Comment 4

•

3 months ago

This bug has been marked as a regression. Setting status flag for Nightly to affected.

status-firefox145: --- → affected

Thinker Li [:sinker]

Updated

•

3 months ago

Severity: S2 → S3

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 5

•

3 months ago

(In reply to Thinker Li [:sinker] from comment #3)

Should we land a patch to catch this setenv() pattern in mozglue?

We have env_interposer.cpp, which takes a lock around calls to functions like getenv and setenv, but it can't catch that internal call to __GI_getenv (it's a local symbol, and I can see in the disassembly of __dcigettext on my system that it's just called directly, not via symbol lookup).

In theory we could use those hooks to try to assert that the environment isn't changed after a certain point in startup, but I don't know how many failures there would be and how much work it would be to fix them (considering that there could be calls from third-party libraries as well as our own code, and there could be cases that don't fail on CI but do fail in actual use, etc.).

Another option is to interpose and wrap the various forms of gettext; that would be kind of hacky and wouldn't cover any of the other places where libc might read the environment, but it should take care of all of the crashes :jesup mentioned.

(Incidentally, if anyone is going to add things to env_interposer: getenv_secure should also be in there.)

Thinker Li [:sinker]

Comment 6

•

3 months ago

The crash might be happening because when setenv/putenv reallocates the environ buffer, __GI_getenv (which is statically linked in glibc) could still be reading from the old freed buffer. Our mozglue lock doesn't help here since __GI_getenv bypasses our wrapper.

I'm thinking we could use an RCU-like approach to fix this:

Implement our own setenv/putenv that never immediately frees old environ buffers
When we allocate a new buffer for environ, put the old buffer on a deferred-free list
Send signals to all threads and check if they're currently in __GI_getenv (using IP range checks in the signal handler)
Keep retrying until all threads have quiesced
Only then free the old buffers

One caveat: I checked the glibc source and found that it does have some internal static calls to __setenv (in hurd/hurdmsg.c and posix/wordexp.c) that would bypass our wrapper. But these are pretty niche - Hurd isn't used on Linux (for Hurd OS I think) and wordexp is only for shell expansion. So this should still work.

It's definitely adding complexity, but it should fix the race. And, should we fix this issue? is it worth?

Pascal Chevrel:pascalc

Updated

•

3 months ago

status-firefox144: --- → affected

status-firefox146: --- → affected

status-firefox-esr140: --- → affected

Donal Meehan [:dmeehan]

Updated

•

3 months ago

status-firefox144: affected → wontfix

Pascal Chevrel:pascalc

Comment 7

•

2 months ago

Is there something we could do on our side to mitigate the crashes?

Flags: needinfo?(rjesup)

Ryan VanderMeulen [:RyanVM]

Updated

•

2 months ago

status-firefox147: --- → affected

status-firefox-esr115: --- → unaffected

Donal Meehan [:dmeehan]

Updated

•

2 months ago

status-firefox145: affected → wontfix

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 8

•

2 months ago

Thinker's idea might work, though it's complex. Perhaps we could try just leaking old envs, and see how many times envs actually get modified during firefox runs. (it's possible external OS/drivers might play with them though). A perf trace might be able to check this without doing all the work up front as well.

Flags: needinfo?(thinker.li)

Flags: needinfo?(rjesup)

Flags: needinfo?(nika)

Flags: needinfo?(jstutte)

Flags: needinfo?(jld)

Gabriele Svelto [:gsvelto]

Comment 9

•

2 months ago

Note that this particular issue has been fixed in recent glibc version. We should at least try and ask distributions to uplift the (albeit complex) fix. As for the volume it appears elevated because the crashes are coming from the ESR channel. When crash reports come from the release channel they are throttled, so we retain only 10% of them. There's no throttling on the ESR channel, so a crash that happens 1k times per day on the release channel will only show 100 crashes per day on crash-stats, but you'll see all of the 1k crashes if it comes from ESR installations.

Donal Meehan [:dmeehan]

Updated

•

2 months ago

status-firefox146: affected → wontfix

Thinker Li [:sinker]

Comment 10

•

2 months ago

Just did more study on crash reports. According to crash reports, most of incidents (95+%) happen on Debian 12 (bookworm). And, their kernel are all built after latest update Debian 12.12 (Sept 6), I boldly guess they are mostly updated to Debian 12.12. And, our spike started roughly on Sep 18 or earlier, very close to Sep 6 the release date of Debian 12.12.

Flags: needinfo?(thinker.li)

Thinker Li [:sinker]

Comment 11

•

1 month ago

The bug report at Debian.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121748

Jens Stutte [:jstutte]

Comment 12

•

1 month ago

Note that this particular issue has been fixed in recent glibc version.

Is this hence blocked on third party and we should flag it like this, to begin with?

Flags: needinfo?(jstutte) → needinfo?(thinker.li)

Thinker Li [:sinker]

Comment 13

•

1 month ago

Thanks for reminding. I've moved this bug to Third Party Packaging.

Component: Widget: Gtk → Third Party Packaging

Flags: needinfo?(thinker.li)

Product: Core → Firefox Build System

Version: unspecified → Firefox 140

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 14

•

1 month ago

There's also the External Software Affecting Firefox category; not sure which one makes more sense here.

For reference, this is the glibc change, and it's in version 2.41, released 2025-01-30. Apparently that caused regressions, resulting in a followup, which was backed out, and then re-attempted, and that seems to have stuck.

I started to write a comment a while back and I thought I posted it but apparently not; basically I was going to suggest leaking the arrays and using exponential growth (as independently suggested by jesup and implemented by glibc), but also mention that there may still be some possibility of bugs if a thread-naive getenv, using non-atomic loads, races with the hypothetical interposed setenv. (But that part doesn't matter if we're not going to attempt that kind of workaround on our side.)

Flags: needinfo?(jld)

Ryan VanderMeulen [:RyanVM]

Updated

•

1 month ago

status-firefox147: affected → fix-optional

Nika Layzell [:nika] (ni? for response)

Comment 15

•

14 days ago

Don't think I have anything to add here beyond what has already been mentioned. Seems like this is at least in part an upstream issue which would be very difficult for us to fully mitigate, and anything else we try is a bit of a half-measure.

Flags: needinfo?(nika)

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Crash in [@ guess_category_value] at startup in ESR 140.3

Categories

(Firefox Build System :: Third Party Packaging, defect)

Tracking

(firefox-esr115 unaffected, firefox-esr140 affected, firefox144 wontfix, firefox145 wontfix, firefox146 wontfix, firefox147 fix-optional)

People

(Reporter: jesup, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, csectype-uaf, regression)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Updated

Updated

Comment 7

Updated

Updated

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Comment 15