Crash in [@ guess_category_value] at startup in ESR 140.3
Categories
(Firefox Build System :: Third Party Packaging, defect)
Tracking
(firefox-esr115 unaffected, firefox-esr140 affected, firefox144 wontfix, firefox145 wontfix, firefox146 wontfix, firefox147 fix-optional)
| Tracking | Status | |
|---|---|---|
| firefox-esr115 | --- | unaffected |
| firefox-esr140 | --- | affected |
| firefox144 | --- | wontfix |
| firefox145 | --- | wontfix |
| firefox146 | --- | wontfix |
| firefox147 | --- | fix-optional |
People
(Reporter: jesup, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: crash, csectype-uaf, regression)
Crash Data
e5e5 crashes that spiked on Linux starting at about Sept 15th or so. 10/day to ~200/day
Basically all startup crashes; yet another getenv() crash likely due to our calling setenv
See also bug 1989027 which this is basically a (secure) dup of, and bug 1977388, and bug 1981179 (https://treeherder.mozilla.org/logviewer?job_id=521025736&repo=autoland&lineNumber=2910)
The metabug for this is probably bug 1748366
Crash report: https://crash-stats.mozilla.org/report/index/0c91ba96-74b1-410a-bc48-5b1ae0251006
Reason:
SIGSEGV / SI_KERNEL
Top 10 frames:
0 libc.so.6 __GI_getenv stdlib/stdlib/getenv.c:84
1 libc.so.6 guess_category_value intl/intl/dcigettext.c:1573
1 libc.so.6 __dcigettext intl/intl/dcigettext.c:647
2 libgio-2.0.so.0 g_socket_output_stream_class_init gio/gsocketoutputstream.c:249
2 libgio-2.0.so.0 g_socket_output_stream_class_intern_init gio/gsocketoutputstream.c:57
3 libgobject-2.0.so.0 type_class_init_Wm gobject/gtype.c:2299
3 libgobject-2.0.so.0 g_type_class_ref gobject/gtype.c:3014
4 libgobject-2.0.so.0 g_object_new_valist gobject/gobject.c:2499
5 libgobject-2.0.so.0 g_object_new gobject/gobject.c:2040
6 libgio-2.0.so.0 g_socket_connection_get_output_stream gio/gsocketconnection.c:118
| Reporter | ||
Comment 1•3 months ago
|
||
Looks like it's an ESR issue, starting Sept 19th
Comment 2•3 months ago
|
||
There was a small number of crashes here, but the bulk of these crashes now are all ESR 140.3.x, and they are 100% start-up crashes. We don't need to hide this.
Comment 3•3 months ago
|
||
Should we land a patch to catch this setenv() pattern in mozglue?
Updated•3 months ago
|
Comment 4•3 months ago
|
||
This bug has been marked as a regression. Setting status flag for Nightly to affected.
Updated•3 months ago
|
Comment 5•3 months ago
|
||
(In reply to Thinker Li [:sinker] from comment #3)
Should we land a patch to catch this setenv() pattern in mozglue?
We have env_interposer.cpp, which takes a lock around calls to functions like getenv and setenv, but it can't catch that internal call to __GI_getenv (it's a local symbol, and I can see in the disassembly of __dcigettext on my system that it's just called directly, not via symbol lookup).
In theory we could use those hooks to try to assert that the environment isn't changed after a certain point in startup, but I don't know how many failures there would be and how much work it would be to fix them (considering that there could be calls from third-party libraries as well as our own code, and there could be cases that don't fail on CI but do fail in actual use, etc.).
Another option is to interpose and wrap the various forms of gettext; that would be kind of hacky and wouldn't cover any of the other places where libc might read the environment, but it should take care of all of the crashes :jesup mentioned.
(Incidentally, if anyone is going to add things to env_interposer: getenv_secure should also be in there.)
Comment 6•3 months ago
|
||
The crash might be happening because when setenv/putenv reallocates the environ buffer, __GI_getenv (which is statically linked in glibc) could still be reading from the old freed buffer. Our mozglue lock doesn't help here since __GI_getenv bypasses our wrapper.
I'm thinking we could use an RCU-like approach to fix this:
- Implement our own setenv/putenv that never immediately frees old environ buffers
- When we allocate a new buffer for environ, put the old buffer on a deferred-free list
- Send signals to all threads and check if they're currently in __GI_getenv (using IP range checks in the signal handler)
- Keep retrying until all threads have quiesced
- Only then free the old buffers
One caveat: I checked the glibc source and found that it does have some internal static calls to __setenv (in hurd/hurdmsg.c and posix/wordexp.c) that would bypass our wrapper. But these are pretty niche - Hurd isn't used on Linux (for Hurd OS I think) and wordexp is only for shell expansion. So this should still work.
It's definitely adding complexity, but it should fix the race. And, should we fix this issue? is it worth?
Updated•3 months ago
|
Updated•3 months ago
|
Comment 7•2 months ago
|
||
Is there something we could do on our side to mitigate the crashes?
Updated•2 months ago
|
Updated•2 months ago
|
| Reporter | ||
Comment 8•2 months ago
|
||
Thinker's idea might work, though it's complex. Perhaps we could try just leaking old envs, and see how many times envs actually get modified during firefox runs. (it's possible external OS/drivers might play with them though). A perf trace might be able to check this without doing all the work up front as well.
Comment 9•2 months ago
|
||
Note that this particular issue has been fixed in recent glibc version. We should at least try and ask distributions to uplift the (albeit complex) fix. As for the volume it appears elevated because the crashes are coming from the ESR channel. When crash reports come from the release channel they are throttled, so we retain only 10% of them. There's no throttling on the ESR channel, so a crash that happens 1k times per day on the release channel will only show 100 crashes per day on crash-stats, but you'll see all of the 1k crashes if it comes from ESR installations.
Updated•2 months ago
|
Comment 10•2 months ago
|
||
Just did more study on crash reports. According to crash reports, most of incidents (95+%) happen on Debian 12 (bookworm). And, their kernel are all built after latest update Debian 12.12 (Sept 6), I boldly guess they are mostly updated to Debian 12.12. And, our spike started roughly on Sep 18 or earlier, very close to Sep 6 the release date of Debian 12.12.
Comment 11•1 month ago
|
||
The bug report at Debian.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121748
Comment 12•1 month ago
|
||
Note that this particular issue has been fixed in recent glibc version.
Is this hence blocked on third party and we should flag it like this, to begin with?
Comment 13•1 month ago
|
||
Thanks for reminding. I've moved this bug to Third Party Packaging.
Comment 14•1 month ago
|
||
There's also the External Software Affecting Firefox category; not sure which one makes more sense here.
For reference, this is the glibc change, and it's in version 2.41, released 2025-01-30. Apparently that caused regressions, resulting in a followup, which was backed out, and then re-attempted, and that seems to have stuck.
I started to write a comment a while back and I thought I posted it but apparently not; basically I was going to suggest leaking the arrays and using exponential growth (as independently suggested by jesup and implemented by glibc), but also mention that there may still be some possibility of bugs if a thread-naive getenv, using non-atomic loads, races with the hypothetical interposed setenv. (But that part doesn't matter if we're not going to attempt that kind of workaround on our side.)
Updated•1 month ago
|
Comment 15•14 days ago
|
||
Don't think I have anything to add here beyond what has already been mentioned. Seems like this is at least in part an upstream issue which would be very difficult for us to fully mitigate, and anything else we try is a bit of a half-measure.
Description
•