Crash in [@ __GI_getenv]
Categories
(Core :: Graphics, defect, P2)
Tracking
()
People
(Reporter: gsvelto, Assigned: gsvelto)
References
Details
(6 keywords, Whiteboard: [adv-main114+r][adv-esr102.12+r])
Crash Data
Attachments
(2 files, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-esr102+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-esr102+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/c701126b-4276-4f79-99a5-b54ba0220124
Reason: SIGSEGV / SI_KERNEL
Top 10 frames of crashing thread:
0 libc.so.6 getenv stdlib/stdlib/getenv.c:84
1 libgallium_dri.so debug_get_bool_option src/util/u_debug.c:164
2 libgallium_dri.so st_create_context src/mesa/state_tracker/st_context.c:1067
3 libgallium_dri.so st_api_create_context src/mesa/state_tracker/st_manager.c:944
4 libgallium_dri.so dri_create_context src/gallium/frontends/dri/dri_context.c:163
5 libgallium_dri.so driCreateContextAttribs src/mesa/drivers/dri/common/dri_util.c:480
6 libEGL_mesa.so.0 dri2_create_context src/egl/drivers/dri2/egl_dri2.c:1573
7 libEGL_mesa.so.0 eglCreateContext src/egl/main/eglapi.c:826
8 libxul.so mozilla::gl::GLContextEGL::CreateGLContext gfx/gl/GLContextProviderEGL.cpp:718
9 libxul.so mozilla::gl::GLContextEGLFactory::CreateImpl gfx/gl/GLContextProviderEGL.cpp:298
Filing this under graphics because the crash happens inside Mesa even though it's not strictly a Mesa bug (or a graphics one for the matter). This is an use-after-free crash caused by a race that happens with at least two signatures and probably more that would be harder to find. The core of the issue is the following: at startup mesa is reading and parsing several environment variables (such as MESA_DEBUG, MESA_VERBOSE and more). This happens off the main thread. However, at the same time we call setenv()
several times from the main thread in our code. With the right timing the following can happen:
- Mesa calls
getenv()
and obtains a pointer inside the environment string corresponding to the requested variable on thread A - Thread A doesn't yet have the chance to parse the string it got from the environment (maybe because the scheduler stops it)
- Thread B thread calls
setenv()
, the environment pointer is replaced and the memory it held freed (thread B is most likely to be the main thread) - Thread A resumes execution inside Mesa code, the pointer it got from
getenv()
is now pointing to freed memory - As soon as Mesa tries to parse its contents Firefox crashes
Note that this bug had already been reported in bug 1653960 but we didn't realize what was going on back then.
I can't think of an easy solution to this problem given the inherently non-reentrant nature of getenv()
. I can think of horrible ones though such as shimming getenv()
and making it leak the returned values (or at least a subset of them that we know would be accessed off the main thread).
Comment 1•2 years ago
|
||
Child can create a GLContext as much as it wants, but I don't know if it can trigger whatever is doing the setenv() as reliably. Could you brute-force this race? The crash happens in the parent process so it seems difficult to trigger and difficult to exploit.
Comment 2•2 years ago
|
||
Brian Smith filed a few similar-ish bugs about the use of getenv/setenv on multiple threads from Gecko code (bug 1748361 and bug 1748366).
Assignee | ||
Comment 3•2 years ago
|
||
(In reply to Daniel Veditz [:dveditz] from comment #1)
Child can create a GLContext as much as it wants, but I don't know if it can trigger whatever is doing the setenv() as reliably. Could you brute-force this race? The crash happens in the parent process so it seems difficult to trigger and difficult to exploit.
I can probably find a way to trigger it with rr chaos mode. We don't use setenv()
at runtime AFAIK, just at startup so I don't think this is exploitable, at least not in an easy way.
Assignee | ||
Comment 4•2 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #2)
Brian Smith filed a few similar-ish bugs about the use of getenv/setenv on multiple threads from Gecko code (bug 1748361 and bug 1748366).
I agree with his analysis that we should avoid using them when possible but in many cases the fix is non-trivial. Plus we don't control external code so we need a solution. I'm really tempted of interposing alternate getenv()
/setenv()
calls to reduce the raciness to avoid the problem at least when accessing different environment variables (it's impossible to make them race-prone on the same variable short of always leaking the old value when setting a new one). That said I'm not even 100% sure how that would interact with code that accesses and interacts with the environment pointer directly. As horrible as that is I vaguely remember we do it somewhere.
Comment 5•2 years ago
|
||
Leaking the returned values is gross but probably effective. Our only other recourse seems like "wait to call any lib that calls getenv() until after we've finished all setenv()s", which seems harder, and I'm guessing the number of setenv()s we do is bounded fairly low?
Updated•1 year ago
|
Comment 6•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 5 desktop browser crashes on Linux on release (startup)
:bhood, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Comment 7•1 year ago
|
||
Jeff reduced this to S3. I will not second-guess that decision.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 9•1 year ago
|
||
Adding more signatures caused by getenv()
. Many changed in the past few months.
Comment 10•1 year ago
|
||
Note that we'd have to hook getenv(), since other things don't call PR_GetEnv. And we have a bunch of PR_SetEnv's, though they seem to get used at startup (or shutdown), but we could still crash (witness this bug).
Comment 11•1 year ago
|
||
The severity field for this bug is set to S3
. However, the following bug duplicate has higher severity:
- Bug 1794309: S2
:bhood, could you consider increasing the severity of this bug to S2
?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 12•1 year ago
|
||
Yes, the only reasonable way to deal with this would be to leak the values returned by getenv()
(at least once per value) however synchronization is a problem too. I think we'd have to hook setenv()
/putenv()
/unsetenv()
too so that we serialize access WRT to getenv()
calls.
Assignee | ||
Comment 13•1 year ago
|
||
We don't seem to have calls to clearenv()
but it might be worth hooking that one too in the improbable case that we'd ever use it.
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 14•1 year ago
|
||
I've inspected a lot of crashes under this signature and it seems that the vast majority of them isn't caused by accesses to strings that have been freed but rather due to accesses to a version of the environment array that was freed. This is because within glibc calls to putenv()
/setenv()
are synchronized with similar calls, but calls to getenv()
aren't. So if getenv()
is walking the environment while a call to putenv()
/setenv()
arrives then the environment will be pulled right under it.
In the end this means that synchronizing all calls that access the environment should be enough to get rid of this crash without having to leak the individual entries. On the long run we probably want to make sure that setenv()
/putenv()
are called only on the main thread and only before we start other threads. That'll certainly require some intrusive changes so I'm putting that off for now.
Assignee | ||
Comment 15•1 year ago
|
||
Assignee | ||
Comment 16•1 year ago
|
||
This add interposers for getenv(), putenv(), setenv(), unsetenv() and
clearenv(). All interposers use a single lock for synchronization while
internally using the libc-provided functions. This is done to prevent races
that typically happen in Firefox code when multiple threads call getenv()
while others are changing variables using setenv() and putenv().
Depends on D164470
![]() |
||
Comment 17•1 year ago
|
||
Landed:
https://hg.mozilla.org/integration/autoland/rev/1adb41e2eb9b475e4afcdb3b70c05f7027d27709
https://hg.mozilla.org/integration/autoland/rev/afbd8f7148043d1fde87fd73041972e790dc46d2
https://hg.mozilla.org/integration/autoland/rev/6b8d96806ae696957c055dd26653044f020b9adb
Backed out for causing spider-monkey bustages: https://hg.mozilla.org/integration/autoland/rev/94f635311a2140b896f56c4710064938909b5640
Push with failures
Failure log
[task 2022-12-20T08:18:29.490Z] Creating config.status
[task 2022-12-20T08:18:29.573Z] Reticulating splines...
[task 2022-12-20T08:18:29.659Z] 0:00.10 File already read. Skipping: /builds/worker/workspace/sm-package/mozjs-110.0a1.0/build/unix/moz.build
[task 2022-12-20T08:18:29.687Z] Traceback (most recent call last):
[task 2022-12-20T08:18:29.687Z] File "/builds/worker/workspace/sm-package/mozjs-110.0a1.0/python/mozbuild/mozbuild/frontend/sandbox.py", line 159, in exec_file
[task 2022-12-20T08:18:29.687Z] source = six.ensure_text(self._finder.get(path).read())
[task 2022-12-20T08:18:29.687Z] AttributeError: 'NoneType' object has no attribute 'read'
[task 2022-12-20T08:18:29.687Z] During handling of the above exception, another exception occurred:
[task 2022-12-20T08:18:29.688Z] Traceback (most recent call last):
[task 2022-12-20T08:18:29.688Z] File "/builds/worker/workspace/sm-package/mozjs-110.0a1.0/python/mozbuild/mozbuild/frontend/reader.py", line 1096, in read_mozbuild
[task 2022-12-20T08:18:29.688Z] for s in self._read_mozbuild(
[task 2022-12-20T08:18:29.688Z] File "/builds/worker/workspace/sm-package/mozjs-110.0a1.0/python/mozbuild/mozbuild/frontend/reader.py", line 1168, in _read_mozbuild
[task 2022-12-20T08:18:29.688Z] sandbox.exec_file(path)
[task 2022-12-20T08:18:29.688Z] File "/builds/worker/workspace/sm-package/mozjs-110.0a1.0/python/mozbuild/mozbuild/frontend/reader.py", line 239, in exec_file
[task 2022-12-20T08:18:29.688Z] Sandbox.exec_file(self, path)
[task 2022-12-20T08:18:29.688Z] File "/builds/worker/workspace/sm-package/mozjs-110.0a1.0/python/mozbuild/mozbuild/frontend/sandbox.py", line 161, in exec_file
[task 2022-12-20T08:18:29.688Z] raise SandboxLoadError(
[task 2022-12-20T08:18:29.688Z] mozbuild.frontend.sandbox.SandboxLoadError: ([], <traceback object at 0x7fed882952c0>)
Assignee | ||
Comment 18•1 year ago
|
||
I'm a bit stumped as I cannot repro locally. I'll rebase and see what happens, maybe the automatic rebase didn't work?
![]() |
||
Comment 19•11 months ago
|
||
Move the pthread_thread_create() interposer under mozglue and prepare for having a single place where we place interposer functions r=glandium
https://hg.mozilla.org/integration/autoland/rev/8b644e9165899a061c853e0959a314f1e3cccab3
https://hg.mozilla.org/mozilla-central/rev/8b644e916589
Updated•11 months ago
|
Comment 20•11 months ago
|
||
The patch landed in nightly and beta is affected.
:gsvelto, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox109
towontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 21•11 months ago
|
||
We've had this bug for ages and the vacations are closing on us, I think this can ride the trains.
Comment 22•11 months ago
|
||
backout |
Backed out for being the likely cause of the Android startup crashes being tracked in bug 1807716 and various dupes.
https://hg.mozilla.org/mozilla-central/rev/54cd49044c3dda67d437dc75808b281f1c9084b1
Assignee | ||
Comment 23•10 months ago
|
||
Jan, which device did you use to reproduce bug 1807716? I tried the emulator and an old Galaxy S3 with both Android 6 & 7 but I cannot reproduce the crash.
Comment 24•10 months ago
|
||
Probably a Nexus S in the emulator (STR here combined with a geckoview build as done here).
Note that the changes in the startup sequence affects this and you might need to revert what's there currently.
Assignee | ||
Comment 25•9 months ago
|
||
This add interposers for getenv(), putenv(), setenv(), unsetenv() and
clearenv(). All interposers use a single lock for synchronization while
internally using the libc-provided functions. This is done to prevent races
that typically happen in Firefox code when multiple threads call getenv()
while others are changing variables using setenv() and putenv().
Depends on D164470
Updated•9 months ago
|
Comment 26•9 months ago
|
||
Just checking that this was tested with the STR for the previous crashes occurring on Android. If not, I have bandwidth today to check this before we land it.
Assignee | ||
Comment 27•9 months ago
|
||
(In reply to Zac McKenney [:zmckenney] from comment #26)
Just checking that this was tested with the STR for the previous crashes occurring on Android. If not, I have bandwidth today to check this before we land it.
Not yet. I tried reproducing the setup described in comment 24 which was basically GeckoView AArch64 / Nexus S / Android 24 and the emulator enters a bootloop which isn't very helpful. I won't land this before double-checking that it works.
Assignee | ||
Comment 28•9 months ago
|
||
I've managed to test the patches using the setup from comment 24 but using the i686 emulator and build as the AArch64 emulator would enter a boot loop on startup. With this setup I could launch GeckoView and intercept startup crashes in both the main process and child processes, proving that the interposers are in place and functioning as expected. If you think this is enough I can land it today.
Note that I tested the GeckoView example, not Fenix itself, I'm not sure if that'd make a difference.
Comment 29•9 months ago
|
||
The GeckoView example doesn't have Glean.
Then again bug 1807716 was crashing on pthread_create
and I don't think that's touched now in this patch?
Assignee | ||
Comment 30•9 months ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #29)
The GeckoView example doesn't have Glean.
Then again bug 1807716 was crashing onpthread_create
and I don't think that's touched now in this patch?
Yes it is, what this patch does is replace pthread_create()
with an interposer which sets up an alternate signal stack then calls the libc-provided version of pthread_create()
. The crash was happening because somehow we couldn't find the original pthread_create()
function in libc. I've tweaked my patch to use the custom Android linker which is used on older versions of Android and I believe that's what was needed to avoid bug 1807716, however since I never managed to reproduce that crash I cannot be 100% sure.
BTW in the meantime I also tested API 24 on 32-bit ARM and that also seems to work fine.
Assignee | ||
Comment 31•9 months ago
|
||
Actually this patch doesn't change the pthread_create()
interposer as that was introduced in an older patch, but it moves it in mozglue so that it's loaded in the main process too (before it was active only in the child processes) and adds the getenv()
(& friends) interposers.
Comment 32•9 months ago
|
||
:gsvelto I'll test this today to confirm it doesn't cause the same crashing as before in Fenix.
Assignee | ||
Comment 33•9 months ago
|
||
Thank you so much 🙏
Comment 34•9 months ago
•
|
||
:gsvelto sorry for the delay, I lost power all of yesterday and was not able to test. Today I tested and show a 100% crash at launch rate when these patches are applied. I don't know much about the Android linker you added but if I can be helpful please let me know!
edit: I ran a debug build so adding the additional information here:
2023-03-16 00:46:30.648 25012-25012 DEBUG pid-25012 A Abort message: 'Hit MOZ_CRASH(getenv() interposition failed but the interposer function is still being called, this won't work!) at /Users/mozilla/StudioProjects/gecko/mozglue/interposers/InterposerHelper.h:35
Here is the non-debug stack trace in Fenix:
Fatal signal 11 (SIGSEGV), code 1, fault addr 0x0 in tid 30228 (NimbusDbScope-t)
2023-03-15 16:26:35.474 30230-30230 DEBUG pid-30230 A *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A Build fingerprint: 'google/sdk_google_phone_arm64/generic_arm64:7.1.1/NYC/8695018:userdebug/test-keys'
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A Revision: '0'
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A ABI: 'arm64'
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A pid: 30210, tid: 30228, name: NimbusDbScope-t >>> org.mozilla.fenix.debug <<<
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x0 0000007839780ef4 x1 000000783964b9e6 x2 0000000000000001 x3 0000000000000021
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x4 0000000000000000 x5 0000000000000000 x6 0000007839780f54 x7 64736466fefefefe
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x8 0000000000000000 x9 0000007839780ec0 x10 0000000000000023 x11 0101010101010101
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x12 0000000000000010 x13 00000078396518c4 x14 00000078639f3000 x15 00000078639f3000
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x16 000000783977cb08 x17 000000786278c908 x18 00000000ffffffff x19 000000783964abca
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x20 000000783977e6a0 x21 000000783977f448 x22 0000000000000001 x23 0000000000000008
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x24 0000000000000000 x25 0000007863daeed9 x26 00000078396af3b0 x27 0000007863f41000
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A x28 0000000000000000 x29 0000007843f05d50 x30 00000078396b69a4
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A sp 0000007843f05d50 pc 00000078396b69dc pstate 0000000060000000
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A
backtrace:
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A #00 pc 00000000000249dc /data/app/org.mozilla.fenix.debug-1/lib/arm64/libmozglue.so (offset 0x9a000)
2023-03-15 16:26:35.475 30230-30230 DEBUG pid-30230 A #01 pc 00000000000249a0 /data/app/org.mozilla.fenix.debug-1/lib/arm64/libmozglue.so (offset 0x9a000)
2023-03-15 16:26:35.548 924-924 lowmemorykiller pid-924 E Error writing /proc/30210/oom_score_adj; errno=22
Comment 35•8 months ago
|
||
There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:gsvelto, could you have a look please?
If you still have some work to do, you can add an action "Plan Changes" in Phabricator.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 36•8 months ago
|
||
This still needs more testing.
Updated•8 months ago
|
Updated•7 months ago
|
Comment 37•7 months ago
|
||
There only seem to be ~40 uses of PR_SetEnv in the tree, currently. Perhaps it's easier to move them? The one I think would be hard is in Logging .cpp, and there's the nsEnvironment.cpp one that sets all the incoming env var. I suspect many could be moved, if not most. SetEnv is exposed to JS, however, in nsEnvironment::Set().
Looking at the crashes, 90% seem to be coming from GL drivers, and almost all are startup crashes. Likely some other thread is tweaking the env while gl is searching it. Perhaps something can be done to mitigate the risk for that particular usecase? Moving something?
This would be hard to exploit since it's (usually) a startup crash.
Any chance of actually landing this? It's a pretty common crash; 30-40 crashes/day (and most are release, so roughly 10x that).
Assignee | ||
Comment 38•7 months ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #37)
Any chance of actually landing this? It's a pretty common crash; 30-40 crashes/day (and most are release, so roughly 10x that).
I really want to land it but I've still been unable to reproduce the crash seen in comment 34. The crashes are probably triggered by our calls to setenv()
but tend to happen in external libraries, and in particular in Mesa which makes very liberal use of environment variables.
Comment 39•7 months ago
|
||
Given that this is our Android/GeckoView topcrasher, it probably deserves an S2.
Assignee | ||
Comment 40•7 months ago
|
||
FYI I'm actively working on this.
Assignee | ||
Comment 41•7 months ago
|
||
My Android emulator helpfully decided that today was the right day to crash on startup :(
Assignee | ||
Comment 42•7 months ago
|
||
After much wrangling with my setup I managed to get the emulator working again and I'm finally reproducing the issue locally. Interestingly this only happens with Fenix but not with the GeckoView example app. Hopefully I'll be able to fix the issue and land this tomorrow.
Assignee | ||
Comment 43•7 months ago
|
||
Here's my analysis of what's going on when dealing with Fenix builds and what I can do about it: somehow in older versions of Android libc isn't yet loaded when the interposers from libmozglue are first executed. I don't know why that's the case but it only seems to happen on Android 7 and only with Fenix. The GeckoView example app seems to be unaffected.
The fix is rather simple though I'd like some feedback about its potential downsides: if dlsym()
fails when loading the real symbol instead of crashing we dlopen()
libc and then retry dlsym()
. The second call to dlsym()
always work in this case (which makes sense as we've guaranteed that libc is loaded).
Comment 44•7 months ago
|
||
Wait, that makes no sense. libmozglue has a dependency on libc. How would it load libmozglue without loading libc first? Can you get a dynamic loader log?
Assignee | ||
Comment 45•7 months ago
|
||
I'm trying because whatever I did works for pthread_create()
but not for getenv()
so there must be something else going on.
Assignee | ||
Comment 46•7 months ago
|
||
I've got the linker log, it's humongous and puzzling. libc.so gets loaded from a Java runtime library then unloaded and finally we crash... and the point where we crash is being executed from within Kotlin code. This is extremely confusing.
Assignee | ||
Comment 47•7 months ago
|
||
Ah, I see what's going on. We're crashing here within our own linker.
Assignee | ||
Comment 48•7 months ago
|
||
Here's what's happening, at least from what I could puzzle out from the logs and testing. We're on Android API 25 so __wrap_dlsym()
is essentially a no-op. That being said the sequence of events is the following:
- Android's Java runtime loads libc
- Stuff happens
- Android's Java runtime decides it's done with libc and
dlclose()
s it, but it doesn't get unloaded (presumably it's used somewhere else) - Android's Java runtime loads
libxul.so
, which requireslibmozglue.so
which requireslibc.so
... - ... but somehow even though
libc.so
has already been loaded whenlibxul.so
callsgetenv()
(or another interposer) it ends up inlibmozglue.so
And here's where things get interesting. Calling dlsym(RTLD_NEXT, "getenv")
from within the interposer returns NULL
. This makes sense because libc.so
was loaded before libmozglue.so
. However calling dlsym(RTLD_DEFAULT, "getenv")
returns libc's getenv()
, not libmozglue's. This is definitely counterintuitive... but it seems to work, at least for this particular setup. In more recent versions of Android the linking sequence follows the path we'd expect.
![]() |
||
Comment 49•7 months ago
|
||
Landed:
https://hg.mozilla.org/integration/autoland/rev/1701c20a5da39e243c52c6c8115787d414924464
https://hg.mozilla.org/integration/autoland/rev/4ecb79ef23ae74066a7e287191f29f5e79fa4728
Backed out for mochitest failures:
https://hg.mozilla.org/integration/autoland/rev/b6e26d8004e25d114ac44303df28890bea866779
Push with failures
Failure log
[task 2023-05-09T21:33:31.658Z] 21:33:31 INFO - MochitestServer : launching ['/builds/worker/workspace/build/tests/bin/xpcshell', '-g', '/builds/worker/workspace/build/application/firefox', '-f', '/builds/worker/workspace/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmphoddfhey.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/builds/worker/workspace/build/tests/mochitest/server.js']
[task 2023-05-09T21:33:31.658Z] 21:33:31 INFO - runtests.py | Server pid: 1566
[task 2023-05-09T21:33:31.672Z] 21:33:31 INFO - runtests.py | Websocket server pid: 1569
[task 2023-05-09T21:33:31.678Z] 21:33:31 INFO - runtests.py | SSL tunnel pid: 1573
[task 2023-05-09T21:33:31.688Z] 21:33:31 INFO - error: XDG_RUNTIME_DIR not set in the environment.
[task 2023-05-09T21:33:31.782Z] 21:33:31 INFO - Setting pipeline to PAUSED ...
[task 2023-05-09T21:33:31.784Z] 21:33:31 INFO - Pipeline is PREROLLING ...
[task 2023-05-09T21:33:31.787Z] 21:33:31 INFO - Pipeline is PREROLLED ...
[task 2023-05-09T21:33:31.788Z] 21:33:31 INFO - Setting pipeline to PLAYING ...
[task 2023-05-09T21:33:31.789Z] 21:33:31 INFO - New clock: GstSystemClock
[task 2023-05-09T21:33:31.822Z] 21:33:31 INFO - Got EOS from element "pipeline0".
[task 2023-05-09T21:33:31.822Z] 21:33:31 INFO - Execution ended after 0:00:00.032809848
[task 2023-05-09T21:33:31.824Z] 21:33:31 INFO - Setting pipeline to PAUSED ...
[task 2023-05-09T21:33:31.825Z] 21:33:31 INFO - Setting pipeline to READY ...
[task 2023-05-09T21:33:31.826Z] 21:33:31 INFO - (gst-launch-1.0:1522): GStreamer-CRITICAL **: 21:33:31.825: gst_object_unref: assertion '((GObject *) object)->ref_count > 0' failed
[task 2023-05-09T21:33:31.827Z] 21:33:31 INFO - Setting pipeline to NULL ...
[task 2023-05-09T21:33:31.827Z] 21:33:31 INFO - Freeing pipeline ...
[task 2023-05-09T21:36:32.273Z] 21:36:32 ERROR - TEST-UNEXPECTED-FAIL | runtests.py | Timed out while waiting for server startup.
Assignee | ||
Comment 50•7 months ago
|
||
I've figured out what's going wrong on the runners: when starting xpcshell we call the interposer's getenv()
in malloc_init()
which triggers a memory allocation via dlsym()
which then re-enters getenv()
leading to a deadlock. This one is going to be harder to fix :-/
Assignee | ||
Comment 51•7 months ago
|
||
Try runs using a re-implemented getenv()
look promising:
Assignee | ||
Comment 52•7 months ago
|
||
Try runs are looking good, let's try again.
Assignee | ||
Comment 53•7 months ago
|
||
Note: there's a very long tail of crashes on Fenix & friends where the stack contains std::sys::unix::os::getenv()
. Crash signatures are all over the place but they must account for up to ~1k crashes per week.
Assignee | ||
Comment 54•7 months ago
|
||
I was backed out because of a missing include, meh. New try run:
https://treeherder.mozilla.org/jobs?repo=try&revision=759ab077e0f1990e5b17034281ce3b389b8c0360
![]() |
||
Comment 55•7 months ago
|
||
For the record:
Landed:
https://hg.mozilla.org/integration/autoland/rev/a53bc961d9587fdedf3a448f4f99df22082d67b1
https://hg.mozilla.org/integration/autoland/rev/621d691fcf431b5d9e4e03613c98534597a2562d
Backed out
https://hg.mozilla.org/integration/autoland/rev/b47e72d1ad37d7511822ddab0e96f43b8aa96311
Assignee | ||
Comment 56•7 months ago
|
||
Try is green for SpiderMonkey too, let's hope I don't burn the trees again.
(somehow no matter how many tasks I put in my try runs they're never enough to cover all bases)
![]() |
||
Comment 57•7 months ago
|
||
Move the pthread_thread_create() interposer under mozglue and prepare for having a single place where we place interposer functions r=glandium
https://hg.mozilla.org/integration/autoland/rev/1760c9f902bf4ba01f8a40d4acba3c6d5ad1a93d
https://hg.mozilla.org/mozilla-central/rev/1760c9f902bf
Add interposers for functions manipulating the environment to prevent crashes r=glandium
https://hg.mozilla.org/integration/autoland/rev/79dc5e93cef43556bcc4bae7f3046ba597a334e5
https://hg.mozilla.org/mozilla-central/rev/79dc5e93cef4
Updated•7 months ago
|
Comment 58•7 months ago
|
||
Zack, can you please keep an eye out for any crash spikes from this now that it's re-landed? I'm assuming we're going to want to let this bake for a bit on Nightly before we look at uplifting.
Comment 59•7 months ago
|
||
Tested on API level 27 (emulator Pixel 6 Pro API 27) with launch crashes occurring again. Built Gecko locally with debug and stacktrace shows it's an issue with the InterposerHelper that was added (line 58 triggers the MOZ_CRASH). Dropping the stacktrace below.
After backing out the 2 revisions libxul is no longer found when Nimbus tries to apply pending experiments so just to be aware before we back these out (error shown: Nimbus error: Nimbus Rust: applyPendingExperiments
). Stacktrace below.
Stacktrace for current nightly
2023-05-11 11:49:51.757 6269-6303 MOZ_CRASH org.mozilla.fenix.debug A Hit MOZ_CRASH(We could not obtain the real putenv(). Calling the symbol we got would make us enter an infinite loop so stop here instead.) at /Users/mozilla/StudioProjects/gecko/mozglue/interposers/InterposerHelper.h:58
2023-05-11 11:49:51.757 6269-6303 libc org.mozilla.fenix.debug A Fatal signal 11 (SIGSEGV), code 1, fault addr 0x0 in tid 6303 (Gecko), pid 6269 (lla.fenix.debug)
2023-05-11 11:49:51.764 6305-6305 cutils-trace pid-6305 E Error opening trace file: No such file or directory (2)
2023-05-11 11:49:51.780 6306-6306 DEBUG pid-6306 A *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2023-05-11 11:49:51.780 6306-6306 DEBUG pid-6306 A Build fingerprint: 'google/sdk_gphone_arm64/generic_arm64:8.1.0/OSM1.180201.044.D1/8352955:userdebug/dev-keys'
2023-05-11 11:49:51.780 6306-6306 DEBUG pid-6306 A Revision: '0'
2023-05-11 11:49:51.780 6306-6306 DEBUG pid-6306 A ABI: 'arm64'
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A pid: 6269, tid: 6303, name: Gecko >>> org.mozilla.fenix.debug <<<
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A Cause: null pointer dereference
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x0 00000000000000e8 x1 0000007cadcfced0 x2 0000000000000004 x3 0000000000000003
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x4 0000000040100401 x5 40404000a800a800 x6 0000000000000000 x7 7f7f7f7fffff7f7f
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x8 0000007cbb6592e8 x9 0000000000000000 x10 0000007cadcfd070 x11 00000000000000dd
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x12 0000000000000038 x13 ffffffffffffffff x14 ff00000000000000 x15 ffffffffffffffff
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x16 0000007d57322cd0 x17 0000007d57033e54 x18 0000007cadcfde38 x19 0000007cbb65931c
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x20 000000000000003a x21 0000007cbb657890 x22 0000007cadcfd8ec x23 0000007cc0bfa498
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x24 0000000000000004 x25 0000007cadcfe588 x26 0000007cd526cea0 x27 0000000000000001
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A x28 0000000000000002 x29 0000007cadcfd5e0 x30 0000007cbb567978
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A sp 0000007cadcfd5e0 pc 0000007cbb567988 pstate 0000000060000000
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A
backtrace:
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A #00 pc 000000000002c988 /data/app/org.mozilla.fenix.debug-r5lKHy6XYrpKNj1VfgxQzQ==/lib/arm64/libmozglue.so (offset 0xb1000)
2023-05-11 11:49:51.781 6306-6306 DEBUG pid-6306 A #01 pc 000000000002c974 /data/app/org.mozilla.fenix.debug-r5lKHy6XYrpKNj1VfgxQzQ==/lib/arm64/libmozglue.so (offset 0xb1000)
2023-05-11 11:49:51.975 1130-1130 /system/bin/tombstoned tombstoned E Tombstone written to: /data/tombstones/tombstone_29
Stacktrace for revision backout causing nimbus error
Nimbus error: Nimbus Rust: applyPendingExperiments
java.lang.UnsatisfiedLinkError: Unable to load library 'xul':
dlopen failed: cannot locate symbol "getenv" referenced by "/data/app/org.mozilla.fenix.debug-SRZFYxsJNgCsY8_TXYQ_Tg==/lib/arm64/libxul.so"...
dlopen failed: cannot locate symbol "getenv" referenced by "/data/app/org.mozilla.fenix.debug-SRZFYxsJNgCsY8_TXYQ_Tg==/lib/arm64/libxul.so"...
dlopen failed: cannot locate symbol "getenv" referenced by "/data/app/org.mozilla.fenix.debug-SRZFYxsJNgCsY8_TXYQ_Tg==/lib/arm64/libxul.so"...
Native library (android-aarch64/libxul.so) not found in resource path (.)
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:323)
at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:483)
at com.sun.jna.Library$Handler.<init>(Library.java:197)
at com.sun.jna.Native.load(Native.java:622)
at com.sun.jna.Native.load(Native.java:596)
at mozilla.telemetry.glean.internal._UniFFILib$Companion$INSTANCE$2.invoke(glean.kt:4870)
at mozilla.telemetry.glean.internal._UniFFILib$Companion$INSTANCE$2.invoke(glean.kt:258)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at mozilla.telemetry.glean.internal._UniFFILib$Companion.getINSTANCE$glean_release(glean.kt:258)
at mozilla.telemetry.glean.internal.TimingDistributionMetric.<init>(glean.kt:2757)
at mozilla.telemetry.glean.private.TimingDistributionMetricType.<init>(TimingDistributionMetricType.kt:19)
at org.mozilla.experiments.nimbus.GleanMetrics.NimbusHealth$applyPendingExperimentsTime$2.invoke(NimbusHealth.kt:46)
at org.mozilla.experiments.nimbus.GleanMetrics.NimbusHealth$applyPendingExperimentsTime$2.invoke(NimbusHealth.kt:45)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at org.mozilla.experiments.nimbus.GleanMetrics.NimbusHealth.applyPendingExperimentsTime(NimbusHealth.kt:45)
at org.mozilla.experiments.nimbus.Nimbus$applyPendingExperimentsOnThisThread$1.invoke(Nimbus.kt:252)
at org.mozilla.experiments.nimbus.Nimbus$applyPendingExperimentsOnThisThread$1.invoke(Nimbus.kt:250)
at org.mozilla.experiments.nimbus.Nimbus.withCatchAll(Nimbus.kt:181)
at org.mozilla.experiments.nimbus.Nimbus.applyPendingExperimentsOnThisThread$nimbus_release(Nimbus.kt:250)
at org.mozilla.experiments.nimbus.Nimbus$applyLocalExperiments$2$1.invokeSuspend(Nimbus.kt:282)
at org.mozilla.experiments.nimbus.Nimbus$applyLocalExperiments$2$1.invoke(Unknown Source:8)
at org.mozilla.experiments.nimbus.Nimbus$applyLocalExperiments$2$1.invoke(Unknown Source:4)
at kotlinx.coroutines.intrinsics.UndispatchedKt.startUndispatchedOrReturn(Undispatched.kt:89)
at kotlinx.coroutines.BuildersKt__Builders_commonKt.withContext(Builders.common.kt:169)
at kotlinx.coroutines.BuildersKt.withContext(Unknown Source:1)
at org.mozilla.experiments.nimbus.Nimbus$applyLocalExperiments$2.invokeSuspend(Nimbus.kt:279)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1162)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)
at java.lang.Thread.run(Thread.java:764)
Suppressed: java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "getenv" referenced by "/data/app/org.mozilla.fenix.debug-SRZFYxsJNgCsY8_TXYQ_Tg==/lib/arm64/libxul.so"...
at com.sun.jna.Native.open(Native Method)
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:211)
... 30 more
Suppressed: java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "getenv" referenced by "/data/app/org.mozilla.fenix.debug-SRZFYxsJNgCsY8_TXYQ_Tg==/lib/arm64/libxul.so"...
at com.sun.jna.Native.open(Native Method)
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:224)
... 30 more
Suppressed: java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "getenv" referenced by "/data/app/org.mozilla.fenix.debug-SRZFYxsJNgCsY8_TXYQ_Tg==/lib/arm64/libxul.so"...
at java.lang.Runtime.loadLibrary0(Runtime.java:10
Comment 60•7 months ago
|
||
Updating about Nimbus crash after backout. I did a mach clobber
and another mach build
which fixed the issue so something was wrong with my local build after removing these revisions. These revisions should be safe to backout if the quick fix doesn't work today.
Assignee | ||
Comment 61•7 months ago
|
||
Bad news, Android 8 behaves in a way that is even more confusing than Android 7 and there's no quick way around the issue. We need to back this out again :(
Comment 62•7 months ago
|
||
I see in 1832526 that there was a patch that likely fixed this issue (retrying with a manual loading of libc). This morning, I'll verify the update in that ticket fixes this one across all API levels we support but we can hold off backing out until we know.
Comment 63•7 months ago
|
||
I tested API's 23-31 and all launched without crashing. API level 22 crashes at launch but that seems to be unrelated (I tried older builds prior to this landing and it still occurs). I think the change in 1832526 have resolved the crashing here.
Comment 64•7 months ago
|
||
I see a bunch now showing up under this signature: [@ _glNamedBufferAttachMemoryNV ]
bp-637ff856-b466-4954-90f6-4e9300230516
This is the 3rd most common crash signature on poison values in the last week on versions 113 and higher. The good news is that it looks basically like the crash in comment 0, and is not present in 115 so maybe the patch here fixed it.
Did something change with these getenv signatures? Like it is in the skip list now? Maybe not, as the number 2 crash is [@ getenv ]...
Comment 65•7 months ago
|
||
[@ os_get_option ] looks like another fairly common variant. As in my previous comment, this looks very similar to the stack in comment 0.
Assignee | ||
Comment 66•7 months ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #64)
Did something change with these getenv signatures? Like it is in the skip list now? Maybe not, as the number 2 crash is [@ getenv ]...
We started ignoring everything that starts with __GI
in bug 1790051, it was in September.
Assignee | ||
Comment 67•7 months ago
|
||
Ignoring __GI_getenv
wasn't a great idea apparently as I expected a getenv
frame to always appear after it - possibly inlined - but it doesn't. There's a bunch more signatures that are really getenv
crashes but don't look like it because they only have __GI_getenv
on the stack:
https://crash-stats.mozilla.org/search/?proto_signature=~__GI_getenv
I'll add a few signatures but not all of them because they're too much.
Comment 68•6 months ago
|
||
Think we're ready for Beta & ESR approval requests on this?
Assignee | ||
Comment 70•6 months ago
|
||
Comment on attachment 9307843 [details]
Bug 1752703 - Add interposers for functions manipulating the environment to prevent crashes r=glandium
Beta/Release Uplift Approval Request
- User impact if declined: Firefox might crash on startup on Linux and Android
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: Bug 1832526
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): This patch adds some complex logic very early during startup. This logic has already been widely tested for another interposer but we've encountered some hard-to-debug issues on Android before we could stabilize it. It's impossible to rule out other issues even though the crashes that were first introduced by this code have been fixed.
- String changes made/needed: none
- Is Android affected?: Yes
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: This is sec-moderate but crash volume was remarkably high
- User impact if declined: Firefox might crash on startup on Linux and Android
- Fix Landed on Version: 115
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): This patch adds some complex logic very early during startup. This logic has already been widely tested for another interposer but we've encountered some hard-to-debug issues on Android before we could stabilize it. It's impossible to rule out other issues even though the crashes that were first introduced by this code have been fixed.
Assignee | ||
Updated•6 months ago
|
Comment 71•6 months ago
|
||
Comment on attachment 9307843 [details]
Bug 1752703 - Add interposers for functions manipulating the environment to prevent crashes r=glandium
Approved for 114 beta 8, thanks.
Updated•6 months ago
|
Comment 72•6 months ago
|
||
uplift |
Comment 73•6 months ago
|
||
Comment on attachment 9307842 [details]
Bug 1752703 - Move the pthread_thread_create() interposer under mozglue and prepare for having a single place where we place interposer functions r=glandium
Approved for 102.12esr.
Updated•6 months ago
|
Comment 74•6 months ago
|
||
uplift |
Updated•6 months ago
|
Updated•6 months ago
|
Updated•1 month ago
|
Description
•