Closed Bug 1706008 Opened 4 years ago Closed 4 years ago

[Wayland] dom/canvas/test/webgl-conf/generated/test_2_conformance2__glsl3__array-as-return-value.html crash with sandbox violation

Categories

(Core :: Security: Process Sandboxing, defect, P3)

defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox90 --- fixed

People

(Reporter: stransky, Assigned: gerard-majax)

References

Details

Attachments

(1 file)

On Wayland, dom/canvas/test/webgl-conf/generated/test_2_conformance2__glsl3__array-as-return-value.html crashes with:

0:04.01 GECKO(16251) Sandbox: seccomp sandbox violation: pid 16420, tid 16420, syscall 157, args 23 32 0 139706498268232 0 1.

Run with MOZ_DISABLE_CONTENT_SANDBOX=1 prevents the crash.

Component: Widget: Gtk → Security: Process Sandboxing
Assignee: nobody → jld
Severity: -- → S4

prctl(PR_CAPBSET_READ, CAP_MAC_OVERRIDE), if I'm reading that correctly?

It comes from

nsSystemInfo::Init()

void* libpulse = dlopen("libpulse.so.0", RTLD_LAZY);

There's a backtrace:
#4 __GI___prctl (option=option@entry=23) at ../sysdeps/unix/sysv/linux/prctl.c:38
#5 0x00007f9243b1a1f7 in cap_get_bound (cap=cap@entry=32) at cap_proc.c:272
#6 0x00007f9243b197b6 in _initialize_libcap () at cap_alloc.c:20
#7 0x00007f927b5468de in call_init (l=<optimized out>, argc=argc@entry=15, argv=argv@entry=0x7ffec734bc48, env=env@entry=0x7f927ad24400) at dl-init.c:74
#8 0x00007f927b5469c8 in call_init (env=0x7f927ad24400, argv=0x7ffec734bc48, argc=15, l=<optimized out>) at dl-init.c:37
#9 _dl_init (main_map=0x7f925f1dd000, argc=15, argv=0x7ffec734bc48, env=0x7f927ad24400) at dl-init.c:121
#10 0x00007f927b0e02e5 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=operate@entry=0x7f927b54a350 <call_dl_init>, args=args@entry=0x7ffec7347300)
at dl-error-skeleton.c:182
#11 0x00007f927b54ae25 in dl_open_worker (a=a@entry=0x7ffec73474a0) at dl-open.c:783
#12 0x00007f927b0e0288 in __GI__dl_catch_exception (exception=exception@entry=0x7ffec7347480, operate=operate@entry=0x7f927b54aa40 <dl_open_worker>, args=args@entry=0x7ffec73474a0)
at dl-error-skeleton.c:208
#13 0x00007f927b54a65e in _dl_open
(file=0x7ffec7347480 "ht4\307\376\177", mode=-2147483647, caller_dlopen=0x7f9272945e53 <nsSystemInfo::Init()+1171>, nsid=-2, argc=15, argv=0x7ffec734bc48, env=0x7f927ad24400)
at dl-open.c:864
#14 0x00007f927b4bb39c in dlopen_doit (a=a@entry=0x7ffec73476d0) at dlopen.c:66
#15 0x00007f927b0e0288 in __GI__dl_catch_exception (exception=exception@entry=0x7ffec7347670, operate=operate@entry=0x7f927b4bb340 <dlopen_doit>, args=args@entry=0x7ffec73476d0)
at dl-error-skeleton.c:208
#16 0x00007f927b0e0353 in __GI__dl_catch_error
(objname=objname@entry=0x7f927ad34090, errstring=errstring@entry=0x7f927ad34098, mallocedp=mallocedp@entry=0x7f927ad34088, operate=operate@entry=0x7f927b4bb340 <dlopen_doit>, args=args@entry=0x7ffec73476d0) at dl-error-skeleton.c:227
#17 0x00007f927b4bbbd9 in _dlerror_run (operate=operate@entry=0x7f927b4bb340 <dlopen_doit>, args=args@entry=0x7ffec73476d0) at dlerror.c:170
#18 0x00007f927b4bb428 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#19 0x00007f9272945e53 in nsSystemInfo::Init() (this=<optimized out>) at /raid/src2/xpcom/base/nsSystemInfo.cpp:1009

The call is:
268 int cap_get_bound(cap_value_t cap)
269 {
270 int result;
271
272 result = prctl(PR_CAPBSET_READ, pr_arg(cap), pr_arg(0));
273 if (result < 0) {
274 errno = -result;
275 return -1;
276 }

but it comes from libc.

OS is Fedora 33.

So this is a mess. libcap2 started using PR_CAPBSET_READ at static initializer time to probe the kernel's capability set size, in version 2.30, released 2020-01-04. And we're loading libpulse in nsSystemInfo, even in sandboxed processes that can't use it, because of bug 1245745. (We really should pass down system info like that over IPC or in the environment instead of making every process recompute it, and I remember saying something similar on some other sandboxing bug recently….)

Meaning that I don't know why this would have started happening only recently. Were we not using nsSystemInfo in content processes before this? Or was it previously always being used before sandbox startup, and now sometimes the first use is after sandbox startup?

I suppose it's easy enough to fix by always returning EINVAL — sandboxed processes shouldn't have access to actually do anything with capabilities, so it shouldn't matter what it thinks the set size is.

Meaning that I don't know why this would have started happening only recently.

IIUC this only happens on Wayland, which doesn't have a CI yet - Martin is working on bringing it up. Given that it's a WebGL test, I wonder if also happens on X11/EGL though, or what exactly is different for Wayland.

Gently stealing your bug ;)

Assignee: jld → lissyx+mozillians

Martin, is there any specific environment / setup to perform to repro ?

I just did a build here and I can't repro running ./mach test dom/canvas/test/webgl-conf/generated/test_2_conformance2__glsl3__array-as-return-value.html either running on X11/EGL nor under Xayland or pure Wayland.

But I'm running on some Ubuntu 21.04 setup (Gnome/Wayland).

Flags: needinfo?(stransky)

Soo, after setting up an uptodate Fedora 33 VM, I can repro the crash there.

Flags: needinfo?(stransky)

Thanks for looking into this. Just to be sure: does this only reproduce on Wayland or also X11/EGL? Because in the later case we should make this bug block bug 1695933, which will hopefully land soon.

Flags: needinfo?(lissyx+mozillians)

(In reply to Robert Mader [:rmader] from comment #9)

Thanks for looking into this. Just to be sure: does this only reproduce on Wayland or also X11/EGL? Because in the later case we should make this bug block bug 1695933, which will hopefully land soon.

Only on Wayland for far, I have not ested on X11/EGL on Fedora yet, but I'll verify that soon then.

Flags: needinfo?(lissyx+mozillians)

This might need cross-checking, but:

  • Xwayland installed and running on the F33 VM
  • GDB_BACKEND=xwayland MOZ_ENABLE_WAYLAND=0 ./mach run and checking Window Protocol shows xwayland
  • test running with ``GDB_BACKEND=xwayland MOZ_ENABLE_WAYLAND=0` repro the issue

However, I'm unsure about whether XWayland protocol is a valid alternative in this case, or do I need pure X11?

Flags: needinfo?(robert.mader)

Good thing, Fedora 33 still has Xorg setup, so I booted a new session using "GNOME with Xorg":

  • XDG_SESSION_TYPE=x11
  • ./mach run shows properly Window Protocol: x11 in about:support
  • test still fails the same way: Signature:[@ libc.so.6 + 0x1023f1]

Thanks. EGL/X11 can be activated both in an X11 or Wayland session via MOZ_X11_EGL=1 (i.e. Window Protocol can be x11 or xwayland) - it's visible by Driver WSI Infos showing EGL_... instead of GLX_... extensions.

However, from what I understand now, the bug also happens on plain GLX - i.e. this does not only affect Wayland or EGL, but all configurations. Now it would be interesting if it's a Firefox regression or a mesa bug or so :/ Odd that it doesn't repro on Ubuntu.

Flags: needinfo?(robert.mader)

P.S.: AFAIK Fedora builds firefox with GCC, while Ubuntu IIUC has switched to Clang/LLVM like upstream has.

Martin, could this be the issue?

Flags: needinfo?(stransky)

(In reply to Robert Mader [:rmader] from comment #13)

Thanks. EGL/X11 can be activated both in an X11 or Wayland session via MOZ_X11_EGL=1 (i.e. Window Protocol can be x11 or xwayland) - it's visible by Driver WSI Infos showing EGL_... instead of GLX_... extensions.

However, from what I understand now, the bug also happens on plain GLX - i.e. this does not only affect Wayland or EGL, but all configurations. Now it would be interesting if it's a Firefox regression or a mesa bug or so :/ Odd that it doesn't repro on Ubuntu.

Indeed, running with MOZ_X11_EGL=1 also show the issue, but since previous tests were reproduced with GLX, I dont think it should be blocking the egl work?

P.S.: AFAIK Fedora builds firefox with GCC, while Ubuntu IIUC has switched to Clang/LLVM like upstream has.

Seems unlikely given the stacks @jld posted. More likely: Fedora has never versions of one of the involved libraries.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #16)

P.S.: AFAIK Fedora builds firefox with GCC, while Ubuntu IIUC has switched to Clang/LLVM like upstream has.

Seems unlikely given the stacks @jld posted. More likely: Fedora has never versions of one of the involved libraries.

2.44 on debian and ubuntu, 2.48 on fedora ... let's look ...

(In reply to Alexandre LISSY :gerard-majax from comment #15)

I dont think it should be blocking the egl work?

Indeed - neither should it block bug 1578640 then.

(In reply to Alexandre LISSY :gerard-majax from comment #17)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #16)

P.S.: AFAIK Fedora builds firefox with GCC, while Ubuntu IIUC has switched to Clang/LLVM like upstream has.

Seems unlikely given the stacks @jld posted. More likely: Fedora has never versions of one of the involved libraries.

2.44 on debian and ubuntu, 2.48 on fedora ... let's look ...

(and fedora 32 was on 2.26)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #4)

Meaning that I don't know why this would have started happening only recently. Were we not using nsSystemInfo in content processes before this? Or was it previously always being used before sandbox startup, and now sometimes the first use is after sandbox startup?

So, there might be a bit of both:

  • LD_PRELOAD with libcap.so.2.48, test passes
  • changing symlink /usr/lib64/libcap.so.2 from libcap.so.2.48 to libcap.so.2.26 also passes tests, without any preload

(In reply to Alexandre LISSY :gerard-majax from comment #20)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #4)

Meaning that I don't know why this would have started happening only recently. Were we not using nsSystemInfo in content processes before this? Or was it previously always being used before sandbox startup, and now sometimes the first use is after sandbox startup?

So, there might be a bit of both:

  • LD_PRELOAD with libcap.so.2.48, test passes

I've also tried rebuilding with the debian patches, just in case, but to no success.

  • changing symlink /usr/lib64/libcap.so.2 from libcap.so.2.48 to libcap.so.2.26 also passes tests, without any preload

I think this is just consistent with https://git.kernel.org/pub/scm/libs/libcap/libcap.git/diff/libcap/cap_alloc.c?id=f1f62a748d7c67361e91e32d26abafbfb03eeee4 as you mentionned in comment 4.

I'm going to try and find a way to trace better between Fedora and Ubuntu, I'm really getting convinced the difference is the one you suspect: on ubuntu it is used before sandbox startup while it's after on fedora.

Tracing with MOZ_SANDBOX_LOGGING=1, I can confirm on Fedora 33:

  • libcap.so is being loaded during the test, by the content process
  • when using LD_PRELOAD there is no trace of loading libcap.so in the content process

Running with MOZ_SANDBOX_LOGGING=1 also on Ubuntu, I can confirm there is no trace of libcap.so being loaded by the content process.

Martin, I'm still trying to verify my idea that this is related to the current Fedora setup of libcap and might involve PAM. In the meantime, i'm mostly confident that this is unrelated to Wayland, since I happen to repro under a GNOME/Xorg session on Fedora 33 in my VM. I'm continuing investigation to try and verify the current hypothesis I have on the source of the issue and why we see that only on Fedora and not on Debian, but in the meantime I think we can drop the wayland-tests blocker?

Yes, let's drop the wayland-tests.
Thanks.

No longer blocks: wayland-tests
Flags: needinfo?(stransky)

I ended up hacking directly:

diff --git a/toolkit/xre/nsAppRunner.cpp b/toolkit/xre/nsAppRunner.cpp
index 0d887d933f3d1..3afcf295eaf09 100644
--- a/toolkit/xre/nsAppRunner.cpp
+++ b/toolkit/xre/nsAppRunner.cpp
@@ -5416,6 +5416,9 @@ int XREMain::XRE_main(int argc, char* argv[], const BootstrapConfig& aConfig) {
   NS_SetCurrentThreadName("MainThread");
 #endif

+  PR_SetEnv("LD_DEBUG=libs,files");
+  PR_SetEnv("LD_DEBUG_OUTPUT=libcap/ld.log");
+
   AUTO_BASE_PROFILER_LABEL("XREMain::XRE_main (around Gecko Profiler)", OTHER);
   AUTO_PROFILER_INIT;
   AUTO_PROFILER_LABEL("XREMain::XRE_main", OTHER);

From there, I can confirm:

  • on Fedora, libpulse.so.0 from our nsSystemInfo::Init() loading pulls libcap.so.2 that triggers the violation
  • on Debian, no libcap.so.2 gets loaded, at all (no dep from libpulse.so.0)
  • on Ubuntu, libpulse.so.0 has a dependency against libcap.so.2 BUT libsystemd.so.0 also has one, and it gets loaded by as a dep chain that goes up to libmozgtk.so

If you want to confirm on your side that the fix works for you, but as much as I can tell, it unblocks what I was reproducing on Fedora 33 under Xorg and Wayland.

Flags: needinfo?(stransky)
Status: NEW → ASSIGNED

The fix works for me, Thanks!

Flags: needinfo?(stransky)
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/558a27bd30ed Block PR_CAPBSET_READ with EINVAL r=gcp
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch
See Also: → 1901642
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: