Closed Bug 1287971 Opened 8 years ago Closed 8 years ago

LeakSanitizer does not run in the content process

Categories

(Core :: Security: Process Sandboxing, defect)

Unspecified
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: mccr8, Assigned: jld)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink][sblc2])

Attachments

(1 file)

This came up while looking at bug 1287437. It looks like we're not actually running LeakSanitizer in ASan builds right now, in content processes. This is bad because we could have introduced new leaks in the mean time.

You can see this by running some mochitests, like
  ./mach mochitest -f plain js/xpconnect/
and then looking through the log for instances of "Suppressions used:". There should be one for the parent process and one for the child process. If I run normally, there's only one, but if I run with "MOZ_DISABLE_CONTENT_SANDBOX=t" then there are two (also with bug 1287437 backed out).
Whiteboard: [MemShrink]
I believe the sandbox is Nightly-only, so if there are leaks introduced they will show up when we merge to Aurora.
See Also: → 1287437
In local testing, ASan still seems to detect use-after-frees, so it seems like it is only LeakSanitizer that's not working.
(In reply to Andrew McCreight [:mccr8] from comment #1)
> I believe the sandbox is Nightly-only, so if there are leaks introduced they
> will show up when we merge to Aurora.

I consider this a relatively stupid comment.  The purpose of these things is to find things causing issues soon after check-in to aid in isolating which patch cased the issue.  Trying to figure it out over an accumulation of 6 weeks of patches is not really helpful.
(In reply to Bill Gianopoulos [:WG9s] from comment #3)
> I consider this a relatively stupid comment.
This comment is quite rude, and not appropriate for Bugzilla.

> The purpose of these things is
> to find things causing issues soon after check-in to aid in isolating which
> patch cased the issue.  Trying to figure it out over an accumulation of 6
> weeks of patches is not really helpful.

I'm well aware of that. I was mostly pointing this out in case RyanVM or Philor does a push of Nightly-as-Aurora and sees weird failures.
try push with sandboxing to see if anything has started failing in the mean time:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e78ba769fa7d
(In reply to Andrew McCreight [:mccr8] from comment #5)
> try push with sandboxing to see if anything has started failing in the mean
> time:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=e78ba769fa7d

Well then I guess I am the stupid one.  I missed the point of your comment.
(In reply to Andrew McCreight [:mccr8] from comment #5)
> try push with sandboxing to see if anything has started failing in the mean
> time:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=e78ba769fa7d

If you run a terminal window you should see related Sandbox errors in the terminal window.
I've handed off the uplift simulations to the remaining sheriffs :)
From bug 1287437 comment #10:
> One weird thing here is that if the sanitizer runtime managed to block
> SIGSYS, possibly by trying to block all signals (we have symbol
> interposition for sigprocmask and pthread_sigmask to force SIGSYS to stay
> unblocked for exactly this reason, but that wouldn't apply if it does the
> sigprocmask syscall directly) then the kernel will unblock the signal *and*
> reset its disposition before delivering it, which means the process will
> immediately exit — no log messages, no crash reporting, nothing.  So that
> could result in weird breakage that wouldn't show up as a test failure or
> even be obvious to a human reading the logs.
> 
> But a look at the compiler-rt source doesn't show anything that might be
> doing this besides TSan, which is known to be incompatible with sandboxing
> for various reasons (and will disable it: bug 1182565).

So I tried strace — after backing out bug 1286324 and manually symlinking OBJDIR/dist/bin/llvm-symbolizer to the matching llvm-symbolizer (although that might not be necessary for this bug?) — and this happened:

  [pid 79914] rt_sigprocmask(SIG_BLOCK, ~[ILL ABRT BUS FPE SEGV XCPU XFSZ], [], 8) = 0
  [pid 79914] --- SIGSYS {si_signo=SIGSYS, si_code=0x1, si_errno=EPERM, si_pid=5088774, si_uid=0, si_value={int=56, ptr=0xc000003e00000038}} ---
  [pid 79914] +++ killed by SIGSYS +++

Syscall 56 is clone(); si_syscall aliases si_value, it looks like.  So, where in the compiler-rt source might this be coming from?  Maybe lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc:

  static const int kUnblockedSignals[] = { SIGABRT, SIGILL, SIGFPE, SIGSEGV,
                                           SIGBUS, SIGXCPU, SIGXFSZ };

And, as predicted, it's using its own self-contained implementation of sigprocmask() which doesn't use libc at all.

The thing I failed to spot in the git-grep output was this, in lib/lsan/lsan_common.cc:

  StopTheWorld(DoLeakCheckCallback, &param);

Unsurprisingly in hindsight, LSan wants to stop all other threads (and signals) while rummaging through the heap, to avoid data races.  To do this, the StopTheWorld code forks a process that shares the parent's address space and a few other things, but in its own new thread group (and with its own signal disposition, etc.):

  uptr tracer_pid = internal_clone(
      TracerThread, tracer_stack.Bottom(),
      CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_UNTRACED,
      &tracer_thread_argument, 0 /* parent_tidptr */, 0 /* newtls */, 0
      /* child_tidptr */);

That's the clone call where the problems start.  But if that succeeded, it would then go on to do a bunch of other unusual stuff in that process, in particular ptrace()ing all of its parent's threads, whose pids it determines by opening /proc/%d/task/, etc.  This… trying to make this somehow work with some vestige of sandboxing does not seem like a good idea.

Also, from the compiler-rt revision history, this code seems to be as old as LSan, so I don't know how this ever worked for GeckoMediaPlugin sandboxing.  Possibly the answer is that it didn't… except in gtests, because (pre-TaskCluster, anyway) those ran on old RedHat with no seccomp-bpf.

<hr />

tl;dr: Once bug 742434 landed, content processes were silently dying when they tried to do LSan reports, which effectively disabled LSan in content processes.  (This is when the process was already exiting, and the IPC code doesn't check the exit status, so nothing else broke.)  But bug 1286324 replaced the thing that got the process killed with a simple failure, resulting in the "LeakSanitizer has encountered a fatal error" message.  In either case, the only reasonable solution is disabling sandboxing if ASan (and therefore LSan) is used.

The fix is simple: extend the bug 1182565 fix to MOZ_ASAN (and maybe MOZ_MSAN?) as well as MOZ_TSAN.  It looks as though small changes like that are acceptable in old-configure.in (i.e., hopefully I won't be required to convert all sanitizer and sandboxing configury to Python just for this).
Assignee: nobody → jld
(In reply to Andrew McCreight [:mccr8] from comment #4)
> 
> I'm well aware of that. I was mostly pointing this out in case RyanVM or
> Philor does a push of Nightly-as-Aurora and sees weird failures.

i took this over Andrew and feel free to needinfo me in the future (just saw this bug here:)  will start such a run tomorrow morning my time.
Whiteboard: [MemShrink] → [MemShrink][sblc2]
Comment on attachment 8773813 [details] [diff] [review]
bug1287971-lsan-unbox-hg0.diff

Review of attachment 8773813 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8773813 - Flags: review?(julian.r.hector) → review+
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9818c6d55f97 — the intermittents seem to not be my fault.
Keywords: checkin-needed
OS: Unspecified → Linux
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d2a4af272e3
Disable sandboxing in Linux ASan builds. r=jhector
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/8d2a4af272e3
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
See Also: → 1436509
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: