LeakSanitizer does not run in the content process

RESOLVED FIXED in Firefox 50

Status

()

RESOLVED FIXED
2 years ago
8 months ago

People

(Reporter: mccr8, Assigned: jld)

Tracking

(Blocks: 1 bug)

Trunk
mozilla50
Unspecified
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 fixed)

Details

(Whiteboard: [MemShrink][sblc2])

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
This came up while looking at bug 1287437. It looks like we're not actually running LeakSanitizer in ASan builds right now, in content processes. This is bad because we could have introduced new leaks in the mean time.

You can see this by running some mochitests, like
  ./mach mochitest -f plain js/xpconnect/
and then looking through the log for instances of "Suppressions used:". There should be one for the parent process and one for the child process. If I run normally, there's only one, but if I run with "MOZ_DISABLE_CONTENT_SANDBOX=t" then there are two (also with bug 1287437 backed out).
(Reporter)

Updated

2 years ago
Whiteboard: [MemShrink]
(Reporter)

Comment 1

2 years ago
I believe the sandbox is Nightly-only, so if there are leaks introduced they will show up when we merge to Aurora.
(Reporter)

Updated

2 years ago
See Also: → bug 1287437
(Reporter)

Comment 2

2 years ago
In local testing, ASan still seems to detect use-after-frees, so it seems like it is only LeakSanitizer that's not working.
(In reply to Andrew McCreight [:mccr8] from comment #1)
> I believe the sandbox is Nightly-only, so if there are leaks introduced they
> will show up when we merge to Aurora.

I consider this a relatively stupid comment.  The purpose of these things is to find things causing issues soon after check-in to aid in isolating which patch cased the issue.  Trying to figure it out over an accumulation of 6 weeks of patches is not really helpful.
(Reporter)

Comment 4

2 years ago
(In reply to Bill Gianopoulos [:WG9s] from comment #3)
> I consider this a relatively stupid comment.
This comment is quite rude, and not appropriate for Bugzilla.

> The purpose of these things is
> to find things causing issues soon after check-in to aid in isolating which
> patch cased the issue.  Trying to figure it out over an accumulation of 6
> weeks of patches is not really helpful.

I'm well aware of that. I was mostly pointing this out in case RyanVM or Philor does a push of Nightly-as-Aurora and sees weird failures.
(Reporter)

Comment 5

2 years ago
try push with sandboxing to see if anything has started failing in the mean time:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e78ba769fa7d
(In reply to Andrew McCreight [:mccr8] from comment #5)
> try push with sandboxing to see if anything has started failing in the mean
> time:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=e78ba769fa7d

Well then I guess I am the stupid one.  I missed the point of your comment.
(In reply to Andrew McCreight [:mccr8] from comment #5)
> try push with sandboxing to see if anything has started failing in the mean
> time:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=e78ba769fa7d

If you run a terminal window you should see related Sandbox errors in the terminal window.
I've handed off the uplift simulations to the remaining sheriffs :)
(Assignee)

Comment 9

2 years ago
From bug 1287437 comment #10:
> One weird thing here is that if the sanitizer runtime managed to block
> SIGSYS, possibly by trying to block all signals (we have symbol
> interposition for sigprocmask and pthread_sigmask to force SIGSYS to stay
> unblocked for exactly this reason, but that wouldn't apply if it does the
> sigprocmask syscall directly) then the kernel will unblock the signal *and*
> reset its disposition before delivering it, which means the process will
> immediately exit — no log messages, no crash reporting, nothing.  So that
> could result in weird breakage that wouldn't show up as a test failure or
> even be obvious to a human reading the logs.
> 
> But a look at the compiler-rt source doesn't show anything that might be
> doing this besides TSan, which is known to be incompatible with sandboxing
> for various reasons (and will disable it: bug 1182565).

So I tried strace — after backing out bug 1286324 and manually symlinking OBJDIR/dist/bin/llvm-symbolizer to the matching llvm-symbolizer (although that might not be necessary for this bug?) — and this happened:

  [pid 79914] rt_sigprocmask(SIG_BLOCK, ~[ILL ABRT BUS FPE SEGV XCPU XFSZ], [], 8) = 0
  [pid 79914] --- SIGSYS {si_signo=SIGSYS, si_code=0x1, si_errno=EPERM, si_pid=5088774, si_uid=0, si_value={int=56, ptr=0xc000003e00000038}} ---
  [pid 79914] +++ killed by SIGSYS +++

Syscall 56 is clone(); si_syscall aliases si_value, it looks like.  So, where in the compiler-rt source might this be coming from?  Maybe lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc:

  static const int kUnblockedSignals[] = { SIGABRT, SIGILL, SIGFPE, SIGSEGV,
                                           SIGBUS, SIGXCPU, SIGXFSZ };

And, as predicted, it's using its own self-contained implementation of sigprocmask() which doesn't use libc at all.

The thing I failed to spot in the git-grep output was this, in lib/lsan/lsan_common.cc:

  StopTheWorld(DoLeakCheckCallback, &param);

Unsurprisingly in hindsight, LSan wants to stop all other threads (and signals) while rummaging through the heap, to avoid data races.  To do this, the StopTheWorld code forks a process that shares the parent's address space and a few other things, but in its own new thread group (and with its own signal disposition, etc.):

  uptr tracer_pid = internal_clone(
      TracerThread, tracer_stack.Bottom(),
      CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_UNTRACED,
      &tracer_thread_argument, 0 /* parent_tidptr */, 0 /* newtls */, 0
      /* child_tidptr */);

That's the clone call where the problems start.  But if that succeeded, it would then go on to do a bunch of other unusual stuff in that process, in particular ptrace()ing all of its parent's threads, whose pids it determines by opening /proc/%d/task/, etc.  This… trying to make this somehow work with some vestige of sandboxing does not seem like a good idea.

Also, from the compiler-rt revision history, this code seems to be as old as LSan, so I don't know how this ever worked for GeckoMediaPlugin sandboxing.  Possibly the answer is that it didn't… except in gtests, because (pre-TaskCluster, anyway) those ran on old RedHat with no seccomp-bpf.

<hr />

tl;dr: Once bug 742434 landed, content processes were silently dying when they tried to do LSan reports, which effectively disabled LSan in content processes.  (This is when the process was already exiting, and the IPC code doesn't check the exit status, so nothing else broke.)  But bug 1286324 replaced the thing that got the process killed with a simple failure, resulting in the "LeakSanitizer has encountered a fatal error" message.  In either case, the only reasonable solution is disabling sandboxing if ASan (and therefore LSan) is used.

The fix is simple: extend the bug 1182565 fix to MOZ_ASAN (and maybe MOZ_MSAN?) as well as MOZ_TSAN.  It looks as though small changes like that are acceptable in old-configure.in (i.e., hopefully I won't be required to convert all sanitizer and sandboxing configury to Python just for this).
Assignee: nobody → jld
(In reply to Andrew McCreight [:mccr8] from comment #4)
> 
> I'm well aware of that. I was mostly pointing this out in case RyanVM or
> Philor does a push of Nightly-as-Aurora and sees weird failures.

i took this over Andrew and feel free to needinfo me in the future (just saw this bug here:)  will start such a run tomorrow morning my time.

Updated

2 years ago
Whiteboard: [MemShrink] → [MemShrink][sblc2]
(Assignee)

Comment 11

2 years ago
Created attachment 8773813 [details] [diff] [review]
bug1287971-lsan-unbox-hg0.diff

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9818c6d55f97
Attachment #8773813 - Flags: review?(julian.r.hector)
Comment on attachment 8773813 [details] [diff] [review]
bug1287971-lsan-unbox-hg0.diff

Review of attachment 8773813 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8773813 - Flags: review?(julian.r.hector) → review+
(Assignee)

Comment 13

2 years ago
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9818c6d55f97 — the intermittents seem to not be my fault.
Keywords: checkin-needed
OS: Unspecified → Linux

Comment 14

2 years ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d2a4af272e3
Disable sandboxing in Linux ASan builds. r=jhector
Keywords: checkin-needed

Comment 15

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/8d2a4af272e3
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox50: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1287437
(Assignee)

Updated

8 months ago
See Also: → bug 1436509
You need to log in before you can comment on or make changes to this bug.