Closed Bug 1089816 Opened 7 years ago Closed 7 years ago

Get LeakSanitizer working on e10s

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(e10s+, firefox42 fixed)

RESOLVED FIXED
mozilla42
Tracking Status
e10s + ---
firefox42 --- fixed

People

(Reporter: mccr8, Assigned: mccr8)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink])

Attachments

(3 files, 3 obsolete files)

Our ASan builds, which run LSan, are opt builds, so they may call exit(0) in the child process, and not go through the regular shutdown.  This probably means we're not running LSan in the child process.

LSan mostly detects leaks of buffers of stuff, which can be particularly important for media code.  We probably don't want to bother with this until the regular leak checking is working, or we'll get a lot of failures.
tracking-e10s: --- → ?
Summary: Figure out if LSan is working on e10s → Figure out if LeakSanitizer is working on e10s
I did a hacky try run that removes the QuickExit() from ContentChild::ActorDestroy entirely, and it looks like we are in fact not running LSan on the child process right now.  That will be easy enough to fix, by changing the #ifndef DEBUG to be something that doesn't apply to ASAN builds either. The problem will just be fixing up stuff so that it is green.

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=7965a2cdeb12

Ideally, we'd also have some kind of check that all processes produce LSan reports, like we do for regular leak logs, but that's probably something for a separate bug.  I filed bug 1094988 for that, but I think it isn't too critical.
Summary: Figure out if LeakSanitizer is working on e10s → Get LeakSanitizer working on e10s
Attached file LSan output
This is the leak log I get when I run this:
  ./mach mochitest-plain --e10s --this-chunk=20 --total-chunks=100
A lot of those look like the same thing as bug 1116261. I suspect that we could clean a lot of this up by fixing bug 1090570.
(In reply to Bill McCloskey (:billm) from comment #3)
> A lot of those look like the same thing as bug 1116261.
Yeah, I figured that looking at LSan output would be easier than trying to instrument all of chromium IPC with annotations.

> I suspect that we could clean a lot of this up by fixing bug 1090570.

Ok, interesting.
Depends on: 1090570
Depends on: 1122045
No longer depends on: 1051230
Assignee: nobody → continuation
This is the minimum patch needed, and it works locally for short Mochitest runs. Unfortunately, it does not produce a LeakSanitizer log on try. Presumably we're silently exiting somewhere else. I guess I'll have to add a bunch of printfs and try to figure out where that is failing.
I haven't figured out anything here, but I should try disabling sandboxing and see if that helps.
Ok, so there's actually no sandboxing. The issue seems to be how ContentParent kills off the child process when it takes too long. It turns out this was an issue back in 2010 for XPCOM shutdown leak checking (bug 539295), so now that I know what is happening it is easy to fix.
We don't really need to fix the existing Chromium IPC leaks before this lands. I can just suppression list my way to glory.
No longer depends on: 1090570, 1122045
Attachment #8550291 - Attachment is obsolete: true
Depends on: 1187410
This is closer to being ready. I filed a number of bugs for the leaks this testing uncovers (bug 1187424, bug 1187518, bug 1187421 and bug 1187410, in addition to the previously filed bug 1090570 and bug 1122045). Mostly I am adding suppressions for these leaks.

Beyond making sure that my tweaked version of the suppression file still works, I need to fix the permaorange in browser/base/content/test/social/browser_social_multiworker.js that my patch causes, which is some kind of timeout. This is already failing intermittently, as bug 964669. The test runs some tests in a child process, then uses a wonky looking timeout mechanism to wait for it to finish, and hits this timeout if it doesn't finish fast enough. Presumably I'm making shutdown take a lot longer. My plan is to just bump up the timeout...
Attachment #8638239 - Attachment is obsolete: true
Depends on: 964669
There are a variety of ways that the parent and child process ensure that
the child process exits quickly in opt builds, but for AddressSanitizer
builds we want to let the child process to run to completion, so that we
can get a LeakSanitizer report.

This requires adding some addition LSan suppressions, because running
LSan in child processes detects some new leaks.

asan try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=124f60a5b19b
Attachment #8638831 - Attachment is obsolete: true
Attachment #8639109 - Flags: review?(wmccloskey)
In addition to the green try run, I also did a run with an intentional child process only leak to see if we actually are producing results a reasonable amount of the time (there are additional leaks because this did not contain all of the suppressions in the attached patch).

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae1d658e6cb3
Attachment #8639109 - Flags: review?(wmccloskey) → review+
There was Windows bustage. I accidentally used "and" instead of "&&" in ContentChild. Apparently they mean the same thing in C++, so my ASan try runs should be valid, but Visual Studio doesn't support "and" so it produced a warning because there was some extra junk it didn't understand at the end of the line.
https://hg.mozilla.org/mozilla-central/rev/260370115e37
https://hg.mozilla.org/mozilla-central/rev/f504cbcaefa9
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Whiteboard: [MemShrink]
This patch adds an intentional leak to ContentChild, which is very useful when investigating whether e10s LSan is working or not.
You need to log in before you can comment on or make changes to this bug.