Open Bug 976414 (LSan) Opened 11 years ago Updated 11 months ago

[meta] Run LeakSanitizer

Categories

(Core :: Sanitizers, defect)

defect

Tracking

()

People

(Reporter: mccr8, Unassigned)

References

(Depends on 27 open bugs, )

Details

(Keywords: meta, Whiteboard: [MemShrink:meta])

Attachments

(2 files, 4 obsolete files)

Apparently ASAN has this LeakSanitizer thing, which I'm pretty sure we don't run.
  https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer

It is described as "experimental", which does not fill me with great hope, but it sounds like the overhead is entirely at the end of the run, so if it works we might be able to turn it on.  But anyways, just running it and seeing what happens might produce some useful result.
This is great!  Thanks for doing this.
Looks like Chrome runs it on their equivalent of TBPL:
  http://www.chromium.org/developers/testing/leaksanitizer
I can help with integrating this if needed. In bug 957865 I'm currently working on upgrading the Clang we use for ASan so we can also do TSan with it. We can probably also use it for LSan then.
This is going to give us a ton more leak testing coverage! Definitely worth trying.
Whiteboard: [MemShrink] → [MemShrink:P1]
I'll try to look into this a bit, but if somebody else is excited about working on it let me know! :)
Alias: LSAN
Assignee: nobody → continuation
As far as I can see the instructions for using LSAN are:
1) Build a normal ASAN build.
2) Run as you would ASAN, except also set ASAN_OPTIONS="detect_leaks=1".
So let's see what happens: https://tbpl.mozilla.org/?tree=Try&rev=9dbbdc04ca04
The mochitests all seemed to fail after about 4 minutes with this:

18:27:23     INFO -  INFO | runtests.py | ASan running in low-memory configuration
18:27:23     INFO -  ==2416==Could not attach to thread 2413 (errno 1).
18:27:23     INFO -  =================================================================
18:27:23     INFO -  ==2413==ERROR: LeakSanitizer: detected memory leaks
18:27:23     INFO -  Direct leak of 24 byte(s) in 1 object(s) allocated from:
18:27:23     INFO -      #0 0x44a235 in malloc /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:74
18:27:23     INFO -      #1 0x7ff6e3f44cab in PORT_Alloc_Util /builds/slave/try-l64-asan-00000000000000000/build/security/nss/lib/util/secport.c:86
18:27:23     INFO -      #2 0x7ff6e072f0f7 (+0x5550f7)
18:27:23     INFO -      #3 0x7ff6e072ee60 (+0x554e60)
18:27:23     INFO -      #4 0x7ff6e07321f3 (+0x5581f3)
18:27:23     INFO -  SUMMARY: LeakSanitizer: 24 byte(s) leaked in 1 allocation(s).
18:27:23  WARNING -  TEST-UNEXPECTED-FAIL | runtests.py | Certificate integration failed

I have no idea what that means. It looks like maybe the MochitestServer failed to launch?
The error is coming from <http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#837>

which means certutil failed to launch (most likely), see line 1405 same file.
Crashtests did seem to produce some kind of useful result:
  SUMMARY: LeakSanitizer: 11176265 byte(s) leaked in 59174 allocation(s).

CPP also produced a lot of things, but I assume we don't do any leak checking there right now, so maybe this is just bogus stuff that needs be cleaned up under whatever this is.  I see a lot of stacks with things like nsComponentManagerImpl::RegisterModule.

Note that these are going to be a little high, because I should really also set MOZ_CC_RUN_DURING_SHUTDOWN so we run the GC and CC at shutdown.
With shutdown CCs, hopefully crash tests will have less leaks:
  https://tbpl.mozilla.org/?tree=Try&rev=18b2e1d52c09
My wild guess is that the certutil thing runs, then leaks, so LSAN makes it return a failure, which the harness detects, then it bails out.  It would be nice if the stack was a little more symbolicated.  That line in PORT_Alloc is indeed an allocation.  But maybe as a hack I can have it not do LSAN for certutil...
(In reply to comment #11)
> My wild guess is that the certutil thing runs, then leaks, so LSAN makes it
> return a failure, which the harness detects, then it bails out.  It would be
> nice if the stack was a little more symbolicated.  That line in PORT_Alloc is
> indeed an allocation.  But maybe as a hack I can have it not do LSAN for
> certutil...

Yeah, actually I think at least on the beginning we should only set the env var for firefox, and not even the xpcshell which runs the HTTP server.  That's really what matters the most.
Yeah, I didn't intend to set it for the server xpcshell, but I may have.
Depends on: 979718
I went through the log for crash tests.  I didn't find anything particularly interesting, except that I'd say about 95% of the leaks it found are from WebRTC.  I'll probably have to add a suppression for anything with webrtc in the path.
(In reply to Andrew McCreight [:mccr8] from comment #14)
> I went through the log for crash tests.  I didn't find anything particularly
> interesting, except that I'd say about 95% of the leaks it found are from
> WebRTC.  I'll probably have to add a suppression for anything with webrtc in
> the path.

That doesn't sound very encouraging.  Can you file bugs for the leaks so we can fix them, or are there simple too many?
Given the number it's not clear that filing a bug for every reported leak is useful.  The best path forward is probably for someone with (or willing to acquire) a passing familiarity with webrtc to go through the list and figure out what to do with them.
Maybe Randell? :)
Attached file LSAN report from Crashtests —
Here's the LSAN report from crash tests, with the timing gunk removed and the paths shortened a bit for improved legibility.
Depends on: 979928
Depends on: 979993
> Here's the LSAN report from crash tests

My kingdom for some whitespace!
More seriously, this is good news! Because we can fix many of these.

mccr8, do you think you could go through them and put them in logical groups, in a form that's a little easier to read? Then you could maybe do one follow-up bug per group. Emailing dev-platform would also be a good idea, to get more eyes on the leaks.

And I don't understand the nsThread::ProcessNextEvent() leaks, but I'd like to. They're big, and seem bad, but maybe inlining is making it hard to see exactly what's going on?
Also, it might be worthwhile suppressing these right now so we can get LSAN running all the time, and then get the important ones fixed at a more leisurely pace.
(In reply to Nicholas Nethercote [:njn] from comment #20)
> mccr8, do you think you could go through them and put them in logical
> groups, in a form that's a little easier to read? Then you could maybe do
> one follow-up bug per group. Emailing dev-platform would also be a good
> idea, to get more eyes on the leaks.

It is still a little early to ask for broader participation.  We need to get suppression figured out, and if there's any way to deal with the inlining issue, so people can actually figure out what is happening.  I'd also like to get mochitests working first, so there's a little more meat to chew on.  I've been going through the logs and filing things that look actionable that I've seen so far.

> And I don't understand the nsThread::ProcessNextEvent() leaks, but I'd like
> to. They're big, and seem bad, but maybe inlining is making it hard to see
> exactly what's going on?

Yeah, inlining is a big problem for understanding these stacks.

(In reply to Nicholas Nethercote [:njn] from comment #21)
> Also, it might be worthwhile suppressing these right now so we can get LSAN
> running all the time, and then get the important ones fixed at a more
> leisurely pace.

Yes, once I figure out how to modify the test harness to do that.  You have to stick the path to the file in some environment variable, and I need to figure out what exactly that path should be, given a particular file in the tree.

There's also an issue of getting the test harness to produce slightly more useful errors on TBPL than a crash with a weird error code.
(In reply to Andrew McCreight [:mccr8] from comment #22)
>
> Yes, once I figure out how to modify the test harness to do that.  You have
> to stick the path to the file in some environment variable, and I need to
> figure out what exactly that path should be, given a particular file in the
> tree.

For TSan we basically need to solve the same problem. Passing the environment variable is easy (see where we pass ASAN_SYMBOLIZER or ASAN_OPTIONS for example). However, even if the file is part of the tree, it won't be available on the testers, since they don't have the source tree. I guess we would have to package the file to make it available on the testers.

As for the location, we already have build/sanitizers/ for such files, it currently contains the TSan suppression list.
Depends on: 980109
And build/valgrind/*.sup are the Valgrind suppression files, in case that's helpful.
Attached patch Mash in environment options for LSAN. WIP. (obsolete) — — Splinter Review
Attached patch Don't run LSAN for the cert DB filler. WIP. (obsolete) — — Splinter Review
The cert DB patch seemed to work locally for running Mochitests so let's see what we get.
https://tbpl.mozilla.org/?tree=Try&rev=8f7fc216e81a
Depends on: 980625
Attached file LSAN suppressions (obsolete) —
With this suppression list, and the patches in the blocking bugs, M1 only reports the following leaks:

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x440995 in malloc (/home/amccreight/ff-dbg-asan/dist/bin/plugin-container+0x440995)
    #1 0x7f023901ee08 in moz_xmalloc /home/amccreight/mc/memory/mozalloc/mozalloc.cpp:52
SUMMARY: LeakSanitizer: 24 byte(s) leaked in 1 allocation(s).

Indirect leak of 3297 byte(s) in 230 object(s) allocated from:
    #0 0x440f25 in malloc (/home/amccreight/ff-dbg-asan/dist/bin/firefox+0x440f25)
    #1 0x7fb4f75959b9 in __GI___strdup (/usr/lib/libc.so.6+0x819b9)
SUMMARY: LeakSanitizer: 3297 byte(s) leaked in 230 allocation(s).

Only one of the suppressions, NS_InvokeByIndex, is cheating.
I wrote a python script that extracts LSAN info from a local Mochitest run:
  https://github.com/amccreight/moz-lsan-tools
Depends on: 981023
> With this suppression list, and the patches in the blocking bugs, M1 only
> reports the following leaks:
> 
> Direct leak of 24 byte(s) in 1 object(s) allocated from:
>     #0 0x440995 in malloc
> (/home/amccreight/ff-dbg-asan/dist/bin/plugin-container+0x440995)
>     #1 0x7f023901ee08 in moz_xmalloc
> /home/amccreight/mc/memory/mozalloc/mozalloc.cpp:52
> SUMMARY: LeakSanitizer: 24 byte(s) leaked in 1 allocation(s).
> 
> Indirect leak of 3297 byte(s) in 230 object(s) allocated from:
>     #0 0x440f25 in malloc
> (/home/amccreight/ff-dbg-asan/dist/bin/firefox+0x440f25)
>     #1 0x7fb4f75959b9 in __GI___strdup (/usr/lib/libc.so.6+0x819b9)
> SUMMARY: LeakSanitizer: 3297 byte(s) leaked in 230 allocation(s).

Are those the full stacks? If so, that's not very helpful :(

Looking at the suppressions file, does a leak get suppressed if any of its stack entries match any of the lines in the file? If so, that's crude enough to make me uncomfortable -- Valgrind's suppressions can match against entire stack traces or sub-traces, so the likelihood of suppressing the wrong thing is much smaller.

Having said that, why is __GI___strdup in the second stack when it's also in the suppressions file?
(In reply to Nicholas Nethercote [:njn] from comment #31)
> Are those the full stacks? If so, that's not very helpful :(
Yes.  My guess is that these are calls from external libraries that aren't built with ASAN so there are no symbols.  Decoder may have some idea of how to extract more information.

> If so, that's crude enough to make me uncomfortable
Yes, it is not good.  They could be made a little better, but judging from Chrome's suppression file, there's nothing better than per-frame matching.
  http://src.chromium.org/viewvc/chrome/trunk/src/tools/lsan/suppressions.txt   
The documentation is extremely thin, though, so I'll have to read through the ASAN source code to see if there's anything better.  Maybe we can write some patches to make it better in LSAN.

> Having said that, why is __GI___strdup in the second stack when it's also in
> the suppressions file?

I added it to the suppressions file because it was showing up, but LSAN fails to actually suppress it.  Though you'll note in the file it is actually commented out. ;)  But it wasn't when I did the run.
Depends on: 981167
Depends on: 981195
try run with unlanded fixes for bug 979928, bug 980625, bug 981167 from various other people:
  https://tbpl.mozilla.org/?tree=Try&rev=a9a80c84b8b5
Thanks for all of the quick fixes!
Depends on: 981220
According to the design document ( https://code.google.com/p/address-sanitizer/wiki/LeakSanitizerDesignDocument ) LSAN works by doing some kind of conservative scanning of memory, rather than just tallying blocks that have been allocated but not freed: "LSan scans this memory for byte patterns that look like pointers to heap blocks (interior pointers are treated the same as pointers to the beginning of a block). Those blocks are considered reachable and their contents are also treated as live memory. In this manner we discover all blocks reachable from the root set. We mark such blocks by setting a flag in the block's metadata. We then iterate over all existing heap blocks and report unreachable blocks as leaks, together with stack traces of corresponding allocations. Instead of reporting each allocation individually, it might be more useful to aggregate them by the last k frames in the stack trace."
Depends on: 981911
Depends on: 981920
Depends on: 981931
New push with a couple of leaks fixed, frame pointers enabled (for better stacks), and using the patch I pushed to TBPL to make LSAN failures a little snazzier looking:
https://tbpl-dev.allizom.org/?tree=Try&rev=413a1ce105d7
In my local M1 run, this seemed to fix all leaks aside from the stuff I have suppressed.  From reading Chrome's documentation, installing and using debug versions of these libraries should give more insight into what is happening with those leaks.
Depends on: 981966
Depends on: 982105
Depends on: 982111
Depends on: 982235
Depends on: 982373
Depends on: 987217
Depends on: 987263
Depends on: 987385
Depends on: 987918
Depends on: 987925
Depends on: 988041
Whiteboard: [MemShrink:P1] → [MemShrink:meta]
Attached patch Enable LeakSanitizer on TBPL. WIP (obsolete) — — Splinter Review
Here's what I have so far. I still haven't gotten the suppressions file set up properly for CJR Cpp or X, but it seems to be working for Mochitests.

Mochitests are still orange because I haven't given up on figuring out how to get better stacks for these clipped stacks that call into system libraries.
Oops, I meant to post that in bug 988041.
Attachment #8386927 - Attachment is obsolete: true
Attachment #8386928 - Attachment is obsolete: true
Attachment #8396813 - Attachment is obsolete: true
Attachment #8387718 - Attachment is obsolete: true
(In reply to Andrew McCreight [:mccr8] from comment #32)
> (In reply to Nicholas Nethercote [:njn] from comment #31)
> > Are those the full stacks? If so, that's not very helpful :(
> Yes.  My guess is that these are calls from external libraries that aren't
> built with ASAN so there are no symbols.  Decoder may have some idea of how
> to extract more information.
Usually the last frame of the stack trace is the library which doesn't have frame pointers (not symbols). In that __GI_strdup stack trace it's libc.
You can try running with ASAN_OPTIONS=fast_unwind_on_malloc=0 to get complete stack traces. It'll help you see where those particular leaks are coming from. It's not a good idea to always run in this mode because the slowdown will be huge.

On an unrelated note, we have a rudimentary API in <sanitizer/lsan_interface.h> which you might find useful.

> > Having said that, why is __GI___strdup in the second stack when it's also in
> > the suppressions file?
> 
> I added it to the suppressions file because it was showing up, but LSAN
> fails to actually suppress it.
That's not supposed to happen. Can you follow up with a reproducer?
(In reply to Sergey Matveev from comment #38)
> Usually the last frame of the stack trace is the library which doesn't have
> frame pointers (not symbols). In that __GI_strdup stack trace it's libc.
> You can try running with ASAN_OPTIONS=fast_unwind_on_malloc=0 to get
> complete stack traces. It'll help you see where those particular leaks are
> coming from. It's not a good idea to always run in this mode because the
> slowdown will be huge.
Thanks, I'll try that.

> That's not supposed to happen. Can you follow up with a reproducer?
Sure.  I worked around it somehow, but I don't remember what I did.  Hopefully I'll get a chance to start looking at LSAN again in the next week or two.
Depends on: 1000548
Depends on: 1000609
Depends on: 1000613
Depends on: 1018435
Depends on: 1021187
Depends on: 1021302
Depends on: 1021328
Depends on: 1021350
Depends on: 1021854
Depends on: 1021932
Depends on: 1022010
Depends on: 1022027
Depends on: 1022042
Depends on: 1022954
Depends on: 1022991
I wrote a little blurb on LeakSanitizer on the MDN AddressSanitizer page.
I've also written a few little Python scripts for extracting LSan results from Mochitest logs, parsing a suppression file, and checking what the effect of the suppression file is on a given set of LSan results.
  https://github.com/amccreight/moz-lsan-tools
Depends on: 1023548
Depends on: 1023583
Depends on: 1023585
Depends on: 1024259
Depends on: 1026663
Alias: LSAN → LSan
Depends on: 1028382
Depends on: 1028403
Depends on: 1028456
Depends on: 1028483
Depends on: 1034602
Depends on: 1036562
Depends on: 1044213
Depends on: 1044206
Depends on: 1046165
Depends on: 1046538
Depends on: 1046539
Depends on: 1046837
Depends on: 1053978
Depends on: 1055154
Depends on: 1055910
Depends on: 1063011
Depends on: 1074310
Depends on: 1074317
Depends on: 1075134
Depends on: 1075144
Depends on: 1082665
Depends on: 1059771
Depends on: 1068634
Depends on: 1089816
Depends on: 1093654
Depends on: 1086172
Depends on: 1088694
Depends on: 1094988
Depends on: 1105586
Depends on: 1122045
Depends on: 1145980
Depends on: 1149719
Depends on: 1166041
Depends on: 1187410
Depends on: 1187421
Depends on: 1187424
Depends on: 1187518
Depends on: 1074416
Depends on: 1189430
Depends on: 1189568
Depends on: 1190471
Depends on: 1195777
Depends on: 1090570
Depends on: 1201096
Depends on: 1207161
Depends on: 1212987
Depends on: 1215443
Depends on: asan-clang3.7
Depends on: 1216354
Depends on: 1241773
Depends on: 1139323
Depends on: 1169849
Depends on: 1203028
Depends on: 1213826
Depends on: 1214416
Depends on: 1217218
Depends on: 1220653
Depends on: 1221012
Depends on: 1221910
Depends on: 1222354
Depends on: 1225704
Depends on: 1227713
Depends on: 1228240
Depends on: 1231277
Depends on: 1232555
Depends on: 1239927
Depends on: 1242300
Depends on: 1239504
Depends on: 1250995
Depends on: 1252642
Depends on: 1252647
Depends on: 1253440
Depends on: 1253854
Depends on: 1257231
Depends on: 1227347
Depends on: 1253299
Depends on: 1267650
Depends on: 1246801
Depends on: 1287437
Depends on: 1260289
Depends on: 1287877
Depends on: 1287971
Bug 1246801 might not need to block this — it's for LSan support in NSS's own build/testsuite/CI/etc.  (When building NSS as part of Firefox, the sanitizer build flags carry over without needing any special support from NSS.)  There might be some overlap as far as tuning the list of suppressions, but most of it would be independent, I'd think.
(In reply to Jed Davis [:jld] [⏰MDT; UTC-6] from comment #42)
> Bug 1246801 might not need to block this — it's for LSan support in NSS's
> own build/testsuite/CI/etc.  (When building NSS as part of Firefox, the
> sanitizer build flags carry over without needing any special support from
> NSS.)  There might be some overlap as far as tuning the list of
> suppressions, but most of it would be independent, I'd think.

The meta bug is just a combination of things related to running LSan on Firefox-y code.
Depends on: 1304156
Depends on: 1309420
Depends on: 1329445
Depends on: 1336170
Depends on: 1341354
Depends on: 1353858
Depends on: 1353877
Depends on: 1353893
Depends on: 1354232
Depends on: 1355834
Depends on: 1358128
Depends on: 1361476
Depends on: 1363602
Depends on: 1373340
Depends on: 1375341
Depends on: 1495659
Depends on: 1480799
Depends on: 1507350
Depends on: 1508500
Depends on: 1517574
Depends on: 1517667
Depends on: leak-fuzz
Depends on: 1279108
Depends on: 1530374
Depends on: 1523989
Depends on: 1544115
Type: defect → task
Keywords: meta
Summary: Try out LeakSanitizer → [meta] Try out LeakSanitizer
Depends on: 1587463
Depends on: 1602645
Type: task → defect
Summary: [meta] Try out LeakSanitizer → [meta] Run LeakSanitizer
Assignee: continuation → nobody
Component: General → Sanitizers
Depends on: 1639519
Depends on: 1641746
Depends on: 1715090
Depends on: 1715084
Depends on: 1714228
See Also: → 1765225
Severity: normal → S3
Depends on: 1568234, 1450476
No longer depends on: 1750062
No longer depends on: 1626420
No longer depends on: 1728821
No longer depends on: 1708028
No longer depends on: 1735085
No longer depends on: 1708828
No longer depends on: 1754191
No longer depends on: 1509373
No longer depends on: 1728820
No longer depends on: 1331152
No longer depends on: 1753183
No longer depends on: 1687248
No longer depends on: 1755129
No longer depends on: 1711640
No longer depends on: 1532189

I've removed all of the intermittent-failures blocking this bug. It looks like bugbot won't close intermittent-failures that haven't happened in a long time, so there were dozens of bugs that hadn't happened in years that were still open. Intermittent LSan failures are easy enough to find by searching for "Intermittent LeakSanitizer", and generally aren't very interesting bugs, so I think it is okay for them to not be blocking this meta bug.

Sorry for the noise. I guess I should have removed the blocking bugs manually in a single change to this bug rather than used the Bugzilla query mass change on the bugs that block this one.

See Also: → 1715080
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: