Closed Bug 717853 (NewDMD) Opened 12 years ago Closed 12 years ago

Integrate DMD into the browser

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20
Tracking Status
firefox19 --- fixed
firefox20 --- fixed
b2g18 --- fixed

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [MemShrink:P1])

Attachments

(5 files, 26 obsolete files)

110.84 KB, patch
justin.lebar+bug
: review+
glandium
: review+
justin.lebar+bug
: approval-mozilla-aurora+
justin.lebar+bug
: approval-mozilla-b2g18+
Details | Diff | Splinter Review
4.26 KB, patch
justin.lebar+bug
: review+
justin.lebar+bug
: approval-mozilla-aurora+
justin.lebar+bug
: approval-mozilla-b2g18+
Details | Diff | Splinter Review
44 bytes, text/x-github-pull-request
dhylands
: review+
Details | Review
44 bytes, text/x-github-pull-request
dhylands
: review+
Details | Review
860.14 KB, application/x-bzip2
Details
DMD (bug 676724) is great:  it finds dark matter and double-counted heap blocks.  

DMD sucks:  it's super slow and doesn't run on Windows.  I.e. it's a Valgrind tool.

The only Valgrind features of note that DMD uses are:

- malloc interception

- stack trace recording (including getting line numbers) and matching

As it happens, trace-malloc does both of those things.  So it should be possible to implement DMD entirely within Firefox.  It would still require building with the --enable-dmd flag, but that's *much* easier and more widely applicable than the current version.
Whiteboard: [MemShrink] → [MemShrink:P2]
Alias: NewDMD
Blocks: 736383
Attached patch draft patch (obsolete) — Splinter Review
dbaron, here's a preliminary version that's working well.  I'd like feedback mostly on the structure of it.  Currently it's very much piggy-backed onto trace-malloc.

- You configure with --enable-dmd and that forces --enable-trace-malloc on.  It also causes MOZ_DMD to be defined.

- Most of the code is in nsTraceMalloc.c, within |#ifdef MOZ_DMD|, which is nested within |#ifdef NS_TRACE_MALLOC|.  This #ifdef nesting also occurs in nsJSEnvironment.cpp.

- To run DMD, you actually have to specify --trace-malloc, and the usual trace-malloc logging occurs.

I did it this way because it was easy.  There's a lot of code sharing between trace-malloc and DMD -- all that malloc interception and stack walking stuff.  

Now, does this seem reasonable?  Arguably a better way to do it would be to factor out the common stuff that's shared by trace-malloc and DMD, and then build trace-malloc and DMD as layers on top of this shared core.  But that would be a bigger change.  For example, a lot of 

  #ifdef NS_TRACE_MALLOC

occurrences would become

  #if defined(NS_TRACE_MALLOC) || defined(MOZ_DMD)

I also don't know what this shared core would be called.
Attachment #606477 - Flags: feedback?(dbaron)
Blocks: DMD
No longer depends on: DMD
It might be useful (as a followup, or whatever) to see if you can get t-m to use jemalloc.  That way, you'll get proper slop numbers.
dbaron: feedback ping.
Blocks: 747647
Comment on attachment 606477 [details] [diff] [review]
draft patch

comment 1 sounds fine to me; I probably shouldn't delay this any longer for more detailed feedback
Attachment #606477 - Flags: feedback?(dbaron) → feedback+
(In reply to Nicholas Nethercote [:njn] from comment #1)
> I also don't know what this shared core would be called.

Drive-by bikeshedding: malloc-profiling?  memory-profiling?
I would be really interested in a description of what DMD does. Is this synonymous with "enumerate all heap blocks with some metadata such as allocation call stack" ?
(In reply to Benoit Jacob [:bjacob] from comment #6)
> I would be really interested in a description of what DMD does. Is this
> synonymous with "enumerate all heap blocks with some metadata such as
> allocation call stack" ?

Yes.  DMD needs each heap block to be annotated with:

- its requested size (which may be smaller than actual size)
- the callsite of its allocation point
- the callsite of its report point (i.e. where a memory reporter counted it);  NULL if it hasn't been reported
- a string containing the name of the reporter that reported it;  NULL if it hasn't been reported

When requested, DMD iterates over all heap blocks, and after the enumeration it prints aggregated information about all the reported and unreported blocks, in particular, their allocation points.

Per-heap-block metadata is a common basis for dynamic tools.  Massif is another example which uses it.

I read the thread on the jemalloc mailing list (http://www.canonware.com/pipermail/jemalloc-discuss/2012-October/000456.html) about your patch that modifies jemalloc to track extra per-block metadata.

I understand their reluctance to put this in the allocator, and instead do it in the application code -- it's invasive and not core functionality.

And I understand your with to put it in the allocator -- to avoid possibly missing some allocations, and to avoid the performance cost of having to maintain an additional hash table mapping block pointers to metadata.

It's a tricky one.  For DMD, I'll continue using trace-malloc for now, because it's already working and I want to be able to land this soon-ish without requiring buy-in for a major change from the jemalloc folks.
OK, thanks for the explanation!

As I said in the jemalloc-discuss thread, I don't intent for my patch to be taken as-is, as it is indeed intrusive in its current form. Though, as it only touches the public malloc(3) api functions, it is actually very self-contained so could easily be integrated as an optional feature.

Now I'm left in a place where I really don't know what to do next: I am changing my mind in average twice a day :)

I could use nsTraceMalloc, but it is really really slow in its current form. Not sure why. A profile showed a diverse mix of things: mutex locking, hashtable operation, stack trace unwinding. By contrast, my patch currently does mutex locking too, and still runs at near-"native" speed. So mutex locking alone shouldn't explain it, unless there is something really wrong about PR locks.

It seems that nsTraceMalloc tries to annotate stacks with symbol names on every allocation -- is that correct? That could explain slowness and seems unnecessary as it should be enough to record return addresses and only read symbol names later, on demand. Right?
I'm getting this warning a few times with my patch when I run NewDMD immediately after starting up:

REPORT WARNING(telemetry): no such heap block 0xf78540
 Reported at:
  TelemetryMallocSizeOf (/home/njn/moz/mi8/toolkit/components/telemetry/Telemetry.cpp:200)
  base::Histogram::SizeOfIncludingThis(unsigned long (*)(void const*)) (/home/njn/moz/mi8/ipc/chromium/src/base/histogram.cc:416)
  GetTelemetryMemoryUsed (/home/njn/moz/mi8/toolkit/components/telemetry/Telemetry.cpp:235)
  GetAmount (/home/njn/moz/mi8/toolkit/components/telemetry/Telemetry.cpp:240)
  mozilla::DMDRunReporters() (/home/njn/moz/mi8/xpcom/base/nsMemoryReporterManager.cpp:1617)
  DMDCheckAndDump (/home/njn/moz/mi8/dom/base/nsJSEnvironment.cpp:2643)

I adding debugging printfs to nsTraceMalloc.c and found that 0xf78540 was never seen in any of {Malloc,Calloc,Realloc,Free}Callback().  This suggests that telemetry starts up before trace-malloc, but that isn't true AFAICT.  Hmm.
trace-malloc is really slow, yes.  Running Firefox with --trace-malloc is roughly as slow as running it under Valgrind, which is pretty sad given how much extra stuff Valgrind does.

I'm not sure about the symbol name stuff.  backtrace() does call NS_StackWalk(), though, which doesn't look like a cheap function.
> By contrast, my patch currently
> does mutex locking too, and still runs at near-"native" speed. So mutex
> locking alone shouldn't explain it, unless there is something really wrong
> about PR locks.

So it must be the stack walking and/or symbol gathering that's the problem.
(In reply to Benoit Jacob [:bjacob] from comment #8)
> It seems that nsTraceMalloc tries to annotate stacks with symbol names on
> every allocation -- is that correct? That could explain slowness and seems
> unnecessary as it should be enough to record return addresses and only read
> symbol names later, on demand. Right?

Yes exactly. AFAIK mapping addresses to symbols is much more expensive then walking the stack. From my experience with the profiler doing walking the stack 1000 times per second is very cheap as long as you don't resolve the symbol. But with memory allocation it's quite likely that youll need more then 1000/sec so I'm not sure how well it can scale for that use case.
To add insult to injury, on Linux trace-malloc gives stack entries like this:

  moz_malloc[/home/njn/moz/mi8/dmd64/dist/bin/libmozalloc.so +0x109A]

and you have to run tools/rb/fix-linux-stack.pl afterwards to convert it to something nicer:

  moz_malloc (/home/njn/moz/mi8/memory/mozalloc/mozalloc.cpp:68)

I'm not sure about other platforms.
OSX is the same way, assuming this is the same stack dumping code that ref count logging uses. If we could really make this kind of stack recording faster, that might also speed up ref count logging.
Julian is working on an alternate stack walking implementation for SPS in bug 779291. I'm not sure if it'd meet the needs of this use case.
(In reply to Nicholas Nethercote [:njn] from comment #13)
> To add insult to injury, on Linux trace-malloc gives stack entries like this:
> 
>   moz_malloc[/home/njn/moz/mi8/dmd64/dist/bin/libmozalloc.so +0x109A]
> 
> and you have to run tools/rb/fix-linux-stack.pl afterwards to convert it to
> something nicer:
> 
>   moz_malloc (/home/njn/moz/mi8/memory/mozalloc/mozalloc.cpp:68)
> 
> I'm not sure about other platforms.

What's sadly funny about this is that we go to all the trouble of identifying the symbol name (demangled, even!), the library, and the offset, when all of this could be done in the offline scripts with a little additional smarts wrt load maps.  I imagine the same is true for OS X.  I don't know what impact changing the stack tracing code for trace-malloc and refcnt logging would have on Windows.
> I don't know what impact changing the stack tracing code for trace-malloc and
> refcnt logging would have on Windows.

That's the problem with modifying this kind of stuff;  getting it working on all the important platforms can be a huge pain.  trace-malloc uses a different malloc-interception technique on each of Windows, Mac, and Linux.

(That's something that is attractive about bjacob's "modify jemalloc" approach -- it should Just Work on all the jemalloc platforms...)
> This suggests that telemetry starts up before trace-malloc, but that isn't
> true AFAICT.

Actually, it is true.  Here's the tail of main():

#elif defined(XP_UNIX)
    XRE_TelemetryAccumulate(mozilla::Telemetry::EARLY_GLUESTARTUP_HARD_FAULTS,
                            int(initialRUsage.ru_majflt));
    struct rusage newRUsage;
    if (!getrusage(RUSAGE_SELF, &newRUsage)) {
      XRE_TelemetryAccumulate(mozilla::Telemetry::GLUESTARTUP_HARD_FAULTS,
                              int(newRUsage.ru_majflt - initialRUsage.ru_majflt)
);
    }
#endif
  }

  int result;
  {
    ScopedLogging log;
    result = do_main(argc, argv);
  }

  XPCOMGlueShutdown();
  return result;
}


Telemetry starts up in XRE_TelemetryAccumulate().  trace-malloc starts up within ScopedLogging's constructor.  How annoying.
(In reply to Nicholas Nethercote [:njn] from comment #11)
> > By contrast, my patch currently
> > does mutex locking too, and still runs at near-"native" speed. So mutex
> > locking alone shouldn't explain it, unless there is something really wrong
> > about PR locks.
> 
> So it must be the stack walking and/or symbol gathering that's the problem.

Data point: stack walking alone is very slow (even without symbol gathering). At least when using Glibc's backtrace(). I've let my jemalloc use it to get the top 8 return addresses from the stack on each allocation, and it makes the difference between "native speed" and "slow browser, not very responsive". On a profile, about 50% of time is spent walking stacks.
> I've let my jemalloc
> use it to get the top 8 return addresses from the stack on each allocation,
> and it makes the difference between "native speed" and "slow browser, not
> very responsive".

And 8 isn't that many -- for NewDMD I use 30 at the moment, and any less than about 20 (or 15 if we're pushing it) would start losing useful information :(
> From my experience with the profiler doing walking the stack 1000 times per second is very cheap as 
> long as you don't resolve the symbol.

> Data point: stack walking alone is very slow (even without symbol gathering). At least when using 
> Glibc's backtrace().

So...which is it, Benoit?  :)
You will note that there are two Benoits involved here. We shall discuss this tomorrow in the Toronto office, if we can overcome Benoit degeneracy pressure.
> You will note that there are two Benoits involved here.

I think that's what jlebar was hinting at with his smiley -- the fact that the two Benoits are saying conflicting things :)
Blocks: 799336
Blocks: 799796
Attached patch another draft patch (obsolete) — Splinter Review
New, in-progress patch.  Not really in a state worth looking at.
Attachment #606477 - Attachment is obsolete: true
Attached patch and another draft patch (obsolete) — Splinter Review
Basically a back-up of where I'm at.  Still barely usable.
Attachment #672687 - Attachment is obsolete: true
I'd like us to re-triage this (with an eye on making it a P1): cjones and company have seen apparently device-only spikes of heap-unclassified in the main process which we have no insight into at all.  If these are in fact device-only, I don't know how we're going to fix them without native DMD.
Whiteboard: [MemShrink:P2] → [MemShrink]
Running NewDMD on B2G devices would be challenging.  The first obstacle I can think of is that the stacks NewDMD produces look like this (on desktop Linux):

   0x7f28fefdeeff MallocCallback(void*, unsigned long, DMDThread*)[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x2BA9EFF]
   0x7f28fefdee54 malloc[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x2BA9E54]
         0x405464 malloc[dmd64/dist/bin/firefox +0x5464]
   0x7f28ffc54871 ???[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x381F871]
   0x7f28ffc579a0 ???[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x38229A0]
   0x7f28ffc9f6c8 ???[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x386A6C8]
   0x7f28ffc9f64b ???[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x386A64B]
   0x7f28ffe50fea ???[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x3A1BFEA]
   0x7f28ffc8f70b ???[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x385A70B]
   0x7f28ffc8f46f ???[/home/njn/moz/mi2/dmd64/dist/bin/libxul.so +0x385A46F]

You then have to run tools/rb/fix-linux-stack.pl to convert all the library + offset entries into a real stack entry, with a filename and line number.  fix-linux-stack.pl uses addr2line to do this.  It's really slow (e.g. takes minutes to process a full output file, which can easily be 10s of MBs of stacks), though it could probably be sped up with some caching.

Running fix-linux-stack.pl on a device sounds like a bad idea.  To run it on a desktop machine we'd need to transfer both the original output file and various files like libxul.so, and then munge the paths in the output file.  Or something.
And that's assuming that NS_StackWalk() even works on B2G.

---

I just talked to BenWa about the possibility of incorporating memory profiling into SPS.  The key idea is to change the unit of time of sampling -- instead of taking a sample (i.e. stack trace) every N milliseconds, you'd take a sample every N bytes of heap allocation.  For example, if N was 4 KiB, and you did four 1 KiB allocations, one of those would be sampled;  in contrast, a single 20 KiB allocation would span 4 or 5 samples.

This would require malloc interception, possibly via LD_PRELOAD, or possibly using something like bug 804303.  It would probably also require accurate stack unwinding on B2G devices -- I suspect the pseudostack labels might not give good coverage of the parts of the code that are causing these heap-unclassified spikes -- which is dependent on bug 779291.  Bug 809317 may also be relevant.

---

So I think we're in kind of an awkward situation where the various bits of infrastructure are not quite in place, not when looking at the short-term horizon that B2G needs.  Hmm.
I believe dhylands figured out efficient symbolification for the purposes of running SPS on the device.

ISTM that what you suggest in comment 27 doesn't have anything to do with a sampling profiler or SPS -- you could do the same kind of sampling in NewDMD.  Similarly, you could keep NewDMD's approach of watching every single allocation and still integrate with SPS.
> I believe dhylands figured out efficient symbolification for the purposes of running SPS 
> on the device.

(I don't know how he does it, but for example, you could pull libxul.so from the device and run addr2line on the host, presuming that you have a cross-compiled version, which I think we do.)
sewardj is working on symbolication for sps on arm.

(In reply to Nicholas Nethercote [:njn] from comment #27)
> So I think we're in kind of an awkward situation where the various bits of
> infrastructure are not quite in place, not when looking at the short-term
> horizon that B2G needs.  Hmm.

OTOH, what is it you want to track on the short-term horizon that b2g needs? I thought DMD was to identify heap-unclassified, are you after that or something more?
> OTOH, what is it you want to track on the short-term horizon that b2g needs? I thought 
> DMD was to identify heap-unclassified, are you after that or something more?

The thought is that we can't run the valgrind-based DMD on devices, particularly due to how slow it is.  But we want a DMD-like tool for devices.
> I thought DMD was to identify heap-unclassified, are you after that or
> something more?

Identifying heap-unclassified is the goal (comment 26).  DMD is ideal for that, but if the spikes are large then any memory profiler should provide insight.
(In reply to Justin Lebar [:jlebar] from comment #29)
> > I believe dhylands figured out efficient symbolification for the purposes of running SPS 
> > on the device.
> 
> (I don't know how he does it, but for example, you could pull libxul.so from
> the device and run addr2line on the host, presuming that you have a
> cross-compiled version, which I think we do.)

The libs stored on the phone have the debug information stripped. They have the exported symbols available, but won't have line number mapping or static function names, etc.

My symbolicate script assumes that you've done a build for whats running on the phone and looks for the non-stripped versions of the library in your build tree.
FYI, we have a script in-tree that can symbolicate these stacks using Breakpad symbols:
http://mxr.mozilla.org/mozilla-central/source/tools/rb/fix_stack_using_bpsyms.py

We use this for assertions on debug tests, where we only have stripped binaries. We always have Breakpad symbols available when we run tests, so it's easy enough to filter this output. (We could do it on a desktop machine and not a mobile device, as well.)
Attached patch new patch (obsolete) — Splinter Review
The code is awful, but this basically works.

Build it with --enable-dmd.

For B2G, run it normally.  Then if you send it SIGRTMIN instead of dumping memory reports it will run DMD checking, and dump output to $CWD/out-$PID.dmd.
Attachment #675764 - Attachment is obsolete: true
cjones, since you're the only one who can reproduce these mysterious main-process dark-matter explosions, it might be worthwhile for you to try running a build with this enabled.  I don't know how much extra memory it takes, but we could turn on the extra 256mb on the Unagi if it helps.
Whiteboard: [MemShrink] → [MemShrink:P1]
This doesn't compile in B2G.  I'll see if I can push this forward.

DMD.cpp
In file included from ../../dist/include/mozilla/Assertions.h:37,
                 from ../../dist/include/nsDebug.h:19,
                 from ../../dist/include/nsCharTraits.h:43,
                 from ../../dist/include/nsStringIterator.h:11,
                 from ../../dist/include/nsAString.h:16,
                 from ../../dist/include/nsSubstring.h:12,
                 from ../../dist/include/nsString.h:14,
                 from ../../dist/include/nsPrintfCString.h:11,
                 from ../../../../ff-git2/src/tools/dmd/DMD.cpp:45:
../../../system/core/include/android/log.h:88: warning: comma at end of enumerator list
../../../../ff-git2/src/tools/dmd/DMD.cpp:78: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp:79: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp:80: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp:81: error: variable or field '__libc_free' declared void
../../../../ff-git2/src/tools/dmd/DMD.cpp:81: error: '__ptr_t' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:82: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp:83: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp: In function 'DMDThread* FetchDMDThread()':
../../../../ff-git2/src/tools/dmd/DMD.cpp:286: error: '__libc_malloc' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp: At global scope:
../../../../ff-git2/src/tools/dmd/DMD.cpp:363: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp:384: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp:405: error: '__ptr_t' does not name a type
../../../../ff-git2/src/tools/dmd/DMD.cpp: In function 'void* valloc(size_t)':
../../../../ff-git2/src/tools/dmd/DMD.cpp:434: error: '__libc_valloc' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:437: error: '__ptr_t' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:437: error: expected ';' before 'ptr'
../../../../ff-git2/src/tools/dmd/DMD.cpp:440: error: 'ptr' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:440: error: '__libc_valloc' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:443: error: 'ptr' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp: In function 'void* memalign(size_t, size_t)':
../../../../ff-git2/src/tools/dmd/DMD.cpp:455: error: '__libc_memalign' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:458: error: '__ptr_t' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:458: error: expected ';' before 'ptr'
../../../../ff-git2/src/tools/dmd/DMD.cpp:461: error: 'ptr' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:461: error: '__libc_memalign' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:464: error: 'ptr' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp: In function 'int posix_memalign(void**, size_t, size_t)':
../../../../ff-git2/src/tools/dmd/DMD.cpp:472: error: '__ptr_t' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:472: error: expected ';' before 'ptr'
../../../../ff-git2/src/tools/dmd/DMD.cpp:473: error: 'ptr' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:476: error: 'ptr' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp: At global scope:
../../../../ff-git2/src/tools/dmd/DMD.cpp:481: error: variable or field 'free' declared void
../../../../ff-git2/src/tools/dmd/DMD.cpp:481: error: '__ptr_t' was not declared in this scope
../../../../ff-git2/src/tools/dmd/DMD.cpp:129: warning: 'gIsTestMode' defined but not used
Depends on: 812070
Attached patch draft, v5 (obsolete) — Splinter Review
Updated version.
Attachment #680878 - Attachment is obsolete: true
This is hacky code which gets minimal integration with bug 804303 working.

Basically, do whatever you want inside malloc_replace_libxul.cpp.  You're in libxul; have fun.

You need to run Firefox with

  LD_PRELOAD=path/to/libtolibxul_malloc_replace.so LD_LIBRARY_PATH=path/to/libxul
Comment on attachment 682637 [details] [diff] [review]
Patch to integrate with bug 804303

Review of attachment 682637 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/replace/tolibxul/tolibxul_malloc_replace.cpp
@@ +13,5 @@
> +
> +extern "C"
> +void *replace_malloc(size_t size, malloc_impl_t malloc_impl)
> +{
> +  return libxul_replace_malloc(size, malloc_impl);

If all you're doing is that, you could just export the replace_* functions from libxul, and LD_PRELOAD it.
> This is hacky code which gets minimal integration with bug 804303 working.

Oh, and you need to build with --enable-malloc-replace, but you probably figured that out already.
> Oh, and you need to build with --enable-malloc-replace, but you probably figured that out already.

And apparently you also may need --enable-jemalloc.  It's supposed to give you an error, but bjacob didn't hit it at first.
NewDMD is currently built into its own library, libdmd.so.  But that's just because I copied what trace-malloc does, and it could be changed.
Maybe this is obvious, but it just occurred to me how incredibly cool it would be if we could provide contributors with a prebuilt binary which ran DMD.
I talked to bjacob about adding sample-based memory profiling to SPS.  It seems quite plausible and would give a similar sort of effect to a distributed version of DMD.
Probably BenWa rather than me ;-) But bring it on, I love getting credit for 50% of what he does.
> Probably BenWa rather than me ;-)

Argh, yes.
Yes I'd love to integrate memory features to SPS. If someone has time to write an effective way to interpose the malloc function family and report the information to SPS using something that could work in nightly that would help speed that along.
I had patches adding a few jemalloc stats in the SPS data, what more would there need to be reported to SPS? I mean, if you're only taking samples once in a while, why would you need to interpose malloc functions completely?
In a time profiler, the meaning of sampling is obvious -- every N ms you get the stack trace.  What's the equivalent for a memory profiler?  Taking a sample every N ms doesn't make much sense.

So, instead you change the unit of "time" from "milliseconds" to "heap bytes allocated".  (Calls to free() have no effect on this unit of time.)

Then you can sample every N bytes.  Imagine that N = 4096.  If you did four 1024 byte allocations, the fourth one would be sampled.  If you did a 16384 byte allocation, it would be hit by 3 or 4 samples.  If you set N = 1 then you have complete heap profiling.

I have this sampling working in the attached patch -- see MallocCallback().  But for it to work you have to intercept every call to malloc and realloc, so you can update the "heap bytes allocated" time unit.
> So, instead you change the unit of "time" from "milliseconds" to "heap bytes allocated".  (Calls to 
> free() have no effect on this unit of time.)

If we sampled at an interval that's a prime number of bytes (e.g. 4093) and if we started the counter at a random value on startup, this /might/ be OK.

I don't at all buy the argument that it's good for our malloc sampling algorithm to be analogous to the CPU sampling algorithm.  Our CPU sampling algorithm is sound only because of noise introduced by the CPU scheduler.  If it weren't for this noise, sampling the program counter once every N ms would be unsound for the same reason that sampling once every 4096 bytes allocated is unsound.  That is, suppose we're profiling

  while(1) {
    foo(); // takes 1ms
    bar(); // takes 3ms
  }

if there's no external randomness here and we sample every 4ms, we'll see foo() or bar() but never both.  I don't expect we run into this problem in practice because we can rely on noise introduced by the system.

But the order and size of mallocs in our program is much more predictable than the order and size of CPU interruptions we get from the kernel.  The main source of randomness in our mallocs is due to the way threads interleave, but because we're so heavily single-threaded, this should be a relatively small effect.

So I don't think emulating our CPU profiling algorithm is virtuous, and I still very much prefer the situation where we record malloc(x) with some probability p(x), if we can make that fast enough.  This approach is more flexible, in that p(x) can be an arbitrary function.  It's also a statistically sound method of sampling.
I think you raise a good point Justin. I filed bug 813214 for adding memory tools to the profiler not to hijack the discussion.
Here's DMD built with --enable-replace-malloc, built on top of glandium's two patches from bug 804303, and jlebar's patch here.

The whole thing was pretty easy and pleasant!  The only thing I have to complain about is that I didn't have access to the malloc_impl_t and free_impl_t functions outside of the replacements.  So I saved them to sMallocImpl and sFreeImpl the first time the replacements were called.
> draft, v5, using --enable-malloc-replace

I should add: that patch applies on top of the "draft, v5" patch.  At least, I think it does, but I may have made some small changes that mean it doesn't apply cleanly.
Eh, here's the full --enable-malloc-replace version in one patch.  It's 100 lines smaller than the old (based on trace-malloc) version, which is nice.

Note that it stores per-heap-block metadata in a table.  Eventually it would be nice to store that adjacent to the block itself, as bjacob has suggested can be done.
Attachment #682637 - Attachment is obsolete: true
Attachment #683448 - Attachment is obsolete: true
(In reply to Nicholas Nethercote [:njn] from comment #55)
> Note that it stores per-heap-block metadata in a table.  Eventually it would
> be nice to store that adjacent to the block itself, as bjacob has suggested
> can be done.

Indeed and this is done in the patch that I just uploaded on bug 704240 comment 23. This patch implements a replace_malloc lib that allocates bigger blocks to store doubly linked list elements next to the actual payload. See that comment for more details and how to use it.
new draft version
Attachment #682332 - Attachment is obsolete: true
Attachment #683470 - Attachment is obsolete: true
I'm working on some changes to make this work for B2G; ping me when you get up and I'll send you my latest changes.
glandium and I just finished chatting about this.

When I try to run this on B2G (I'll attach a patch with some necessary minor changes in a moment), replace-malloc gets initialized, but the weak symbols replace_malloc etc. are null, so we never call into DMD's stack-walking routines.  But gdb has no difficulty resolving the symbol replace_malloc into DMD.cpp.  This indicates to me that for some reason the Bionic linker is not happy with the circular dependency we've set up between libxul and libmozglue.

The solution glandium and I came up with is the following:

 1. Create libdmd.so, which doesn't rely on libxul.
 2. When built with --enable-dmd, libxul depends on libdmd.so.
 3. libxul calls into functions in libdmd.so to tell DMD when it's reported a
    block, and to ask DMD to dump information on unreported blocks.

The main trick here is (1).  Right now, DMD.cpp depends on libxul for NS_StackWalk and pldhash.  (Maybe other things, too; this is all I see.)  glandium's solution is to rebuild the requisite files for libdmd.so.  For example, to get NS_StackWalk, he suggests doing:

  VPATH += $(topsrcdir)/xpcom/base
  CPPSRCS += nsStackWalk.cpp

  VPATH += $(topsrcdir)/nsprpub/lib/libc/src
  CSRCS += strcpy.c

Note that with this approach, the dynamic linker will require that libdmd be present when it tries to load libxul (if libxul was built with --enable-dmd).  It strikes me that if we're going to have this constraint, we might as well fold libdmd into libmozglue.

The main advantage of this is that we wouldn't have to add an extra LD_PRELOAD library to every invocation of Gecko.  (This is non-trivial to maintain in our B2G infrastructure, since there are many scripts which invoke the b2g binary -- b2g.sh, which lives on the device; the script which starts b2g inside the debugger; the emulator script, I presume -- and these aren't tightly hooked into the gecko build, so it's a bit tricky to get them to notice when I passed --enable-dmd to the Gecko configure.)
Updated version.  jlebar and I agreed to use this as a base patch for the two sets of development going on (my general clean-ups, and jlebar/glandium's library munging).
Attachment #685982 - Attachment is obsolete: true
> It strikes me that if we're going to have this constraint, we might as well fold libdmd into libmozglue.

This might not be possible because of a chicken-egg problem (libmozglue is built too early for plenty of things)
It still doesn't work, but at least it builds.

Note that I had to change how DMD is linked into libxul; before this change, DMD just wasn't getting linked in, as far as I could tell.  Maybe it worked for you because of differences in the toolchain, or maybe it worked for you because you had some stuff sitting around in your objdir.  (I found clobbering was sometimes necessary when I was working on this patch.)

Anywho, we're changing all this stuff imminently; you probably don't need to worry about the Makefile.in changes in this patch.  The other changes are probably going to stay.
This applies on top of draft v7 and b2g addendum 1.  We're still a ways away from getting it to link.

Oh, and you're going to /love/ how I got around the problems in nsTHashtable.

Maybe we should just change nsTHashtable instead of doing this gross thing.
Attached patch njn addendum 1 (obsolete) — Splinter Review
This patch applies on top of draft, v7.  It includes a bunch of my changes from
today:

- Output can be triggered by the signal mechanism.  It goes into a
  dmd-<BLAH>.txt.gz file, next to the memory-report-<BLAH>.json.gz file.

- Invocation is now controlled by the $DMD environment variable.  Set it to
  "test" for test mode, and anything else for normal mode.
  
  I thought $DMD would be inherited by the child processes but it appears not,
  so I'll need to work something else out there.

- Various other small improvements.

- Double-count warnings are temporarily disabled.

And then I incorporated all of the changes from jlebar's addendum 1, and most
of them from addendum 2.  Things changed/not incorporated from addendum 2:

- I want the exit(1) calls for test mode.  They're legitimate early exits.

- I omitted the dmdHashtable changes, because they hopefully won't be necessary

- I had trouble with gettid() on my Linux box.  It's just need for the
  assertions, and really I just want to assert that the global lock is locked,
  I don't care which thread locked it.  So I removed the
  GetCurrentThreadId/ThreadIdType stuff and replaced it with simpler stuff.

- I didn't incorporate the tools/dmd/Makefile.in changes from addendum 2.
Comment on attachment 686422 [details] [diff] [review]
B2G addendum 2: Create libdmd, work towards making it link properly

Review of attachment 686422 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/dmd/DMD.cpp
@@ +177,5 @@
> +
> +namespace {
> +
> +// MutexBase implements the platform-specific parts of a mutex.
> +class MutexBase

I feel we should have such locking and synchronization classes in MFBT.
(In reply to Nicholas Nethercote [:njn] from comment #64)
>   I thought $DMD would be inherited by the child processes but it appears not,
>   so I'll need to work something else out there.

That doesn't seem right. Except if something is resetting the variable, or if the child processes are not actual child processes, this shouldn't happen.
You may be interested in a taking a quick look at https://bugzilla.mozilla.org/attachment.cgi?id=686550&action=diff .

In particular:

(In reply to Justin Lebar [:jlebar] from comment #63)
> Oh, and you're going to /love/ how I got around the problems in nsTHashtable.

Instead of working around problems with XPCOM containers and/or reimplementing them, it seems better to just write a STL allocator that uses the raw functions from the malloc_table_t directly, so it bypasses the instrumentation and the lock. You can find such a STL allocator in the above patch, attachment 686550 [details] [diff] [review]. Should be trivial to reuse, just slap in your own pointer to malloc_table_t. The good thing with that approach is that you do it once and for all and from there you can use all of the STL in your replace_malloc code without having to worry.

 (In reply to Mike Hommey [:glandium] from comment #65)
> Comment on attachment 686422 [details] [diff] [review]
> B2G addendum 2: Create libdmd, work towards making it link properly
> 
> Review of attachment 686422 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: tools/dmd/DMD.cpp
> @@ +177,5 @@
> > +
> > +namespace {
> > +
> > +// MutexBase implements the platform-specific parts of a mutex.
> > +class MutexBase
> 
> I feel we should have such locking and synchronization classes in MFBT.

In attachment 686550 [details] [diff] [review] there is a 5-lines-of-code self-contained Mutex class, using atomic fetch-and-add. I found this a nicer compromise than reimplementing something like mozilla/Mutex.h in my tool. The downside is that it busy-spins instead of passively waiting. It currently uses the GCC/Clang intrinsic for that, but it's trivial to add a code path using the MSVC intrinsic.
(In reply to Benoit Jacob [:bjacob] from comment #67)
> Instead of working around problems with XPCOM containers and/or
> reimplementing them, it seems better to just write a STL allocator that uses
> the raw functions from the malloc_table_t directly, so it bypasses the
> instrumentation and the lock. You can find such a STL allocator in the above
> patch, attachment 686550 [details] [diff] [review]. Should be trivial to
> reuse, just slap in your own pointer to malloc_table_t. The good thing with
> that approach is that you do it once and for all and from there you can use
> all of the STL in your replace_malloc code without having to worry.

Sadly std::map is not a hashtable. std::unordered_map is, but it's C++11.
(In reply to Nicholas Nethercote [:njn] from comment #60)
> Created attachment 686272 [details] [diff] [review]
> draft, v7, using --enable-replace-malloc
> 
> Updated version.  jlebar and I agreed to use this as a base patch for the
> two sets of development going on (my general clean-ups, and
> jlebar/glandium's library munging).

You may want to define MOZ_REPLACE_ONLY_MEMALIGN before including replace_malloc.h, and skip replace_posix_memalign, replace_aligned_alloc and replace_valloc.
In case it's not clear, we're trying hard to get newdmd done so we can use it on B2G.  Getting it to work on Windows would be nice, but it's not my priority at the moment.  Speaking for myself, I am totally OK landing this using pthreads, with no Windows fallback.

> In attachment 686550 [details] [diff] [review] there is a 5-lines-of-code self-contained
> Mutex class, using atomic fetch-and-add.

FWIW I'm not convinced this would be a good implementation for B2G, where we have just one core, and so we'll always busy-wait until the end of our timeslice.  We can use pthreads; I think it's probably worth doing so.

> - I want the exit(1) calls for test mode.  They're legitimate early exits.

The reason I (blindly) changed all the exit() calls is that I've seen deadlocks and other strange behavior in B2G when I call exit(), because exit() runs static destructors without first running xpcom-shutdown, and that's not a path we test.

But I think the exit(1) calls are probably OK, because we quit before we initialize most stuff.

> I had trouble with gettid() on my Linux box.  It's just need for the
> assertions, and really I just want to assert that the global lock is locked,
> I don't care which thread locked it.  So I removed the
> GetCurrentThreadId/ThreadIdType stuff and replaced it with simpler stuff.

Yeah, gettid() is not exported by all libc's, so you have to invoke syscall() directly.  It's bizarre.

http://stackoverflow.com/questions/9565177/call-gettid-witin-glibc
> <<<<<<< HEAD
> 
>   if (!aPtr || !gIsDMDRunning ||
>     (t = DMDThread::Fetch())->interceptsAreBlocked()) {
> =======
>   if (!aPtr || !gIsDMDRunning ||
>       (t = DMDThread::Fetch())->interceptsAreBlocked()) {
> >>>>>>> 82b94bf... Bug 717853 - Creating libdmd, and getting partway towards its linking.

I don't know if you intentionally left out this change, but we usually indent the second line of an if statement as in the second hunk, rather than as in the first.
(In reply to Justin Lebar [:jlebar] from comment #70)
> In case it's not clear, we're trying hard to get newdmd done so we can use
> it on B2G.  Getting it to work on Windows would be nice, but it's not my
> priority at the moment.  Speaking for myself, I am totally OK landing this
> using pthreads, with no Windows fallback.
> 
> > In attachment 686550 [details] [diff] [review] there is a 5-lines-of-code self-contained
> > Mutex class, using atomic fetch-and-add.
> 
> FWIW I'm not convinced this would be a good implementation for B2G, where we
> have just one core, and so we'll always busy-wait until the end of our
> timeslice.  We can use pthreads; I think it's probably worth doing so.

Yeah - that would waste a minimum of 5 milliseconds (on average) of CPU each time a thread had to wait, since we have a 10-millisecond timeslice. To put that in perspective, at 60 FPS, that's about 1/3 of a single frame-time.
It turns out that I can #include <unordered_map> and it builds with no problems on bionic.  I wonder if we can use that after all.
As I said on irc: unordered_map is available on gcc 4.4, clang and MSVC 2010. However, we're not necessarily building with -std=gnu++0x on gcc 4.4 or clang on linux, thanks to bugs in headers.

I'll add that we can probably add a configure test to error out with an explicit message when unordered_map is not supported and dmd enabled.
> >   I thought $DMD would be inherited by the child processes but it appears not,
> >   so I'll need to work something else out there.
> 
> That doesn't seem right.

You're right.  Turns out I was doing something else wrong.
This applies on top of njn addendum 1, and obsoletes the previous 2 B2G addenda.

We have (basically, or actually; not sure) no external object-file dependencies here.  Yay.

Unfortuantely eliminating the dependencies required using unsorted_map and unsorted_set, which are C++11-only.  My desktop clang is not happy about compiling this, even if I add CXXFLAGS += -std=-gnu++0x to tools/dmd/Makefile.in.  (You need to add it after client.mk, otherwise it has no effect!)

I still think eliminating dependencies is probably the safest thing for us to do here, but it's not clear whether that's feasible.

I need to call it a night, though.
Attachment #686317 - Attachment is obsolete: true
Attachment #686422 - Attachment is obsolete: true
Attached patch njn addendum 2 (obsolete) — Splinter Review
This patch applies on top of "njn addendum 1".  Notable changes:

- Child processes are now handled properly;  previously DMD wasn't enabled for
  them.  This required lots of startup-related changes.  

- I realized that gMethods and gLibraries were being used to do nothing more
  than very expensive no-ops(!)  I've removed them.

Furthermore, I think the best way to reduce dependencies on libxul is not use
STL but to use js::HashMap, js::HashSet and js::Vector.  Good things about 
them:

- They are under our control. 
- They are defined entirely in headers so there are no linking troubles.
- They allow custom allocators.
- I'm pretty sure they're flexible enough to handle some of the strange
  key/value combinations we have.
- They're really quite nice to use, and the documentation (in
  js/public/HashTable.h) is good, with numerous examples.

Before I realized gMethods and gLibraries could be removed I converted them to 
js::HashTable, and it worked well.  The AllocPolicy (a.k.a. custom allocator)
that I used remains in this patch.

----

jlebar, I'm happy to continue converting the hash tables to js::HashTable on 
Monday, if you want to take a day off from this stuff.  Or if you want to 
continue yourself, feel free, but I strongly recommend converting one hash 
table at a time and testing on desktop, rather than converting them all in one
big bang.  (Even gMethods/gLibraries, which were simple, took me a couple of 
goes to get right.)
Attachment #686940 - Attachment is patch: true
> - I realized that gMethods and gLibraries were being used to do nothing more
>  than very expensive no-ops(!)

Because NS_DescribeCodeAddress was already interning these strings?

> Or if you want to 
> continue yourself, feel free, but I strongly recommend converting one hash 
> table at a time and testing on desktop

Yeah, you're right.  But part of what I wanted to prove with the hashtable conversion was that when we got rid of the nsTHashtable / pldhash stuff, that we'd actually have covered all our dependencies.  So I guess the STL conversion was useful in that way.

I'll see if I can get the JS containers to work.
> Because NS_DescribeCodeAddress was already interning these strings?

I don't know what it does on the inside, but I was getting the strings that NS_DescribeCodeAddress copied into a temporary buffer, and then interning them, even though they were used only once and immediately afterwards.
So...I'm not sure why we didn't think of this earlier, but: Our effort to move the DMD code out of libxul was based on the premise that the Android dynamic linker doesn't like circular dependencies.

Well, libdmd depends on libmozglue, because it calls functions which malloc, such as strdup.  And then the weak replace_malloc symbols in libmozglue depend on libdmd.

This is exactly the same circular dependency we had when the DMD code lived inside libxul.  It also seems to work exactly as well, which is to say, not at all.

I'm going to build a shared library which /doesn't/ depend on libmozglue and verify that linker is able to resolve weak symbols which come from that library.  If so, I think some options are

 1. eliminate libmozglue dependency in libdmd.  This may be hard because libdmd depends on nsStackWalk, which we may not be able to modify so that it never malloc's.

 2. merge libdmd and libmozglue.

 3. fix the dynamic linker.
(In reply to Justin Lebar [:jlebar] from comment #80)
> So...I'm not sure why we didn't think of this earlier, but: Our effort to
> move the DMD code out of libxul was based on the premise that the Android
> dynamic linker doesn't like circular dependencies.
> 
> Well, libdmd depends on libmozglue, because it calls functions which malloc,
> such as strdup.  And then the weak replace_malloc symbols in libmozglue
> depend on libdmd.
> 
> This is exactly the same circular dependency we had when the DMD code lived
> inside libxul.  It also seems to work exactly as well, which is to say, not
> at all.
> 
> I'm going to build a shared library which /doesn't/ depend on libmozglue and
> verify that linker is able to resolve weak symbols which come from that
> library.  If so, I think some options are
> 
>  1. eliminate libmozglue dependency in libdmd.  This may be hard because
> libdmd depends on nsStackWalk, which we may not be able to modify so that it
> never malloc's.

If you add malloc, free, strdup and other such functions that would just call the malloc function pointers you got in replace_init (except for strdup where you'd need your own), and make them visibility hidden for good measure, you should be able and link NS_stackwalk in libdmd, then NS_stackwalk will be using these functions.

>  2. merge libdmd and libmozglue.
> 
>  3. fix the dynamic linker.

The lack of dynamic symbol resolution surely doesn't help here.
Attached file Debug trace of linking b2g (obsolete) —
(Sorry this is long.)

As glandium suggested, I rebuilt the linker with debug mode enabled.  (This was surprisingly easy; see bionic/linker/README.TXT, and find the relevant objdir in your B2G clone to nuke by searching for e.g. linker_environ.o.)

I made a libdmd.so which contains just replace_init and replace_malloc.  I also tweaked b2g.sh so that instead of exec'ing /system/bin/b2g, it simply runs it.  The results of running b2g.sh with linker debugging enabled are...interesting.

You can see in the attached trace that loading libdmd.so is the first thing we do.  So that's good.  But then later on, when we resolve replace_init ("SEARCH replace_init"), we don't look in libdmd.

It seems that the linker doesn't look inside a library until it first resolves a symbol from that library.  To illustrate, here's the first time libmozglue is mentioned in the log after the LD_PRELOAD bit at the very beginning:

> SEARCH open in liblog.so@0x400ea000 000766be 41
> SEARCH open in libstlport.so@0x4013e000 000766be 484
> SEARCH open in libstdc++.so@0x400d4000 000766be 21
> SEARCH open in libm.so@0x40019000 000766be 40
> SEARCH open in libc.so@0x400ee000 000766be 484
> FOUND open in libc.so (00018c05) 36
> si libmozglue.so sym open s->st_value = 0x00018c05, found in libc.so, base = 0x400ee000
> si libmozglue.so sym write s->st_value = 0x0000c588, found in libc.so, base = 0x400ee000
> RELO JMP_SLOT 400d17f8 <- 400fa588 write
> SEARCH close in libmozglue.so@0x400b8000 006a3695 231
> SEARCH close in liblog.so@0x400ea000 006a3695 25
> SEARCH close in libstlport.so@0x4013e000 006a3695 508
> SEARCH close in libstdc++.so@0x400d4000 006a3695 16
> SEARCH close in libm.so@0x40019000 006a3695 188

Notice how we don't search for open in libmozglue, but we do search for close in libmozglue.  I'm not sure what the "si libmozglue.so sym open" lines are saying, but it seems like it's something to the effect of "we found |open|, which libmozglue uses."

The same thing happens with the first mention of libdmd.so after the initial LD_PRELOAD stuff:

> SEARCH puts in libmozglue.so@0x400b8000 00077cb3 180
> SEARCH puts in liblog.so@0x400ea000 00077cb3 34
> SEARCH puts in libstlport.so@0x4013e000 00077cb3 950
> SEARCH puts in libstdc++.so@0x400d4000 00077cb3 18
> SEARCH puts in libm.so@0x40019000 00077cb3 145
> SEARCH puts in libc.so@0x400ee000 00077cb3 950
> FOUND puts in libc.so (0001b7b9) 128
> si libdmd.so sym puts s->st_value = 0x0001b7b9, found in libc.so, base = 0x400ee000
> RELO JMP_SLOT 400b7ac4 <- b000515d dladdr
> SEARCH strlen in libdmd.so@0x400b6000 07ab92be 4
> SEARCH strlen in libmozglue.so@0x400b8000 07ab92be 149
> SEARCH strlen in liblog.so@0x400ea000 07ab92be 45
> SEARCH strlen in libstlport.so@0x4013e000 07ab92be 469
> SEARCH strlen in libstdc++.so@0x400d4000 07ab92be 29

We don't look inside libdmd.so when resolving puts, but we do look inside it when resolving strlen (and when resolving all other symbols after this point, as far as I can tell).

So this gives an immediate cause for why we don't resolve the weak replace_init symbol; when we try to resolve replace_init, we're not far enough along in linking that we're looking inside libdmd:
 
> SEARCH replace_init in libstlport.so@0x4013e000 0e399774 499
> SEARCH replace_init in libstdc++.so@0x400d4000 0e399774 20
> SEARCH replace_init in libm.so@0x40019000 0e399774 92
> SEARCH replace_init in libc.so@0x400ee000 0e399774 499
> SEARCH replace_init in libdl.so@0x00000000 0e399774 0
> RELO GLOB_DAT 400d1770 <- 00000000 replace_init

(This trace appears before the one above, in the log.)  But that of course don't explain what's actually going wrong.

A clue that I don't know how to interpret is: When we process libdmd.so before pushing it to the device, the processed library is empty, according to nm.  (I'm not sure exactly what we do, but I believe we strip debugging information and run elfhack on the binary.)  In the traces above, I've been using the unstripped binary.  When I use the stripped, empty binary, libdmd.so barely shows up in the linker trace at all.

But why does processing libdmd.so erase everything, and is that relevant to the question at hand about weak symbols?  If it helps, here's what nm shows for the unprocessed binary:

> 000007b0 T NS_DescribeCodeAddress
> 00000754 T NS_FormatCodeAddressDetails
> 00000814 T NS_StackWalk
> 00000630 T PL_strcpy
> 000005e4 T PL_strncpy
> 00000604 T PL_strncpyz
> 000019d8 d _DYNAMIC
> 00001aac d _GLOBAL_OFFSET_TABLE_
> 00000650 t _GLOBAL__I_DMD.cpp
>          U _Unwind_Backtrace
>          U _Unwind_VRS_Get
> 00000708 t _Z14DemangleSymbolPKcPci
> 000006f8 t _Z28StackWalkInitCriticalAddressv
> 00001ad4 b _ZL12gMallocTable
> 0000084c t _ZL15unwind_callbackP15_Unwind_ContextPv
> 00001ad8 b _ZL16gCriticalAddress
> 00000710 t _ZN7mozilla21FramePointerStackWalkEPFvPvS0_S0_EjS0_PS0_S0_
> 00000678 T _ZN7mozilla3dmd4DumpEPFvPvPKczES1_
> 000006a8 T _ZN7mozilla3dmd4InitEv
> 00000698 T _ZN7mozilla3dmd6ReportEPKvPKc
> 00000660 T _ZN7mozilla3dmd7FpWriteEPvPKcz
> 00000688 T _ZN7mozilla3dmd8UnreportEPKv
>          U __aeabi_unwind_cpp_pr0
>          U __aeabi_unwind_cpp_pr1
> 00001ad4 A __bss_start
> 00001ad4 A _edata
> 00001ae0 A _end
>          U dladdr
>          U puts
> 000006dc T replace_init
> 000006b8 T replace_malloc
>          U snprintf
>          U strcpy
>          U strlen


Another thing I did was verify using a minimal testcase that I can in fact resolve weak symbols using a similar scheme to what we have here with libdmd.so, libmozglue.so, and the b2g binary.  It all worked as expected, so long as libdmd.so came first in the list of LD_PRELOAD'ed libraries.

In the absence of any good ideas here, I guess I'm going to look through the linker's source to see if I can understand what's going on here.
The problem is obvious: libdmd.so is not linked when libmozglue.so is being linked, because libmozglue.so is loaded as a dependency of libdmd.so (it is listed as a DT_NEEDED entry in libdmd.so), so the linker can't resolve symbols in libdmd.so from libmozglue.so. "Fixing" the linker to allow that to happen sounds dangerous.

You need to remove the dependency of libdmd on libmozglue.so. IOW, you need to empty MOZ_GLUE_LDFLAGS in your Makefile.in.
> You need to remove the dependency of libdmd on libmozglue.so. IOW, you need to empty 
> MOZ_GLUE_LDFLAGS in your Makefile.in.

And WRAP_LDFLAGS, apparently.  But wow, that worked!!
Attached patch draft, v8 (obsolete) — Splinter Review
New, complete patch.  This version uses JS containers throughout instead of Gecko ones, so there should be no remaining dependencies on libxul.
Attachment #686272 - Attachment is obsolete: true
Attachment #686459 - Attachment is obsolete: true
Attachment #686850 - Attachment is obsolete: true
Attachment #686940 - Attachment is obsolete: true
Attachment #687491 - Attachment is obsolete: true
>diff --git a/xpcom/glue/standalone/nsGlueLinkingDlopen.cpp b/xpcom/glue/standalone/nsGlueLinkingDlopen.cpp
>--- a/xpcom/glue/standalone/nsGlueLinkingDlopen.cpp
>+++ b/xpcom/glue/standalone/nsGlueLinkingDlopen.cpp
>@@ -35,17 +35,17 @@ NS_HIDDEN __typeof(dlsym) __wrap_dlsym;
> NS_HIDDEN __typeof(dlclose) __wrap_dlclose;
> }
> 
> #define dlopen __wrap_dlopen
> #define dlsym __wrap_dlsym
> #define dlclose __wrap_dlclose
> #endif
> 
>-#ifdef NS_TRACE_MALLOC
>+#if defined(NS_TRACE_MALLOC)
> extern "C" {
> NS_EXPORT_(__ptr_t) __libc_malloc(size_t);
> NS_EXPORT_(__ptr_t) __libc_calloc(size_t, size_t);
> NS_EXPORT_(__ptr_t) __libc_realloc(__ptr_t, size_t);
> NS_EXPORT_(void)    __libc_free(__ptr_t);
> NS_EXPORT_(__ptr_t) __libc_memalign(size_t, size_t);
> NS_EXPORT_(__ptr_t) __libc_valloc(size_t);
> }

This is just cosmetic, right?  I might move this cleanup (all the changes to
this file) to a separate patch (or just nix it altogether).
Trolling through [1], I noticed this:

> (1) Make sure your binary provides PT_GNU_EH_FRAME.  This is the quickest
>     path through the unwinder, since the table is pre-sorted by the linker.

It seems that if we /don't/ provide this, _Unwind_Backtrace has to malloc and then sorts some list.  If that's the case, we probably want to try to do this, whatever it is.  Unless the next few comments rule this out.

[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=24724#c13
Attached patch B2G patch atop draft v8 (obsolete) — Splinter Review
This is sufficient to get things building on B2G.  I'm testing it as we speak.

There are a bunch of XXX jlebar's in here which may function as a preliminary review.
Attached patch draft, v8, njn addendum 1 (obsolete) — Splinter Review
This patch applies on top of "draft, v8".

Major changes:

- Reporting of double-counted and mis-unreported blocks is working again.  It
  now is shown at the same time as reported/unreported blocks, and there's 
  similar duplicate detection (which can greatly reduce the amount of output).
  
- The $DMD env var is more capable.  In particular, you can specify the sample
  size with it (e.g. |DMD='sample-size=4096'|).
  
I haven't addressed jlebar's review comments yet, nor incorporated the 
necessary B2G changes from "B2G patch atop draft v8".  I'll do that tomorrow.
Improved get_about_memory.py and new fix_b2g_stack.py scripts are up on my git branch.  If DMD is active and you do get_about_memory.py, we now pull the DMD dumps and fix them up.

I want to make at least one other change to this, to make the stack-fixup go faster by caching addr2line results between runs of fix_b2g_stack.

https://github.com/jlebar/B2G/tree/malloc-profile
Attached patch draft, v8, njn addendum 2 (obsolete) — Splinter Review
Changes in this patch:

- It's now possible to call Dump() multiple times per browser session.

- I removed Unreport(), because it wasn't really useful -- it was always 
  called on a block just before the block was freed, in which case unreporting
  it was of little use.

- I've included the B2g build fixes and partially addressed the review comments
  from "B2G patch atop draft v8".
With the changes I just pushed, we can now process a B2G DMD report in ~3s after the first time we run it on a given set of libraries.  (The first run takes about 20s and writes a cache file to disk.)

I think it should be relatively easy to move this code into m-c and use it for fix-linux-stack.pl and friends, although I haven't looked at this, since my focus has been on B2G.
Attachment #687949 - Attachment is obsolete: true
Attached patch draft, v8.1 (obsolete) — Splinter Review
Rebased atop latest patches in bug 804303.  No interesting changes.
Attachment #687628 - Attachment is obsolete: true
Attached patch draft v8.1, njn addendum 2.1 (obsolete) — Splinter Review
Rebased atop latest changes from bug 804303.  (Addendum 1 didn't need any changes.)

If playing with patches isn't your idea of a fun evening, feel free to pull my
git branch (which, fair warning, I rebase occasionally):

https://github.com/jlebar/mozilla-central/tree/newdmd
Attachment #688622 - Attachment is obsolete: true
(In reply to Justin Lebar [:jlebar] from comment #94)
> If playing with patches isn't your idea of a fun evening,

There are other ways to spend an evening?
I did some profiling with Cachegrind.  It failed to annotate a lot of code,
though.  The only useful thing I got out of it was that the loop in
MallocCallback was expensive when gSampleSize==1, so I avoided the loop in that
case.

Having fixed that,  I then tried something different.  I tweaked test mode so it
just allocates 1 million blocks -- with a variety of stack traces -- and then
tried removing certain operations, in order to estimate where the time is going.
  
For a call to malloc, the might-be-expensive operations (from replace_malloc
and MallocCallback) are as follows:
  
- DMDThread::Fetch -- DMD_GET_TLS_DATA (pthread_getspecific)
- gMallocTable->malloc
- AutoLockState
- AutoBlockIntercepts
- gMallocTable->malloc_usable_size
- StackTrace::Do -- get stack trace, lookup hash table, maybe dup and insert
- insert block into hash table 

Here are some measurements after I selectively disabled various things:

  base:                                                 5.4s
  base + gIsDMDRunning==false:                          0.1s (-5.3s)
  base + remove DMDThread::Fetch+AutoBlockIntercepts:   5.4s
  base + remove malloc_usable_size:                     5.4s 
  base + remove malloc + remove malloc_usable_size:     5.4s
  base + remove AutoLockState:                          5.4s
  base + remove StackTrace lookup/dup/insert:           5.2s (-0.2s)
  base + remove NS_StackWalk + remove StackTrace l/d/i: 0.6s (-4.8s)
  base + empty unwind_callback:                         5.2s (-0.2s)
  base + don't insert block into hash table:            5.0s (-0.4s)
  
So, in summary:
- Total is 5.4s
- NS_StackWalk is 4.6s
  - Of that, 0.2s is within unwind_callback
- Block insert is 0.4s
- StackTrace lookup/dup/insert is 0.2s
- Everything else is 0.2s

Stack unwinding dominates;  hash table insertions are minor;  everything else
is negligible.  This might explain why the Cachegrind annotations were bad --
-- it couldn't annotate the stack unwinding code.

The fact that we unwind entire stacks but only look at the top 24 frames is
now doubly annoying.  _Unwind_Backtrace seems to allow early termination of
unwinding, but NS_StackWalk doesn't expose that.

It's an interesting question if stack unwinding is as expensive on other
platforms.  I suspect so.
I also tried sampling.  All of the allocations in my test code were 144 bytes, so a sample size of 144 or less would be equivalent to a sample size of 1.

Times:
- sample-size =     1:  5.4s
- sample-size =   144:  5.4s
- sample-size =   288:  2.8s
- sample-size =   576:  1.4s
- sample-size =  1024:  0.9s
- sample-size =  4096:  0.3s
- sample-size = 16384:  0.2s

Just more confirmation of the observations in comment 96.
I just tried a patch that allowed an NS_StackWalk callback to bail early, and made DMD's callback bail beyond various lengths:

- length = 24:  5.4s
- length = 12:  5.3s
- length =  6:  3.4s
- length =  3:  2.7s
- length =  2:  2.5s
- length =  1:  2.2s
- length =  0:  1.9s

There's no drop until we get the length below 12 because in my artificial example all the stacks only have a length of 12.

And even with length 0 the time is still relatively high. That's partly because DMD passes a "skip" value of 2 to NS_StackWalk;  when I change that to 0 and length to 0, it drops to 1.4s.  Even in that case it's still getting one frame, because we have to get at least one frame in order for the callback to bail out.  Still, that's a long way from 0.2s, so maybe there's some high startup overhead for each call to _Unwind_Backtrace.

Anyway, for normal browser runs stack depths of 50 and higher are common, so if we implement this early bailout at depth 24 it'll help quite a bit.
> Anyway, for normal browser runs stack depths of 50 and higher are common, so
> if we implement this early bailout at depth 24 it'll help quite a bit.

I filed bug 818793 for this.
FYI, macos doesn't have memalign, so the test becomes even more complicated.  :(
(In reply to Justin Lebar [:jlebar] from comment #100)
> FYI, macos doesn't have memalign, so the test becomes even more complicated.
> :(

what are you trying to do?
> what are you trying to do?

At the bottom of DMD.cpp there's some test code which exercises DMD's main functions in various ways.  One of the things it does is call all the different allocation functions... except for aligned_alloc because it's C11-only, and posix_memalign because it's apparently not available on B2G, and now memalign because it's not on Mac.
FWIW on my mac the profile with DMD enabled is entirely dominated by dyld::findMappedRange(), which takes up more than 50% of CPU time.

The stack is

NS_StackWalk
_Unwind_Backtrace (libunwind)
libunwind::UnwindCursor::step()
libunwind::UnwindCursor::setInfoBasedOnIPRegister()
_dyld_find_unwind_sections (libdyld)
client_dyld_find_unwind_sections
dyld::findMappedRange

Various other libunwind stuff takes up another 10% or so.

I guess we can hope that on Android the equivalent code is faster.  :)
I commented the test pieces out instead of deleting them because I wasn't sure
if you wanted to ifdef them or something.
Note that this has an advantage even when we're sampling with a large
value (e.g. 4096 bytes): With this patch, we can measure the slop of
allocations larger than the sample size.

I haven't tested this other than by compiling it.
This patch is ready for full review.  r?ing glandium for the build system
stuff (especially the places where I just cargo-culted a MOZ_DMD occurrence
from a nearby NS_TRACE_MALLOC occurrence), and jlebar for everything else.

Things that still to be resolved:

- NS_StackWalk failures on ARM.

- fix-linux-stack.pl and fix_macosx_stack.py need their regexps tweaked;  I
  need to check what I've done doesn't break existing tools.

Other notable things since the last review:

- Now doing jlebar-style sampling, controlled by the |sample-below| parameter.
  Block groups and frame groups that involve sampling have their numbers
  prefixed with '~' to indicate inexactness.

- DMD's own allocations, both direct and indirect, are all infallible (with
  one tiny exception, described in |replace_malloc|), controlled by the
  |Infallible| class.

- Now correctly handling realloc failure in the hunspell reporter.

- Many fewer friend classes.

- Tons of smaller clean-ups.
Attachment #689623 - Flags: review?(mh+mozilla)
Attachment #689623 - Flags: review?(justin.lebar+bug)
Attachment #688106 - Attachment is obsolete: true
Attachment #689009 - Attachment is obsolete: true
Attachment #689011 - Attachment is obsolete: true
Attachment #689313 - Attachment is obsolete: true
Attachment #689314 - Attachment is obsolete: true
Comment on attachment 689623 [details] [diff] [review]
Add a native version of DMD.

Review of attachment 689623 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the configure.in things fixed, as well as PAGE_SIZE. The rest is more optional. You may or may not want to consider doing them.

::: configure.in
@@ +7006,5 @@
> +    MOZ_DMD= )
> +
> +if test "$MOZ_DMD"; then
> +    # Please, Mr. Linker Man, don't take away our symbol names
> +    MOZ_COMPONENTS_VERSION_SCRIPT_LDFLAGS=

You're effectively doing this twice in configure.in.

@@ +7010,5 @@
> +    MOZ_COMPONENTS_VERSION_SCRIPT_LDFLAGS=
> +    USE_ELF_DYNSTR_GC=
> +    AC_DEFINE(MOZ_DMD)
> +
> +    if test "${OS_TARGET}" = "Android"; then

You probably want that on arm, not android.

@@ +7019,5 @@
> +    MOZ_MEMORY=1                        # DMD enables jemalloc
> +    MOZ_REPLACE_MALLOC=1                # DMD enables replace-malloc
> +    MOZ_DMDV=                           # DMD disables DMDV
> +    if test "$NS_TRACE_MALLOC"; then    # trace-malloc disables DMD
> +        MOZ_DMD=

trace malloc won't be happy with jemalloc and replace-malloc being enabled.

::: toolkit/library/Makefile.in
@@ +394,5 @@
>  EXTRA_DSO_LDOPTS += $(MOZ_PIXMAN_LIBS)
>  endif
>  
> +ifdef MOZ_DMD
> +EXTRA_DSO_LDOPTS += $(call EXPAND_LIBNAME_PATH,dmd,$(DIST)/lib)

Bug 818922 will alleviate the need for this.

::: toolkit/toolkit-tiers.mk
@@ +15,5 @@
>  tier_platform_dirs = tools/trace-malloc/lib
>  endif
>  
> +ifdef MOZ_DMD
> +tier_platform_dirs += tools/dmd

Wouldn't it be better under memory/replace?

::: tools/dmd/DMD.cpp
@@ +25,5 @@
> +#include "nscore.h"
> +#include "nsStackWalk.h"
> +
> +#include "js/HashTable.h"
> +#include "js/Vector.h"

We should move these in mfbt.

@@ +36,5 @@
> +// replace_{posix_memalign,aligned_alloc,valloc}.  It requires defining
> +// PAGE_SIZE.  Nb: sysconf() is expensive, but it's only used for (the obsolete
> +// and rarely used) valloc.
> +#define MOZ_REPLACE_ONLY_MEMALIGN 1
> +#define PAGE_SIZE sysconf(_SC_PAGESIZE)

This will fail on windows. (valloc could probably be left out on windows, that's something i'll need to change)
Attachment #689623 - Flags: review?(mh+mozilla) → review+
> This will fail on windows.

FWIW Windows is explicitly not supported, and DMD will #error if not XP_UNIX.
(In reply to Justin Lebar [:jlebar] from comment #108)
> > This will fail on windows.
> 
> FWIW Windows is explicitly not supported, and DMD will #error if not XP_UNIX.

I was confused by the #ifdef XP_WIN a few lines above.
> - NS_StackWalk failures on ARM.

I looked into this.

tl;dr: _Unwind_Backtrace is returning an error code because eventually it traces back to a pc which belongs to the dynamic linker.  It's nothing to worry about.

When we call _Unwind_Backtrace from NS_StackWalk, it always returns 9, which corresponds to _URC_FAILURE [1].

The code we're running for _Unwind_Backtrace on ARM lives in gcc/config/arm/unwind-arm.c in the GCC tree.

We're returning _URC_FAILURE because of a failure in unwind-arm.c::get_eit_entry.  get_eit_entry will return _URC_END_OF_STACK when it hits the end of the stack, and this is also translated into _URC_FAILURE by _Unwind_Backtrace, but that's not what's causing us to fail.

So why is get_eit_entry returning an error?  To figure this out, I observed arguments that caused it to fail, then waited for us to call it with those arguments and stepped through the asm in gdb.

This resulted in a fun tour through libc and the dynamic linker.  We eventually end up in bionic/linker/linker.c:dl_unwind_find_exidx:

>/* For a given PC, find the .so that it belongs to.
> * Returns the base address of the .ARM.exidx section
> * for that .so, and the number of 8-byte entries
> * in that section (via *pcount).
> *
> * Intended to be called by libc's __gnu_Unwind_Find_exidx().
> *
> * This function is exposed via dlfcn.c and libdl.so.
> */
>_Unwind_Ptr dl_unwind_find_exidx(_Unwind_Ptr pc, int *pcount)
>{
>    soinfo *si;
>    unsigned addr = (unsigned)pc;
>
>    if ((addr < LINKER_BASE) || (addr >= LINKER_TOP)) {
>        for (si = solist; si != 0; si = si->next){
>            if ((addr >= si->base) && (addr < (si->base + si->size))) {
>                *pcount = si->ARM_exidx_count;
>                return (_Unwind_Ptr)(si->base + (unsigned long)si->ARM_exidx);
>            }
>        }
>    }
>   *pcount = 0;
>    return NULL;
>}

We return NULL here becaue the initial if statement fails.  This cascades back and causes get_eit_entry to fail.

If you squint a bit, you'll see that if the if statement fails, that means
pc is inside the dynamic linker itself!

Of course I don't know if this is the source of all of our _URC_FAILURE's.  But
it looks like if we don't hit linker code, we'll get a _URC_FAILURE when we hit
the end of the stack (because per above get_eit_entry returns
_URC_END_OF_STACK, which _Unwind_Backtrace translates into _URC_FAILURE).

I don't see any way to avoid that, because the only way to stop backtracing is
to return something other than _URC_NO_REASON from unwind_callback, but doing
so triggers _URC_FAILURE.

Given that the success case returns failure and the error case returns failure,
I think we should just ignore the error code here.  I guess we should make that
change just for ARM, since it seems to work on x86.
> > FWIW Windows is explicitly not supported, and DMD will #error if not XP_UNIX.
> 
> I was confused by the #ifdef XP_WIN a few lines above.

Left-overs from my having copied nsTraceMalloc.cpp for a lot of this stuff.  I can remove that code.

BTW, judging from configure.in XP_WIN and XP_WIN32 are equivalent?  DMD.cpp currently uses a mix of them...
(In reply to Justin Lebar [:jlebar] from comment #110)
> I guess we should make that change just for ARM, since it seems to work on x86.

Do you mean it works on x86 Android? Does it also fail on ARM glibc? To me, this looks like a bionic issue more than an architecture issue.
> Do you mean it works on x86 Android? Does it also fail on ARM glibc? To me, this looks 
> like a bionic issue more than an architecture issue.

It works on x86 Linux (in njn's testing; I haven't tested myself).

I haven't looked very closely, but x86 (Android/Linux, bionic or not) seems to call into libunwind, and who knows what that does.  My guess would be that it doesn't call into dl_unwind_find_exidx, even.  So you're probably right; it's probably a bionic issue.

OTOH I guess I don't see a ton of value in strictly checking the return value coming out of _Unwind_Backtrace.  The argument I have in mind is along the lines of Pascal's wager: We have one choice under our control, whether to drop stacks upon observing an error, and one choice not under our control, whether when we receive an error from _Unwind_Backtrace it indicates that the stack is bad somehow.

So consider the possibilities:

 * If errors reflect corrupt stacks, and
    1. we drop stacks on error, we get no stacks.
    2. we don't drop stacks on error, we get bad stacks, which may have some marginal use (they may not be entirely bad), but anyway are likely to be obviously corrupt.

 * If errors don't reflect corrupt stacks, and
    3. we drop stacks on error, we get no stacks.
    4. we don't drop stacks on error, we get useful stacks.

It seems to me that the best choice here is not to drop stacks on error, because corrupt stacks aren't much worse than no stacks (1 vs 2), while no stacks are a lot worse than useful stacks (3 vs 4).

But I don't feel strongly about this; if you think we should ifdef BIONIC, I'm fine with that too.
> It works on x86 Linux (in njn's testing; I haven't tested myself).

I've tested on x86-64 Linux, but not x86 Linux.

(Other than that, I'm not fussed and will do whatever you two decide is the best thing.)
This is necessary to get B2G to compile.

The change in gfxAndroidPlatform is just to fix a typo.  More interesting is the DMD.cpp change, where I had to get rid of the const * const in mDoubleReport.

Without that change, I was getting

> tools/dmd/DMD.cpp:612: error: member 'mozilla::dmd::BlockKey::<anonymous
> union>::<anonymous struct> mozilla::dmd::BlockKey::<anonymous
> union>::mDoubleReport' with copy assignment operator not allowed in union

when compiling for B2G.

As best I can guess, the compiler helpfully noticed that you had a * const pointer and created a copy-constructor for you (since the default no-args constructor can't initialize this field).  I couldn't figure out a way to delete the constructor.  Doing

  private:
    DoubleReportT(DoubleReportT& aOther) MOZ_DELETE;

did not work; the compiler complained that the union had a struct with a constructor.  (In fact, I couldn't even add a type name to the struct without hitting this copy-constructor error.)
Attachment #689886 - Flags: review?(n.nethercote)
Comment on attachment 689886 [details] [diff] [review]
B2G follow-up to "Add a native version of DMD"

Review of attachment 689886 [details] [diff] [review]:
-----------------------------------------------------------------

r=me.  Thanks for fixing these.  I can fold these changes into my patch before I land it.

::: tools/dmd/DMD.cpp
@@ +602,5 @@
>      } mLive;                                //   immediately on allocation
>  
>      struct
>      {
> +      const StackTrace* mReportStackTrace1;   // const, never null

Add a comment like "These are really |*const|, but some compilers can't handle that."

@@ +655,5 @@
>        mAllocStackTrace(aAllocStackTrace)
>    {
>      // Use const_cast here because we can't initialize these in the initializer
>      // list, and structs within unions can't have constructors.
> +    mDoubleReport.mReportStackTrace1 = aReportStackTrace1;

The comment is now unnecessary.
Attachment #689886 - Flags: review?(n.nethercote) → review+
I'm not done reading through this patch, but I had some thoughts about ways to
optimize its memory usage that I wanted to get down.  Of course without memory
reporters for DMD, we can't say whether any of this would be worthwhile.

* Don't store requested size + slop size; you can use malloc_good_size to
  compute slop size.
  
* Don't include the BlockKey struct in the Block struct.

  AIUI, we only need the mLive fields for blocks alocated by hunspell-style
  reporters, which I'd expect to be a tiny fraction fo allocations.  We could
  store that data in a separate hashtable.  (We'd of course need to create a
  space to hold this data for other blocks while we did a DMD dump.)

  I don't know how you'd handle double-reports because I haven't read that
  part of the code yet, but I expect there's a way to make that work; these
  too should be rare.

* Store stack traces more efficiently.

  Since it's common to have many different stack traces which share a suffix,
  it seems to me that we'd probably win by storing stack traces as linked
  lists instead of as arrays.  A single stack trace would take twice as much
  space (because we'd have two pointers per frame), but if on average one
  stack trace shares more than half of its trace with one other stack (or more
  than one third of its stack with two other stacks, and so on), we'd come out
  ahead.

  One trick with this would be memory management.  We'd probably have to
  refcount the stack frame nodes, which further increases their overhead.  To
  make up for this, maybe we'd have to store multiple IPs in one stack frame
  node, or something.

  Actually, we also need a way to look up these stack frames, which
  necessitates some sort of hashtable.  So we could occasionally iterate over
  that and GC our old stack frames, obviating the necessity for refcounts.

  Or maybe you construct this tree as follows (warning: extremely premature
  optimization ideas follow):

  - Let the root be a dummy node.  All stacks start at the root.

  - Each non-root node contains an IP as its data.

  - The tree is an N-ary tree like [1], except the *kids pointers (the
    vertical pointers in the drawing) are bi-directional.

    [1] https://blog.mozilla.org/nnethercote/2012/03/07/n-ary-trees-in-c/

  - This lets us do two important operations:

    1) Given a leaf node, generate a stack trace by iterating towards the
       root using parent pointers.

    2) Given a stack trace, start at the root and iterate over the child and
       sibling pointers until we find the first element of the trace (we create it
       if we can't find it).  Then iterate until we find the second element of the
       trace, and so on.

  - We actually don't need three pointers per node.  Two suffices, because we
    can use an xor list for the parent+child pointers.

    http://en.wikipedia.org/wiki/XOR_linked_list

  - Doing step (2) above is slow in general (interning a stack trace may
    require iterating over a lot of nodes in the tree).  But in practice,
    stacks aren't random, and we can re-order the tree so that every time we
    touch a node we place it at the front of the list.

  - GC'ing this data structure is a simple matter of starting at all the leaves
    pointed to by the live heap blocks and walking down to the root, marking as
    you go.  Then do one final pass iterating over all the nodes (which we can
    do if we start at the root) and free the unmarked nodes.
> Of course without memory reporters for DMD, we can't say whether any of
> this would be worthwhile.

By the time you finish reading the patch, you'll have seen ShowMemoryConsumption() :)  It's pretty simple, but gives the size of the live block table and stack table at the end of output.  On Linux64 each one can be several 10s of MiBs.

For blocks, bjacob has code to store that data immediately before the block in question, which avoids the hash table overhead.

For stack traces, in prior draft patches I stored them in trees.  This code was copied from trace-malloc (and was the inspiration for the blog post you mentioned).  I removed it because it was more complicated than the current arrangement, but I could resurrect that if necessary.

But I want to land this soon -- are you happy if this stuff is done as follow-ups?
> For blocks, bjacob has code to store that data immediately before the block
> in question, which avoids the hash table overhead.

Actually, that could get us into clownshoes territory... still, there is definitely scope for shrinking the block table, as you said.
(In reply to Nicholas Nethercote [:njn] from comment #119)
> > For blocks, bjacob has code to store that data immediately before the block
> > in question, which avoids the hash table overhead.
> 
> Actually, that could get us into clownshoes territory... still, there is
> definitely scope for shrinking the block table, as you said.

Indeed that is what my patch on bug 704240 does; is there something clownshoesy about it?
> But I want to land this soon -- are you happy if this stuff is done as follow-ups?

Yes, definitely.  A previous revision of that comment explicitly said "for future work", but it looks like I unintentionally revised that bit away.

> Indeed that is what my patch on bug 704240 does; is there something clownshoesy about 
> it?

Suppose gecko allocates a 4096-byte block.  If we add some metadata to the front of it, that will cause it to become an 8192-byte block.

jemalloc3 does a bit better here, because it has more size classes between 4096 and 8192.  But I'd guess that even with jemalloc3, the approach with the hashtable will use less memory than the approach of sticking metadata into each heap block, particularly since the hashtable approach gives us more control over how much overhead we add per block.
(In reply to Justin Lebar [:jlebar] from comment #121)
> Suppose gecko allocates a 4096-byte block.  If we add some metadata to the
> front of it, that will cause it to become an 8192-byte block.

That's true, so the latest version of my patch on bug 704240 prints some metrics to understand how big this problem is.

Here is what I got on Linux x86-64 after loading nytimes.com and youtube.com:

# Reminder: All numbers are in hexadecimal.
#
# (...snip...)
#
# Overhead summary:
#   Total size of payloads:                          48e973a
#   Overhead of instrumentation alone:               2060220
#   Total overhead of instrumentation and allocator: 3147476
#   Overhead of allocator without instrumentation:   8179be
#   Total overhead incurred by instrumentation:      292fab8

So in human-readable units:
 - The application's required memory usage (what it asked for) is 72.9 MiB
 - The overhead from instrumentation and memory allocator allocating bigger blocks than requested, is 49.3 MiB. This is distributed as 32.4 MiB of instrumentation data, and 16.9 MiB due to the memory allocator allocating bigger blocks than requested.
 - According to malloc_good_size, without my instrumentation, the overhead from the memory allocator allocating bigger blocks than requested, would be 8.1 MiB

So you are right in that my approach leads to a > 2x increase in the amount of memory wasted by the memory allocator allocating bigger blocks than requested: we go from 8.1 MiB without my patch to 16.9 MiB with my patch, an increase of 8.8 MiB.

However, that is relatively small overhead compared to the inherent overhead of instrumentation data (32.4 MiB, almost 4x more!) which is almost entirely useful data (stats that I didn't paste here, but that my tool dumps, show that the amount of memory wasted to honor alignment requirements is negligible).
Also, you may be interested in a new feature in the bug 704240 patch: it records type information for the objects allocated on the heap. This works by taking advantage that there are some places in the code where we know both the type of an object, and its address. Currently I've just instrumented various places where we call AddRef(). That's enough to record type info for most objects that are refcounted and allocated on the heap. The main limitation is that we have to scan memory starting from the given pointer, to find a marker indicating the start of the instrumented heap block, and we have to stop at page boundaries to avoid crashing in case of non-heap pointers, so in some cases we can't record types. This is still successful in most cases.
As you predicted, with Jemalloc 3, the overhead (due to the allocator allocating bigger blocks than requested) is smaller, but that is true in the same proportion with and without my patch:

                            |  mozjemalloc  |  jemalloc 3 |
----------------------------+---------------+-------------+
without my instrumentation  |  8.1 MiB      |  5.5 MiB    |
----------------------------+---------------+-------------+
with my instrumentation     |  16.9 MiB     |  10.7 MiB   |
----------------------------+---------------+-------------+
This is definitely interesting, but perhaps we should move further discussion into bug 704240.  Anyway, I didn't mean to suggest that you were doing something wrong with your approach there; your goals are quite different than ours here (in particular, you're not targeting a platform where the only devices have ~120mb of RAM available to Gecko!).  At the point that you're running on a laptop with 8gb of RAM, doubling the heap size isn't at all unreasonable.
Actually, I would quite like to be able to run my tool on B2G. There is B2G-specific code that I won't be able to test otherwise.

Sorry for hijacking the bug. Just a question and sorry if I missed it in an above comment: how low is the memory overhead of NewDMD?
IIRC starting up a 64-bit browser, DMD took 20 MiB of per-block data and 20 MiB of stacks.  (On 32-bit it will be roughly half that.)  But it hasn't been space-optimized at all, so hopefully that will reduce greatly.

In short, at some point I'll try your per-block storage approach and make some measurements and decide from there.
This looks pretty good to me.  I only see two things which look like functional
bugs:

 - Infallible::_realloc appears fallible, and
 - that memory-reporter-path pointers-equal thing, which I still am not
   convinced is correct.

The rest is mostly cosmetic stuff.  I don't mind if you pass over some of it, so long as you don't mind looking at patches to fix them after you check this in.

>diff --git a/dom/base/nsJSEnvironment.cpp b/dom/base/nsJSEnvironment.cpp
>+#ifdef MOZ_DMD
>+
>+#include <errno.h>
>+
>+namespace mozilla {
>+namespace dmd {
>+
>+// See https://wiki.mozilla.org/Performance/MemShrink/DMD for instructions on
>+// how to use DMD.

As future work: Now that DMD doesn't require any external programs, perhaps we
should move the docs into the tree.

>diff --git a/storage/src/mozStorageService.cpp b/storage/src/mozStorageService.cpp
>+// sqlite does its own memory accounting, and we use its numbers in our memory
>+// reporters.  But we don't want sqlite's heap blocks to show up in DMD's
>+// output as unreported, so we mark them as reported when they're allocated and
>+// mark them as unreported when they are freed.
>+//
>+// In other words, we are marking all sqlite heap blocks as reported even
>+// though we're not reporting them ourselves.  Instead we're trusting that
>+// sqlite is fully and correctly accounting for all of its heap blocks via its
>+// own memory accounting.

Do we have any evidence to suggest that this trust is warranted?  If not,
perhaps we should verify our assumption (or even better, place in automated
checks).  As part of a separate bug, of course.

>diff --git a/tools/dmd/DMD.h b/tools/dmd/DMD.h
>+// A useful |Writer| that can be used with Dump().
>+MOZ_EXPORT void
>+FpWrite(void* aFp, const char* aFmt, ...);

Could you be a bit more direct about how to use this?

>diff --git a/tools/dmd/DMD.cpp b/tools/dmd/DMD.cpp
>new file mode 100644
>--- /dev/null
>+++ b/tools/dmd/DMD.cpp

>@@ -0,0 +1,1971 @@
>+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
>+/* vim: set ts=2 et sw=2 tw=80: */
>+/* This Source Code Form is subject to the terms of the Mozilla Public
>+ * License, v. 2.0. If a copy of the MPL was not distributed with this
>+ * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>+
>+// njn: jlebar: NS_StackWalk regularly returns an error on ARM...

We fixed bug 819539, so you can remove this.

>+static void
>+StatusMsg(const char* aFmt, ...)
>+{
>+  fprintf(stderr, "DMD[%d] ", getpid());
>+  va_list ap;
>+  va_start(ap, aFmt);
>+  vfprintf(stderr, aFmt, ap);
>+  va_end(ap);
>+}

It would be nice if you printed this to logcat on Android; it's hard to get
stderr off the device.

  #include "android/log.h"
  __android_log_print(ANDROID_LOG_INFO, "DMD", printf-args);

(Or maybe make the magic string "Gecko-DMD" or something.) There's also

  __android_log_vprint(int prio, const char *tag,
                       const char *fmt, va_list ap);

The logcat output will include the pid, so you don't need to worry about that.

Nit: For the fprintf case, it would be nice if you could write this with one
call to printf, since we often have multiple threads (and processes) writing
stuff to the console, and we don't want this line to be broken up.

>+// This provides infallible allocations (they abort on OOM).  We use it for all
>+// of DMD's own allocations, which fall into the following three cases.
>+// - Direct allocations (the easy case).
>+// - Indirect allocations in js::{Vector,HashSet,HashMap} -- this class serves
>+//   as their AllocPolicy.
>+// - Other indirect allocations (e.g. NS_StackWalk) -- see the comments on
>+//   DMDThread::mBlockIntercepts and in replace_malloc for how these work.
>+//
>+class Infallible

This isn't a great name for a class, since it's an adjective.  InfallibleAlloc
would be more to my liking.  It's more verbose, but later on I have a way for
you to refer to it less-often, so maybe that's a wash.

>+  // This realloc_ is required for this to be a JS container AllocPolicy.
>+  static void* realloc_(void* aPtr, size_t aOldSize, size_t aNewSize)
>+  {
>+    return realloc_(aPtr, aNewSize);
>+  }

Isn't this fallible?

>+  static void* memalign_(size_t aAlignment, size_t aSize)
>+  {
>+    void* p = gMallocTable->memalign(aAlignment, aSize);
>+    ExitOnFailure(p);
>+    return p;
>+  }

>+// Commifies the number and prepends a '~' if requested.

Add a comment or assertion indicating how big you expect these buffers to be?
(I know there's an assertion at the end, but it's only checking that we didn't
happen to overflow our buffer, not that the buffer is large enough in general.)

>+static char*
>+Show(size_t n, char* buf, size_t buflen, bool addTilde = false)
>+{
>+  int nc = 0, i = 0, lasti = buflen - 2;
>+  buf[lasti + 1] = '\0';
>+  if (n == 0) {
>+    buf[lasti - i] = '0';
>+    i++;
>+  } else {
>+    while (n > 0) {
>+      if (((i - nc) % 3) == 0 && i != 0) {
>+        buf[lasti - i] = ',';
>+        i++;
>+        nc++;
>+      }
>+      buf[lasti - i] = (char)(n % 10) + '0';

Style is to use C++-style casts (although I'm not going to object to C-style
casts for malloc, personally).  This is static_cast or reinterpret_cast; I
think it doesn't make a difference here.

>+// This lock must be held while manipulating global state:  gStackTraceTable,
>+// gLiveBlockTable, etc.

This particular appositive colon feels awkward to me (I'd replace it with
parentheses, myself) but I can't find technical fault with it, so I guess it's
OK!

>+//---------------------------------------------------------------------------
>+// Thread-local storage and blocking of intercepts
>+//---------------------------------------------------------------------------
>+
>+#ifdef XP_WIN32
>+
>+#error "Windows not supported yet, sorry."
>+
>+#else
>+
>+#include <pthread.h>
>+
>+#define DMD_TLS_INDEX_TYPE               pthread_key_t
>+#define DMD_CREATE_TLS_INDEX(i_)         pthread_key_create(&(i_), nullptr)
>+#define DMD_DESTROY_TLS_INDEX(i_)        pthread_key_delete((i_))
>+#define DMD_GET_TLS_DATA(i_)             pthread_getspecific((i_))
>+#define DMD_SET_TLS_DATA(i_, v_)         pthread_setspecific((i_), (v_))
>+
>+#endif
>+
>+static DMD_TLS_INDEX_TYPE gTlsIndex;
>+
>+class DMDThread
>+{
>+  friend class Infallible; // required for allocation via Infallible::new_

I'd prefer making the constructor public (and you'll need to do so if you make
infallible new the default in this file; see below), but either way...

>+  // When true, this blocks intercepts, which allows any malloc interception
>+  // functions to themselves call malloc.  (Nb: for direct calls to malloc we

Nit: s/any malloc/malloc/c

>+  // can just use Infallible::{malloc_,new_}, but we sometimes indirectly call
>+  // vanilla malloc via functions like NS_StackWalk.)
>+  bool mBlockIntercepts;

>+/* static */ DMDThread*
>+DMDThread::Fetch()
>+{
>+  DMDThread* t = static_cast<DMDThread*>(DMD_GET_TLS_DATA(gTlsIndex));
>+
>+  if (MOZ_UNLIKELY(!t)) {
>+    // This memory is never freed, even if the thread dies.  It's a leak, but
>+    // only a tiny one.
>+    t = Infallible::new_<DMDThread>();
>+    DMD_SET_TLS_DATA(gTlsIndex, t);
>+  }

Maybe you should just define

  void* operator new(size_t)
  void operator delete(void*)

and make new infallible.  It's a lot cleaner than calling Infallible::new_
everywhere, and it's more similar to other code in Gecko, which has
infallible-by-default operator new.

(I tested and a global operator new declared after the headers compiles and
links for me; I didn't test that it's actually called.  I think you need to be
sure to define it in the global namespace or it won't override properly.)

>+// This must be called before running any code that might allocate.
>+class AutoBlockIntercepts

Nit: This isn't a function to be called.  "Placed on the stack?"

>+static Writer gWrite = nullptr;

It's not at all clear from the type name that Writer is a function pointer.
Can we call it something else, like WriterFn?

>+static void* gWriteState = nullptr;
>+
>+#define W(...) gWrite(gWriteState, __VA_ARGS__);
>+
>+#define WriteTitle(...)                                                       \
>+  W("------------------------------------------------------------------\n");  \
>+  W(__VA_ARGS__);                                                             \
>+  W("------------------------------------------------------------------\n\n");

I'm not entirely comfortable using a global variable for what is really a
function argument.  (It's two arguments, but you could shove them into a struct
and pass that.)

I guess there's an argument to be made that if you don't specify the name of
the Writer struct when calling W(), then W() should use a global variable.

But I'm worried about someone calling W() and not realizing that it can only be
called when gWriter and gWriteState are defined.  Then it will work sometimes
and crash other times.

Since we're into shoot-yourself-in-the-foot^W^W^W^W^W^W^W^W^Wfancy C++ stuff
here, I think you could do something like

  class Writer {
    operator()(const char* fmt, ...) {
      // va_args stuff
    }

  private:
    mWriterFn;
    mWriterState;
  };

then you could do

  void Foo(Writer& W) {
    W("blah %s", "blah");
  }

which is as terse as your current implementation (modulo the extra parameter to
Foo, of course), but doesn't use global state.

I guess the WriterFn would have to take a va_args parameter, but that doesn't
seem like a big minus to me.

Also, it looks like you always put a newline at the end of your W statements.
So maybe it's worth doing that inside the W function instead.

>+class StackTrace
>+{
>+  static const uint32_t MaxDepth = 24;
>+
>+  uint32_t mLength;             // The number of PCs.
>+  void* mPcs[MaxDepth];         // The PCs themselves.
>+
>+public:
>+  StackTrace() : mLength(0) {}
>+
>+  uint32_t Length() const { return mLength; }
>+  void* Pc(uint32_t i) const { MOZ_ASSERT(i < mLength); return mPcs[i]; }

>+  uint32_t Size() const { return mLength * sizeof(mPcs[0]); }
>+
>+  static const StackTrace* Do(DMDThread* aT);

Add a comment indicating who owns the pointer which is returned here?

Also, maybe Get() would be a better name?  Do() kind of implies to me that we
get back a new StackTrace, which is not the case.

>+  static bool match(const StackTrace* const& aA,
>+                    const StackTrace* const& aB)
>+  {
>+    return aA->mLength == aB->mLength &&
>+           memcmp(aA->mPcs, aB->mPcs, aA->Size()) == 0;

You have my blessing to use |a| and |b| here if you like; |aA| is a bit absurd.

>+  }
>+
>+private:
>+  static void StackWalkCallback(void* aPc, void* aSp, void* aClosure)
>+  {
>+    StackTrace* st = (StackTrace*) aClosure;
>+
>+    // Only fill to MaxDepth.

Add a pointer to bug 818793?

>+    if (st->mLength < MaxDepth) {
>+      st->mPcs[st->mLength] = aPc;
>+      st->mLength++;
>+    }
>+  }
>+
>+  static int QsortCmp(const void* aA, const void* aB)
>+  {
>+    void* a = *(void**)aA;
>+    void* b = *(void**)aB;
>+    if (a < b) return -1;
>+    if (a > b) return  1;
>+    return 0;
>+  }

Same here with |aA| and |aB|, if you like.

>+typedef js::HashSet<StackTrace*, StackTrace, Infallible> StackTraceTable;
>+static StackTraceTable* gStackTraceTable = nullptr;
>+
>+void
>+StackTrace::Print() const
>+{
>+  if (mLength == 0) {
>+    W("   (empty)\n");
>+    return;
>+  }
>+
>+  if (gMode == Test) {
>+    // Don't print anything because there's too much variation.
>+    W("   (stack omitted due to test mode)\n");
>+    return;
>+  }
>+
>+  for (uint32_t i = 0; i < mLength; i++) {
>+    nsCodeAddressDetails details;
>+    void* pc = mPcs[i];
>+    PcInfo(pc, &details);
>+    if (details.function[0]) {
>+      W("   %14p %s[%s +0x%X]\n", pc, details.function, details.library,
>+        details.loffset);
>+    }
>+  }
>+}

Dumping DMD output takes a long time (~30s on my phone to dump a few
processes).  Without profiling, I'd guess that the culprit is
NS_DescribeCodeAddress, called from here.

We shouldn't change it in this patch, but I wanted to put it down that we may
want to add a cache for that, which would live (say) for the entirety of a call
to Dump().

>+/* static */ const StackTrace*
>+StackTrace::Do(DMDThread* aT)

[snip]

>+  StackTraceTable::AddPtr p = gStackTraceTable->lookupForAdd(&tmp);
>+  if (!p) {
>+    StackTrace* stnew = Infallible::new_<StackTrace>(tmp);
>+    (void)gStackTraceTable->add(p, stnew);

Nit: unused <<, here and elsewhere?

>+//---------------------------------------------------------------------------
>+// Heap blocks
>+//---------------------------------------------------------------------------
>+
>+static const char* gUnreportedName = "unreported";
>+
>+// This is used by both |Block|s and |BlockGroups|.
>+class BlockKey
>+{
>+protected:
>+  enum Kind {
>+    Live,               // for all live blocks, reported or not
>+    DoubleReport        // for blocks that have been double-reported
>+  };
>+
>+  const Kind mKind;
>+
>+public:
>+  const StackTrace* const mAllocStackTrace; // never null
>+
>+protected:
>+  // Blocks can be reported in two ways.
>+  // - The most common is via a memory reporter traversal -- the block is
>+  //   reported when the reporter runs, causing DMD to mark it as reported, and
>+  //   DMD must clear the marking once it has finished its analysis.
>+  // - Less common are ones that are reported immediately on allocation.  DMD
>+  //   must *not* clear the markings of these blocks once it has finished its
>+  //   analysis.

This comment is in a confusing place, because it seems like it helps clarify
the union below, but now I think that it doesn't.  So what does it clarify?

>+  union
>+  {
>+    struct
>+    {
>+      const StackTrace* mReportStackTrace;  // nullptr if unreported
>+      const char*       mReporterName;      // gUnreportedName if unreported
>+      bool              mReportedOnAlloc;   // true if block was reported
>+    } mLive;                                //   immediately on allocation
>+
>+    struct
>+    {
>+      const StackTrace* const mReportStackTrace1;   // never null
>+      const StackTrace* const mReportStackTrace2;   // never null
>+      const char*       const mReporterName1;       // never gUnreportedName
>+      const char*       const mReporterName2;       // never gUnreportedName
>+    } mDoubleReport;
>+  };

>+  BlockKey(const StackTrace* aAllocStackTrace,
>+           const StackTrace* aReportStackTrace1,
>+           const StackTrace* aReportStackTrace2,
>+           const char* aReporterName1, const char* aReporterName2)

What are these arguments?  It's not clear to me, even considering the comment
at the top of the class.

>+  // Quasi-hash policy (used by BlockGroup's hash policy).
>+  //
>+  // Hash() and Match() both assume that identical reporter names have
>+  // identical pointers.  In practice this always happens because they are
>+  // static strings.  This is true even for multi-reporters.  (If it ever
>+  // became untrue, the worst that would happen is that some blocks that should
>+  // be in the same block group would end up in separate block groups.)

I'm still not convinced this is true.

Report() doesn't intern the memory-reporter string, so we're relying on the
fact that multi-reporters will do the interning for us.

But for example if I pull up about:compartments, I see a bunch of compartments
with "[2]" next to them.  (Open the same bug in two tabs, for instance.)  The
leaf reporters under those compartments do not share the same string pointers,
as far as I can tell.  (See nsWindowMemoryReporter.cpp::CollectWindowReports.)

Am I missing something here?

>+  static bool Match(const BlockKey& aA, const BlockKey& aB)
>+  {
>+    if (aA.mKind == Live && aB.mKind == Live) {
>+      return aA.mAllocStackTrace   == aB.mAllocStackTrace &&
>+             aA.ReportStackTrace() == aB.ReportStackTrace() &&
>+             aA.ReporterName()     == aB.ReporterName();
>+    }
>+
>+    if (aA.mKind == DoubleReport && aB.mKind == DoubleReport) {
>+      return aA.mAllocStackTrace    == aB.mAllocStackTrace &&
>+             aA.ReportStackTrace1() == aB.ReportStackTrace1() &&
>+             aA.ReportStackTrace2() == aB.ReportStackTrace2() &&
>+             aA.ReporterName1()     == aB.ReporterName1() &&
>+             aA.ReporterName2()     == aB.ReporterName2();
>+    }
>+
>+    MOZ_NOT_REACHED();  // Nb: aA.mKind should always equal aB.mKind.

Can we just make this return false?  Seems silly not to.

>+class BlockSize
>+{
>+  static const size_t kSlopBits = sizeof(size_t) * 8 - 1;  // 31 or 63
>+
>+public:
>+  size_t mReq;              // size requested
>+  size_t mSlop:kSlopBits;   // additional bytes allocated due to rounding up

Wow, I'm amazed this works.  That's cool.

>+  size_t mSampled:1;        // were one or more blocks contributing to this
>+                            //   BlockSize sampled?
>+  BlockSize()
>+    : mReq(0),
>+      mSlop(0),
>+      mSampled(false)
>+  {}
>+
>+  BlockSize(size_t aReq, size_t aSlop, bool aSampled)
>+    : mReq(aReq),
>+      mSlop(aSlop),
>+      mSampled(aSampled)
>+  {}
>+
>+  size_t Usable() const { return mReq + mSlop; }

>+  void AddTo(const BlockSize& aBlockSize)

Nit: This is a confusing name: "a |addTo| b" sounds like b += a (because
prepositions bind to the right in English), but here it means a += b.

Unfortunately I don't have a great alternative.  "Incorporate"?  Maybe just
plain "Add" would be okay.

>+  {
>+    mReq  += aBlockSize.mReq;
>+    mSlop += aBlockSize.mSlop;
>+    mSampled = mSampled || aBlockSize.mSampled;
>+  }

>+// A live heap block.
>+class Block : public BlockKey

I was initially confused by the fact that BlockKey is not used as the hash key
for Blocks (the pointer into the heap is used as the key).

Perhaps BlockKey should have a different name.

>+{
>+public:
>+  const BlockSize mBlockSize;

This is a nit that you don't have to change, but inheritance feels better than a public member for this sort of thing; Block::Usable() is a perfectly reasonable thing to call.

Of course, you want Block to inherit from |const BlockSize|; the const is important.  But you could just split up BlockSize into two classes, only one of which is mutable.

>+void*
>+replace_realloc(void* aOldPtr, size_t aSize)
>+{
>+  using namespace mozilla::dmd;
>+
>+  if (!gIsDMDRunning) {
>+    return gMallocTable->realloc(aOldPtr, aSize);
>+  }
>+
>+  DMDThread* t = DMDThread::Fetch();
>+  if (t->interceptsAreBlocked()) {
>+    return Infallible::realloc_(aOldPtr, aSize);
>+  }
>+
>+  // If |aOldPtr| is NULL, the call is equivalent to |malloc(aSize)| for all
>+  // values of |aSize|.

Nit: I don't think "for all values of |aSize|" adds anything here.

>+namespace mozilla {
>+namespace dmd {
>+
>+//---------------------------------------------------------------------------
>+// Block groups
>+//---------------------------------------------------------------------------
>+
>+// A group of one or more heap blocks with a common BlockKey.
>+class BlockGroup : public BlockKey
>+{
>+  friend class FrameGroup;      // FrameGroups are created from BlockGroups

AFAICT the only thing that's keeping us from getting rid of this friend is the
lack of a public getter for mNumBlocks.  Can we just do that instead?

>+  // The (inherited) BlockKey is used as the key in BlockGroupTable, and the
>+  // other members constitute the value, so it's ok for them to be |mutable|.
>+private:
>+  mutable uint32_t  mNumBlocks;     // number of blocks with this BlockKey
>+  mutable BlockSize mCombinedSize;  // combined size of those blocks
>+
>+public:
>+  explicit BlockGroup(const BlockKey& aKey)
>+    : BlockKey(aKey),
>+      mNumBlocks(0),
>+      mCombinedSize()
>+  {}
>+
>+  const BlockSize& CombinedSize() const { return mCombinedSize; }
>+
>+  // The |const| qualifier is something of a lie, but is necessary so this type
>+  // can be used in js::HashSet, and it fits with the |mutable| fields above.
>+  void UpdateFromBlock(const Block& aB) const

I don't think "UpdateFromBlock" makes much sense.  Maybe just "Add", if we
change it to BlockSize::Add()?

>+  {
>+    mNumBlocks++;
>+    mCombinedSize.AddTo(aB.mBlockSize);
>+  }
>+
>+  void Print(int, int, const char*, const char*, size_t, size_t, size_t) const;

It's customary to put the variable names here.  It's also customary not to use
|int| and |long| unless you really mean it.

>+  static int QsortCmp(const void* aA, const void* aB)
>+  {
>+    BlockGroup* a = *(BlockGroup**)aA;
>+    BlockGroup* b = *(BlockGroup**)aB;

C++-style casts are preferred.

>+    return BlockSize::Cmp(a->mCombinedSize, b->mCombinedSize);
>+  }
>+
>+  static const char* const kName;   // for PrintSortedGroups
>+
>+  // Hash policy
>+
>+  typedef BlockKey Lookup;
>+
>+  static uint32_t hash(const BlockKey& aKey)
>+  {
>+    return BlockKey::Hash(aKey);
>+  }
>+
>+  static bool match(const BlockGroup& aBg, const BlockKey& aKey)
>+  {
>+    return BlockKey::Match(aBg, aKey);
>+  }
>+};
>+
>+const char* const BlockGroup::kName = "block";
>+
>+typedef js::HashSet<BlockGroup, BlockGroup, Infallible> BlockGroupTable;
>+BlockGroupTable* gDoubleReportBlockGroupTable = nullptr;
>+
>+void
>+BlockGroup::Print(int m, int n, const char* aStr, const char* astr,
>+                  size_t aCategoryUsableSize, size_t aCumulativeUsableSize,
>+                  size_t aTotalUsableSize) const

I can't say I'm wild about having both aStr and astr.  Would you also mind
adding a comment somewhere explaining what |m| and |n| are?

>+  if (mKind == BlockKey::Live) {
>+    if (IsReported()) {
>+      W("\n Reported by '%s' at\n", ReporterName());
>+      ReportStackTrace()->Print();
>+    }
>+
>+  } else if (mKind == BlockKey::DoubleReport) {
>+    W("\n Previously reported by '%s' at\n", ReporterName1());
>+    ReportStackTrace1()->Print();
>+
>+    W("\n Now reported by '%s' at\n", ReporterName2());
>+    ReportStackTrace2()->Print();

Aha, I finally know what stack trace 1 and 2 are for!  I'd appreciate a comment
earlier.

>+//---------------------------------------------------------------------------
>+// Stack frame groups
>+//---------------------------------------------------------------------------
>+
>+// A group of one or more stack frames (from heap block allocation stack
>+// traces) with a common PC.
>+class FrameGroup

The nomenclature and comments are confusing here.

We say a FrameGroup is "a group of one or more stack frames with a common PC."  But it seems to me that it's actually an aggregation of one or more /blocks/ whose stacks all share a common PC.

Of course, if we describe it that way, it's not clear why this isn't called a "block group", since it's a group of blocks.  So that suggests BlockGroup may not be the right name, either.

Maybe something like BlocksAggregatedByStack and BlocksAggregatedByFrame would be better?  That would make explicit the parallels between these two classes.

>+{
>+  // mPc is used as the key in FrameGroupTable, and the other members
>+  // constitute the value, so it's ok for them to be |mutable|.
>+  const void* const mPc;
>+  mutable size_t    mNumBlocks;
>+  mutable size_t    mNumBlockGroups;
>+  mutable BlockSize mCombinedSize;
>+
>+public:
>+  explicit FrameGroup(const void* aPc)
>+    : mPc(aPc),
>+      mNumBlocks(0),
>+      mNumBlockGroups(0),
>+      mCombinedSize()
>+  {}
>+
>+  const BlockSize& CombinedSize() const { return mCombinedSize; }
>+
>+  // The |const| qualifier is something of a lie, but is necessary so this type
>+  // can be used in js::HashSet, and it fits with the |mutable| fields above.
>+  void UpdateFromBlockGroup(const BlockGroup& aBg) const

Again, maybe just Add()?  It probably makes sense to match whatever we rename
AddTo() to.

>+  {
>+    mNumBlocks += aBg.mNumBlocks;
>+    mNumBlockGroups++;
>+    mCombinedSize.AddTo(aBg.mCombinedSize);
>+  }

>+void
>+FrameGroup::Print(int m, int n, const char* aStr, const char* astr,
>+                  size_t aCategoryUsableSize, size_t aCumulativeUsableSize,
>+                  size_t aTotalUsableSize) const
>+{

Would you mind squelching the unused aCumulativeUsableSize warning here?

>+//---------------------------------------------------------------------------
>+// DMD start-up
>+//---------------------------------------------------------------------------
>+
>+static void DMDTest(FILE* fp);
>+static void DMDStress();
>+
>+// Given an option name like "foo", succeed if the argument has the form
>+// "foo=blah" and return the pointer to "blah".
>+static const char*
>+OptionValueIfMatch(const char* aArg, const char* aOptionName)
>+{
>+  size_t optionLen = strlen(aOptionName);
>+  if (strncmp(aArg, aOptionName, optionLen) == 0 && aArg[optionLen] == '=') {

I don't think this is any safer than a plain strcmp.

>+    return aArg + optionLen + 1;    // +1 for '='

Note that this might return a pointer to '\0'.  Which is fine in and of itself,
but see below.

>+  }
>+  return nullptr;
>+}
>+
>+// Extracts a long value for an option from an argument.  It must be within the

s/long/|long|/c  Otherwise this is tricky to parse.

>+// range |aMin..aMax|.

Is this range inclusive or exclusive?

>+static bool
>+OptionLong(const char* aArg, const char* aOptionName, long aMin, long aMax,
>+           long* aN)
>+{
>+  if (const char* optionValue = OptionValueIfMatch(aArg, aOptionName)) {
>+    char* endPtr;
>+    *aN = strtol(optionValue, &endPtr, /* base */ 10);
>+    if (!*endPtr && aMin <= *aN && *aN <= aMax && *aN != LONG_MAX) {

If we're checking for under/overflow, we need to add a LONG_MIN check.

We also need to check here or in OptionValueIfMatch that *optionValue != '\0'.
From man strtol: 

   if *str is not `\0' but **endptr is `\0' on return, the entire string was
   valid.

(If we pass optionValue = "\0", strtol returns 0 and *endPtr == 0, which I
don't think is what you want.)

>+      return true;
>+    }
>+  }
>+  return false;
>+}
>+
>+static const size_t gMaxSampleBelowSize = 100 * 1000 * 1000;

Add a comment indicating units?

>+// WARNING: this function runs *very* early -- before all static initializers
>+// have run.  For this reason, non-scalar globals such as gStateLock and
>+// gStackTraceTable are allocated dynamically (so we can guarantee their
>+// construction in this function) rather than statically.
>+static void
>+Init(const malloc_table_t* aMallocTable)
>+{
>+  MOZ_ASSERT(!gIsDMDRunning);
>+
>+  gMallocTable = aMallocTable;
>+
>+  // Set defaults of things that can be affected by the $DMD env var.
>+  gMode = Normal;
>+  gSampleBelowSize = 1;
>+
>+  // DMD is controlled by the |DMD| environment variable.
>+  // - If it's unset, DMD doesn't run.

Personally, I'd check if it's unset /or empty/, but that's up to you.

>+  // - Otherwise, the contents dictate DMD's behaviour.
>+
>+  char* e = getenv("DMD");
>+  if (!e) {
>+    return;
>+  }
>+
>+  if (strcmp(e, "1") != 0) {
>+    while (true) {
>+      // Find the first char after the arg, and temporarily change it to '\0'
>+      // to isolate the arg.
>+      char* p = e;
>+      while (*p != ',' && *p != '\0') {
>+        p++;
>+      }

I think being comma- or space-separarated would be nice; you know some bozo is
going to try space-separated args and be confused when it doesn't work, so why
not just make it work?

>+      bool isEnd = *p == '\0';
>+      *p = '\0';
>+
>+      // Handle arg
>+      long myLong;
>+      if (OptionLong(e, "sample-below", 1, gMaxSampleBelowSize, &myLong)) {
>+        gSampleBelowSize = myLong;
>+
>+      } else if (strcmp(e, "mode=normal") == 0) {
>+        gMode = Normal;
>+      } else if (strcmp(e, "mode=test")   == 0) {
>+        gMode = Test;
>+      } else if (strcmp(e, "mode=stress") == 0) {
>+        gMode = Stress;
>+
>+      } else {
>+        BadArg(e);
>+      }

The extra two lines of whitespace here are confusing to me; it looks like the
if statement and then the last else if aren't bracketed or something.

I don't like cuddle else much myself either, but I think we have to learn to
embrace it.

>+      // Stop if at the end, else revert the temporary isolation.
>+      if (isEnd) {
>+        break;
>+      }
>+      *p = ',';
>+      e = p+1;      // +1 to get past the comma
>+    }
>+  }

Can we print a status message indicating what we've learned from parsing?

>+  // Finished parsing $DMD.
>+
>+  gStateLock = Infallible::new_<Mutex>();
>+
>+  gSmallBlockActualSizeCounter = 0;
>+
>+  FILE* testFp;
>+
>+  if (gMode == Test) {
>+    // fopen() allocates.  So do this before setting gIsDMDRunning so those
>+    // allocations don't show up in our results.
>+    const char* filename = "test.dmd";
>+    testFp = fopen(filename, "w");
>+    if (!testFp) {
>+      StatusMsg("can't create test file %s: %s\n", filename, strerror(errno));
>+      exit(0);

Maybe don't exit with 0 so we know there was a test failure?

>+
>+//---------------------------------------------------------------------------
>+// DMD reporting and unreporting
>+//---------------------------------------------------------------------------
>+
>+static void
>+UpdateBlockGroupTableFromBlock(BlockGroupTable& aTable, const BlockKey& aKey,
>+                               const Block& aBlock)

AddBlockToBlockGroupFromTable?  Even AddBlockToBlockGroup() would work for me.

>+static void
>+PrintSortedBlockAndFrameGroups(const char* aStr, const char* astr,
>+                               const BlockGroupTable& aBlockGroupTable,
>+                               size_t aCategoryUsableSize,
>+                               size_t aTotalUsableSize)

If we rename BlockGroup and/or FrameGroup, this needs to change too.

>+// This is only needed because of the |const void*| vs |void*| arg mismatch.
>+static size_t
>+MallocSizeOf(const void* aPtr)
>+{
>+  return gMallocTable->malloc_usable_size((void*)aPtr);

Style is to use const_cast here.

>+}
>+
>+static void
>+ShowMemoryConsumption()
>+{
>+  // Memory consumption is non-deterministic, so don't show it in test mode.
>+  if (gMode == Test) {
>+    return;
>+  }
>+
>+  size_t sizeOfStackTraceTable =
>+    gStackTraceTable->sizeOfIncludingThis(MallocSizeOf);
>+  for (StackTraceTable::Range r = gStackTraceTable->all();
>+       !r.empty();
>+       r.popFront()) {
>+    StackTrace* const& st = r.front();
>+    sizeOfStackTraceTable += MallocSizeOf(st);
>+  }
>+  W("Stack trace table: %s of %s entries used, taking up %s bytes\n",
>+    Show(gStackTraceTable->count(),    gBuf1, kBufLen),
>+    Show(gStackTraceTable->capacity(), gBuf2, kBufLen),
>+    Show(sizeOfStackTraceTable, gBuf3, kBufLen));
>+
>+  W("Live block table:  %s of %s entries used, taking up %s bytes\n",
>+    Show(gLiveBlockTable->count(),    gBuf1, kBufLen),
>+    Show(gLiveBlockTable->capacity(), gBuf2, kBufLen),
>+    Show(gLiveBlockTable->sizeOfIncludingThis(MallocSizeOf), gBuf3, kBufLen));
>+
>+  W("\n");
>+}

As follow-up work, we can turn this into real memory reporters.  (The memory
reporter would have to live in libxul, but it could call into libdmd to get
these numbers.)  This would be helpful because right now all of DMD's memory is
going to show up as heap-unclassified, and you have to subtract it off
yourself.

>+static void
>+ClearState()
>+{
>+  // Unreport all blocks, except those that were reported on allocation,
>+  // because they need to keep their reported marking.
>+  for (BlockTable::Range r = gLiveBlockTable->all(); !r.empty(); r.popFront()) {
>+    r.front().value.UnreportIfNotReportedOnAlloc();
>+  }
>+
>+  // Clear errors.
>+  gDoubleReportBlockGroupTable->finish();
>+  gDoubleReportBlockGroupTable->init();
>+}
>+
>+MOZ_EXPORT void
>+Dump(Writer aWrite, void* aWriteState)
>+{
>+  gWrite = aWrite ? aWrite : StderrWrite;
>+  gWriteState = aWriteState;
>+
>+  if (!gIsDMDRunning) {
>+    const char* msg = "cannot check;  $DMD was undefined at startup\n";
>+    StatusMsg("%s", msg);
>+    W("%s", msg);
>+    return;
>+  }
>+
>+  AutoBlockIntercepts block(DMDThread::Fetch());
>+  AutoLockState lock;
>+
>+  static int dumpCount = 1;
>+  StatusMsg("Dump %d {\n", dumpCount++);
>+
>+  StatusMsg("  gathering live block groups...\n");
>+
>+  BlockGroupTable unreportedBlockGroupTable;
>+  (void)unreportedBlockGroupTable.init(2048);
>+  size_t unreportedUsableSize = 0;
>+
>+  BlockGroupTable reportedBlockGroupTable;
>+  (void)reportedBlockGroupTable.init(2048);
>+  size_t reportedUsableSize = 0;
>+
>+  bool anyBlocksSampled = false;
>+
>+  for (BlockTable::Range r = gLiveBlockTable->all(); !r.empty(); r.popFront()) {

Just to check: This isn't O(n^2), right?  popFront() is a constant-time
operation on a HashTable::Range?

>+//---------------------------------------------------------------------------
>+// Testing
>+//---------------------------------------------------------------------------
>+
>+// This function checks that heap blocks that have the same stack trace but
>+// different (or no) reporters get aggregated separately.
>+void foo()
>+{
>+   int i;
>+   char* a[6];
>+   for (i = 0; i < 6; i++) {
>+      a[i] = (char*) malloc(128 - 16*i);
>+   }
>+
>+   for (i = 0; i <= 1; i++)
>+      Report(a[i], "a01");              // reported

Nit: I'd prefer explicit declaration of |int i| in the loop; otherwise, this
looks like C.

>+   Report(a[2], "a23");                 // reported
>+   Report(a[3], "a23");                 // reported
>+   // a[4], a[5] unreported
>+}

>+MOZ_EXPORT void
>+FpWrite(void* aWriteState, const char* aFmt, ...)
>+{
>+  FILE* fp = (FILE*)aWriteState;
>+  va_list ap;
>+  va_start(ap, aFmt);
>+  vfprintf(fp, aFmt, ap);
>+  va_end(ap);
>+}

If this is MOZ_EXPORT'ed and in the header file, should it be under "Testing"?
But I think YAGNI (particularly because other libraries can't rely on libdmd
being present).

>+  // These together constitute exactly one sample.  sampleCounter = 0.
>+  for (int i = 0; i < 16; i++) {
>+    s = (char*) malloc(8);
>+    UseItOrLoseIt(s);
>+  }
>+
>+  // These fall 8 bytes short of a full sample.  sampleCounter = 120.
>+  for (int i = 0; i < 15; i++) {
>+    s = (char*) malloc(8);
>+    UseItOrLoseIt(s);
>+  }

Do you want to assert the sample-counter directly here and below?

>+//---------------------------------------------------------------------------
>+// Stress testing microbenchmark
>+//---------------------------------------------------------------------------
>+
>+static void stress5()

[snip]

>+static void
>+DMDStress()
>+{
>+  stress1(); stress1(); stress1(); stress1(); stress1();
>+  stress1(); stress1(); stress1(); stress1(); stress1();
>+}

Should these stress-test functions be MOZ_NEVER_INLINE?

>diff --git a/tools/rb/fix-linux-stack.pl b/tools/rb/fix-linux-stack.pl
>--- a/tools/rb/fix-linux-stack.pl
>+++ b/tools/rb/fix-linux-stack.pl
>@@ -218,17 +218,19 @@ sub get_file_info($) {
> }
> 
> # Ignore SIGPIPE as a workaround for addr2line crashes in some situations.
> $SIG{PIPE} = 'IGNORE';
> 
> select STDOUT; $| = 1; # make STDOUT unbuffered
> while (<>) {
>     my $line = $_;
>-    if ($line =~ /^([ \|0-9-]*)(.*) ?\[([^ ]*) \+(0x[0-9A-F]{1,8})\](.*)$/) {
>+#   if ($line =~ /^([ \|0-9-]*)(.*) ?\[([^ ]*) \+(0x[0-9A-F]{1,8})\](.*)$/) {
>+# njn
>+    if ($line =~ /^(.* )([^ ]+)\[([^ ]*) \+(0x[0-9A-F]{1,8})\](.*)$/) {

Let's get this reviewed by Someone Who Knows Something; this script is used in
automation, I believe, so it's important.

>diff --git a/tools/rb/fix_macosx_stack.py b/tools/rb/fix_macosx_stack.py
>--- a/tools/rb/fix_macosx_stack.py
>+++ b/tools/rb/fix_macosx_stack.py

>@@ -122,19 +123,16 @@ def fixSymbols(line):
>             #   symbol (in foo.dylib) (file:line)
>             symresult = atos_sym_re.match(info)
>             if symresult is not None:
>                 # Print the first two forms as-is, and transform the third
>                 (symbol, library, fileline) = symresult.groups()
>                 symbol = cxxfilt(symbol)
>                 info = "%s (%s, in %s)" % (symbol, fileline, library)
> 
>-             # throw away the bad symbol, but keep balance tree structure
>-            before = balance_tree_re.match(before).groups()[0]

What is this change for?

We should probably have jesse and/or ted look at these changes.

>diff --git a/xpcom/base/nsMemoryInfoDumper.cpp b/xpcom/base/nsMemoryInfoDumper.cpp
>--- a/xpcom/base/nsMemoryInfoDumper.cpp
>+++ b/xpcom/base/nsMemoryInfoDumper.cpp
>+static void
>+MakeFilename(const char *aPrefix, const nsAString &aIdentifier,
>+             const char *aSuffix, nsACString &aResult)
>+{

Nit: It's customary to Truncate() out-strings at the beginning of functions
like this.

>+  aResult.Append(aPrefix);
>+  MOZ_ASSERT(!aIdentifier.IsEmpty());
>+  aResult.AppendLiteral("-");
>+  aResult.Append(NS_ConvertUTF16toUTF8(aIdentifier));
>+  aResult.AppendLiteral("-");
>+  aResult.AppendInt(getpid());
>+  aResult.Append(aSuffix);

I don't know if you prefer

    aResult = nsPrintfCString("%s-%s-%d.%s",
                              aPrefix,
                              NS_ConvertUTF16toUTF8(aIdentifier).get(),
                              getpid(), aSuffix);

but that's an option, too.

> /* static */ nsresult
> nsMemoryInfoDumper::DumpMemoryReportsToFileImpl(
>   const nsAString& aIdentifier)
> {

Maybe add a MOZ_ASSERT(!aIdentifier.IsEmpty()), since that assertion is
currently tucked away in MakeFilename?

>+  nsresult rv;

Nit: Style is to declare rv as close to its first use as possible.  That
usually means doing |nsresult rv = <fn call>|.  (This is in the style guide I
derided on IRC, although it's there as a general rule, not specific to rv.)

>+#ifdef MOZ_DMD
>+  // Open a new file named something like
>+  //
>+  //   dmd-<identifier>-<pid>.txt.gz
>+  //
>+  // in NS_OS_TEMP_DIR for writing, and dump DMD output to it.  This must occur
>+  // after the memory reporters have been run (above), but before the
>+  // memory-reports file has been renamed (so the scripts can detect the DMD

s/the scripts/scripts/c

>+  // Dump DMD output to the file.
>+
>+  static const size_t dumpBufSize = 4096;
>+  char dumpBuf[dumpBufSize];
>+  DMDDumpState state(dumpBuf, dumpBufSize, dmdWriter);

Nit: Put the buffer inside DMDDumpState?

>diff --git a/xpcom/base/nsMemoryReporterManager.cpp b/xpcom/base/nsMemoryReporterManager.cpp
>--- a/xpcom/base/nsMemoryReporterManager.cpp
>+++ b/xpcom/base/nsMemoryReporterManager.cpp

>-#ifdef MOZ_DMDV
>+namespace dmd {

Is it confusing to have methods used for both DMD and DMDV in namespace mozilla::dmd?

It almost seems like we should have

  mozilla::dmd  / MOZ_DMD   (shared stuff)
  mozilla::dmdv / MOZ_DMDV  (valgrind-specific stuff)
  mozilla::dmdn / MOZ_DMDN  (native DMD)

(nit on the names as you wish).

What do you think?  Maybe it doesn't matter if we're to get rid of DMDV altogether soon.  I don't know if there remains a use-case for it.
> I don't mind if you pass over some of it, so long as you don't mind looking at patches to 
> fix them after you check this in.

(To be clear, I was volunteering to write such patches.)
Attachment #689623 - Flags: review?(justin.lebar+bug) → review+
Attached patch B2G follow-up v2Splinter Review
Attachment #689886 - Attachment is obsolete: true
Attachment #690193 - Flags: review+
> Chris Jones [:cjones] [:warhammer] 2012-11-14 04:18:46 PST
> Keywords: qawanted

I'm removing qawanted, because I don't think we need QA to ensure that the patch here functions correctly.

I would /love/ assistance from QA in investigating dark-matter (i.e., high heap-unclassified) bugs, but that shouldn't happen in this bug.
Keywords: qawanted
> > +    if test "${OS_TARGET}" = "Android"; then
> 
> You probably want that on arm, not android.

So, like this?

      if test "${CPU_ARCH}" = "arm"; then


> > +ifdef MOZ_DMD
> > +tier_platform_dirs += tools/dmd
> 
> Wouldn't it be better under memory/replace?

Hmm, not sure.  Choosing its location in the tree based on a particular mechanism that it uses seems strange.  I'm inclined to leave it in tools/.
Blocks: 819817
Blocks: 819819
Depends on: 819833
Thanks for the detailed review!  I've addressed all the things you raised, except where described below.  I have a few questions in my responses.


> As future work: Now that DMD doesn't require any external programs, perhaps
> we
> should move the docs into the tree.

In the README?


> Instead we're trusting that sqlite is fully and correctly accounting for all
> of its heap blocks via its own memory accounting.
> 
> Do we have any evidence to suggest that this trust is warranted?  If not,
> perhaps we should verify our assumption (or even better, place in automated
> checks).  As part of a separate bug, of course.

I filed bug 819901.


> >+class Infallible
> 
> This isn't a great name for a class, since it's an adjective. 

True.  But when it's used in context -- e.g. |Infallible::malloc_| -- it works better than, say, |InfallibleAlloc::malloc_|.  And because it only contains static methods it's always used in that way.  I'll leave it unchanged.


> >+  // This realloc_ is required for this to be a JS container AllocPolicy.
> >+  static void* realloc_(void* aPtr, size_t aOldSize, size_t aNewSize)
> >+  {
> >+    return realloc_(aPtr, aNewSize);
> >+  }
> 
> Isn't this fallible?

No.  It calls the other realloc_, (note the trailing '_') which is infallible.  I'll change it to |Infallible::realloc_| to make that clearer.


> >+class DMDThread
> >+{
> >+  friend class Infallible; // required for allocation via Infallible::new_
> 
> I'd prefer making the constructor public

But making it private communicates that Fetch() is the only way to get a DMDThread, which is useful knowledge.  I'll leave it.


> Maybe you should just define
> 
>   void* operator new(size_t)
>   void operator delete(void*)
> 
> and make new infallible.  It's a lot cleaner than calling Infallible::new_
> everywhere, and it's more similar to other code in Gecko, which has
> infallible-by-default operator new.

I dislike the fact that |new| is infallible in Gecko but malloc isn't;  I have trouble remembering which one is infallible.  I distinctly like the fact that malloc and |new| look the same in this file.  I'm not sure what you mean by "cleaner";  your suggestion is more concise but less clear IMHO.  So I'll leave this unchanged.


> But I'm worried about someone calling W() and not realizing that it can only
> be
> called when gWriter and gWriteState are defined.  Then it will work sometimes
> and crash other times.

It's a fair point, and I was thinking about this anyway -- I had such a crash recently because I was calling W() in Dump() in the !gIsDMDRunning case before setting gWriter.
I've done it, but I just used a Write() method instead of |operator()| and kept the W macro for the terseness.


> Also, it looks like you always put a newline at the end of your W statements.
> So maybe it's worth doing that inside the W function instead.

I don't want to paint myself into a corner and one day need to add a WriteNoNewline function, so I'll leave it unchanged :)


> >+  static const StackTrace* Do(DMDThread* aT);
> 
> Also, maybe Get() would be a better name?  Do() kind of implies to me that we
> get back a new StackTrace, which is not the case.

I avoided Get because the style guide said that should be used for fallible functions.
I changed it to "Obtain".  /me shrugs.


> You have my blessing to use |a| and |b| here if you like; |aA| is a bit absurd.

Eh, I don't mind it.  Unchanged.


> Dumping DMD output takes a long time (~30s on my phone to dump a few
> processes).  Without profiling, I'd guess that the culprit is
> NS_DescribeCodeAddress, called from here.

I've noticed that on my Linux box it's pretty fast -- e.g. a couple of seconds for the entire dump.  But on my Mac it's much slower.  So that's a plausible suggestion, and I filed bug 819817 for it.


> Nit: unused <<, here and elsewhere?

I avoided that because it's just one more dependency on Gecko (and it's not just unused.h, there's also unused.cpp) and |(void)| is equivalent.


> >+protected:
> >+  // Blocks can be reported in two ways.
> >+  // - The most common is via a memory reporter traversal -- the block is
> >+  //   reported when the reporter runs, causing DMD to mark it as reported, and
> >+  //   DMD must clear the marking once it has finished its analysis.
> >+  // - Less common are ones that are reported immediately on allocation.  DMD
> >+  //   must *not* clear the markings of these blocks once it has finished its
> >+  //   analysis.
> 
> This comment is in a confusing place, because it seems like it helps clarify
> the union below, but now I think that it doesn't.  So what does it clarify?

It clarifies the |mReportedOnAlloc| field.  I moved the comment closer to that field and mentioned it by name.  

BTW, the whole BlockKey/Block/BlockGroup structure is confusing and sub-optimal.  I'll do a refactoring after landing this which should make it clearer (hopefully eliminating the union) and more space-efficient.


> >+  BlockKey(const StackTrace* aAllocStackTrace,
> >+           const StackTrace* aReportStackTrace1,
> >+           const StackTrace* aReportStackTrace2,
> >+           const char* aReporterName1, const char* aReporterName2)
> 
> What are these arguments?  It's not clear to me, even considering the comment
> at the top of the class.

This constructor is used for DoubleReport Blocks.  I've added comments to both constructors.  (Again, the refactoring will help here.)


> >+  // Hash() and Match() both assume that identical reporter names have
> >+  // identical pointers.  In practice this always happens because they are
> >+  // static strings.  This is true even for multi-reporters.
> 
> I'm still not convinced this is true.
> 
> Am I missing something here?

Yes:  understanding which string is the reporter name.  It's not the report path.  It's the name given in the NS_MEMORY_REPORTER_MALLOC_SIZEOF_FUN() macro.  I'll add a comment.


> >+  static bool Match(const BlockKey& aA, const BlockKey& aB)
> >+  {
> >+    if (aA.mKind == Live && aB.mKind == Live) {
> >+      return aA.mAllocStackTrace   == aB.mAllocStackTrace &&
> >+             aA.ReportStackTrace() == aB.ReportStackTrace() &&
> >+             aA.ReporterName()     == aB.ReporterName();
> >+    }
> >+
> >+    if (aA.mKind == DoubleReport && aB.mKind == DoubleReport) {
> >+      return aA.mAllocStackTrace    == aB.mAllocStackTrace &&
> >+             aA.ReportStackTrace1() == aB.ReportStackTrace1() &&
> >+             aA.ReportStackTrace2() == aB.ReportStackTrace2() &&
> >+             aA.ReporterName1()     == aB.ReporterName1() &&
> >+             aA.ReporterName2()     == aB.ReporterName2();
> >+    }
> >+
> >+    MOZ_NOT_REACHED();  // Nb: aA.mKind should always equal aB.mKind.
> 
> Can we just make this return false?  Seems silly not to.

If |aA.mKind != aB.mKind|, something's gone badly wrong, and we shouldn't continue even in an opt build.  (BTW, it's a bit unclear to me if MOZ_NOT_REACHED() is as strong as MOZ_CRASH() -- is it guaranteed to terminate execution?)


> Perhaps BlockKey should have a different name.

Yeah, I'll address this when I do the abovementioned refactoring.


> >+{
> >+public:
> >+  const BlockSize mBlockSize;
> 
> This is a nit that you don't have to change, but inheritance feels better
> than a public member for this sort of thing; Block::Usable() is a perfectly
> reasonable thing to call.

I concluded otherwise, though not by a large margin :)  But the refactoring will again address this.


> AFAICT the only thing that's keeping us from getting rid of this friend is
> the
> lack of a public getter for mNumBlocks.  Can we just do that instead?

There's an argument to be made (I suspect Stroustrup would make it) that a friend is better because it exposes mNumBlocks to a single other class, rather than everyone.  But, again, the refactoring should fix this.


> >+void
> >+BlockGroup::Print(int m, int n, const char* aStr, const char* astr,
> >+                  size_t aCategoryUsableSize, size_t aCumulativeUsableSize,
> >+                  size_t aTotalUsableSize) const
> 
> I can't say I'm wild about having both aStr and astr. 

Consider yourself lucky!  In an earlier version I also had ASTR :P


> >+class FrameGroup
> 
> The nomenclature and comments are confusing here.
> 
> Maybe something like BlocksAggregatedByStack and BlocksAggregatedByFrame
> would be better?  That would make explicit the parallels between these two
> classes.

Those names are unwieldy, but you are right.  But I'll defer addressing this to my abovementioned refactoring.


> >+void
> >+FrameGroup::Print(int m, int n, const char* aStr, const char* astr,
> >+                  size_t aCategoryUsableSize, size_t aCumulativeUsableSize,
> >+                  size_t aTotalUsableSize) const
> >+{
> 
> Would you mind squelching the unused aCumulativeUsableSize warning here?

I don't get that warning.  Furthermore, -Wunused-parameters should only be enabled if -Wextra is enabled, and it shouldn't be.  Which compiler do you see it with?


> >+// range |aMin..aMax|.
> 
> Is this range inclusive or exclusive?

Inclusive.  Is |a..b| ever exclusive?  Nonetheless, I mentioned that in the comment.


> >+      return true;
> >+    }
> >+  }
> >+  return false;
> >+}
> >+
> >+static const size_t gMaxSampleBelowSize = 100 * 1000 * 1000;
> 
> Add a comment indicating units?

|Size| implies bytes all throughout this file, and other parts of Mozilla code (certainly the JS engine).  Is it not obvious?  I added a comment anyway.


> I think being comma- or space-separarated would be nice; you know some bozo is
> going to try space-separated args and be confused when it doesn't work, so why
> not just make it work?

Then that bozo will get an informative error message :P  Besides, spaces in env vars seems like a bad idea.

Hmm, what do you think about changing commas to colons, as in $PATH?


> Can we print a status message indicating what we've learned from parsing?

Do you want the status of all options controllable by the command line, or just the ones that we've explicitly set?


> Just to check: This isn't O(n^2), right?  popFront() is a constant-time
> operation on a HashTable::Range?

popFront() is almost constant-time -- it has to skip over empty entries in the table.  Iteration over the whole table like this is O(n).


> >+MOZ_EXPORT void
> >+FpWrite(void* aWriteState, const char* aFmt, ...)
> 
> If this is MOZ_EXPORT'ed and in the header file, should it be under
> "Testing"?
> But I think YAGNI (particularly because other libraries can't rely on libdmd
> being present).

I moved it near the top, with the other writing functions.  It's used by dom/base/nsJSEnvironment.cpp, for its DMD-only code.


> >+#   if ($line =~ /^([ \|0-9-]*)(.*) ?\[([^ ]*) \+(0x[0-9A-F]{1,8})\](.*)$/) {
> >+# njn
> >+    if ($line =~ /^(.* )([^ ]+)\[([^ ]*) \+(0x[0-9A-F]{1,8})\](.*)$/) {
> 
> Let's get this reviewed by Someone Who Knows Something; this script is used
> in
> automation, I believe, so it's important.

I've spun these changes off into bug 819933.


> Is it confusing to have methods used for both DMD and DMDV in namespace
> mozilla::dmd?
>
> Maybe it doesn't matter if we're to get rid of DMDV altogether soon.

I filed bug 819819 about removing it.
Blocks: 819839
>> As future work: Now that DMD doesn't require any external programs, perhaps
>> we should move the docs into the tree.
>
> In the README?

Sure, so long as there are pointers from the source to the readme, that sounds good to me.

> when it's used in context -- e.g. |Infallible::malloc_| -- it works better than, say,
> |InfallibleAlloc::malloc_|.

FWIW the context I found it confusing in was

  HashTable<a, b, Infallible>,

where I thought

  HashTable<a, b, InfallibleAlloc>

would be clearer.  Not a big deal, though.

> But making it private communicates that Fetch() is the only way to get a DMDThread, which is useful 
> knowledge.  I'll leave it.

If you want to make the DMDThread constructor private, then perhaps we can make
it actually private and provide a factory method, instead of abusing friend?

> I dislike the fact that |new| is infallible in Gecko but malloc isn't;  I
> have trouble remembering which one is infallible.

Be that as it may, there's also an argument in favor of hewing to convention,
so you don't have to remember a separate convention for this particular file
(which is that I'm not supposed to call |new| at all).

In particular, a call to |new| is likely a bug, as things stand.  That's an
easy bug to write and miss in review.

> I distinctly like the fact that malloc and |new| look the same in this file.

I'm not sure what you mean.

> I've done it, but I just used a Write() method instead of |operator()| and
> kept the W macro for the terseness.

So to be clear, we just rely on all methods having an arg with the same name?
That's fine with me, since we'll get a compile error if things go wrong.

I did think the operator W() thing was clever, though.  Maybe that's a good
reason not to do it.  :)

> I avoided Get because the style guide said that should be used for fallible
> functions.  I changed it to "Obtain".  /me shrugs.

That convention is used in practice to distinguish GetFoo() from Foo(), so I
don't think it applies here.  I guess Obtain() is OK, if you like it, although
if we both agree Get() is better, I think we should use that.


Most of what's below below is arguing for the sake of arguing.  Appologies in
advance; it's hard for me to drop things sometimes.

> If |aA.mKind != aB.mKind|, something's gone badly wrong, and we shouldn't
> continue even in an opt build.

But that's just an artifact of how we've currently written the code, where
these two types of objects are stored in different hashtables.  It doesn't make
much sense to me to assert this in the classes themselves.  See the fragility
argument below wrt friend.

> (BTW, it's a bit unclear to me if MOZ_NOT_REACHED() is as strong as
> MOZ_CRASH() -- is it guaranteed to terminate execution?)

According to the comments, MOZ_NOT_REACHED() triggers undefined behavior.  I've
had multiple reviewers ask me to use MOZ_NOT_REACHED() instead of MOZ_CRASH,
which worries me that either the comments are wrong, or we're using it wrong.
I think it's the latter.

> There's an argument to be made (I suspect Stroustrup would make it) that a
> friend is better because it exposes mNumBlocks to a single other class,
> rather than everyone.

If we're going to take design tips from language designers, I'm much more
interested in what Guido has to say than Bjarne.  :)

(And I can make a pretty good guess as to what Guido would say, since Python
doesn't have private or friend!)

In all seriousness, the usage of friend like you're suggesting leads to
patterns where the compiler enforces that our code uses a class exactly as it
was originally written to be used.  So when we want to change how we use the
class, we have to change the class (not only by adding friends or by making
methods public, but often by adding getter methods which should have been there
to begin with, since we don't want everyone and their cousin touching our
member variables directly).  This kind of fragility is counter to the whole
point of having a class in the first place.

If we had many different classes all vying to get their hands on this class and
we wanted to expose to one particular class a mutator method that's dangerous
to use in general, that's perhaps a justifiable use of friend.  But here, this
data isn't sensitive at all (particularly if it was protected by a getter); it
just happens not to be used by anyone else.

To frame the argument another way, if this data shouldn't be public, why have
public access at all when you can just list which classes can access which
members and functions?  Isn't that safer?

> Consider yourself lucky!  In an earlier version I also had ASTR :P

I was hoping you'd consider renaming one, or adding a comment.  :)

> I don't get that warning.

I searched through build logs for "warning:.*unused" (case-insensitive).  We
get unused variable warnings at least on MacOS debug and Linux 32 debug builds.
I didn't check other ones.

http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/1355077072/mozilla-central-macosx64-bm34-build1-build268.txt.gz

http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-debug/1355077072/mozilla-central-linux64-debug-bm49-build1-build31.txt.gz

> Is |a..b| ever exclusive?

(You're going to love this...)  Actually, yes.  In git, "a..b" (this exact
syntax) is exclusive of a and inclusive of b.

> Then that bozo will get an informative error message :P

Not if I'm running this on B2G.  Then I don't see stderr, since B2G runs as a
daemon, and I have to go looking for logcat, if this warning even prints there.
And logcat is really noisy, so I probably have to filter by "DMD", which means
I have to know that DMD is going to output something there with that prefix...

It's not a big deal, but I just thought we could easily save someone 15 minutes
of fighting against this without compromising our integrity too much.

> Besides, spaces in env vars seems like a bad idea.

FWIW LD_PRELOAD, the env variable you have to set in order to even get these
error messages, is space-separated.

(To concede your point, spaces in /that/ env var is actually a /very/ bad idea,
because it means you can't LD_PRELOAD a lib whose path contains spaces!  But
that's not really an issue here.  :)

> Hmm, what do you think about changing commas to colons, as in $PATH?

I feel like "foo:bar=1" is more ambiguous than "foo,bar=1", but either way is
fine with me.

> Do you want the status of all options controllable by the command line, or
> just the ones that we've explicitly set?

Either way is fine with me.  I'm mostly concerned with confirmation of "DMD
interpreted my args as I expected."
> If you want to make the DMDThread constructor private, then perhaps we can make
> it actually private and provide a factory method, instead of abusing friend?

Hm, on second thought, that's exactly what you're doing, isn't it?  I can't call Infallible::new_(DMDThread), because the constructor is private.

I still prefer making |new| work here just like it does everywhere else in Gecko, but this DMDThread thing is itself OK.
>   HashTable<a, b, InfallibleAlloc>
> 
> would be clearer.  Not a big deal, though.

Alright, I've changed it to InfallibleAllocPolicy.


> In particular, a call to |new| is likely a bug, as things stand.  That's an
> easy bug to write and miss in review.

Yes.  And calling vanilla malloc is also a bug.

 
> > I distinctly like the fact that malloc and |new| look the same in this file.
> 
> I'm not sure what you mean.

I write |InfallibleAllocPolicy::malloc_| and |InfallibleAllocPolicy::new_|, rather than |InfallibleAllocPolicy::malloc_| and |new|.

Neither option is ideal, but my personal preference is to treat |malloc| and |new| uniformly than it is to close this loophole for |new| but not |malloc|.


> > I've done it, but I just used a Write() method instead of |operator()| and
> > kept the W macro for the terseness.
> 
> So to be clear, we just rely on all methods having an arg with the same name?
> That's fine with me, since we'll get a compile error if things go wrong.

Yep.
(In reply to Nicholas Nethercote [:njn] from comment #132)
> > > +    if test "${OS_TARGET}" = "Android"; then
> > 
> > You probably want that on arm, not android.
> 
> So, like this?
> 
>       if test "${CPU_ARCH}" = "arm"; then

Yes.
(In reply to Nicholas Nethercote [:njn] from comment #132)
> > > +ifdef MOZ_DMD
> > > +tier_platform_dirs += tools/dmd
> > 
> > Wouldn't it be better under memory/replace?
> 
> Hmm, not sure.  Choosing its location in the tree based on a particular
> mechanism that it uses seems strange.  I'm inclined to leave it in tools/.

Arguably, it is a memory-related tool, and it would be better to group memory-related tools under memory/ ; memory/replace has the advantage of being built at the right time for anything that needs to #include libxul headers.
> my personal preference is to treat |malloc| and |new| uniformly than it is to close this 
> loophole for |new| but not |malloc|.

But to be clear, calling malloc() and not checking its result anywhere in gecko is a bug, while calling new in gecko and not checking its result is not.  So we're adopting a different convention here from the rest of Gecko.
>>+  // DMD is controlled by the |DMD| environment variable.
>>+  // - If it's unset, DMD doesn't run.
>
> Personally, I'd check if it's unset /or empty/, but that's up to you.

Actually, I think we definitely should make this change.  I think DMD="" or DMD="0" should disable DMD.

In B2G's shell script to run B2G under GDB, I need to do something like

  if [ b2g is enabled ]; then
    dmd=1
  fi

  adb shell DMD=$dmd [long command to run b2g]

But if $dmd is empty, then "DMD=" sets the DMD environment variable to the empty string, which causes DMD to become unhappy and quit.

Note that I can't simply export DMD in the if statement, because I need to pass the env variable to adb shell.
Lots of good stuff in here, particularly in the main process.  I'll file bugs.
> Actually, I think we definitely should make this change.  I think DMD="" or
> DMD="0" should disable DMD.

I'd already done this for "", I'll do it for "0" as well.
> calling malloc() and not checking its result anywhere in
> gecko is a bug, while calling new in gecko and not checking its result is
> not.  So we're adopting a different convention here from the rest of Gecko.

Yep.  I'm comfortable with that because this code is quite unlike Gecko code -- it uses very little code from Gecko, it uses JS containers, it has its own Mutex implementation, etc.
Attachment #690496 - Flags: review?(dhylands) → review+
Comment on attachment 690495 [details] [review]
B2G pull request

I'd like to see a comment added to the flash.sh script to provide some context as to why you'd even want to call delete_extra_gecko_files_on_device

If you were look at flash.sh in isolation, it isn't obvious why this is being done.

r=me with that comment added.
Attachment #690495 - Flags: review?(dhylands) → review+
> >+  size_t optionLen = strlen(aOptionName);
> >+  if (strncmp(aArg, aOptionName, optionLen) == 0 && aArg[optionLen] == '=') {
>
> I don't think this is any safer than a plain strcmp.

Heh, I changed it to a plain strcmp yesterday and today I was wondering why option parsing didn't work.  It's deliberately looking for a partial match.  E.g. |aArg| is "sample-size=4099" and "aOptionName" is "sample-size".
> It's deliberately looking for a partial match.

Ah, duh.

> I'd like to see a comment added to the flash.sh script to provide some context as to why 
> you'd even want to call delete_extra_gecko_files_on_device

I'll add it to the PR.  Thanks for the reviews!
I just merged the pull requests.  We need to update the wiki (at least for B2G; I don't know if you've already updated it for desktop) and write about this on the b2g mailing list.
No longer depends on: 820242
Btw, I believe this patch changed the filenames for the about:memory .json.gz dump files. The files now come out as e.g. memory-report-Start-3126..json.gz (note that there are two . characters before the json.gz).
> Btw, I believe this patch changed the filenames for the about:memory
> .json.gz dump files.

So it did.  Here's a follow-up fix:

https://hg.mozilla.org/integration/mozilla-inbound/rev/e594a31444ee

Thanks!
https://hg.mozilla.org/mozilla-central/rev/b29ecfdf4255
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Comment on attachment 689623 [details] [diff] [review]
Add a native version of DMD.

>+    MOZ_MEMORY=1                        # DMD enables jemalloc
>+    MOZ_REPLACE_MALLOC=1                # DMD enables replace-malloc
>+    MOZ_DMDV=                           # DMD disables DMDV
>+    if test "$NS_TRACE_MALLOC"; then    # trace-malloc disables DMD
>+        MOZ_DMD=
>+    fi
>+fi
>+AC_SUBST(MOZ_DMD)
>+
>+dnl ========================================================
> dnl = Enable jemalloc
> dnl ========================================================
> MOZ_ARG_ENABLE_BOOL(jemalloc,
> [  --enable-jemalloc       Replace memory allocator with jemalloc],
>     MOZ_MEMORY=1,
>     MOZ_MEMORY=)
> 
> if test "$NS_TRACE_MALLOC"; then
>     MOZ_MEMORY=
> fi
Should you test NS_TRACE_MALLOC before turning on MOZ_REPLACE_MALLOC?
njn, are you ready to land this on branches?
> njn, are you ready to land this on branches?

Er, sure, if someone tells me I have approval and on which branches to do it.  And then there's the question of follow-up patches, of which there are quite a few (in other bugs)...
a=me.  This is all code that's off by default.
Comment on attachment 689623 [details] [diff] [review]
Add a native version of DMD.

[Triage Comment]
Just to make this official...

This is all NPOTB.
Attachment #689623 - Flags: approval-mozilla-b2g18+
Attachment #689623 - Flags: approval-mozilla-aurora+
Attachment #690193 - Flags: approval-mozilla-b2g18+
Attachment #690193 - Flags: approval-mozilla-aurora+
Nick, does b2g18 just need bug 800703 first?
> Nick, does b2g18 just need bug 800703 first?

It'll definitely need that.  That's all NPOTB changes, though.
OK, tomorrow I'll try pushing that first. Thanks!
> https://hg.mozilla.org/releases/mozilla-b2g18/rev/6706110a23be
> https://hg.mozilla.org/releases/mozilla-b2g18/rev/e97002d7a566

Thanks!

jlebar, I guess it makes sense to backport all subsequent DMD improvements to aurora/b2g18?  Below is a list of the DMD changes that have landed on m-i since the above two.  jlebar, can you a+ them?  RyanVM, can you backport them?


changeset:   115714:6d2586f85a81
summary:     Bug 820401 - Default DMD to sample-below=4093. r=njn

changeset:   115734:753c92f59577
summary:     Bug 819833 (part 1) - DMD: Print PCs at the end of lines.  r=jlebar.

changeset:   115751:3cbe7a6b5395
summary:     Bug 819772 - Add a memory reporter for DMD's data.  r=jlebar.

changeset:   115801:f17233a2e6ab
summary:     Bug 820540 - Build more things with -funwind-tables when DMD is enabled on ARM. r=glandium

changeset:   115846:92bc31268fd6
summary:     Bug 820682 - DMD: Tweak stress mode.  r=jlebar.

changeset:   115984:f5fbf4fc5df2
summary:     Bug 820652 (part 1) - DMD: Split BlockKey in two, and fix the ensuing fall-out.  r=jlebar.

changeset:   115985:a77821dd8903
summary:     Bug 820652 (part 2) - DMD: Embriefen the hash policy of LiveBlockGroup and DoubleReportBlockGroup.  r=jlebar.

changeset:   115986:f54a6f692591
summary:     Bug 820652 (part 3) - DMD: Distinguish BlockSize and GroupSize.  r=jlebar.

changeset:   115987:2c2785739e25
summary:     Bug 820652 (part 4) - DMD: Inline BlockSize into LiveBlock.  r=jlebar.

changeset:   115988:5ac16858d004
summary:     Bug 820652 (part 5) - DMD: Store the block address in LiveBlock.  r=jlebar.

changeset:   115989:9f305729255b
summary:     Bug 820652 (part 6) - DMD: Don't use LiveBlockKey in LiveBlock.  r=jlebar.

changeset:   115990:8c77de31ddc4
summary:     Bug 820652 (part 7) - DMD: Fix bug in strdup_.  r=jlebar.

changeset:   115999:9585e38ed39d
summary:     Bug 821358 - Port the trace-malloc Windows TLS macros to DMD; r=njn

changeset:   116119:6fbb674858da
summary:     Bug 820875 - Reimplement DMD's MutexBase using critical sections; r=bbondy

changeset:   116201:e5fddab2f19e
summary:     Bug 819817 - DMD: cache calls to NS_DescribeCodeAddress for faster dumping.  r=jlebar.

changeset:   116202:e082861f545f
summary:     Bug 821577 - DMD: Fix hang at start-up on Mac.  r=jlebar.
(In reply to Nicholas Nethercote [:njn] from comment #166)
> RyanVM, can you backport them?

If they get approval, my bug queries will find them.
Attachment mime type: text/plain text/plain → text/x-github-pull-request text/x-github-pull-request
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: