Closed Bug 1123237 Opened 5 years ago Closed 4 years ago

Implementing a sampling based memory profiler

Categories

(Core :: Gecko Profiler, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox43 --- fixed

People

(Reporter: laszio.bugzilla, Assigned: kanru)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(15 files, 25 obsolete files)

96.79 KB, image/png
Details
40 bytes, text/x-review-board-request
Details
40 bytes, text/x-review-board-request
Details
40 bytes, text/x-review-board-request
Details
40 bytes, text/x-review-board-request
Details
40 bytes, text/x-review-board-request
Details
40 bytes, text/x-review-board-request
Details
40 bytes, text/x-review-board-request
BenWa
: review+
Details
40 bytes, text/x-review-board-request
Details
13.74 KB, patch
terrence
: review+
Details | Diff | Splinter Review
37.08 KB, patch
terrence
: review+
Details | Diff | Splinter Review
950 bytes, patch
froydnj
: review+
Details | Diff | Splinter Review
37.52 KB, patch
cyu
: review+
Details | Diff | Splinter Review
3.07 KB, patch
terrence
: review+
Details | Diff | Splinter Review
197.70 KB, patch
Details | Diff | Splinter Review
The following introductions are mostly copied from https://wiki.mozilla.org/MemoryProfiler

Traditional memory analysis tools look at the data. They build the referring-to, namely retaining graphs. This is helpful in exploring the relationships among objects and identifying memory leaks. However, the information exposed in this way is usually a snapshot and the history of how memory is used is missing. Moreover, they usually don't keep tracks of how objects are allocated. Developers have to infer how an object might be allocated and freed by himself/herself.

Another common approach is to plot the allocations or the memory usages along the timeline. This gives some intuition of how and when a bulk of objects are allocated. However, it still relies on developers' wisdom, and sometimes chances, to dig out the problematical code snippet.

This memory profiler is designed to solve the inconveniences and limitations. The goal is to identify the memory eager codes directly. For example, it can show which functions have highest peaks: 

    SelfHWM    TotalHWM  Name
    6225920    17563648  p/this.start (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:8)
    6029312     6029312  s (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:11)
    4784128     4849664  nsDisplayBoxShadowOuter::Paint
    1966080     1966080  IPDL::PHttpChannel::RecvOnTransportAndData
    1769472     6684672  a (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:11)
    1572864    18153472  u (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:8)
    1572864     1572864  nsJSContext::GarbageCollectNow
    1507328     1507328  PresShell::DoReflow
    1507328    13500416  p/<.run/x/< (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:8)

The memory profiler samples allocation events. An allocation event includes a type (what and at where is going to be allocated), a size, a timestamp and the corresponding stack trace. Free events are also tracked. For managed languages, namely languages relying on garbage collection, a free event is generated when an object is reclaimed by the garbage collector.

These sampled events can be used to approximate the full history of allocations afterwards. That means we can get various memory profiles of a program in different perspectives by post-processing the history in different ways.
Attachment #8551149 - Flags: feedback?(n.nethercote)
Attachment #8551149 - Flags: feedback?(mh+mozilla)
CountAlloc is an implementation of ReplaceAlloc. The profiler use it to register hooks to monitor how many bytes are allocated.
Attached patch part 2. hooks in js engine (obsolete) — Splinter Review
This defines some interfaces to the hook and adds some helpers to be used in later patches.
Attachment #8551156 - Flags: feedback?(bgirard)
Hi Terrence, would you mind to review the idea of the sampler in the garbage collector? This patch inserts a few hooks into the memory allocator and garbage collector to profile the memory usage.
Attachment #8551158 - Flags: feedback?(terrence)
Attachment #8551159 - Flags: feedback?(terrence)
Hi BenWa, Would you mind to review the idea here? The memory profiler samples allocation, free and gc events like a CPU-time profiler. It mimics the native allocator and also the allocation and collection on the gcheap.
Attachment #8551165 - Flags: feedback?(bgirard)
Attached patch part 9. Interface to Add-Ons (obsolete) — Splinter Review
Hi Jim, as discussed, this is a temporary solution to enable the profiler before the API of debugger is completed. Would you mind to review the ideas here? Thank you.
Attachment #8551170 - Flags: feedback?(jimb)
Hi Nicholas, this implements some of the ideas you mentioned (mostly about the Allocation Tab) in
https://bugzilla.mozilla.org/show_bug.cgi?id=960671#c33

May I know your opinions to this profiler?
Whiteboard: MemShrink
Comment on attachment 8551158 [details] [diff] [review]
part 3. Monitoring allocation and gc events in nursery and tenured heaps

Hmm. Note that the JITs can also allocate from the nursery and the current free list: MacroAssembler::nurseryAllocate [1] and MacroAssembler::freeListAllocate [2].

[1] https://hg.mozilla.org/mozilla-central/file/ea8cce9f6630/js/src/jit/MacroAssembler.cpp#l650
[2] https://hg.mozilla.org/mozilla-central/file/ea8cce9f6630/js/src/jit/MacroAssembler.cpp#l681
Comment on attachment 8551156 [details] [diff] [review]
part 2. hooks in js engine

Review of attachment 8551156 [details] [diff] [review]:
-----------------------------------------------------------------

Chiming in since this seems to be a fairly JS::GC specific series.

::: js/public/MemoryProfiler.h
@@ +15,5 @@
> +  public:
> +    virtual ~NativeProfiler() {};
> +    virtual void sampleNative(void *addr, int32_t size) = 0;
> +    virtual void removeNative(void *addr) = 0;
> +    virtual void reset() = 0;

Making all of these calls virtual is going to add a ton of overhead. It looks like there is only one consumer so maybe we could just move most of the code inline here?

@@ +22,5 @@
> +class GCHeapProfiler {
> +  public:
> +    virtual ~GCHeapProfiler() {};
> +    virtual void sampleTenured(void *addr, int32_t size) = 0;
> +    virtual void sampleNursery(void *addr, int32_t size) = 0;

Please don't use void* for GC things. Instead include js/HeapAPI.h and use GCCellPtr here, and below where you take GC references.

::: js/src/vm/MemoryProfiler.cpp
@@ +9,5 @@
> +
> +mozilla::Atomic<int> MemProfiler::sGlobalSwitch;
> +NativeProfiler *MemProfiler::sNativeProfiler;
> +
> +GCHeapProfiler *MemProfiler::GetGCHeapProfiler(void *addr)

Type should be on the line above:

GCHeapProfiler *
MemProfiler::GetGCHeapProfiler(GCCellPtr cell)
Comment on attachment 8551158 [details] [diff] [review]
part 3. Monitoring allocation and gc events in nursery and tenured heaps

Review of attachment 8551158 [details] [diff] [review]:
-----------------------------------------------------------------

As Emanuel already mentioned, this misses both of the JIT allocators. Rather than adding new instructions to the allocation paths, you'll want to discard the jit code and ensure that when we recompile we do the allocation out-of-line. I'd add another condition to the TraceEnabled check in checkAllocatorState: https://dxr.mozilla.org/mozilla-central/source/js/src/jit/MacroAssembler.cpp#620.

::: js/src/gc/Heap.h
@@ +1280,5 @@
>  bool
>  TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const
>  {
>      AssertValidColor(this, color);
> +    MemProfiler::MarkTenured(reinterpret_cast<void *>(address()));

A virtual call here is going to be extremely noticable on GC benchmarks. What exactly are you trying to do with this count, because it's not going to work as is.
Attachment #8551158 - Flags: feedback?(terrence) → feedback-
Comment on attachment 8551159 [details] [diff] [review]
part 4. monitoring allocations and frees for ArrayBuffer

Review of attachment 8551159 [details] [diff] [review]:
-----------------------------------------------------------------

js/src/gc/Memory.cpp doesn't really have anything to do with ArrayBufferObjects; please split the Memory changes into a separate patch.

::: js/src/gc/Memory.cpp
@@ +711,5 @@
>          return nullptr;
>  
>      buf = (uint8_t *) MapMemoryAt(buf, pa_size, PROT_READ | PROT_WRITE,
>                                    MAP_PRIVATE | MAP_FIXED, fd, pa_start);
> +    MemProfiler::SampleNative(buf, pa_size);

You appear to be building the code on all platforms: please add appropriate sampling for all the other platforms as well.
Attachment #8551159 - Flags: feedback?(terrence)
Attachment #8551159 - Flags: feedback?(sphink)
Attachment #8551159 - Flags: feedback-
Comment on attachment 8551149 [details] [diff] [review]
part 1. CountAlloc in ReplaceAlloc

Review of attachment 8551149 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure if you want feedback from me about the code in this particular patch or the tool in general. I'll give you the latter.

I built the browser and add-on from the GitHub repos, and created a new profile. When I opened the devtools panel and clicked "start" I found that it frequently crashed almost immediately. Loading the PDF at https://github.com/mozilla/pdf.js/issues/2541 reliably caused the crash.

When I was able to get it to work without crashing I couldn't get any useful output. I clicked on the "start" button, did some browsing, then clicked on the "stop" button... and nothing happened. The rectangles within the Memory Profiler panel remained blank and there was no indication of activity. The "reset" button also appeared to have no effect. https://wiki.mozilla.org/MemoryProfiler doesn't have anything about how to use the tool, so I can't tell if I'm doing something wrong or not.
Attachment #8551149 - Flags: feedback?(n.nethercote)
Oh, one question: you say "The memory profiler samples allocation events". How does it do the sampling? Do you sample when a certain amount of time has passed, or when a certain number of bytes have been allocated. Hopefully it's the latter; that's what DMD does.

I ask this because time-based profiling doesn't work well in a memory profiler because you're likely to sample lots of small allocations (which are common) and miss large ones (which are uncommon) which gives an inaccurate view of what is happening.
Whiteboard: MemShrink → [MemShrink]
(In reply to Nicholas Nethercote [:njn] from comment #13)
> I'm not sure if you want feedback from me about the code in this particular
> patch or the tool in general. I'll give you the latter.

Thank you. This is what I'm looking for.
 
> I built the browser and add-on from the GitHub repos, and created a new
> profile. When I opened the devtools panel and clicked "start" I found that
> it frequently crashed almost immediately. Loading the PDF at
> https://github.com/mozilla/pdf.js/issues/2541 reliably caused the crash.

I apologize for the crashes. There was a huge refactoring recently and some crashing bugs were introduced. While I'm fixing the bugs and refining the codes according to Terrence's and Emanuel's comments, I tried the pdf mentioned in the above and made a screenshot for your reference.
http://carto.metro.free.fr/documents/CartoMetroParis.v3.6.pdf

> Oh, one question: you say "The memory profiler samples allocation events".
> How does it do the sampling? Do you sample when a certain amount of time has
> passed, or when a certain number of bytes have been allocated. Hopefully
> it's the latter; that's what DMD does.

Yes, the sampler samples along the the number of bytes instead of time..
> There was a huge refactoring recently

Please NEEDINFO me once you've got the code in a good state again. Thanks.
Comment on attachment 8551165 [details] [diff] [review]
part 5. tracking the memory events

Review of attachment 8551165 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/MemoryProfiler.cpp
@@ +70,5 @@
> +
> +  return trace;
> +}
> +
> +class AutoLock {

We have MutexAutoLock in mozilla/Mutex.h for this.

@@ +451,5 @@
> +  void *profiler = PR_GetThreadPrivate(gActiveProfiler);
> +  if (profiler == &nobody) {
> +    gNativeProfiler->sampleNative(addr, size);
> +  } else if (!profiler) {
> +    PRThread *curr = PR_GetCurrentThread();

PR_GetCurrentThread can end up using the memory allocator. Using it in this function is dangerous. PR_GetThreadPrivate calls PR_GetCurrentThread, so that is dangerous as well.

@@ +454,5 @@
> +  } else if (!profiler) {
> +    PRThread *curr = PR_GetCurrentThread();
> +    if (tlsOwner == curr)
> +      return;
> +    PR_Lock(tlsLock);

Please note somewhere that CARegister needs to happen after _initOnce, otherwise you crash here, if not before.
Comment on attachment 8551149 [details] [diff] [review]
part 1. CountAlloc in ReplaceAlloc

Review of attachment 8551149 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/build/replace_malloc_bridge.h
@@ +86,5 @@
>     * instance is valid until the process dies.
>     * This method was added in version 2 of the bridge. */
>    virtual void InitDebugFd(mozilla::DebugFdRegistry&) {}
>  
> +  virtual void CARegister(void (*_alloc)(void *, int32_t),

Seems to me _alloc and _free should have a different name. CARegister is also not very clear as to what it is. Please also add comments about what the function is for, and that it is added in version 3 of the bridge.

::: memory/replace/countalloc/CountAlloc.cpp
@@ +39,5 @@
> +{
> +  virtual void CARegister(void (*_alloc)(void *, int32_t),
> +                          void (*_free)(void *)) override {
> +    __hook_alloc = _alloc;
> +    __hook_free = _free;

You probably want a memory fence of some sort here. Or to use atomics.

@@ +96,5 @@
> +{
> +  void* new_ptr = sFuncs->realloc(aPtr, aSize);
> +  if (new_ptr || !aSize) {
> +    FreeHook(aPtr);
> +    if (!aSize) {

if (aSize)?
Attachment #8551149 - Flags: feedback?(mh+mozilla)
Comment on attachment 8551156 [details] [diff] [review]
part 2. hooks in js engine

Review of attachment 8551156 [details] [diff] [review]:
-----------------------------------------------------------------

I can't review js/* code. I'll defer it to Terrence.
Attachment #8551156 - Flags: feedback?(bgirard)
Comment on attachment 8551165 [details] [diff] [review]
part 5. tracking the memory events

Review of attachment 8551165 [details] [diff] [review]:
-----------------------------------------------------------------

Just a high level check, I'll finish tomorrow

::: tools/profiler/MemoryProfiler.cpp
@@ +52,5 @@
> +static std::unordered_map<JSRuntime *, GCHeapProfilerImpl *> gRuntimeToGCHeapProfiler;
> +static std::unordered_map<JSRuntime *, bool> gEnabledOnRuntime;
> +
> +// Get the backtrace in PseudoStack.
> +static std::vector<const char *> SPSGetStacktrace()

This is an implementation detail of the profiler that doesn't belong here. GeckoProfiler.h is the public interface that should be used. The method you're looking for is 'profiler_get_backtrace'. If you only want the pseudostack you should introduce a flag to profiler_get_backtrace to collect just the pseudostack. Otherwise the method should a stack containing all the information available.

::: tools/profiler/moz.build
@@ +8,5 @@
>      FAIL_ON_WARNINGS = True
>  
>      XPIDL_MODULE = 'profiler'
>      XPIDL_SOURCES += [
> +        'nsIMemoryProfiler.idl',

As far as I can tell this doesn't integrate with the profiler at all. The only thing it does is use the pseudostack which is mean to be provided by a public API.

If the memory profiler doesn't integrate with the profiler it should live outside of the profiler, or at the very least in a separate subdir. This will prevent the project from developing cross dependencies.

::: tools/profiler/nsIMemoryProfiler.idl
@@ +4,5 @@
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#include "nsISupports.idl"
> +
> +// profiler = Cc["@mozilla.org/tools/memory-profiler;1"].getService(Ci.nsIMemoryProfiler)

If this is means as documentation you should write a better comment for this. If it's just a personal note you should delete it.

This class could use documentation.

can you get someone else to review .idl? I'm not qualified to do that.
Comment on attachment 8551165 [details] [diff] [review]
part 5. tracking the memory events

Review of attachment 8551165 [details] [diff] [review]:
-----------------------------------------------------------------

Just a high level check, I'll finish tomorrow

::: tools/profiler/MemoryProfiler.cpp
@@ +40,5 @@
> +
> +// gActiveProfiler is the key to a TLS variable which tracks active profiler on
> +// each thread. nobody is a sentinal indicating there's no profile active.
> +static unsigned int gActiveProfiler;
> +static int nobody;

Please use mfbt/ThreadLocal.h. It deals with the nobody case better, has better typing and the API is better.

@@ +52,5 @@
> +static std::unordered_map<JSRuntime *, GCHeapProfilerImpl *> gRuntimeToGCHeapProfiler;
> +static std::unordered_map<JSRuntime *, bool> gEnabledOnRuntime;
> +
> +// Get the backtrace in PseudoStack.
> +static std::vector<const char *> SPSGetStacktrace()

This is an implementation detail of the profiler that doesn't belong here. GeckoProfiler.h is the public interface that should be used. The method you're looking for is 'profiler_get_backtrace'. If you only want the pseudostack you should introduce a flag to profiler_get_backtrace to collect just the pseudostack. Otherwise the method should a stack containing all the information available.

@@ +399,5 @@
> +
> +  virtual void sampleNative(void *addr, int32_t size) {
> +    AutoMPLock lock(lock_, this);
> +    cumulated += size;
> +    if (cumulated >= mSampleSize) {

This will give a bias, see my write up on this in my next comment.

@@ +614,5 @@
> +  return std::make_tuple(names.serialize(), traces.serialize(), move(events));
> +}
> +
> +NS_IMETHODIMP
> +MemoryProfiler::GetResults(JSContext *cx, JS::MutableHandle<JS::Value> aResult)

If you want to dump the results when the JS engine isn't alive you should have a look at the JSStreamWriter that the profiler uses. It looks like since we've introduced JSONWriter.h to mfbt which looks similar.

::: tools/profiler/moz.build
@@ +8,5 @@
>      FAIL_ON_WARNINGS = True
>  
>      XPIDL_MODULE = 'profiler'
>      XPIDL_SOURCES += [
> +        'nsIMemoryProfiler.idl',

As far as I can tell this doesn't integrate with the profiler at all. The only thing it does is use the pseudostack which is mean to be provided by a public API.

If the memory profiler doesn't integrate with the profiler it should live outside of the profiler, or at the very least in a separate subdir. This will prevent the project from developing cross dependencies.

::: tools/profiler/nsIMemoryProfiler.idl
@@ +4,5 @@
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#include "nsISupports.idl"
> +
> +// profiler = Cc["@mozilla.org/tools/memory-profiler;1"].getService(Ci.nsIMemoryProfiler)

If this is means as documentation you should write a better comment for this. If it's just a personal note you should delete it.

This class could use documentation.

can you get someone else to review .idl? I'm not qualified to do that.
Attachment #8551165 - Flags: feedback?(bgirard) → feedback-
So we've had a lengthy discussion in Toronto (myself, jrmuizel, mstange) about how to sample memory allocations to build a representative heap snapshot to show an *evenly distributed sample* of your live heap at a given time. I'll try to summarize a lengthy discussion.

(In reply to Nicholas Nethercote [:njn] from comment #14)
> Oh, one question: you say "The memory profiler samples allocation events".
> How does it do the sampling? Do you sample when a certain amount of time has
> passed, or when a certain number of bytes have been allocated. Hopefully
> it's the latter; that's what DMD does.

I'm assuming here you mean that you keep a stateful count of how many bytes have been allocated since the last sample. Otherwise you would always miss small allocations which is wrong. When the count is above a threshold we 'sample' the entire allocations weighted by the allocation size. With this method you wont get an evenly distributed probability that a given 'live' byte on the heap has been sampled.

Let's explore this from the point of view of a particular byte:

1) A byte is more likely to be sampled if it's part of a large allocation. This is because when an allocation crosses the threshold we record all the bytes within it. So a byte in a one byte allocation gets 1/sampleThreshold. A byte in an allocation larger then your threshold will always get logged.
2) Some allocations can be reliably missed if say a loop allocates code in a particular pattern that is a multiple of the allocation threshold.
3) Blocks tend to have a different lifetime and I'm going that there's a correlation between allocation size and life time. Assuming that this is true then when looking at the heap at a particular time you wont have an even sample of the blocks that are currently live. This is related to 1).

The way to fix it is to roll a random number for each byte allocated (in practice you can do something equivalent but faster) regardless of the allocation size of the number of bytes that have been allocated previously. If roll is below sampleRatio then you grab a stack and charge it against that byte. This fixes all the problems above. Each byte has an equal chance of being sampled at any point in time and the probability of sampling a byte is independent of the nearby allocations patterns or how big the allocation that it's inside.

Instead of generating a random roll for each allocated byte we can just generate a random binomial random variate. If that's >= 1 then we capture a stack assign the value with that as a weight.

Our suggestion should address problem 1,2,3 and let you build a memory profiler that can show an unbiased view of live bytes at any given point in time.
DMD's sampling is controlled by AllocCallback() within memory/replace/dmd/DMD.cpp.

The default sampling threshold is 4093 bytes, which is prime, and the documentation recommends that if you choose a different threshold that you also choose a prime. This mitigates problem (2).

Also, when a small (i.e. < 4093 B) allocation cause the small allocations counter to cross the 4093 B threshold, that allocation gets treated as though it is 4093 B in size. I.e. it compensates for the the fact that some small blocks are ignored by treating the non-ignored small blocks as if they were bigger.

Such blocks are also marked in a way that indicates sampling was involved and in the output any measurements that involved sampled blocks are preceded with '~' to indicate they aren't exact. I think this mitigates (1), though I'm not sure if I understood what you wrote exactly.

As for (3), I'm skeptical that there is a correlation between allocation size and lifetime.
(In reply to Nicholas Nethercote [:njn] from comment #23)
> DMD's sampling is controlled by AllocCallback() within
> memory/replace/dmd/DMD.cpp.
> 
> The default sampling threshold is 4093 bytes, which is prime, and the
> documentation recommends that if you choose a different threshold that you
> also choose a prime. This mitigates problem (2).

If your code happens to be allocating a multiple of your prime, say 2 * 4093 then it doesn't rigorously address my concern. But in practice it's probably good enough.

> 
> Also, when a small (i.e. < 4093 B) allocation cause the small allocations
> counter to cross the 4093 B threshold, that allocation gets treated as
> though it is 4093 B in size. I.e. it compensates for the the fact that some
> small blocks are ignored by treating the non-ignored small blocks as if they
> were bigger.
> 
> Such blocks are also marked in a way that indicates sampling was involved
> and in the output any measurements that involved sampled blocks are preceded
> with '~' to indicate they aren't exact. I think this mitigates (1), though
> I'm not sure if I understood what you wrote exactly.
> 
> As for (3), I'm skeptical that there is a correlation between allocation
> size and lifetime.

Essentially you're saying that it's good enough/mitigated. I don't disagree with this in practice. FWIW the CPU profiler suffers from the same problem but time sampling is a lot more noisy.

I think my approach is more sound but generating binomial random variate quickly is a PITA.
> > The default sampling threshold is 4093 bytes, which is prime, and the
> > documentation recommends that if you choose a different threshold that you
> > also choose a prime. This mitigates problem (2).
> 
> If your code happens to be allocating a multiple of your prime, say 2 * 4093
> then it doesn't rigorously address my concern. But in practice it's probably
> good enough.

jemalloc has a finite number of size classes and any requests for odd or prime sizes will be rounded up to a power-of-two or something power-of-two-ish (such as a multiple of 16 or 256 or 1024) E.g. 4093 rounds up to 4096. And DMD uses the rounded-up size in the sampling calculations.
Thank you for reviewing this Terrence! I'm fixing the problems according to your comments and will update the patches soon. Just a few questions in below and pardon me if I misunderstood what you mean:

(In reply to Terrence Cole [:terrence] from comment #10)
> ::: js/public/MemoryProfiler.h
> @@ +15,5 @@
> > +  public:
> > +    virtual ~NativeProfiler() {};
> > +    virtual void sampleNative(void *addr, int32_t size) = 0;
> > +    virtual void removeNative(void *addr) = 0;
> > +    virtual void reset() = 0;
> 
> Making all of these calls virtual is going to add a ton of overhead. It
> looks like there is only one consumer so maybe we could just move most of
> the code inline here?

Because codes in js/src don't know the codes or interfaces in tools/profiler, the approach here is to have hooks installable from outside. I tried tables with function pointers but they seemed to be not much faster than virtual functions which looked a little bit prettier IMHO.

> @@ +22,5 @@
> > +class GCHeapProfiler {
> > +  public:
> > +    virtual ~GCHeapProfiler() {};
> > +    virtual void sampleTenured(void *addr, int32_t size) = 0;
> > +    virtual void sampleNursery(void *addr, int32_t size) = 0;
> 
> Please don't use void* for GC things. Instead include js/HeapAPI.h and use
> GCCellPtr here, and below where you take GC references.

The addresses here are just plain numbers and the only operation on them is equality check. Would GCCellPtr be an overkill?
(In reply to Terrence Cole [:terrence] from comment #11)
> ::: js/src/gc/Heap.h
> @@ +1280,5 @@
> >  bool
> >  TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const
> >  {
> >      AssertValidColor(this, color);
> > +    MemProfiler::MarkTenured(reinterpret_cast<void *>(address()));
> 
> A virtual call here is going to be extremely noticable on GC benchmarks.
> What exactly are you trying to do with this count, because it's not going to
> work as is.

MemProfiler::MarkTenured is an inline static function. When the profiler is not enabled, the overheads are a memory load and a conditional jump. I was worried if that can still be noticeable but Octane seems to be satisfied with that. I'll try more benchmarks before requesting a review.

When the profiler is enabled, the situation is similar to that in comment 26. A function table is not much faster.
Thank you Mike.

(In reply to Mike Hommey [:glandium] from comment #17)
> @@ +451,5 @@
> > +  void *profiler = PR_GetThreadPrivate(gActiveProfiler);
> > +  if (profiler == &nobody) {
> > +    gNativeProfiler->sampleNative(addr, size);
> > +  } else if (!profiler) {
> > +    PRThread *curr = PR_GetCurrentThread();
> 
> PR_GetCurrentThread can end up using the memory allocator. Using it in this
> function is dangerous. PR_GetThreadPrivate calls PR_GetCurrentThread, so
> that is dangerous as well.

I think this is the source of crashes Nicholas encountered (sorry...). I gave up relying on tricky flags and TLS to avoid recurrence when using STL containers. Instead, the approach in the next patch will be supplying a customized allocator to containers.
(In reply to Benoit Girard (:BenWa) from comment #24)
> I think my approach is more sound but generating binomial random variate
> quickly is a PITA.

I'm not sure how to use a binomial random variable to speed up the process but Nick Fitzgerald showed me that a random variable which conforms to geometric distribution can do so and can be generated with a uniform random variable and two log10():

https://hg.mozilla.org/mozilla-central/file/29b05d283b00/js/src/vm/SavedStacks.cpp#l784

So we only have to roll a dice per 1/p allocations. If the sampling rate p is low enough, the solution should pay off.

But yeah, I concur with Nicholas that a prime should be practically good enough.
(In reply to Nicholas Nethercote [:njn] from comment #23)
> The default sampling threshold is 4093 bytes, which is prime, and the
> documentation recommends that if you choose a different threshold that you
> also choose a prime. This mitigates problem (2).

It mitigates it somewhat, but this can still end up heavily biased.

Say you're using a threshold of 4093 bytes, and you're only allocating structs that happen to be 4096 bytes.

  struct {
       char *a; // = malloc(2048)
       char *b; // = malloc(2048)
  }

If you allocate an infinite number of structs, your sampling is perfect. But the first 682 samples always fall within 'b'. In fact, if you look at which of a or b each sample falls within, it will be 'b' 682 times, then 'a' 683 times, repeating (well, 683 and 682 swap for each repeat, but whatever.)

So say during startup you allocate 500 of these, and you happen to land near the beginning of one of these runs. You'll end up estimating 2MB of stack 'a' and 0 bytes for stack 'b' (or vice versa). And then you never allocate with these stacks anymore, so you can't bury the bias with later samples.

How likely is this in practice? No clue. It's totally possible that it doesn't matter. But fitzgen's approach eliminates the problem completely, since it is exactly equivalent to rolling dice for every byte.

> Also, when a small (i.e. < 4093 B) allocation cause the small allocations
> counter to cross the 4093 B threshold, that allocation gets treated as
> though it is 4093 B in size. I.e. it compensates for the the fact that some
> small blocks are ignored by treating the non-ignored small blocks as if they
> were bigger.

This is the right thing to do, but I would explain it differently. You're sampling along the bytes-allocated dimension with probability p = 1/4093. Thus every sample represents 1/p = 4093 
bytes.

The implication, though, is that you are *not* changing the size for allocations larger than 4093 bytes. That also introduces a bias. A large allocation of size K should be treated as if it got sampled *approximately* K*p times, and each sample stands for 1/p bytes, so for large K using K*p/p=K for the estimate is fine. But for small K>4093, you really ought to be figuring out exactly how many times your every-4093-byte counter would fire, and multiplying that number by 4093.

On the other hand, it's nicer to have an exact size for larger allocations. And if you switch to fitzgen's method, then you don't need to do the math to figure out how many times your sampler would hit for K bytes. (If you really want to, I'm sure you can pose it as a question to fitzgen and jimb and they won't be able to stop themselves from computing it.) Still, it's good to be aware that you likely have bias for the sizes slightly larger than your 1/p (4093) value.

> Such blocks are also marked in a way that indicates sampling was involved
> and in the output any measurements that involved sampled blocks are preceded
> with '~' to indicate they aren't exact. I think this mitigates (1), though
> I'm not sure if I understood what you wrote exactly.

Yeah, you're doing the right thing here. It's exactly the same as BenWa proposed in (1) when he wrote 1/sampleThreshold.

> As for (3), I'm skeptical that there is a correlation between allocation
> size and lifetime.

I find that correlation believable. Specifically, I would guess that larger allocations tend to live longer. But I don't see why it's a problem.

(In reply to Benoit Girard (:BenWa) from comment #22)
> 3) Blocks tend to have a different lifetime and I'm going that there's a
> correlation between allocation size and life time. Assuming that this is
> true then when looking at the heap at a particular time you wont have an
> even sample of the blocks that are currently live. This is related to 1).

I don't get this. Why do relative lifetimes matter? As long as you're weighting everything by 1/p, I don't see why the relative distribution of chunk sizes matters. Well, except for the bias introduced by using exact allocation sizes.

Then again, I haven't read the patch and it's been too long since I read the DMD code. Are frees being sampled too? How does that even work? I really ought to look, but I've spent too much time on the comment already. :-|

> Instead of generating a random roll for each allocated byte we can just
> generate a random binomial random variate. If that's >= 1 then we capture a
> stack assign the value with that as a weight.

I sort of understand what you're saying, though I don't understand the exact terminology. This still requires sampling from a probability distribution on every allocation, so fitzgen's approach is still better afaict. (Well, really fitzgen has a single event at a time, instead of a count of some number of them as you do, but it's a straightforward extension.)

> Our suggestion should address problem 1,2,3 and let you build a memory
> profiler that can show an unbiased view of live bytes at any given point in
> time.

Right. But njn's is doing the same thing if you take p=1/4093, and so his estimates are no more biased than yours if you ignore harmonics. So I still don't get your (3). (He's flipping a coin for every byte that comes up heads with p=1/4093, it's just that his algorithm for flipping the coin is "return false 4092 times, then true, then start over." I would like to gamble with him. Or equivalently, he's sampling your binomial distribution by just reading out the mode.)

To be specific, pseudocode for the unbiased form of what I am suggesting is:

  remaining = fitzgen_sample(p)
  alloc(K):
    k = K
    while (k > remaining):
      sample(this stack, weight=1/p)
      k -= remaining
      remaining = fitzgen_sample(p)
    remaining -= k

but of course that's slow and awful for large K, and we probably don't need to worry about the "near 1/p bias", so it'd probably be

  remaining = fitzgen_sample(p)
  alloc(K):
    if (K >= 1/p):
      sample(this stack, weight=K)
    else:
      k = K
      while (k >= remaining):
        sample(this stack, weight=1/p)
        k -= remaining
        remaining = fitzgen_sample(p)
      remaining -= k

You can eliminate the loop here by throwing more math at it, but the loop is rarely going to run very many times. This also still has the same bias as njn's method, just without the harmonics. With some more math you could eliminate the bias, and the whole thing would reduce to

  remaining = fitzgen_sample(p)
  alloc(K):
    if (K < remaining):
      remaining -= K
    else:
      hits, remaining = f(K, remaining)
      sample(this stack, weight=hits * 1/p)

where f() figures out how many times the sampler would hit within the next K bytes, and also returns the number of bytes left after the final sample. This does lose the exact counts for larger allocations, though.

Math is involved, which means I got something wrong in the above, but I think the idea should be about right.
Thanks for the detailed analysis, sfink. I will have to re-read it on Monday. When I first implemented sampling in DMD I tried doing something like you suggested and I had a long argument with jlebar about it and he convinced me that DMD's current code was ok. Unfortunately I don't remember the explanation now.

> This does lose the exact counts for larger allocations, though.

For what it's worth, this would hurt. DMD shows the size of all the non-sampled allocations and it's useful data that I look at quite often.
Hi everyone.  njn asked me to weigh in here.

I agree with sfink's assessment of the "4093 algorithm".  I'm not convinced that the harmonic bias is particularly important in the context of Firefox: I can't recall a time when we missed something important due to sampling bias.  But of course maybe we never found it precisely because it was hidden by sampling bias.

I can't figure out what fitzgen_sample is, so I can't comment on that suggestion.  I can, however, point you to what Google does in tcmalloc.  There's an OSS version, which is all I can comment on; the relevant code is in sampler.h [1].  I think the sampling algorithm it uses is similar to the last algorithm from comment 30, although it lacks the "hits" variable.  There's a big comment at the top of the file that hopefully explains what's going on, but I studied this code pretty extensively about a year ago, so I can try to help if you need clarification.

The tcmalloc algorithm is statistically sound inasmuch as I was able to analyze it; it certainly lacks the harmonic bias of the 4093 algorithm.  It's also quite elegant; I swooned a bit when I first read it.  :)  The only caveat I'd offer is that DMD samples at a much higher rate than tcmalloc's default; DMD samples 4k allocs with probability 1, while tcmalloc samples a 4k allocation with probability < 1%.  tcmalloc's algorithm has a higher per-sample overhead than the 4093 algorithm, and this may be significant at higher sampling frequencies; I don't know.

There's a concern here about getting the exact size of large allocations that I want to allay, but I'm not sure   exactly what it's about.  With either tcmalloc sampling or the 4093 algorithm, there's uncertainty about the *number* of allocations of size N from a given callsite, but no uncertainty about each sampled allocation's size.  tcmalloc represents this in its output as e.g. "~3mb of size-48 allocs from function foo()".  But anyway if you want a hard cutoff above which all allocs are sampled, I don't think that would be a problem to implement with the tcmalloc approach.

HTH.  I'm not reading bugmail (I still get a ton), but feel free to ping me via e-mail if you have any other questions.

[1] https://code.google.com/p/gperftools/source/browse/src/sampler.h
(In reply to Steve Fink [:sfink, :s:] from comment #30)
> Then again, I haven't read the patch and it's been too long since I read the
> DMD code. Are frees being sampled too? How does that even work? I really
> ought to look, but I've spent too much time on the comment already. :-|

Frees are not sampled unfortunately. Each free results in a look up in the address table to see if there's a corresponding sample.
(In reply to Justin Lebar (not reading bugmail) from comment #32)
> I can't figure out what fitzgen_sample is, so I can't comment on that
> suggestion.  I can, however, point you to what Google does in tcmalloc. 

fitzgen_sample is sfink's made-up name for the algorithm cited in comment 29:

https://hg.mozilla.org/mozilla-central/file/29b05d283b00/js/src/vm/SavedStacks.cpp#l784

which is indeed - miracle of miracles! - the same thing tcmalloc is doing: generating random numbers with a geometric distribution to get the equivalent of a Bernoulli trial on every allocation.

However, the way tcmalloc does a trail for every *byte*, not every *allocation*, seems like a fine idea: it seems very wrong to skip gigantic allocations with the same probability as small ones.

The fact that each byte's probability of being marked is completely independent of the other bytes' means that we don't even need any funny re-synchronization of the skip count after a large allocation; we just choose a new skip count like we would for anything else, because the process looks the same no matter where you start in the stream.
Comment on attachment 8551170 [details] [diff] [review]
part 9. Interface to Add-Ons

Review of attachment 8551170 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good.

Do you have any tests for it? See the examples of tests for other actors in toolkit/devtools/server/tests/mochitest. It wouldn't be necessary to actually test nsIMemoryProfiler as well; the tests could set a global flag indicating that the actor should return fake data.
Attachment #8551170 - Flags: feedback?(jimb)
(In reply to Jim Blandy :jimb from comment #34)
> (In reply to Justin Lebar (not reading bugmail) from comment #32)
> > I can't figure out what fitzgen_sample is, so I can't comment on that
> > suggestion.  I can, however, point you to what Google does in tcmalloc. 
> 
> fitzgen_sample is sfink's made-up name for the algorithm cited in comment 29:
> 
> https://hg.mozilla.org/mozilla-central/file/29b05d283b00/js/src/vm/
> SavedStacks.cpp#l784
> 
> which is indeed - miracle of miracles! - the same thing tcmalloc is doing:
> generating random numbers with a geometric distribution to get the
> equivalent of a Bernoulli trial on every allocation.
> 
> However, the way tcmalloc does a trail for every *byte*, not every
> *allocation*, seems like a fine idea: it seems very wrong to skip gigantic
> allocations with the same probability as small ones.

Gah! It does?! I should've read more closely. I just assumed it was skipping N bytes, not N allocations. Skipping allocations is, er, not what you want. My pseudocode was assuming each byte has the same probability of being sampled.

> The fact that each byte's probability of being marked is completely
> independent of the other bytes' means that we don't even need any funny
> re-synchronization of the skip count after a large allocation; we just
> choose a new skip count like we would for anything else, because the process
> looks the same no matter where you start in the stream.

Well, it depends on how you're looking at it. We're looking at a memoryless sampling process, so it is true that the skip count computation is exactly the same no matter how recent your last sample was.

But to implement this, that's not what you need to compute. Say you have a state where your skip count is 5 and you're allocating 20 bytes. That's equivalent to a skip count of zero and an allocation of 15 bytes (if we ignore the off-by-one stuff for the specific byte being sampled.) If p=1/5 (per byte), then you want to know how many samples you're going to take within those 15 bytes (it'll be about 3), and also how much is *left* of your skip count after the last byte allocated. The distribution for this leftover skip count is not the same as the distribution for the skip count in general. It is smaller on average. (If you were doing every-4093rd-byte sampling, it would be equal to 4093 minus the gap at the end of the previous allocation.)

Er... except see below. If you use the actual sizes for large allocations, then you have already accounted for the difference. You don't even need to compute a new skip count.

(In reply to Justin Lebar (not reading bugmail) from comment #32)
> I can't figure out what fitzgen_sample is, so I can't comment on that

Sorry, as jimb said, fitzgen_sample is just sampling from a geometric distribution. (And it sounds like I should have called it jimb_sample anyway, if I didn't want to use the official name.)

> suggestion.  I can, however, point you to what Google does in tcmalloc. 
> There's an OSS version, which is all I can comment on; the relevant code is
> in sampler.h [1].  I think the sampling algorithm it uses is similar to the
> last algorithm from comment 30, although it lacks the "hits" variable. 

Lacking "hits" seems to be a rather large flaw to me, but it depends on what you're using it for. "hits" is a weight. If you're implementing a profiler, weights are important. Or in simple terms, if you care about the difference between a 1GB and a 2GB allocation, then you need weights. (Well, unless you want to implement those as a whole crapload of separate samples, but that would make your sampling take time linear in the size of the allocations, which seems bad.)

If your p is low enough, then it also doesn't really make much of a difference -- if you would only take at most one sample in any sanely-sized allocation 99.999% of the time, then weights are irrelevant in practice since they'll always be 1.

Oh! And if you're using actual sizes for larger allocations, you mostly don't need weights either.

> There's a big comment at the top of the file that hopefully explains what's
> going on, but I studied this code pretty extensively about a year ago, so I
> can try to help if you need clarification.
> 
> The tcmalloc algorithm is statistically sound inasmuch as I was able to
> analyze it; it certainly lacks the harmonic bias of the 4093 algorithm. 
> It's also quite elegant; I swooned a bit when I first read it.  :)  The only

Yay for independent evolution. :-) jimb came up with the same thing.

> caveat I'd offer is that DMD samples at a much higher rate than tcmalloc's
> default; DMD samples 4k allocs with probability 1, while tcmalloc samples a
> 4k allocation with probability < 1%.  tcmalloc's algorithm has a higher
> per-sample overhead than the 4093 algorithm, and this may be significant at
> higher sampling frequencies; I don't know.

Yes, I agree. But for the use we're considering here, we're taking a friggin' stack trace. Assuming the last algorithm, that's one sample drawn from a distribution per stack trace, and I'd be very pleasantly surprised if taking a stack trace is fast compared to doing a little bit of math.

> There's a concern here about getting the exact size of large allocations
> that I want to allay, but I'm not sure   exactly what it's about.  With

It's complicated, and I had a much larger writeup here, but it turns out that using the actual size for large allocations is what you want and avoids some subtle problems. Maybe I'll do a blog post, where I explain the painful detail why.

> either tcmalloc sampling or the 4093 algorithm, there's uncertainty about
> the *number* of allocations of size N from a given callsite, but no
> uncertainty about each sampled allocation's size.  tcmalloc represents this
> in its output as e.g. "~3mb of size-48 allocs from function foo()".  But
> anyway if you want a hard cutoff above which all allocs are sampled, I don't
> think that would be a problem to implement with the tcmalloc approach.

Yes, it is straightforward to implement, though the exact meaning of the results in that case is complicated. I would need to look at the sampler.h callers to be sure of what its numbers mean.

> [1] https://code.google.com/p/gperftools/source/browse/src/sampler.h

Thanks for the link!

Ok, so to be precise, I'm talking about:

  remaining = sample_geometric(p);
  alloc(K):
    if (K > 1/p):
      sample(this stack, weight=K)
    else:
      if (K < remaining):
        remaining -= K
      else:
        sample(this stack, weight=1/p)
        remaining = sample_geometric(p)

where sample_geometric(p) is ceil(ln(1-rand) / ln(1-p)) and rand is a sample from a uniform distribution over [0,1].

Note that allocations are "sampled" based on two totally different processes: anything of alloc size over 1/p is recorded unconditionally and weighted by the actual size of the allocation. Smaller allocations are sampled with an independent probability of p per byte, and samples are weighted by 1/p.

This procedure is fast, simple, and if I am not mistaken, wrong. The correct algorithm would be:

  remaining = sample_geometric(p);
  alloc(K):
    if (K > 1/p):
      sample(this stack, weight=K)
    else:
      if (K < remaining):
        remaining -= K
      else:
        hits = 0
        while (remaining <= K):
          hits++
          K -= remaining
          remaining = sample_geometric(p)
        sample(this stack, weight=hits*1/p)

because with the first procedure a 100-byte allocation with p=1/4093 would have an exactly 0% chance of being sampled twice, which is wrong if each byte is tested independently. But this is sadly more complicated, for little real benefit. Perhaps you could bump up the weight based on the leftover bytes to avoid the while() loop? Say, by (K-remaining)*p? I've no idea whether that is actually sound, but once again I've spent too much time thinking about this. :-( I think it reintroduces a teeny tiny bit of the harmonic problem, but nowhere near enough to worry about.

Ok, my final answer (for now):

  remaining = sample_geometric(p);
  alloc(K):
    if (K > 1/p):
      sample(this stack, weight=K)
    else:
      if (K < remaining):
        remaining -= K
      else:
        sample(this stack, weight=(1/p + K - remaining))
        remaining = sample_geometric(p)

Not that anyone was necessarily waiting for me to stop talking to myself...
Whiteboard: [MemShrink] → [MemShrink:P2]
Comment on attachment 8551170 [details] [diff] [review]
part 9. Interface to Add-Ons

Review of attachment 8551170 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good.

Do you have any tests for it? See the examples of tests for other actors in toolkit/devtools/server/tests/mochitest. It wouldn't be necessary to actually test nsIMemoryProfiler as well; the tests could set a global flag indicating that the actor should return fake data.
Attachment #8551170 - Flags: feedback+
[Sorry for the re-post of the review. I blame Splinter.]

I don't see the value of having a special case for K > 1/p. If p is 4096, you've got a 99.9% chance of sampling a 32k allocation.

I also don't think the 'hits' count is actually interesting, since it's strictly a function of the block's size. The likelihood of each byte being marked is independent, so the hit count is simply the number of marked bytes the block contains, which is a value with a binomial distribution:

http://en.wikipedia.org/wiki/Binomial_distribution#Generating_binomial_random_variates

So I suppose you could fake up a hit count if you wanted to, but ... why not just report the size? If your display doesn't adequately put big allocations front and center, that's a separate problem.

I'd say the algorithm simply:

  remaining = sample_geometric(p);
  alloc(K):
    remaining -= K
    if remaining < 0:
      sample(this stack)
      remaining = sample_geometric(p)
More strongly: if I'm not missing something, the expected value of the hits count is simply Kp. Not very interesting.
(In reply to Jim Blandy :jimb from comment #39)
> More strongly: if I'm not missing something, the expected value of the hits
> count is simply Kp. Not very interesting.

Yes, that is the expected value.

> I don't see the value of having a special case for K > 1/p. If p is 4096,
> you've got a 99.9% chance of sampling a 32k allocation.

Well, for one, I'm not sure what p is going to be, since it depends on how long it takes to grab a stack trace. I could see it being 1/1M.

But really, I introduced the special case to (1) preserve the property that counts above some threshold are exact, and (2) to avoid the "leftover" calculation since I still insist that you can't just generate a new sample and store it in 'remaining'. That's the right distribution of the wrong variable.

I don't know how important (1) is in practice, but it seems nice for testing at least.

> I also don't think the 'hits' count is actually interesting, since it's
> strictly a function of the block's size. The likelihood of each byte being
> marked is independent, so the hit count is simply the number of marked bytes
> the block contains, which is a value with a binomial distribution:
> 
> http://en.wikipedia.org/wiki/
> Binomial_distribution#Generating_binomial_random_variates

Ah, that was the thing that I never got around to figuring out. So sample_binomial(size,p)*1/p could be used as the weight. And then, yes, you can just grab a fresh sample_geometric(p) for 'remaining'. That resolves my (2).

> So I suppose you could fake up a hit count if you wanted to, but ... why not
> just report the size? If your display doesn't adequately put big allocations
> front and center, that's a separate problem.

If you just report the size without doing a fancy computation for 'remaining', I believe you're biasing things.

> I'd say the algorithm simply:
> 
>   remaining = sample_geometric(p);
>   alloc(K):
>     remaining -= K
>     if remaining < 0:
>       sample(this stack)
>       remaining = sample_geometric(p)

Huh? No weights at all? Then if you do 100MB of tiny allocations followed by one 2GB allocation, the 2GB allocation will be completely lost in the noise.

Or are you implicitly using the full size, so that should say "sample(this stack, weight=K)" in place of "sample(this stack)"? But then you're not estimating the total sizes correctly. If you have a bunch of 4 byte allocations mixed with the same number of 8 byte allocations, then you'll take twice as many samples of 8 byte allocations (which is good) *and* weight each twice as much, which means you'll think you're using up 4x the space with the 8-byters when in fact you're only using up 2x the space. They should be sampled with the same weight.

I do believe the binomial sampling allows a simpler solution than I proposed, but not quite that simple.
(In reply to Benoit Girard (:BenWa) from comment #21)
> you're looking for is 'profiler_get_backtrace'. If you only want the

profiler_get_backtrace returns a ProfilerBacktrace of which the only interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter& b). This can be slow in my use cases because I have to parse the streams back into lists of function names and attributes. Would you mind if I add a public member function to it to return the list directly?

By the way, is it possible to compact the stacktraces on the fly when profiling, and decode it in the front-end? For example, although the number of function names and stacktraces is somewhat proportional to the number of samples, they basically form a small set and most of them are the same. In this way, the buffer can be utilized much more efficiently and the profiler can run over a much longer period without considering memory usage.

In my experience, running Octane 2.0 with ~1 million samples only result in a few thousands (1000 ~ 3000) of distinct function names. The stacktraces can be encoded in only 10000 ~ 20000 indexes. In contrast, there will be (avg. stack depth) * 1 million function names without compaction.
Flags: needinfo?(bgirard)
(In reply to Ting-Yuan Huang from comment #41)
> (In reply to Benoit Girard (:BenWa) from comment #21)
> > you're looking for is 'profiler_get_backtrace'. If you only want the
> 
> profiler_get_backtrace returns a ProfilerBacktrace of which the only
> interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter&
> b). This can be slow in my use cases because I have to parse the streams
> back into lists of function names and attributes. Would you mind if I add a
> public member function to it to return the list directly?

Yes please, you should add something to GeckoProfiler.h that takes the ProfilerBacktrace object and returns a list of struct POD that contains the data you need.

(In reply to Ting-Yuan Huang from comment #41)
> 
> By the way, is it possible to compact the stacktraces on the fly when
> profiling, and decode it in the front-end? For example, although the number
> of function names and stacktraces is somewhat proportional to the number of
> samples, they basically form a small set and most of them are the same. In
> this way, the buffer can be utilized much more efficiently and the profiler
> can run over a much longer period without considering memory usage.
> 
> In my experience, running Octane 2.0 with ~1 million samples only result in
> a few thousands (1000 ~ 3000) of distinct function names. The stacktraces
> can be encoded in only 10000 ~ 20000 indexes. In contrast, there will be
> (avg. stack depth) * 1 million function names without compaction.

We've discussed this. The best thing would be to use lz4 compression. But so far it's better to just increase the buffer size or decrease the sampling interval.
Flags: needinfo?(bgirard)
(In reply to Benoit Girard (:BenWa) from comment #42)
> Yes please, you should add something to GeckoProfiler.h that takes the
> ProfilerBacktrace object and returns a list of struct POD that contains the
> data you need.

Got it. Thanks!

> We've discussed this. The best thing would be to use lz4 compression. But so
> far it's better to just increase the buffer size or decrease the sampling
> interval.

I tried again with Octane 2.0 and got only 276114 samples because my debug codes introduced some slowdowns. Nevertheless, there are 1582 unique names and the total size is 149.67KB. If not compacted, the size would be 703.58MB. The compression ratio is 4788. Removing the debug codes will get more samples and have an even better compression ratio. I doubt that lz4 can beat this within a reasonable amount of time :-)
(In reply to Benoit Girard (:BenWa) from comment #42)
> > profiler_get_backtrace returns a ProfilerBacktrace of which the only
> > interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter&
> > b). This can be slow in my use cases because I have to parse the streams
> > back into lists of function names and attributes. Would you mind if I add a
> > public member function to it to return the list directly?
> 
> Yes please, you should add something to GeckoProfiler.h that takes the
> ProfilerBacktrace object and returns a list of struct POD that contains the
> data you need.

The most tricky part when implementing this is that profiler_get_backtrace() allocates memory. Calling an allocator in a memory profiler results in an infinite recursion.

There are two approaches I tried to solve the problem when dealing with STL containers:

1. Completely remove explicit/implicit memory allocations. STL containers are supplied with a customized allocator. I feel that a huge change of SPS that removes all allocations is unlikely acceptable. Removing all (explicit and implicit) allocations in profiler_get_backtrace() can be a huge change and make subsequent developments painful; Any patch which introduces memory allocations breaks the memory profiler. The best way I can think of is to factor out the stack unwinding part of SPS.

2. Use a per-thread flag to prevent recurring into the memory profiler itself. This looks easier and is transparent to profiler. However, an allocator-free version TLS must be implemented. That means a number of PR_XXX such as PR_GetCurrentThread() and PR_SetThreadPrivate() must be implemented again on without dynamic memory allocation on all platforms.

Either approach requires substantial efforts and deserves its own bug. Would you mind if we first land this with a quick hack (go through the PseudoStack either in memory profiler or in a new API of profiler) and revisit the problem later?
Flags: needinfo?(bgirard)
(In reply to Ting-Yuan Huang from comment #43)
> (In reply to Benoit Girard (:BenWa) from comment #42)
> > Yes please, you should add something to GeckoProfiler.h that takes the
> > ProfilerBacktrace object and returns a list of struct POD that contains the
> > data you need.
> 
> Got it. Thanks!
> 
> > We've discussed this. The best thing would be to use lz4 compression. But so
> > far it's better to just increase the buffer size or decrease the sampling
> > interval.
> 
> I tried again with Octane 2.0 and got only 276114 samples because my debug
> codes introduced some slowdowns. Nevertheless, there are 1582 unique names
> and the total size is 149.67KB. If not compacted, the size would be
> 703.58MB. The compression ratio is 4788. Removing the debug codes will get
> more samples and have an even better compression ratio. I doubt that lz4 can
> beat this within a reasonable amount of time :-)

What are you comparing it to? What has a 4788 times compression ratio?
Flags: needinfo?(bgirard)
(In reply to Ting-Yuan Huang from comment #44)
> (In reply to Benoit Girard (:BenWa) from comment #42)
> > > profiler_get_backtrace returns a ProfilerBacktrace of which the only
> > > interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter&
> > > b). This can be slow in my use cases because I have to parse the streams
> > > back into lists of function names and attributes. Would you mind if I add a
> > > public member function to it to return the list directly?
> > 
> > Yes please, you should add something to GeckoProfiler.h that takes the
> > ProfilerBacktrace object and returns a list of struct POD that contains the
> > data you need.
> 
> The most tricky part when implementing this is that profiler_get_backtrace()
> allocates memory. Calling an allocator in a memory profiler results in an
> infinite recursion.

Your implementation of SPSGetStacktrace already uses std::vector which will allocate memory.

> The best way I can think of is to
> factor out the stack unwinding part of SPS.

It's already optional under the 'stackwalk' feature. You might want to add a flag to profiler_get_backtrace to not use the stackwalk feature.

> 
> Either approach requires substantial efforts and deserves its own bug. Would
> you mind if we first land this with a quick hack (go through the PseudoStack
> either in memory profiler or in a new API of profiler) and revisit the
> problem later?

I'm suggesting that you don't break the profiler API and expose implementation details. I'm not asking for a functional change. Expose the information you want through a good API instead of exposing the internal details. If you have SPSGetStacktrace working with the condition you need, you should be able to have SPSGetStacktrace be profiler_get_backtrace and have the allocation behavior that you need.
(In reply to Benoit Girard (:BenWa) from comment #45)
> What are you comparing it to? What has a 4788 times compression ratio?

(total size of names of all backtraces) / (total size of those 1582 unique/atom names)

Does that make sense?
(In reply to Benoit Girard (:BenWa) from comment #46)
> Your implementation of SPSGetStacktrace already uses std::vector which will
> allocate memory.

Yeah, that is the source of crashes njn encountered. It will be eliminated in the incoming patch.

> It's already optional under the 'stackwalk' feature. You might want to add a
> flag to profiler_get_backtrace to not use the stackwalk feature.

I'm sorry that I should have not use word "stack unwinding". I meant to factor the part getting backtraces (equivalently unwinding and merging both JS and native stack) out of TableTicker which creates SyncProfile or so.

> I'm suggesting that you don't break the profiler API and expose
> implementation details. I'm not asking for a functional change. Expose the
> information you want through a good API instead of exposing the internal
> details. If you have SPSGetStacktrace working with the condition you need,
> you should be able to have SPSGetStacktrace be profiler_get_backtrace and
> have the allocation behavior that you need.

Glad to know that I misunderstood. Now it looks much easier to me :-)
(In reply to Ting-Yuan Huang from comment #47)
> (In reply to Benoit Girard (:BenWa) from comment #45)
> > What are you comparing it to? What has a 4788 times compression ratio?
> 
> (total size of names of all backtraces) / (total size of those 1582
> unique/atom names)
> 
> Does that make sense?

You can't eliminate redundant samples for free, you have to encode them somehow. That encoding is exactly what LZ4 provides. But if you can write an encoder that is better than LZ4, is fast, can work in place without allocating memory and isn't complex and difficult to maintain then that works
(In reply to Benoit Girard (:BenWa) from comment #49)
> You can't eliminate redundant samples for free, you have to encode them
> somehow. That encoding is exactly what LZ4 provides. But if you can write an
> encoder that is better than LZ4, is fast, can work in place without
> allocating memory and isn't complex and difficult to maintain then that works

Let's talk about this in a different bug. Kannan also expressed interest in improving the packing of the profiler buffer by using a separate symbolication table.
(Both Kannan and I were extremely surprised by the lz4 suggestion since we don't understand how that would work with a circular buffer; the compressed stream can refer to anything earlier in the stream, even to parts that have already been discarded from the buffer.)
Comment on attachment 8551159 [details] [diff] [review]
part 4. monitoring allocations and frees for ArrayBuffer

Review of attachment 8551159 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/ArrayBufferObject.cpp
@@ +298,5 @@
>              ReleaseAsmJSMappedData(data);
>              js_ReportOutOfMemory(cx);
>              return false;
>          }
> +        MemProfiler::SampleNative(diffStart, diffLength);

I'm fine with the general approach of instrumenting right at the mmap calls.
Attachment #8551159 - Flags: feedback?(sphink) → feedback+
(In reply to Mike Hommey [:glandium] from comment #17)
> Comment on attachment 8551165 [details] [diff] [review]
> part 5. tracking the memory events
> 
> Review of attachment 8551165 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: tools/profiler/MemoryProfiler.cpp
> @@ +70,5 @@
> > +
> > +  return trace;
> > +}
> > +
> > +class AutoLock {
> 
> We have MutexAutoLock in mozilla/Mutex.h for this.

mozilla::Mutex seems to call allocator in debug builds. Would you mind to give one more chance to the simpler PRLock and a reinvented wheel (AutoMPLock)?

> @@ +451,5 @@
> > +  void *profiler = PR_GetThreadPrivate(gActiveProfiler);
> > +  if (profiler == &nobody) {
> > +    gNativeProfiler->sampleNative(addr, size);
> > +  } else if (!profiler) {
> > +    PRThread *curr = PR_GetCurrentThread();
> 
> PR_GetCurrentThread can end up using the memory allocator. Using it in this
> function is dangerous. PR_GetThreadPrivate calls PR_GetCurrentThread, so
> that is dangerous as well.
>
> @@ +454,5 @@
> > +  } else if (!profiler) {
> > +    PRThread *curr = PR_GetCurrentThread();
> > +    if (tlsOwner == curr)
> > +      return;
> > +    PR_Lock(tlsLock);
> 
> Please note somewhere that CARegister needs to happen after _initOnce,
> otherwise you crash here, if not before.

These are avoided by using a customized allocator which is not hooked.

(In reply to Mike Hommey [:glandium] from comment #18)
> Comment on attachment 8551149 [details] [diff] [review]
> part 1. CountAlloc in ReplaceAlloc
> 
> Review of attachment 8551149 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: memory/build/replace_malloc_bridge.h
> @@ +86,5 @@
> >     * instance is valid until the process dies.
> >     * This method was added in version 2 of the bridge. */
> >    virtual void InitDebugFd(mozilla::DebugFdRegistry&) {}
> >  
> > +  virtual void CARegister(void (*_alloc)(void *, int32_t),
> 
> Seems to me _alloc and _free should have a different name. CARegister is
> also not very clear as to what it is. Please also add comments about what
> the function is for, and that it is added in version 3 of the bridge.

Done. Comments are added and they renamed to CountAllocRegisterHook, sAllocHook, etc.

> ::: memory/replace/countalloc/CountAlloc.cpp
> @@ +39,5 @@
> > +{
> > +  virtual void CARegister(void (*_alloc)(void *, int32_t),
> > +                          void (*_free)(void *)) override {
> > +    __hook_alloc = _alloc;
> > +    __hook_free = _free;
> 
> You probably want a memory fence of some sort here. Or to use atomics.

For multithreaded environment, would it be caller's responsibility to ensure the order? For compiler optimizations or out-of-order-execution in a single thread, I think it should be OK when not inlined, right?

> @@ +96,5 @@
> > +{
> > +  void* new_ptr = sFuncs->realloc(aPtr, aSize);
> > +  if (new_ptr || !aSize) {
> > +    FreeHook(aPtr);
> > +    if (!aSize) {
> 
> if (aSize)?

Yes! Thanks you.
Attachment #8551149 - Attachment is obsolete: true
Attachment #8571128 - Flags: review?(mh+mozilla)
Attached patch part 2. hooks in js engine (obsolete) — Splinter Review
(In reply to Terrence Cole [:terrence] from comment #10)
> Comment on attachment 8551156 [details] [diff] [review]
> part 2. hooks in js engine
> 
> Review of attachment 8551156 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Chiming in since this seems to be a fairly JS::GC specific series.
> 
> ::: js/public/MemoryProfiler.h
> @@ +15,5 @@
> > +  public:
> > +    virtual ~NativeProfiler() {};
> > +    virtual void sampleNative(void *addr, int32_t size) = 0;
> > +    virtual void removeNative(void *addr) = 0;
> > +    virtual void reset() = 0;
> 
> Making all of these calls virtual is going to add a ton of overhead. It
> looks like there is only one consumer so maybe we could just move most of
> the code inline here?

Because codes in js/src don't know the codes or interfaces in tools/profiler, the approach here is to have hooks installable from outside. I tried tables with function pointers but they seemed to be not much faster than virtual functions which looked a little bit prettier IMHO.

Anyway, when the memory profiler is not enabled, these virtual functions won't be called.

> @@ +22,5 @@
> > +class GCHeapProfiler {
> > +  public:
> > +    virtual ~GCHeapProfiler() {};
> > +    virtual void sampleTenured(void *addr, int32_t size) = 0;
> > +    virtual void sampleNursery(void *addr, int32_t size) = 0;
> 
> Please don't use void* for GC things. Instead include js/HeapAPI.h and use
> GCCellPtr here, and below where you take GC references.

The addresses here are just plain numbers and the only operation on them is equality check. Would GCCellPtr be an overkill?

> ::: js/src/vm/MemoryProfiler.cpp
> @@ +9,5 @@
> > +
> > +mozilla::Atomic<int> MemProfiler::sGlobalSwitch;
> > +NativeProfiler *MemProfiler::sNativeProfiler;
> > +
> > +GCHeapProfiler *MemProfiler::GetGCHeapProfiler(void *addr)
> 
> Type should be on the line above:
> 
> GCHeapProfiler *
> MemProfiler::GetGCHeapProfiler(GCCellPtr cell)

Done.
Attachment #8551156 - Attachment is obsolete: true
Attachment #8571133 - Flags: review?(terrence)
Comment on attachment 8551158 [details] [diff] [review]
part 3. Monitoring allocation and gc events in nursery and tenured heaps

(In reply to Terrence Cole [:terrence] from comment #11)
> Comment on attachment 8551158 [details] [diff] [review]
> part 3. Monitoring allocation and gc events in nursery and tenured heaps
> 
> Review of attachment 8551158 [details] [diff] [review]:
> -----------------------------------------------------------------
> ::: js/src/gc/Heap.h
> @@ +1280,5 @@
> >  bool
> >  TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const
> >  {
> >      AssertValidColor(this, color);
> > +    MemProfiler::MarkTenured(reinterpret_cast<void *>(address()));
> 
> A virtual call here is going to be extremely noticable on GC benchmarks.
> What exactly are you trying to do with this count, because it's not going to
> work as is.

MemProfiler::MarkTenured is an inline static function. When the profiler is not enabled, the overheads are a memory load and a conditional jump.

When the profiler is enabled, the situation is similar to that in comment 26. A function table is not much faster.
Attachment #8551158 - Flags: review?(terrence)
(In reply to Terrence Cole [:terrence] from comment #12)
> Comment on attachment 8551159 [details] [diff] [review]
> part 4. monitoring allocations and frees for ArrayBuffer
> 
> Review of attachment 8551159 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> js/src/gc/Memory.cpp doesn't really have anything to do with
> ArrayBufferObjects; please split the Memory changes into a separate patch.
> 
> ::: js/src/gc/Memory.cpp
> @@ +711,5 @@
> >          return nullptr;
> >  
> >      buf = (uint8_t *) MapMemoryAt(buf, pa_size, PROT_READ | PROT_WRITE,
> >                                    MAP_PRIVATE | MAP_FIXED, fd, pa_start);
> > +    MemProfiler::SampleNative(buf, pa_size);
> 
> You appear to be building the code on all platforms: please add appropriate
> sampling for all the other platforms as well.

I've pulled the hooks from Memory.cpp into ArrayBufferObject.cpp.
Attachment #8551159 - Attachment is obsolete: true
Attachment #8571143 - Flags: review?(terrence)
Attachment #8571143 - Flags: review?(sphink)
(In reply to Terrence Cole [:terrence] from comment #11)
> Comment on attachment 8551158 [details] [diff] [review]
> part 3. Monitoring allocation and gc events in nursery and tenured heaps
> 
> Review of attachment 8551158 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> As Emanuel already mentioned, this misses both of the JIT allocators. Rather
> than adding new instructions to the allocation paths, you'll want to discard
> the jit code and ensure that when we recompile we do the allocation
> out-of-line. I'd add another condition to the TraceEnabled check in
> checkAllocatorState:
> https://dxr.mozilla.org/mozilla-central/source/js/src/jit/MacroAssembler.
> cpp#620.

Thank you. The patch is basically:

-    if (js::gc::TraceEnabled())
+    if (js::gc::TraceEnabled() || MemProfiler::enabled())
Attachment #8571147 - Flags: review?(terrence)
(In reply to Benoit Girard (On Leave Until June 1st!) from comment #42)
> (In reply to Ting-Yuan Huang from comment #41)
> > (In reply to Benoit Girard (:BenWa) from comment #21)
> > > you're looking for is 'profiler_get_backtrace'. If you only want the
> > 
> > profiler_get_backtrace returns a ProfilerBacktrace of which the only
> > interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter&
> > b). This can be slow in my use cases because I have to parse the streams
> > back into lists of function names and attributes. Would you mind if I add a
> > public member function to it to return the list directly?
> 
> Yes please, you should add something to GeckoProfiler.h that takes the
> ProfilerBacktrace object and returns a list of struct POD that contains the
> data you need.

Thanks, I've added a new API to get backtrace without allocating memory in profiler:

void profiler_get_backtrace_noalloc(char *output, size_t outputSize);
Attachment #8571176 - Flags: review?(bgirard)
Attached patch part 7. xpcom interface (obsolete) — Splinter Review
> ::: tools/profiler/nsIMemoryProfiler.idl
> @@ +4,5 @@
> > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> > +
> > +#include "nsISupports.idl"
> > +
> > +// profiler = Cc["@mozilla.org/tools/memory-profiler;1"].getService(Ci.nsIMemoryProfiler)
> 
> If this is means as documentation you should write a better comment for
> this. If it's just a personal note you should delete it.
> 
> This class could use documentation.
> 
> can you get someone else to review .idl? I'm not qualified to do that.

Hi Smaug, would you mind to review this? Thanks.
Attachment #8551165 - Attachment is obsolete: true
Attachment #8571199 - Flags: review?(bugs)
(In reply to Benoit Girard (On Leave Until June 1st!) from comment #21)
> Comment on attachment 8551165 [details] [diff] [review]
> part 5. tracking the memory events
> 
> Review of attachment 8551165 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Just a high level check, I'll finish tomorrow
> 
> ::: tools/profiler/MemoryProfiler.cpp
> @@ +40,5 @@
> > +
> > +// gActiveProfiler is the key to a TLS variable which tracks active profiler on
> > +// each thread. nobody is a sentinal indicating there's no profile active.
> > +static unsigned int gActiveProfiler;
> > +static int nobody;
> 
> Please use mfbt/ThreadLocal.h. It deals with the nobody case better, has
> better typing and the API is better.

A customized allocator to STL containers is used instead of TLS tricks, which is too complicated and error prone.

> @@ +52,5 @@
> > +static std::unordered_map<JSRuntime *, GCHeapProfilerImpl *> gRuntimeToGCHeapProfiler;
> > +static std::unordered_map<JSRuntime *, bool> gEnabledOnRuntime;
> > +
> > +// Get the backtrace in PseudoStack.
> > +static std::vector<const char *> SPSGetStacktrace()
> 
> This is an implementation detail of the profiler that doesn't belong here.
> GeckoProfiler.h is the public interface that should be used. The method
> you're looking for is 'profiler_get_backtrace'. If you only want the
> pseudostack you should introduce a flag to profiler_get_backtrace to collect
> just the pseudostack. Otherwise the method should a stack containing all the
> information available.

Done in part 6.

> @@ +399,5 @@
> > +
> > +  virtual void sampleNative(void *addr, int32_t size) {
> > +    AutoMPLock lock(lock_, this);
> > +    cumulated += size;
> > +    if (cumulated >= mSampleSize) {
> 
> This will give a bias, see my write up on this in my next comment.

Done. Re-implemented using a geometric random number generator. It is not in c++11 style because there's no such things in stlport.

> ::: tools/profiler/moz.build
> @@ +8,5 @@
> >      FAIL_ON_WARNINGS = True
> >  
> >      XPIDL_MODULE = 'profiler'
> >      XPIDL_SOURCES += [
> > +        'nsIMemoryProfiler.idl',
> 
> As far as I can tell this doesn't integrate with the profiler at all. The
> only thing it does is use the pseudostack which is mean to be provided by a
> public API.
> 
> If the memory profiler doesn't integrate with the profiler it should live
> outside of the profiler, or at the very least in a separate subdir. This
> will prevent the project from developing cross dependencies.

Done. The memory profiler is now factored out to tools/memory-profiler.
Attachment #8571200 - Flags: review?(bgirard)
Comment on attachment 8551170 [details] [diff] [review]
part 9. Interface to Add-Ons

(under testing. will file a review request later.)
Attachment #8551170 - Attachment description: part 6. Interface to Add-Ons → part 9. Interface to Add-Ons
Comment on attachment 8571199 [details] [diff] [review]
part 7. xpcom interface

>+++ b/tools/memory-profiler/MemoryProfiler.cpp
>@@ -0,0 +1,46 @@
>+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
>+/* This Source Code Form is subject to the terms of the Mozilla Public
>+ * License, v. 2.0. If a copy of the MPL was not distributed with this file,
>+ * You can obtain one at http://mozilla.org/MPL/2.0/. */
>+
>+#include "MemoryProfiler.h"
>+#include "nsIDOMClassInfo.h"
>+#include "nsIGlobalObject.h"
>+#include "js/TypeDecls.h"
>+#include "xpcprivate.h"
It is not clear to me why you need to include all these headers.
Would be nice to avoid nsIDOMClassInfo.h, js/TypeDecls.h and  xpcprivate.h
>+MemoryProfiler::GetResults(JSContext *cx, JS::MutableHandle<JS::Value> aResult)
JSContext* aCx

>+++ b/tools/memory-profiler/nsMemoryProfilerFactory.cpp
>@@ -0,0 +1,30 @@
>+/* -*- Mode: C++; tab-width: 20; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
2 spaces, not 4

>+static const mozilla::Module::CIDEntry kMemoryProfilerCIDs[] = {
>+    { &kMEMORY_PROFILER_CID, false, nullptr, MemoryProfilerConstructor },
>+    { nullptr }
>+};
>+
>+static const mozilla::Module::ContractIDEntry kMemoryProfilerContracts[] = {
>+    { MEMORY_PROFILER_CONTRACT_ID, &kMEMORY_PROFILER_CID },
>+    { nullptr }
>+};
>+
>+static const mozilla::Module kMemoryProfilerModule = {
>+    mozilla::Module::kVersion,
>+    kMemoryProfilerCIDs,
>+    kMemoryProfilerContracts
>+};
So use 2 spaces for indentation
Attachment #8571199 - Flags: review?(bugs) → review+
Comment on attachment 8551158 [details] [diff] [review]
part 3. Monitoring allocation and gc events in nursery and tenured heaps

(In reply to Ting-Yuan Huang from comment #54)
> Comment on attachment 8551158 [details] [diff] [review]
> part 3. Monitoring allocation and gc events in nursery and tenured heaps
> 
> (In reply to Terrence Cole [:terrence] from comment #11)
> > Comment on attachment 8551158 [details] [diff] [review]
> > part 3. Monitoring allocation and gc events in nursery and tenured heaps
> > 
> > Review of attachment 8551158 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > ::: js/src/gc/Heap.h
> > @@ +1280,5 @@
> > >  bool
> > >  TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const
> > >  {
> > >      AssertValidColor(this, color);
> > > +    MemProfiler::MarkTenured(reinterpret_cast<void *>(address()));
> > 
> > A virtual call here is going to be extremely noticable on GC benchmarks.
> > What exactly are you trying to do with this count, because it's not going to
> > work as is.
> 
> MemProfiler::MarkTenured is an inline static function. When the profiler is
> not enabled, the overheads are a memory load and a conditional jump.

You've still doubled the number of loads here and added a jump where there wasn't one before -- this would totally be visible in the allocation path, so I'd expect it to be visible in the GC path as well. How does this affect your scores on Octane, in particular Splay? Moreover, why do you need to know the number of times that a mark bit was set, rather than just the number of mark bits that have been set? Would it be possible to post-process the mark bitmaps at the end of each slice to get similar information?
 
> When the profiler is enabled, the situation is similar to that in comment
> 26. A function table is not much faster.

Still, this is going to (potentially significantly) distort the proportion of time spent in gc marking. Whether that's important depends what you're doing with the numbers, but since you are sampling, I'd think that distortion would actually be pretty important.
Attachment #8551158 - Flags: review?(terrence)
Comment on attachment 8571143 [details] [diff] [review]
part 4. monitoring allocations and frees for ArrayBuffer

Review of attachment 8571143 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine on a quick skim
Attachment #8571143 - Flags: review?(sphink) → review+
(In reply to Terrence Cole [:terrence] from comment #62)
> Comment on attachment 8551158 [details] [diff] [review]
> part 3. Monitoring allocation and gc events in nursery and tenured heaps
> 
> (In reply to Ting-Yuan Huang from comment #54)
> > Comment on attachment 8551158 [details] [diff] [review]
> > part 3. Monitoring allocation and gc events in nursery and tenured heaps
> > 
> > (In reply to Terrence Cole [:terrence] from comment #11)
> > > Comment on attachment 8551158 [details] [diff] [review]
> > > part 3. Monitoring allocation and gc events in nursery and tenured heaps
> > > 
> > > Review of attachment 8551158 [details] [diff] [review]:
> > > -----------------------------------------------------------------
> > > ::: js/src/gc/Heap.h
> > > @@ +1280,5 @@
> > > >  bool
> > > >  TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const
> > > >  {
> > > >      AssertValidColor(this, color);
> > > > +    MemProfiler::MarkTenured(reinterpret_cast<void *>(address()));
> > > 
> > > A virtual call here is going to be extremely noticable on GC benchmarks.
> > > What exactly are you trying to do with this count, because it's not going to
> > > work as is.
> > 
> > MemProfiler::MarkTenured is an inline static function. When the profiler is
> > not enabled, the overheads are a memory load and a conditional jump.
> 
> You've still doubled the number of loads here and added a jump where there
> wasn't one before -- this would totally be visible in the allocation path,
> so I'd expect it to be visible in the GC path as well. How does this affect
> your scores on Octane, in particular Splay? Moreover, why do you need to

I only tested 6 times for each and Splay seems to be 3% ~ 5% slower with my patch.

> know the number of times that a mark bit was set, rather than just the
> number of mark bits that have been set? Would it be possible to post-process
> the mark bitmaps at the end of each slice to get similar information?
>  

Yes, tracking how many times each bit is marked is an overkill. The memory profiler maintains a table recording a subset of what's allocated. It resembles memory management so it has to mark and sweep, too.

When I'm looking for a place to do the post-process that you mentioned, it seems to me that every objects, whether it has a finalizer or not, are poisoned (in other words, freed) in Arena::finalize(). By tracing where isMarked() is used, it also looks like a good place to do the post-process. Could I just move MemProfiler::MarkTenured here and maybe duplicate the loop and check if the memory profiler is enabled outside the loop (loop unswitching)? 

> > When the profiler is enabled, the situation is similar to that in comment
> > 26. A function table is not much faster.
> 
> Still, this is going to (potentially significantly) distort the proportion
> of time spent in gc marking. Whether that's important depends what you're
> doing with the numbers, but since you are sampling, I'd think that
> distortion would actually be pretty important.

Yes, it will distort the speed/time behavior but not (at least not directly) memory footprint.
(In reply to Ting-Yuan Huang from comment #64)
> > > MemProfiler::MarkTenured is an inline static function. When the profiler is
> > > not enabled, the overheads are a memory load and a conditional jump.
> > 
> > You've still doubled the number of loads here and added a jump where there
> > wasn't one before -- this would totally be visible in the allocation path,
> > so I'd expect it to be visible in the GC path as well. How does this affect
> > your scores on Octane, in particular Splay? Moreover, why do you need to
> 
> I only tested 6 times for each and Splay seems to be 3% ~ 5% slower with my
> patch.

That massive of a regression would get backed out very quickly.

> > know the number of times that a mark bit was set, rather than just the
> > number of mark bits that have been set? Would it be possible to post-process
> > the mark bitmaps at the end of each slice to get similar information?
> >  
> 
> Yes, tracking how many times each bit is marked is an overkill. The memory
> profiler maintains a table recording a subset of what's allocated. It
> resembles memory management so it has to mark and sweep, too.
> 
> When I'm looking for a place to do the post-process that you mentioned, it
> seems to me that every objects, whether it has a finalizer or not, are
> poisoned (in other words, freed) in Arena::finalize(). By tracing where
> isMarked() is used, it also looks like a good place to do the post-process.
> Could I just move MemProfiler::MarkTenured here and maybe duplicate the loop
> and check if the memory profiler is enabled outside the loop (loop
> unswitching)? 

That sounds fine.

> > > When the profiler is enabled, the situation is similar to that in comment
> > > 26. A function table is not much faster.
> > 
> > Still, this is going to (potentially significantly) distort the proportion
> > of time spent in gc marking. Whether that's important depends what you're
> > doing with the numbers, but since you are sampling, I'd think that
> > distortion would actually be pretty important.
> 
> Yes, it will distort the speed/time behavior but not (at least not directly)
> memory footprint.

My experience with transitioning to generational GC is that this sort of "minor" distortion can actually have some extremely surprising and widespread performance effects. I guess it will depend on what you're trying to get out of it.
(In reply to Terrence Cole [:terrence] from comment #65)
> That massive of a regression would get backed out very quickly.

Thanks. After moving MemProfiler::MarkTenured into the finalizer, the slowdown becomes unnoticeable.

> My experience with transitioning to generational GC is that this sort of
> "minor" distortion can actually have some extremely surprising and
> widespread performance effects. I guess it will depend on what you're trying
> to get out of it.

One solution might be to embed callbacks in the finalizers of sampled objects. I'm still tracing the codes and can't be sure but it looks like that not every object has a finalizer?

If the above is not possible, the best improvement I can think of is to fine tune the implementation tracking the allocations. 

Similar tools such as hprof (for Java) suffer from performance degradations, too. In our case Splay is 4 times slower when the profiler is on. Perhaps we can warn the users about the implications in documentations and let them interpret what the profiles really mean.
Attachment #8551158 - Attachment is obsolete: true
Attachment #8572481 - Flags: review?(terrence)
Attachment #8571176 - Flags: review?(bgirard) → review?(mstange)
Attachment #8571200 - Flags: review?(bgirard) → review?(mstange)
No longer blocks: 1129249
Comment on attachment 8571176 [details] [diff] [review]
part 6. A new API to get backtrace without allocating memory in profiler

Review of attachment 8571176 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/platform.cpp
@@ +1038,5 @@
> +  char *bound = output + outputSize - 2;
> +  output[0] = output[1] = '\0';
> +  PseudoStack *pseudoStack = tlsPseudoStack.get();
> +  if (!pseudoStack)
> +    return;

{} please
Attachment #8571176 - Flags: review?(mstange) → review+
Comment on attachment 8571200 [details] [diff] [review]
part 8. tracking the memory events

Review of attachment 8571200 [details] [diff] [review]:
-----------------------------------------------------------------

I'm going to wait with this review until the patches it depends on have been reviewed.
Comment on attachment 8571133 [details] [diff] [review]
part 2. hooks in js engine

Review of attachment 8571133 [details] [diff] [review]:
-----------------------------------------------------------------

I guess Jim isn't going to be able to get a similar memory profiler in your hands as quickly as he hoped. Let's move forward here under the condition that we will reimplement this on top of his primitives when they become available.

I'd like to see another copy of this with the changes applied.

::: js/public/MemoryProfiler.h
@@ +1,5 @@
> +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*-
> + * vim: set ts=8 sts=4 et sw=4 tw=99:
> + * This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

Instead of adding a new file to js/public/, where it will be exposed as part of our public facing interfaces, please move all of the code in MemoryProfiler.h to jsfriendapi.h.

@@ +10,5 @@
> +#include "mozilla/Atomics.h"
> +
> +struct JSRuntime;
> +
> +class NativeProfiler {

{ on newline.

@@ +13,5 @@
> +
> +class NativeProfiler {
> +  public:
> +    virtual ~NativeProfiler() {};
> +    virtual void sampleNative(void *addr, uint32_t size) = 0;

SpiderMonkey style now has the * next to the type. There is a script in Bug 1144366 to convert patches automatically.

@@ +18,5 @@
> +    virtual void removeNative(void *addr) = 0;
> +    virtual void reset() = 0;
> +};
> +
> +class GCHeapProfiler {

{ on newline.

@@ +31,5 @@
> +    virtual void moveNurseryToTenured(void *addrOld, void *addrNew) = 0;
> +    virtual void reset() = 0;
> +};
> +
> +class MemProfiler {

{ on newline.

::: js/src/vm/MemoryProfiler.cpp
@@ +1,5 @@
> +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*-
> + * vim: set ts=8 sts=4 et sw=4 tw=99:
> + * This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

MemoryProfiler.cpp should be js/src/gc/.

@@ +30,5 @@
> +    return &runtime->mMemProfiler;
> +}
> +
> +void
> +MemProfiler::start(GCHeapProfiler *aGCHeapProfiler) {

{ on newline.

@@ +37,5 @@
> +    sActiveProfilerCount++;
> +}
> +
> +void
> +MemProfiler::stop() {

{ on newline.

::: js/src/vm/Runtime.h
@@ +1423,5 @@
>       * function to assess the size of malloc'd blocks of memory.
>       */
>      mozilla::MallocSizeOf debuggerMallocSizeOf;
> +
> +    MemProfiler mMemProfiler;

The MemoryProfiler instance belongs on the GCRuntime structure in js/src/gc/GCRuntime.h.
Attachment #8571133 - Flags: review?(terrence)
Comment on attachment 8572481 [details] [diff] [review]
part 3. Monitoring allocation and gc events in nursery and tenured heaps

Review of attachment 8572481 [details] [diff] [review]:
-----------------------------------------------------------------

Please realize that despite having negligable performance impact on the platform you tested on, you may still be backed out for regressing octane on other platforms, including small regressions on octane's sub-tests that look like noise over the course of a handful of runs.

::: js/src/gc/Heap.h
@@ +497,5 @@
>              return nullptr;
>          }
>          head.checkSpan(thingSize);
>          JS_EXTRA_POISON(reinterpret_cast<void *>(thing), JS_ALLOCATED_TENURED_PATTERN, thingSize);
> +        MemProfiler::SampleTenured(reinterpret_cast<void *>(thing), thingSize);

<void*>

::: js/src/gc/Nursery.cpp
@@ +207,5 @@
>      void *thing = (void *)position();
>      position_ = position() + size;
>  
>      JS_EXTRA_POISON(thing, JS_ALLOCATED_NURSERY_PATTERN, size);
> +    MemProfiler::SampleNursery(reinterpret_cast<void *>(thing), size);

<void*>

::: js/src/jsgc.cpp
@@ +483,5 @@
>      size_t nmarked = 0;
>  
> +    if (MOZ_UNLIKELY(MemProfiler::enabled())) {
> +        for (ArenaCellIterUnderFinalize i(&aheader); !i.done(); i.next()) {
> +            T *t = i.get<T>();

T* t

@@ +485,5 @@
> +    if (MOZ_UNLIKELY(MemProfiler::enabled())) {
> +        for (ArenaCellIterUnderFinalize i(&aheader); !i.done(); i.next()) {
> +            T *t = i.get<T>();
> +            if (t->asTenured().isMarked()) {
> +                MemProfiler::MarkTenured(reinterpret_cast<void *>(t));

<void*>
Attachment #8572481 - Flags: review?(terrence) → review+
Attachment #8571143 - Flags: review?(terrence) → review+
Attachment #8571147 - Flags: review?(terrence) → review+
Is this waiting for my review now? Do the patches still apply?
(In reply to Markus Stange [:mstange] from comment #71)
> Is this waiting for my review now? Do the patches still apply?

Yes, please go ahead. All other dependent patches have been reviewed. The patches need some small rebases but are still in good shape. I'll publish a rebased patchset shortly.
Assignee: nobody → kchen
Attachment #8571128 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8571128 - Flags: review?(mh+mozilla)
Attachment #8603370 - Flags: review?(mh+mozilla)
I have to move some code back to header files because they break terribly the unified build.
Attachment #8571133 - Attachment is obsolete: true
Attachment #8603374 - Flags: review?(terrence)
rebase
Attachment #8572481 - Attachment is obsolete: true
Attachment #8603375 - Flags: review+
rebase
Attachment #8571143 - Attachment is obsolete: true
Attachment #8603376 - Flags: review+
rebase
Attachment #8571147 - Attachment is obsolete: true
Attachment #8603377 - Flags: review+
rebase
Attachment #8571176 - Attachment is obsolete: true
Attachment #8603379 - Flags: review+
rebase
Attachment #8571199 - Attachment is obsolete: true
Attachment #8603380 - Flags: review+
rebase
Attachment #8571200 - Attachment is obsolete: true
Attachment #8571200 - Flags: review?(mstange)
Attachment #8603383 - Flags: review?(mstange)
TODO: Need to add some tests
Attachment #8551170 - Attachment is obsolete: true
Attachment #8603384 - Flags: feedback+
Comment on attachment 8603380 [details] [diff] [review]
Part 7. XPCOM interface for memory profiler

Review of attachment 8603380 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/toolkit.mozbuild
@@ +125,5 @@
>      DIRS += ['/tools/jprof']
>  
>  DIRS += [
>      '/tools/profiler',
> +    '/tools/memory-profiler',

Any reason this is not in /tools/profiler/memory?
Comment on attachment 8603383 [details] [diff] [review]
Part 8. Tracking the memory events

Review of attachment 8603383 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/memory-profiler/MemoryProfiler.cpp
@@ +12,3 @@
>  #include "nsIDOMClassInfo.h"
>  #include "nsIGlobalObject.h"
> +#include "replace_malloc_bridge.h"

This will fail to build without replace-malloc, which builds on aurora, beta, and release are.
(In reply to Mike Hommey [:glandium] from comment #82)
> Comment on attachment 8603380 [details] [diff] [review]
> Part 7. XPCOM interface for memory profiler
> 
> Review of attachment 8603380 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/toolkit.mozbuild
> @@ +125,5 @@
> >      DIRS += ['/tools/jprof']
> >  
> >  DIRS += [
> >      '/tools/profiler',
> > +    '/tools/memory-profiler',
> 
> Any reason this is not in /tools/profiler/memory?

See comment 20. We chose to use a new directory instead of a subdir. Either is fine with me.
Comment on attachment 8603374 [details] [diff] [review]
Part 2. MemoryProfiler hooks in js engine

Review of attachment 8603374 [details] [diff] [review]:
-----------------------------------------------------------------

Bug 1163643 should have fixed the unified bustage. Please rebase over that commit first. If further changes are required, please split them into a separate patch under this one. I'd like another look at this without the unified fixes.

::: js/src/gc/MemoryProfiler.cpp
@@ +8,5 @@
> +
> +#include "vm/Runtime.h"
> +
> +mozilla::Atomic<int> MemProfiler::sActiveProfilerCount;
> +NativeProfiler *MemProfiler::sNativeProfiler;

Spidermonkey style, as with the rest of the tree now, has the * with the type. You can use the script in bug 1144366 to fix the patch automatically.
Attachment #8603374 - Flags: review?(terrence)
Comment on attachment 8603370 [details] [diff] [review]
Part 1. CountAlloc in ReplaceAlloc

Sorry about the delay. I needed some time to think about it, because I possibly wanted to kill 2 birds with 1 stone. And I just did, in bug 1168719. You can replace your uses of CountAllocRegisterHook with RegisterHook. You need to give it a code name as first argument, a nullptr as second argument, and a malloc_hook_table_t pointer as third argument. That table may contain pointers for hooks for all kinds of malloc methods, but you can, in your case, just use hooks for malloc and free. So fill malloc_hook and free_hook with functions with the following respective signatures:
- void* (*malloc_hook)(void*, size_t)
- void (*free_hook)(void*)

(your malloc_hook needs to return the pointer it's given)
Attachment #8603370 - Flags: review?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #86)
> Comment on attachment 8603370 [details] [diff] [review]
> Part 1. CountAlloc in ReplaceAlloc
> 
> Sorry about the delay. I needed some time to think about it, because I
> possibly wanted to kill 2 birds with 1 stone. And I just did, in bug
> 1168719. You can replace your uses of CountAllocRegisterHook with
> RegisterHook. You need to give it a code name as first argument, a nullptr
> as second argument, and a malloc_hook_table_t pointer as third argument.
> That table may contain pointers for hooks for all kinds of malloc methods,
> but you can, in your case, just use hooks for malloc and free. So fill
> malloc_hook and free_hook with functions with the following respective
> signatures:
> - void* (*malloc_hook)(void*, size_t)
> - void (*free_hook)(void*)
> 
> (your malloc_hook needs to return the pointer it's given)

It doesn't look like we could unregister the hook? I need that to enable/disable the profiler at runtime. I could workaround that by set malloc_hook to a noop function but it will introduce unnecessary slowdown.
Flags: needinfo?(mh+mozilla)
(In reply to Kan-Ru Chen [:kanru] from comment #87)
> It doesn't look like we could unregister the hook? I need that to
> enable/disable the profiler at runtime. I could workaround that by set
> malloc_hook to a noop function but it will introduce unnecessary slowdown.

New version of the patch allows to unregister by using a null table pointer. Can you test it before I land it?
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #88)
> (In reply to Kan-Ru Chen [:kanru] from comment #87)
> > It doesn't look like we could unregister the hook? I need that to
> > enable/disable the profiler at runtime. I could workaround that by set
> > malloc_hook to a noop function but it will introduce unnecessary slowdown.
> 
> New version of the patch allows to unregister by using a null table pointer.
> Can you test it before I land it?

Thanks, it works!
Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r?terrence
Attachment #8615220 - Flags: review?(terrence)
Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence
Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink
Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence
Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange
Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug
Bug 1123237 - Part 8. Tracking the memory events. r?mstange
Attachment #8615226 - Flags: review?(mstange)
Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r?jimb
Attachment #8615227 - Flags: review?(jimb)
Attachment #8603370 - Attachment is obsolete: true
Attachment #8603374 - Attachment is obsolete: true
Attachment #8603375 - Attachment is obsolete: true
Attachment #8603376 - Attachment is obsolete: true
Attachment #8603377 - Attachment is obsolete: true
Attachment #8603379 - Attachment is obsolete: true
Attachment #8603380 - Attachment is obsolete: true
Attachment #8603383 - Attachment is obsolete: true
Attachment #8603383 - Flags: review?(mstange)
Attachment #8603384 - Attachment is obsolete: true
Comment on attachment 8615227 [details]
MozReview Request: Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r?jimb

https://reviewboard.mozilla.org/r/10213/#review9059

::: toolkit/devtools/server/tests/mochitest/memprof-helpers.js:56
(Diff revision 1)
> +function waitForTime(ms) {

Is this function used?
Attachment #8615227 - Flags: review?(jimb) → review+
Attachment #8615226 - Flags: review?(mstange) → review?(bgirard)
Attachment #8615226 - Flags: review?(bgirard)
Comment on attachment 8615226 [details]
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?mstange

https://reviewboard.mozilla.org/r/10211/#review9219

::: tools/memory-profiler/MemoryProfiler.cpp:40
(Diff revision 1)
> +class uncensored_allocator {

Seperate file?

::: tools/memory-profiler/MemoryProfiler.cpp:52
(Diff revision 1)
> +  uncensored_allocator(const uncensored_allocator&) noexcept { }

explit constructor. Are these 3 constructors really needed? At a quick glance it doesn't look like they are. If they are really needed that's fine.

::: tools/memory-profiler/MemoryProfiler.cpp:140
(Diff revision 1)
> +struct TrieNode {

This should probably be moved to it's own file. You're mixing data structures and memory profiler logic here.

::: tools/memory-profiler/MemoryProfiler.cpp:133
(Diff revision 1)
> +  for (const char* p = output; *p; p += strlen(p) + 1)

Put the braces around single line for/if:
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style#Control_Structures

Here and the rest of the patch.

::: tools/memory-profiler/MemoryProfiler.cpp:123
(Diff revision 1)
> +static u_unordered_map<JSRuntime*, GCHeapProfilerImpl*>* gRuntimeToGCHeapProfiler;

Might be worth introducing a struct if you plan on mapping more things from JSRuntime.

::: tools/memory-profiler/MemoryProfiler.cpp:258
(Diff revision 1)
> +  static bool initialized = false;

This is not thread safe. Assert which thread should be calling this or don't use 'static' in this context.

::: tools/memory-profiler/MemoryProfiler.cpp:266
(Diff revision 1)
> +class GCHeapProfilerImpl : public GCHeapProfiler {

I'm not familiar with our GC. You should have someone who is review this.

::: tools/memory-profiler/MemoryProfiler.cpp:306
(Diff revision 1)
> +      AllocEvent ai(mTraceTable.insert(trace), nSamples * mSampleSize, PR_Now());

In my performance data I found it super useful to include the time of the time. The UI has to do something useful with it. Just something to consider. A lot of queries you want to make are tend to be based on time. But perhaps this only applies for performance data.

::: tools/memory-profiler/MemoryProfiler.cpp:326
(Diff revision 1)
> +    PR_DestroyLock(lock_);

This is racy. Can you fix this or put a big comment as to what is going on and why this race is acceptable?

Are you meaning to do a sanity check here? If so it should be in a (runtime perhaps) assert style check instead.

::: tools/memory-profiler/MemoryProfiler.cpp:455
(Diff revision 1)
> +    PR_DestroyLock(lock_);

Same. Can you turn this into an assert? You shouldn't have the lock held here. You should contact and stop the consumers before destroying the object.

::: tools/memory-profiler/MemoryProfiler.cpp:581
(Diff revision 1)
> +// Cautious: this is not thread safe upon first time entry.

Maybe you should restrict that this is first called on the Gecko Main thread? Then you can add an assert for the main thread in here.

::: tools/memory-profiler/MemoryProfiler.cpp:603
(Diff revision 1)
> +    if (gRefCnt == 0) {

gRefCnt could use a better name. It's the name of profiled runtime. Maybe gProfileRuntimeCount.

::: tools/memory-profiler/MemoryProfiler.cpp:652
(Diff revision 1)
> +  if (gRefCnt == 0) {

I'm not sure what this reset does? Shouldn't it decrement gRefCnt and maybe run DisableMallocHook like Stop?

::: tools/memory-profiler/MemoryProfiler.cpp:718
(Diff revision 1)
> +  // Getting results when the profiler is running is not allowed.

That's a pretty strict restriction IMO. The profiler will set a 'mask' bit and not gather data when running.
There's also a SplayTree implementation in MFBT
Attachment #8615220 - Flags: review?(terrence)
Comment on attachment 8615220 [details]
MozReview Request: Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink

https://reviewboard.mozilla.org/r/10199/#review9903

r=me

Sorry for the delay; I haven't used ReviewBoard before.
Blocks: 1059139
https://reviewboard.mozilla.org/r/10211/#review9219

> I'm not sure what this reset does? Shouldn't it decrement gRefCnt and maybe run DisableMallocHook like Stop?

Reset is expected to be called after Stop. The profiler could be stopped and resumed (start) again. When the profiler is stopped, caller could query the results and/or reset the profiler.

> That's a pretty strict restriction IMO. The profiler will set a 'mask' bit and not gather data when running.

I'm not sure. That's how the system was designed. Maybe we could loose the restriction?

Maybe we could stop all the profilers while getting the results and resume them after that. Anyway I would like to fix this later if time permits or someone want to take it.
Comment on attachment 8615218 [details]
MozReview Request: Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence

Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence
Attachment #8615218 - Attachment description: MozReview Request: fix mochitest → MozReview Request: Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence
Comment on attachment 8615219 [details]
MozReview Request: Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence

Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence
Attachment #8615219 - Attachment description: MozReview Request: Bug 1168719 - Generic replace-malloc library → MozReview Request: Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence
Attachment #8615220 - Attachment description: MozReview Request: Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r?terrence → MozReview Request: Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink
Comment on attachment 8615220 [details]
MozReview Request: Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink

Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink
Comment on attachment 8615221 [details]
MozReview Request: Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence

Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence
Attachment #8615221 - Attachment description: MozReview Request: Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence → MozReview Request: Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence
Comment on attachment 8615222 [details]
MozReview Request: Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange

Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange
Attachment #8615222 - Attachment description: MozReview Request: Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink → MozReview Request: Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange
Comment on attachment 8615223 [details]
MozReview Request: Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug

Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug
Attachment #8615223 - Attachment description: MozReview Request: Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence → MozReview Request: Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug
Comment on attachment 8615224 [details]
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence
Attachment #8615224 - Attachment description: MozReview Request: Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange → MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence
Attachment #8615224 - Flags: review?(terrence)
Attachment #8615224 - Flags: review?(bgirard)
Comment on attachment 8615225 [details]
MozReview Request: Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb

Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb
Attachment #8615225 - Attachment description: MozReview Request: Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug → MozReview Request: Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb
Attachment #8615226 - Attachment is obsolete: true
Attachment #8615227 - Attachment is obsolete: true
https://reviewboard.mozilla.org/r/10193/#review13977

terrence, could you review the parts related to JS GCHeap? Especially GCHeapProfilerImpl, thanks.
Attachment #8615224 - Flags: review?(terrence)
Attachment #8615224 - Flags: review?(bgirard)
Comment on attachment 8615218 [details]
MozReview Request: Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence

Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615219 [details]
MozReview Request: Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence

Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615220 [details]
MozReview Request: Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink

Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615221 [details]
MozReview Request: Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence

Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615222 [details]
MozReview Request: Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange

Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615223 [details]
MozReview Request: Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug

Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615224 [details]
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Attachment #8615224 - Flags: review?(terrence)
Attachment #8615224 - Flags: review?(bgirard)
Comment on attachment 8615225 [details]
MozReview Request: Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb

Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615224 [details]
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

https://reviewboard.mozilla.org/r/10207/#review14187

r+ with the following changes.

::: tools/memory-profiler/CompactTraceTable.h:53
(Diff revision 3)
> +template<typename T>

This class needs a comment, the name on its own isn't clear.

::: tools/memory-profiler/GCHeapProfilerImpl.cpp:73
(Diff revision 3)
> +  }

Is it safe to assert that the new mTenuredEntriesFG is empty?

::: tools/memory-profiler/GCHeapProfilerImpl.cpp:79
(Diff revision 3)
> +  // Profiler can be enabled during marking.

Is this comment trying to explain why |AutoMPLock lock(mLock);| isn't called here? If so I don't really understand why this would be safe.

::: tools/memory-profiler/MemoryProfiler.h:27
(Diff revision 3)
> +class MemoryProfiler final : public nsIMemoryProfiler

This feature is really complicated and could use a nice block comment documenting it. Either something hanging off MemoryProfiler, top fo this header or nsIMemoryProfiler.idl.

::: tools/memory-profiler/MemoryProfiler.h:40
(Diff revision 3)
> +  int64_t mTimestamp;

ideally this should use TimeStamp unless allocation is a problem.

::: tools/memory-profiler/MemoryProfiler.h:55
(Diff revision 3)
> +  int mEventIdx : 31;

signed 31 bits feels a bit low for a long running system watching memory events. Most likely enough but feels like its getting closed. You should maybe consider catching the overflow case.

::: tools/memory-profiler/MemoryProfiler.h:94
(Diff revision 3)
> + * want to allocate memories.

memory*

::: tools/memory-profiler/MemoryProfiler.cpp:47
(Diff revision 3)
> +static int gProfileRuntimeCount;

I don't see you initializing this?

::: tools/memory-profiler/MemoryProfiler.cpp:54
(Diff revision 3)
> +  : mSampleSize(65536)

I'd imagine this is something you're going to want to expose in a follow-up since it gives you a memory logger instead of a sampler.

::: tools/memory-profiler/MemoryProfiler.cpp:85
(Diff revision 3)
> + * The sampler generates a random variable which conforms to a geometric

Great comment. IMO this should go in the header instead.

::: tools/memory-profiler/UncensoredAllocator.h:98
(Diff revision 3)
> +using u_unordered_map = std::unordered_map;

Be careful, for the gecko profiler we noticed that unordered_map performed much worse than the mozilla hash map.

Actually I don't think stlport supports unordered_map unless I'm misremebering.
Attachment #8615224 - Flags: review?(bgirard) → review+
https://reviewboard.mozilla.org/r/10207/#review14187

> signed 31 bits feels a bit low for a long running system watching memory events. Most likely enough but feels like its getting closed. You should maybe consider catching the overflow case.

I changed it to unsigned 31 bits, should be enough.

> Be careful, for the gecko profiler we noticed that unordered_map performed much worse than the mozilla hash map.
> 
> Actually I don't think stlport supports unordered_map unless I'm misremebering.

stlport has unordered_map under the tr1 namespace. Ideally I also like to use nsTHashtable but it doesn't support custom alloctor. I'm not sure if we want to add that support to nsTHashtable.
https://reviewboard.mozilla.org/r/10193/#review14275

::: js/src/gc/Nursery.cpp:554
(Diff revision 3)
> +    MemProfiler::SweepNursery(rt);

Please move this to Nursery::sweep.

::: js/src/jsgc.cpp:496
(Diff revision 3)
> +            if (t->asTenured().isMarked()) {
> +                MemProfiler::MarkTenured(reinterpret_cast<void*>(t));
> +            }

Style nit: No {} around a single-line block.

::: js/src/jsgc.cpp:5508
(Diff revision 3)
> +    MemProfiler::SweepTenured(rt);

I'm not sure what the SweepTenured event is supposed to indicate, but this isn't near either sweeping start or sweeping end. It also doesn't account for the fact that we do most sweeping off-main-thread.
https://reviewboard.mozilla.org/r/10193/#review14275

> I'm not sure what the SweepTenured event is supposed to indicate, but this isn't near either sweeping start or sweeping end. It also doesn't account for the fact that we do most sweeping off-main-thread.

It should indicate sweeping end. Where should I put this if not here? The profiler methods are thread-safe.
Attachment #8615218 - Flags: review?(terrence)
Comment on attachment 8615218 [details]
MozReview Request: Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence

Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615224 [details]
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Comment on attachment 8615218 [details]
MozReview Request: Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence

Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Attachment #8615218 - Flags: review?(terrence)
Comment on attachment 8615224 [details]
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence

Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Attachment #8615224 - Flags: review?(terrence)
Attachment #8649816 - Flags: review?(terrence) → review+
Comment on attachment 8649817 [details] [diff] [review]
0007-Bug-1123237-Part-8.-Tracking-the-memory-events.-r-Be.patch

Review of attachment 8649817 [details] [diff] [review]:
-----------------------------------------------------------------

This code isn't really in my module, so I can't comment on the style. The mechanism seems fine though.
Attachment #8649817 - Flags: review?(terrence) → review+
This has caused build bustage.
I want to use the SwapElements method on derived classes.
Attachment #8652835 - Flags: review?(nfroyd)
I gave up on using unordered_map and other STL data structures. With this new patch I use a ThreadLocal variable to control whether native profiler is enabled or not. I also removed other STL usage.
Attachment #8652837 - Flags: review?(bgirard)
Comment on attachment 8652835 [details] [diff] [review]
0001-Bug-1123237-Part-10.-Expose-SwapElements-from-nsBase.patch

Review of attachment 8652835 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/glue/nsBaseHashtable.h
@@ +299,5 @@
>  
> +  /**
> +   * Swap the elements in this hashtable with the elements in aOther.
> +   */
> +  void SwapElements(nsBaseHashtable<KeyClass, DataType, UserDataType>& aOther)

Please make this:

void SwapElements(nsBaseHashtable& aOther)

for clarity's sake.
Attachment #8652835 - Flags: review?(nfroyd) → review+
Comment on attachment 8652837 [details] [diff] [review]
0002-Bug-1123237-Part-11.-Don-t-use-STL-in-memory-profile.patch

Review of attachment 8652837 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, do you mind finder a different reviewer? I'm not that familiar with our containers so I don't think my review would really provide much added value.
Comment on attachment 8652837 [details] [diff] [review]
0002-Bug-1123237-Part-11.-Don-t-use-STL-in-memory-profile.patch

Cervantes, could you review this? Basically I switched from std::vector to nsTArray and unordered_map to nsDataHashtable.
Attachment #8652837 - Flags: review?(bgirard) → review?(cyu)
Comment on attachment 8652837 [details] [diff] [review]
0002-Bug-1123237-Part-11.-Don-t-use-STL-in-memory-profile.patch

Review of attachment 8652837 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the usage of Mozilla container classes.

::: tools/memory-profiler/MemoryProfiler.h
@@ +83,5 @@
>  
> +  AllocEntry()
> +    : mEventIdx(0)
> +    , mMarked(false)
> +  {}

nit: comment why we need a default ctor here.
Attachment #8652837 - Flags: review?(cyu) → review+
And probably also intermittent unhandled shutdown crashes where all we get is "ERROR - Return code: -11" on OS X 10.6, given https://treeherder.mozilla.org/logviewer.html#?job_id=13737336&repo=mozilla-inbound and https://treeherder.mozilla.org/logviewer.html#?job_id=13738109&repo=mozilla-inbound
Backed out changeset 9c26b3b787f8 (bug 1123237)
Backed out changeset 1fcec0dc93d5 (bug 1123237)
Backed out changeset 390033ceebb6 (bug 1123237)
Backed out changeset e8a1845876d6 (bug 1123237)
Backed out changeset 714ec40715fa (bug 1123237)
Backed out changeset 5ed952e011c3 (bug 1123237)
Backed out changeset c785df6c0cdf (bug 1123237)
Backed out changeset d69e2d195a24 (bug 1123237)
Backed out changeset 1f328807da1d (bug 1123237)
Backed out changeset a1546857dce9 (bug 1123237)





Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>


--HG--
extra : commitid : 62SJ4Omh8xE
extra : rebase_source : ab565b3681e6ac4ef62770be7ea7853cf9fbdaa2



This warning is typically triggered by code which implements
some kind of assertion macro, and it's probably not a good
indicator of bugs anyway, so there is no good reason to keep
it on.





--HG--
extra : rebase_source : c3389a6400aea7f5d557b6a6e7445e7824341dfa

--HG--
extra : rebase_source : 1c6eda0d1aa19102897977cada1013791c8ceda8

--HG--
extra : rebase_source : f94d19ed514d64e55d736b7eef11adcd0b1a65f9

--HG--
extra : rebase_source : c52dbb168bc88b90e7cfe8301d4a74a1f90087ce

--HG--
extra : rebase_source : c63999c3dd9546ed5a3c8973cdec2822bfa33854


























This fixes uninitialised reads.

--HG--
extra : rebase_source : c7f50d9ea7a56d845ccb2f324e3e73fbd2e83003


This is done for performance reasons, as the event regions can be complex.

--HG--
extra : rebase_source : 864ff7cf6f7286271c8f32b50b2166feefecc011
extra : source : 0de9f9d0f8c1dd915e0b04225450ad09552a139e

--HG--
extra : rebase_source : c3b5d44c04b41dc4133e9f3f50a0394c964ac673

--HG--
extra : rebase_source : c32b1cd483bdbfe09760fcb7c0a36cecf3b9940c

--HG--
extra : rebase_source : c779897338dc4b73dca98517acb8d0eef7e0d7a5

--HG--
extra : rebase_source : 59ab475354a7d82dae761f55224da3402be59769

--HG--
extra : rebase_source : a99d677c91f86033f0bbe76c26ba22673522f14c

--HG--
extra : source : e9159912a5b372b90ddd42e19bcf1c7801c5932f
extra : amend_source : 02218c1f824ea56eec53ad73eb72a9d983690aec


















--HG--
extra : rebase_source : d1823d3f1eaf2632fc61e0a444cce2d144d8951f
Attachment #8657518 - Flags: review?(terrence) → review+
Depends on: 1205255
Depends on: 1206485
Depends on: 1216970
You need to log in before you can comment on or make changes to this bug.