1123237 - Implementing a sampling based memory profiler

Reporter

Description

•

11 years ago

The following introductions are mostly copied from https://wiki.mozilla.org/MemoryProfiler Traditional memory analysis tools look at the data. They build the referring-to, namely retaining graphs. This is helpful in exploring the relationships among objects and identifying memory leaks. However, the information exposed in this way is usually a snapshot and the history of how memory is used is missing. Moreover, they usually don't keep tracks of how objects are allocated. Developers have to infer how an object might be allocated and freed by himself/herself. Another common approach is to plot the allocations or the memory usages along the timeline. This gives some intuition of how and when a bulk of objects are allocated. However, it still relies on developers' wisdom, and sometimes chances, to dig out the problematical code snippet. This memory profiler is designed to solve the inconveniences and limitations. The goal is to identify the memory eager codes directly. For example, it can show which functions have highest peaks: SelfHWM TotalHWM Name 6225920 17563648 p/this.start (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:8) 6029312 6029312 s (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:11) 4784128 4849664 nsDisplayBoxShadowOuter::Paint 1966080 1966080 IPDL::PHttpChannel::RecvOnTransportAndData 1769472 6684672 a (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:11) 1572864 18153472 u (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:8) 1572864 1572864 nsJSContext::GarbageCollectNow 1507328 1507328 PresShell::DoReflow 1507328 13500416 p/<.run/x/< (https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1412710964149:8) The memory profiler samples allocation events. An allocation event includes a type (what and at where is going to be allocated), a size, a timestamp and the corresponding stack trace. Free events are also tracked. For managed languages, namely languages relying on garbage collection, a free event is generated when an object is reclaimed by the garbage collector. These sampled events can be used to approximate the full history of allocations afterwards. That means we can get various memory profiles of a program in different perspectives by post-processing the history in different ways.

Ting-Yuan Huang

Reporter

Comment 1

•

11 years ago

Attached patch part 1. CountAlloc in ReplaceAlloc (obsolete) — Details — Splinter Review

Attachment #8551149 - Flags: feedback?(n.nethercote)

Attachment #8551149 - Flags: feedback?(mh+mozilla)

Ting-Yuan Huang

Reporter

Comment 2

•

11 years ago

CountAlloc is an implementation of ReplaceAlloc. The profiler use it to register hooks to monitor how many bytes are allocated.

Ting-Yuan Huang

Reporter

Comment 3

•

11 years ago

Attached patch part 2. hooks in js engine (obsolete) — Details — Splinter Review

This defines some interfaces to the hook and adds some helpers to be used in later patches.

Attachment #8551156 - Flags: feedback?(bgirard)

Ting-Yuan Huang

Reporter

Comment 4

•

11 years ago

Attached patch part 3. Monitoring allocation and gc events in nursery and tenured heaps (obsolete) — Details — Splinter Review

Hi Terrence, would you mind to review the idea of the sampler in the garbage collector? This patch inserts a few hooks into the memory allocator and garbage collector to profile the memory usage.

Attachment #8551158 - Flags: feedback?(terrence)

Ting-Yuan Huang

Reporter

Comment 5

•

11 years ago

Attached patch part 4. monitoring allocations and frees for ArrayBuffer (obsolete) — Details — Splinter Review

Attachment #8551159 - Flags: feedback?(terrence)

Ting-Yuan Huang

Reporter

Comment 6

•

11 years ago

Attached patch part 5. tracking the memory events (obsolete) — Details — Splinter Review

Hi BenWa, Would you mind to review the idea here? The memory profiler samples allocation, free and gc events like a CPU-time profiler. It mimics the native allocator and also the allocation and collection on the gcheap.

Attachment #8551165 - Flags: feedback?(bgirard)

Ting-Yuan Huang

Reporter

Comment 7

•

11 years ago

Attached patch part 9. Interface to Add-Ons (obsolete) — Details — Splinter Review

Hi Jim, as discussed, this is a temporary solution to enable the profiler before the API of debugger is completed. Would you mind to review the ideas here? Thank you.

Attachment #8551170 - Flags: feedback?(jimb)

Ting-Yuan Huang

Reporter

Comment 8

•

11 years ago

Hi Nicholas, this implements some of the ideas you mentioned (mostly about the Allocation Tab) in https://bugzilla.mozilla.org/show_bug.cgi?id=960671#c33 May I know your opinions to this profiler?

Ben Kelly [:bkelly, not reviewing]

Updated

•

11 years ago

Whiteboard: MemShrink

Emanuel Hoogeveen [:ehoogeveen]

Comment 9

•

11 years ago

Comment on attachment 8551158 [details] [diff] [review] part 3. Monitoring allocation and gc events in nursery and tenured heaps Hmm. Note that the JITs can also allocate from the nursery and the current free list: MacroAssembler::nurseryAllocate [1] and MacroAssembler::freeListAllocate [2]. [1] https://hg.mozilla.org/mozilla-central/file/ea8cce9f6630/js/src/jit/MacroAssembler.cpp#l650 [2] https://hg.mozilla.org/mozilla-central/file/ea8cce9f6630/js/src/jit/MacroAssembler.cpp#l681

Terrence Cole [:terrence]

Comment 10

•

11 years ago

Comment on attachment 8551156 [details] [diff] [review] part 2. hooks in js engine Review of attachment 8551156 [details] [diff] [review]: ----------------------------------------------------------------- Chiming in since this seems to be a fairly JS::GC specific series. ::: js/public/MemoryProfiler.h @@ +15,5 @@ > + public: > + virtual ~NativeProfiler() {}; > + virtual void sampleNative(void *addr, int32_t size) = 0; > + virtual void removeNative(void *addr) = 0; > + virtual void reset() = 0; Making all of these calls virtual is going to add a ton of overhead. It looks like there is only one consumer so maybe we could just move most of the code inline here? @@ +22,5 @@ > +class GCHeapProfiler { > + public: > + virtual ~GCHeapProfiler() {}; > + virtual void sampleTenured(void *addr, int32_t size) = 0; > + virtual void sampleNursery(void *addr, int32_t size) = 0; Please don't use void* for GC things. Instead include js/HeapAPI.h and use GCCellPtr here, and below where you take GC references. ::: js/src/vm/MemoryProfiler.cpp @@ +9,5 @@ > + > +mozilla::Atomic<int> MemProfiler::sGlobalSwitch; > +NativeProfiler *MemProfiler::sNativeProfiler; > + > +GCHeapProfiler *MemProfiler::GetGCHeapProfiler(void *addr) Type should be on the line above: GCHeapProfiler * MemProfiler::GetGCHeapProfiler(GCCellPtr cell)

Terrence Cole [:terrence]

Comment 11

•

11 years ago

Comment on attachment 8551158 [details] [diff] [review] part 3. Monitoring allocation and gc events in nursery and tenured heaps Review of attachment 8551158 [details] [diff] [review]: ----------------------------------------------------------------- As Emanuel already mentioned, this misses both of the JIT allocators. Rather than adding new instructions to the allocation paths, you'll want to discard the jit code and ensure that when we recompile we do the allocation out-of-line. I'd add another condition to the TraceEnabled check in checkAllocatorState: https://dxr.mozilla.org/mozilla-central/source/js/src/jit/MacroAssembler.cpp#620. ::: js/src/gc/Heap.h @@ +1280,5 @@ > bool > TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const > { > AssertValidColor(this, color); > + MemProfiler::MarkTenured(reinterpret_cast<void *>(address())); A virtual call here is going to be extremely noticable on GC benchmarks. What exactly are you trying to do with this count, because it's not going to work as is.

Attachment #8551158 - Flags: feedback?(terrence) → feedback-

Terrence Cole [:terrence]

Comment 12

•

11 years ago

Comment on attachment 8551159 [details] [diff] [review] part 4. monitoring allocations and frees for ArrayBuffer Review of attachment 8551159 [details] [diff] [review]: ----------------------------------------------------------------- js/src/gc/Memory.cpp doesn't really have anything to do with ArrayBufferObjects; please split the Memory changes into a separate patch. ::: js/src/gc/Memory.cpp @@ +711,5 @@ > return nullptr; > > buf = (uint8_t *) MapMemoryAt(buf, pa_size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_FIXED, fd, pa_start); > + MemProfiler::SampleNative(buf, pa_size); You appear to be building the code on all platforms: please add appropriate sampling for all the other platforms as well.

Attachment #8551159 - Flags: feedback?(terrence)

Attachment #8551159 - Flags: feedback?(sphink)

Attachment #8551159 - Flags: feedback-

Nicholas Nethercote [inactive]

Comment 13

•

11 years ago

Comment on attachment 8551149 [details] [diff] [review] part 1. CountAlloc in ReplaceAlloc Review of attachment 8551149 [details] [diff] [review]: ----------------------------------------------------------------- I'm not sure if you want feedback from me about the code in this particular patch or the tool in general. I'll give you the latter. I built the browser and add-on from the GitHub repos, and created a new profile. When I opened the devtools panel and clicked "start" I found that it frequently crashed almost immediately. Loading the PDF at https://github.com/mozilla/pdf.js/issues/2541 reliably caused the crash. When I was able to get it to work without crashing I couldn't get any useful output. I clicked on the "start" button, did some browsing, then clicked on the "stop" button... and nothing happened. The rectangles within the Memory Profiler panel remained blank and there was no indication of activity. The "reset" button also appeared to have no effect. https://wiki.mozilla.org/MemoryProfiler doesn't have anything about how to use the tool, so I can't tell if I'm doing something wrong or not.

Attachment #8551149 - Flags: feedback?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 14

•

11 years ago

Oh, one question: you say "The memory profiler samples allocation events". How does it do the sampling? Do you sample when a certain amount of time has passed, or when a certain number of bytes have been allocated. Hopefully it's the latter; that's what DMD does. I ask this because time-based profiling doesn't work well in a memory profiler because you're likely to sample lots of small allocations (which are common) and miss large ones (which are uncommon) which gives an inaccurate view of what is happening.

Whiteboard: MemShrink → [MemShrink]

Ting-Yuan Huang

Reporter

Comment 15

•

11 years ago

Attached image 2015-01-26-164237_1246x1060_scrot.png — Details

(In reply to Nicholas Nethercote [:njn] from comment #13) > I'm not sure if you want feedback from me about the code in this particular > patch or the tool in general. I'll give you the latter. Thank you. This is what I'm looking for. > I built the browser and add-on from the GitHub repos, and created a new > profile. When I opened the devtools panel and clicked "start" I found that > it frequently crashed almost immediately. Loading the PDF at > https://github.com/mozilla/pdf.js/issues/2541 reliably caused the crash. I apologize for the crashes. There was a huge refactoring recently and some crashing bugs were introduced. While I'm fixing the bugs and refining the codes according to Terrence's and Emanuel's comments, I tried the pdf mentioned in the above and made a screenshot for your reference. http://carto.metro.free.fr/documents/CartoMetroParis.v3.6.pdf > Oh, one question: you say "The memory profiler samples allocation events". > How does it do the sampling? Do you sample when a certain amount of time has > passed, or when a certain number of bytes have been allocated. Hopefully > it's the latter; that's what DMD does. Yes, the sampler samples along the the number of bytes instead of time..

Nicholas Nethercote [inactive]

Comment 16

•

11 years ago

> There was a huge refactoring recently Please NEEDINFO me once you've got the code in a good state again. Thanks.

Mike Hommey [:glandium]

Comment 17

•

11 years ago

Comment on attachment 8551165 [details] [diff] [review] part 5. tracking the memory events Review of attachment 8551165 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/MemoryProfiler.cpp @@ +70,5 @@ > + > + return trace; > +} > + > +class AutoLock { We have MutexAutoLock in mozilla/Mutex.h for this. @@ +451,5 @@ > + void *profiler = PR_GetThreadPrivate(gActiveProfiler); > + if (profiler == &nobody) { > + gNativeProfiler->sampleNative(addr, size); > + } else if (!profiler) { > + PRThread *curr = PR_GetCurrentThread(); PR_GetCurrentThread can end up using the memory allocator. Using it in this function is dangerous. PR_GetThreadPrivate calls PR_GetCurrentThread, so that is dangerous as well. @@ +454,5 @@ > + } else if (!profiler) { > + PRThread *curr = PR_GetCurrentThread(); > + if (tlsOwner == curr) > + return; > + PR_Lock(tlsLock); Please note somewhere that CARegister needs to happen after _initOnce, otherwise you crash here, if not before.

Mike Hommey [:glandium]

Comment 18

•

11 years ago

Comment on attachment 8551149 [details] [diff] [review] part 1. CountAlloc in ReplaceAlloc Review of attachment 8551149 [details] [diff] [review]: ----------------------------------------------------------------- ::: memory/build/replace_malloc_bridge.h @@ +86,5 @@ > * instance is valid until the process dies. > * This method was added in version 2 of the bridge. */ > virtual void InitDebugFd(mozilla::DebugFdRegistry&) {} > > + virtual void CARegister(void (*_alloc)(void *, int32_t), Seems to me _alloc and _free should have a different name. CARegister is also not very clear as to what it is. Please also add comments about what the function is for, and that it is added in version 3 of the bridge. ::: memory/replace/countalloc/CountAlloc.cpp @@ +39,5 @@ > +{ > + virtual void CARegister(void (*_alloc)(void *, int32_t), > + void (*_free)(void *)) override { > + __hook_alloc = _alloc; > + __hook_free = _free; You probably want a memory fence of some sort here. Or to use atomics. @@ +96,5 @@ > +{ > + void* new_ptr = sFuncs->realloc(aPtr, aSize); > + if (new_ptr || !aSize) { > + FreeHook(aPtr); > + if (!aSize) { if (aSize)?

Attachment #8551149 - Flags: feedback?(mh+mozilla)

Benoit Girard (:BenWa)

Comment 19

•

11 years ago

Comment on attachment 8551156 [details] [diff] [review] part 2. hooks in js engine Review of attachment 8551156 [details] [diff] [review]: ----------------------------------------------------------------- I can't review js/* code. I'll defer it to Terrence.

Attachment #8551156 - Flags: feedback?(bgirard)

Benoit Girard (:BenWa)

Comment 20

•

11 years ago

Comment on attachment 8551165 [details] [diff] [review] part 5. tracking the memory events Review of attachment 8551165 [details] [diff] [review]: ----------------------------------------------------------------- Just a high level check, I'll finish tomorrow ::: tools/profiler/MemoryProfiler.cpp @@ +52,5 @@ > +static std::unordered_map<JSRuntime *, GCHeapProfilerImpl *> gRuntimeToGCHeapProfiler; > +static std::unordered_map<JSRuntime *, bool> gEnabledOnRuntime; > + > +// Get the backtrace in PseudoStack. > +static std::vector<const char *> SPSGetStacktrace() This is an implementation detail of the profiler that doesn't belong here. GeckoProfiler.h is the public interface that should be used. The method you're looking for is 'profiler_get_backtrace'. If you only want the pseudostack you should introduce a flag to profiler_get_backtrace to collect just the pseudostack. Otherwise the method should a stack containing all the information available. ::: tools/profiler/moz.build @@ +8,5 @@ > FAIL_ON_WARNINGS = True > > XPIDL_MODULE = 'profiler' > XPIDL_SOURCES += [ > + 'nsIMemoryProfiler.idl', As far as I can tell this doesn't integrate with the profiler at all. The only thing it does is use the pseudostack which is mean to be provided by a public API. If the memory profiler doesn't integrate with the profiler it should live outside of the profiler, or at the very least in a separate subdir. This will prevent the project from developing cross dependencies. ::: tools/profiler/nsIMemoryProfiler.idl @@ +4,5 @@ > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > + > +#include "nsISupports.idl" > + > +// profiler = Cc["@mozilla.org/tools/memory-profiler;1"].getService(Ci.nsIMemoryProfiler) If this is means as documentation you should write a better comment for this. If it's just a personal note you should delete it. This class could use documentation. can you get someone else to review .idl? I'm not qualified to do that.

Benoit Girard (:BenWa)

Comment 21

•

11 years ago

Comment on attachment 8551165 [details] [diff] [review] part 5. tracking the memory events Review of attachment 8551165 [details] [diff] [review]: ----------------------------------------------------------------- Just a high level check, I'll finish tomorrow ::: tools/profiler/MemoryProfiler.cpp @@ +40,5 @@ > + > +// gActiveProfiler is the key to a TLS variable which tracks active profiler on > +// each thread. nobody is a sentinal indicating there's no profile active. > +static unsigned int gActiveProfiler; > +static int nobody; Please use mfbt/ThreadLocal.h. It deals with the nobody case better, has better typing and the API is better. @@ +52,5 @@ > +static std::unordered_map<JSRuntime *, GCHeapProfilerImpl *> gRuntimeToGCHeapProfiler; > +static std::unordered_map<JSRuntime *, bool> gEnabledOnRuntime; > + > +// Get the backtrace in PseudoStack. > +static std::vector<const char *> SPSGetStacktrace() This is an implementation detail of the profiler that doesn't belong here. GeckoProfiler.h is the public interface that should be used. The method you're looking for is 'profiler_get_backtrace'. If you only want the pseudostack you should introduce a flag to profiler_get_backtrace to collect just the pseudostack. Otherwise the method should a stack containing all the information available. @@ +399,5 @@ > + > + virtual void sampleNative(void *addr, int32_t size) { > + AutoMPLock lock(lock_, this); > + cumulated += size; > + if (cumulated >= mSampleSize) { This will give a bias, see my write up on this in my next comment. @@ +614,5 @@ > + return std::make_tuple(names.serialize(), traces.serialize(), move(events)); > +} > + > +NS_IMETHODIMP > +MemoryProfiler::GetResults(JSContext *cx, JS::MutableHandle<JS::Value> aResult) If you want to dump the results when the JS engine isn't alive you should have a look at the JSStreamWriter that the profiler uses. It looks like since we've introduced JSONWriter.h to mfbt which looks similar. ::: tools/profiler/moz.build @@ +8,5 @@ > FAIL_ON_WARNINGS = True > > XPIDL_MODULE = 'profiler' > XPIDL_SOURCES += [ > + 'nsIMemoryProfiler.idl', As far as I can tell this doesn't integrate with the profiler at all. The only thing it does is use the pseudostack which is mean to be provided by a public API. If the memory profiler doesn't integrate with the profiler it should live outside of the profiler, or at the very least in a separate subdir. This will prevent the project from developing cross dependencies. ::: tools/profiler/nsIMemoryProfiler.idl @@ +4,5 @@ > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > + > +#include "nsISupports.idl" > + > +// profiler = Cc["@mozilla.org/tools/memory-profiler;1"].getService(Ci.nsIMemoryProfiler) If this is means as documentation you should write a better comment for this. If it's just a personal note you should delete it. This class could use documentation. can you get someone else to review .idl? I'm not qualified to do that.

Attachment #8551165 - Flags: feedback?(bgirard) → feedback-

Benoit Girard (:BenWa)

Comment 22

•

11 years ago

So we've had a lengthy discussion in Toronto (myself, jrmuizel, mstange) about how to sample memory allocations to build a representative heap snapshot to show an *evenly distributed sample* of your live heap at a given time. I'll try to summarize a lengthy discussion. (In reply to Nicholas Nethercote [:njn] from comment #14) > Oh, one question: you say "The memory profiler samples allocation events". > How does it do the sampling? Do you sample when a certain amount of time has > passed, or when a certain number of bytes have been allocated. Hopefully > it's the latter; that's what DMD does. I'm assuming here you mean that you keep a stateful count of how many bytes have been allocated since the last sample. Otherwise you would always miss small allocations which is wrong. When the count is above a threshold we 'sample' the entire allocations weighted by the allocation size. With this method you wont get an evenly distributed probability that a given 'live' byte on the heap has been sampled. Let's explore this from the point of view of a particular byte: 1) A byte is more likely to be sampled if it's part of a large allocation. This is because when an allocation crosses the threshold we record all the bytes within it. So a byte in a one byte allocation gets 1/sampleThreshold. A byte in an allocation larger then your threshold will always get logged. 2) Some allocations can be reliably missed if say a loop allocates code in a particular pattern that is a multiple of the allocation threshold. 3) Blocks tend to have a different lifetime and I'm going that there's a correlation between allocation size and life time. Assuming that this is true then when looking at the heap at a particular time you wont have an even sample of the blocks that are currently live. This is related to 1). The way to fix it is to roll a random number for each byte allocated (in practice you can do something equivalent but faster) regardless of the allocation size of the number of bytes that have been allocated previously. If roll is below sampleRatio then you grab a stack and charge it against that byte. This fixes all the problems above. Each byte has an equal chance of being sampled at any point in time and the probability of sampling a byte is independent of the nearby allocations patterns or how big the allocation that it's inside. Instead of generating a random roll for each allocated byte we can just generate a random binomial random variate. If that's >= 1 then we capture a stack assign the value with that as a weight. Our suggestion should address problem 1,2,3 and let you build a memory profiler that can show an unbiased view of live bytes at any given point in time.

Nicholas Nethercote [inactive]

Comment 23

•

11 years ago

DMD's sampling is controlled by AllocCallback() within memory/replace/dmd/DMD.cpp. The default sampling threshold is 4093 bytes, which is prime, and the documentation recommends that if you choose a different threshold that you also choose a prime. This mitigates problem (2). Also, when a small (i.e. < 4093 B) allocation cause the small allocations counter to cross the 4093 B threshold, that allocation gets treated as though it is 4093 B in size. I.e. it compensates for the the fact that some small blocks are ignored by treating the non-ignored small blocks as if they were bigger. Such blocks are also marked in a way that indicates sampling was involved and in the output any measurements that involved sampled blocks are preceded with '~' to indicate they aren't exact. I think this mitigates (1), though I'm not sure if I understood what you wrote exactly. As for (3), I'm skeptical that there is a correlation between allocation size and lifetime.

Benoit Girard (:BenWa)

Comment 24

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #23) > DMD's sampling is controlled by AllocCallback() within > memory/replace/dmd/DMD.cpp. > > The default sampling threshold is 4093 bytes, which is prime, and the > documentation recommends that if you choose a different threshold that you > also choose a prime. This mitigates problem (2). If your code happens to be allocating a multiple of your prime, say 2 * 4093 then it doesn't rigorously address my concern. But in practice it's probably good enough. > > Also, when a small (i.e. < 4093 B) allocation cause the small allocations > counter to cross the 4093 B threshold, that allocation gets treated as > though it is 4093 B in size. I.e. it compensates for the the fact that some > small blocks are ignored by treating the non-ignored small blocks as if they > were bigger. > > Such blocks are also marked in a way that indicates sampling was involved > and in the output any measurements that involved sampled blocks are preceded > with '~' to indicate they aren't exact. I think this mitigates (1), though > I'm not sure if I understood what you wrote exactly. > > As for (3), I'm skeptical that there is a correlation between allocation > size and lifetime. Essentially you're saying that it's good enough/mitigated. I don't disagree with this in practice. FWIW the CPU profiler suffers from the same problem but time sampling is a lot more noisy. I think my approach is more sound but generating binomial random variate quickly is a PITA.

Nicholas Nethercote [inactive]

Comment 25

•

11 years ago

> > The default sampling threshold is 4093 bytes, which is prime, and the > > documentation recommends that if you choose a different threshold that you > > also choose a prime. This mitigates problem (2). > > If your code happens to be allocating a multiple of your prime, say 2 * 4093 > then it doesn't rigorously address my concern. But in practice it's probably > good enough. jemalloc has a finite number of size classes and any requests for odd or prime sizes will be rounded up to a power-of-two or something power-of-two-ish (such as a multiple of 16 or 256 or 1024) E.g. 4093 rounds up to 4096. And DMD uses the rounded-up size in the sampling calculations.

Ting-Yuan Huang

Reporter

Comment 26

•

11 years ago

Thank you for reviewing this Terrence! I'm fixing the problems according to your comments and will update the patches soon. Just a few questions in below and pardon me if I misunderstood what you mean: (In reply to Terrence Cole [:terrence] from comment #10) > ::: js/public/MemoryProfiler.h > @@ +15,5 @@ > > + public: > > + virtual ~NativeProfiler() {}; > > + virtual void sampleNative(void *addr, int32_t size) = 0; > > + virtual void removeNative(void *addr) = 0; > > + virtual void reset() = 0; > > Making all of these calls virtual is going to add a ton of overhead. It > looks like there is only one consumer so maybe we could just move most of > the code inline here? Because codes in js/src don't know the codes or interfaces in tools/profiler, the approach here is to have hooks installable from outside. I tried tables with function pointers but they seemed to be not much faster than virtual functions which looked a little bit prettier IMHO. > @@ +22,5 @@ > > +class GCHeapProfiler { > > + public: > > + virtual ~GCHeapProfiler() {}; > > + virtual void sampleTenured(void *addr, int32_t size) = 0; > > + virtual void sampleNursery(void *addr, int32_t size) = 0; > > Please don't use void* for GC things. Instead include js/HeapAPI.h and use > GCCellPtr here, and below where you take GC references. The addresses here are just plain numbers and the only operation on them is equality check. Would GCCellPtr be an overkill?

Ting-Yuan Huang

Reporter

Comment 27

•

11 years ago

(In reply to Terrence Cole [:terrence] from comment #11) > ::: js/src/gc/Heap.h > @@ +1280,5 @@ > > bool > > TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const > > { > > AssertValidColor(this, color); > > + MemProfiler::MarkTenured(reinterpret_cast<void *>(address())); > > A virtual call here is going to be extremely noticable on GC benchmarks. > What exactly are you trying to do with this count, because it's not going to > work as is. MemProfiler::MarkTenured is an inline static function. When the profiler is not enabled, the overheads are a memory load and a conditional jump. I was worried if that can still be noticeable but Octane seems to be satisfied with that. I'll try more benchmarks before requesting a review. When the profiler is enabled, the situation is similar to that in comment 26. A function table is not much faster.

Ting-Yuan Huang

Reporter

Comment 28

•

11 years ago

Thank you Mike. (In reply to Mike Hommey [:glandium] from comment #17) > @@ +451,5 @@ > > + void *profiler = PR_GetThreadPrivate(gActiveProfiler); > > + if (profiler == &nobody) { > > + gNativeProfiler->sampleNative(addr, size); > > + } else if (!profiler) { > > + PRThread *curr = PR_GetCurrentThread(); > > PR_GetCurrentThread can end up using the memory allocator. Using it in this > function is dangerous. PR_GetThreadPrivate calls PR_GetCurrentThread, so > that is dangerous as well. I think this is the source of crashes Nicholas encountered (sorry...). I gave up relying on tricky flags and TLS to avoid recurrence when using STL containers. Instead, the approach in the next patch will be supplying a customized allocator to containers.

Ting-Yuan Huang

Reporter

Comment 29

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #24) > I think my approach is more sound but generating binomial random variate > quickly is a PITA. I'm not sure how to use a binomial random variable to speed up the process but Nick Fitzgerald showed me that a random variable which conforms to geometric distribution can do so and can be generated with a uniform random variable and two log10(): https://hg.mozilla.org/mozilla-central/file/29b05d283b00/js/src/vm/SavedStacks.cpp#l784 So we only have to roll a dice per 1/p allocations. If the sampling rate p is low enough, the solution should pay off. But yeah, I concur with Nicholas that a prime should be practically good enough.

Steve Fink [:sfink] [:s:]

Comment 30

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #23) > The default sampling threshold is 4093 bytes, which is prime, and the > documentation recommends that if you choose a different threshold that you > also choose a prime. This mitigates problem (2). It mitigates it somewhat, but this can still end up heavily biased. Say you're using a threshold of 4093 bytes, and you're only allocating structs that happen to be 4096 bytes. struct { char *a; // = malloc(2048) char *b; // = malloc(2048) } If you allocate an infinite number of structs, your sampling is perfect. But the first 682 samples always fall within 'b'. In fact, if you look at which of a or b each sample falls within, it will be 'b' 682 times, then 'a' 683 times, repeating (well, 683 and 682 swap for each repeat, but whatever.) So say during startup you allocate 500 of these, and you happen to land near the beginning of one of these runs. You'll end up estimating 2MB of stack 'a' and 0 bytes for stack 'b' (or vice versa). And then you never allocate with these stacks anymore, so you can't bury the bias with later samples. How likely is this in practice? No clue. It's totally possible that it doesn't matter. But fitzgen's approach eliminates the problem completely, since it is exactly equivalent to rolling dice for every byte. > Also, when a small (i.e. < 4093 B) allocation cause the small allocations > counter to cross the 4093 B threshold, that allocation gets treated as > though it is 4093 B in size. I.e. it compensates for the the fact that some > small blocks are ignored by treating the non-ignored small blocks as if they > were bigger. This is the right thing to do, but I would explain it differently. You're sampling along the bytes-allocated dimension with probability p = 1/4093. Thus every sample represents 1/p = 4093 bytes. The implication, though, is that you are *not* changing the size for allocations larger than 4093 bytes. That also introduces a bias. A large allocation of size K should be treated as if it got sampled *approximately* K*p times, and each sample stands for 1/p bytes, so for large K using K*p/p=K for the estimate is fine. But for small K>4093, you really ought to be figuring out exactly how many times your every-4093-byte counter would fire, and multiplying that number by 4093. On the other hand, it's nicer to have an exact size for larger allocations. And if you switch to fitzgen's method, then you don't need to do the math to figure out how many times your sampler would hit for K bytes. (If you really want to, I'm sure you can pose it as a question to fitzgen and jimb and they won't be able to stop themselves from computing it.) Still, it's good to be aware that you likely have bias for the sizes slightly larger than your 1/p (4093) value. > Such blocks are also marked in a way that indicates sampling was involved > and in the output any measurements that involved sampled blocks are preceded > with '~' to indicate they aren't exact. I think this mitigates (1), though > I'm not sure if I understood what you wrote exactly. Yeah, you're doing the right thing here. It's exactly the same as BenWa proposed in (1) when he wrote 1/sampleThreshold. > As for (3), I'm skeptical that there is a correlation between allocation > size and lifetime. I find that correlation believable. Specifically, I would guess that larger allocations tend to live longer. But I don't see why it's a problem. (In reply to Benoit Girard (:BenWa) from comment #22) > 3) Blocks tend to have a different lifetime and I'm going that there's a > correlation between allocation size and life time. Assuming that this is > true then when looking at the heap at a particular time you wont have an > even sample of the blocks that are currently live. This is related to 1). I don't get this. Why do relative lifetimes matter? As long as you're weighting everything by 1/p, I don't see why the relative distribution of chunk sizes matters. Well, except for the bias introduced by using exact allocation sizes. Then again, I haven't read the patch and it's been too long since I read the DMD code. Are frees being sampled too? How does that even work? I really ought to look, but I've spent too much time on the comment already. :-| > Instead of generating a random roll for each allocated byte we can just > generate a random binomial random variate. If that's >= 1 then we capture a > stack assign the value with that as a weight. I sort of understand what you're saying, though I don't understand the exact terminology. This still requires sampling from a probability distribution on every allocation, so fitzgen's approach is still better afaict. (Well, really fitzgen has a single event at a time, instead of a count of some number of them as you do, but it's a straightforward extension.) > Our suggestion should address problem 1,2,3 and let you build a memory > profiler that can show an unbiased view of live bytes at any given point in > time. Right. But njn's is doing the same thing if you take p=1/4093, and so his estimates are no more biased than yours if you ignore harmonics. So I still don't get your (3). (He's flipping a coin for every byte that comes up heads with p=1/4093, it's just that his algorithm for flipping the coin is "return false 4092 times, then true, then start over." I would like to gamble with him. Or equivalently, he's sampling your binomial distribution by just reading out the mode.) To be specific, pseudocode for the unbiased form of what I am suggesting is: remaining = fitzgen_sample(p) alloc(K): k = K while (k > remaining): sample(this stack, weight=1/p) k -= remaining remaining = fitzgen_sample(p) remaining -= k but of course that's slow and awful for large K, and we probably don't need to worry about the "near 1/p bias", so it'd probably be remaining = fitzgen_sample(p) alloc(K): if (K >= 1/p): sample(this stack, weight=K) else: k = K while (k >= remaining): sample(this stack, weight=1/p) k -= remaining remaining = fitzgen_sample(p) remaining -= k You can eliminate the loop here by throwing more math at it, but the loop is rarely going to run very many times. This also still has the same bias as njn's method, just without the harmonics. With some more math you could eliminate the bias, and the whole thing would reduce to remaining = fitzgen_sample(p) alloc(K): if (K < remaining): remaining -= K else: hits, remaining = f(K, remaining) sample(this stack, weight=hits * 1/p) where f() figures out how many times the sampler would hit within the next K bytes, and also returns the number of bytes left after the final sample. This does lose the exact counts for larger allocations, though. Math is involved, which means I got something wrong in the above, but I think the idea should be about right.

Nicholas Nethercote [inactive]

Comment 31

•

11 years ago

Thanks for the detailed analysis, sfink. I will have to re-read it on Monday. When I first implemented sampling in DMD I tried doing something like you suggested and I had a long argument with jlebar about it and he convinced me that DMD's current code was ok. Unfortunately I don't remember the explanation now. > This does lose the exact counts for larger allocations, though. For what it's worth, this would hurt. DMD shows the size of all the non-sampled allocations and it's useful data that I look at quite often.

Justin Lebar (not reading bugmail)

Comment 32

•

11 years ago

Hi everyone. njn asked me to weigh in here. I agree with sfink's assessment of the "4093 algorithm". I'm not convinced that the harmonic bias is particularly important in the context of Firefox: I can't recall a time when we missed something important due to sampling bias. But of course maybe we never found it precisely because it was hidden by sampling bias. I can't figure out what fitzgen_sample is, so I can't comment on that suggestion. I can, however, point you to what Google does in tcmalloc. There's an OSS version, which is all I can comment on; the relevant code is in sampler.h [1]. I think the sampling algorithm it uses is similar to the last algorithm from comment 30, although it lacks the "hits" variable. There's a big comment at the top of the file that hopefully explains what's going on, but I studied this code pretty extensively about a year ago, so I can try to help if you need clarification. The tcmalloc algorithm is statistically sound inasmuch as I was able to analyze it; it certainly lacks the harmonic bias of the 4093 algorithm. It's also quite elegant; I swooned a bit when I first read it. :) The only caveat I'd offer is that DMD samples at a much higher rate than tcmalloc's default; DMD samples 4k allocs with probability 1, while tcmalloc samples a 4k allocation with probability < 1%. tcmalloc's algorithm has a higher per-sample overhead than the 4093 algorithm, and this may be significant at higher sampling frequencies; I don't know. There's a concern here about getting the exact size of large allocations that I want to allay, but I'm not sure exactly what it's about. With either tcmalloc sampling or the 4093 algorithm, there's uncertainty about the *number* of allocations of size N from a given callsite, but no uncertainty about each sampled allocation's size. tcmalloc represents this in its output as e.g. "~3mb of size-48 allocs from function foo()". But anyway if you want a hard cutoff above which all allocs are sampled, I don't think that would be a problem to implement with the tcmalloc approach. HTH. I'm not reading bugmail (I still get a ton), but feel free to ping me via e-mail if you have any other questions. [1] https://code.google.com/p/gperftools/source/browse/src/sampler.h

Ting-Yuan Huang

Reporter

Comment 33

•

11 years ago

(In reply to Steve Fink [:sfink, :s:] from comment #30) > Then again, I haven't read the patch and it's been too long since I read the > DMD code. Are frees being sampled too? How does that even work? I really > ought to look, but I've spent too much time on the comment already. :-| Frees are not sampled unfortunately. Each free results in a look up in the address table to see if there's a corresponding sample.

Jim Blandy :jimb

Comment 34

•

11 years ago

(In reply to Justin Lebar (not reading bugmail) from comment #32) > I can't figure out what fitzgen_sample is, so I can't comment on that > suggestion. I can, however, point you to what Google does in tcmalloc. fitzgen_sample is sfink's made-up name for the algorithm cited in comment 29: https://hg.mozilla.org/mozilla-central/file/29b05d283b00/js/src/vm/SavedStacks.cpp#l784 which is indeed - miracle of miracles! - the same thing tcmalloc is doing: generating random numbers with a geometric distribution to get the equivalent of a Bernoulli trial on every allocation. However, the way tcmalloc does a trail for every *byte*, not every *allocation*, seems like a fine idea: it seems very wrong to skip gigantic allocations with the same probability as small ones. The fact that each byte's probability of being marked is completely independent of the other bytes' means that we don't even need any funny re-synchronization of the skip count after a large allocation; we just choose a new skip count like we would for anything else, because the process looks the same no matter where you start in the stream.

Jim Blandy :jimb

Comment 35

•

11 years ago

Comment on attachment 8551170 [details] [diff] [review] part 9. Interface to Add-Ons Review of attachment 8551170 [details] [diff] [review]: ----------------------------------------------------------------- This looks good. Do you have any tests for it? See the examples of tests for other actors in toolkit/devtools/server/tests/mochitest. It wouldn't be necessary to actually test nsIMemoryProfiler as well; the tests could set a global flag indicating that the actor should return fake data.

Attachment #8551170 - Flags: feedback?(jimb)

Steve Fink [:sfink] [:s:]

Comment 36

•

11 years ago

(In reply to Jim Blandy :jimb from comment #34) > (In reply to Justin Lebar (not reading bugmail) from comment #32) > > I can't figure out what fitzgen_sample is, so I can't comment on that > > suggestion. I can, however, point you to what Google does in tcmalloc. > > fitzgen_sample is sfink's made-up name for the algorithm cited in comment 29: > > https://hg.mozilla.org/mozilla-central/file/29b05d283b00/js/src/vm/ > SavedStacks.cpp#l784 > > which is indeed - miracle of miracles! - the same thing tcmalloc is doing: > generating random numbers with a geometric distribution to get the > equivalent of a Bernoulli trial on every allocation. > > However, the way tcmalloc does a trail for every *byte*, not every > *allocation*, seems like a fine idea: it seems very wrong to skip gigantic > allocations with the same probability as small ones. Gah! It does?! I should've read more closely. I just assumed it was skipping N bytes, not N allocations. Skipping allocations is, er, not what you want. My pseudocode was assuming each byte has the same probability of being sampled. > The fact that each byte's probability of being marked is completely > independent of the other bytes' means that we don't even need any funny > re-synchronization of the skip count after a large allocation; we just > choose a new skip count like we would for anything else, because the process > looks the same no matter where you start in the stream. Well, it depends on how you're looking at it. We're looking at a memoryless sampling process, so it is true that the skip count computation is exactly the same no matter how recent your last sample was. But to implement this, that's not what you need to compute. Say you have a state where your skip count is 5 and you're allocating 20 bytes. That's equivalent to a skip count of zero and an allocation of 15 bytes (if we ignore the off-by-one stuff for the specific byte being sampled.) If p=1/5 (per byte), then you want to know how many samples you're going to take within those 15 bytes (it'll be about 3), and also how much is *left* of your skip count after the last byte allocated. The distribution for this leftover skip count is not the same as the distribution for the skip count in general. It is smaller on average. (If you were doing every-4093rd-byte sampling, it would be equal to 4093 minus the gap at the end of the previous allocation.) Er... except see below. If you use the actual sizes for large allocations, then you have already accounted for the difference. You don't even need to compute a new skip count. (In reply to Justin Lebar (not reading bugmail) from comment #32) > I can't figure out what fitzgen_sample is, so I can't comment on that Sorry, as jimb said, fitzgen_sample is just sampling from a geometric distribution. (And it sounds like I should have called it jimb_sample anyway, if I didn't want to use the official name.) > suggestion. I can, however, point you to what Google does in tcmalloc. > There's an OSS version, which is all I can comment on; the relevant code is > in sampler.h [1]. I think the sampling algorithm it uses is similar to the > last algorithm from comment 30, although it lacks the "hits" variable. Lacking "hits" seems to be a rather large flaw to me, but it depends on what you're using it for. "hits" is a weight. If you're implementing a profiler, weights are important. Or in simple terms, if you care about the difference between a 1GB and a 2GB allocation, then you need weights. (Well, unless you want to implement those as a whole crapload of separate samples, but that would make your sampling take time linear in the size of the allocations, which seems bad.) If your p is low enough, then it also doesn't really make much of a difference -- if you would only take at most one sample in any sanely-sized allocation 99.999% of the time, then weights are irrelevant in practice since they'll always be 1. Oh! And if you're using actual sizes for larger allocations, you mostly don't need weights either. > There's a big comment at the top of the file that hopefully explains what's > going on, but I studied this code pretty extensively about a year ago, so I > can try to help if you need clarification. > > The tcmalloc algorithm is statistically sound inasmuch as I was able to > analyze it; it certainly lacks the harmonic bias of the 4093 algorithm. > It's also quite elegant; I swooned a bit when I first read it. :) The only Yay for independent evolution. :-) jimb came up with the same thing. > caveat I'd offer is that DMD samples at a much higher rate than tcmalloc's > default; DMD samples 4k allocs with probability 1, while tcmalloc samples a > 4k allocation with probability < 1%. tcmalloc's algorithm has a higher > per-sample overhead than the 4093 algorithm, and this may be significant at > higher sampling frequencies; I don't know. Yes, I agree. But for the use we're considering here, we're taking a friggin' stack trace. Assuming the last algorithm, that's one sample drawn from a distribution per stack trace, and I'd be very pleasantly surprised if taking a stack trace is fast compared to doing a little bit of math. > There's a concern here about getting the exact size of large allocations > that I want to allay, but I'm not sure exactly what it's about. With It's complicated, and I had a much larger writeup here, but it turns out that using the actual size for large allocations is what you want and avoids some subtle problems. Maybe I'll do a blog post, where I explain the painful detail why. > either tcmalloc sampling or the 4093 algorithm, there's uncertainty about > the *number* of allocations of size N from a given callsite, but no > uncertainty about each sampled allocation's size. tcmalloc represents this > in its output as e.g. "~3mb of size-48 allocs from function foo()". But > anyway if you want a hard cutoff above which all allocs are sampled, I don't > think that would be a problem to implement with the tcmalloc approach. Yes, it is straightforward to implement, though the exact meaning of the results in that case is complicated. I would need to look at the sampler.h callers to be sure of what its numbers mean. > [1] https://code.google.com/p/gperftools/source/browse/src/sampler.h Thanks for the link! Ok, so to be precise, I'm talking about: remaining = sample_geometric(p); alloc(K): if (K > 1/p): sample(this stack, weight=K) else: if (K < remaining): remaining -= K else: sample(this stack, weight=1/p) remaining = sample_geometric(p) where sample_geometric(p) is ceil(ln(1-rand) / ln(1-p)) and rand is a sample from a uniform distribution over [0,1]. Note that allocations are "sampled" based on two totally different processes: anything of alloc size over 1/p is recorded unconditionally and weighted by the actual size of the allocation. Smaller allocations are sampled with an independent probability of p per byte, and samples are weighted by 1/p. This procedure is fast, simple, and if I am not mistaken, wrong. The correct algorithm would be: remaining = sample_geometric(p); alloc(K): if (K > 1/p): sample(this stack, weight=K) else: if (K < remaining): remaining -= K else: hits = 0 while (remaining <= K): hits++ K -= remaining remaining = sample_geometric(p) sample(this stack, weight=hits*1/p) because with the first procedure a 100-byte allocation with p=1/4093 would have an exactly 0% chance of being sampled twice, which is wrong if each byte is tested independently. But this is sadly more complicated, for little real benefit. Perhaps you could bump up the weight based on the leftover bytes to avoid the while() loop? Say, by (K-remaining)*p? I've no idea whether that is actually sound, but once again I've spent too much time thinking about this. :-( I think it reintroduces a teeny tiny bit of the harmonic problem, but nowhere near enough to worry about. Ok, my final answer (for now): remaining = sample_geometric(p); alloc(K): if (K > 1/p): sample(this stack, weight=K) else: if (K < remaining): remaining -= K else: sample(this stack, weight=(1/p + K - remaining)) remaining = sample_geometric(p) Not that anyone was necessarily waiting for me to stop talking to myself...

Nicholas Nethercote [inactive]

Updated

•

11 years ago

Whiteboard: [MemShrink] → [MemShrink:P2]

Jim Blandy :jimb

Comment 37

•

11 years ago

Comment on attachment 8551170 [details] [diff] [review] part 9. Interface to Add-Ons Review of attachment 8551170 [details] [diff] [review]: ----------------------------------------------------------------- This looks good. Do you have any tests for it? See the examples of tests for other actors in toolkit/devtools/server/tests/mochitest. It wouldn't be necessary to actually test nsIMemoryProfiler as well; the tests could set a global flag indicating that the actor should return fake data.

Attachment #8551170 - Flags: feedback+

Jim Blandy :jimb

Comment 38

•

11 years ago

[Sorry for the re-post of the review. I blame Splinter.] I don't see the value of having a special case for K > 1/p. If p is 4096, you've got a 99.9% chance of sampling a 32k allocation. I also don't think the 'hits' count is actually interesting, since it's strictly a function of the block's size. The likelihood of each byte being marked is independent, so the hit count is simply the number of marked bytes the block contains, which is a value with a binomial distribution: http://en.wikipedia.org/wiki/Binomial_distribution#Generating_binomial_random_variates So I suppose you could fake up a hit count if you wanted to, but ... why not just report the size? If your display doesn't adequately put big allocations front and center, that's a separate problem. I'd say the algorithm simply: remaining = sample_geometric(p); alloc(K): remaining -= K if remaining < 0: sample(this stack) remaining = sample_geometric(p)

Jim Blandy :jimb

Comment 39

•

11 years ago

More strongly: if I'm not missing something, the expected value of the hits count is simply Kp. Not very interesting.

Steve Fink [:sfink] [:s:]

Comment 40

•

11 years ago

(In reply to Jim Blandy :jimb from comment #39) > More strongly: if I'm not missing something, the expected value of the hits > count is simply Kp. Not very interesting. Yes, that is the expected value. > I don't see the value of having a special case for K > 1/p. If p is 4096, > you've got a 99.9% chance of sampling a 32k allocation. Well, for one, I'm not sure what p is going to be, since it depends on how long it takes to grab a stack trace. I could see it being 1/1M. But really, I introduced the special case to (1) preserve the property that counts above some threshold are exact, and (2) to avoid the "leftover" calculation since I still insist that you can't just generate a new sample and store it in 'remaining'. That's the right distribution of the wrong variable. I don't know how important (1) is in practice, but it seems nice for testing at least. > I also don't think the 'hits' count is actually interesting, since it's > strictly a function of the block's size. The likelihood of each byte being > marked is independent, so the hit count is simply the number of marked bytes > the block contains, which is a value with a binomial distribution: > > http://en.wikipedia.org/wiki/ > Binomial_distribution#Generating_binomial_random_variates Ah, that was the thing that I never got around to figuring out. So sample_binomial(size,p)*1/p could be used as the weight. And then, yes, you can just grab a fresh sample_geometric(p) for 'remaining'. That resolves my (2). > So I suppose you could fake up a hit count if you wanted to, but ... why not > just report the size? If your display doesn't adequately put big allocations > front and center, that's a separate problem. If you just report the size without doing a fancy computation for 'remaining', I believe you're biasing things. > I'd say the algorithm simply: > > remaining = sample_geometric(p); > alloc(K): > remaining -= K > if remaining < 0: > sample(this stack) > remaining = sample_geometric(p) Huh? No weights at all? Then if you do 100MB of tiny allocations followed by one 2GB allocation, the 2GB allocation will be completely lost in the noise. Or are you implicitly using the full size, so that should say "sample(this stack, weight=K)" in place of "sample(this stack)"? But then you're not estimating the total sizes correctly. If you have a bunch of 4 byte allocations mixed with the same number of 8 byte allocations, then you'll take twice as many samples of 8 byte allocations (which is good) *and* weight each twice as much, which means you'll think you're using up 4x the space with the 8-byters when in fact you're only using up 2x the space. They should be sampled with the same weight. I do believe the binomial sampling allows a simpler solution than I proposed, but not quite that simple.

Ting-Yuan Huang

Reporter

Comment 41

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #21) > you're looking for is 'profiler_get_backtrace'. If you only want the profiler_get_backtrace returns a ProfilerBacktrace of which the only interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter& b). This can be slow in my use cases because I have to parse the streams back into lists of function names and attributes. Would you mind if I add a public member function to it to return the list directly? By the way, is it possible to compact the stacktraces on the fly when profiling, and decode it in the front-end? For example, although the number of function names and stacktraces is somewhat proportional to the number of samples, they basically form a small set and most of them are the same. In this way, the buffer can be utilized much more efficiently and the profiler can run over a much longer period without considering memory usage. In my experience, running Octane 2.0 with ~1 million samples only result in a few thousands (1000 ~ 3000) of distinct function names. The stacktraces can be encoded in only 10000 ~ 20000 indexes. In contrast, there will be (avg. stack depth) * 1 million function names without compaction.

Flags: needinfo?(bgirard)

Benoit Girard (:BenWa)

Comment 42

•

11 years ago

(In reply to Ting-Yuan Huang from comment #41) > (In reply to Benoit Girard (:BenWa) from comment #21) > > you're looking for is 'profiler_get_backtrace'. If you only want the > > profiler_get_backtrace returns a ProfilerBacktrace of which the only > interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter& > b). This can be slow in my use cases because I have to parse the streams > back into lists of function names and attributes. Would you mind if I add a > public member function to it to return the list directly? Yes please, you should add something to GeckoProfiler.h that takes the ProfilerBacktrace object and returns a list of struct POD that contains the data you need. (In reply to Ting-Yuan Huang from comment #41) > > By the way, is it possible to compact the stacktraces on the fly when > profiling, and decode it in the front-end? For example, although the number > of function names and stacktraces is somewhat proportional to the number of > samples, they basically form a small set and most of them are the same. In > this way, the buffer can be utilized much more efficiently and the profiler > can run over a much longer period without considering memory usage. > > In my experience, running Octane 2.0 with ~1 million samples only result in > a few thousands (1000 ~ 3000) of distinct function names. The stacktraces > can be encoded in only 10000 ~ 20000 indexes. In contrast, there will be > (avg. stack depth) * 1 million function names without compaction. We've discussed this. The best thing would be to use lz4 compression. But so far it's better to just increase the buffer size or decrease the sampling interval.

Flags: needinfo?(bgirard)

Ting-Yuan Huang

Reporter

Comment 43

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #42) > Yes please, you should add something to GeckoProfiler.h that takes the > ProfilerBacktrace object and returns a list of struct POD that contains the > data you need. Got it. Thanks! > We've discussed this. The best thing would be to use lz4 compression. But so > far it's better to just increase the buffer size or decrease the sampling > interval. I tried again with Octane 2.0 and got only 276114 samples because my debug codes introduced some slowdowns. Nevertheless, there are 1582 unique names and the total size is 149.67KB. If not compacted, the size would be 703.58MB. The compression ratio is 4788. Removing the debug codes will get more samples and have an even better compression ratio. I doubt that lz4 can beat this within a reasonable amount of time :-)

Ting-Yuan Huang

Reporter

Comment 44

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #42) > > profiler_get_backtrace returns a ProfilerBacktrace of which the only > > interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter& > > b). This can be slow in my use cases because I have to parse the streams > > back into lists of function names and attributes. Would you mind if I add a > > public member function to it to return the list directly? > > Yes please, you should add something to GeckoProfiler.h that takes the > ProfilerBacktrace object and returns a list of struct POD that contains the > data you need. The most tricky part when implementing this is that profiler_get_backtrace() allocates memory. Calling an allocator in a memory profiler results in an infinite recursion. There are two approaches I tried to solve the problem when dealing with STL containers: 1. Completely remove explicit/implicit memory allocations. STL containers are supplied with a customized allocator. I feel that a huge change of SPS that removes all allocations is unlikely acceptable. Removing all (explicit and implicit) allocations in profiler_get_backtrace() can be a huge change and make subsequent developments painful; Any patch which introduces memory allocations breaks the memory profiler. The best way I can think of is to factor out the stack unwinding part of SPS. 2. Use a per-thread flag to prevent recurring into the memory profiler itself. This looks easier and is transparent to profiler. However, an allocator-free version TLS must be implemented. That means a number of PR_XXX such as PR_GetCurrentThread() and PR_SetThreadPrivate() must be implemented again on without dynamic memory allocation on all platforms. Either approach requires substantial efforts and deserves its own bug. Would you mind if we first land this with a quick hack (go through the PseudoStack either in memory profiler or in a new API of profiler) and revisit the problem later?

Flags: needinfo?(bgirard)

Benoit Girard (:BenWa)

Comment 45

•

11 years ago

(In reply to Ting-Yuan Huang from comment #43) > (In reply to Benoit Girard (:BenWa) from comment #42) > > Yes please, you should add something to GeckoProfiler.h that takes the > > ProfilerBacktrace object and returns a list of struct POD that contains the > > data you need. > > Got it. Thanks! > > > We've discussed this. The best thing would be to use lz4 compression. But so > > far it's better to just increase the buffer size or decrease the sampling > > interval. > > I tried again with Octane 2.0 and got only 276114 samples because my debug > codes introduced some slowdowns. Nevertheless, there are 1582 unique names > and the total size is 149.67KB. If not compacted, the size would be > 703.58MB. The compression ratio is 4788. Removing the debug codes will get > more samples and have an even better compression ratio. I doubt that lz4 can > beat this within a reasonable amount of time :-) What are you comparing it to? What has a 4788 times compression ratio?

Flags: needinfo?(bgirard)

Benoit Girard (:BenWa)

Comment 46

•

11 years ago

(In reply to Ting-Yuan Huang from comment #44) > (In reply to Benoit Girard (:BenWa) from comment #42) > > > profiler_get_backtrace returns a ProfilerBacktrace of which the only > > > interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter& > > > b). This can be slow in my use cases because I have to parse the streams > > > back into lists of function names and attributes. Would you mind if I add a > > > public member function to it to return the list directly? > > > > Yes please, you should add something to GeckoProfiler.h that takes the > > ProfilerBacktrace object and returns a list of struct POD that contains the > > data you need. > > The most tricky part when implementing this is that profiler_get_backtrace() > allocates memory. Calling an allocator in a memory profiler results in an > infinite recursion. Your implementation of SPSGetStacktrace already uses std::vector which will allocate memory. > The best way I can think of is to > factor out the stack unwinding part of SPS. It's already optional under the 'stackwalk' feature. You might want to add a flag to profiler_get_backtrace to not use the stackwalk feature. > > Either approach requires substantial efforts and deserves its own bug. Would > you mind if we first land this with a quick hack (go through the PseudoStack > either in memory profiler or in a new API of profiler) and revisit the > problem later? I'm suggesting that you don't break the profiler API and expose implementation details. I'm not asking for a functional change. Expose the information you want through a good API instead of exposing the internal details. If you have SPSGetStacktrace working with the condition you need, you should be able to have SPSGetStacktrace be profiler_get_backtrace and have the allocation behavior that you need.

Ting-Yuan Huang

Reporter

Comment 47

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #45) > What are you comparing it to? What has a 4788 times compression ratio? (total size of names of all backtraces) / (total size of those 1582 unique/atom names) Does that make sense?

Ting-Yuan Huang

Reporter

Comment 48

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #46) > Your implementation of SPSGetStacktrace already uses std::vector which will > allocate memory. Yeah, that is the source of crashes njn encountered. It will be eliminated in the incoming patch. > It's already optional under the 'stackwalk' feature. You might want to add a > flag to profiler_get_backtrace to not use the stackwalk feature. I'm sorry that I should have not use word "stack unwinding". I meant to factor the part getting backtraces (equivalently unwinding and merging both JS and native stack) out of TableTicker which creates SyncProfile or so. > I'm suggesting that you don't break the profiler API and expose > implementation details. I'm not asking for a functional change. Expose the > information you want through a good API instead of exposing the internal > details. If you have SPSGetStacktrace working with the condition you need, > you should be able to have SPSGetStacktrace be profiler_get_backtrace and > have the allocation behavior that you need. Glad to know that I misunderstood. Now it looks much easier to me :-)

Benoit Girard (:BenWa)

Comment 49

•

11 years ago

(In reply to Ting-Yuan Huang from comment #47) > (In reply to Benoit Girard (:BenWa) from comment #45) > > What are you comparing it to? What has a 4788 times compression ratio? > > (total size of names of all backtraces) / (total size of those 1582 > unique/atom names) > > Does that make sense? You can't eliminate redundant samples for free, you have to encode them somehow. That encoding is exactly what LZ4 provides. But if you can write an encoder that is better than LZ4, is fast, can work in place without allocating memory and isn't complex and difficult to maintain then that works

Markus Stange [:mstange]

Comment 50

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #49) > You can't eliminate redundant samples for free, you have to encode them > somehow. That encoding is exactly what LZ4 provides. But if you can write an > encoder that is better than LZ4, is fast, can work in place without > allocating memory and isn't complex and difficult to maintain then that works Let's talk about this in a different bug. Kannan also expressed interest in improving the packing of the profiler buffer by using a separate symbolication table. (Both Kannan and I were extremely surprised by the lz4 suggestion since we don't understand how that would work with a circular buffer; the compressed stream can refer to anything earlier in the stream, even to parts that have already been discarded from the buffer.)

Steve Fink [:sfink] [:s:]

Comment 51

•

11 years ago

Comment on attachment 8551159 [details] [diff] [review] part 4. monitoring allocations and frees for ArrayBuffer Review of attachment 8551159 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/vm/ArrayBufferObject.cpp @@ +298,5 @@ > ReleaseAsmJSMappedData(data); > js_ReportOutOfMemory(cx); > return false; > } > + MemProfiler::SampleNative(diffStart, diffLength); I'm fine with the general approach of instrumenting right at the mmap calls.

Attachment #8551159 - Flags: feedback?(sphink) → feedback+

Ting-Yuan Huang

Reporter

Comment 52

•

11 years ago

Attached patch part 1. CountAlloc in ReplaceAlloc (obsolete) — Details — Splinter Review

(In reply to Mike Hommey [:glandium] from comment #17) > Comment on attachment 8551165 [details] [diff] [review] > part 5. tracking the memory events > > Review of attachment 8551165 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: tools/profiler/MemoryProfiler.cpp > @@ +70,5 @@ > > + > > + return trace; > > +} > > + > > +class AutoLock { > > We have MutexAutoLock in mozilla/Mutex.h for this. mozilla::Mutex seems to call allocator in debug builds. Would you mind to give one more chance to the simpler PRLock and a reinvented wheel (AutoMPLock)? > @@ +451,5 @@ > > + void *profiler = PR_GetThreadPrivate(gActiveProfiler); > > + if (profiler == &nobody) { > > + gNativeProfiler->sampleNative(addr, size); > > + } else if (!profiler) { > > + PRThread *curr = PR_GetCurrentThread(); > > PR_GetCurrentThread can end up using the memory allocator. Using it in this > function is dangerous. PR_GetThreadPrivate calls PR_GetCurrentThread, so > that is dangerous as well. > > @@ +454,5 @@ > > + } else if (!profiler) { > > + PRThread *curr = PR_GetCurrentThread(); > > + if (tlsOwner == curr) > > + return; > > + PR_Lock(tlsLock); > > Please note somewhere that CARegister needs to happen after _initOnce, > otherwise you crash here, if not before. These are avoided by using a customized allocator which is not hooked. (In reply to Mike Hommey [:glandium] from comment #18) > Comment on attachment 8551149 [details] [diff] [review] > part 1. CountAlloc in ReplaceAlloc > > Review of attachment 8551149 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: memory/build/replace_malloc_bridge.h > @@ +86,5 @@ > > * instance is valid until the process dies. > > * This method was added in version 2 of the bridge. */ > > virtual void InitDebugFd(mozilla::DebugFdRegistry&) {} > > > > + virtual void CARegister(void (*_alloc)(void *, int32_t), > > Seems to me _alloc and _free should have a different name. CARegister is > also not very clear as to what it is. Please also add comments about what > the function is for, and that it is added in version 3 of the bridge. Done. Comments are added and they renamed to CountAllocRegisterHook, sAllocHook, etc. > ::: memory/replace/countalloc/CountAlloc.cpp > @@ +39,5 @@ > > +{ > > + virtual void CARegister(void (*_alloc)(void *, int32_t), > > + void (*_free)(void *)) override { > > + __hook_alloc = _alloc; > > + __hook_free = _free; > > You probably want a memory fence of some sort here. Or to use atomics. For multithreaded environment, would it be caller's responsibility to ensure the order? For compiler optimizations or out-of-order-execution in a single thread, I think it should be OK when not inlined, right? > @@ +96,5 @@ > > +{ > > + void* new_ptr = sFuncs->realloc(aPtr, aSize); > > + if (new_ptr || !aSize) { > > + FreeHook(aPtr); > > + if (!aSize) { > > if (aSize)? Yes! Thanks you.

Attachment #8551149 - Attachment is obsolete: true

Attachment #8571128 - Flags: review?(mh+mozilla)

Ting-Yuan Huang

Reporter

Comment 53

•

11 years ago

Attached patch part 2. hooks in js engine (obsolete) — Details — Splinter Review

(In reply to Terrence Cole [:terrence] from comment #10) > Comment on attachment 8551156 [details] [diff] [review] > part 2. hooks in js engine > > Review of attachment 8551156 [details] [diff] [review]: > ----------------------------------------------------------------- > > Chiming in since this seems to be a fairly JS::GC specific series. > > ::: js/public/MemoryProfiler.h > @@ +15,5 @@ > > + public: > > + virtual ~NativeProfiler() {}; > > + virtual void sampleNative(void *addr, int32_t size) = 0; > > + virtual void removeNative(void *addr) = 0; > > + virtual void reset() = 0; > > Making all of these calls virtual is going to add a ton of overhead. It > looks like there is only one consumer so maybe we could just move most of > the code inline here? Because codes in js/src don't know the codes or interfaces in tools/profiler, the approach here is to have hooks installable from outside. I tried tables with function pointers but they seemed to be not much faster than virtual functions which looked a little bit prettier IMHO. Anyway, when the memory profiler is not enabled, these virtual functions won't be called. > @@ +22,5 @@ > > +class GCHeapProfiler { > > + public: > > + virtual ~GCHeapProfiler() {}; > > + virtual void sampleTenured(void *addr, int32_t size) = 0; > > + virtual void sampleNursery(void *addr, int32_t size) = 0; > > Please don't use void* for GC things. Instead include js/HeapAPI.h and use > GCCellPtr here, and below where you take GC references. The addresses here are just plain numbers and the only operation on them is equality check. Would GCCellPtr be an overkill? > ::: js/src/vm/MemoryProfiler.cpp > @@ +9,5 @@ > > + > > +mozilla::Atomic<int> MemProfiler::sGlobalSwitch; > > +NativeProfiler *MemProfiler::sNativeProfiler; > > + > > +GCHeapProfiler *MemProfiler::GetGCHeapProfiler(void *addr) > > Type should be on the line above: > > GCHeapProfiler * > MemProfiler::GetGCHeapProfiler(GCCellPtr cell) Done.

Attachment #8551156 - Attachment is obsolete: true

Attachment #8571133 - Flags: review?(terrence)

Ting-Yuan Huang

Reporter

Comment 54

•

11 years ago

Comment on attachment 8551158 [details] [diff] [review] part 3. Monitoring allocation and gc events in nursery and tenured heaps (In reply to Terrence Cole [:terrence] from comment #11) > Comment on attachment 8551158 [details] [diff] [review] > part 3. Monitoring allocation and gc events in nursery and tenured heaps > > Review of attachment 8551158 [details] [diff] [review]: > ----------------------------------------------------------------- > ::: js/src/gc/Heap.h > @@ +1280,5 @@ > > bool > > TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const > > { > > AssertValidColor(this, color); > > + MemProfiler::MarkTenured(reinterpret_cast<void *>(address())); > > A virtual call here is going to be extremely noticable on GC benchmarks. > What exactly are you trying to do with this count, because it's not going to > work as is. MemProfiler::MarkTenured is an inline static function. When the profiler is not enabled, the overheads are a memory load and a conditional jump. When the profiler is enabled, the situation is similar to that in comment 26. A function table is not much faster.

Attachment #8551158 - Flags: review?(terrence)

Ting-Yuan Huang

Reporter

Comment 55

•

11 years ago

Attached patch part 4. monitoring allocations and frees for ArrayBuffer (obsolete) — Details — Splinter Review

(In reply to Terrence Cole [:terrence] from comment #12) > Comment on attachment 8551159 [details] [diff] [review] > part 4. monitoring allocations and frees for ArrayBuffer > > Review of attachment 8551159 [details] [diff] [review]: > ----------------------------------------------------------------- > > js/src/gc/Memory.cpp doesn't really have anything to do with > ArrayBufferObjects; please split the Memory changes into a separate patch. > > ::: js/src/gc/Memory.cpp > @@ +711,5 @@ > > return nullptr; > > > > buf = (uint8_t *) MapMemoryAt(buf, pa_size, PROT_READ | PROT_WRITE, > > MAP_PRIVATE | MAP_FIXED, fd, pa_start); > > + MemProfiler::SampleNative(buf, pa_size); > > You appear to be building the code on all platforms: please add appropriate > sampling for all the other platforms as well. I've pulled the hooks from Memory.cpp into ArrayBufferObject.cpp.

Attachment #8551159 - Attachment is obsolete: true

Attachment #8571143 - Flags: review?(terrence)

Attachment #8571143 - Flags: review?(sphink)

Ting-Yuan Huang

Reporter

Comment 56

•

11 years ago

Attached patch part 5. Don't emit inline allocation when memory profiler enabled. (obsolete) — Details — Splinter Review

(In reply to Terrence Cole [:terrence] from comment #11) > Comment on attachment 8551158 [details] [diff] [review] > part 3. Monitoring allocation and gc events in nursery and tenured heaps > > Review of attachment 8551158 [details] [diff] [review]: > ----------------------------------------------------------------- > > As Emanuel already mentioned, this misses both of the JIT allocators. Rather > than adding new instructions to the allocation paths, you'll want to discard > the jit code and ensure that when we recompile we do the allocation > out-of-line. I'd add another condition to the TraceEnabled check in > checkAllocatorState: > https://dxr.mozilla.org/mozilla-central/source/js/src/jit/MacroAssembler. > cpp#620. Thank you. The patch is basically: - if (js::gc::TraceEnabled()) + if (js::gc::TraceEnabled() || MemProfiler::enabled())

Attachment #8571147 - Flags: review?(terrence)

Ting-Yuan Huang

Reporter

Comment 57

•

11 years ago

Attached patch part 6. A new API to get backtrace without allocating memory in profiler (obsolete) — Details — Splinter Review

(In reply to Benoit Girard (On Leave Until June 1st!) from comment #42) > (In reply to Ting-Yuan Huang from comment #41) > > (In reply to Benoit Girard (:BenWa) from comment #21) > > > you're looking for is 'profiler_get_backtrace'. If you only want the > > > > profiler_get_backtrace returns a ProfilerBacktrace of which the only > > interface a caller can get the stacktrace is StreamJSObject(JSStreamWriter& > > b). This can be slow in my use cases because I have to parse the streams > > back into lists of function names and attributes. Would you mind if I add a > > public member function to it to return the list directly? > > Yes please, you should add something to GeckoProfiler.h that takes the > ProfilerBacktrace object and returns a list of struct POD that contains the > data you need. Thanks, I've added a new API to get backtrace without allocating memory in profiler: void profiler_get_backtrace_noalloc(char *output, size_t outputSize);

Attachment #8571176 - Flags: review?(bgirard)

Ting-Yuan Huang

Reporter

Comment 58

•

11 years ago

Attached patch part 7. xpcom interface (obsolete) — Details — Splinter Review

> ::: tools/profiler/nsIMemoryProfiler.idl > @@ +4,5 @@ > > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > > + > > +#include "nsISupports.idl" > > + > > +// profiler = Cc["@mozilla.org/tools/memory-profiler;1"].getService(Ci.nsIMemoryProfiler) > > If this is means as documentation you should write a better comment for > this. If it's just a personal note you should delete it. > > This class could use documentation. > > can you get someone else to review .idl? I'm not qualified to do that. Hi Smaug, would you mind to review this? Thanks.

Attachment #8551165 - Attachment is obsolete: true

Attachment #8571199 - Flags: review?(bugs)

Ting-Yuan Huang

Reporter

Comment 59

•

11 years ago

Attached patch part 8. tracking the memory events (obsolete) — Details — Splinter Review

(In reply to Benoit Girard (On Leave Until June 1st!) from comment #21) > Comment on attachment 8551165 [details] [diff] [review] > part 5. tracking the memory events > > Review of attachment 8551165 [details] [diff] [review]: > ----------------------------------------------------------------- > > Just a high level check, I'll finish tomorrow > > ::: tools/profiler/MemoryProfiler.cpp > @@ +40,5 @@ > > + > > +// gActiveProfiler is the key to a TLS variable which tracks active profiler on > > +// each thread. nobody is a sentinal indicating there's no profile active. > > +static unsigned int gActiveProfiler; > > +static int nobody; > > Please use mfbt/ThreadLocal.h. It deals with the nobody case better, has > better typing and the API is better. A customized allocator to STL containers is used instead of TLS tricks, which is too complicated and error prone. > @@ +52,5 @@ > > +static std::unordered_map<JSRuntime *, GCHeapProfilerImpl *> gRuntimeToGCHeapProfiler; > > +static std::unordered_map<JSRuntime *, bool> gEnabledOnRuntime; > > + > > +// Get the backtrace in PseudoStack. > > +static std::vector<const char *> SPSGetStacktrace() > > This is an implementation detail of the profiler that doesn't belong here. > GeckoProfiler.h is the public interface that should be used. The method > you're looking for is 'profiler_get_backtrace'. If you only want the > pseudostack you should introduce a flag to profiler_get_backtrace to collect > just the pseudostack. Otherwise the method should a stack containing all the > information available. Done in part 6. > @@ +399,5 @@ > > + > > + virtual void sampleNative(void *addr, int32_t size) { > > + AutoMPLock lock(lock_, this); > > + cumulated += size; > > + if (cumulated >= mSampleSize) { > > This will give a bias, see my write up on this in my next comment. Done. Re-implemented using a geometric random number generator. It is not in c++11 style because there's no such things in stlport. > ::: tools/profiler/moz.build > @@ +8,5 @@ > > FAIL_ON_WARNINGS = True > > > > XPIDL_MODULE = 'profiler' > > XPIDL_SOURCES += [ > > + 'nsIMemoryProfiler.idl', > > As far as I can tell this doesn't integrate with the profiler at all. The > only thing it does is use the pseudostack which is mean to be provided by a > public API. > > If the memory profiler doesn't integrate with the profiler it should live > outside of the profiler, or at the very least in a separate subdir. This > will prevent the project from developing cross dependencies. Done. The memory profiler is now factored out to tools/memory-profiler.

Attachment #8571200 - Flags: review?(bgirard)

Ting-Yuan Huang

Reporter

Comment 60

•

11 years ago

Comment on attachment 8551170 [details] [diff] [review] part 9. Interface to Add-Ons (under testing. will file a review request later.)

Attachment #8551170 - Attachment description: part 6. Interface to Add-Ons → part 9. Interface to Add-Ons

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 61

•

11 years ago

reviewed

Comment on attachment 8571199 [details] [diff] [review] part 7. xpcom interface >+++ b/tools/memory-profiler/MemoryProfiler.cpp >@@ -0,0 +1,46 @@ >+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ >+/* This Source Code Form is subject to the terms of the Mozilla Public >+ * License, v. 2.0. If a copy of the MPL was not distributed with this file, >+ * You can obtain one at http://mozilla.org/MPL/2.0/. */ >+ >+#include "MemoryProfiler.h" >+#include "nsIDOMClassInfo.h" >+#include "nsIGlobalObject.h" >+#include "js/TypeDecls.h" >+#include "xpcprivate.h" It is not clear to me why you need to include all these headers. Would be nice to avoid nsIDOMClassInfo.h, js/TypeDecls.h and xpcprivate.h >+MemoryProfiler::GetResults(JSContext *cx, JS::MutableHandle<JS::Value> aResult) JSContext* aCx >+++ b/tools/memory-profiler/nsMemoryProfilerFactory.cpp >@@ -0,0 +1,30 @@ >+/* -*- Mode: C++; tab-width: 20; indent-tabs-mode: nil; c-basic-offset: 4 -*- */ 2 spaces, not 4 >+static const mozilla::Module::CIDEntry kMemoryProfilerCIDs[] = { >+ { &kMEMORY_PROFILER_CID, false, nullptr, MemoryProfilerConstructor }, >+ { nullptr } >+}; >+ >+static const mozilla::Module::ContractIDEntry kMemoryProfilerContracts[] = { >+ { MEMORY_PROFILER_CONTRACT_ID, &kMEMORY_PROFILER_CID }, >+ { nullptr } >+}; >+ >+static const mozilla::Module kMemoryProfilerModule = { >+ mozilla::Module::kVersion, >+ kMemoryProfilerCIDs, >+ kMemoryProfilerContracts >+}; So use 2 spaces for indentation

Attachment #8571199 - Flags: review?(bugs) → review+

Terrence Cole [:terrence]

Comment 62

•

11 years ago

Comment on attachment 8551158 [details] [diff] [review] part 3. Monitoring allocation and gc events in nursery and tenured heaps (In reply to Ting-Yuan Huang from comment #54) > Comment on attachment 8551158 [details] [diff] [review] > part 3. Monitoring allocation and gc events in nursery and tenured heaps > > (In reply to Terrence Cole [:terrence] from comment #11) > > Comment on attachment 8551158 [details] [diff] [review] > > part 3. Monitoring allocation and gc events in nursery and tenured heaps > > > > Review of attachment 8551158 [details] [diff] [review]: > > ----------------------------------------------------------------- > > ::: js/src/gc/Heap.h > > @@ +1280,5 @@ > > > bool > > > TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const > > > { > > > AssertValidColor(this, color); > > > + MemProfiler::MarkTenured(reinterpret_cast<void *>(address())); > > > > A virtual call here is going to be extremely noticable on GC benchmarks. > > What exactly are you trying to do with this count, because it's not going to > > work as is. > > MemProfiler::MarkTenured is an inline static function. When the profiler is > not enabled, the overheads are a memory load and a conditional jump. You've still doubled the number of loads here and added a jump where there wasn't one before -- this would totally be visible in the allocation path, so I'd expect it to be visible in the GC path as well. How does this affect your scores on Octane, in particular Splay? Moreover, why do you need to know the number of times that a mark bit was set, rather than just the number of mark bits that have been set? Would it be possible to post-process the mark bitmaps at the end of each slice to get similar information? > When the profiler is enabled, the situation is similar to that in comment > 26. A function table is not much faster. Still, this is going to (potentially significantly) distort the proportion of time spent in gc marking. Whether that's important depends what you're doing with the numbers, but since you are sampling, I'd think that distortion would actually be pretty important.

Attachment #8551158 - Flags: review?(terrence)

Steve Fink [:sfink] [:s:]

Comment 63

•

11 years ago

reviewed

Comment on attachment 8571143 [details] [diff] [review] part 4. monitoring allocations and frees for ArrayBuffer Review of attachment 8571143 [details] [diff] [review]: ----------------------------------------------------------------- Looks fine on a quick skim

Attachment #8571143 - Flags: review?(sphink) → review+

Ting-Yuan Huang

Reporter

Comment 64

•

11 years ago

(In reply to Terrence Cole [:terrence] from comment #62) > Comment on attachment 8551158 [details] [diff] [review] > part 3. Monitoring allocation and gc events in nursery and tenured heaps > > (In reply to Ting-Yuan Huang from comment #54) > > Comment on attachment 8551158 [details] [diff] [review] > > part 3. Monitoring allocation and gc events in nursery and tenured heaps > > > > (In reply to Terrence Cole [:terrence] from comment #11) > > > Comment on attachment 8551158 [details] [diff] [review] > > > part 3. Monitoring allocation and gc events in nursery and tenured heaps > > > > > > Review of attachment 8551158 [details] [diff] [review]: > > > ----------------------------------------------------------------- > > > ::: js/src/gc/Heap.h > > > @@ +1280,5 @@ > > > > bool > > > > TenuredCell::markIfUnmarked(uint32_t color /* = BLACK */) const > > > > { > > > > AssertValidColor(this, color); > > > > + MemProfiler::MarkTenured(reinterpret_cast<void *>(address())); > > > > > > A virtual call here is going to be extremely noticable on GC benchmarks. > > > What exactly are you trying to do with this count, because it's not going to > > > work as is. > > > > MemProfiler::MarkTenured is an inline static function. When the profiler is > > not enabled, the overheads are a memory load and a conditional jump. > > You've still doubled the number of loads here and added a jump where there > wasn't one before -- this would totally be visible in the allocation path, > so I'd expect it to be visible in the GC path as well. How does this affect > your scores on Octane, in particular Splay? Moreover, why do you need to I only tested 6 times for each and Splay seems to be 3% ~ 5% slower with my patch. > know the number of times that a mark bit was set, rather than just the > number of mark bits that have been set? Would it be possible to post-process > the mark bitmaps at the end of each slice to get similar information? > Yes, tracking how many times each bit is marked is an overkill. The memory profiler maintains a table recording a subset of what's allocated. It resembles memory management so it has to mark and sweep, too. When I'm looking for a place to do the post-process that you mentioned, it seems to me that every objects, whether it has a finalizer or not, are poisoned (in other words, freed) in Arena::finalize(). By tracing where isMarked() is used, it also looks like a good place to do the post-process. Could I just move MemProfiler::MarkTenured here and maybe duplicate the loop and check if the memory profiler is enabled outside the loop (loop unswitching)? > > When the profiler is enabled, the situation is similar to that in comment > > 26. A function table is not much faster. > > Still, this is going to (potentially significantly) distort the proportion > of time spent in gc marking. Whether that's important depends what you're > doing with the numbers, but since you are sampling, I'd think that > distortion would actually be pretty important. Yes, it will distort the speed/time behavior but not (at least not directly) memory footprint.

Terrence Cole [:terrence]

Comment 65

•

11 years ago

(In reply to Ting-Yuan Huang from comment #64) > > > MemProfiler::MarkTenured is an inline static function. When the profiler is > > > not enabled, the overheads are a memory load and a conditional jump. > > > > You've still doubled the number of loads here and added a jump where there > > wasn't one before -- this would totally be visible in the allocation path, > > so I'd expect it to be visible in the GC path as well. How does this affect > > your scores on Octane, in particular Splay? Moreover, why do you need to > > I only tested 6 times for each and Splay seems to be 3% ~ 5% slower with my > patch. That massive of a regression would get backed out very quickly. > > know the number of times that a mark bit was set, rather than just the > > number of mark bits that have been set? Would it be possible to post-process > > the mark bitmaps at the end of each slice to get similar information? > > > > Yes, tracking how many times each bit is marked is an overkill. The memory > profiler maintains a table recording a subset of what's allocated. It > resembles memory management so it has to mark and sweep, too. > > When I'm looking for a place to do the post-process that you mentioned, it > seems to me that every objects, whether it has a finalizer or not, are > poisoned (in other words, freed) in Arena::finalize(). By tracing where > isMarked() is used, it also looks like a good place to do the post-process. > Could I just move MemProfiler::MarkTenured here and maybe duplicate the loop > and check if the memory profiler is enabled outside the loop (loop > unswitching)? That sounds fine. > > > When the profiler is enabled, the situation is similar to that in comment > > > 26. A function table is not much faster. > > > > Still, this is going to (potentially significantly) distort the proportion > > of time spent in gc marking. Whether that's important depends what you're > > doing with the numbers, but since you are sampling, I'd think that > > distortion would actually be pretty important. > > Yes, it will distort the speed/time behavior but not (at least not directly) > memory footprint. My experience with transitioning to generational GC is that this sort of "minor" distortion can actually have some extremely surprising and widespread performance effects. I guess it will depend on what you're trying to get out of it.

Ting-Yuan Huang

Reporter

Comment 66

•

11 years ago

Attached patch part 3. Monitoring allocation and gc events in nursery and tenured heaps (obsolete) — Details — Splinter Review

(In reply to Terrence Cole [:terrence] from comment #65) > That massive of a regression would get backed out very quickly. Thanks. After moving MemProfiler::MarkTenured into the finalizer, the slowdown becomes unnoticeable. > My experience with transitioning to generational GC is that this sort of > "minor" distortion can actually have some extremely surprising and > widespread performance effects. I guess it will depend on what you're trying > to get out of it. One solution might be to embed callbacks in the finalizers of sampled objects. I'm still tracing the codes and can't be sure but it looks like that not every object has a finalizer? If the above is not possible, the best improvement I can think of is to fine tune the implementation tracking the allocations. Similar tools such as hprof (for Java) suffer from performance degradations, too. In our case Splay is 4 times slower when the profiler is on. Perhaps we can warn the users about the implications in documentations and let them interpret what the profiles really mean.

Attachment #8551158 - Attachment is obsolete: true

Attachment #8572481 - Flags: review?(terrence)

Chiajung Hung [:chiajung]

Updated

•

11 years ago

Blocks: 1129249

Benoit Girard (:BenWa)

Updated

•

11 years ago

Attachment #8571176 - Flags: review?(bgirard) → review?(mstange)

Benoit Girard (:BenWa)

Updated

•

11 years ago

Attachment #8571200 - Flags: review?(bgirard) → review?(mstange)

Chiajung Hung [:chiajung]

Updated

•

11 years ago

No longer blocks: 1129249

Markus Stange [:mstange]

Comment 67

•

11 years ago

reviewed

Comment on attachment 8571176 [details] [diff] [review] part 6. A new API to get backtrace without allocating memory in profiler Review of attachment 8571176 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/platform.cpp @@ +1038,5 @@ > + char *bound = output + outputSize - 2; > + output[0] = output[1] = '\0'; > + PseudoStack *pseudoStack = tlsPseudoStack.get(); > + if (!pseudoStack) > + return; {} please

Attachment #8571176 - Flags: review?(mstange) → review+

Markus Stange [:mstange]

Comment 68

•

11 years ago

Comment on attachment 8571200 [details] [diff] [review] part 8. tracking the memory events Review of attachment 8571200 [details] [diff] [review]: ----------------------------------------------------------------- I'm going to wait with this review until the patches it depends on have been reviewed.

Terrence Cole [:terrence]

Comment 69

•

11 years ago

Comment on attachment 8571133 [details] [diff] [review] part 2. hooks in js engine Review of attachment 8571133 [details] [diff] [review]: ----------------------------------------------------------------- I guess Jim isn't going to be able to get a similar memory profiler in your hands as quickly as he hoped. Let's move forward here under the condition that we will reimplement this on top of his primitives when they become available. I'd like to see another copy of this with the changes applied. ::: js/public/MemoryProfiler.h @@ +1,5 @@ > +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*- > + * vim: set ts=8 sts=4 et sw=4 tw=99: > + * This Source Code Form is subject to the terms of the Mozilla Public > + * License, v. 2.0. If a copy of the MPL was not distributed with this > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ Instead of adding a new file to js/public/, where it will be exposed as part of our public facing interfaces, please move all of the code in MemoryProfiler.h to jsfriendapi.h. @@ +10,5 @@ > +#include "mozilla/Atomics.h" > + > +struct JSRuntime; > + > +class NativeProfiler { { on newline. @@ +13,5 @@ > + > +class NativeProfiler { > + public: > + virtual ~NativeProfiler() {}; > + virtual void sampleNative(void *addr, uint32_t size) = 0; SpiderMonkey style now has the * next to the type. There is a script in Bug 1144366 to convert patches automatically. @@ +18,5 @@ > + virtual void removeNative(void *addr) = 0; > + virtual void reset() = 0; > +}; > + > +class GCHeapProfiler { { on newline. @@ +31,5 @@ > + virtual void moveNurseryToTenured(void *addrOld, void *addrNew) = 0; > + virtual void reset() = 0; > +}; > + > +class MemProfiler { { on newline. ::: js/src/vm/MemoryProfiler.cpp @@ +1,5 @@ > +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*- > + * vim: set ts=8 sts=4 et sw=4 tw=99: > + * This Source Code Form is subject to the terms of the Mozilla Public > + * License, v. 2.0. If a copy of the MPL was not distributed with this > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ MemoryProfiler.cpp should be js/src/gc/. @@ +30,5 @@ > + return &runtime->mMemProfiler; > +} > + > +void > +MemProfiler::start(GCHeapProfiler *aGCHeapProfiler) { { on newline. @@ +37,5 @@ > + sActiveProfilerCount++; > +} > + > +void > +MemProfiler::stop() { { on newline. ::: js/src/vm/Runtime.h @@ +1423,5 @@ > * function to assess the size of malloc'd blocks of memory. > */ > mozilla::MallocSizeOf debuggerMallocSizeOf; > + > + MemProfiler mMemProfiler; The MemoryProfiler instance belongs on the GCRuntime structure in js/src/gc/GCRuntime.h.

Attachment #8571133 - Flags: review?(terrence)

Terrence Cole [:terrence]

Comment 70

•

11 years ago

reviewed

Comment on attachment 8572481 [details] [diff] [review] part 3. Monitoring allocation and gc events in nursery and tenured heaps Review of attachment 8572481 [details] [diff] [review]: ----------------------------------------------------------------- Please realize that despite having negligable performance impact on the platform you tested on, you may still be backed out for regressing octane on other platforms, including small regressions on octane's sub-tests that look like noise over the course of a handful of runs. ::: js/src/gc/Heap.h @@ +497,5 @@ > return nullptr; > } > head.checkSpan(thingSize); > JS_EXTRA_POISON(reinterpret_cast<void *>(thing), JS_ALLOCATED_TENURED_PATTERN, thingSize); > + MemProfiler::SampleTenured(reinterpret_cast<void *>(thing), thingSize); <void*> ::: js/src/gc/Nursery.cpp @@ +207,5 @@ > void *thing = (void *)position(); > position_ = position() + size; > > JS_EXTRA_POISON(thing, JS_ALLOCATED_NURSERY_PATTERN, size); > + MemProfiler::SampleNursery(reinterpret_cast<void *>(thing), size); <void*> ::: js/src/jsgc.cpp @@ +483,5 @@ > size_t nmarked = 0; > > + if (MOZ_UNLIKELY(MemProfiler::enabled())) { > + for (ArenaCellIterUnderFinalize i(&aheader); !i.done(); i.next()) { > + T *t = i.get<T>(); T* t @@ +485,5 @@ > + if (MOZ_UNLIKELY(MemProfiler::enabled())) { > + for (ArenaCellIterUnderFinalize i(&aheader); !i.done(); i.next()) { > + T *t = i.get<T>(); > + if (t->asTenured().isMarked()) { > + MemProfiler::MarkTenured(reinterpret_cast<void *>(t)); <void*>

Attachment #8572481 - Flags: review?(terrence) → review+

Terrence Cole [:terrence]

Updated

•

11 years ago

Attachment #8571143 - Flags: review?(terrence) → review+

Terrence Cole [:terrence]

Updated

•

11 years ago

Attachment #8571147 - Flags: review?(terrence) → review+

Markus Stange [:mstange]

Comment 71

•

10 years ago

Is this waiting for my review now? Do the patches still apply?

part 1. CountAlloc in ReplaceAlloc 11 years ago Ting-Yuan Huang 7.60 KB, patch		Details \| Diff \| Splinter Review
part 2. hooks in js engine 11 years ago Ting-Yuan Huang 8.69 KB, patch		Details \| Diff \| Splinter Review
part 3. Monitoring allocation and gc events in nursery and tenured heaps 11 years ago Ting-Yuan Huang 6.04 KB, patch	terrence : feedback-	Details \| Diff \| Splinter Review
part 4. monitoring allocations and frees for ArrayBuffer 11 years ago Ting-Yuan Huang 4.90 KB, patch	terrence : feedback- sfink : feedback+	Details \| Diff \| Splinter Review
part 5. tracking the memory events 11 years ago Ting-Yuan Huang 24.76 KB, patch	BenWa : feedback-	Details \| Diff \| Splinter Review
part 9. Interface to Add-Ons 11 years ago Ting-Yuan Huang 3.64 KB, patch	jimb : feedback+	Details \| Diff \| Splinter Review
2015-01-26-164237_1246x1060_scrot.png 11 years ago Ting-Yuan Huang 96.79 KB, image/png		Details
part 1. CountAlloc in ReplaceAlloc 11 years ago Ting-Yuan Huang 9.08 KB, patch		Details \| Diff \| Splinter Review
part 2. hooks in js engine 11 years ago Ting-Yuan Huang 9.81 KB, patch		Details \| Diff \| Splinter Review
part 4. monitoring allocations and frees for ArrayBuffer 11 years ago Ting-Yuan Huang 5.19 KB, patch	terrence : review+ sfink : review+	Details \| Diff \| Splinter Review
part 5. Don't emit inline allocation when memory profiler enabled. 11 years ago Ting-Yuan Huang 1.71 KB, patch	terrence : review+	Details \| Diff \| Splinter Review
part 6. A new API to get backtrace without allocating memory in profiler 11 years ago Ting-Yuan Huang 4.06 KB, patch	mstange : review+	Details \| Diff \| Splinter Review
part 7. xpcom interface 11 years ago Ting-Yuan Huang 6.20 KB, patch	smaug : review+	Details \| Diff \| Splinter Review
part 8. tracking the memory events 11 years ago Ting-Yuan Huang 21.13 KB, patch		Details \| Diff \| Splinter Review
part 3. Monitoring allocation and gc events in nursery and tenured heaps 11 years ago Ting-Yuan Huang 6.63 KB, patch	terrence : review+	Details \| Diff \| Splinter Review
Part 1. CountAlloc in ReplaceAlloc 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 9.18 KB, patch		Details \| Diff \| Splinter Review
Part 2. MemoryProfiler hooks in js engine 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 19.27 KB, patch		Details \| Diff \| Splinter Review
Part 3. Monitoring allocation and gc events in nursery and tenured heaps 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 6.79 KB, patch	kanru : review+	Details \| Diff \| Splinter Review
Part 4. Monitoring allocations and frees for ArrayBuffer 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 4.74 KB, patch	kanru : review+	Details \| Diff \| Splinter Review
Part 5. Don't emit inline allocation when memory profiler enabled. 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 1.56 KB, patch	kanru : review+	Details \| Diff \| Splinter Review
Part 6. A new API to get backtrace without allocating memory in profiler 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 4.16 KB, patch	kanru : review+	Details \| Diff \| Splinter Review
Part 7. XPCOM interface for memory profiler 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 6.20 KB, patch	kanru : review+	Details \| Diff \| Splinter Review
Part 8. Tracking the memory events 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 21.21 KB, patch		Details \| Diff \| Splinter Review
Part 9. Memory profiler interface to Add-Ons 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 3.67 KB, patch	kanru : feedback+	Details \| Diff \| Splinter Review
MozReview Request: Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sphink 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?BenWa,terrence 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request	BenWa : review+	Details
MozReview Request: Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 8. Tracking the memory events. r?mstange 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r?jimb 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 40 bytes, text/x-review-board-request	jimb : review+	Details
0001-Bug-1123237-Part-2.-MemoryProfiler-hooks-in-js-engin.patch 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 13.74 KB, patch	terrence : review+	Details \| Diff \| Splinter Review
0007-Bug-1123237-Part-8.-Tracking-the-memory-events.-r-Be.patch 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 37.08 KB, patch	terrence : review+	Details \| Diff \| Splinter Review
0001-Bug-1123237-Part-10.-Expose-SwapElements-from-nsBase.patch 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 950 bytes, patch	froydnj : review+	Details \| Diff \| Splinter Review
0002-Bug-1123237-Part-11.-Don-t-use-STL-in-memory-profile.patch 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 37.52 KB, patch	cyu : review+	Details \| Diff \| Splinter Review
Part 12. Mark memory profiler as nogc. 10 years ago Kan-Ru Chen [:kanru] (UTC+9) 3.07 KB, patch	terrence : review+	Details \| Diff \| Splinter Review
Merge m-i to m-c, a=merge 10 years ago Johannes Pfrang [:johnp] 197.70 KB, patch		Details \| Diff \| Splinter Review