Closed Bug 964096 Opened 11 years ago Closed 11 years ago

capture heap usage in SPS profiler

Categories

(Core :: Gecko Profiler, defect, P4)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED DUPLICATE of bug 988332

People

(Reporter: bkelly, Assigned: bkelly)

References

Details

(Keywords: perf, Whiteboard: [c=memory p=3 s=2014.05.09 u=][MemShrink:P2])

Attachments

(5 files, 3 obsolete files)

In addition to capturing stack frames it might also be useful for SPS to sample the current total amount of heap allocated. This should be relatively inexpensive since its stored in a jemalloc counter. For example, see the attached screenshot showing the results of a proof-of-concept. Here we can see that there is elevated memory usage while RIL is dispatching an event using an ES6 for-of iterator. This might be a dupe of bug 747605, although at first glance that one appears more about capturing exceptional conditions as markers.
This patch calls jemalloc_stats while the profiler is active with the "memory" feature enabled. Note, this will cause Firefox on Mac OS X to spin. Based on the comments in jemalloc I'm going to guess that this might be due to hitting priority inversion or something with the spinlocks used on that platform. This patch works on platforms like b2g that use pthread mutexes.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Attachment #8365716 - Flags: feedback?(n.nethercote)
Attachment #8365716 - Flags: feedback?(bgirard)
This is a very hacky modification to cleopatra to show the memory values. Right now it just replaces the normal histogram information. I think ideally we would show this in a parallel histogram below where the stack height is shown. Benoit, any suggestions on how best to achieve this?
Attachment #8365717 - Flags: feedback?(bgirard)
See Also: → 747605
It's a good idea, but heap-allocated probably isn't the best thing to measure. See http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsIMemoryReporter.idl#253 for various global statistics that we can already easily get. I'd suggest "residentFast", or maybe "explicit" but it takes longer to measure so might be too expensive. Or you could get multiple. BTW, the devtools folks are working on similar things, such as bug 923275, though that might take a while longer and so this bug may well be fixed more quickly.
Comment on attachment 8365716 [details] [diff] [review] Sample heap allocation total while profiling. Review of attachment 8365716 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/TableTicker.cpp @@ +537,5 @@ > bool recordSample = true; > #if defined(XP_WIN) > bool powerSample = false; > #endif > + size_t allocatedHeap = 0; |heapAllocated| is the name used elsewhere for this measurement. @@ +565,5 @@ > + jemalloc_stats_t stats; > + jemalloc_stats(&stats); > + allocatedHeap = stats.allocated; > + } > +#endif You should use nsMemoryReporterManager::GetHeapAllocated() (or GetResidentFast()) here. ::: tools/profiler/TableTicker.h @@ +209,5 @@ > bool mProfileThreads; > bool mUnwinderThread; > bool mProfileJava; > bool mProfilePower; > + bool mMemory; Please use the more specific name from nsIMemoryReporter.idl. In this case it's "heap-allocated".
Attachment #8365716 - Flags: feedback?(n.nethercote)
> > + bool mMemory; > > Please use the more specific name from nsIMemoryReporter.idl. In this case > it's "heap-allocated". Oh, and please do likewise in the UI itself.
Here's an updated gecko patch based on Nicholas's feedback. Patch compiles, but has not been tested on device yet. It now pulls rss and heap-allocated using separate features; "rss" and "heap-allocated". I could see these both being useful under different circumstances. I have not updated the cleopatra patch to read the additional profile data yet.
Attachment #8365716 - Attachment is obsolete: true
Attachment #8365716 - Flags: feedback?(bgirard)
Attachment #8365754 - Flags: feedback?(bgirard)
See Also: → 923275
Attachment #8365756 - Flags: feedback?(bgirard)
Attachment #8365754 - Attachment is obsolete: true
Attachment #8365754 - Flags: feedback?(bgirard)
Comment on attachment 8365756 [details] [diff] [review] Sample heap allocation total while profiling. (v3) Review of attachment 8365756 [details] [diff] [review]: ----------------------------------------------------------------- Patch looks good. I leave it up to :njn to review if this is what we want to measure or not. ::: tools/profiler/ProfileEntry.cpp @@ +410,5 @@ > b.DefineProperty(sample, "time", entry.mTagFloat); > } > } > break; > + case 'z': Please document these tags in GeckoProfiler.h ::: tools/profiler/TableTicker.cpp @@ +554,5 @@ > powerSample = true; > } > #endif > > +#if !defined(XP_DARWIN) No need to ifdef here. The feature should always be disabled on mac and never reported. @@ +557,5 @@ > > +#if !defined(XP_DARWIN) > + // XXX: This spin-locks on Mac OS X. > + if (mRss && mMemoryReporterManager) { > + mMemoryReporterManager->GetResidentFast(&rss); Can you *promise* that everything that happens here is async signal safe for non mac? (See 'man signal') ::: tools/profiler/TableTicker.h @@ +71,5 @@ > mUnwinderThread = hasFeature(aFeatures, aFeatureCount, "unwinder") || sps_version2(); > mAddLeafAddresses = hasFeature(aFeatures, aFeatureCount, "leaf"); > mPrivacyMode = hasFeature(aFeatures, aFeatureCount, "privacy"); > mAddMainThreadIO = hasFeature(aFeatures, aFeatureCount, "mainthreadio"); > + mRss = hasFeature(aFeatures, aFeatureCount, "rss"); When adding a new feature you can list them in get_features. Make sure it's not listed on OSX.
Attachment #8365756 - Flags: feedback?(bgirard) → feedback+
Comment on attachment 8365756 [details] [diff] [review] Sample heap allocation total while profiling. (v3) Review of attachment 8365756 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/ProfileEntry.cpp @@ +413,5 @@ > break; > + case 'z': > + { > + if (sample) { > + b.DefineProperty(sample, "rss", entry.mTagFloat); Please call it "resident" for consistency with about:memory. ::: tools/profiler/TableTicker.cpp @@ +532,5 @@ > bool recordSample = true; > #if defined(XP_WIN) > bool powerSample = false; > #endif > + int64_t rss = 0; Ditto. @@ +557,5 @@ > > +#if !defined(XP_DARWIN) > + // XXX: This spin-locks on Mac OS X. > + if (mRss && mMemoryReporterManager) { > + mMemoryReporterManager->GetResidentFast(&rss); Benoit's question about async signal safety is a good one. I think you're ok on Linux, because it just reads /proc/<pid>/statm, which involves open/read/close, all of which are async signal safe (according to |man sigaction|). I'm not sure about the other OSes. Whatever you conclude, additional comments explaining the safety would be good.
Attachment #8365756 - Flags: review+
Comment on attachment 8365756 [details] [diff] [review] Sample heap allocation total while profiling. (v3) Review of attachment 8365756 [details] [diff] [review]: ----------------------------------------------------------------- I'm getting errors compiling on arch linux with the latest git version of firefox and this patch. 9:31.60 /home/v/work/gecko-dev/xpcom/base/nsMemoryReporterManager.h:34:7: error: use of undeclared identifier 'do_GetService' 9:31.60 do_GetService("@mozilla.org/memory-reporter-manager;1"); 9:31.60 ^ 9:31.60 In file included from /home/v/work/gecko-dev/tools/profiler/platform-linux.cc:73: 9:31.60 In file included from ../../dist/include/nsThreadUtils.h:18: 9:31.60 ../../dist/include/nsCOMPtr.h:557:15: error: no matching constructor for initialization of 'nsCOMPtr_base' 9:31.60 : NSCAP_CTOR_BASE(aRawPtr) 9:31.60 ^ ~~~~~~~ 9:31.60 ../../dist/include/nsCOMPtr.h:479:30: note: expanded from macro 'NSCAP_CTOR_BASE' 9:31.60 #define NSCAP_CTOR_BASE(x) nsCOMPtr_base(x) 9:31.60 ^ 9:31.60 /home/v/work/gecko-dev/xpcom/base/nsMemoryReporterManager.h:187:9: note: in instantiation of member function 'nsCOMPtr<nsITimer>::nsCOMPtr' requested here 9:31.60 mTimer(aTimer), 9:31.60 ^ 9:31.60 ../../dist/include/nsCOMPtr.h:407:1: note: candidate constructor (the implicit copy constructor) not viable: cannot convert argument of incomplete type 'nsITimer *' to 'const nsCOMPtr_base' 9:31.60 nsCOMPtr_base 9:31.61 ^ 9:31.61 ../../dist/include/nsCOMPtr.h:423:7: note: candidate constructor not viable: cannot convert argument of incomplete type 'nsITimer *' to 'nsISupports *' 9:31.61 nsCOMPtr_base( nsISupports* rawPtr = 0 ) 9:31.61 ^ 9:31.61 2 errors generated.
Here's the profile I used to generate the original screenshot.
Updated patch to fix compilation errors. I had to add a couple #include's to nsMemoryReporterManager.h as it appears to have been relying on certain types already being defined. Carrying flags forward. I have not investigated signal safety yet.
Attachment #8366382 - Flags: review+
Attachment #8366382 - Flags: feedback+
Attachment #8365756 - Attachment is obsolete: true
Comment on attachment 8366382 [details] [diff] [review] Sample heap allocation total while profiling. (v4) Review of attachment 8366382 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/base/nsMemoryReporterManager.h @@ +4,5 @@ > * License, v. 2.0. If a copy of the MPL was not distributed with this > * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > > #include "nsIMemoryReporter.h" > +#include "nsITimer.h" Why do you need to include nsITimer.h? I only see |nsITimer*| and |nsCOMPtr<nsITimer>|, so isn't the forward declaration enough?
(In reply to Nicholas Nethercote [:njn] from comment #13) > Why do you need to include nsITimer.h? I only see |nsITimer*| and > |nsCOMPtr<nsITimer>|, so isn't the forward declaration enough? No, it needs the full type in order to satisfy |nsCOMPtr_base()| which takes an nsISupports. See the error in comment 10.
Comment on attachment 8366382 [details] [diff] [review] Sample heap allocation total while profiling. (v4) Review of attachment 8366382 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/TableTicker.cpp @@ +564,5 @@ > + if (mHeapAllocated && mMemoryReporterManager) { > + mMemoryReporterManager->GetHeapAllocated(&heapAllocated); > + } > +#endif > + We should sample these things only on one thread. It's simplest to do it on the main thread. You can use `NS_IsMainThread()` to find out if this is the main thread.
I talked to Benoit and we concluded that it's a good idea to separate memory and power data from any particular thread in the output (JSON). That way we'll have a much easier time on the frontend when we display the results and it just makes more sense because it's not thread specific data.
I'll make the necessary changes for that.
Priority: -- → P4
Whiteboard: [c=memory p= u= s=][MemShrink] → [c=memory p= s= u=1.3][MemShrink]
Whiteboard: [c=memory p= s= u=1.3][MemShrink] → [c=memory p=3 s= u=1.3][MemShrink]
Whiteboard: [c=memory p=3 s= u=1.3][MemShrink] → [c=memory p=3 s= u=][MemShrink]
Blocks: 928475
This patch is incomplete. It shows one way to separate the power and memory data in the output format of the profiler. We still store the power and memory info on the main thread's buffer, but at least it's logically separated when we build the JSON.
It looks like getting the memory information causes calls to Malloc from what Viktor was showing me. We need this fixed before we can take this fix.
Flags: needinfo?(bkelly)
Comment on attachment 8366382 [details] [diff] [review] Sample heap allocation total while profiling. (v4) Review of attachment 8366382 [details] [diff] [review]: ----------------------------------------------------------------- I can't continue working/testing the output format change until this patch is fixed. ::: tools/profiler/TableTicker.cpp @@ +557,5 @@ > > +#if !defined(XP_DARWIN) > + // XXX: This spin-locks on Mac OS X. > + if (mRss && mMemoryReporterManager) { > + mMemoryReporterManager->GetResidentFast(&rss); I'm getting a deadlock on Linux coming from here. There seems to be a malloc called somewhere inside this function. I'm also having problems on Windows, but I'm not sure if they are related.
Hmm, ok. I'll look into this. I thought I verified at some point RSS was just a syscall. Unfortunately I have a couple of blockers I need to address before revisiting the profiler enhancement. Sorry for the delay.
Hmm, this is what the code does when reading resident size on linux: http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsMemoryReporterManager.cpp#43 Its not directly allocating any memory. I wonder if fopen() allocates some resources behind the scenes. Are you have any problems with the heap measurement? Personally I would be ok just going with the heap-allocated values to start as I think its still somewhat useful. Resident size shows better total footprint of the process of course, but it can be effected by things like paging, swapping, etc that are not directly tied to the current stack frame. Nicholas, would you be ok if we got something initially going without just heap-allocated and table resident measurements for now?
Flags: needinfo?(bkelly) → needinfo?(n.nethercote)
> fopen() is not in the list of async-signal-safe functions It should be easy to rewrite the relevant function to use open() instead of fopen(). > Are you have any problems with the heap measurement? Personally I would be > ok just going with the heap-allocated values to start as I think its still > somewhat useful. Resident size shows better total footprint of the process > of course, but it can be effected by things like paging, swapping, etc that > are not directly tied to the current stack frame. The main problem with "heap-allocated" is that it doesn't include the JavaScript GC heap, which is a big and important part of our memory consumption. I certainly won't veto the use of "heap-allocated", but it only shows part of the picture. If I had to choose a single memory consumption measurement, I would always choose "resident".
Flags: needinfo?(n.nethercote)
You can move non async signal safe stuff to happen in the signal sending thread where we can do anything and pass the information when signaling. We do this for timestamps. Of course you want to make sure you can quickly gather the data you want otherwise you're going to turn 1ms sampling into 1ms+your delay cascading each sample to be off. This is really bad if your function will randomly lock as well. Right now the signal sender is re-implemented on each platform but it's fairly trivial to merge it and make it platform neutral, ideally we would do that first.
Whiteboard: [c=memory p=3 s= u=][MemShrink] → [c=memory p=3 s= u=][MemShrink:P2]
Comment on attachment 8365717 [details] [review] Pull request at https://github.com/bgirard/cleopatra/pull/24 feedback given on github
Attachment #8365717 - Flags: feedback?(bgirard)
Depends on: 991412
No longer blocks: 928475
See Also: → 928475
See Also: → 988332
Bug 988332 looks good to me!
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Whiteboard: [c=memory p=3 s= u=][MemShrink:P2] → [c=memory p=3 s=2014.05.09 u=][MemShrink:P2]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: