Closed Bug 1436179 Opened 2 years ago Closed 2 years ago

Gecko profiler uses about 1MB of memory per process

Categories

(Core :: Gecko Profiler, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: bzbarsky, Assigned: emilio)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P1])

Attachments

(5 files)

Looks to me like XRE_InitChildProcess unconditionally calls AutoProfilerInit::AutoProfilerInit which eventually lands in CorePS::Create.

When I'm poking at this via about:memory, sInstance->mLiveThreads.size() is 29.  Each ThreadInfo is about 38KB (1KB sizeof(this) and the rest mRacyInfo).  That's a bit over a megabyte all told.

I don't know whether we can make ThreadInfo (and in particular RacyThreadInfo) smaller, but why do we have this stuff allocated at all?  This is a clean profile, it has no profiler extension installed, it has never done any profiling.
Whiteboard: [MemShrink]
Yikes, that's a bit much.

Most of this is probably the thread's PseudoStack: We reserve space for 1024 pseudo stack frames, and each pseudo stack frame (ProfileEntry) is 32 bytes big. This is where we store PROFILER_LABELs and JS interpreter functions that are currently on the stack in the thread. We do this even when the profiler isn't running so that we're not missing things that are already on the stack when the profiler is started.

However, most threads don't run any JS, and of those, most probably don't ever have more than 5 pseudo stack frames on the stack. So reserving space for 1024 entries is overkill.

This accounts for 32KB, which is quite a bit less than the 38KB for ThreadInfo that you report. I'd be curious to find out where the rest is going.
The 38KB for ThreadInfo breaks down like so:

1KiB -- sizeof(ThreadInfo).  Well, it's 712 bytes, but that's the 1KiB jemalloc bucket.
36KiB -- sizeof(RacyThreadInfo).  The actual size is 32808 bytes, or 40 bytes over 32KiB.
         That 40 bytes comes from its various non-"entries" members: stackPointer,
         mPendingMarkers, mThreadId, mSleep.  So it doesn't fit in the 32KiB jemalloc
         bucket, and the next bucket size is 36KiB.

That's 37888 bytes.  Plus a bit for the thread name means we're typically at about 379xx bytes, which I abbreviated as "38KB" above.  So our missing 6KB are 1KiB size of ThreadInfo itself, 4KiB jemalloc overhead because our size is _just_ over a bucket boundary, and ~1KB because a KiB is bigger than a KB.  ;)

An easy win would be to drop the reserved number of stack frames to 1022 and win 4KiB on that allocation.

Also, the size of the ThreadInfo is almost entirely the UniqueStacks bit (560 bytes there).  Each of the JSON writers is 120 bytes, UniqueJSONStrings is 168 bytes, the hashtables are 40-48 bytes each (40 for ours, 48 for std::map), that sort of thing.  If we dealt with the RacyThreadInfo bits, allocating the UniqueStacks on the heap instead of inline might be the next win.  Especially if we can shrink it under 512 bytes...
Oh, and you can see the jemalloc bucket sizes as of today at https://searchfox.org/mozilla-central/rev/84cea84b12145d752e50ddca6be5462c38510e35/memory/build/mozjemalloc.cpp#59-88 -- we're into "Large" territory here, where the bucket sizes are an integral number of 4KiB pages.
Ah, looks like we both found out the same things over the last few minutes... I'm pasting my comment here anyway, but it's not adding anything new:
> Ok, so most of the unaccounted size comes from the allocator's size bucketing.
> sizeof(ThreadInfo) is 712, aMallocSizeOf(threadInfo) is 1024
> sizeof(RacyInfo) is 32808, aMallocSizeOf(racyInfo) is 36864
>
> sizeof(ThreadInfo) is so big mostly because of mUniqueStacks: sizeof(mUniqueStacks)
> is 560, and that's mostly because it contains three JSONWriters: mFrameTableWriter,
> mStackTableWriter and mUniqueStrings->mStringTableWriter, each of which are 120 bytes
> large. It also contains a few std::maps, which are 48 bytes large each.
> 
> mUniqueStacks is only needed in very rare cases; we should move it into a separate
> heap allocation by using a UniquePtr instead of a Maybe for it.
So I looked into what threads we have here.  They are:

1) GeckoMain
2) Chrome_ChildThread (chromium-ipc)
3) JS Watchdog
4) Socket Thread
5) Hang Monitor
6) Timer Thread
7-21) ImgDecoder
22) ImageIO
23) Stream transport
24) ImageBridgeChild
25) VideoChild
26) ProcessHangMon (not to be confused with Hang Monitor)
27) ProfilerChild
28) DOM File
29) HTML5 Parser

Some of these (hang monitors, JS watchdog) probably never have any labels and definitely do not have any JS.

And yes, that's 15 image decoder threads.  I have 16 logical processors on this machine, hence 15 image decoder threads.  That seems a bit excessive.  Certainly in a world where we also have multiple processes.  I filed bug 1436247 on that.
My suggestion would be to (wherever possible) lazily-allocate these on first enabling of the profiler in that session.

This might complicate the CorePS::Exists() usage, which isn't locked, though making sInstance an atomic will help (or perhaps solve it) and locking can be used in Create.  Or instantiate CorePS (since I don't think it's huge, though I haven't checked), and just make the ThreadInfo's lazy.

Interestingly... I don't see the "new CorePS();" (or platform.ccp:238) anywhere in the asan allocs (and I used "100" %) *or* the DMD dump of live allocs.  Odd.

mstange - How feasible is some form of lazy-allocate?
Flags: needinfo?(mstange)
See comment 1:
> We do this even when the profiler isn't running so that we're not missing things that are already on the stack when the profiler is started.

But we can grow the stack lazily, I think:

Before:
ProfileEntry entries[1024];

After:
Atomic<ProfileEntry*> entries; // read on sampler thread and mutated on this thread

// Read and mutated only on this thread:
UniquePtr<ProfileEntry[]> entryStorage;
size_t entryStorageCapacity;

And then we can allocate+memcpy whenever we run out of capacity. I think this is worth trying out.
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #7)
> After:
> Atomic<ProfileEntry*> entries; // read on sampler thread and mutated on this
> thread

> And then we can allocate+memcpy whenever we run out of capacity. I think
> this is worth trying out.

I think this can work, because of the implicit lock the sampler applies to the thread being sampled.
We know that the "current" thread is suspended while the sampler is reading entries.  So:

(yeah, I've left off some niceties here)
   if (not_enough_space) {
     size_t new_size = 2 * old_size;
     ProfileEntry *temp = new ProfileEntry[new_size]
     memcpy(temp, entries, sizeof(entries[0]*old_size);
     // if entries is read here, sampler thread will read all data pointed to by it
     entries = temp;
     // if entries is read here, sampler thread will read all data from new buffer
     entryStorage = temp; 
     // old entries is now freed.  IF we didn't have the guarantee that the sampler thread would complete it's
     // read before letting execution continue, then if it started before entries = temp, it might still be walking
     // the array, and might cause a UAF at this point.
     entryStorageCapacity = new_size; 
   }
   Push new entry
Thanks for spelling it out so clearly! That's exactly the idea.
Depends on: 1436924
Priority: -- → P1
Whiteboard: [MemShrink] → [MemShrink:P1]
So, mUniqueStacks has been dealt with in bug 1436924 - That leaves the largest usage -- mRacyInfo -- untouched.   What can we do to reduce that, at least when the profiler hasn't been used in a session?
Flags: needinfo?(mstange)
I'll give that remaining bit a shot.
Assignee: nobody → emilio
Comment on attachment 8959982 [details]
Bug 1436179: Remove redundant entryStorage member.

https://reviewboard.mozilla.org/r/228774/#review234764

::: js/public/ProfilingStack.h:321
(Diff revision 1)
>      PseudoStack()
>        : stackPointer(0)
>      {}
>  
>      ~PseudoStack() {
> +        delete[] entries;

You're right. I think I only suggested having a separate storage field so that this manual call to delete wouldn't be needed.

Does anything initialize entries or could we be deleting a garbage pointer here? (Does mozilla::Atomic zero-initialize its contents?)
Comment on attachment 8959982 [details]
Bug 1436179: Remove redundant entryStorage member.

https://reviewboard.mozilla.org/r/228774/#review234764

> You're right. I think I only suggested having a separate storage field so that this manual call to delete wouldn't be needed.
> 
> Does anything initialize entries or could we be deleting a garbage pointer here? (Does mozilla::Atomic zero-initialize its contents?)

Yes, it does, afaict. But I can zero-initialize it just to be sure.
So, first two commits work, but they use system malloc, so SpiderMonkey's CI gets very angry at me.

Third commit is an attempt to fix this, which is _almost_ green. I had to add an inline buffer because turns out that we try to add profiler markers to the main process early on, before the JS allocator is initialized. That makes it green on Linux and such, and probably needs a slightly bigger inline buffer on Windows, but nothing terrible.

There _is_ something terrible though, which is that we don't initialize the Spidermonkey allocator on plugin processes, which means that we crash whenever we have more than four markers on the stack in a plugin process, like:

  https://treeherder.mozilla.org/logviewer.html#?job_id=169031626&repo=try

So I see two options... I can detect whether the SpiderMonkey allocator is initialized and then just do nothing. We'll lose profiling inside plugin processes though. Not sure how important is it.

Now, the other option is "just use system malloc", but I'm not aware of any kind of whitelist for this lint of course...

Steve, do you know if it's acceptable to add the profiler code to the exceptions to use system malloc? I guess it'd be a matter of adding an exception in:

  https://searchfox.org/mozilla-central/rev/877c99c523a054419ec964d4dfb3f0cadac9d497/config/check_vanilla_allocations.py#143
Flags: needinfo?(sphink)
(via IRC) Yes, I think that's fine. We want JS-related stuff to use the JS allocators, for OOM testing, heap partitioning for security, and sometimes for memory pressure estimation, but none of those seem to apply here.
Flags: needinfo?(sphink)
Blocks: 1440336
Is anything blocking landing this?
Flags: needinfo?(emilio)
It's waiting for my review. I'm going to do it this week. I wanted to test the impact on code size locally first and I haven't had a chance to do that yet.
Flags: needinfo?(mstange)
Flags: needinfo?(emilio)
Comment on attachment 8959981 [details]
Bug 1436179: Lazily grow the ProfileEntryStorage.

https://reviewboard.mozilla.org/r/228772/#review242004
Attachment #8959981 - Flags: review?(mstange) → review+
Comment on attachment 8959982 [details]
Bug 1436179: Remove redundant entryStorage member.

https://reviewboard.mozilla.org/r/228774/#review242008
Attachment #8959982 - Flags: review?(mstange) → review+
Comment on attachment 8960373 [details]
Bug 1436179: Move allocation out of line.

https://reviewboard.mozilla.org/r/229156/#review242010
Attachment #8960373 - Flags: review?(mstange) → review+
Comment on attachment 8960482 [details]
Bug 1436179: Remove inline storage, which is now not needed.

https://reviewboard.mozilla.org/r/229232/#review242016

::: js/src/perf/ProfilingStack.cpp:40
(Diff revision 2)
> -    // It's important that `entries` / `entryCapacity` / `stackPointer` remain in sync here all
> -    // the time.
> +    // It's important that `entries` / `entryCapacity` / `stackPointer` remain sane here at all
> +    // times.

How about 'consistent' instead of 'sane'?
Attachment #8960482 - Flags: review?(mstange) → review+
These patches look great to me. Thanks for doing this, and sorry for the slow review!
This work does increase the binary size a little, but I think I can get some wins there.
Comment on attachment 8960373 [details]
Bug 1436179: Move allocation out of line.

https://reviewboard.mozilla.org/r/229156/#review242022

::: js/public/ProfilingStack.h:324
(Diff revision 2)
> -        });
> +        if (MOZ_LIKELY(entryCapacity > oldStackPointer) || MOZ_LIKELY(ensureCapacitySlow()))
> +            entries[oldStackPointer].initCppFrame(label, dynamicString, sp, line, kind, category);

Please add braces.

::: js/public/ProfilingStack.h:342
(Diff revision 2)
> -        });
> +        if (MOZ_LIKELY(entryCapacity > oldStackPointer) || MOZ_LIKELY(ensureCapacitySlow()))
> +            entries[oldStackPointer].initJsFrame(label, dynamicString, script, pc);

Here too.
Pushed by ecoal95@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc008b8e4c3c
Lazily grow the ProfileEntryStorage. r=mstange
js/src/perf is a bit of a weird location for the ProfilingStack.cpp file. The rest of the profiler code lives in js/src/vm (see GeckoProfiler.cpp) - js/src/perf is for some profiling stuff for developers but not code that's actively exposed.

Do you mind moving it? In general you should probably have asked for review from a SpiderMonkey peer.
(In reply to Jan de Mooij [:jandem] from comment #36)
> js/src/perf is a bit of a weird location for the ProfilingStack.cpp file.
> The rest of the profiler code lives in js/src/vm (see GeckoProfiler.cpp) -
> js/src/perf is for some profiling stuff for developers but not code that's
> actively exposed.
> 
> Do you mind moving it? In general you should probably have asked for review
> from a SpiderMonkey peer.

Ah, my apologies, will back out, move, and request you for review.
Attachment #8967666 - Flags: review?(jdemooij)
Backout by ecoal95@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/40e3ad41beaa
Back out changeset cc008b8e4c3c for landing without JS peer review. r=me
Comment on attachment 8967666 [details] [diff] [review]
Same patch, in js/vm.

Review of attachment 8967666 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry I didn't mean this had to be backed out :) Thanks for moving the file; I didn't look closely at the actual code changes because mstange reviewed them.

::: js/src/vm/ProfilingStack.cpp
@@ +23,5 @@
> +
> +    delete[] entries;
> +}
> +
> +bool PseudoStack::ensureCapacitySlow()

Nit: \n after bool
Attachment #8967666 - Flags: review?(jdemooij) → review+
Pushed by ecoal95@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/599270aa0f9b
Lazily grow the ProfileEntryStorage. r=mstange,jandem
https://hg.mozilla.org/mozilla-central/rev/599270aa0f9b
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Depends on: 1454327
No longer depends on: 1454327
You need to log in before you can comment on or make changes to this bug.