Going to use this to track various specific bits. We have a lot of heap-unclassified (almost 40% of the heap) in a vanilla content process with nothing really loaded. Also, I haven't even looked at the non-heap-allocated overhead.
Summary: Reduce content process memory overhead → [meta] Reduce content process memory overhead
I have a dump of memory allocated by content processes using ASAN's __sanitizer_print_memory_profile(); it's a little confused because both content processes interleave their dumps (I'll need to find a way to separate them, or only dump 1). The two processes have allocated (still live) ~17MB and ~21.5MB; for reference the two processes show a size in System Monitor (on Fedora) of ~24.5 and 28MB when I looked in a different run with the same profile. Probably the 17MB is the 'warm' process that hasn't loaded any content. and the other is showing a blank page. Comparing the two will also be interesting. The dump is large (since I told it to dump the allocation stacks of *all* allocations; total was ~77K for the small, and 100K for the larger). I'll upload the raw files, but the highlights: We're spending a lot on alignment(?) We have a LOT of power-of-2-sized buffers -- IIRC jemalloc isn't efficient on powers-of-two (not unusual) -- Glandium? The profiler is allocating a bunch of memory up front in case it needs it when turned on (I presume) Lots of HashTables - many probably far from filled, and some are static once created Prefs.... (njn is working on this!) fontconfig is a PIG!!!! Telemetry is a non-0 %-age Quite a bit (scattered) of script data/source/etc ~8% is in posix_memalign from slab_allocator_alloc_chunk() in gslice.c (2500+ allocations) ~3% (~850K) in ~25 allocations from ThreadInfo::ThreadInfo in tools/profiler/core/ThreadInfo.cpp, allocated when the threads are registered with the profiler, or 32808 bytes per thread. That's a lot to spend for the profiler when I haven't installed it in that profile, let alone used it. Lazy allocation, perhaps? ~3% in many allocations from g_realloc() (no further backtrace) ~3% (650K) in 21 allocations from performXDR<> called from js::XDRScript<> ~2% in 5 allocations (of ~88K each) from js::DuplicateString(), called from js::ScriptSource::setSourceCopy() ~1% (262144) in 1 allocation from PLDHashTable::ChangeTable from Preferences(!) (SetLatePreferences) ~1% (262144) in 16 allocations from DoInterfaceDescriptior(XPTArena...), called a ways above from DoRegisterXPT() ~1% in ~8000 allocations from FcCharSetFindLeafCreate() (fontconfig) ~1% in ~7700 allocations from FcValueListCreate/ ~1% in 621 allocations from JSScript::createScriptData() (from XDRScript<>) ~0.5% in 2 allocations from __strtof_l() ~0.5% in FcPatternObjectInsertElt() ~0.5% (131072) from js::detail::HashTable<>changeTableSize() ~0.5% (131072) in 2 allocations from PLDHashTable::Add() (an XPTInterfaceInfoManager table) ~0.5% (131072) in 1 allocations from PLDHashTable::Add() from GetAtomHashEntry() when in RegisterStaticAtoms ~0.5% (131072) in PLDHashTable::Add() called from TelemetryHistogram::InitializeGlobalState() (perhaps a couple more 131072 or 262144 allocations) ~0.5% (122K) in XPT_DoCString() from XPTInterfaceInfoManager::RegisterBuffer() ~0.4% (113K) in 95 allocations from _dl_new_object() ~0.4% (111K) from FcCharSetPutLeaf ~0.4% in 17xx allocations from PLDHashTable::Add for strings from TelemetryHistogram::InitializeGlobalState() ~0.4% (102K) in 50 allocations from nsPersistentProperties::SetStringProperty() ~0.4% (101K) in many allocations from FcValueSave() ~0.4% (98K) in 3 allocations from ThreadInfo::ThreadInfo() ~0.4% (98304) in 6 allocations from xptiInterfaceEntry::Create() ~0.4% (98304) in 2 allocations from PLDHashTable::Add() 98K in 12 allocs from FcConfigAllocExpr 81920 (10*8192!) in 10 allocations from DuplicateString<char, 8192ul, 1ul> from Pref::Pref() Bunch more allocs from ThreadInfo::ThreadInfo() (profiler) 65536 in 1 alloc from HashTable<>::createTable() from AtomizeAndCopyChars<> 65536 in 1 allocation from nsAtomFriend::RegisterStaticAtoms() 65536 in 2 allocations from gfxFcPlatformFontList::AddPatternToFontList()/InitFontListForPlatform() 65536 in 2 allocs from js::LifoAlloc::newChunkWithCapacity() 65536 in 1 alloc from nsComponentManagerImpl::RegisterCIDEntryLocked() 65520 in 2 allocs from nsPurpleBuffer::Put() 60K in 65 allocs from ft_mem_qalloc() (freetype) 60K in ~2500 allocs from nsAtomFriend::RegisterStaticAtoms()
> We have a LOT of power-of-2-sized buffers -- IIRC jemalloc isn't efficient on powers-of-two (not unusual) -- Glandium? No, powers-of-two are the best case, along with everything that's exactly matching a class size, or is a multiple of the page size for larger sizes.
> No, powers-of-two are the best case, along with everything that's exactly > matching a class size, or is a multiple of the page size for larger sizes. Good. (IIRC at one point it was better to be power-of-2-minus-n; though perhaps I'm thinking of some other system/allocator)
You might be thinking about things like nsTAutoArray, which have an embedded header, so a better size for it is jemalloc_class_size - header_size.
(In reply to Randell Jesup [:jesup] from comment #1) > I have a dump of memory allocated by content processes using ASAN's > __sanitizer_print_memory_profile() You'll want to be careful with that -- I'm pretty sure ASAN will be using it's own allocator instead of jemalloc, so it's not a representative run. You can use DMD for vanilla heap profiling. It works with jemalloc so will give representative results. See the docs about "live mode" at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD.
So, DMD results (I was using ASAN): similar of course, since I don't care much about a few bytes - biggest difference would be in slop and alignment I imagine. Comments above are still valid; we can now see that gtk is using a moderate amount, and no surprise the fontconfig stuff is called from a InitFontList. Lots in total in js::ScriptSource, various things involving Atoms, and XPTInterfaceInfoManager::RegisterBuffer() is big hotspot LifoAlloc then comes in with a lot of different little allocations (probably not surprising) 6% (811K) in ThreadInfo::ThreadInfo (note: another couple % below with different stacks) 4.5% in js::ScriptSource::performXDR<> 3.6% (491K) from XPTInterfaceInfoManager::RegisterBuffer() (a few % more below) 2.3% in js::ScriptSource::setSourceCopy() 2% in js::SharedScriptData::new_ from JSScript::createScriptData() 1.9% (262144) in PLDHashTable::ChangeTable() from SetLatePreferences() 1.7% in js::SharedScriptData::new_ from JSScript::fillyInitFromEmitters 1.6% in FcPatternObjectAddWithBinding() (from InitFontList()) 1.5% in gtk_css_selector_tree_builder_build from (near top) dgtk_settings_get_for_display() 1.4% from js::ScriptSource::setSourceCopy() 1.2% in glibc _dl_new_object() 1% (~30% cumulative) in nsPersistentProperties::SetStringProperty() from nsStringBundle::LoadProperties() 1% from XPTInterfaceInfoManager::RegisterBuffer() (again, different stack slightly) 1% in PLDHashTable::Add() from nsAtomFriend::RegisterStaticAtoms() 1% (131072, 1 alloc) in gtk_css_provider_load_internal() from gtk_settings_get_for_display 1% (ditto) in PLDHashTable::Add() from TelemetryHistogram::InitializeGlobalState() 1% (ditto) in js::AtomizeChars from the frontend::GeneralParser<> 1% (ditto) in js::AtomizeChars from js::XDRAtom<>/js::XDRScript<> 1% in FcPatternObjectInsertElt from InitFontlist() 0.85% in PLDHashTable::Add from XPTInterfaceInfoManager::RegisterBuffer() (different stack) 0.8% in gtk_css_ruleset_add() from gtk_settings_get_for_display 0.8% in js::SharedScriptData::new_() from JSScript::fullyInitFromEmitter() 0.8% DuplicateString() from pref_SetPref() 0.7% in FcCharSetPutLeaf from InitFontList 0.7% in js:LifoAlloc::newChunkWithCapacity() from js::frontend:PerHandlerParser 0.6% from js::SharedScriptData::new_ from JSScriptCreateScriptData (different stack) 0.6% from nsAtomFriend::RegisterStaticAtoms() 0.5% from FcPatternObjectAddWithBinding 0.5% from ThreadInfo::ThreadInfo (different stack) 0.5% from XPTInterfaceInfoManager::RegisterBuffer() (different stack) 0.5% from call_init() (dl_init.c) in glibc 0.5% (45% cumulative) in js::AtomizeChars (different stack) <bunch of 65536 byte allocs from Component Manager, HashTables for StaticAtoms, JSSCript::shareScriptData()> <several 61440-byte totals (15 allocs) from LifoAlloc, and a bunch in the 50K region with 13 allocs from LifoAlloc> <4 ~36K alloc stacks from ThreadInfo::ThreadInfo -- different callers - HangMonitor, WatchdogMain, BackgroundHangManager -- I wonder if there's some duplication here that could be eliminated)
Bug 1436179 tracks the ThreadInfo/profiler bits.
Raw data. Note lsan4_xaa is the first 100K lines (which goes down to ~1K total allocation/stack; the tail is LONG; xaa is only about 1/15th of the full file. Also note that lsan has a mix of two content processes; one that is displaying a blank page, one that hasn't been used yet https://app.box.com/folder/46288716831
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Assignee: rjesup → nobody
Status: ASSIGNED → NEW
cc'ing felipe who might want to be in the loop on this.
Depends on: 1497746
3 months ago
Depends on: 1514869
Depends on: 1430810
You need to log in before you can comment on or make changes to this bug.