Closed Bug 1375281 Opened 7 years ago Closed 7 years ago

nsWindowMemoryReporter hurts performance badly

Categories

(Core :: DOM: Core & HTML, enhancement)

enhancement
Not set
major

Tracking

()

RESOLVED FIXED
mozilla56
Performance Impact high
Tracking Status
firefox56 --- fixed

People

(Reporter: MatsPalmgren_bugz, Assigned: cyu)

References

Details

(Keywords: perf)

Attachments

(3 files, 3 obsolete files)

https://perf-html.io/public/830842928218cd9298c70db69582695fc21e8b07/calltree/?invertCallstack&thread=5

I'm looking at the second entry, "je_jemalloc_stats", and under that
the first entry "nsMemoryReporterManager::GetHeapAllocated".
Expanding that thread leads to "nsObserverService::NotifyObservers cycle-collector-begin".

At first I thought "oh well this is only because we're collecting perf data",
but it seems this observer is always registered, also in release builds:
http://searchfox.org/mozilla-central/rev/ae94cfb36d804727dfb38f2750bfcaab4621df6e/dom/base/nsWindowMemoryReporter.cpp#116
http://searchfox.org/mozilla-central/rev/ae94cfb36d804727dfb38f2750bfcaab4621df6e/layout/build/nsLayoutStatics.cpp#272

Please tell me I'm wrong :-)
Wait, no. That's something else. I think this is used to collect memory reports which are attached to crash reports.
This is for telemetry [1], the probe is limited to once per minute [2]. Given your profile was for one minute I'm assuming this was called *once*.

We see two paths [3] that it's called in:

779ms - GetHeapAllocated
  1ms - GetHeapOverheadFraction

So we know generally it's pretty fast (1ms), but we saw a crazy slow call (779ms). My best guess is this is some sort of lock contention [4]. I see some allocation related functions in the profile but nothing to that degree!

[1] http://searchfox.org/mozilla-central/rev/ae94cfb36d804727dfb38f2750bfcaab4621df6e/toolkit/components/telemetry/TelemetrySession.jsm#1973-1979
[2] http://searchfox.org/mozilla-central/rev/ae94cfb36d804727dfb38f2750bfcaab4621df6e/toolkit/components/telemetry/TelemetrySession.jsm#83
[3] http://searchfox.org/mozilla-central/rev/ae94cfb36d804727dfb38f2750bfcaab4621df6e/toolkit/components/telemetry/TelemetrySession.jsm#1167-1168
[4] http://searchfox.org/mozilla-central/rev/ae94cfb36d804727dfb38f2750bfcaab4621df6e/memory/mozjemalloc/mozjemalloc.cpp#5086,5093,5099,5117
Eric can you take another look to determine if there is any part of this that can be optimized. Ehsan noted that a significant amount of time is spent in a binary search (250+ms) and that may be addressable more readily than lock contention. If there isn't anything to do here after a brief investigation let me know and we can re-prioritize.
Flags: needinfo?(erahm)
Whiteboard: [qf] → [qf:p1]
I don't think nsEffectiveTLDService overlaps with the stuff comment 0 is talking about.

I don't know if the stack is right, but the "je_jemalloc_stats" path seems to spin the event loop and then spending all its time in this:

content_process_main(mozilla::Bootstrap *,int,char * * const)
Startup::XRE_InitChildProcess
XRE_InitChildProcess(int,char * * const,XREChildData const *)
MessageLoop::Run()
MessageLoop::RunHandler()
mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate *)
XRE_RunAppShell()
nsAppShell::Run()
nsBaseAppShell::Run()
MessageLoop::Run()
MessageLoop::RunHandler()
mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate *)
mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *)
nsThread::ProcessNextEvent(bool,bool *)
nsTimerEvent::Run()
Timer::Fire
nsTimerImpl::Fire(int)
CollectorRunner::TimedOut(nsITimer *,void *)
CollectorRunner::Run()
CCRunnerFired
nsJSContext::RunCycleCollectorSlice
nsCycleCollector::collectSlice
nsJSContext::RunCycleCollectorSlice(mozilla::TimeStamp)
nsCycleCollector_collectSlice(js::SliceBudget &,bool)
nsCycleCollector::Collect(ccType,js::SliceBudget &,nsICycleCollectorListener *,bool)
nsCycleCollector::BeginCollection(ccType,nsICycleCollectorListener *)
XPCJSRuntime::BeginCycleCollectionCallback()
nsObserverService::NotifyObservers cycle-collector-begin
nsObserverService::NotifyObservers(nsISupports *,char const *,char16_t const *)
nsObserverList::NotifyObservers(nsISupports *,char const *,char16_t const *)
SharedStub
PrepareAndDispatch
nsXPCWrappedJS::CallMethod(unsigned short,XPTMethodDescriptor const *,nsXPTCMiniVariant *)
nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS *,unsigned short,XPTMethodDescriptor const *,nsXPTCMiniVariant *)
JS_CallFunctionValue(JSContext *,JS::Handle<JSObject *>,JS::Handle<JS::Value>,JS::HandleValueArray const &,JS::MutableHandle<JS::Value>)
js::InternalCallOrConstruct(JSContext *,JS::CallArgs const &,js::MaybeConstruct)
js::RunScript
observe
gatherMemory
js::RunScript(JSContext *,js::RunState &)
Interpret
js::InternalCallOrConstruct(JSContext *,JS::CallArgs const &,js::MaybeConstruct)
js::RunScript
b
js::RunScript(JSContext *,js::RunState &)
Interpret
js::InternalCallOrConstruct(JSContext *,JS::CallArgs const &,js::MaybeConstruct)
js::RunScript
h
js::RunScript(JSContext *,js::RunState &)
Interpret
js::GetProperty(JSContext *,JS::Handle<JSObject *>,JS::Handle<JSObject *>,JS::Handle<jsid>,JS::MutableHandle<JS::Value>)
js::Call(JSContext *,JS::Handle<JS::Value>,JS::Handle<JS::Value>,js::AnyInvokeArgs const &,JS::MutableHandle<JS::Value>)
js::InternalCallOrConstruct(JSContext *,JS::CallArgs const &,js::MaybeConstruct)
XPC_WN_GetterSetter(JSContext *,unsigned int,JS::Value *)
XPCWrappedNative::CallMethod(XPCCallContext &,XPCWrappedNative::CallMode)
XPTC__InvokebyIndex
nsMemoryReporterManager::GetHeapAllocated(__int64 *)
je_jemalloc_stats
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #4)
> It seems like
> https://searchfox.org/mozilla-central/rev/
> ae94cfb36d804727dfb38f2750bfcaab4621df6e/netwerk/dns/nsEffectiveTLDService.
> cpp#51 is taking up all of the time here which is doing a binary search?

Apologies, wrong bug!
Oh, I was confused in comment 6 because I didn't notice the stacks were inverted.

Eric, could we make this use idle dispatch?  Does the exact timing matter?
We just need to run this code occasionally, like once a minute or so. Idle dispatch would be perfect.

We could probably run the jemalloc part of it off the main thread though I don't know if that is worth the complexity.
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #7)
> (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from
> comment #4)
> > It seems like
> > https://searchfox.org/mozilla-central/rev/
> > ae94cfb36d804727dfb38f2750bfcaab4621df6e/netwerk/dns/nsEffectiveTLDService.
> > cpp#51 is taking up all of the time here which is doing a binary search?
> 
> Apologies, wrong bug!

Argh, sorry, right bug actually.  See the call stack under nsWindowMemoryReporter::GhostWindowsReporter::DistinguishedAmount(): http://bit.ly/2t0WBYb
Mats, do you remember what you were doing in the browser when you took this profile?  How many tabs did you have open?
Flags: needinfo?(mats)
Converting to idle dispatch is the easiest option. Note that the stacks being pointed out are in different processes. I think this has basically morphed into bug 1340134, we might just want to dup to that and file some follow ups.

At this point I think we have a few issues:

#1 - TelemetrySession.jsm's gatherMemory is rather slow (to the point it can impact CC timing), we see various portions showing up in profiles > 60s. Moving to idle dispatch would at least help prevent jank / messing with CC timing.

#2 - je_malloc_stats can be super slow, my guess is lock contention particularly when handing memory back to the OS. I can look into whether that would really be the case and see if we can drop locks more eagerly

#3 - GhostWindowsReporter is taking more time than we'd expect, I can look into improving the perf of that. It's possible our algorithm can be tweaked.

#4 - We can look into reducing the frequency of memory gathering in TelemetrySession and consider dropping some expensive metrics
Flags: needinfo?(erahm)
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #3)
> Given your profile was for one minute I'm assuming this was called *once*.

Fair enough, but why are we observing, and ignoring, every single CC-start during
that minute then?  Observers are expensive too... even more so when the callback
is in JS.  And CC-start can be fairly frequent.

Wouldn't it be less expensive to use a setTimeout or something here instead?

(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #11)
> Mats, do you remember what you were doing in the browser when you took this
> profile?  How many tabs did you have open?

It's not my profile so I have no idea.  It's for some Facebook use case, IIRC.
I can probably dig it up if it's important, but I don't think it is.
Flags: needinfo?(mats)
Depends on: 1340134
(In reply to Mats Palmgren (:mats) from comment #13)
> (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment
> #3)
> > Given your profile was for one minute I'm assuming this was called *once*.
> 
> Fair enough, but why are we observing, and ignoring, every single CC-start
> during
> that minute then?  Observers are expensive too... even more so when the
> callback
> is in JS.  And CC-start can be fairly frequent.
> 
> Wouldn't it be less expensive to use a setTimeout or something here instead?

We want make the memory measurement around the time of cycle collection, I'm not sure just setTimeout would work. We could do something like:
- Register observer
- Get notified
- Remove observer, do our thing, setTimeout, re-register after timeout

So really it comes down to if registering and removing an observer roughly every minute is better than always being registered.
Depends on: 1376038
Yeah, that seems worth it to me.
We can make an async version of nsMemoryReporterManager::GetHeapAllocated() that returns a promise and offload je_malloc_stats() from the main thread. I'll work on this.
Assignee: nobody → cyu
All of this looks good, but I think we might want to expand things a bit. |heapOverheadFraction| [1,2] also hits |jemalloc_stats|. I think either we should add an async version for that or combine the measurement with |heapAllocated| (I'd prefer this).

[1] http://searchfox.org/mozilla-central/rev/17ebac68112bd635b458e831670c1e506ebc17ad/toolkit/components/telemetry/TelemetrySession.jsm#1167
[2] http://searchfox.org/mozilla-central/rev/17ebac68112bd635b458e831670c1e506ebc17ad/xpcom/base/nsMemoryReporterManager.cpp#2378
Comment on attachment 8882325 [details] [diff] [review]
Part 1: Add getHeapAllocatedAsync() to nsIMemoryReporter.idl.

Review of attachment 8882325 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, although we may want to change things a bit. r+ for now.
Attachment #8882325 - Flags: review?(erahm) → review+
Comment on attachment 8882326 [details] [diff] [review]
Part 2: Implementation of nsMemoryReporterManager::GetHeapAllocatedAsync().

Review of attachment 8882326 [details] [diff] [review]:
-----------------------------------------------------------------

This logic seems good, but I'd like to get another reviewer. Nathan can you take a look (also I'm not super confident on the JS parts if you'd like to double-check that).

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +2434,5 @@
> +      Unused << NS_DispatchToMainThread(resolveOrRejectRunnable);
> +    });
> +
> +  promise.forget(aPromise);
> +  return threadPool->Dispatch(getHeapAllocatedRunnable, NS_DISPATCH_NORMAL);

I'm not sure if there's a better way, but I'm a little concerned that we spin up a |SharedThreadPool| and immediately destroy it.

So lock, lookup in the global table, insert, unlock, do stuff, lock, remove from global table, unlock, send event to main thread, spin event loop on main thread until the worker thread dies. I'm going to cc froydnj to see if he has a better option.
Attachment #8882326 - Flags: review?(nfroyd)
Attachment #8882326 - Flags: review?(erahm)
Attachment #8882326 - Flags: review+
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #22)
> ::: xpcom/base/nsMemoryReporterManager.cpp
> @@ +2434,5 @@
> > +      Unused << NS_DispatchToMainThread(resolveOrRejectRunnable);
> > +    });
> > +
> > +  promise.forget(aPromise);
> > +  return threadPool->Dispatch(getHeapAllocatedRunnable, NS_DISPATCH_NORMAL);
> 
> I'm not sure if there's a better way, but I'm a little concerned that we
> spin up a |SharedThreadPool| and immediately destroy it.
> 
> So lock, lookup in the global table, insert, unlock, do stuff, lock, remove
> from global table, unlock, send event to main thread, spin event loop on
> main thread until the worker thread dies. I'm going to cc froydnj to see if
> he has a better option.

Oh, right. We should keep a reference of the SharedThreadPool instance we get to keep it alive and ClearOnShutdown() the ref.
Comment on attachment 8882326 [details] [diff] [review]
Part 2: Implementation of nsMemoryReporterManager::GetHeapAllocatedAsync().

Review of attachment 8882326 [details] [diff] [review]:
-----------------------------------------------------------------

I agree with Eric here: SharedThreadPool indicates that its user should hold onto the created pool, so these pools should have a long lifespan.  Spinning up a pool for a single dispatch does not make much sense to me.

We could start a single thread here, or we could just grab a reference to the stream transport service or something like that to dispatch our event to.  I guess we could also hold a persistent ref to the pool and clear it on Shutdown if you wanted to go that route--thought if we're going go that route, I think spinning up a single thread would be better.

Do we have to use promises here?  I understand it's somewhat convenient, but we could just do:

[scriptable, function]
interface nsIHeapAllocatedCallback : nsISupports
{
  void consumeValue(long long aBytesAllocated);
};

and pass a nsIHeapAllocatedCallback into this function: the JS wouldn't have to change, but we'd have more confidence in the C++ implementation being correct.
Attachment #8882326 - Flags: review?(nfroyd)
Don't we have some kind of lazy idle thread thing that dies when not in use?

Or just use our existing pool in StreamTransportService.
Attachment #8882329 - Flags: review?(gfritzsche) → review?(chutten)
Comment on attachment 8882329 [details] [diff] [review]
Part 3: Get "MEMORY_HEAP_ALLOCATED" asynchronously in TelemetrySession.jsm.

Cancel the review request for now since I'll change the idl interface per review request.
Attachment #8882329 - Flags: review?(chutten)
Comment on attachment 8882329 [details] [diff] [review]
Part 3: Get "MEMORY_HEAP_ALLOCATED" asynchronously in TelemetrySession.jsm.

Review of attachment 8882329 [details] [diff] [review]:
-----------------------------------------------------------------

Do you expect this to interact well with the changes pending in bug 1340134?
Attachment #8882329 - Flags: feedback+
(In reply to Chris H-C :chutten from comment #27)
> Comment on attachment 8882329 [details] [diff] [review]
> Part 3: Get "MEMORY_HEAP_ALLOCATED" asynchronously in TelemetrySession.jsm.
> 
> Review of attachment 8882329 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Do you expect this to interact well with the changes pending in bug 1340134?

I think it should be fine, in bug 1340134 we avoid slowing down CC by punting |gatherMemory| to an idle task. That still means we'll have a long running task on main thread; this patch addresses taking one of the slower pieces and making it asynchronous.
Carryover r+ from :erahm. Changed from using promise to using callback.
Attachment #8882325 - Attachment is obsolete: true
Attachment #8882662 - Flags: review+
Carryover r+ from :erahm and request additional review from :froydnj.
Attachment #8882326 - Attachment is obsolete: true
Attachment #8882663 - Flags: review?(nfroyd)
Changed to use a callback instead of a promise to get the result.
Attachment #8882329 - Attachment is obsolete: true
Attachment #8882665 - Flags: review?(chutten)
Attachment #8882665 - Attachment description: 1375281_p3.patch → Part 3: Get "MEMORY_HEAP_ALLOCATED" asynchronously in TelemetrySession.jsm.
Comment on attachment 8882665 [details] [diff] [review]
Part 3: Get "MEMORY_HEAP_ALLOCATED" asynchronously in TelemetrySession.jsm.

Review of attachment 8882665 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8882665 - Flags: review?(chutten) → review+
Comment on attachment 8882663 [details] [diff] [review]
Part 2: Implementation of nsMemoryReporterManager::GetHeapAllocatedAsync().

Review of attachment 8882663 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8882663 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/800cae1681846f50fa6dc51baf218d17c1422975
Bug 1375281 - Part 1: Add getHeapAllocatedAsync() to nsIMemoryReporter.idl. r=erahm

https://hg.mozilla.org/integration/mozilla-inbound/rev/5f92b3b576253070f7c8df5410a9f3f58b2028c3
Bug 1375281 - Part 2: Implementation of nsMemoryReporterManager::GetHeapAllocatedAsync(). r=erahm,froydnj

https://hg.mozilla.org/integration/mozilla-inbound/rev/f433b516c2d83f8fa93915a8860a9e4e338fb1ea
Bug 1375281 - Part 3: Get "MEMORY_HEAP_ALLOCATED" asynchronously in TelemetrySession.jsm. r=chutten
Component: DOM → DOM: Core & HTML
Performance Impact: --- → P1
Whiteboard: [qf:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: