Reduce overhead of profiler::AllocCallback and FreeCallback
Categories
(Core :: Gecko Profiler, task, P2)
Tracking
()
People
(Reporter: mozbugz, Unassigned)
References
(Blocks 1 open bug)
Details
Spawned from bug 1745591 comment 3:
I did some local instrumentation and profiling of this: https://share.firefox.dev/3t3ddM4
It seems like running the profiler slows things down by more 2x. The time for a single call to new_ct_font_with_variations goes from around 3-4ms to 9-12ms (With mozilla::profiler::AllocCallback being mostly to blame).
This is quite a big jump!
Zooming in on zones of activity in WRWorker threads, one third of samples are in atomic_fetch_add
, inside ProfilerCounterTotal::Add
called by profiler::AllocCallback
and profiler::FreeCallback
.
So even though profiler counters are using relaxed atomics, they can still have a visible overhead where there are lots of operations happening around the same time in many threads.
I can think of two possible ways to help:
- In these (hopefully rare) cases where the Profiler memory interception functions have a bit impact, there could be a way to prevent them from running, either an environment variable for our skilled users, or a friendlier option in about:profiling.
AND/OR - Remove the contentious shared atomics, by using atomic operations working on thread-specific numbers accessed through thread-local storage (TLS).
The periodic sampling part would have to read all of these.
This should make individual operations faster, thanks to the minimal contention: Only that thread would perform the addition, and the sampler thread would read it from time to time. There is a cost to perform TLS accesses, to be measured on all platforms.
Reporter | ||
Comment 1•3 years ago
|
||
Looking again at the profile of WRWorker activity, a further 22% of samples are in an atomic load in mozilla::profiler::ThreadIntercept::ThreadIntercept
(to prevent re-entering interceptions routines), I'm not sure we could avoid them with option 2 alone.
And there's another 10% in atomic_fetch_sub inside PHC's MaybePageAlloc
, so it's not just the profiler adding some overhead!
Reporter | ||
Comment 2•3 years ago
|
||
One more bit of information: Looking at the json data (in the js console, it's in the profile
variable), I can see that the number of memory operations is in the low hundreds per sample (every ~1ms) most of the time, but during these busy multi-threaded times the number of memory operations climbed to around 10,000 per sample!
This adds to the evidence that inter-thread atomic contention is much more visible in these cases.
Description
•