Open Bug 1757350 Opened 3 years ago Updated 3 years ago

Reduce overhead of profiler::AllocCallback and FreeCallback

Categories

(Core :: Gecko Profiler, task, P2)

task

Tracking

()

People

(Reporter: mozbugz, Unassigned)

References

(Blocks 1 open bug)

Details

Spawned from bug 1745591 comment 3:

I did some local instrumentation and profiling of this: https://share.firefox.dev/3t3ddM4
It seems like running the profiler slows things down by more 2x. The time for a single call to new_ct_font_with_variations goes from around 3-4ms to 9-12ms (With mozilla::profiler::AllocCallback being mostly to blame).

This is quite a big jump!
Zooming in on zones of activity in WRWorker threads, one third of samples are in atomic_fetch_add, inside ProfilerCounterTotal::Add called by profiler::AllocCallback and profiler::FreeCallback.
So even though profiler counters are using relaxed atomics, they can still have a visible overhead where there are lots of operations happening around the same time in many threads.

I can think of two possible ways to help:

  1. In these (hopefully rare) cases where the Profiler memory interception functions have a bit impact, there could be a way to prevent them from running, either an environment variable for our skilled users, or a friendlier option in about:profiling.
    AND/OR
  2. Remove the contentious shared atomics, by using atomic operations working on thread-specific numbers accessed through thread-local storage (TLS).
    The periodic sampling part would have to read all of these.
    This should make individual operations faster, thanks to the minimal contention: Only that thread would perform the addition, and the sampler thread would read it from time to time. There is a cost to perform TLS accesses, to be measured on all platforms.

Looking again at the profile of WRWorker activity, a further 22% of samples are in an atomic load in mozilla::profiler::ThreadIntercept::ThreadIntercept (to prevent re-entering interceptions routines), I'm not sure we could avoid them with option 2 alone.

And there's another 10% in atomic_fetch_sub inside PHC's MaybePageAlloc, so it's not just the profiler adding some overhead!

One more bit of information: Looking at the json data (in the js console, it's in the profile variable), I can see that the number of memory operations is in the low hundreds per sample (every ~1ms) most of the time, but during these busy multi-threaded times the number of memory operations climbed to around 10,000 per sample!
This adds to the evidence that inter-thread atomic contention is much more visible in these cases.

You need to log in before you can comment on or make changes to this bug.