Measure the performance effect of malloc poisoning on free, in mozjemalloc

RESOLVED FIXED

Status

()

enhancement
RESOLVED FIXED
2 years ago
3 months ago

People

(Reporter: jseward, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

tl;dr: it costs about 2.5% performance on Intel/Skylake, and more than that
in terms of number of instructions and memory traffic.

------

mozjemalloc poisons (overwrites) all memory on free, as a necessary security
measure.  This bug doesn't propose to change anything, but it serves as a
place to record measurements about the performance effects of the poisoning.
We allocate and free heap fast -- easily 100MB/sec -- and the memset calls
associated with poisoning show up frequently in profiles.

Measurements were done using the Speedometer benchmark,
http://browserbench.org/Speedometer.  The profiled build is m-c from source,
release-ish config (-O2, nondebug), on Fedora 25 x86_64.

I particularly wanted to measure this because poisoning breaks the
assumption that it is performance neutral to allocate oversize blocks and
then not use all the allocated space.  Also, poisoning has potential to
increase MESI protocol misses in situations where memory is freed by a
different thread than the one that allocated/used it.

------

Actual performance

Detecting any difference was quite tricky.  Speedometer gives the impression
of having short periods where it falls idle, which causes the CPU clock
rates to fall, confusing the measurements.  In the end I locked the clock
rate and ran on a quiet machine.  And with e10s disabled.

Because the performance of the memset calls in question is possibly strongly
dependent on CPU implementation details (cache sizes, hardware prefetch
abilities, memory/cache bandwidth), I wanted to test it on processors at
both ends of the capability spectrum.  What I had to hand is

(top end) A Xeon Skylake:
   Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz (4 cores, 8 threads)
   with clock rate fixed to 2.40 GHz
   
(bottom end) An Atom (Silvermont):
   Intel N 3050 (base 1.60 GHz, max 2.16 GHz) (2 cores, 2 threads)
   with clock rate fixed to 2.00 GHz

The Speedometer runs are noisy.  The results are:

   baseline, Skylake:  66.7 +/- 1.5, 65.3 +/- 1.6  --> average 66.0
   nopoison, Skylake:  67.0 +/- 1.2, 68.5 +/- 1.2  --> average 67.75

   baseline, Silvermont:  13.6 +/- 0.15
   nopoison, Silvermont:  13.8 +/- 0.24

The Skylake results are as I expected -- about a 2.5% perf loss.  The
Silvermont results surprised me.  I am not sure why it appears almost
unaffected.  I did notice that the Silvermont's clock rate wouldn't always
stay fixed, despite my best efforts.  That probably doesn't help get
reliable numbers.

------

Measurements with Callgrind

Callgrind is a great tool for getting insights into microarchitectural
effects, but it's very slow.  I decided to profile a steady-state section of
the Speedometer run -- steps 45 through 52 inclusive -- and then normalise
the results by the number of frees done.  This is on the assumption that
heap turnover is a fairly reliable indicator of the amount of useful work
done, since disabling poisoning won't change the number of
allocations/frees.

I asked Callgrind to simulate a cache hierarchy similar to that in the
Skylake.  Although that's not terribly realistic (it ignores multi-core
effects and prefetching), it does give some baseline measurement of
cache-friendlyness.  Results are:

  what               baseline       nopoison
              
  #frees            1,132,636      1,225,786

  Insns                13224M         13163M
  DataReads             3520M          3660M
  DataWrites            2773M          2256M
  D1ReadMisses         130.3M         138.9M
  D1WriteMisses         37.6M          29.0M

  Insns/free            11675          10738
  DataReads/free         3108           2985
  DataWrites/free        2448           1840
  (DR+DW)/free           5556           4825

  D1ReadMisses/free     115.0          113.3
  D1WriteMisses/free     33.2           23.7
  (D1RM+D1WM)/free      148.2          137.0

What you can take from this is:

* poisoning adds 8.7% insns per unit useful work

* and increases memory traffic by 15.1%, mostly writes (as expected)

* and increases the D1 miss rate by around 8%, unsurprisingly

I am bit suspicious that poisoning also increases the number of memory
reads.  I'm not sure why.  Maybe the frame setup/removal overheads of
calling memset?  Unlikely.  Maybe the two runs aren't exactly identical
in workload?

Given the increased instruction count, I am surprised that the Skylake only
took a 2.5% performance hit, and even more surprised -- to the extent of
disbelief -- that the Silvermont seems almost unaffected.
Thanks for measuring this! I agree the relatively minor performance hit is surprising...

The measurements are here, I guess we can close this bug.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.