Closed Bug 1387001 Opened 6 years ago Closed 4 years ago
Measure the performance effect of malloc poisoning on free, in mozjemalloc
(Core :: Memory Allocator, enhancement)
(Reporter: jseward, Unassigned)
tl;dr: it costs about 2.5% performance on Intel/Skylake, and more than that in terms of number of instructions and memory traffic. ------ mozjemalloc poisons (overwrites) all memory on free, as a necessary security measure. This bug doesn't propose to change anything, but it serves as a place to record measurements about the performance effects of the poisoning. We allocate and free heap fast -- easily 100MB/sec -- and the memset calls associated with poisoning show up frequently in profiles. Measurements were done using the Speedometer benchmark, http://browserbench.org/Speedometer. The profiled build is m-c from source, release-ish config (-O2, nondebug), on Fedora 25 x86_64. I particularly wanted to measure this because poisoning breaks the assumption that it is performance neutral to allocate oversize blocks and then not use all the allocated space. Also, poisoning has potential to increase MESI protocol misses in situations where memory is freed by a different thread than the one that allocated/used it. ------ Actual performance Detecting any difference was quite tricky. Speedometer gives the impression of having short periods where it falls idle, which causes the CPU clock rates to fall, confusing the measurements. In the end I locked the clock rate and ran on a quiet machine. And with e10s disabled. Because the performance of the memset calls in question is possibly strongly dependent on CPU implementation details (cache sizes, hardware prefetch abilities, memory/cache bandwidth), I wanted to test it on processors at both ends of the capability spectrum. What I had to hand is (top end) A Xeon Skylake: Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz (4 cores, 8 threads) with clock rate fixed to 2.40 GHz (bottom end) An Atom (Silvermont): Intel N 3050 (base 1.60 GHz, max 2.16 GHz) (2 cores, 2 threads) with clock rate fixed to 2.00 GHz The Speedometer runs are noisy. The results are: baseline, Skylake: 66.7 +/- 1.5, 65.3 +/- 1.6 --> average 66.0 nopoison, Skylake: 67.0 +/- 1.2, 68.5 +/- 1.2 --> average 67.75 baseline, Silvermont: 13.6 +/- 0.15 nopoison, Silvermont: 13.8 +/- 0.24 The Skylake results are as I expected -- about a 2.5% perf loss. The Silvermont results surprised me. I am not sure why it appears almost unaffected. I did notice that the Silvermont's clock rate wouldn't always stay fixed, despite my best efforts. That probably doesn't help get reliable numbers. ------ Measurements with Callgrind Callgrind is a great tool for getting insights into microarchitectural effects, but it's very slow. I decided to profile a steady-state section of the Speedometer run -- steps 45 through 52 inclusive -- and then normalise the results by the number of frees done. This is on the assumption that heap turnover is a fairly reliable indicator of the amount of useful work done, since disabling poisoning won't change the number of allocations/frees. I asked Callgrind to simulate a cache hierarchy similar to that in the Skylake. Although that's not terribly realistic (it ignores multi-core effects and prefetching), it does give some baseline measurement of cache-friendlyness. Results are: what baseline nopoison #frees 1,132,636 1,225,786 Insns 13224M 13163M DataReads 3520M 3660M DataWrites 2773M 2256M D1ReadMisses 130.3M 138.9M D1WriteMisses 37.6M 29.0M Insns/free 11675 10738 DataReads/free 3108 2985 DataWrites/free 2448 1840 (DR+DW)/free 5556 4825 D1ReadMisses/free 115.0 113.3 D1WriteMisses/free 33.2 23.7 (D1RM+D1WM)/free 148.2 137.0 What you can take from this is: * poisoning adds 8.7% insns per unit useful work * and increases memory traffic by 15.1%, mostly writes (as expected) * and increases the D1 miss rate by around 8%, unsurprisingly I am bit suspicious that poisoning also increases the number of memory reads. I'm not sure why. Maybe the frame setup/removal overheads of calling memset? Unlikely. Maybe the two runs aren't exactly identical in workload? Given the increased instruction count, I am surprised that the Skylake only took a 2.5% performance hit, and even more surprised -- to the extent of disbelief -- that the Silvermont seems almost unaffected.
6 years ago
Thanks for measuring this! I agree the relatively minor performance hit is surprising...
4 years ago
The measurements are here, I guess we can close this bug.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.