1386572 - The Baseline JIT totals the Level-1 data cache every time it runs

Reporter

Description

•

8 years ago

I'm not 100% sure I've joined the dots here correctly. Nevertheless.. I suspect this is an integration-level problem that doesn't show up when testing SM standalone, assuming it doesn't use mozjemalloc when standalone. I noticed this in profiles when running http://browserbench.org/Speedometer/. What I think happens is: * jit::BaselineCompile (js/src/jit/BaselineJIT.cpp:277) is called a lot. * At each call, it creates a new temp allocator which is used only for the duration of this call (this is the bit I'm unclear about) LifoAlloc alloc(TempAllocator::PreferredLifoChunkSize); TempAllocator* temp = alloc.new_<TempAllocator>(&alloc); * The temp allocator provides a heap-allocated buffer of size 32KB (TempAllocator::PreferredLifoChunkSize) * On average (for the Speedometer workload) the baseline JIT only uses the first 10% of the buffer. * After the call, the buffer is freed. This is the bad part. mozjemalloc poisons the entire buffer for security reasons, hence writing 32KB of contiguous data. Most processors don't even have a D1 which is 32KB; 16KB is more typical. Hence the poisoning is going to trash D1 entirely and some part of the L2 behind it. Anything else running on the same core (most importantly, us!) will then take a bunch of D1 misses soon afterwards. The underlying problem is that mozjemalloc's policy of poisoning-on-free (a necessary evil, I understand) changes the total cost of renting a block of size N from O(1) to O(N), hence negating the implicit assumption that it's performance neutral to allocate blocks that are too large and use only a part of the space. Two questions: (1) is my analysis correct? (2) if yes, can we either baseline's initial chunk size a lot smaller, or (even better) allocate it on the stack?

Julian Seward [:jseward]

Reporter

Comment 1

•

8 years ago

Here are the relevant DHAT reports resulting from running 24 out of 420 iterations of Speedometer. To interpret the first one: * This allocation stack only ever had one block live at once * but it created and destroyed 4740 blocks in total, all of size 32768 * Each block stayed alive an average of 89921 insns. Maybe that's the average cost of a call to the baseline compiler? * On average, only 10% of the block was read and 9% was written (no, don't ask me); which corresponds to about 3.2KB of actual used area The second report is similar but not so extreme -- the blocks stay alive 257K insns and are about 30% used. Still only one alive at once, though. -------------------- 68 of 1000 -------------------- max-live: 32,768 in 1 blocks tot-alloc: 155,320,320 in 4,740 blocks (avg size 32768.00) deaths: 4,740, at avg age 89,921 (0.00% of prog lifetime) acc-ratios: 0.10 rd, 0.09 wr (16,254,674 b-read, 14,003,959 b-written) at 0x4C2BF9D: malloc (/home/sewardj/VgTRUNK/progress/coregrind/m_replacemalloc/vg_replace_malloc.c:299) by 0x1447B4E0: js_malloc (gcc-O2-nondebug-jemalloc/dist/include/js/Utility.h:230) by 0x1447B4E0: js::detail::BumpChunk::new_(unsigned long) (js/src/ds/LifoAlloc.cpp:23) by 0x1447B719: js::LifoAlloc::getOrCreateChunk(unsigned long) (js/src/ds/LifoAlloc.cpp:106) by 0x1417A78B: allocImpl (js/src/ds/LifoAlloc.h:225) by 0x1417A78B: alloc (js/src/ds/LifoAlloc.h:285) by 0x1417A78B: new_<js::jit::TempAllocator, js::LifoAlloc*> (js/src/ds/LifoAlloc.h:454) by 0x1417A78B: js::jit::BaselineCompile(JSContext*, JSScript*, bool) (js/src/jit/BaselineJIT.cpp:286) by 0x1417AF0E: js::jit::CanEnterBaselineMethod(JSContext*, js::RunState&) (js/src/jit/BaselineJIT.cpp:409) by 0x140F67FE: js::RunScript(JSContext*, js::RunState&) (js/src/vm/Interpreter.cpp:395) by 0x140F6E5F: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (js/src/vm/Interpreter.cpp:487) -------------------- 400 of 1000 -------------------- max-live: 32,768 in 1 blocks tot-alloc: 35,979,264 in 1,098 blocks (avg size 32768.00) deaths: 1,098, at avg age 257,425 (0.00% of prog lifetime) acc-ratios: 0.32 rd, 0.26 wr (11,714,292 b-read, 9,580,331 b-written) at 0x4C2BF9D: malloc (/home/sewardj/VgTRUNK/progress/coregrind/m_replacemalloc/vg_replace_malloc.c:299) by 0x1447B4E0: js_malloc (gcc-O2-nondebug-jemalloc/dist/include/js/Utility.h:230) by 0x1447B4E0: js::detail::BumpChunk::new_(unsigned long) (js/src/ds/LifoAlloc.cpp:23) by 0x1447B719: js::LifoAlloc::getOrCreateChunk(unsigned long) (js/src/ds/LifoAlloc.cpp:106) by 0x1417A78B: allocImpl (js/src/ds/LifoAlloc.h:225) by 0x1417A78B: alloc (js/src/ds/LifoAlloc.h:285) by 0x1417A78B: new_<js::jit::TempAllocator, js::LifoAlloc*> (js/src/ds/LifoAlloc.h:454) by 0x1417A78B: js::jit::BaselineCompile(JSContext*, JSScript*, bool) (js/src/jit/BaselineJIT.cpp:286) by 0x1417AE81: js::jit::CanEnterBaselineAtBranch(JSContext*, js::InterpreterFrame*, bool) (js/src/jit/BaselineJIT.cpp:380) by 0x140F40BD: Interpret(JSContext*, js::RunState&) (js/src/vm/Interpreter.cpp:2011) by 0x140F67B4: js::RunScript(JSContext*, js::RunState&) (js/src/vm/Interpreter.cpp:409)

Jan de Mooij [:jandem]

Assignee

Comment 2

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #0) > I suspect this is an integration-level problem that doesn't show up when > testing SM standalone, assuming it doesn't use mozjemalloc when standalone. The JS shell should use mozjemalloc these days. > * At each call, it creates a new temp allocator which is used only for > the duration of this call (this is the bit I'm unclear about) Correct. > * The temp allocator provides a heap-allocated buffer of size 32KB > (TempAllocator::PreferredLifoChunkSize) Note that :njn changed this from a smaller size to 32KB in bug 1085740... > (1) is my analysis correct? I think so. > (2) if yes, can we either baseline's initial chunk size a lot smaller, > or (even better) allocate it on the stack? Some thoughts: (1) There are not a lot of places where Baseline uses TempAllocator, so *maybe* we could get away with removing it completely and replacing the handful of uses with a plain Vector or malloc or something. I can look into this. (2) Ion also uses TempAllocator so it would have the same problem, and that one is more memory-hungry. Note that we free it off-thread in that case (bug 1377238). (3) I wonder if it would make sense to reuse a single LifoAlloc/TempAllocator, and free it only on GC. That's less safe of course (no poisoning after each compilation) but it would eliminate most of the malloc/free cost. (4) Have we done benchmarking with jemalloc poison-on-free turned off? Might be interesting to see how that affects Speedometer.

Jan de Mooij [:jandem]

Assignee

Comment 3

•

8 years ago

Oh and thanks for filing this, jseward! This is very interesting.

Jan de Mooij [:jandem]

Assignee

Comment 4

•

8 years ago

Hm I think we could use LifoAllocScope and cx->tempLifoAlloc. Let me try something.

Assignee: nobody → jdemooij

Status: NEW → ASSIGNED

Jan de Mooij [:jandem]

Assignee

Comment 5

•

8 years ago

Attached patch Patch — Details — Splinter Review

jseward, in case it's easy for you to try this patch, would you mind checking whether this helps? We now use cx->tempLifoAlloc(), it has a chunk size of 4 KB.

Attachment #8892841 - Flags: feedback?(jseward)

Julian Seward [:jseward]

Reporter

Comment 6

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #2) > (2) Ion also uses TempAllocator so it would have the same problem, and that > one is more memory-hungry. Note that we free it off-thread in that case (bug > 1377238). Yes, I see other uses a 32KB buffer that seem to involve Ion. But in those cases, the buffer stays alive much longer, and all parts of it are used, so the relative overhead of poisoning it afterwards is much smaller. But freeing (viz, poisoning) it off-thread is in itself a performance problem. Imagine a buffer that has been used by one thread. It now resides in the D1/L2 caches of that thread's core. If a different thread on a different core frees it, the freeing thread will cause the underlying MESI protocol to move the entire block contents to the freeing thread core's L2 cache, simply to overwrite it. So we get a bunch of pointless inter-core L2 misses (data movement). <somewhat OT>From other profiling work on Stylo I did in Q2, I believe that cross-thread freeing and poisoning costs are a limiting factor in the parallel speedup we can achieve from Stylo. I have been profiling cross- core cache misses, and SM is a big source of them.</somewhat OT> > (3) I wonder if it would make sense to reuse a single > LifoAlloc/TempAllocator, and free it only on GC. That's less safe of course > (no poisoning after each compilation) but it would eliminate most of the > malloc/free cost. Is it possible to change the LifoAllocator so as to be a bit like mozilla::Vector, that is: have a small (eg 4KB) initial buffer that is stack allocated, and only move to the heap if more is needed? > (4) Have we done benchmarking with jemalloc poison-on-free turned off? Might > be interesting to see how that affects Speedometer. Good question. I wondered that too. I am doing that now.

Jan de Mooij [:jandem]

Assignee

Comment 7

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #6) > Is it possible to change the LifoAllocator so as to be a bit like > mozilla::Vector, that is: have a small (eg 4KB) initial buffer that is stack > allocated, and only move to the heap if more is needed? Note that IIUC my patch has more or less the same effect: it uses a LifoAllocScope which sets a "mark" in the constructor and in the destructor just restores the allocator to this mark. Unused blocks are freed on GC. > > (4) Have we done benchmarking with jemalloc poison-on-free turned off? Might > > be interesting to see how that affects Speedometer. > > Good question. I wondered that too. I am doing that now. Thanks!

Nicholas Nethercote [inactive]

Comment 8

•

8 years ago

> Note that :njn changed this from a smaller size to 32KB in bug 1085740... IIRC, prior to that bug we'd always allocate a smaller chunk (e.g. 4 KiB) plus a 32 KiB chunk. And bug 1085740 changed it to just allocate a 32 KiB chunk. So if should have reduced the number of bytes being poisoned.

Julian Seward [:jseward]

Reporter

Comment 9

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #2) > (4) Have we done benchmarking with jemalloc poison-on-free turned off? > Might be interesting to see how that affects Speedometer. See bug 1387001.

Julian Seward [:jseward]

Reporter

Comment 10

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #5) > Created attachment 8892841 [details] [diff] [review] > Patch > jseward, in case it's easy for you to try this patch, would you mind > checking whether this helps? Just that alone reduces Firefox's allocation rate in the first 17.8 billion instructions of Speedometer from 12.84 instructions per byte to 14.49 instructions per byte. That is, it reduces the total heap allocation in this section from 1385MB to "only" 1227MB (cough cough). FWIW, 17.8 billion insns is easily less than 10 cpu seconds. So +1 from me! Per comments in bug 1387001, whether that has a detectable performance benefit is a much harder question. But I think we should land this anyway.

Jan de Mooij [:jandem]

Assignee

Updated

•

8 years ago

Attachment #8892841 - Flags: review?(bhackett1024)

Attachment #8892841 - Flags: feedback?(jseward)

Attachment #8892841 - Flags: feedback+

Jan de Mooij [:jandem]

Assignee

Comment 11

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #10) > Just that alone reduces Firefox's allocation rate in the first 17.8 billion > instructions of Speedometer from 12.84 instructions per byte to 14.49 > instructions per byte. That is, it reduces the total heap allocation in > this section from 1385MB to "only" 1227MB (cough cough). FWIW, 17.8 billion > insns is easily less than 10 cpu seconds. Nice, thanks!

Brian Hackett [Laid off!]

Updated

•

8 years ago

Attachment #8892841 - Flags: review?(bhackett1024) → review+

Pulsebot

Comment 12

•

8 years ago

Pushed by jandemooij@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/52ca726cf896 Use cx->tempLifoAlloc instead of a custom LifoAlloc for Baseline compilation. r=bhackett

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 13

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/52ca726cf896

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox57: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Nicholas Nethercote [inactive]

Comment 14

•

8 years ago

Attached file DMD output — Details

I did a DMD --cumulative run before and after this patch landed. It fixed the Baseline cases but that's a minority of cases. It shows up in other places (esp. IonBuilder and BacktrackingAllocator) a lot more. I've attached the DMD output showing the top 8 records, which all involve this data structure.

Nicholas Nethercote [inactive]

Comment 15

•

8 years ago

jandem, could this same trick be applied to those other parts of SpiderMonkey?

Flags: needinfo?(jdemooij)

Lars T Hansen [:lth]

Comment 16

•

8 years ago

Could also be an issue for the WebAssembly compilations, need to investigate. Those compilations batch many functions together to reduce per-function overhead (nominally synchronization overhead but could work for tempallocator too) so it might not be a problem, but we should look just to have looked.

Jan de Mooij [:jandem]

Assignee

Comment 17

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #15) > jandem, could this same trick be applied to those other parts of > SpiderMonkey? Not easily. (1) cx->tempLifoAlloc can't be used for Ion due to off-thread Ion compilation, (2) Ion needs the 32 KB chunk size (the issue you fixed in bug 1085740). Ion compilation is very wasteful though. #1 in your DMD output is allocating slots vectors for MBasicBlock - there's no need to keep these around for all blocks, we should be able to recycle them (I think I added this recycling for asm.js/wasm a long time ago). We should spend some time figuring out which parts of Ion allocate the most memory. It might also be a good idea to reuse Ion LifoAllocs. I'm interested in fixing this, but I'm not sure it's worth it for 57 - I have a lot on my plate atm.

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Comment 18

•

8 years ago

I filed bug 1388002 for Ion.

Julian Seward [:jseward]

Reporter

Comment 19

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #14) > I did a DMD --cumulative run before and after this patch landed. It fixed > the Baseline cases but that's a minority of cases. For clarity .. reducing Ion et al allocation is worth doing, but that is a slightly different issue to what I was concerned about here, namely: overallocating in Baseline and then dragging the unused areas up the memory hierarchy in free() merely so as to be able to poison it. My impression from DHAT numbers is that, while Ion et al do allocate a lot of memory, they also use all (or most) of it. So the relative cost of poisoning is less than for Baseline. Partly because that poisoning cost is amortized over more useful work, and partly because that work will have dragged the relevant bits of address space up the memory hierarchy anyway, so poisoning it will be cheaper.

Aki Sasaki (not active)

Comment 20

•

8 years ago

https://hg.mozilla.org/projects/date/rev/52ca726cf89698336d11277976060779ad7cf9fd Bug 1386572 - Use cx->tempLifoAlloc instead of a custom LifoAlloc for Baseline compilation. r=bhackett

Patch 8 years ago Jan de Mooij [:jandem] 1.11 KB, patch	bhackett1024 : review+ jandem : feedback+	Details \| Diff \| Splinter Review
DMD output 8 years ago Nicholas Nethercote [inactive] 9.61 KB, text/plain		Details