Closed Bug 1386572 Opened 7 years ago Closed 7 years ago

The Baseline JIT totals the Level-1 data cache every time it runs

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: jseward, Assigned: jandem)

Details

Attachments

(2 files)

I'm not 100% sure I've joined the dots here correctly.  Nevertheless..

I suspect this is an integration-level problem that doesn't show up when
testing SM standalone, assuming it doesn't use mozjemalloc when standalone.

I noticed this in profiles when running http://browserbench.org/Speedometer/.
What I think happens is:

* jit::BaselineCompile (js/src/jit/BaselineJIT.cpp:277) is called a lot.

* At each call, it creates a new temp allocator which is used only for
  the duration of this call (this is the bit I'm unclear about)

    LifoAlloc alloc(TempAllocator::PreferredLifoChunkSize);
    TempAllocator* temp = alloc.new_<TempAllocator>(&alloc);

* The temp allocator provides a heap-allocated buffer of size 32KB
  (TempAllocator::PreferredLifoChunkSize)

* On average (for the Speedometer workload) the baseline JIT only uses
  the first 10% of the buffer.

* After the call, the buffer is freed.  This is the bad part.  mozjemalloc
  poisons the entire buffer for security reasons, hence writing 32KB of
  contiguous data.

Most processors don't even have a D1 which is 32KB; 16KB is more typical.
Hence the poisoning is going to trash D1 entirely and some part of the L2
behind it.  Anything else running on the same core (most importantly, us!)
will then take a bunch of D1 misses soon afterwards.

The underlying problem is that mozjemalloc's policy of poisoning-on-free (a
necessary evil, I understand) changes the total cost of renting a block of
size N from O(1) to O(N), hence negating the implicit assumption that it's
performance neutral to allocate blocks that are too large and use only a
part of the space.

Two questions:

(1) is my analysis correct?

(2) if yes, can we either baseline's initial chunk size a lot smaller,
    or (even better) allocate it on the stack?
Here are the relevant DHAT reports resulting from running 24 out of 420
iterations of Speedometer.  To interpret the first one:

* This allocation stack only ever had one block live at once

* but it created and destroyed 4740 blocks in total, all of size 32768

* Each block stayed alive an average of 89921 insns.  Maybe that's the
  average cost of a call to the baseline compiler?

* On average, only 10% of the block was read and 9% was written (no, don't
  ask me); which corresponds to about 3.2KB of actual used area

The second report is similar but not so extreme -- the blocks stay alive
257K insns and are about 30% used.  Still only one alive at once, though.


-------------------- 68 of 1000 --------------------
max-live:    32,768 in 1 blocks
tot-alloc:   155,320,320 in 4,740 blocks (avg size 32768.00)
deaths:      4,740, at avg age 89,921 (0.00% of prog lifetime)
acc-ratios:  0.10 rd, 0.09 wr  (16,254,674 b-read, 14,003,959 b-written)
   at 0x4C2BF9D: malloc (/home/sewardj/VgTRUNK/progress/coregrind/m_replacemalloc/vg_replace_malloc.c:299)
   by 0x1447B4E0: js_malloc (gcc-O2-nondebug-jemalloc/dist/include/js/Utility.h:230)
   by 0x1447B4E0: js::detail::BumpChunk::new_(unsigned long) (js/src/ds/LifoAlloc.cpp:23)
   by 0x1447B719: js::LifoAlloc::getOrCreateChunk(unsigned long) (js/src/ds/LifoAlloc.cpp:106)
   by 0x1417A78B: allocImpl (js/src/ds/LifoAlloc.h:225)
   by 0x1417A78B: alloc (js/src/ds/LifoAlloc.h:285)
   by 0x1417A78B: new_<js::jit::TempAllocator, js::LifoAlloc*> (js/src/ds/LifoAlloc.h:454)
   by 0x1417A78B: js::jit::BaselineCompile(JSContext*, JSScript*, bool) (js/src/jit/BaselineJIT.cpp:286)
   by 0x1417AF0E: js::jit::CanEnterBaselineMethod(JSContext*, js::RunState&) (js/src/jit/BaselineJIT.cpp:409)
   by 0x140F67FE: js::RunScript(JSContext*, js::RunState&) (js/src/vm/Interpreter.cpp:395)
   by 0x140F6E5F: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (js/src/vm/Interpreter.cpp:487)

-------------------- 400 of 1000 --------------------
max-live:    32,768 in 1 blocks
tot-alloc:   35,979,264 in 1,098 blocks (avg size 32768.00)
deaths:      1,098, at avg age 257,425 (0.00% of prog lifetime)
acc-ratios:  0.32 rd, 0.26 wr  (11,714,292 b-read, 9,580,331 b-written)
   at 0x4C2BF9D: malloc (/home/sewardj/VgTRUNK/progress/coregrind/m_replacemalloc/vg_replace_malloc.c:299)
   by 0x1447B4E0: js_malloc (gcc-O2-nondebug-jemalloc/dist/include/js/Utility.h:230)
   by 0x1447B4E0: js::detail::BumpChunk::new_(unsigned long) (js/src/ds/LifoAlloc.cpp:23)
   by 0x1447B719: js::LifoAlloc::getOrCreateChunk(unsigned long) (js/src/ds/LifoAlloc.cpp:106)
   by 0x1417A78B: allocImpl (js/src/ds/LifoAlloc.h:225)
   by 0x1417A78B: alloc (js/src/ds/LifoAlloc.h:285)
   by 0x1417A78B: new_<js::jit::TempAllocator, js::LifoAlloc*> (js/src/ds/LifoAlloc.h:454)
   by 0x1417A78B: js::jit::BaselineCompile(JSContext*, JSScript*, bool) (js/src/jit/BaselineJIT.cpp:286)
   by 0x1417AE81: js::jit::CanEnterBaselineAtBranch(JSContext*, js::InterpreterFrame*, bool) (js/src/jit/BaselineJIT.cpp:380)
   by 0x140F40BD: Interpret(JSContext*, js::RunState&) (js/src/vm/Interpreter.cpp:2011)
   by 0x140F67B4: js::RunScript(JSContext*, js::RunState&) (js/src/vm/Interpreter.cpp:409)
(In reply to Julian Seward [:jseward] from comment #0)
> I suspect this is an integration-level problem that doesn't show up when
> testing SM standalone, assuming it doesn't use mozjemalloc when standalone.

The JS shell should use mozjemalloc these days.

> * At each call, it creates a new temp allocator which is used only for
>   the duration of this call (this is the bit I'm unclear about)

Correct.

> * The temp allocator provides a heap-allocated buffer of size 32KB
>   (TempAllocator::PreferredLifoChunkSize)

Note that :njn changed this from a smaller size to 32KB in bug 1085740...

> (1) is my analysis correct?

I think so.

> (2) if yes, can we either baseline's initial chunk size a lot smaller,
>     or (even better) allocate it on the stack?

Some thoughts:

(1) There are not a lot of places where Baseline uses TempAllocator, so *maybe* we could get away with removing it completely and replacing the handful of uses with a plain Vector or malloc or something. I can look into this.

(2) Ion also uses TempAllocator so it would have the same problem, and that one is more memory-hungry. Note that we free it off-thread in that case (bug 1377238).

(3) I wonder if it would make sense to reuse a single LifoAlloc/TempAllocator, and free it only on GC. That's less safe of course (no poisoning after each compilation) but it would eliminate most of the malloc/free cost.

(4) Have we done benchmarking with jemalloc poison-on-free turned off? Might be interesting to see how that affects Speedometer.
Oh and thanks for filing this, jseward! This is very interesting.
Hm I think we could use LifoAllocScope and cx->tempLifoAlloc. Let me try something.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Attached patch PatchSplinter Review
jseward, in case it's easy for you to try this patch, would you mind checking whether this helps?

We now use cx->tempLifoAlloc(), it has a chunk size of 4 KB.
Attachment #8892841 - Flags: feedback?(jseward)
(In reply to Jan de Mooij [:jandem] from comment #2)

> (2) Ion also uses TempAllocator so it would have the same problem, and that
> one is more memory-hungry. Note that we free it off-thread in that case (bug
> 1377238).

Yes, I see other uses a 32KB buffer that seem to involve Ion.  But in those
cases, the buffer stays alive much longer, and all parts of it are used, so
the relative overhead of poisoning it afterwards is much smaller.

But freeing (viz, poisoning) it off-thread is in itself a performance
problem.  Imagine a buffer that has been used by one thread.  It now resides
in the D1/L2 caches of that thread's core.  If a different thread on a
different core frees it, the freeing thread will cause the underlying MESI
protocol to move the entire block contents to the freeing thread core's L2
cache, simply to overwrite it.  So we get a bunch of pointless inter-core L2
misses (data movement).

<somewhat OT>From other profiling work on Stylo I did in Q2, I believe that
cross-thread freeing and poisoning costs are a limiting factor in the
parallel speedup we can achieve from Stylo.  I have been profiling cross-
core cache misses, and SM is a big source of them.</somewhat OT>


> (3) I wonder if it would make sense to reuse a single
> LifoAlloc/TempAllocator, and free it only on GC. That's less safe of course
> (no poisoning after each compilation) but it would eliminate most of the
> malloc/free cost.

Is it possible to change the LifoAllocator so as to be a bit like
mozilla::Vector, that is: have a small (eg 4KB) initial buffer that is stack
allocated, and only move to the heap if more is needed?


> (4) Have we done benchmarking with jemalloc poison-on-free turned off? Might
> be interesting to see how that affects Speedometer.

Good question.  I wondered that too.  I am doing that now.
(In reply to Julian Seward [:jseward] from comment #6)
> Is it possible to change the LifoAllocator so as to be a bit like
> mozilla::Vector, that is: have a small (eg 4KB) initial buffer that is stack
> allocated, and only move to the heap if more is needed?

Note that IIUC my patch has more or less the same effect: it uses a LifoAllocScope which sets a "mark" in the constructor and in the destructor just restores the allocator to this mark. Unused blocks are freed on GC.

> > (4) Have we done benchmarking with jemalloc poison-on-free turned off? Might
> > be interesting to see how that affects Speedometer.
> 
> Good question.  I wondered that too.  I am doing that now.

Thanks!
> Note that :njn changed this from a smaller size to 32KB in bug 1085740...

IIRC, prior to that bug we'd always allocate a smaller chunk (e.g. 4 KiB) plus a 32 KiB chunk. And bug 1085740 changed it to just allocate a 32 KiB chunk. So if should have reduced the number of bytes being poisoned.
(In reply to Jan de Mooij [:jandem] from comment #2)
> (4) Have we done benchmarking with jemalloc poison-on-free turned off?
> Might be interesting to see how that affects Speedometer.

See bug 1387001.
(In reply to Jan de Mooij [:jandem] from comment #5)
> Created attachment 8892841 [details] [diff] [review]
> Patch
> jseward, in case it's easy for you to try this patch, would you mind
> checking whether this helps?

Just that alone reduces Firefox's allocation rate in the first 17.8 billion
instructions of Speedometer from 12.84 instructions per byte to 14.49
instructions per byte.  That is, it reduces the total heap allocation in
this section from 1385MB to "only" 1227MB (cough cough).  FWIW, 17.8 billion
insns is easily less than 10 cpu seconds.

So +1 from me!

Per comments in bug 1387001, whether that has a detectable performance
benefit is a much harder question.  But I think we should land this anyway.
Attachment #8892841 - Flags: review?(bhackett1024)
Attachment #8892841 - Flags: feedback?(jseward)
Attachment #8892841 - Flags: feedback+
(In reply to Julian Seward [:jseward] from comment #10)
> Just that alone reduces Firefox's allocation rate in the first 17.8 billion
> instructions of Speedometer from 12.84 instructions per byte to 14.49
> instructions per byte.  That is, it reduces the total heap allocation in
> this section from 1385MB to "only" 1227MB (cough cough).  FWIW, 17.8 billion
> insns is easily less than 10 cpu seconds.

Nice, thanks!
Attachment #8892841 - Flags: review?(bhackett1024) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/52ca726cf896
Use cx->tempLifoAlloc instead of a custom LifoAlloc for Baseline compilation. r=bhackett
https://hg.mozilla.org/mozilla-central/rev/52ca726cf896
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Attached file DMD output
I did a DMD --cumulative run before and after this patch landed. It fixed the Baseline cases but that's a minority of cases. It shows up in other places (esp. IonBuilder and BacktrackingAllocator) a lot more.

I've attached the DMD output showing the top 8 records, which all involve this data structure.
jandem, could this same trick be applied to those other parts of SpiderMonkey?
Flags: needinfo?(jdemooij)
Could also be an issue for the WebAssembly compilations, need to investigate.  Those compilations batch many functions together to reduce per-function overhead (nominally synchronization overhead but could work for tempallocator too) so it might not be a problem, but we should look just to have looked.
(In reply to Nicholas Nethercote [:njn] from comment #15)
> jandem, could this same trick be applied to those other parts of
> SpiderMonkey?

Not easily. (1) cx->tempLifoAlloc can't be used for Ion due to off-thread Ion compilation, (2) Ion needs the 32 KB chunk size (the issue you fixed in bug 1085740).

Ion compilation is very wasteful though. #1 in your DMD output is allocating slots vectors for MBasicBlock - there's no need to keep these around for all blocks, we should be able to recycle them (I think I added this recycling for asm.js/wasm a long time ago).

We should spend some time figuring out which parts of Ion allocate the most memory. It might also be a good idea to reuse Ion LifoAllocs. I'm interested in fixing this, but I'm not sure it's worth it for 57 - I have a lot on my plate atm.
Flags: needinfo?(jdemooij)
I filed bug 1388002 for Ion.
(In reply to Nicholas Nethercote [:njn] from comment #14)
> I did a DMD --cumulative run before and after this patch landed. It fixed
> the Baseline cases but that's a minority of cases.

For clarity .. reducing Ion et al allocation is worth doing, but that is a
slightly different issue to what I was concerned about here, namely:
overallocating in Baseline and then dragging the unused areas up the memory
hierarchy in free() merely so as to be able to poison it.

My impression from DHAT numbers is that, while Ion et al do allocate a lot
of memory, they also use all (or most) of it.  So the relative cost of
poisoning is less than for Baseline.  Partly because that poisoning cost is
amortized over more useful work, and partly because that work will have
dragged the relevant bits of address space up the memory hierarchy anyway,
so poisoning it will be cheaper.
https://hg.mozilla.org/projects/date/rev/52ca726cf89698336d11277976060779ad7cf9fd
Bug 1386572 - Use cx->tempLifoAlloc instead of a custom LifoAlloc for Baseline compilation. r=bhackett
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: