Closed
Bug 1276908
Opened 6 years ago
Closed 6 years ago
Make our JIT post barriers more efficient
Categories
(Core :: JavaScript: GC, defect)
Core
JavaScript: GC
Tracking
()
RESOLVED
FIXED
mozilla50
Tracking | Status | |
---|---|---|
firefox50 | --- | fixed |
People
(Reporter: jonco, Assigned: jonco)
References
(Blocks 1 open bug)
Details
Attachments
(5 files, 1 obsolete file)
889 bytes,
patch
|
Details | Diff | Splinter Review | |
3.46 KB,
patch
|
terrence
:
review+
terrence
:
feedback+
|
Details | Diff | Splinter Review |
5.17 KB,
patch
|
terrence
:
review+
jandem
:
review+
|
Details | Diff | Splinter Review |
2.84 KB,
patch
|
terrence
:
review+
|
Details | Diff | Splinter Review |
25.93 KB,
patch
|
terrence
:
review+
|
Details | Diff | Splinter Review |
I wrote a test program that created a large array of object and noticed that it spent ~25% of its time calling into the VM from JIT code to run postbarriers. This bug is to investigate improving this situation.
Comment 1•6 years ago
|
||
How much work would it be to move the is-this-item-is-the-last-item-we-put-into-the-store-buffer check inline into the jit code? The reason that 1-item cache works so well is because it is the common case, so if we could avoid a VM call in the common case, that seems pretty good. This would be much easier with the whole cell buffer than the slots buffer (which is used for array elements, unfortunately). I think we also talked about checking the ObjectElements::Flags::IN_WHOLE_CELL_BUFFER bit inline in the jit code, too. We could also investigate eliding post-barriers in the JIT when we can prove that we already put the edge in the store buffer (again, much easier with the whole cell buffer than with the slots buffer). Can you share how you were creating the large array, and which post barrier (PostWriteBarrier or PostWriteElementBarrier) we end up running?
Blocks: GC.60fps
Assignee | ||
Comment 2•6 years ago
|
||
Here are the benchmarks I'm using. This is about getting rid of calls to PostWriteBarrier, although I've ideas for PostWriteElementBarrier too. The plan is to change the whole cell store buffer representation from a HashSet to an optional bit vector associated with each arena. This is simple enough to manipulate that we can do this from jitcode without the overhead of calling into the VM. Maybe we just was to check if a cell is already in the buffer, or maybe we go the whole way and add too, I'm not sure.
Assignee | ||
Comment 3•6 years ago
|
||
The plan is to associate an optional bit vector with each arena that tells us whether the cells in that arena are in the whole cell store buffer. To do this we need to add a pointer to the arena header. Here are the stats about the arena layout for the different thing kinds on 32 and 64 bit builds. 64-bit: Pre: Post: Size Offset Count Size Offset Count FUNCTION: 64 64 63 64 64 63 FUNCTION_EXTENDED: 80 96 50 80 96 50 OBJECT0: 32 32 127 32 64 126 OBJECT0_BACKGROUND: 32 32 127 32 64 126 OBJECT2: 48 64 84 48 64 84 OBJECT2_BACKGROUND: 48 64 84 48 64 84 OBJECT4: 64 64 63 64 64 63 OBJECT4_BACKGROUND: 64 64 63 64 64 63 OBJECT8: 96 64 42 96 64 42 OBJECT8_BACKGROUND: 96 64 42 96 64 42 OBJECT12: 128 128 31 128 128 31 OBJECT12_BACKGROUND: 128 128 31 128 128 31 OBJECT16: 160 96 25 160 96 25 OBJECT16_BACKGROUND: 160 96 25 160 96 25 SCRIPT: 208 144 19 208 144 19 LAZY_SCRIPT: 64 64 63 64 64 63 SHAPE: 40 56 101 40 56 101 ACCESSOR_SHAPE: 56 64 72 56 64 72 BASE_SHAPE: 40 56 101 40 56 101 OBJECT_GROUP: 48 64 84 48 64 84 FAT_INLINE_STRING: 32 32 127 32 64 126 STRING: 24 40 169 24 40 169 EXTERNAL_STRING: 24 40 169 24 40 169 SYMBOL: 24 40 169 24 40 169 JITCODE: 48 64 84 48 64 84 32-bit: Pre: Post: Size Offset Count Size Offset Count FUNCTION: 32 32 127 32 32 127 FUNCTION_EXTENDED: 48 16 85 48 64 84 OBJECT0: 16 16 255 16 32 254 OBJECT0_BACKGROUND: 16 16 255 16 32 254 OBJECT2: 32 32 127 32 32 127 OBJECT2_BACKGROUND: 32 32 127 32 32 127 OBJECT4: 48 16 85 48 64 84 OBJECT4_BACKGROUND: 48 16 85 48 64 84 OBJECT8: 80 16 51 80 96 50 OBJECT8_BACKGROUND: 80 16 51 80 96 50 OBJECT12: 112 64 36 112 64 36 OBJECT12_BACKGROUND: 112 64 36 112 64 36 OBJECT16: 144 64 28 144 64 28 OBJECT16_BACKGROUND: 144 64 28 144 64 28 SCRIPT: 144 64 28 144 64 28 LAZY_SCRIPT: 48 16 85 48 64 84 SHAPE: 24 16 170 24 40 169 ACCESSOR_SHAPE: 32 32 127 32 32 127 BASE_SHAPE: 24 16 170 24 40 169 OBJECT_GROUP: 24 16 170 24 40 169 FAT_INLINE_STRING: 32 32 127 32 32 127 STRING: 16 16 255 16 32 254 EXTERNAL_STRING: 16 16 255 16 32 254 SYMBOL: 16 16 255 16 32 254 JITCODE: 40 16 102 40 56 101 In summary, on 64 bit builds we get one less OBJECT0 (+ BACKGROUND) and FAT_INLINE_STRING per arena (126 rather than 127). All other kinds are unaffected. On 32 bit builds there is a great affect, with the following kinds having one less cell per arena: FUNCTION_EXTENDED OBJECT0 + BACKGROUND OBJECT4 OBJECT5 LAZY_SCRIPT SHAPE BASE_SHAPE OBJECT_GROUP STRING EXTERNAL_STRING JITCODE
Attachment #8762096 -
Flags: feedback?(terrence)
Assignee | ||
Comment 4•6 years ago
|
||
Reimplement the whole cell store buffer using an optional bitvector associated with each arena. These are allocated from the nursery.
Assignee | ||
Comment 5•6 years ago
|
||
For writes to objects known at compile time we can compile a faster store buffer check that calling into the VM to do this.
Assignee | ||
Comment 6•6 years ago
|
||
Benchmark results for the owb.js microbenchmark: Initial: 1.29 real 1.28 user 0.01 sys 1.30 real 1.28 user 0.01 sys 1.28 real 1.26 user 0.01 sys With reimplememented whole cell store buffer: 0.86 real 0.85 user 0.01 sys 0.87 real 0.86 user 0.01 sys 0.86 real 0.84 user 0.01 sys With constant object optimisation: 0.62 real 0.61 user 0.01 sys 0.63 real 0.61 user 0.01 sys 0.62 real 0.60 user 0.01 sys
Assignee | ||
Comment 7•6 years ago
|
||
I also spent a while optimising the non-constant object case so we could check this in JIT code without calling into the VM, and made it work on x86 (yay for the BTS instruction), arm and arm64. It wasn't any faster.
Updated•6 years ago
|
Attachment #8762096 -
Flags: review+
Attachment #8762096 -
Flags: feedback?(terrence)
Attachment #8762096 -
Flags: feedback+
Comment 8•6 years ago
|
||
Comment on attachment 8762098 [details] [diff] [review] bitvector-2-whole-cell-store-buffer Review of attachment 8762098 [details] [diff] [review]: ----------------------------------------------------------------- I'd like to see a copy using BitArray first. ::: js/src/gc/Heap.h @@ +337,5 @@ > * wastes space but allows to avoid expensive devisions by thing's size when > * accessing the bitmap. In addition this allows to use some bits for colored > * marking during the cycle GC. > */ > +const size_t ArenaCellCount = ArenaSize / CellSize; Maybe also static_assert that ArenaSize % CellSize is 0. ::: js/src/gc/StoreBuffer-inl.h @@ +50,5 @@ > + uint32_t mask = getMask(cellIndex); > + uint32_t word = bits[wordIndex]; > + if ((word & mask) == 0) > + bits[wordIndex] = word | mask; > +} All of this code seems to be identical to the code in ds/BitArray.h. Please re-implement |bits| using a BitArray. ::: js/src/gc/StoreBuffer.cpp @@ +137,5 @@ > + oomUnsafe.crash("Failed to allocate WholeCellSet"); > + return nullptr; > + } > + > + if (nursery.approxFreeSpace() < 64 * 1024) Please make 64 * 1024 a constant of some sort in the header so that it's apparent that it is a knob we can fiddle with.
Attachment #8762098 -
Flags: feedback+
Comment 9•6 years ago
|
||
Ah! I guess we can't naively use BitArray because we need to have a fixed word size for the JIT integration! I think it would be worth the effort to give BitArray a template parameter of WordType instead of having a second copy with a fixed size type though. Or even to make BitArray just use uint32_t, for that matter.
Comment 10•6 years ago
|
||
Comment on attachment 8762100 [details] [diff] [review] bitvector-3-whole-cell-store-buffer-jit-constant Review of attachment 8762100 [details] [diff] [review]: ----------------------------------------------------------------- This is nice. Obviously don't land this if it's not actually faster, but you have my r+ to do so if it is.
Attachment #8762100 -
Flags: review+
Comment 11•6 years ago
|
||
(In reply to Jon Coppeard (:jonco) from comment #6) > Benchmark results for the owb.js microbenchmark: Great results!
Assignee | ||
Comment 12•6 years ago
|
||
Make BitArray use 32 bit words everywhere so it's easier to access its storage from the JITs.
Attachment #8762546 -
Flags: review?(terrence)
Assignee | ||
Comment 13•6 years ago
|
||
Updated patch using BitArray and with review comments applied.
Attachment #8762098 -
Attachment is obsolete: true
Attachment #8762547 -
Flags: review?(terrence)
Comment 14•6 years ago
|
||
Comment on attachment 8762546 [details] [diff] [review] bitvector-0-bit-array-changes Review of attachment 8762546 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/ds/BitArray.h @@ +18,5 @@ > template <size_t nbits> > class BitArray > { > private: > + using WordT = uint32_t; Maybe add a comment on this to the effect that the fixed word size is to make it easier to code against in the JIT?
Attachment #8762546 -
Flags: review?(terrence) → review+
Comment 15•6 years ago
|
||
Comment on attachment 8762547 [details] [diff] [review] bitvector-2-whole-cell-store-buffer v2 Review of attachment 8762547 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/gc/Nursery.h @@ +217,5 @@ > MOZ_ALWAYS_INLINE uintptr_t heapEnd() const { > return heapEnd_; > } > > + // Free space remaining, now counting chunk trailers. s/now/not/ ::: js/src/gc/StoreBuffer-inl.h @@ +23,5 @@ > + > +inline /* static */ void > +ArenaCellSet::getWordIndexAndMask(size_t cellIndex, size_t* wordp, uint32_t* maskp) > +{ > + BitArray<ArenaCellCount>::getIndexAndMask(cellIndex, wordp, maskp); Much nicer!
Attachment #8762547 -
Flags: review?(terrence) → review+
Assignee | ||
Comment 16•6 years ago
|
||
Comment on attachment 8762100 [details] [diff] [review] bitvector-3-whole-cell-store-buffer-jit-constant Requesting additional review for JIT changes.
Attachment #8762100 -
Flags: review?(jdemooij)
Comment 17•6 years ago
|
||
Comment on attachment 8762100 [details] [diff] [review] bitvector-3-whole-cell-store-buffer-jit-constant Review of attachment 8762100 [details] [diff] [review]: ----------------------------------------------------------------- Cool! ::: js/src/jit/CodeGenerator.cpp @@ +3444,5 @@ > + > + masm.branchTest32(Assembler::NonZero, Address(cells, offset), Imm32(mask), exit); > + > + // Check whether this is the sentinel set and if so call the VM to allocate > + // a one for this arena. Nit: s/a one/one/ or something similar
Attachment #8762100 -
Flags: review?(jdemooij) → review+
Comment 18•6 years ago
|
||
Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6a5167669bd2 Make BitArray use 32 bit words to make it easier to access from JIT code r=terrence https://hg.mozilla.org/integration/mozilla-inbound/rev/b1c44ce827f7 Add an extra field to the arena header r=terrence https://hg.mozilla.org/integration/mozilla-inbound/rev/85911372f276 Reimplement whole cell store buffer using a bit vector associated with the arena r=terrence https://hg.mozilla.org/integration/mozilla-inbound/rev/b8b6dd03d8fd Optimise post barriers in the JIT for constant objects r=terrence r=jandem
Comment 19•6 years ago
|
||
Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/79328d1a536d Fix implicit conversion constructor error r=me on a CLOSED TREE
Comment 20•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6a5167669bd2 https://hg.mozilla.org/mozilla-central/rev/b1c44ce827f7 https://hg.mozilla.org/mozilla-central/rev/85911372f276 https://hg.mozilla.org/mozilla-central/rev/b8b6dd03d8fd https://hg.mozilla.org/mozilla-central/rev/79328d1a536d
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox50:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment 21•6 years ago
|
||
Nice gains: https://treeherder.allizom.org/perf.html#/alerts?id=2726 @Brian: misc 0.7 typedobj-splay-standard linux32 opt shell has regressed. I think it is a fluke and not related to what the benchmark is actually testing. And the linux64 opt shell has improved with the same revision. Can you confirm I can ignore that regression.
Flags: needinfo?(bhackett1024)
Comment 23•6 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #22) > Hannes, did you mean to ni Jon instead of Brian? No, not at all. The testcase is a testcase that Brian added. That's why I'm NI'ed him
Flags: needinfo?(hv1989)
Comment 24•6 years ago
|
||
(In reply to Hannes Verschore [:h4writer] from comment #21) > Nice gains: > https://treeherder.allizom.org/perf.html#/alerts?id=2726 > > @Brian: misc 0.7 typedobj-splay-standard linux32 opt shell has regressed. I > think it is a fluke and not related to what the benchmark is actually > testing. And the linux64 opt shell has improved with the same revision. Can > you confirm I can ignore that regression. This should be fine to ignore, I don't know why linux32 would regress and linux64 would improve.
Flags: needinfo?(bhackett1024)
You need to log in
before you can comment on or make changes to this bug.
Description
•