Closed Bug 1329019 Opened 3 years ago Closed 2 years ago

Wasm: Too many vector grows and copies in asmMergeWith

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: dmajor, Assigned: luke)

References

Details

Attachments

(8 files, 3 obsolete files)

In an xperf profile I noticed a lot of time spent on grows and copies in Assembler::asmMergeWith and friends.

Here's a hacky patch that reserves an initial 40 megs in the masm buffer, just to demonstrate that there's room for improvement.

(This is on top of the patches for bug 1329018 in my queue)
Threads:       1   2   3   4   5   6   7   8
baseline:      621 441 347 305 292 283 275 268
patch:         604 436 327 292 269 262 252 244

The wins are not as prominent as when I last measured; maybe the tree has generally gotten better in the meantime. So we don't necessarily need to fix this urgently. Luke mentioned that for Cretonne there may be a re-working of the buffer handling in this area. That could be a good opportunity to reduce some of the layers of copying.
Priority: -- → P3
Taking.  I have some patches that pass tests, but they turned off batching so a major regression which I need to work on next.
Assignee: nobody → luke
This patch just cleans up some stuff I noticed on the real patch: some parts of LinkData::InternalLink are now unnecessary and we can also reuse Bind() and remove the custom wasm paths that were doing the same thing.
Attachment #8906162 - Flags: review?(lhansen)
Attached patch rm-asmMergeWith (WIP) (obsolete) — Splinter Review
Here's a WIP that passes wasm tests (fails some debug tests for probably simple reasons).  Overall, it feels like a major simplification and cleanup, and it's also a lot more clear to me now how this will plug into cretonne.

This patch provides a 25% reduction in overall baseline compile time when running with `taskset 0x3` and --cpu-count=2 by reducing the time to do parallel function compilation by 34%.

Unfortunately, the simplistic approach of doing all the linking at the end loses parallelism because the linking/copying is now one big sequential operation at the end; while asmMergeWith was being done in parallel. Thus, as you add more hyperthreads/cores, the speedup gets dominated by the sequential linking.

So next step is to go back to doing the linking incrementally (as CompileTasks return) which shouldn't be too bad.  But, progress.
(In reply to Luke Wagner [:luke] from comment #3)

> This patch provides a 25% reduction in overall baseline compile time when
> running with `taskset 0x3` and --cpu-count=2 by reducing the time to do
> parallel function compilation by 34%.

Oh, I *want* this!
Comment on attachment 8906162 [details] [diff] [review]
simplify-internal-link

Review of attachment 8906162 [details] [diff] [review]:
-----------------------------------------------------------------

Nice cleanup.
Attachment #8906162 - Flags: review?(lhansen) → review+
Attached patch rm-asmMergeWith (WIP 2) (obsolete) — Splinter Review
Updated patch WIP, almost done, just need to fill in ARM and there is now a dependency on the size estimation in bug 1380033.

So the patch now shows a pretty big speedup on all --cpu-count (on Tanks) with baseline:
 =1  638ms -> 547ms  (-14%)
 =2  430ms -> 314ms  (-27%)
 =4  287ms -> 196ms  (-31%)
 =8  260ms -> 153ms  (-41%)
Ion also gets better, but less pronounced.

Another fun metric is comparing the ion / baseline ratio change (on Tanks):
 =1  11.2x -> 12.6x
 =2  10.0x -> 12.2x
 =4   6.7x ->  8.3x
 =8   4.5x ->  7.7x
so the patch makes baseline look relatively better.
Attachment #8906164 - Attachment is obsolete: true
Attached patch fix-vector-resize-to-fit (obsolete) — Splinter Review
Later patches hit two omissions in podResizeToFit() (I expect wasm is the only client of this function): (1) realloc don't accept a target length of 0, (2) mReserved was not being updated on shrink.
Attachment #8908191 - Flags: review?(nfroyd)
This patch adds an overload to replaceRawBuffer() that allows providing a vector with length != capacity.
Attachment #8908193 - Flags: review?(nfroyd)
Attached patch fix-hashSplinter Review
This patch only computes the hash when debugging is enabled.  This was taking like 7% of total compilation time.
Attachment #8908194 - Flags: review?(ydelendik)
This patch splits CallSiteAndTargetVector into separate CallSiteVector and CallSiteTargetVector.  This allows CallSiteVector to go straight into MetadataTier (during parallel compilation) while keeping the CallSiteTargetVector local to the ModuleGenerator (b/c it isn't needed after linking).  Otherwise, the two were being split in the final sequential phase.  (This matters more in conjunction with the next patch.)
Attachment #8908196 - Flags: review?(lhansen)
Comment on attachment 8908191 [details] [diff] [review]
fix-vector-resize-to-fit

Review of attachment 8908191 [details] [diff] [review]:
-----------------------------------------------------------------

Could we add a standalone test for this, possibly one test for each of the problematic cases?

I am unsure about the statement "realloc doesn't accept a target length of 0".  Would it be more accurate to say that "realloc frees the target buffer if the length is 0, which the code was not expecting"?

Clearing review for my confusion below and tests.

::: mfbt/Vector.h
@@ +249,5 @@
> +      aV.mTail.mCapacity = aV.kInlineCapacity;
> +#ifdef DEBUG
> +      aV.mTail.mReserved = 0;
> +#endif
> +      return;

This hunk seems correct.

@@ +261,5 @@
>      aV.mTail.mCapacity = aV.mLength;
> +#ifdef DEBUG
> +    aV.mTail.mReserved = aV.mTail.mReserved < aV.mLength
> +                         ? aV.mTail.mReserved
> +                         : aV.mLength;

Hm, so the doc comment for mBegin says:

   * Pointer to the buffer, be it inline or heap-allocated. Only [mBegin,
   * mBegin + mLength) hold valid constructed T objects. The range [mBegin +
   * mLength, mBegin + mCapacity) holds uninitialized memory. The range
   * [mBegin + mLength, mBegin + mReserved) also holds uninitialized memory
   * previously allocated by a call to reserve().

so mReserved should be >= mLength, and this assignment does not respect that invariant in the first case, right?  (The invariant is explicitly spelled out in the documentation comment for the reserved() method.)

It's not clear to me that mReserved is everywhere updated to respect this invariant, so I am certainly willing to admit my misunderstanding of this code.
Attachment #8908191 - Flags: review?(nfroyd)
Comment on attachment 8908193 [details] [diff] [review]
enhance-replace-raw-buffer

Review of attachment 8908193 [details] [diff] [review]:
-----------------------------------------------------------------

Again, tests would be great here...but I see we don't have any tests for replaceRawBuffer in the first place, so maybe we can just choose to keep punting.

r=me; the comment below is just me wondering out loud what the right thing to do is.  Interested in your thoughts here.

::: mfbt/Vector.h
@@ +779,5 @@
>     * N.B. This call assumes that there are no uninitialized elements in the
> +   *      passed range [aP, aP + aLength). The range [aP + aLength, aP +
> +   *      aCapacity) must be allocated uninitialized memory.
> +   */
> +  void replaceRawBuffer(T* aP, size_t aLength, size_t aCapacity);

Do we think an overload is sufficient here, or should this be called something like replaceOversizedRawBuffer to clue the reader in to what's going on?
Attachment #8908193 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #12)
> Do we think an overload is sufficient here, or should this be called
> something like replaceOversizedRawBuffer to clue the reader in to what's
> going on?

That's a good question.  Honestly, I was surprised replaceRawBuffer() *didn't* take a capacity, so it seems pretty natural to me to take 3 args (viewing the other overload as sugar), but maybe that's just b/c I wrote Vector all those years ago and think in terms of [begin, length, capacity] :)
(In reply to Nathan Froyd [:froydnj] from comment #11)
> I am unsure about the statement "realloc doesn't accept a target length of
> 0".  Would it be more accurate to say that "realloc frees the target buffer
> if the length is 0, which the code was not expecting"?

I was just too-briefly restating this assert that I hit:
  https://hg.mozilla.org/mozilla-central/file/593158cd4910/js/public/Utility.h#l253

> so mReserved should be >= mLength, and this assignment does not respect that
> invariant in the first case, right?  (The invariant is explicitly spelled
> out in the documentation comment for the reserved() method.)
> 
> It's not clear to me that mReserved is everywhere updated to respect this
> invariant, so I am certainly willing to admit my misunderstanding of this
> code.

Thanks!  I hadn't noticed that invariant.  I see we do update mReserved when mLength is changed, e.g., by growBy() so that certainly does appear to be the intention.  That makes that line simpler :)
Attachment #8908191 - Attachment is obsolete: true
Attachment #8908333 - Flags: review?(nfroyd)
Comment on attachment 8908333 [details] [diff] [review]
fix-vector-resize-to-fit

Review of attachment 8908333 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you!
Attachment #8908333 - Flags: review?(nfroyd) → review+
Comment on attachment 8908194 [details] [diff] [review]
fix-hash

Review of attachment 8908194 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8908194 - Flags: review?(ydelendik) → review+
Attached patch rm-asmMergeWithSplinter Review
Ok, I ended up doing the simplistic thing for ARM which leaves some possible performance gains for the future (ARM's masm's segmented-chunk buffer cannot be so-trivially swap()ed with a Vector<uint8_t> as x86-shared's can).

I'd suggest reading WasmGenerator.h to see the new types first: the basic idea is that a CompileTask now contains N FuncCompileInput's which are compiled into a single CompiledCode struct which contains everything that used to go in a MacroAssembler.  A key optimization is that CompiledCode swaps its empty-but-reserved Vectors in *and* out of the temporary MacroAssembler.

I'll wait on bug 1380033 to land so I can replace the TODO.

Unfortunately, I wasn't able to make func-index == code-range-index in this patch b/c the FarJumpIslands inserted by patchCallSites() get in the way.  However, I realized I could have two VectorCodeRange (1 for functions, 1 for everything else) in some future patch.
Attachment #8907935 - Attachment is obsolete: true
Attachment #8908761 - Flags: review?(lhansen)
Comment on attachment 8908196 [details] [diff] [review]
split-callsite-and-target

Review of attachment 8908196 [details] [diff] [review]:
-----------------------------------------------------------------

Seems reasonable enough.
Attachment #8908196 - Flags: review?(lhansen) → review+
Blocks: 1401827
Comment on attachment 8908761 [details] [diff] [review]
rm-asmMergeWith

Review of attachment 8908761 [details] [diff] [review]:
-----------------------------------------------------------------

OK. This patch is massive so the review is a lot more "no obvious bugs" than "obviously no bugs".

::: js/src/wasm/WasmGenerator.h
@@ +201,5 @@
>      const CodeRange& funcCodeRange(uint32_t funcIndex) const;
> +    const CodeRange& trapCodeRange(Trap trap) const;
> +    const CodeRange& debugTrapCodeRange() const;
> +
> +    bool linkCallSites();

The removals of MOZ_MUST_USE here and below (many instances) seems unmotivated.

::: js/src/wasm/WasmTypes.cpp
@@ +677,5 @@
>      funcBeginToNormalEntry_(0),
>      funcBeginToTierEntry_(0),
>      kind_(kind)
>  {
> +    MOZ_ASSERT_IF(!oom::IsThreadSimulatingOOM(), begin_ <= end_);

The guard on the assertion (also many cases below) is a little strange.  If this can fail during simulated OOM (plausible) then why not during real OOM, and if so, are the effects benign?
Attachment #8908761 - Flags: review?(lhansen) → review+
(In reply to Lars T Hansen [:lth] from comment #20)
Thanks!  Sorry it was so big.  (Some of the intermediate iterations (that were slowdowns) were even more drastic.)

> > +    bool linkCallSites();
> 
> The removals of MOZ_MUST_USE here and below (many instances) seems
> unmotivated.

I'm certainly open to reverting, but my thinking here was that MOZ_MUST_USE is most useful as a tool in common utilities to ensure return values are checked and, within the private confines of ModuleGenerator, it wasn't providing much value while at the same time making all the signatures look gross and shouty and shifted over 13 columns.  WDYT?

> > +    MOZ_ASSERT_IF(!oom::IsThreadSimulatingOOM(), begin_ <= end_);
> 
> The guard on the assertion (also many cases below) is a little strange.  If
> this can fail during simulated OOM (plausible) then why not during real OOM,
> and if so, are the effects benign?

This just avoided a bunch of otherwise-unnecessary "if (masm.oom()) return false" checks right before codeRanges.emplaceBack() calls (that would hit the assert on OOM but were benign).  But I don't have a strong opinion here either; probably it is cleaner to keep asserts tight and avoid confusion for future readers, so I'll revert.
(In reply to Luke Wagner [:luke] from comment #21)
> (In reply to Lars T Hansen [:lth] from comment #20)
> Thanks!  Sorry it was so big.  (Some of the intermediate iterations (that
> were slowdowns) were even more drastic.)
> 
> > > +    bool linkCallSites();
> > 
> > The removals of MOZ_MUST_USE here and below (many instances) seems
> > unmotivated.
> 
> I'm certainly open to reverting, but my thinking here was that MOZ_MUST_USE
> is most useful as a tool in common utilities to ensure return values are
> checked and, within the private confines of ModuleGenerator, it wasn't
> providing much value while at the same time making all the signatures look
> gross and shouty and shifted over 13 columns.  WDYT?

Your call.   Myself, I like to use 'MOZ_MUST_USE bool' everywhere as a marker for 'must check the result'; maybe eventually we'll move to something Result-based here.

> > > +    MOZ_ASSERT_IF(!oom::IsThreadSimulatingOOM(), begin_ <= end_);
> > 
> > The guard on the assertion (also many cases below) is a little strange.  If
> > this can fail during simulated OOM (plausible) then why not during real OOM,
> > and if so, are the effects benign?
> 
> This just avoided a bunch of otherwise-unnecessary "if (masm.oom()) return
> false" checks right before codeRanges.emplaceBack() calls (that would hit
> the assert on OOM but were benign).  But I don't have a strong opinion here
> either; probably it is cleaner to keep asserts tight and avoid confusion for
> future readers, so I'll revert.

Hm, sounds related to the hack that I introduced earlier that defaulted the values of some of these structures so that these asserts would pass.  I did that to avoid OOM issues too.
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7b360a7ded8d
Baldr: simplify InternalLink code (r=lth)
https://hg.mozilla.org/integration/mozilla-inbound/rev/55b6ce9eb26d
Vector::podResizeToFit should update reserved (r=froydnj)
https://hg.mozilla.org/integration/mozilla-inbound/rev/fd601c9bcd00
Allow specifying capacity to Vector::replaceRawBuffer (r=froydnj)
https://hg.mozilla.org/integration/mozilla-inbound/rev/1f23975f1d1f
Baldr: Only compute bytecode hash in debug mode (r=yury)
https://hg.mozilla.org/integration/mozilla-inbound/rev/30f8a995d4c1
Baldr: Split CallSiteAndTarget into CallSite and CallSiteTarget (r=lth)
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ff0f49d6e25
Baldr: don't transport MacroAssemblers between helper and ModuleGenerator threads (r=lth)
hi, mips build failed after this patch, I have created a patch, can you help to review this? thank you.
Attachment #8913179 - Flags: review?(luke)
Comment on attachment 8913179 [details] [diff] [review]
0001-MIPS-Fix-build-error.patch

Review of attachment 8913179 [details] [diff] [review]:
-----------------------------------------------------------------

Oh right, sorry for the churn.
Attachment #8913179 - Flags: review?(luke) → review+
thank you. 

Can somebody help to checkin this patch? I can not modify this to add a checkin needed flag.
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/238ad8ca8df0
MIPS: Add some missing functions to fix build errors. r=luke
Depends on: 1406889
hi,
this patch:

https://hg.mozilla.org/mozilla-central/rev/6ff0f49d6e25
use

+AssemblerMIPSShared::appendRawCode(const uint8_t* code, size_t numBytes)
replace
-AssemblerMIPSShared::asmMergeWith(const AssemblerMIPSShared& other)

and it remove a function call:
addLongJump(BufferOffset(size() + off));

so when append a new wasm code, it would not handle the long jumps?

so after that if move the codes by call executableCopy() will failed on mips platform? because we can not update the new target address?
You need to log in before you can comment on or make changes to this bug.