Closed Bug 1333447 Opened 7 years ago Closed 7 years ago

Wasm baseline: AssemblerBufferWithConstantPools::allocEntry is very slow

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla54

Tracking Flags:

Tracking

Status

firefox54

---

fixed

People

(Reporter: lth, Assigned: lth)

References

Details

Attachments

(2 files, 2 obsolete files)

putInt-optimization.patch 7 years ago Lars T Hansen [:lth] 1.14 KB, patch		Details \| Diff \| Splinter Review
arm-buffer-timings.txt 7 years ago Lars T Hansen [:lth] 4.94 KB, text/plain		Details
bug1333447-fast-paths.patch 7 years ago Lars T Hansen [:lth] 8.36 KB, patch		Details \| Diff \| Splinter Review
bug1333447-fast-paths-v2.patch 7 years ago Lars T Hansen [:lth] 13.58 KB, patch	nbp : review+	Details \| Diff \| Splinter Review

Lars T Hansen [:lth]

Assignee

Description

•

7 years ago

Attached patch putInt-optimization.patch (obsolete) — Details — Splinter Review

Profiling suggested that nearly 15% of the running time of the baseline compiler was spent in AssemblerBufferWithConstantPools::allocEntry and ABWCP::insertEntryForwards (which is called by allocEntry).  This is based on compiling AngryBots, with --no-threads, and this is with the profiler including the time to read the file from disk, so the fraction spent in these functions is actually higher.

The problem seems to be that especially allocEntry is very general, takes a ton of parameters, and uses many registers, so there's a lot of work on both sides of the call boundary.  insertEntryForwards is similarly general.

In practice, most of the calls are on behalf of putInt() putting an ALU instruction.  This is a particularly simple case and putInt can make the call to allocEntry a special case by checking some preconditions quickly and then just depositing the data.  Doing so reduces the compilation time from 2035ms to 1870ms, or about 7%.

The attached patch will need to be vetted further but it seems worth it to pursue this.  Even multi-threaded compilation speeds up, though not by that amount.

(It's possible to observe a speedup with Ion as well, for single-threaded compilation.)

Lars T Hansen [:lth]

Assignee

Comment 1

•

7 years ago

Attached file arm-buffer-timings.txt — Details

Raw benchmark results, for posterity.

Lars T Hansen [:lth]

Assignee

Comment 2

•

7 years ago

Oh, there's gold in them thar hills!

By further streamlining (a little) the path to just put an instruction when there's space for it, which is the common case, the single-thread compilation time drops to 1730ms, for a total reduction (just from this modest change) of 15%.  Even then, AssemblerBuffer and assembler functions dominate the profile, and it would not be surprising if there are other fast paths that should get some attention.

(I'll need to put this aside for a couple weeks but I'll attach the WIP patch.)

Lars T Hansen [:lth]

Assignee

Comment 3

•

7 years ago

Attached patch bug1333447-fast-paths.patch (obsolete) — Details — Splinter Review

Attachment #8829896 - Attachment is obsolete: true

Hannes Verschore [:h4writer]

Comment 4

•

7 years ago

I am moving this to P1. It looks important enough and we have a rough patch. Feel free to disagree with this promotion ;).

Priority: P3 → P1

Lars T Hansen [:lth]

Assignee

Comment 5

•

7 years ago

I'll claim it, then, since I probably care the most about it and have hardware that's useful for working on it.  I'll be sure to test on some other hardware too.

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Luke Wagner [:luke]

Comment 6

•

7 years ago

(In reply to Lars T Hansen [:lth] from comment #2)
> Oh, there's gold in them thar hills!

I do recall stepping through this path on occasion and thinking "surely, we can't be doing all this work on *every* instruction...".

Lars T Hansen [:lth]

Assignee

Comment 7

•

7 years ago

Attached patch bug1333447-fast-paths-v2.patch — Details — Splinter Review

This adds a single fast path through the assembler for when we want to just put an instruction, which is nearly always.

This speeds up the compilation of AngryBots on my ARM dev board (compiling with --wasm-always-baseline --no-threads, which highlights assembly cost the most) by 17%, reducing the time from 2010ms to 1725ms and removing most assembler work from the top of the profile.  One hopes other compilation jobs will benefit as well.

The one part of the change that I would especially like an opinion is whether the the new fast-path guard in putInt() in IonAssemblerBufferWithConstantPools.h is correct.  This was lifted out of allocEntry(), and then simplified, as the comment indicates.  The effect of the simplification is to delay certain error reporting, which should be fine, but I want to make sure it doesn't compromise correctness in the process.

Attachment #8830229 - Attachment is obsolete: true

Attachment #8841675 - Flags: review?(nicolas.b.pierron)

Nicolas B. Pierron [:nbp]

Comment 8

•

7 years ago

Comment on attachment 8841675 [details] [diff] [review]
bug1333447-fast-paths-v2.patch

Review of attachment 8841675 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm/Assembler-arm.cpp
@@ -1682,2 @@
>      spew(m_buffer.getInstOrNull(offs));
> -#endif

Can we move these function calls to the header as well?  I do not really like the idea of having the same code written twice as this is a high risk that we will only check one and not the other.

If not, please add a comment to recall the next persons who is changing this code to update both versions of this function.

::: js/src/jit/shared/IonAssemblerBuffer.h
@@ +139,5 @@
> +    MOZ_ALWAYS_INLINE
> +    void putU32Aligned(uint32_t value) {
> +        MOZ_ASSERT(bytelength_ + 4 <= SliceSize);
> +        MOZ_ASSERT((bytelength_ & 3) == 0);
> +        *reinterpret_cast<uint32_t*>(&instructions[bytelength_]) = value;

nit: Either assert the alignment of &instruction[0] in addition to the previous checks, or the alignment of &instruction[byteLength_].

@@ +375,5 @@
>      // bounds of the buffer. Use |getInstOrNull()| if |off| may be unassigned.
>      Inst* getInst(BufferOffset off) {
>          const int offset = off.getOffset();
> +        // This function is hot, do not make the next line a RELEASE_ASSERT.
> +        MOZ_ASSERT(off.assigned() && offset >= 0 && (unsigned)offset < size());

existing-nit: s/(unsigned)offset/unsigned(offset)/

::: js/src/jit/shared/IonAssemblerBufferWithConstantPools.h
@@ -832,3 @@
>      BufferOffset allocEntry(size_t numInst, unsigned numPoolEntries,
> -                            uint8_t* inst, uint8_t* data, PoolEntry* pe = nullptr,
> -                            bool markAsBranch = false)

note: markAsBranch-noop removal confused me while doing this review and would have better been in a different patch.  Next time maybe ;)

@@ +891,5 @@
> +    // which is set to zero when nopFill_ is set.
> +    //
> +    // We assume that we don't have to check this->oom() if there is space to
> +    // insert a plain instruction; there will always come a later time when it will be
> +    // checked anyway.

This indeed sounds correct.

Looking at places which can produce these OOM only highlight cases where we either allocate a new slice, or a pool entry.  The only case which is not covered here is when we register a new branch such that we can later add it to the pool, and we would indeed push instruction, but the flag should still be present, thus preventing us to commit the code to any executable page.

Attachment #8841675 - Flags: review?(nicolas.b.pierron) → review+

Lars T Hansen [:lth]

Assignee

Comment 9

•

7 years ago

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4693c2d4cd6a791f3a58389b656479518f0fd484

Pushed with the wrong bug number (sigh):
https://hg.mozilla.org/integration/mozilla-inbound/rev/7453899cfe44

Pulsebot

Comment 10

•

7 years ago

Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c807c80d954a
ARM assembler: fast paths for putting simple instructions. r=nbp

Lars T Hansen [:lth]

Assignee

Comment 11

•

7 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/bcd5db7ac471cad3370ab332e3425c3cbac9bf15
Bug 1333447 - ARM assembler: fast paths for putting simple instructions. r=nbp

Pulsebot

Comment 12

•

7 years ago

Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb9d82f04cf6
Backed out changeset c807c80d954a wrong patch landed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 13

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bcd5db7ac471

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Wasm baseline: AssemblerBufferWithConstantPools::allocEntry is very slow

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

Tracking

()

People

(Reporter: lth, Assigned: lth)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files, 2 obsolete files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Attachment

General

Description

File Name

Content Type