Wasm baseline: AssemblerBufferWithConstantPools::allocEntry is very slow

RESOLVED FIXED in Firefox 54

Status

()

Core
JavaScript Engine: JIT
P1
normal
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: lth, Assigned: lth)

Tracking

(Blocks: 1 bug)

unspecified
mozilla54
ARM
Linux
Points:
---

Firefox Tracking Flags

(firefox54 fixed)

Details

Attachments

(2 attachments, 2 obsolete attachments)

(Assignee)

Description

a year ago
Created attachment 8829896 [details] [diff] [review]
putInt-optimization.patch

Profiling suggested that nearly 15% of the running time of the baseline compiler was spent in AssemblerBufferWithConstantPools::allocEntry and ABWCP::insertEntryForwards (which is called by allocEntry).  This is based on compiling AngryBots, with --no-threads, and this is with the profiler including the time to read the file from disk, so the fraction spent in these functions is actually higher.

The problem seems to be that especially allocEntry is very general, takes a ton of parameters, and uses many registers, so there's a lot of work on both sides of the call boundary.  insertEntryForwards is similarly general.

In practice, most of the calls are on behalf of putInt() putting an ALU instruction.  This is a particularly simple case and putInt can make the call to allocEntry a special case by checking some preconditions quickly and then just depositing the data.  Doing so reduces the compilation time from 2035ms to 1870ms, or about 7%.

The attached patch will need to be vetted further but it seems worth it to pursue this.  Even multi-threaded compilation speeds up, though not by that amount.

(It's possible to observe a speedup with Ion as well, for single-threaded compilation.)
(Assignee)

Comment 1

a year ago
Created attachment 8829897 [details]
arm-buffer-timings.txt

Raw benchmark results, for posterity.
(Assignee)

Comment 2

a year ago
Oh, there's gold in them thar hills!

By further streamlining (a little) the path to just put an instruction when there's space for it, which is the common case, the single-thread compilation time drops to 1730ms, for a total reduction (just from this modest change) of 15%.  Even then, AssemblerBuffer and assembler functions dominate the profile, and it would not be surprising if there are other fast paths that should get some attention.

(I'll need to put this aside for a couple weeks but I'll attach the WIP patch.)
(Assignee)

Comment 3

a year ago
Created attachment 8830229 [details] [diff] [review]
bug1333447-fast-paths.patch
Attachment #8829896 - Attachment is obsolete: true
I am moving this to P1. It looks important enough and we have a rough patch. Feel free to disagree with this promotion ;).
Priority: P3 → P1
(Assignee)

Comment 5

a year ago
I'll claim it, then, since I probably care the most about it and have hardware that's useful for working on it.  I'll be sure to test on some other hardware too.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
(In reply to Lars T Hansen [:lth] from comment #2)
> Oh, there's gold in them thar hills!

I do recall stepping through this path on occasion and thinking "surely, we can't be doing all this work on *every* instruction...".
(Assignee)

Comment 7

a year ago
Created attachment 8841675 [details] [diff] [review]
bug1333447-fast-paths-v2.patch

This adds a single fast path through the assembler for when we want to just put an instruction, which is nearly always.

This speeds up the compilation of AngryBots on my ARM dev board (compiling with --wasm-always-baseline --no-threads, which highlights assembly cost the most) by 17%, reducing the time from 2010ms to 1725ms and removing most assembler work from the top of the profile.  One hopes other compilation jobs will benefit as well.

The one part of the change that I would especially like an opinion is whether the the new fast-path guard in putInt() in IonAssemblerBufferWithConstantPools.h is correct.  This was lifted out of allocEntry(), and then simplified, as the comment indicates.  The effect of the simplification is to delay certain error reporting, which should be fine, but I want to make sure it doesn't compromise correctness in the process.
Attachment #8830229 - Attachment is obsolete: true
Attachment #8841675 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8841675 [details] [diff] [review]
bug1333447-fast-paths-v2.patch

Review of attachment 8841675 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm/Assembler-arm.cpp
@@ -1682,2 @@
>      spew(m_buffer.getInstOrNull(offs));
> -#endif

Can we move these function calls to the header as well?  I do not really like the idea of having the same code written twice as this is a high risk that we will only check one and not the other.

If not, please add a comment to recall the next persons who is changing this code to update both versions of this function.

::: js/src/jit/shared/IonAssemblerBuffer.h
@@ +139,5 @@
> +    MOZ_ALWAYS_INLINE
> +    void putU32Aligned(uint32_t value) {
> +        MOZ_ASSERT(bytelength_ + 4 <= SliceSize);
> +        MOZ_ASSERT((bytelength_ & 3) == 0);
> +        *reinterpret_cast<uint32_t*>(&instructions[bytelength_]) = value;

nit: Either assert the alignment of &instruction[0] in addition to the previous checks, or the alignment of &instruction[byteLength_].

@@ +375,5 @@
>      // bounds of the buffer. Use |getInstOrNull()| if |off| may be unassigned.
>      Inst* getInst(BufferOffset off) {
>          const int offset = off.getOffset();
> +        // This function is hot, do not make the next line a RELEASE_ASSERT.
> +        MOZ_ASSERT(off.assigned() && offset >= 0 && (unsigned)offset < size());

existing-nit: s/(unsigned)offset/unsigned(offset)/

::: js/src/jit/shared/IonAssemblerBufferWithConstantPools.h
@@ -832,3 @@
>      BufferOffset allocEntry(size_t numInst, unsigned numPoolEntries,
> -                            uint8_t* inst, uint8_t* data, PoolEntry* pe = nullptr,
> -                            bool markAsBranch = false)

note: markAsBranch-noop removal confused me while doing this review and would have better been in a different patch.  Next time maybe ;)

@@ +891,5 @@
> +    // which is set to zero when nopFill_ is set.
> +    //
> +    // We assume that we don't have to check this->oom() if there is space to
> +    // insert a plain instruction; there will always come a later time when it will be
> +    // checked anyway.

This indeed sounds correct.

Looking at places which can produce these OOM only highlight cases where we either allocate a new slice, or a pool entry.  The only case which is not covered here is when we register a new branch such that we can later add it to the pool, and we would indeed push instruction, but the flag should still be present, thus preventing us to commit the code to any executable page.
Attachment #8841675 - Flags: review?(nicolas.b.pierron) → review+

Comment 10

a year ago
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c807c80d954a
ARM assembler: fast paths for putting simple instructions. r=nbp
(Assignee)

Comment 11

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/bcd5db7ac471cad3370ab332e3425c3cbac9bf15
Bug 1333447 - ARM assembler: fast paths for putting simple instructions. r=nbp

Comment 12

a year ago
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb9d82f04cf6
Backed out changeset c807c80d954a wrong patch landed

Comment 13

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/bcd5db7ac471
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox54: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.