IonMonkey: Use a double constant pool on x86/x64, fold loads into operands

RESOLVED FIXED in mozilla25

Status

()

defect
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: luke, Unassigned)

Tracking

(Blocks 1 bug)

unspecified
mozilla25
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 1 obsolete attachment)

Posted patch use rip-relative loads (WIP) (obsolete) — Splinter Review
As a first step, we could load doubles into xmm registers with a single instruction instead of what we have now
   pcmpeqw  %xmm1,%xmm1
   psllq  $0x36,%xmm1
   psrlq  $0x2,%xmm1
or
   movabsq  $0x3ff0000000000000, %r11
   movq  %r11, %xmm1
I have an old, simple WIP patch (attached) that does this for x64 using rip-relative loads, but we could do the same thing with normal loads on x86.

Better than that, we could fold this constant-double-load into the instructions that consume them (like we do with integer immediates with emitAtUses).  This involves converting a bunch of double arith functions to useRegisterOrConstant().

One subtlety, though, is that, unlike with integer literals, several instructions using the same double constant, we probably want to load the constant into a register so it can be reused.  Ideally, we'd integrate this with regalloc so that we don't unnecessarily extend live ranges.  Jan, how hard would it be to get the regalloc to do this?  This seems related to what we do with STACK_SLOTs.
Summary: IonMonkey: Use a constant pool on x86/x64, fold loads into operands → IonMonkey: Use a double constant pool on x86/x64, fold loads into operands
Depends on: 885175
This takes the NonassertingLabel part of the original patch, updates it, and replaces the HeapLabel class, which had been introduced in the meantime to serve the same purpose.
Attachment #768943 - Flags: review?(luke)
This patch updates the original patch to implement rip-relative constant-pool loads for double constants.

It also makes a few minor cleanups for x86 for consistency with x64.
Attachment #754034 - Attachment is obsolete: true
Attachment #768945 - Flags: review?(luke)
Attachment #768943 - Flags: review?(luke) → review+
Comment on attachment 768945 [details] [diff] [review]
the main patch

Review of attachment 768945 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, but probalby sstangl or jan or hannes should review.

::: js/src/ion/x64/MacroAssembler-x64.cpp
@@ +27,5 @@
> +            return;
> +    }
> +    size_t doubleIndex;
> +    DoubleMap::AddPtr p = doubleMap_.lookupForAdd(d);
> +    if (p) {

You can write if (DoubleMap::AddPtr p = ...) (and p will be in scope for the 'else').
Attachment #768945 - Flags: review?(luke) → review?(sstangl)
Comment on attachment 768945 [details] [diff] [review]
the main patch

Review of attachment 768945 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Could we measure effect on benchmarks before landing?

::: js/src/ion/x64/MacroAssembler-x64.cpp
@@ +52,5 @@
> +
> +void
> +MacroAssemblerX64::finish()
> +{
> +    if (!doubles_.empty())

It would be useful here to JS_STATIC_ASSERT that CodeAlignment >= sizeof(double).

::: js/src/ion/x64/MacroAssembler-x64.h
@@ +49,5 @@
> +        double value;
> +        NonAssertingLabel uses;
> +        Double(double value) : value(value) {}
> +    };
> +    Vector<Double, 0, SystemAllocPolicy> doubles_;

SystemAllocPolicy allocates into the heap, while IonAllocPolicy allocates into a (logically) thread-local temporary buffer that persists until the end of compilation. Can we use IonAllocPolicy here and below?

::: js/src/ion/x86/MacroAssembler-x86.cpp
@@ +17,5 @@
>  
>  void
>  MacroAssemblerX86::loadConstantDouble(double d, const FloatRegister &dest)
>  {
> +    if (maybeInlineDouble(mozilla::BitwiseCast<uint64_t>(d), dest))

It may be nicer for maybeInlineDouble to take a double, performing the BitwiseCast internally.
Attachment #768945 - Flags: review?(sstangl) → review+
I don't see any change on the octane benchmark, but I do see speedups on microbenchmarks like the one in bug 876057.
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/91851a7f561a for rather a lot of asm.js test failures.
It appears IonAllocPolicy's heap doesn't persist until the end of compilation; asm.js frees it after each function. Is it ok to switch back to SystemAllocPolicy for the constant pool, or would you like me to look for a different approach?
(In reply to Dan Gohman from comment #8)
> It appears IonAllocPolicy's heap doesn't persist until the end of
> compilation; asm.js frees it after each function. Is it ok to switch back to
> SystemAllocPolicy for the constant pool, or would you like me to look for a
> different approach?

AsmJS doesn't destruct the MacroAssembler after each function; instead, it uses resetForNewCodeGenerator(). I'd prefer to keep the memory in the temporary buffers if it's not a hassle.

Would it be reasonable to make constant pool emission an explicit step during linking? Then we could include constant pool emission and label binding as a step in CodeGenerator::generateAsmJS(), used via GenerateAsmJSCode().
AsmJS does push a LifoAllocScope around each function compilation though, which looks like it causes all the IonAllocPolicy memory to be reused.

How about ContextAllocPolicy?
(In reply to Dan Gohman from comment #10)
> AsmJS does push a LifoAllocScope around each function compilation though,
> which looks like it causes all the IonAllocPolicy memory to be reused.

Each function would have to have its own constant pool.

> How about ContextAllocPolicy?

SystemAllocPolicy would be OK. A number of other compilation objects use it for storage, and it's not the end of the world.
D'oh!  Sorry I didn't see this coming; I think my original patch predates the single-macroassembler change to Odin.

Could the constant double pool be moved into the CodeGenerator?
Luke: The CodeGenerator uses SystemAllocPolicy too :).

I discussed with sstangl on IRC and he agreed with SystemAllocPolicy, at least for now. Of course, anyone with a better idea is free to refactor it.

https://hg.mozilla.org/integration/mozilla-inbound/rev/65b5396b4401
https://hg.mozilla.org/integration/mozilla-inbound/rev/ccb0796f0869
There was a net regression in kraken on AWFY, but the details are ambiguous. For example, kraken-desaturate got 8.5% better on 64-bit MacOS, but got 5% worse on 32-bit MacOS, while kraken-fft got 3% worse on 64-bit MacOS but 1% better on 32-bit MacOS.

I examined the assembly code for kraken-desaturate and kraken-stanford-crypto-aes and confirmed that the patch implements exactly the intended changes. Constant pools are the textbook way to do floating-point constants on x86, and they are significantly faster than what ion was doing before in some cases, so I'm assuming the slowdowns aren't a significant problem.

These kraken loops are fairly small and really want their floating-point constants hoisted in any case (bug 881390).
kraken-desaturate:
64bit: 8.5% better, 32bit: 5% worse

I think this is pretty normal, since the pool is only implemented in 64bit and you removed some inline code for loadConstant. So I can understand this would be a 32bit regression.

kraken-fft:
64bit: 3% worse, 32bit: 1% better

This is indeed strange.
https://hg.mozilla.org/mozilla-central/rev/65b5396b4401
https://hg.mozilla.org/mozilla-central/rev/ccb0796f0869
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
You need to log in before you can comment on or make changes to this bug.