Closed Bug 867001 Opened 7 years ago Closed Last year

IonMonkey: Investigate aligning loop headers

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED INACTIVE

People

(Reporter: bbouvier, Assigned: sunfish)

References

Details

Attachments

(5 files, 8 obsolete files)

When generating code in IonMonkey, align loop headers on memory natural boundaries (i.e. every 16 bits) could enhance loop performance, according to [1]. This optimization is also made by GCC, when using -O2 optimizations or higher.

For this purpose, assembly no-ops could be used. For x86 and x64, Intel's developer manual ([2]) provides noops of sizes included between 1 and 9. For sizes greater or equal to 10, series of noops could be used. However, it seems that GCC aligns on multiples of 8 instead of multiples of 16 if the space needed is higher than 8.

Several benchmarkings should be made to prove that there is a real performance enhancement using the loop alignment. I can think about three experiments:
- one without any loop header optimizations, using the code as it is today.
- one with alignments on multiples of 8, a la GCC.
- one with alignments on multiples of 16.

I will post the results of the investigation on x64 here as soon as I've finished the benchmarking.

[1] http://www.cs.virginia.edu/kim/courses/cs6620/lec/cs6620-21-intel.pdf
[2] http://download.intel.com/products/processor/manual/325462.pdf
Note that Bug 605010 (several years old) contains a patch that implements multibyte NOPs. We didn't take the patch because we had no use case for it, but this bug could be that use case.
(In reply to Benjamin Bouvier [:bbouvier] from comment #0)
> I will post the results of the investigation on x64 here as soon as I've
> finished the benchmarking.

Is there a reason you test on x64 instead of x86? If there is no specific reason, you should probably test first on x86, since that's our official supported platform. 

For linux (release build)
$ sudo apt-get install gcc-multilib ; I think this is the only dependency for cross-builing a 32bit program on x64 platform, could be more.
$ cd mozilla-inbound

// this is needed for threadsafe build
$ cd nsprpub
$ mkdir build-32
$ cd build-32
$ CC='gcc -m32' CXX='g++ -m32' AR=ar ../configure --disable-debug --enable-optimize --target=i686-pc-linux-gnu 

// building js shell
$ cd mozilla-inbound/js/src
$ autoconf2.13
$ mkdir build-32
$ cd build-32
$ CC='gcc -m32' CXX='g++ -m32' AR=ar ../configure --disable-debug --enable-optimize --target=i686-pc-linux-gnu --enable-threadsafe --with-nspr-cflags="-I/PATH_TO/mozilla-inbound/nsprpub/dist/include/nspr" --with-nspr-libs="/PATH_TO/mozilla-inbound/nsprpub/dist/lib/libnspr4.a /PATH_TO/mozilla-inbound/nsprpub/dist/lib/libplc4.a /PATH_TO/mozilla-inbound/nsprpub/dist/lib/libplds4.a"

If you are on a mac, the configure lines change slightly. In that case you should ask luke or just ask in #ionmonkey.
Oh I made a small mistake in my last configure line
/PATH_TO/mozilla-inbound/nsprpub/...
should be
/PATH_TO/mozilla-inbound/nsprpub/build-32/...
(In reply to Hannes Verschore [:h4writer] from comment #2)
> (In reply to Benjamin Bouvier [:bbouvier] from comment #0)
> > I will post the results of the investigation on x64 here as soon as I've
> > finished the benchmarking.
> 
> Is there a reason you test on x64 instead of x86? If there is no specific
> reason, you should probably test first on x86, since that's our official
> supported platform. 
> 

There is indeed no reason, except my computer is running a x64 architecture linux kernel.
Thank you for the cross compilation commands!
This is a first version of the patch, for alignment of loop headers on 16bits boundaries.

I'm not plenty satisfied with the switch approach, it looks dirty.
Using initializer list of multibytes sequences would have been more proper (instead of the switch, it would have been a direct access to the array of sequences), but the compiler doesn't seem to appreciate initializer lists inside initializer list, unless I made a syntax error.

Also, as sstangl suggested, the blocks where code is allocated (in IonLinker.h:Linker:newCode) were aligned on 8 bits boundaries. I put up this alignment to 16.

This has been tested with asm generation and IonMonkey generation and loops header were correctly aligned on 16 bits boundaries (checked with gdb).

Microbenchmarks and Octane benchmark showed a relative 0.5% difference between code with alignment and code without alignment (the one with alignment being better). The 32-bits version (thanks h4writer and sstangl!) was used and 21 runs were executed. This enhancement doesn't look significant. I will try on other benchmarks though and also make a version with 8 bytes boundaries like GCC does.
Attachment #744005 - Flags: feedback?(luke)
I'd also be interested to see measurements for the asm.js benchmarks in bug 840284 attachment 718174 [details].  No harness for these, just run, e.g., 'time js zlib.js'.

You might also want to turn off dynamic CPU throttling:
  echo performance > /sys/devices/cpu/cpuX/cpufreq/scaling_governor
for all X in /sys/devices/cpu.
Comment on attachment 744005 [details] [diff] [review]
WIP - Align on 16 bits boundaries

Review of attachment 744005 [details] [diff] [review]:
-----------------------------------------------------------------

You are off to a great start, just a few nits:

::: js/src/assembler/assembler/X86Assembler.h
@@ +387,5 @@
> +                nop_six();
> +                nop_seven();
> +                break;
> +            case 15:
> +                nop_one(); // no break: 15 == 1 + 14

For simplicity, I'd just duplicate the case for 15 and avoid the fallthrough.

@@ +2766,5 @@
> +        if (neededPadding > 0) {
> +            if (neededPadding < 16) {
> +                // for instance, before a loop header
> +                insert_nop (neededPadding);
> +            } else {

I'd rather not overload the meaning of masm.align.  It's nice that all the inter-function padding is hlt since this effectively asserts that we don't accidentally execute them.

How about renaming masm.align to masm.haltingAlign (same behavior as before) and then adding masm.nopAlign(i) (JS_ASSERT(i<16)).

::: js/src/ion/CodeGenerator.cpp
@@ +2210,5 @@
>  
> +#if WTF_CPU_X86_64 or WTF_CPU_X86
> +        if (current->mir()->isLoopHeader()) {
> +            masm.align(16);
> +        }

SM style is that single-statement then/else bodies are not braced.  (You might want to go over https://wiki.mozilla.org/JavaScript:SpiderMonkey:Coding_Style a few times to save yourself annoying nits like these :)
Attachment #744005 - Flags: feedback?(luke) → feedback+
As masm.align() was used in some other places, I had to directly replace the call to align by a call to haltingAlign in these cases, so as to preserve the halting comportment (in case of failure). Moreover, the Assemlber-x86-shared's align() method proxies haltingAlign method, allowing to keep the previous expected behaviour.

Benchmarks are coming next. Modifying CodeAlignment to 8 (instead of 16) allows to directly have methods and loop headers to be aligned on 8 bits boundaries.
Attachment #744005 - Attachment is obsolete: true
Attachment #744367 - Flags: feedback?(luke)
Versions mean the following:
- the prefix wip means with the applied patch, while the prefix base means without the applied patch
- the suffix 16b (resp 8b) means 16 (resp 8) bits boundaries.
Octane benchmarks run 11 times for each version.

This benchmark is quite pessimistic for this optimization. It seems that all versions without the patch are faster.
Comment on attachment 744367 [details] [diff] [review]
WIP - Align on 16 bits boundaries - v2

Review of attachment 744367 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/CodeGenerator.cpp
@@ +2350,5 @@
>          LInstructionIterator iter = current->begin();
>  
> +#if WTF_CPU_X86_64 or WTF_CPU_X86
> +        if (current->mir()->isLoopHeader())
> +            masm.nopAlign(16);

I think we only care about the alignment if we can save a fetch line on small loops.  If the number of bytes needed for the loop is 33 (16 * k + 1), we have no way to save any fetch-line (if there is only one back-edge).  If this loop is 34 (16 * k + 2) bytes long, then padding with *only one* nop might save a fetch-line, otherwise this is unnecessary.

It sounds to me that there is no good eager solution to this problem.  Unless we can predict the code size ahead of time.
Comment on attachment 744367 [details] [diff] [review]
WIP - Align on 16 bits boundaries - v2

I was hoping that a simple patch like you've written would give easy gains, but it seems like not.  Thanks for doing this experiment!

nbp: Indeed, there are a lot more sophisticated code alignment optimization strategies suggested by Intel, each with examples of significant speedups.  They do seem to require a lot more sophistication than we have in our codegen, though.  In particular, the optimizations seem to require the sizes of generated code for different blocks so that the blocks can be padded or rearranged in memory.
Attachment #744367 - Flags: feedback?(luke)
Duplicate of this bug: 596421
Attached patch loop-align.patch (obsolete) — Splinter Review
Attached is a refresh of the existing patch to trunk.
Attached patch loop-align.patch (obsolete) — Splinter Review
Refreshed to fix a silly bug I introduced in the previous refresh.
Attachment #8422191 - Attachment is obsolete: true
Attached patch loop-align.patch (obsolete) — Splinter Review
Another refresh.
Attachment #8422194 - Attachment is obsolete: true
Attached patch loop-align.patch (obsolete) — Splinter Review
This refresh fixes the #if test so that it actually emits the alignment.
Attachment #8483896 - Attachment is obsolete: true
Attached patch loop-align.patch (obsolete) — Splinter Review
Rebased.
Attachment #8483900 - Attachment is obsolete: true
(In reply to Dan Gohman [:sunfish] from comment #20)
> Created attachment 8499159 [details] [diff] [review]
> loop-align.patch

Even if the actual loop alignment does not land soon, the assembler nop support looks useful, and perhaps this patch could be usefully split?
Flags: needinfo?(sunfish)
Do you have a use for them in mind? I'd be happy to split the patch, but I hesitate to land assembler functionality without having a use.
Flags: needinfo?(sunfish)
(In reply to Dan Gohman [:sunfish] from comment #22)
> Do you have a use for them in mind? I'd be happy to split the patch, but I
> hesitate to land assembler functionality without having a use.

Nothing urgent, but for example ensureOsiSpace() inserts multiple nops, see: js/src/jit/shared/CodeGenerator-shared.cpp
https://codereview.chromium.org/811713002/
Issue 811713002: [turbofan] Always align loop headers at 16-byte boundaries.
Attached patch loop-align.patch (obsolete) — Splinter Review
This patch checks in the loop alignment support code, but not the code which actually inserts alignment at loop headers. This will make it easier to experiment with enabling it.
Attachment #8499159 - Attachment is obsolete: true
Attachment #8541800 - Flags: review?(benj)
Keywords: leave-open
Comment on attachment 8541800 [details] [diff] [review]
loop-align.patch

Review of attachment 8541800 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine, but are there already consequences to aligning code headers on 16B rather than 8? See comment in Assembler-x86.h

::: js/src/jit/arm/Assembler-arm.cpp
@@ +1365,5 @@
>  
>  void
> +Assembler::haltingAlign(int alignment)
> +{
> +    // TODO: Implement a proper halting align.

Please reference this bug's number here, or open a follow up

::: js/src/jit/mips/Assembler-mips.cpp
@@ +541,5 @@
>  
>  BufferOffset
> +Assembler::haltingAlign(int alignment)
> +{
> +    // TODO: Implement a proper halting align.

ditto

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +406,5 @@
> +    void insert_nop (int size)
> +    {
> +        MOZ_ASSERT(size < 16);
> +        switch (size) {
> +            case 1:

uber-style-nit: two spaces only before "case", then two spaces only before the first instruction in the case body.

switch (something) {
  case 1:
    f();
    break;
  case 2:
    g();
    break;
  //etc.

@@ +449,5 @@
> +            case 13:
> +                nop_six();
> +                nop_seven();
> +                break;
> +            case 15:

why is case 15 before case 14? (guess it's a fallout from my code where i'd nop_one() and just fall through the next case :) feel free to reorder)

@@ +458,5 @@
> +            case 14:
> +                nop_seven();
> +                nop_seven();
> +                break;
> +            default:

this should MOZ_CRASH() as we asserted that size < 16 above

::: js/src/jit/x64/Assembler-x64.h
@@ +184,5 @@
>  
>  static MOZ_CONSTEXPR_VAR Register PreBarrierReg = rdx;
>  
>  static const uint32_t ABIStackAlignment = 16;
> +static const uint32_t CodeAlignment = 16;

See comment in Assembler-x86.h

::: js/src/jit/x86/Assembler-x86.h
@@ +114,5 @@
>  static const uint32_t ABIStackAlignment = 16;
>  #else
>  static const uint32_t ABIStackAlignment = 4;
>  #endif
> +static const uint32_t CodeAlignment = 16;

Did you mean to check in this change? If so, how does it affect performance (benchmarks) / memory usage?
Attachment #8541800 - Flags: review?(benj)
Assignee: benj → sunfish
Comment on attachment 8541800 [details] [diff] [review]
loop-align.patch

Review of attachment 8541800 [details] [diff] [review]:
-----------------------------------------------------------------

Considering another patch has landed which changed code alignment to 16 already, i'm fine with this patch, so r+ with nits fixed.
Attachment #8541800 - Flags: review+
The loop alignment infrastructure is now checked in. This patch contains the change to actually make use of it.
Attachment #744367 - Attachment is obsolete: true
Attachment #8541800 - Attachment is obsolete: true
The leave-open keyword is there and there is no activity for 6 months.
:sunfish, maybe it's time to close this bug?
Flags: needinfo?(sunfish)
Yeah, I'm not actively working on this.
Status: NEW → RESOLVED
Closed: Last year
Flags: needinfo?(sunfish)
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.