Closed Bug 867001 Opened 11 years ago Closed 6 years ago

IonMonkey: Investigate aligning loop headers

Tracking

()

Status:

RESOLVED INACTIVE

People

(Reporter: bbouvier, Assigned: sunfish)

References

Details

Attachments

(5 files, 8 obsolete files)

WIP - Align on 16 bits boundaries 11 years ago Benjamin Bouvier [:bbouvier] (inactive) 9.30 KB, patch	luke : feedback+	Details \| Diff \| Splinter Review
WIP - Align on 16 bits boundaries - v2 11 years ago Benjamin Bouvier [:bbouvier] (inactive) 12.90 KB, patch		Details \| Diff \| Splinter Review
SunSpider benchmarks results for 8 bits alignment 11 years ago Benjamin Bouvier [:bbouvier] (inactive) 4.30 KB, text/plain		Details
SunSpider benchmarks results for 16 bits alignment 11 years ago Benjamin Bouvier [:bbouvier] (inactive) 3.75 KB, text/plain		Details
asm.js benchmarks (11 runs) for all versions 11 years ago Benjamin Bouvier [:bbouvier] (inactive) 2.28 KB, text/plain		Details
Octane benchmarks (11 runs) for all versions 11 years ago Benjamin Bouvier [:bbouvier] (inactive) 236 bytes, text/plain		Details
loop-align.patch 10 years ago Dan Gohman [:sunfish] 13.71 KB, patch		Details \| Diff \| Splinter Review
loop-align.patch 10 years ago Dan Gohman [:sunfish] 13.75 KB, patch		Details \| Diff \| Splinter Review
loop-align.patch 10 years ago Dan Gohman [:sunfish] 15.11 KB, patch		Details \| Diff \| Splinter Review
loop-align.patch 10 years ago Dan Gohman [:sunfish] 15.13 KB, patch		Details \| Diff \| Splinter Review
loop-align.patch 10 years ago Dan Gohman [:sunfish] 15.14 KB, patch		Details \| Diff \| Splinter Review
loop-align.patch 10 years ago Dan Gohman [:sunfish] 26.61 KB, patch	bbouvier : review+	Details \| Diff \| Splinter Review
use-loop-alignment.patch 9 years ago Dan Gohman [:sunfish] 1.17 KB, patch		Details \| Diff \| Splinter Review

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Description

•

11 years ago

When generating code in IonMonkey, align loop headers on memory natural boundaries (i.e. every 16 bits) could enhance loop performance, according to [1]. This optimization is also made by GCC, when using -O2 optimizations or higher.

For this purpose, assembly no-ops could be used. For x86 and x64, Intel's developer manual ([2]) provides noops of sizes included between 1 and 9. For sizes greater or equal to 10, series of noops could be used. However, it seems that GCC aligns on multiples of 8 instead of multiples of 16 if the space needed is higher than 8.

Several benchmarkings should be made to prove that there is a real performance enhancement using the loop alignment. I can think about three experiments:
- one without any loop header optimizations, using the code as it is today.
- one with alignments on multiples of 8, a la GCC.
- one with alignments on multiples of 16.

I will post the results of the investigation on x64 here as soon as I've finished the benchmarking.

[1] http://www.cs.virginia.edu/kim/courses/cs6620/lec/cs6620-21-intel.pdf
[2] http://download.intel.com/products/processor/manual/325462.pdf

Sean Stangl [:sstangl]

Comment 1

•

11 years ago

Note that Bug 605010 (several years old) contains a patch that implements multibyte NOPs. We didn't take the patch because we had no use case for it, but this bug could be that use case.

Hannes Verschore [:h4writer]

Comment 2

•

11 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #0)
> I will post the results of the investigation on x64 here as soon as I've
> finished the benchmarking.

Is there a reason you test on x64 instead of x86? If there is no specific reason, you should probably test first on x86, since that's our official supported platform. 

For linux (release build)
$ sudo apt-get install gcc-multilib ; I think this is the only dependency for cross-builing a 32bit program on x64 platform, could be more.
$ cd mozilla-inbound

// this is needed for threadsafe build
$ cd nsprpub
$ mkdir build-32
$ cd build-32
$ CC='gcc -m32' CXX='g++ -m32' AR=ar ../configure --disable-debug --enable-optimize --target=i686-pc-linux-gnu 

// building js shell
$ cd mozilla-inbound/js/src
$ autoconf2.13
$ mkdir build-32
$ cd build-32
$ CC='gcc -m32' CXX='g++ -m32' AR=ar ../configure --disable-debug --enable-optimize --target=i686-pc-linux-gnu --enable-threadsafe --with-nspr-cflags="-I/PATH_TO/mozilla-inbound/nsprpub/dist/include/nspr" --with-nspr-libs="/PATH_TO/mozilla-inbound/nsprpub/dist/lib/libnspr4.a /PATH_TO/mozilla-inbound/nsprpub/dist/lib/libplc4.a /PATH_TO/mozilla-inbound/nsprpub/dist/lib/libplds4.a"

If you are on a mac, the configure lines change slightly. In that case you should ask luke or just ask in #ionmonkey.

Hannes Verschore [:h4writer]

Comment 3

•

11 years ago

Oh I made a small mistake in my last configure line
/PATH_TO/mozilla-inbound/nsprpub/...
should be
/PATH_TO/mozilla-inbound/nsprpub/build-32/...

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 4

•

11 years ago

(In reply to Hannes Verschore [:h4writer] from comment #2)
> (In reply to Benjamin Bouvier [:bbouvier] from comment #0)
> > I will post the results of the investigation on x64 here as soon as I've
> > finished the benchmarking.
> 
> Is there a reason you test on x64 instead of x86? If there is no specific
> reason, you should probably test first on x86, since that's our official
> supported platform. 
> 

There is indeed no reason, except my computer is running a x64 architecture linux kernel.
Thank you for the cross compilation commands!

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 5

•

11 years ago

Attached patch WIP - Align on 16 bits boundaries (obsolete) — Details — Splinter Review

This is a first version of the patch, for alignment of loop headers on 16bits boundaries.

I'm not plenty satisfied with the switch approach, it looks dirty.
Using initializer list of multibytes sequences would have been more proper (instead of the switch, it would have been a direct access to the array of sequences), but the compiler doesn't seem to appreciate initializer lists inside initializer list, unless I made a syntax error.

Also, as sstangl suggested, the blocks where code is allocated (in IonLinker.h:Linker:newCode) were aligned on 8 bits boundaries. I put up this alignment to 16.

This has been tested with asm generation and IonMonkey generation and loops header were correctly aligned on 16 bits boundaries (checked with gdb).

Microbenchmarks and Octane benchmark showed a relative 0.5% difference between code with alignment and code without alignment (the one with alignment being better). The 32-bits version (thanks h4writer and sstangl!) was used and 21 runs were executed. This enhancement doesn't look significant. I will try on other benchmarks though and also make a version with 8 bytes boundaries like GCC does.

Attachment #744005 - Flags: feedback?(luke)

Luke Wagner [:luke]

Comment 6

•

11 years ago

I'd also be interested to see measurements for the asm.js benchmarks in bug 840284 attachment 718174 [details].  No harness for these, just run, e.g., 'time js zlib.js'.

You might also want to turn off dynamic CPU throttling:
  echo performance > /sys/devices/cpu/cpuX/cpufreq/scaling_governor
for all X in /sys/devices/cpu.

Luke Wagner [:luke]

Comment 7

•

11 years ago

Comment on attachment 744005 [details] [diff] [review]
WIP - Align on 16 bits boundaries

Review of attachment 744005 [details] [diff] [review]:
-----------------------------------------------------------------

You are off to a great start, just a few nits:

::: js/src/assembler/assembler/X86Assembler.h
@@ +387,5 @@
> +                nop_six();
> +                nop_seven();
> +                break;
> +            case 15:
> +                nop_one(); // no break: 15 == 1 + 14

For simplicity, I'd just duplicate the case for 15 and avoid the fallthrough.

@@ +2766,5 @@
> +        if (neededPadding > 0) {
> +            if (neededPadding < 16) {
> +                // for instance, before a loop header
> +                insert_nop (neededPadding);
> +            } else {

I'd rather not overload the meaning of masm.align.  It's nice that all the inter-function padding is hlt since this effectively asserts that we don't accidentally execute them.

How about renaming masm.align to masm.haltingAlign (same behavior as before) and then adding masm.nopAlign(i) (JS_ASSERT(i<16)).

::: js/src/ion/CodeGenerator.cpp
@@ +2210,5 @@
>  
> +#if WTF_CPU_X86_64 or WTF_CPU_X86
> +        if (current->mir()->isLoopHeader()) {
> +            masm.align(16);
> +        }

SM style is that single-statement then/else bodies are not braced.  (You might want to go over https://wiki.mozilla.org/JavaScript:SpiderMonkey:Coding_Style a few times to save yourself annoying nits like these :)

Attachment #744005 - Flags: feedback?(luke) → feedback+

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 8

•

11 years ago

Attached patch WIP - Align on 16 bits boundaries - v2 (obsolete) — Details — Splinter Review

As masm.align() was used in some other places, I had to directly replace the call to align by a call to haltingAlign in these cases, so as to preserve the halting comportment (in case of failure). Moreover, the Assemlber-x86-shared's align() method proxies haltingAlign method, allowing to keep the previous expected behaviour.

Benchmarks are coming next. Modifying CodeAlignment to 8 (instead of 16) allows to directly have methods and loop headers to be aligned on 8 bits boundaries.

Attachment #744005 - Attachment is obsolete: true

Attachment #744367 - Flags: feedback?(luke)

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 9

•

11 years ago

Attached file SunSpider benchmarks results for 8 bits alignment — Details

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 10

•

11 years ago

Attached file SunSpider benchmarks results for 16 bits alignment — Details

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 11

•

11 years ago

Attached file asm.js benchmarks (11 runs) for all versions — Details

Versions mean the following:
- the prefix wip means with the applied patch, while the prefix base means without the applied patch
- the suffix 16b (resp 8b) means 16 (resp 8) bits boundaries.

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 12

•

11 years ago

Attached file Octane benchmarks (11 runs) for all versions — Details

Octane benchmarks run 11 times for each version.

This benchmark is quite pessimistic for this optimization. It seems that all versions without the patch are faster.

Nicolas B. Pierron [:nbp]

Comment 13

•

11 years ago

Comment on attachment 744367 [details] [diff] [review]
WIP - Align on 16 bits boundaries - v2

Review of attachment 744367 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/CodeGenerator.cpp
@@ +2350,5 @@
>          LInstructionIterator iter = current->begin();
>  
> +#if WTF_CPU_X86_64 or WTF_CPU_X86
> +        if (current->mir()->isLoopHeader())
> +            masm.nopAlign(16);

I think we only care about the alignment if we can save a fetch line on small loops.  If the number of bytes needed for the loop is 33 (16 * k + 1), we have no way to save any fetch-line (if there is only one back-edge).  If this loop is 34 (16 * k + 2) bytes long, then padding with *only one* nop might save a fetch-line, otherwise this is unnecessary.

It sounds to me that there is no good eager solution to this problem.  Unless we can predict the code size ahead of time.

Luke Wagner [:luke]

Comment 14

•

11 years ago

Comment on attachment 744367 [details] [diff] [review]
WIP - Align on 16 bits boundaries - v2

I was hoping that a simple patch like you've written would give easy gains, but it seems like not.  Thanks for doing this experiment!

nbp: Indeed, there are a lot more sophisticated code alignment optimization strategies suggested by Intel, each with examples of significant speedups.  They do seem to require a lot more sophistication than we have in our codegen, though.  In particular, the optimizations seem to require the sizes of generated code for different blocks so that the blocks can be padded or rearranged in memory.

Attachment #744367 - Flags: feedback?(luke)

Dan Gohman [:sunfish]

Assignee

Comment 16

•

10 years ago

Attached patch loop-align.patch (obsolete) — Details — Splinter Review

Attached is a refresh of the existing patch to trunk.

Dan Gohman [:sunfish]

Assignee

Comment 17

•

10 years ago

Attached patch loop-align.patch (obsolete) — Details — Splinter Review

Refreshed to fix a silly bug I introduced in the previous refresh.

Attachment #8422191 - Attachment is obsolete: true

Dan Gohman [:sunfish]

Assignee

Comment 18

•

10 years ago

Attached patch loop-align.patch (obsolete) — Details — Splinter Review

Another refresh.

Attachment #8422194 - Attachment is obsolete: true

Dan Gohman [:sunfish]

Assignee

Comment 19

•

10 years ago

Attached patch loop-align.patch (obsolete) — Details — Splinter Review

This refresh fixes the #if test so that it actually emits the alignment.

Attachment #8483896 - Attachment is obsolete: true

Dan Gohman [:sunfish]

Assignee

Comment 20

•

10 years ago

Attached patch loop-align.patch (obsolete) — Details — Splinter Review

Rebased.

Attachment #8483900 - Attachment is obsolete: true

Douglas Crosher [:dougc]

Comment 21

•

10 years ago

(In reply to Dan Gohman [:sunfish] from comment #20)
> Created attachment 8499159 [details] [diff] [review]
> loop-align.patch

Even if the actual loop alignment does not land soon, the assembler nop support looks useful, and perhaps this patch could be usefully split?

Flags: needinfo?(sunfish)

Dan Gohman [:sunfish]

Assignee

Comment 22

•

10 years ago

Do you have a use for them in mind? I'd be happy to split the patch, but I hesitate to land assembler functionality without having a use.

Flags: needinfo?(sunfish)

Douglas Crosher [:dougc]

Comment 23

•

10 years ago

(In reply to Dan Gohman [:sunfish] from comment #22)
> Do you have a use for them in mind? I'd be happy to split the patch, but I
> hesitate to land assembler functionality without having a use.

Nothing urgent, but for example ensureOsiSpace() inserts multiple nops, see: js/src/jit/shared/CodeGenerator-shared.cpp

Douglas Crosher [:dougc]

Comment 24

•

10 years ago

https://codereview.chromium.org/811713002/
Issue 811713002: [turbofan] Always align loop headers at 16-byte boundaries.

Dan Gohman [:sunfish]

Assignee

Comment 25

•

10 years ago

Attached patch loop-align.patch (obsolete) — Details — Splinter Review

This patch checks in the loop alignment support code, but not the code which actually inserts alignment at loop headers. This will make it easier to experiment with enabling it.

Attachment #8499159 - Attachment is obsolete: true

Attachment #8541800 - Flags: review?(benj)

Dan Gohman [:sunfish]

Assignee

Updated

•

10 years ago

Keywords: leave-open

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 26

•

9 years ago

Comment on attachment 8541800 [details] [diff] [review]
loop-align.patch

Review of attachment 8541800 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine, but are there already consequences to aligning code headers on 16B rather than 8? See comment in Assembler-x86.h

::: js/src/jit/arm/Assembler-arm.cpp
@@ +1365,5 @@
>  
>  void
> +Assembler::haltingAlign(int alignment)
> +{
> +    // TODO: Implement a proper halting align.

Please reference this bug's number here, or open a follow up

::: js/src/jit/mips/Assembler-mips.cpp
@@ +541,5 @@
>  
>  BufferOffset
> +Assembler::haltingAlign(int alignment)
> +{
> +    // TODO: Implement a proper halting align.

ditto

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +406,5 @@
> +    void insert_nop (int size)
> +    {
> +        MOZ_ASSERT(size < 16);
> +        switch (size) {
> +            case 1:

uber-style-nit: two spaces only before "case", then two spaces only before the first instruction in the case body.

switch (something) {
  case 1:
    f();
    break;
  case 2:
    g();
    break;
  //etc.

@@ +449,5 @@
> +            case 13:
> +                nop_six();
> +                nop_seven();
> +                break;
> +            case 15:

why is case 15 before case 14? (guess it's a fallout from my code where i'd nop_one() and just fall through the next case :) feel free to reorder)

@@ +458,5 @@
> +            case 14:
> +                nop_seven();
> +                nop_seven();
> +                break;
> +            default:

this should MOZ_CRASH() as we asserted that size < 16 above

::: js/src/jit/x64/Assembler-x64.h
@@ +184,5 @@
>  
>  static MOZ_CONSTEXPR_VAR Register PreBarrierReg = rdx;
>  
>  static const uint32_t ABIStackAlignment = 16;
> +static const uint32_t CodeAlignment = 16;

See comment in Assembler-x86.h

::: js/src/jit/x86/Assembler-x86.h
@@ +114,5 @@
>  static const uint32_t ABIStackAlignment = 16;
>  #else
>  static const uint32_t ABIStackAlignment = 4;
>  #endif
> +static const uint32_t CodeAlignment = 16;

Did you mean to check in this change? If so, how does it affect performance (benchmarks) / memory usage?

Attachment #8541800 - Flags: review?(benj)

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Updated

•

9 years ago

Assignee: benj → sunfish

Benjamin Bouvier [:bbouvier] (inactive)

Reporter

Comment 27

•

9 years ago

Comment on attachment 8541800 [details] [diff] [review]
loop-align.patch

Review of attachment 8541800 [details] [diff] [review]:
-----------------------------------------------------------------

Considering another patch has landed which changed code alignment to 16 already, i'm fine with this patch, so r+ with nits fixed.

Attachment #8541800 - Flags: review+

Dan Gohman [:sunfish]

Assignee

Comment 28

•

9 years ago

Attached patch use-loop-alignment.patch — Details — Splinter Review

The loop alignment infrastructure is now checked in. This patch contains the change to actually make use of it.

Attachment #744367 - Attachment is obsolete: true

Attachment #8541800 - Attachment is obsolete: true

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 29

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/23c991c5434a

BugBot [:suhaib / :marco/ :calixte]

Comment 30

•

6 years ago

The leave-open keyword is there and there is no activity for 6 months.
:sunfish, maybe it's time to close this bug?

Flags: needinfo?(sunfish)

Dan Gohman [:sunfish]

Assignee

Comment 31

•

6 years ago

Yeah, I'm not actively working on this.

Status: NEW → RESOLVED

Closed: 6 years ago

Flags: needinfo?(sunfish)

Resolution: --- → INACTIVE

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Keywords: leave-open

You need to log in before you can comment on or make changes to this bug.