IonMonkey and type inference for PowerPC (OS X ABI)

NEW
Assigned to

Status

()

Core
JavaScript Engine
6 years ago
4 years ago

People

(Reporter: spectre, Assigned: spectre)

Tracking

Trunk
PowerPC
Mac OS X
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 3 obsolete attachments)

(Assignee)

Description

6 years ago
Created attachment 601168 [details] [diff] [review]
Part 1 (new files): Macro assembler, assembler, OS X trampoline

In TenFourFox we are now shipping a completed JavaScript methodjit for PowerPC. This passes all tests with and without type inference, and is approximately 4x faster than the interpreter, depending on the benchmark. This now completely replaces our previous tracejit and is already in use by our users as the default. Shouts-out to Ben Stuhl who did major initial work getting the embryonic macroassembler to compile and then turned around and significantly replumbed the branching code and regular expression parsing to yield exceptional performance, and Dave Kilbridge for our software square root routine (read on).

Historically PowerPC has had two major ABIs, viz., PowerOpen (AIX, MacOS, Mac OS X) and SysV (AmigaOS 4, Linux, *BSD). These ABIs differ in some significant respects. This code is for Mac OS X and is compliant with Apple's modified PowerOpen ABI. It should work with little or no adjustment on AIX (I defer to Andrew Paprocki), but it will not work without some changes on SysV ABI.

Some explanations:
- There are a couple tweaks to friend classes to work around a gcc 4.0.1 bug, which we specifically support for 10.4 (hopefully this isn't a big deal).
- We use JS_CPU_PPC_OSX to indicate the combination of JS_CPU_PPC and 10.4/10.5. Any changes for AIX probably should use JS_CPU_PPC_AIX, and SysV ABI changes should be _PPC_SYSV.
- As a result, the attached trampoline will probably ONLY work on OS X/ppc.
- There are separate code paths for "little POWER" (G3 and G4; also probably e500, PPC 4xx, PA6T, Gekko, Broadway) and "big POWER" (G5; also POWER4/5/6/7; probably Cell PPE and Xenon as well). "Big POWER" is selected with -D_PPC970_. In "little POWER" mode, Ben's branch work (ab)uses the constant pool to pull the far call branch stanzas out of line and save i-cache. This is a losing strategy on "big POWER," where their increased sequential nature and much larger i-caches benefit more from keeping everything inline. Also, in "big POWER," there is additional code to keep G5 dispatch groups properly aligned and to use its native square root; in "little POWER," we instead inline Dave's fast software square root using frsqrte and Newton's method (this is ABI dependent because we spill into the red zone) to save a stub call.

I'm not sure if you want the changes to JS in the tree because it is for OS X/ppc, so this first patch does NOT include:
- Changes to methodjit for our veneer, for MonoIC calls (to add a faux argument space) and for repatching.
- Changes to the Makefile.
- Changes to YarrJIT. We have both the regular changes and Ben also cooked up a 8% win by having YARR use an extra register to track position. This should also benefit register-rich big RISC systems like SPARC and MIPS, and might also be useful for ARM. I can file a separate bug for this last part.

If you do, I will attach as a Part 2 for review; if you don't, I'll just attach them for "educational purposes." I do think these would be useful to check into the tree for downstream consumers.

Thanks to Brian Hackett, Chris Leary, and Dave Mandelin for initial help working through some of the asserts our code triggered.

Dave, if you prefer dvander or someone else to review, please advise.
(Assignee)

Updated

6 years ago
Attachment #601168 - Flags: review?(dmandelin)
(Assignee)

Updated

6 years ago
Attachment #601168 - Attachment is patch: true
(Assignee)

Comment 1

6 years ago
Two things I forgot to mention:

- 4x on SunSpider; v8-v6 goes from ~200 to ~2000 (10x). Haven't run Kraken on it recently.

- The GPR<->FPR moves and zeroDouble make assumptions about the linkage area that there is a double-size block of space nominally reserved for the compiler at r1(16), so those should be marked ABI dependent. I'll do that with any requested changes.
Comment on attachment 601168 [details] [diff] [review]
Part 1 (new files): Macro assembler, assembler, OS X trampoline

Review of attachment 601168 [details] [diff] [review]:
-----------------------------------------------------------------

Wow, great work! The speedups are right inline with what we got for interpreter vs. JM+TI on x86--maybe even a bit better on v8--I don't recall the exact numbers.
Attachment #601168 - Flags: review?(dmandelin) → review+
(In reply to Cameron Kaiser from comment #0)
> I'm not sure if you want the changes to JS in the tree because it is for OS
> X/ppc, so this first patch does NOT include:
> - Changes to methodjit for our veneer, for MonoIC calls (to add a faux
> argument space) and for repatching.
> - Changes to the Makefile.

Those changes don't sound too extensive, so I'd like to take them.

> - Changes to YarrJIT. We have both the regular changes and Ben also cooked
> up a 8% win by having YARR use an extra register to track position. This
> should also benefit register-rich big RISC systems like SPARC and MIPS, and
> might also be useful for ARM. I can file a separate bug for this last part.

I try to keep our version of Yarr as close as possible to upstream [*] in order to make it easy to backport their bugfixes to our tree. If you'd like to get the Yarr stuff in our tree, could you file a bug with WebKit first, and try to get them to take it?

[*] We don't refresh Yarr from upstream very frequently, so it's more accurate to say "close as possible to *some* upstream rev".
(Assignee)

Comment 4

6 years ago
Okay, I'll keep the position register code out and file it upstream. We do have some otherwise vanilla changes to YarrJIT just to get PPC regex JIT to work, though; I'll put that in a separate patch just in case. I'm merging our changes against mozilla-central and will post Part 2 presently (thanks for the quick r+ on part 1).

I'll also post a "for checkin" Part 1 with the tweak to correctly flag the GPR<->FPR moves as ABI-dependent.
(Assignee)

Updated

6 years ago
Assignee: general → spectre
(Assignee)

Comment 5

6 years ago
I'm going to retract part 1 because I have a terrible performance regression between our ESR build and mozilla-central. I did a partial bisect and have this narrowed down to 10->11. The numbers I'm getting on v8-v6 look like this:

# 11 interpreter only baseline
../../../obj-ff-dbg/dist/TenFourFox.app/Contents/MacOS/js -f run.js
Richards: 127
DeltaBlue: 124
Crypto: 118
RayTrace: 196
EarleyBoyer: 254
RegExp: 247
Splay: 614
----
Score (version 6): 203

# score in 10.0.3pre
/Applications/TenFourFoxG5.app/Contents/MacOS/js -m -n -f run.js
Richards: 3096
DeltaBlue: 2521
Crypto: 3114
RayTrace: 1222
EarleyBoyer: 2924
RegExp: 505
Splay: 3037
----
Score (version 6): 2012

# score in 11 opt G5 build test
../../../obj-ff-dbg/dist/TenFourFox.app/Contents/MacOS/js -m -n -f run.js
Richards: 820
DeltaBlue: 1082
Crypto: 125
RayTrace: 1140
EarleyBoyer: 1492
RegExp: 361
Splay: 1661
----
Score (version 6): 732

With the exception of RayTrace, virtually all the benchmarks suffer badly. Crypto doesn't show any improvement at all. I'm going to bisect further, but do these sound like a particular change or set of changes that may have affected runtime?
(Assignee)

Comment 6

6 years ago
(Just thinking out loud: could the write barriers for incremental GC cause changes like this?)
(Assignee)

Comment 7

6 years ago
Kindly disregard the spam, it turned out that a build system change between 10 and 11 made our custom settings in the mozconfig not stick (G5 does very badly if not tuned properly). However, I will have a revised part 1 anyway.
(Assignee)

Comment 8

6 years ago
Created attachment 602548 [details] [diff] [review]
Part 1 (new files): Macroassembler, assembler, trampoline (v2)

Revised part 1. The only difference is that I added a standard define for a free zone in the linkage area and made it ABI dependent, and fixed up lea(BaseIndex, RegisterID). I don't know if you need to re-review it.
Attachment #601168 - Attachment is obsolete: true
Attachment #602548 - Flags: review?(dmandelin)
(Assignee)

Comment 9

6 years ago
Created attachment 602549 [details] [diff] [review]
Part 2: JS build system changes (except YarrJIT)

This patch contains all the needed patches to methodjit/, assembler/jit/, and configure.in. YarrJIT will be under separate cover.
Attachment #602549 - Flags: review?(dmandelin)
Attachment #602548 - Flags: review?(dmandelin) → review+
Comment on attachment 602549 [details] [diff] [review]
Part 2: JS build system changes (except YarrJIT)

Review of attachment 602549 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/methodjit/MachineRegs.h
@@ +372,5 @@
>  #elif defined(JS_CPU_MIPS)
>          return 4;
> +#elif defined(JS_CPU_PPC_OSX)
> +        // Yup, we're awesome.
> +        return 8;

:-)
Attachment #602549 - Flags: review?(dmandelin) → review+
(Assignee)

Comment 11

6 years ago
Created attachment 602556 [details] [diff] [review]
Part 3: YarrJIT and Makefile.in

This patch I'm not 100% sure on -- please double check my changes to Makefile.in, since I'm not sure I wrote it in the most efficient manner (our Makefile obviously is hardcoded to our own configurations).

The changes to YarrJIT, besides the prologue and epilogue, are to keep stack pointer-relative addressing out of the linkage area (as written, it stomps all over it, wrecking the PPC frame). I'll get the position tracking register changes accepted upstream with these changes also, but this will get it working here.
Attachment #602556 - Flags: review?(dmandelin)
(Assignee)

Comment 12

6 years ago
(and thanks very much for the fast r+ on 1/2 :)
(Assignee)

Comment 13

6 years ago
Created attachment 602571 [details] [diff] [review]
Part 3: YarrJIT.cpp and Makefile.in (v2)

This looks better.
Attachment #602556 - Attachment is obsolete: true
Attachment #602571 - Flags: review?(dmandelin)
Attachment #602556 - Flags: review?(dmandelin)
(Assignee)

Comment 14

6 years ago
Created attachment 602616 [details] [diff] [review]
Part 3: YarrJIT.cpp, Makefile.in, MacroAssembler.h (v3)

It would be even better with the MacroAssembler.h changes.
Attachment #602571 - Attachment is obsolete: true
Attachment #602616 - Flags: review?(dmandelin)
Attachment #602571 - Flags: review?(dmandelin)
Comment on attachment 602616 [details] [diff] [review]
Part 3: YarrJIT.cpp, Makefile.in, MacroAssembler.h (v3)

Review of attachment 602616 [details] [diff] [review]:
-----------------------------------------------------------------

The Makefile change looks OK. I like that the changes to Yarr are non-instrusive, so I'm not worried about taking them into our tree even if they don't make it upstream. But can you create some constants for the magic numbers just for style points? You also might want to add some comments to explain a bit more about what problems you're solving with the PPC-specific code, but I guess you're probably the major maintainer of that code, so it's up to you.
Attachment #602616 - Flags: review?(dmandelin)
(Assignee)

Comment 16

6 years ago
I can do that. I'll work the magic constants into PPCAssembler.h, since that is probably the logical place.
(Assignee)

Comment 17

6 years ago
I'm also going to have a revised part 1 that streamlines G3/G4 square root and improves V8 performance by 2% on G5/big POWER by rescheduling the Veneer. I'll rebase everything against the tree after bug 691898.
(Assignee)

Updated

5 years ago
Depends on: 751845
(Assignee)

Updated

5 years ago
Depends on: 771320
(Assignee)

Comment 18

5 years ago
We're significantly revising our code to prepare for IonMonkey, particularly with regards to strict ABI compliance so that the stack is in a sane state at all times. I'm still having some trouble avoiding tricks to get typed arrays to work, so this is not ready yet. In any case, the current set of patches are now all obsolete.
(In reply to Cameron Kaiser from comment #18)
> We're significantly revising our code to prepare for IonMonkey, particularly
> with regards to strict ABI compliance so that the stack is in a sane state
> at all times. I'm still having some trouble avoiding tricks to get typed
> arrays to work, so this is not ready yet. In any case, the current set of
> patches are now all obsolete.

Cool! As you probably have already found out, IonMonkey uses a new MacroAssembler interface. It is possible to build this on top of either the existing MacroAssembler, or by writing an entirely new one on top of PPCAssembler (this is what we did for x86/x64). The biggest changes are around branching (our "Label" is the inverse of the old "Jump") and type safety of Register IDs.

Our long-term plan is to replace JM with a new baseline compiler (bug 805241), and that is using the same MacroAssembler interface as IonMonkey. With any luck that'll mostly work for you without significant changes.
(Assignee)

Comment 20

5 years ago
> It is possible to build this on top of either the existing MacroAssembler,

That would certainly be the easiest way to bootstrap it, but how? Is the ARM version built that way?

We had the advantage of being able to crib off the SPARC JaegerMonkey when writing that, but I think we're the trailblazers here.
Updating the description to reflect that this bug has morphed from being about JM+TI about IM+TI.
Summary: JaegerMonkey and type inference for PowerPC (OS X ABI) → IonMonkey and type inference for PowerPC (OS X ABI)
You need to log in before you can comment on or make changes to this bug.