Closed Bug 598839 Opened 14 years ago Closed 14 years ago

JM: Faster x64 Syncing

Categories

(Core :: JavaScript Engine, defect)

x86_64
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sstangl, Unassigned)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 4 obsolete files)

From bug 598491 we have storeValueFromComponents(), which emits more efficient whole-Value storing code on x64. Once bug 591836 settles we can use this function to optimize x64 syncing, which has a large effect on performance, as previously measured in bug 587444.
=============================================================================

** TOTAL **:           1.040x as fast    581.6ms +/- 0.3%   559.3ms +/- 0.4%     

=============================================================================

  3d:                  1.068x as fast    116.7ms +/- 1.0%   109.3ms +/- 1.4%     
    cube:              1.121x as fast     35.7ms +/- 1.5%    31.8ms +/- 0.6%     
    morph:             -                  44.9ms +/- 1.3%    44.3ms +/- 1.6%
    raytrace:          1.086x as fast     36.1ms +/- 1.2%    33.3ms +/- 2.6%     

  access:              1.062x as fast     79.8ms +/- 0.9%    75.1ms +/- 0.9%     
    binary-trees:      -                  11.3ms +/- 1.4%    11.3ms +/- 2.1%
    fannkuch:          1.042x as fast     31.7ms +/- 1.0%    30.5ms +/- 1.6%     
    nbody:             1.127x as fast     23.2ms +/- 1.1%    20.6ms +/- 0.9%     
    nsieve:            1.059x as fast     13.6ms +/- 1.2%    12.8ms +/- 1.1%     

  bitops:              1.079x as fast     39.9ms +/- 0.5%    36.9ms +/- 0.6%     
    3bit-bits-in-byte: 1.085x as fast      4.0ms +/- 1.4%     3.7ms +/- 3.9%     
    bits-in-byte:      1.106x as fast     14.7ms +/- 0.9%    13.3ms +/- 1.1%     
    bitwise-and:       ??                  7.7ms +/- 1.7%     7.8ms +/- 1.7%
    nsieve-bits:       1.107x as fast     13.4ms +/- 1.0%    12.1ms +/- 0.8%     

  controlflow:         1.052x as fast      6.2ms +/- 2.0%     5.9ms +/- 2.2%     
    recursive:         1.052x as fast      6.2ms +/- 2.0%     5.9ms +/- 2.2%     

  crypto:              1.029x as fast     41.4ms +/- 1.0%    40.2ms +/- 0.7%     
    aes:               -                  22.2ms +/- 1.0%    21.9ms +/- 1.0%
    md5:               1.048x as fast     11.6ms +/- 1.5%    11.1ms +/- 0.6%     
    sha1:              1.051x as fast      7.6ms +/- 2.2%     7.3ms +/- 1.9%     

  date:                -                  61.5ms +/- 0.7%    60.9ms +/- 0.8%
    format-tofte:      -                  26.0ms +/- 1.3%    25.7ms +/- 1.0%
    format-xparb:      -                  35.5ms +/- 0.7%    35.2ms +/- 1.0%

  math:                1.062x as fast     56.7ms +/- 0.4%    53.4ms +/- 1.0%     
    cordic:            1.132x as fast     20.5ms +/- 0.7%    18.1ms +/- 0.9%     
    partial-sums:      -                  23.6ms +/- 0.8%    23.3ms +/- 1.8%
    spectral-norm:     1.056x as fast     12.5ms +/- 1.1%    11.9ms +/- 1.2%     

  regexp:              -                  22.4ms +/- 0.6%    22.4ms +/- 0.7%
    dna:               -                  22.4ms +/- 0.6%    22.4ms +/- 0.7%

  string:              1.012x as fast    157.1ms +/- 0.4%   155.2ms +/- 0.4%     
    base64:            ??                 14.3ms +/- 1.0%    14.5ms +/- 1.7%
    fasta:             1.032x as fast     25.8ms +/- 0.5%    25.0ms +/- 0.6%     
    tagcloud:          1.011x as fast     42.3ms +/- 0.5%    41.9ms +/- 0.5%     
    unpack-code:       -                  57.6ms +/- 0.6%    57.3ms +/- 0.6%
    validate-input:    1.028x as fast     17.0ms +/- 0.8%    16.6ms +/- 1.1%
This patch introduces a function ensureFeSynced() that knows how to efficiently sync a full Value on x64, with equivalent syncing on x86 to what exists now. The patch then integrates this function throughout the syncing paths, so that x64 sees the performance improvement posted above.

Most of the code is shared between x86 and x64.
Attachment #479822 - Flags: review?(dvander)
Same as previous patch, except without duplicate syncing in syncAndKill().
Attachment #479822 - Attachment is obsolete: true
Attachment #480000 - Flags: review?(dvander)
Attachment #479822 - Flags: review?(dvander)
Blocks: 601066
Depends on: 601109
No longer depends on: 601109
Same as previous patch, but fixes overly ambitious syncing on x86 in sync() and syncAndKill() during the register-syncing phases.

It is difficult to determine whether x86 performance is affected by this patch -- the results are always within the noisy range.
Attachment #480000 - Attachment is obsolete: true
Attachment #480430 - Flags: review?(dvander)
Attachment #480000 - Flags: review?(dvander)
Comment on attachment 480430 [details] [diff] [review]
Restore x86 perf in ::sync() and ::syncAndKill().

This patch looks sound, but given the delicacy of this area and the chance to clean up some things, 

> inline void
>+FrameState::ensureFeSynced(const FrameEntry *fe, Address to, Assembler &masm) const

This looks like it's designed to sync a FrameEntry to a slot on the frame, as opposed to FrameState::storeTo, which syncs to an arbitrary address. From looking at the rest of this patch, ensureFeSynced is only passed an fe/to mismatch in one place: storeLocal. I think we can factor this better. Remove the |to| parameter and always expect the actual |fe|, not the backing store. Or perhaps better, keep the address and assert that it's addressOf(fe).

>+{
>+    const FrameEntry *orig = fe;
>+    if (fe->isCopy())
>+        fe = fe->copyOf();

Keeping in line with canonical naming, |fe| should stay as-is and the backing store assigned to "backing".

>+
>+    JS_ASSERT_IF(!orig->type.synced(), fe->type.inRegister() || fe->type.isConstant());
>+    JS_ASSERT_IF(!orig->data.synced(), fe->data.inRegister() || fe->data.isConstant());

These asserts are tautologies, so you can remove them.

>+    /* If syncing to a different address, always perform sync. */
>+    if (addressOf(orig).base != to.base || addressOf(orig).offset != to.offset) {

This block should go away. (If for no other reason, we should shy away from address snooping.)

> inline void
>@@ -507,18 +543,25 @@ FrameState::syncData(const FrameEntry *fe, Address to, Assembler &masm) const
>                  !fe->data.synced());
>     JS_ASSERT(fe->data.inRegister() || fe->data.isConstant());
> 
>-    if (fe->data.isConstant()) {
>+#if defined JS_PUNBOX64

Good changes, but we need a follow up bug to split up FrameState. This whole function looks like it wants to be box format-specific.

`sync() const` is almost totally box-specific too.

>-            if ((!fe->type.synced() && !backing->type.inRegister()) ||
>-                (!fe->data.synced() && !backing->data.inRegister())) {
>+            /* Fall back to a slower sync algorithm if load required. */
>+            if ((!fe->type.synced() && backing->type.inMemory()) ||
>+                (!fe->data.synced() && backing->data.inMemory())) {

Nice - good catch, this could avoid some fallback into syncFancy().

>+        /* Sync now, unless synced above. */

Change this comment to something explaining which phase of the algorithm we're in. Maybe: "Any register-backed FE parts have been synced up-front. If a part still needs syncing, it is either a copy or a constant."

>+#if defined JS_PUNBOX64
>+        if (!fe->type.inRegister() && !fe->data.inRegister())
>+            ensureFeSynced(fe, addressOf(fe), masm);

Comment in here: "If either part of a PUNBOX FE was register-backed, then the entire value was already synced."

>+#elif defined JS_NUNBOX32
>+        if (!fe->data.synced() && !fe->data.inRegister()) {
>+            syncData(backing, addressOf(fe), masm);
>+            if (fe->isConstant())
>+                continue;

Not your code, but please add a :FIXME: bug 601657.

> 
>@@ -530,6 +527,27 @@ FrameState::syncAndKill(Registers kill, Uses uses, Uses ignore)
> 
>         JS_ASSERT(fe->isTracked());
> 
>+#if defined JS_PUNBOX64
>+        /* Sync entire FE to prevent loads. */
>+        ensureFeSynced(fe, addressOf(fe), masm);
>+
>+        if (!fe->data.synced())
>+            fe->data.sync();
>+        if (!fe->type.synced())
>+            fe->type.sync();

It can wait for a follow-up patch, but I expect we'll want two versions of ensureFeSynced: a const that takes a masm, and a non-const that doesn't. In the latter case, marking of data/type as synced would be done for the caller.

Out of curiosity, is there any reason not to use ensureFeSynced() here on x86 as well?


>+#if defined JS_PUNBOX64
>+        /* Attempt to sync both simultaneously. */
>+        if (!fe->isCopy() || !(backing->data.inMemory() || backing->type.inMemory())) {
>+            ensureFeSynced(backing, address, masm);
>+            if (!fe->type.synced())
>+                fe->type.sync();
>+            if (!fe->data.synced())
>+                fe->data.sync();
>+        }
>+#endif

Another case where a mutable ensureFeSynced() would help.

>+        /* If not possible, do a piecewise sync. */
>         if (!fe->data.synced()) {
>-            if (backing != fe && backing->data.inMemory())
>+            if (fe->isCopy() && backing->data.inMemory())
>                 tempRegForData(backing);

fe != backing test is faster.

>             syncData(backing, address, masm);
>             fe->data.sync();
>             if (fe->isConstant() && !fe->type.synced()) {
>                 fe->type.sync();

Bleh, another :FIXME: bug 601657 :)

Actually, I'm semi-convinced this entire path (save for the marking killed registers) could be squeezed into a mutable ensureFeSynced. Even better if you made *that* function call into an ensureDataSynced() and ensureTypeSynced(), because...

>+            if (!backing->data.inMemory() && !backing->type.inMemory()) {
>+                ensureFeSynced(backing, addressOf(local), masm);
>+            } else {
>+                if (!local->data.synced()) {
>+                    if (backing->data.inMemory())
>+                        tempRegForData(backing);
>+                    syncData(backing, addressOf(local), masm);
>+                }
>+                if (!local->type.synced()) {
>+                    if (backing->type.inMemory())
>+                        tempRegForType(backing);
>+                    syncType(backing, addressOf(local), masm);
>+                }

All of this could go away. So major bonus points if you go that route, otherwise, we can leave it for a follow-up.

>+            /* TODO: It is unclear why this call is here. */
>             if (closed)
>                 forgetEntry(local);

Closed variables cannot be optimized, so its known type must be forgotten. Though I think as we discussed on IRC, this one is stray and the work is already done by resetSynced() later. Either remove the TODO or the block ;)

BTW: Please make sure to test on --jitflags=m,md on both x86/x64, since with debug mode FS hits very, very different code paths.
Attachment #480430 - Flags: review?(dvander)
Attached patch Faster x64 Syncing, Second Take (obsolete) — Splinter Review
This is the same as the above patch, plus dvander's fixes, plus rebasing on top of bug 601657. From the previous patch, the following code is new:

- What used to be called sync{Type,Data} are now called ensure{Type,Data}Synced.

The 'ensure' family of functions are const. The new 'sync' family of functions (sync{Type,Data,Fe}) may allocate registers and will set the synced flags of the relevant FE appropriately.

- No functions in the 'ensure' family take an Address parameter, as it was never actually necessary. (We could also get rid of the masm parameter, but then we would have to eliminate it from the entirety of FrameState, which can be done in a separate patch.)

Passes trace-tests on x86,x64 with --jitflags=m,md.
Attachment #480430 - Attachment is obsolete: true
Attachment #481151 - Flags: review?(dvander)
For encouragement, the newest x64 SS-0.9.1 results, which are slightly improved (as a result of better code in syncAndKill(), from syncFe()'s existence):

=============================================================================

** TOTAL **:           1.056x as fast    581.8ms +/- 0.5%   550.7ms +/- 0.4% 

=============================================================================

  3d:                  1.065x as fast    116.9ms +/- 1.2%   109.7ms +/- 1.4%
    cube:              1.099x as fast     35.5ms +/- 1.5%    32.3ms +/- 1.5%
    morph:             -                  44.8ms +/- 1.2%    44.5ms +/- 1.6%
    raytrace:          1.113x as fast     36.7ms +/- 1.7%    32.9ms +/- 2.3%

  access:              1.088x as fast     79.7ms +/- 0.7%    73.2ms +/- 0.8%
    binary-trees:      1.035x as fast     10.1ms +/- 1.6%     9.7ms +/- 2.1%
    fannkuch:          1.076x as fast     32.3ms +/- 0.8%    30.1ms +/- 1.3%
    nbody:             1.142x as fast     23.2ms +/- 1.1%    20.3ms +/- 0.9%
    nsieve:            1.073x as fast     14.1ms +/- 1.5%    13.1ms +/- 0.7%

  bitops:              1.139x as fast     40.2ms +/- 1.4%    35.3ms +/- 0.6%
    3bit-bits-in-byte: 1.155x as fast      3.9ms +/- 2.8%     3.4ms +/- 4.1%
    bits-in-byte:      1.24x as fast      15.1ms +/- 1.4%    12.2ms +/- 0.9%
    bitwise-and:       -                   7.8ms +/- 2.4%     7.7ms +/- 1.7% 
    nsieve-bits:       1.118x as fast     13.5ms +/- 1.5%    12.0ms +/- 0.5%

  controlflow:         1.173x as fast      6.2ms +/- 2.4%     5.3ms +/- 2.5%
    recursive:         1.173x as fast      6.2ms +/- 2.4%     5.3ms +/- 2.5%

  crypto:              1.111x as fast     42.6ms +/- 1.0%    38.3ms +/- 0.6%
    aes:               1.099x as fast     23.2ms +/- 1.2%    21.1ms +/- 1.0%
    md5:               1.130x as fast     11.7ms +/- 1.4%    10.3ms +/- 1.3%
    sha1:              1.119x as fast      7.7ms +/- 2.0%     6.9ms +/- 1.4%

  date:                1.049x as fast     64.6ms +/- 0.8%    61.6ms +/- 0.5%
    format-tofte:      1.047x as fast     27.0ms +/- 1.2%    25.8ms +/- 0.6%
    format-xparb:      1.050x as fast     37.6ms +/- 0.7%    35.8ms +/- 0.6%

  math:                1.098x as fast     56.7ms +/- 0.5%    51.6ms +/- 0.8%
    cordic:            1.171x as fast     20.1ms +/- 0.5%    17.2ms +/- 0.6%
    partial-sums:      1.032x as fast     24.0ms +/- 1.2%    23.3ms +/- 1.5%
    spectral-norm:     1.121x as fast     12.6ms +/- 1.1%    11.2ms +/- 1.1%

  regexp:              -                  22.3ms +/- 0.6%    22.2ms +/- 0.5% 
    dna:               -                  22.3ms +/- 0.6%    22.2ms +/- 0.5% 

  string:              ??                152.5ms +/- 0.4%   153.3ms +/- 0.6%
    base64:            1.047x as fast     13.8ms +/- 0.9%    13.2ms +/- 0.8%
    fasta:             1.039x as fast     26.0ms +/- 0.6%    25.0ms +/- 1.3%
    tagcloud:          1.008x as fast     41.1ms +/- 0.4%    40.8ms +/- 0.4%
    unpack-code:       *1.075x as slow*   54.7ms +/- 0.9%    58.8ms +/- 1.0%
    validate-input:    1.089x as fast     16.9ms +/- 1.2%    15.5ms +/- 0.9%
Comment on attachment 481151 [details] [diff] [review]
Faster x64 Syncing, Second Take

>+FrameState::ensureFeSynced(const FrameEntry *fe, Assembler &masm) const
...
>+    /* If we can, sync the type and data in one go. */
>+    if (!fe->data.synced() && !fe->type.synced()) {

It seems like it's always more advantageous to avoid a load. Should this be special cased to x64, for something like:

  if (!dataSynced || !typeSycned && (constant || ((typeReg || typeConst) && dataReg))

?

>+FrameState::ensureTypeSynced(const FrameEntry *fe, Assembler &masm) const
...
>+#if defined JS_PUNBOX64
>+    /* Attempt to store the entire Value, to prevent a load. */
>+    if (backing->data.inRegister()) {
>+        RegisterID dreg = backing->data.reg();
>+        if (backing->isConstant())
>+            masm.storeValue(backing->getValue(), to);

This backing->isConstant() case is unreachable. Should be joisted like ensureDataSynced()?

>+    if (JS_UNLIKELY(needTypeReg && needDataReg)) {
>+        /* Use ValueReg to do a whole-Value mem-to-mem move. */
>+        JS_ASSERT(backing != fe);

Explain this assert - maybe something like "Registers are only needed for memory-to-memory moves, which can only occur for a copy backed by memory."

>+inline void
>+FrameState::syncType(FrameEntry *fe, Assembler &masm)

No need to take in a masm, I think. Helps distinguish it from the const case too.

>+inline void
>+FrameState::syncData(FrameEntry *fe, Assembler &masm)

Ditto.

>+    /*
>+     * Likewise, storeLocal() may have set this FE, with a known type,
>+     * to be a copy of another FE, which has an unknown type.
>+     * Just forget the type, since the backing is used in all cases.
>+     */
>+    if (fe->isCopy()) {
>+        fe->type.invalidate();
>+        return;

I think this is safe, but we shouldn't do it because LOCALINC is wrong in the first place. The storeLocal can leave an illegal copy in the case of an eval or closed slot. We should just pop and re-push a copy. It looks like callers of storeLocal have to be audited for this case, because !popGuaranteed w/ closed variables must not remain copies.

Let's fix LOCALINC in this patch. I'll file a follow-up for the others.

>-    if (closed)
>+    if (closed) {
>+        if (!local->isCopy())
>+            forgetEntry(local);
>         local->resetSynced();

Existing code, but stick a comment in here, like "If the FE can have registers, make sure to free them before resetting."
Attachment #481151 - Flags: review?(dvander)
(In reply to comment #8)
> Should be joisted like ensureDataSynced()?

Oh, let's hoist instead of joist.
(In reply to comment #8)
> It seems like it's always more advantageous to avoid a load. Should this be
> special cased to x64, for something like:
> 
>   if (!dataSynced || !typeSynced && (constant || ((typeReg || typeConst) &&
> dataReg))
> 
> ?

If (!dataSynced && !typeSynced), then we emit sync code with no load. If (dataSynced && !typeSynced), then only ensureTypeSynced() will have any output, and ensureTypeSynced() contains x64-specific code that avoids the load if possible. Likewise for (!dataSynced && typeSynced). So the condition as exists is OK -- it always avoids a load if possible.

This is implicit and annoying to read: we can equivalently change the code to be:

> if (!fe->data.synced() && !fe->type.synced())
>     ...
> else if (!fe->data.synced())
>     ensureDataSynced(fe, masm);
> else if (!fe->type.synced())
>     ensureTypeSynced(fe, masm);

But the entire point of the 'ensure' family is so that we don't have to do such checks. And if we don't do that, we can special-case the 'both' logic to be x64-specific, since it has no effect on x86 except to add bloated, redundant code.

> 
> >+FrameState::ensureTypeSynced(const FrameEntry *fe, Assembler &masm) const
> ...
> >+#if defined JS_PUNBOX64
> >+    /* Attempt to store the entire Value, to prevent a load. */
> >+    if (backing->data.inRegister()) {
> >+        RegisterID dreg = backing->data.reg();
> >+        if (backing->isConstant())
> >+            masm.storeValue(backing->getValue(), to);
> 
> This backing->isConstant() case is unreachable. Should be hoisted like
> ensureDataSynced()?

Oops! Good catch.

As discussed on IRC, we'll leave jsop_localinc() changes for a separate patch. Version three forthcoming.
Attachment #481151 - Attachment is obsolete: true
Attachment #481420 - Flags: review?(dvander)
Comment on attachment 481420 [details] [diff] [review]
Faster x64 Syncing, Third Take

Awesome patch, good to see a lot of the FS redundancy go away.
Attachment #481420 - Flags: review?(dvander) → review+
http://hg.mozilla.org/tracemonkey/rev/08970767d83d
Whiteboard: fixed-in-tracemonkey
You put your right hand in,

http://hg.mozilla.org/tracemonkey/rev/ae702233627c
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/ae702233627c
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.