nanojit: account for eight-byte alignment of stack in StackFilter

RESOLVED FIXED

Status

RESOLVED FIXED
9 years ago
5 years ago

People

(Reporter: dvander, Assigned: njn)

Tracking

unspecified
x86
Linux

Firefox Tracking Flags

(status2.0 wanted)

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)

Attachments

(7 attachments)

Created attachment 431228 [details]
shell version

http://www.adequatelygood.com/2010/3/Performance-of-vs-

An initial look seemed like js_StringToNumber was a bottleneck, but someone taking a look at this should not take that as fact.
Created attachment 431229 [details]
version with just bad cases
Created attachment 431270 [details]
Version that shows the speedup of == over === +

I bet we optimize things away here somehow in the == case.
Created attachment 431272 [details]
Showing things more clearly; should make it easier to create logs and such
I'll bet that in the == cases we dead-code-eliminate the js_StringToNumber() call and we just fail to do that with the === case.
That seems to be what happens, indeed.  The patch from bug 541288 says that the reason is that this:

  0x81a2ec  stfi sp[8] = js_StringToNumber1

is a statement.  Of course there's a statement like that in the == case as well.  Might be worth looking into what the difference between those cases is in terms of what StackFilter does.  Sort of echoes of bug 536458, but perhaps not quite (my hacks for that bug seem to not work at first glance).
All percentages that follow are percentages of total testcase time.

In the slow cases, 93% of the time is under js_StringToNumber.  5% is self time; the rest is under js_strtod.  There 18% is self time; the rest is under JS_strtod.  

The time under JS_strtod is 20% self time, 25% PR_Lock, and 22% PR_Unlock.  The locking is all self times in those functions.  I blame LOCK_DTOA/UNLOCK/DTOA.  Is there a reason those are prlocks and not thinlocks?

One other interesting note:  In Safari, these testcases a single-character string (like "0" or "1") seems to be about 2x faster than for a multi-character string (like "00" or "12").  Increasing the length more only affects the performance weakly.  I bet they're using a lookup table or something for single-char strings.  I wonder how often that's hit in practice, outside of benchmarks...
One other note: I tried switching from global-locking around dtoa stuff to using its built-in MULTIPLE_THREADS support, and while that speeds up this testcase, and doesn't affect |x + ""| for an integer x that I can tell, it does slow down |x + ""| for a floating-point x (12.35 in my case) significantly.
For what it's worth, Safari does in fact special-case single-char ASCII strings in their string-to-number code:

http://trac.webkit.org/browser/trunk/JavaScriptCore/runtime/UString.cpp?rev=54843#L365
Which seems to be part of the fix for https://bugs.webkit.org/show_bug.cgi?id=20333 (aka "tune strings to sunspider").

We may want a separate bug on getting some measurements on this single-char string business...
The locking is covered by bug 509857, methinks.

I filed bug 551109 on avoiding jsdtoa altogether for small-integer strings, and bug 551110 on possibly replicating the jsc 1-char special-casing.

That leaves the original issue this bug was filed on, which is the discrepancy between a==b and a===+b.
Depends on: 509857, 551109, 551110

Comment 11

9 years ago
Lets please nuke the stupid dtoa locking. We can use thread-local storage if necessary (JSThreadData).
(In reply to comment #11)
> Lets please nuke the stupid dtoa locking. We can use thread-local storage if
> necessary (JSThreadData).

Hello, please to be reading dependency bug 509857.

/be
(Assignee)

Comment 13

9 years ago
AFAICT, the '==' vs '===' difference is a red herring.  It's the '+' that's
important.

Here's the relevant code for the '==' case.  Instructions that end up being
removed are indented.

00024:   9  trace
00025:  10  getlocal 0
00028:  10  getlocal 1
00031:  10  eq
    ld2 = ld sp[-16]
        $stack4 = i2f ld2
        sti sp[0] = ld2
        $stack5 = ld sp[-8]
        sti sp[8] = $stack5
        js_StringToNumber1 = fcall #js_StringToNumber ( cx $stack5 )

00032:  10  pop
00033:   9  bindname "i"
00036:   9  dup
00037:   9  getxprop "i"
    About to try emitting guard code for SideExit=0x9cb54d4 exitType=BRANCH
        feq1 = feq $stack4, js_StringToNumber1
        sti sp[0] = feq1
        obj = int -143450112
    sti sp[0] = obj
    sti sp[8] = obj
        globalObj = int -143450112
        map = ld globalObj[0]
        globalObj = int -143450112
    guard_global = eq globalObj, globalObj
    xf2: xf guard_global -> pc=0x9cad04d imacpc=(nil) sp+16 rp+0 (GuardID=002)


Likewise for the '=== +' case:

00024:  17  trace
00025:  18  getlocal 0
00028:  18  getlocal 1
00031:  18  pos
    ld2 = ld sp[-16]
        $stack4 = i2f ld2
        sti sp[0] = ld2
    $stack5 = ld sp[-8]
        sti sp[8] = $stack5
    js_StringToNumber1 = fcall #js_StringToNumber ( cx $stack5 )

00032:  18  stricteq
00033:  18  pop
00034:  17  bindname "i"
00037:  17  dup
00038:  17  getxprop "i"
    About to try emitting guard code for SideExit=0x9cb584c exitType=BRANCH
    stfi sp[8] = js_StringToNumber1                # !!!
        feq1 = feq $stack4, js_StringToNumber1
        sti sp[0] = feq1
        obj = int -143450112
    sti sp[0] = obj
    sti sp[8] = obj
        globalObj = int -143450112
        map = ld globalObj[0]
        globalObj = int -143450112
    guard_global = eq globalObj, globalObj
    xf2: xf guard_global -> pc=0x9cad926 imacpc=(nil) sp+16 rp+0 (GuardID=002)


The difference is the instruction marked '!!!'.  It occurs because the
'pos' requires the result to be pushed onto the stack.  As we saw with bug
536458 the StackFilter cannot remove this 'stfi' because only half of it is
subsequently overwritten.  In this case that keeps the js_StringToNumber
call alive.

Here's my suggestion:  I could add a boolean flag to StackFilter:
'eightByteAligned'.  When it is set, then cases like this could be
optimised away.  Preliminary investigations are promising, eg. crypto-sha1 is 5% faster... in general lots of stack stores are removed, but also a few functions calls (mostly js_UnboxDouble, js_String_p_charCodeAt, math_sqrt_tn).

I would also add some assertions to check that all the sp
offsets are 8-aligned.  BTW, I assume that the RPstack is *not* 8-aligned
but word-aligned?
Assignee: general → nobody
Status: NEW → ASSIGNED
Component: JavaScript Engine → Nanojit
QA Contact: general → nanojit
Summary: TM: odd performance in comparison benchmark → nanojit: account for eight-byte alignment of stack in StackFilter
(Assignee)

Comment 14

9 years ago
I removed all the dependencies which were about speeding up js_StringToNumber;  this bug is really about StackFilter.
No longer depends on: 509857, 551109, 551110
(Assignee)

Updated

9 years ago
Duplicate of this bug: 536458
(Assignee)

Comment 16

9 years ago
Created attachment 431810 [details] [diff] [review]
NJ patch

The first thing to note is that TR does not use StackFilter at all.  So it
seems reasonable to bake some TM-specific assumptions into it.

With that in mind, this patch:

- Changes StackFilter so that it doesn't perform filtering on the RP stack.
  Turns out that this hardly achieves anything -- eg. only 3 of the
  SunSpider tests have their generated code changed by this.  And they only
  have a handful more stores.  But every test benefits slightly from
  StackFilter becoming slightly faster.

- Changes StackFilter to assume that the stack has 8-byte entries, and so in
  this sequence:

    stfi sp[0]
    ...
    sti sp[0]

  it will eliminate the first store even though the second doesn't clobber
  the high four bytes -- we know they're dead anyway.  An assertion
  checks that the sp offsets are always factors of eight.

  This change allows StackFilter to become faster yet, because we only have
  to track half as many entries (because of the 8-byte granularity instead
  of 4-byte) and we can treat both 4-byte and 8-byte stores the same.

- I inlined and removed ignoreStore() now that it's only called from one
  site.

There are some nice improvements, particularly in the following cases where
the improved StackFilter causes function calls to be removed:

                        speedup     instructions executed reduction
                        -------     -------------------------------
- crypto-md5:           4--5%       3.8%
- crypto-sha1:          3--5%       4.0%
- bitops-nsieve-bits:   8--11%      9.0%
- string-base64:        6--7%       4.3%
- v8-crypto:            3--6%       4.0%

The speedups are visible on both i386 and X64.  The instruction reductions
are measured on i386 only, but other platforms should be similar.

Some others may have smaller benefits, but those are hard to distinguish
from the usual noise.

The program in attachement 4 now gives similar numbers for both the '=='
loop and the '=== +' loop.

Not bad for a patch that ends up simplifying StackFilter!
Assignee: nobody → nnethercote
Attachment #431810 - Flags: review?(gal)
(Assignee)

Comment 17

9 years ago
Created attachment 431811 [details] [diff] [review]
Simple TM patch
Attachment #431811 - Flags: review?(gal)
(Assignee)

Comment 18

9 years ago
Created attachment 431812 [details] [diff] [review]
Simple TR patch
Attachment #431812 - Flags: review?(edwsmith)

Comment 19

9 years ago
Comment on attachment 431812 [details] [diff] [review]
Simple TR patch

Nice change.  I've been eyeing StackFilter, because the deadvars_kill() pass in CodegenLIR.cpp is essentially a reformulated version of StackFilter.  Your changes are probably compatible with TR, too, if we were to start using it.
Attachment #431812 - Flags: review?(edwsmith) → review+

Comment 20

9 years ago
Comment on attachment 431810 [details] [diff] [review]
NJ patch

I am a bit sad that rp stack filtering is gone, but I guess we guard so much that there is almost no way to inline a method call without emitting at least one guard.
Attachment #431810 - Flags: review?(gal) → review+

Updated

9 years ago
Attachment #431811 - Flags: review?(gal) → review+
(Assignee)

Comment 21

9 years ago
(In reply to comment #20)
> (From update of attachment 431810 [details] [diff] [review])
> I am a bit sad that rp stack filtering is gone, but I guess we guard so much
> that there is almost no way to inline a method call without emitting at least
> one guard.

The RP stack filtering really did very little.  In particular, one thing I didn't mention above is that all stores to the RP stack are just stores of constants (FrameInfo pointers).  So if you eliminate an RP stack store you only eliminate that store.  Whereas with the SP stack the stores can be of complicated expressions which may include function calls, so eliminating an SP stack store can have an outsized benefit.

RP stack filtering can be added back in later if things change so that the run-time benefits outweigh the compile-time costs.
(Assignee)

Comment 23

9 years ago
http://hg.mozilla.org/tracemonkey/rev/0c65023bff0f
http://hg.mozilla.org/tracemonkey/rev/adbdac887629
Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tracemonkey
(Assignee)

Comment 24

9 years ago
I wonder if this is worth nominating for 1.9.3?  It's a small patch, IMO low-risk, and gives 5--10% speedups for several SunSpider and V8 tests.
status2.0: --- → ?

Updated

9 years ago
status2.0: ? → wanted

Comment 25

9 years ago
http://hg.mozilla.org/tamarin-redux/rev/2bbb335c34fe
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin

Comment 26

9 years ago
http://hg.mozilla.org/mozilla-central/rev/0c65023bff0f
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Component: Nanojit → Nanojit
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.