Closed Bug 551039 Opened 14 years ago Closed 14 years ago

nanojit: account for eight-byte alignment of stack in StackFilter

Tracking

(status2.0 wanted)

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

status2.0

---

wanted

People

(Reporter: dvander, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)

Attachments

(7 files)

shell version 14 years ago David Anderson [:dvander] - inactive, e-mail if emergency 4.38 KB, application/javascript		Details
version with just bad cases 14 years ago David Anderson [:dvander] - inactive, e-mail if emergency 1.36 KB, application/javascript		Details
Version that shows the speedup of == over === + 14 years ago Boris Zbarsky [:bzbarsky] 1.34 KB, text/plain		Details
Showing things more clearly; should make it easier to create logs and such 14 years ago Boris Zbarsky [:bzbarsky] 405 bytes, text/plain		Details
NJ patch 14 years ago Nicholas Nethercote [inactive] 6.78 KB, patch	gal : review+	Details \| Diff \| Splinter Review
Simple TM patch 14 years ago Nicholas Nethercote [inactive] 655 bytes, patch	gal : review+	Details \| Diff \| Splinter Review
Simple TR patch 14 years ago Nicholas Nethercote [inactive] 598 bytes, patch	edwsmith : review+	Details \| Diff \| Splinter Review

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Description

•

14 years ago

Attached file shell version — Details

http://www.adequatelygood.com/2010/3/Performance-of-vs-

An initial look seemed like js_StringToNumber was a bottleneck, but someone taking a look at this should not take that as fact.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 1

•

14 years ago

Attached file version with just bad cases — Details

Boris Zbarsky [:bzbarsky]

Comment 2

•

14 years ago

Attached file Version that shows the speedup of == over === + — Details

I bet we optimize things away here somehow in the == case.

Boris Zbarsky [:bzbarsky]

Comment 3

•

14 years ago

Attached file Showing things more clearly; should make it easier to create logs and such — Details

Boris Zbarsky [:bzbarsky]

Comment 4

•

14 years ago

I'll bet that in the == cases we dead-code-eliminate the js_StringToNumber() call and we just fail to do that with the === case.

Boris Zbarsky [:bzbarsky]

Comment 5

•

14 years ago

That seems to be what happens, indeed.  The patch from bug 541288 says that the reason is that this:

  0x81a2ec  stfi sp[8] = js_StringToNumber1

is a statement.  Of course there's a statement like that in the == case as well.  Might be worth looking into what the difference between those cases is in terms of what StackFilter does.  Sort of echoes of bug 536458, but perhaps not quite (my hacks for that bug seem to not work at first glance).

Boris Zbarsky [:bzbarsky]

Comment 6

•

14 years ago

All percentages that follow are percentages of total testcase time.

In the slow cases, 93% of the time is under js_StringToNumber.  5% is self time; the rest is under js_strtod.  There 18% is self time; the rest is under JS_strtod.  

The time under JS_strtod is 20% self time, 25% PR_Lock, and 22% PR_Unlock.  The locking is all self times in those functions.  I blame LOCK_DTOA/UNLOCK/DTOA.  Is there a reason those are prlocks and not thinlocks?

One other interesting note:  In Safari, these testcases a single-character string (like "0" or "1") seems to be about 2x faster than for a multi-character string (like "00" or "12").  Increasing the length more only affects the performance weakly.  I bet they're using a lookup table or something for single-char strings.  I wonder how often that's hit in practice, outside of benchmarks...

Boris Zbarsky [:bzbarsky]

Comment 7

•

14 years ago

One other note: I tried switching from global-locking around dtoa stuff to using its built-in MULTIPLE_THREADS support, and while that speeds up this testcase, and doesn't affect |x + ""| for an integer x that I can tell, it does slow down |x + ""| for a floating-point x (12.35 in my case) significantly.

Boris Zbarsky [:bzbarsky]

Comment 8

•

14 years ago

For what it's worth, Safari does in fact special-case single-char ASCII strings in their string-to-number code:

http://trac.webkit.org/browser/trunk/JavaScriptCore/runtime/UString.cpp?rev=54843#L365

Boris Zbarsky [:bzbarsky]

Comment 9

•

14 years ago

Which seems to be part of the fix for https://bugs.webkit.org/show_bug.cgi?id=20333 (aka "tune strings to sunspider").

We may want a separate bug on getting some measurements on this single-char string business...

Boris Zbarsky [:bzbarsky]

Comment 10

•

14 years ago

The locking is covered by bug 509857, methinks.

I filed bug 551109 on avoiding jsdtoa altogether for small-integer strings, and bug 551110 on possibly replicating the jsc 1-char special-casing.

That leaves the original issue this bug was filed on, which is the discrepancy between a==b and a===+b.

Depends on: 509857, 551109, 551110

Andreas Gal :gal

Comment 11

•

14 years ago

Lets please nuke the stupid dtoa locking. We can use thread-local storage if necessary (JSThreadData).

Brendan Eich [:brendan]

Comment 12

•

14 years ago

(In reply to comment #11)
> Lets please nuke the stupid dtoa locking. We can use thread-local storage if
> necessary (JSThreadData).

Hello, please to be reading dependency bug 509857.

/be

Nicholas Nethercote [inactive]

Assignee

Comment 13

•

14 years ago

AFAICT, the '==' vs '===' difference is a red herring.  It's the '+' that's
important.

Here's the relevant code for the '==' case.  Instructions that end up being
removed are indented.

00024:   9  trace
00025:  10  getlocal 0
00028:  10  getlocal 1
00031:  10  eq
    ld2 = ld sp[-16]
        $stack4 = i2f ld2
        sti sp[0] = ld2
        $stack5 = ld sp[-8]
        sti sp[8] = $stack5
        js_StringToNumber1 = fcall #js_StringToNumber ( cx $stack5 )

00032:  10  pop
00033:   9  bindname "i"
00036:   9  dup
00037:   9  getxprop "i"
    About to try emitting guard code for SideExit=0x9cb54d4 exitType=BRANCH
        feq1 = feq $stack4, js_StringToNumber1
        sti sp[0] = feq1
        obj = int -143450112
    sti sp[0] = obj
    sti sp[8] = obj
        globalObj = int -143450112
        map = ld globalObj[0]
        globalObj = int -143450112
    guard_global = eq globalObj, globalObj
    xf2: xf guard_global -> pc=0x9cad04d imacpc=(nil) sp+16 rp+0 (GuardID=002)


Likewise for the '=== +' case:

00024:  17  trace
00025:  18  getlocal 0
00028:  18  getlocal 1
00031:  18  pos
    ld2 = ld sp[-16]
        $stack4 = i2f ld2
        sti sp[0] = ld2
    $stack5 = ld sp[-8]
        sti sp[8] = $stack5
    js_StringToNumber1 = fcall #js_StringToNumber ( cx $stack5 )

00032:  18  stricteq
00033:  18  pop
00034:  17  bindname "i"
00037:  17  dup
00038:  17  getxprop "i"
    About to try emitting guard code for SideExit=0x9cb584c exitType=BRANCH
    stfi sp[8] = js_StringToNumber1                # !!!
        feq1 = feq $stack4, js_StringToNumber1
        sti sp[0] = feq1
        obj = int -143450112
    sti sp[0] = obj
    sti sp[8] = obj
        globalObj = int -143450112
        map = ld globalObj[0]
        globalObj = int -143450112
    guard_global = eq globalObj, globalObj
    xf2: xf guard_global -> pc=0x9cad926 imacpc=(nil) sp+16 rp+0 (GuardID=002)


The difference is the instruction marked '!!!'.  It occurs because the
'pos' requires the result to be pushed onto the stack.  As we saw with bug
536458 the StackFilter cannot remove this 'stfi' because only half of it is
subsequently overwritten.  In this case that keeps the js_StringToNumber
call alive.

Here's my suggestion:  I could add a boolean flag to StackFilter:
'eightByteAligned'.  When it is set, then cases like this could be
optimised away.  Preliminary investigations are promising, eg. crypto-sha1 is 5% faster... in general lots of stack stores are removed, but also a few functions calls (mostly js_UnboxDouble, js_String_p_charCodeAt, math_sqrt_tn).

I would also add some assertions to check that all the sp
offsets are 8-aligned.  BTW, I assume that the RPstack is *not* 8-aligned
but word-aligned?

Assignee: general → nobody

Status: NEW → ASSIGNED

Component: JavaScript Engine → Nanojit

QA Contact: general → nanojit

Summary: TM: odd performance in comparison benchmark → nanojit: account for eight-byte alignment of stack in StackFilter

Nicholas Nethercote [inactive]

Assignee

Comment 14

•

14 years ago

I removed all the dependencies which were about speeding up js_StringToNumber;  this bug is really about StackFilter.

No longer depends on: 509857, 551109, 551110

Nicholas Nethercote [inactive]

Assignee

Comment 16

•

14 years ago

Attached patch NJ patch — Details — Splinter Review

The first thing to note is that TR does not use StackFilter at all.  So it
seems reasonable to bake some TM-specific assumptions into it.

With that in mind, this patch:

- Changes StackFilter so that it doesn't perform filtering on the RP stack.
  Turns out that this hardly achieves anything -- eg. only 3 of the
  SunSpider tests have their generated code changed by this.  And they only
  have a handful more stores.  But every test benefits slightly from
  StackFilter becoming slightly faster.

- Changes StackFilter to assume that the stack has 8-byte entries, and so in
  this sequence:

    stfi sp[0]
    ...
    sti sp[0]

  it will eliminate the first store even though the second doesn't clobber
  the high four bytes -- we know they're dead anyway.  An assertion
  checks that the sp offsets are always factors of eight.

  This change allows StackFilter to become faster yet, because we only have
  to track half as many entries (because of the 8-byte granularity instead
  of 4-byte) and we can treat both 4-byte and 8-byte stores the same.

- I inlined and removed ignoreStore() now that it's only called from one
  site.

There are some nice improvements, particularly in the following cases where
the improved StackFilter causes function calls to be removed:

                        speedup     instructions executed reduction
                        -------     -------------------------------
- crypto-md5:           4--5%       3.8%
- crypto-sha1:          3--5%       4.0%
- bitops-nsieve-bits:   8--11%      9.0%
- string-base64:        6--7%       4.3%
- v8-crypto:            3--6%       4.0%

The speedups are visible on both i386 and X64.  The instruction reductions
are measured on i386 only, but other platforms should be similar.

Some others may have smaller benefits, but those are hard to distinguish
from the usual noise.

The program in attachement 4 now gives similar numbers for both the '=='
loop and the '=== +' loop.

Not bad for a patch that ends up simplifying StackFilter!

Assignee: nobody → nnethercote

Attachment #431810 - Flags: review?(gal)

Nicholas Nethercote [inactive]

Assignee

Comment 17

•

14 years ago

Attached patch Simple TM patch — Details — Splinter Review

Attachment #431811 - Flags: review?(gal)

Nicholas Nethercote [inactive]

Assignee

Comment 18

•

14 years ago

Attached patch Simple TR patch — Details — Splinter Review

Attachment #431812 - Flags: review?(edwsmith)

Edwin Smith

Comment 19

•

14 years ago

Comment on attachment 431812 [details] [diff] [review]
Simple TR patch

Nice change.  I've been eyeing StackFilter, because the deadvars_kill() pass in CodegenLIR.cpp is essentially a reformulated version of StackFilter.  Your changes are probably compatible with TR, too, if we were to start using it.

Attachment #431812 - Flags: review?(edwsmith) → review+

Andreas Gal :gal

Comment 20

•

14 years ago

Comment on attachment 431810 [details] [diff] [review]
NJ patch

I am a bit sad that rp stack filtering is gone, but I guess we guard so much that there is almost no way to inline a method call without emitting at least one guard.

Attachment #431810 - Flags: review?(gal) → review+

Andreas Gal :gal

Updated

•

14 years ago

Attachment #431811 - Flags: review?(gal) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 21

•

14 years ago

(In reply to comment #20)
> (From update of attachment 431810 [details] [diff] [review])
> I am a bit sad that rp stack filtering is gone, but I guess we guard so much
> that there is almost no way to inline a method call without emitting at least
> one guard.

The RP stack filtering really did very little.  In particular, one thing I didn't mention above is that all stores to the RP stack are just stores of constants (FrameInfo pointers).  So if you eliminate an RP stack store you only eliminate that store.  Whereas with the SP stack the stores can be of complicated expressions which may include function calls, so eliminating an SP stack store can have an outsized benefit.

RP stack filtering can be added back in later if things change so that the run-time benefits outweigh the compile-time costs.

Nicholas Nethercote [inactive]

Assignee

Comment 22

•

14 years ago

http://hg.mozilla.org/projects/nanojit-central/rev/73a17dbefb34

Whiteboard: fixed-in-nanojit

Nicholas Nethercote [inactive]

Assignee

Comment 23

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/0c65023bff0f
http://hg.mozilla.org/tracemonkey/rev/adbdac887629

Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tracemonkey

Nicholas Nethercote [inactive]

Assignee

Comment 24

•

14 years ago

I wonder if this is worth nominating for 1.9.3?  It's a small patch, IMO low-risk, and gives 5--10% speedups for several SunSpider and V8 tests.

Brendan Eich [:brendan]

Updated

•

14 years ago

status2.0: --- → ?

Robert Sayre

Updated

•

14 years ago

status2.0: ? → wanted

Edwin Smith

Comment 25

•

14 years ago

http://hg.mozilla.org/tamarin-redux/rev/2bbb335c34fe

Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin

Robert Sayre

Comment 26

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/0c65023bff0f

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: Core → Core Graveyard

You need to log in before you can comment on or make changes to this bug.