Closed
Bug 551039
Opened 14 years ago
Closed 14 years ago
nanojit: account for eight-byte alignment of stack in StackFilter
Categories
(Core Graveyard :: Nanojit, defect)
Tracking
(status2.0 wanted)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
status2.0 | --- | wanted |
People
(Reporter: dvander, Assigned: n.nethercote)
References
Details
(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)
Attachments
(7 files)
4.38 KB,
application/javascript
|
Details | |
1.36 KB,
application/javascript
|
Details | |
1.34 KB,
text/plain
|
Details | |
405 bytes,
text/plain
|
Details | |
6.78 KB,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
655 bytes,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
598 bytes,
patch
|
edwsmith
:
review+
|
Details | Diff | Splinter Review |
http://www.adequatelygood.com/2010/3/Performance-of-vs- An initial look seemed like js_StringToNumber was a bottleneck, but someone taking a look at this should not take that as fact.
Reporter | ||
Comment 1•14 years ago
|
||
Comment 2•14 years ago
|
||
I bet we optimize things away here somehow in the == case.
Comment 3•14 years ago
|
||
Comment 4•14 years ago
|
||
I'll bet that in the == cases we dead-code-eliminate the js_StringToNumber() call and we just fail to do that with the === case.
Comment 5•14 years ago
|
||
That seems to be what happens, indeed. The patch from bug 541288 says that the reason is that this: 0x81a2ec stfi sp[8] = js_StringToNumber1 is a statement. Of course there's a statement like that in the == case as well. Might be worth looking into what the difference between those cases is in terms of what StackFilter does. Sort of echoes of bug 536458, but perhaps not quite (my hacks for that bug seem to not work at first glance).
Comment 6•14 years ago
|
||
All percentages that follow are percentages of total testcase time. In the slow cases, 93% of the time is under js_StringToNumber. 5% is self time; the rest is under js_strtod. There 18% is self time; the rest is under JS_strtod. The time under JS_strtod is 20% self time, 25% PR_Lock, and 22% PR_Unlock. The locking is all self times in those functions. I blame LOCK_DTOA/UNLOCK/DTOA. Is there a reason those are prlocks and not thinlocks? One other interesting note: In Safari, these testcases a single-character string (like "0" or "1") seems to be about 2x faster than for a multi-character string (like "00" or "12"). Increasing the length more only affects the performance weakly. I bet they're using a lookup table or something for single-char strings. I wonder how often that's hit in practice, outside of benchmarks...
Comment 7•14 years ago
|
||
One other note: I tried switching from global-locking around dtoa stuff to using its built-in MULTIPLE_THREADS support, and while that speeds up this testcase, and doesn't affect |x + ""| for an integer x that I can tell, it does slow down |x + ""| for a floating-point x (12.35 in my case) significantly.
Comment 8•14 years ago
|
||
For what it's worth, Safari does in fact special-case single-char ASCII strings in their string-to-number code: http://trac.webkit.org/browser/trunk/JavaScriptCore/runtime/UString.cpp?rev=54843#L365
Comment 9•14 years ago
|
||
Which seems to be part of the fix for https://bugs.webkit.org/show_bug.cgi?id=20333 (aka "tune strings to sunspider"). We may want a separate bug on getting some measurements on this single-char string business...
Comment 10•14 years ago
|
||
The locking is covered by bug 509857, methinks. I filed bug 551109 on avoiding jsdtoa altogether for small-integer strings, and bug 551110 on possibly replicating the jsc 1-char special-casing. That leaves the original issue this bug was filed on, which is the discrepancy between a==b and a===+b.
Comment 11•14 years ago
|
||
Lets please nuke the stupid dtoa locking. We can use thread-local storage if necessary (JSThreadData).
Comment 12•14 years ago
|
||
(In reply to comment #11) > Lets please nuke the stupid dtoa locking. We can use thread-local storage if > necessary (JSThreadData). Hello, please to be reading dependency bug 509857. /be
Assignee | ||
Comment 13•14 years ago
|
||
AFAICT, the '==' vs '===' difference is a red herring. It's the '+' that's important. Here's the relevant code for the '==' case. Instructions that end up being removed are indented. 00024: 9 trace 00025: 10 getlocal 0 00028: 10 getlocal 1 00031: 10 eq ld2 = ld sp[-16] $stack4 = i2f ld2 sti sp[0] = ld2 $stack5 = ld sp[-8] sti sp[8] = $stack5 js_StringToNumber1 = fcall #js_StringToNumber ( cx $stack5 ) 00032: 10 pop 00033: 9 bindname "i" 00036: 9 dup 00037: 9 getxprop "i" About to try emitting guard code for SideExit=0x9cb54d4 exitType=BRANCH feq1 = feq $stack4, js_StringToNumber1 sti sp[0] = feq1 obj = int -143450112 sti sp[0] = obj sti sp[8] = obj globalObj = int -143450112 map = ld globalObj[0] globalObj = int -143450112 guard_global = eq globalObj, globalObj xf2: xf guard_global -> pc=0x9cad04d imacpc=(nil) sp+16 rp+0 (GuardID=002) Likewise for the '=== +' case: 00024: 17 trace 00025: 18 getlocal 0 00028: 18 getlocal 1 00031: 18 pos ld2 = ld sp[-16] $stack4 = i2f ld2 sti sp[0] = ld2 $stack5 = ld sp[-8] sti sp[8] = $stack5 js_StringToNumber1 = fcall #js_StringToNumber ( cx $stack5 ) 00032: 18 stricteq 00033: 18 pop 00034: 17 bindname "i" 00037: 17 dup 00038: 17 getxprop "i" About to try emitting guard code for SideExit=0x9cb584c exitType=BRANCH stfi sp[8] = js_StringToNumber1 # !!! feq1 = feq $stack4, js_StringToNumber1 sti sp[0] = feq1 obj = int -143450112 sti sp[0] = obj sti sp[8] = obj globalObj = int -143450112 map = ld globalObj[0] globalObj = int -143450112 guard_global = eq globalObj, globalObj xf2: xf guard_global -> pc=0x9cad926 imacpc=(nil) sp+16 rp+0 (GuardID=002) The difference is the instruction marked '!!!'. It occurs because the 'pos' requires the result to be pushed onto the stack. As we saw with bug 536458 the StackFilter cannot remove this 'stfi' because only half of it is subsequently overwritten. In this case that keeps the js_StringToNumber call alive. Here's my suggestion: I could add a boolean flag to StackFilter: 'eightByteAligned'. When it is set, then cases like this could be optimised away. Preliminary investigations are promising, eg. crypto-sha1 is 5% faster... in general lots of stack stores are removed, but also a few functions calls (mostly js_UnboxDouble, js_String_p_charCodeAt, math_sqrt_tn). I would also add some assertions to check that all the sp offsets are 8-aligned. BTW, I assume that the RPstack is *not* 8-aligned but word-aligned?
Assignee: general → nobody
Status: NEW → ASSIGNED
Component: JavaScript Engine → Nanojit
QA Contact: general → nanojit
Summary: TM: odd performance in comparison benchmark → nanojit: account for eight-byte alignment of stack in StackFilter
Assignee | ||
Comment 14•14 years ago
|
||
I removed all the dependencies which were about speeding up js_StringToNumber; this bug is really about StackFilter.
Assignee | ||
Comment 16•14 years ago
|
||
The first thing to note is that TR does not use StackFilter at all. So it seems reasonable to bake some TM-specific assumptions into it. With that in mind, this patch: - Changes StackFilter so that it doesn't perform filtering on the RP stack. Turns out that this hardly achieves anything -- eg. only 3 of the SunSpider tests have their generated code changed by this. And they only have a handful more stores. But every test benefits slightly from StackFilter becoming slightly faster. - Changes StackFilter to assume that the stack has 8-byte entries, and so in this sequence: stfi sp[0] ... sti sp[0] it will eliminate the first store even though the second doesn't clobber the high four bytes -- we know they're dead anyway. An assertion checks that the sp offsets are always factors of eight. This change allows StackFilter to become faster yet, because we only have to track half as many entries (because of the 8-byte granularity instead of 4-byte) and we can treat both 4-byte and 8-byte stores the same. - I inlined and removed ignoreStore() now that it's only called from one site. There are some nice improvements, particularly in the following cases where the improved StackFilter causes function calls to be removed: speedup instructions executed reduction ------- ------------------------------- - crypto-md5: 4--5% 3.8% - crypto-sha1: 3--5% 4.0% - bitops-nsieve-bits: 8--11% 9.0% - string-base64: 6--7% 4.3% - v8-crypto: 3--6% 4.0% The speedups are visible on both i386 and X64. The instruction reductions are measured on i386 only, but other platforms should be similar. Some others may have smaller benefits, but those are hard to distinguish from the usual noise. The program in attachement 4 now gives similar numbers for both the '==' loop and the '=== +' loop. Not bad for a patch that ends up simplifying StackFilter!
Assignee: nobody → nnethercote
Attachment #431810 -
Flags: review?(gal)
Assignee | ||
Comment 17•14 years ago
|
||
Attachment #431811 -
Flags: review?(gal)
Assignee | ||
Comment 18•14 years ago
|
||
Attachment #431812 -
Flags: review?(edwsmith)
Comment 19•14 years ago
|
||
Comment on attachment 431812 [details] [diff] [review] Simple TR patch Nice change. I've been eyeing StackFilter, because the deadvars_kill() pass in CodegenLIR.cpp is essentially a reformulated version of StackFilter. Your changes are probably compatible with TR, too, if we were to start using it.
Attachment #431812 -
Flags: review?(edwsmith) → review+
Comment 20•14 years ago
|
||
Comment on attachment 431810 [details] [diff] [review] NJ patch I am a bit sad that rp stack filtering is gone, but I guess we guard so much that there is almost no way to inline a method call without emitting at least one guard.
Attachment #431810 -
Flags: review?(gal) → review+
Updated•14 years ago
|
Attachment #431811 -
Flags: review?(gal) → review+
Assignee | ||
Comment 21•14 years ago
|
||
(In reply to comment #20) > (From update of attachment 431810 [details] [diff] [review]) > I am a bit sad that rp stack filtering is gone, but I guess we guard so much > that there is almost no way to inline a method call without emitting at least > one guard. The RP stack filtering really did very little. In particular, one thing I didn't mention above is that all stores to the RP stack are just stores of constants (FrameInfo pointers). So if you eliminate an RP stack store you only eliminate that store. Whereas with the SP stack the stores can be of complicated expressions which may include function calls, so eliminating an SP stack store can have an outsized benefit. RP stack filtering can be added back in later if things change so that the run-time benefits outweigh the compile-time costs.
Assignee | ||
Comment 22•14 years ago
|
||
http://hg.mozilla.org/projects/nanojit-central/rev/73a17dbefb34
Whiteboard: fixed-in-nanojit
Assignee | ||
Comment 23•14 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/0c65023bff0f http://hg.mozilla.org/tracemonkey/rev/adbdac887629
Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tracemonkey
Assignee | ||
Comment 24•14 years ago
|
||
I wonder if this is worth nominating for 1.9.3? It's a small patch, IMO low-risk, and gives 5--10% speedups for several SunSpider and V8 tests.
Updated•14 years ago
|
Updated•14 years ago
|
Comment 25•14 years ago
|
||
http://hg.mozilla.org/tamarin-redux/rev/2bbb335c34fe
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin
Comment 26•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/0c65023bff0f
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•