Closed
Bug 552783
Opened 15 years ago
Closed 15 years ago
JM: PIC for GETPROP
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: dmandelin, Assigned: dmandelin)
References
Details
Attachments
(1 file, 4 obsolete files)
55.93 KB,
patch
|
Details | Diff | Splinter Review |
Implement a PIC in JägerMonkey for JSOP_GETPROP, i.e., JS |o.x|. The design will follow Self/WebKit/V8, summarized as follows:
Initially, generate asm like this:
guard that the base value is a JSObject, jump to slow otherwise
guard that shape == INVALID_SHAPE, jump to slow otherwise
read from slot INVALID_SLOT (in fslots) to eax
store_result:
store eax to stack
...
slow:
call PIC_GetProp_Slow, which returns the value
store eax to stack
jmp store_result
PIC_GetProp_Slow will analyze the property access (the base object type, proto depth, shape, etc) and see if it can generate a fast stub for this kind. If so, it generates the fast stub and patches the taken jump to slow to jump to the new stub instead. The new stub always has a guard and jumps to the original slow case if it fails.
Note: We will only start generating fast code the second time the the PIC slow case is called.
As a special case, if the PIC hasn't been patched yet, and the property is found on the base object itself (not a proto) without a getter, we patch the shape number and slot number in the initially generated fast case.
We will do fast paths for these cases:
property is string length
property is array length
property found on self
property found on proto chain
property found on self/proto chain with getter
Assignee | ||
Comment 1•15 years ago
|
||
This does PIC (actually, MIC) stuff for programs like this:
var o = { a: 55 };
var z;
for (var i = 0; i < 10000000; ++i) {
z = o.a;
}
Times for this test:
system time taken (ms)
interpreter 834
JM w/o PIC (w/ prop cache) 655 1.3x vs. interp
JM w/ PIC 285 2.9x vs. interp, 2.3x vs. w/o PIC
Comment 2•15 years ago
|
||
um, awesome.
Assignee | ||
Comment 3•15 years ago
|
||
mrbkap asked about TM on IRC, which I forgot to measure before. TM takes 146 ms. On an empty loop of the same size (i.e., this benchmark with the prop access cut out), JM takes 185 ms and TM 55 ms. Thus, the time for property access itself (also including the global store) is 100 ms for JM and 90 ms for TM. This makes sense; this particular MIC usage generates pretty much exactly the same code as TM.
Assignee | ||
Comment 4•15 years ago
|
||
This one generates up to 8 stubs for self properties without getters. It passes trace-tests.
Attachment #433434 -
Attachment is obsolete: true
Assignee | ||
Comment 5•15 years ago
|
||
Attachment #434044 -
Attachment is obsolete: true
Assignee | ||
Comment 6•15 years ago
|
||
Attachment #434047 -
Attachment is obsolete: true
Assignee | ||
Comment 7•15 years ago
|
||
Attachment #435038 -
Attachment is obsolete: true
Assignee | ||
Comment 8•15 years ago
|
||
I landed WIP 5 to the JM repo. Perf is a wash right now, at least on SS. That is because the PIC currently doesn't handle array lengths or prototypes, so you go through the slower PIC slow path (that tries to update the PIC, and doesn't use the property cache). I will be adding those next. I just wanted it to get landed to reduce the chance of conflicts going forward.
Assignee | ||
Comment 9•15 years ago
|
||
I landed the PIC for prototypes. It gets a 1.23x speedup on access-nbody, but a 1.36x slowdown on math-partial-sums. Either there is a bug, or I am running too much of a slow path for globals. I will look into that next.
Assignee | ||
Comment 10•15 years ago
|
||
I disabled the PIC for JSOP_GETXPROP, and now it works better. I'm not sure that's the right thing to do, though--it might disable it for too many things. Anyway, the point is that it shouldn't be trying to do things for globals. I'll have to keep working on it.
Assignee | ||
Comment 11•15 years ago
|
||
It was crashing on running the V8 benchmarks in SS. The problem was that the PIC stub format can vary according to whether the shape is available in a register on entry to the stub, but the code that patches the jump-to-slow out of the stub wasn't accounting for that. I fixed that and pushed it.
It's now turned on in JM x86, but it's not done yet--I need to handle getthisprop, grandprotos and such, and do a bunch of tuning and perf testing.
Assignee | ||
Comment 12•15 years ago
|
||
Here are the current results for SunSpider, JM w/o PIC vs. JM with PIC:
buildmonkey-right:SunSpider dmandelin$ ./sunspider-compare-results --shell=/Users/dmandelin/sources/jaegermonkey/js/src/opt/shell/js /tmp/ssj /tmp/ssp
TEST COMPARISON FROM TO DETAILS
=============================================================================
** TOTAL **: 1.035x as fast 1477.4ms +/- 0.2% 1427.6ms +/- 0.1% significant
=============================================================================
3d: 1.062x as fast 198.3ms +/- 0.3% 186.7ms +/- 0.2% significant
cube: 1.064x as fast 60.2ms +/- 0.5% 56.6ms +/- 0.7% significant
morph: 1.010x as fast 72.9ms +/- 0.3% 72.2ms +/- 0.4% significant
raytrace: 1.126x as fast 65.2ms +/- 0.5% 57.9ms +/- 0.4% significant
access: 1.118x as fast 244.5ms +/- 0.2% 218.6ms +/- 0.2% significant
binary-trees: 1.080x as fast 35.0ms +/- 1.0% 32.4ms +/- 1.1% significant
fannkuch: - 79.4ms +/- 0.5% 79.4ms +/- 0.5%
nbody: 1.29x as fast 103.4ms +/- 0.4% 80.0ms +/- 0.0% significant
nsieve: ?? 26.7ms +/- 1.3% 26.8ms +/- 1.1% not conclusive: might be *1.004x as slow*
bitops: 1.009x as fast 146.3ms +/- 0.5% 145.0ms +/- 0.3% significant
3bit-bits-in-byte: 1.037x as fast 16.7ms +/- 2.1% 16.1ms +/- 1.4% significant
bits-in-byte: - 29.0ms +/- 0.0% 29.0ms +/- 0.0%
bitwise-and: 1.019x as fast 54.1ms +/- 0.7% 53.1ms +/- 1.0% significant
nsieve-bits: ?? 46.5ms +/- 0.8% 46.8ms +/- 0.6% not conclusive: might be *1.006x as slow*
controlflow: *1.008x as slow* 24.0ms +/- 0.0% 24.2ms +/- 1.2% significant
recursive: *1.008x as slow* 24.0ms +/- 0.0% 24.2ms +/- 1.2% significant
crypto: ?? 96.8ms +/- 0.5% 97.1ms +/- 0.4% not conclusive: might be *1.003x as slow*
aes: *1.011x as slow* 36.9ms +/- 0.6% 37.3ms +/- 0.9% significant
md5: ?? 29.4ms +/- 1.3% 29.6ms +/- 1.2% not conclusive: might be *1.007x as slow*
sha1: - 30.5ms +/- 1.2% 30.2ms +/- 1.0%
date: - 106.4ms +/- 1.1% 106.0ms +/- 0.6%
format-tofte: - 45.8ms +/- 0.7% 45.6ms +/- 0.8%
format-xparb: - 60.6ms +/- 1.6% 60.4ms +/- 0.6%
math: 1.023x as fast 174.7ms +/- 0.5% 170.8ms +/- 0.4% significant
cordic: *1.002x as slow* 59.0ms +/- 0.0% 59.1ms +/- 0.4% significant
partial-sums: - 77.8ms +/- 0.8% 77.2ms +/- 0.7%
spectral-norm: 1.099x as fast 37.9ms +/- 0.6% 34.5ms +/- 1.1% significant
regexp: ?? 170.8ms +/- 0.2% 171.1ms +/- 0.1% not conclusive: might be *1.002x as slow*
dna: ?? 170.8ms +/- 0.2% 171.1ms +/- 0.1% not conclusive: might be *1.002x as slow*
string: 1.024x as fast 315.6ms +/- 0.3% 308.1ms +/- 0.2% significant
base64: - 28.9ms +/- 0.8% 28.6ms +/- 1.3%
fasta: 1.031x as fast 79.5ms +/- 0.5% 77.1ms +/- 0.3% significant
tagcloud: 1.047x as fast 81.8ms +/- 0.4% 78.1ms +/- 0.3% significant
unpack-code: ?? 86.1ms +/- 0.3% 86.4ms +/- 0.4% not conclusive: might be *1.003x as slow*
validate-input: 1.037x as fast 39.3ms +/- 0.9% 37.9ms +/- 0.6% significant
Assignee | ||
Comment 13•15 years ago
|
||
Analysis of results in the previous comment: Most of the benchmarks don't do enough |getprop| or |length| calls for a PIC to help. Keep in mind also that the baseline for that comparison runs with the property cache, so what we are really measuring is the benefit of jitting property cache operations.
* Benchmarks that benefit primarily from PIC for |getprop|
getprops run time saved time saved/getprop
3d-cube 26616 3.6 ms 135 ns = 298 cycles
3d-raytrace 522707 7.3 14 31
access-binary-trees 125409 2.6 18 39
access-nbody 1040659 23.4 23 50
(Almost all getprops were run on PIC fast paths for all of the above.)
* Benchmarks that benefit primarily from PIC for |length|.
lengths run time saved time saved/length
math-spectral-norm 129760 3.4 ms 26 ns = 58 cycles
string-fasta 87014 2.4 28 61
string-validate-input 30000 1.4 47 103
(Almost all getprops were run on PIC fast paths for all of the above.)
* Benchmarks that benefit from PIC for |length| and |getprop|.
ops run time saved time saved/op
string-tagcloud 136452 3.7 ms 27 ns = 60 cycles
(Almost all ops were run on PIC fast paths for all of the above.)
From the above, we can conclude that running with the PIC saves roughly 15-30 ns per |getprop| or |length| operation. Thus, we shouldn't expect to see a 1ms speedup unless there are at least 30k-60k |getprop| or |length| ops.
* Benchmarks that had many ops run on fast paths but didn't speed up.
crypto-md5 16092 length ops
crypto-sha1 10246 length ops
date-format-tofte 229500 length ops
The crypto benchmarks don't run enough ops for the benefit to really show up, so that makes sense. I don't know what's going on with tofte yet, I need to look into that.
* Benchmarks that had many ops run on slow paths even with the PIC.
date-format-xparb 19999 length ops
math-partial-sums 253952 getxprop ops
string-validate-input 12000 getprop ops
math-partial-sums is yet another buggy benchmark, that thinks that
var a1 = a2 = ... = a9 = 0.0;
inside a function declares 9 local variables. Instead, it uses mostly global variables. |x += e| for global or free variable |x| is implemented with |getxprop| in SM. I'm not running these with the PIC; I think we should solve free variables separately.
I'm not exactly sure why the other two miss in the PIC, and I will check into it. I doubt it matters for SS, as the number of ops is too small to yield a noticeable speedup, but it might matter for web code.
Another key thing to look at next is the amount of time taken to build, monitor, and patch PICs.
Assignee | ||
Comment 14•15 years ago
|
||
SunSpider PIC cost metrics for getprop+length:
static number of pics: 240
dynamic pic mgmt calls: 505
dynamic pic patchings: 259
time for all pic mgmt calls: 201 us = 0.4% of time saved by PICs
This doesn't count the time taken to set up the PIC in the first place, or to free the PIC memory when we are done. But those are probably much less than the time taken to patch a PIC, and almost certainly at no more than the time taken to patch, so I doubt it is more than 1% of the time saved, which is pretty much insignificant--it will be a long time before we care to spend nontrivial developer time to get a 0.5ms speedup on SunSpider.
But there is a defect in the current pic mgmt functions: they are O(N) where N is the number of pics in the script. That will have to be fixed before this works is right for v8-v4 (or the web): they have functions with 99 pics.
Comment 15•15 years ago
|
||
Comment on attachment 435046 [details] [diff] [review]
WIP 5 (rebased)
> enum { SHAPELESS = 0xffffffff };
>
>+ // For all objects |obj|, |obj->shape->map != INVALID_SHAPE|. This is positive
>+ // so that it must be a 32-bit value in x86 ASM.
>+ enum { INVALID_SHAPE = 0x8fffffff };
Prevailing style favors one enum with two enumerators.
>@@ -632,25 +635,23 @@ Compiler::Compile()
> END_CASE(JSOP_CALLELEM)
>
> BEGIN_CASE(JSOP_SETELEM)
> stubCall(jsl_SetElem);
> END_CASE(JSOP_SETELEM)
>
> BEGIN_CASE(JSOP_GETPROP)
> BEGIN_CASE(JSOP_GETXPROP)
>- {
>- uint32 index = fullAtomIndex(codePos);
>- masm.store32(Imm32(index), FrameAddress(ARG_OFFSET(NAMEOPS, index)));
>- stubCall(jsl_GetProp);
>- }
>+ if (!jsop_getprop(fullAtomIndex(codePos)))
>+ return NULL;
> END_CASE(JSOP_GETPROP)
>
> BEGIN_CASE(JSOP_LENGTH)
>- stubCall(jsl_Length);
>+ if (!jsop_length())
>+ return NULL;
> END_CASE(JSOP_LENGTH)
>
> BEGIN_CASE(JSOP_CALLPROP)
> {
> uint32 index = fullAtomIndex(codePos);
> masm.store32(Imm32(index), FrameAddress(ARG_OFFSET(NAMEOPS, index)));
> stubCall(jsl_CallProp);
> }
The added lines seem overindented by two spaces?
Sorry for the drive-by nits, keep patching!
/be
Assignee | ||
Comment 16•15 years ago
|
||
I'll pick up the nits. Here are the current results for the SunSpider v8-v4 suite:
# w/o PIC
buildmonkey-right:SunSpider dmandelin$ tail -n 12 /tmp/vvj
--------------------------------------------
Total: 14960.0ms +/- 1.2%
--------------------------------------------
v8: 14960.0ms +/- 1.2%
crypto: 1473.7ms +/- 0.2%
deltablue: 3611.7ms +/- 1.4%
earley-boyer: 1559.3ms +/- 0.2%
raytrace: 1474.0ms +/- 1.2%
regexp: 1093.0ms +/- 0.0%
richards: 4797.3ms +/- 3.2%
splay: 951.0ms +/- 0.7%
# w/ PIC
buildmonkey-right:SunSpider dmandelin$ tail -n 12 /tmp/vvp
--------------------------------------------
Total: 12740.3ms +/- 4.7%
--------------------------------------------
v8: 12740.3ms +/- 4.7%
crypto: 1436.0ms +/- 0.3%
deltablue: 2846.0ms +/- 1.1%
earley-boyer: 1383.3ms +/- 0.5%
raytrace: 1344.0ms +/- 1.1%
regexp: 1099.3ms +/- 0.1%
richards: 3761.3ms +/- 15.9%
splay: 870.3ms +/- 0.9%
That's 1.17x. PIC patching and generation time is 7ms here, about 0.3% of the time saved by the PICs. So that's still not very significant, but of course I still must get rid of the unnecessary O(N).
The PICs are actually handling everything in the v8 suite that generates a significant number of getprop/length ops, except for line 31 in v8-raytrace. That line reads a method-valued property (to use with |apply|). I think the getter relates to the read barrier for the function cloning avoidance optimization, but I'm not sure. Anyway, I do plan to do getters, but if this isn't a "real" getter, then presumably I need a special case for that as well.
Assignee | ||
Comment 17•15 years ago
|
||
I fixed up the possible O(N) blowup finding PICs by using binary search over a cutoff. The time taken to find PICs seems to be insignificant in the benchmarks (even with linear search), so this is just insurance.
Basic tradeoffs: the time taken to patch one PIC is seems to be about 800 ns. We then save about 20 ns per usage (vs. calling a stub that uses the prop cache), so it takes 40 calls to reach the breakeven point. Currently, we spend a total of 0.2ms patching PICs on SunSpider (out of a run time of >1000ms) and 5ms patching PICs on v8-v4 (out of a run time of >12,000ms) so we basically don't care for now.
Assignee | ||
Comment 18•15 years ago
|
||
I just pushed a patch to purge PICs when a shape-regenerating GC happens. It's kind of unfortunate that we have to do this, because it runs rarely so it will be hard to test. At least it's fairly simple.
At this point, I consider the PIC getprop to be "basically" done, in that it is correct assuming I haven't missed anything, and picks up almost all the perf wins for benchmarks. I will file followon bugs on all the other things that it needs to be truly done.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•