Closed Bug 552783 Opened 15 years ago Closed 15 years ago

JM: PIC for GETPROP

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: dmandelin)

References

Details

Attachments

(1 file, 4 obsolete files)

Implement a PIC in JägerMonkey for JSOP_GETPROP, i.e., JS |o.x|. The design will follow Self/WebKit/V8, summarized as follows: Initially, generate asm like this: guard that the base value is a JSObject, jump to slow otherwise guard that shape == INVALID_SHAPE, jump to slow otherwise read from slot INVALID_SLOT (in fslots) to eax store_result: store eax to stack ... slow: call PIC_GetProp_Slow, which returns the value store eax to stack jmp store_result PIC_GetProp_Slow will analyze the property access (the base object type, proto depth, shape, etc) and see if it can generate a fast stub for this kind. If so, it generates the fast stub and patches the taken jump to slow to jump to the new stub instead. The new stub always has a guard and jumps to the original slow case if it fails. Note: We will only start generating fast code the second time the the PIC slow case is called. As a special case, if the PIC hasn't been patched yet, and the property is found on the base object itself (not a proto) without a getter, we patch the shape number and slot number in the initially generated fast case. We will do fast paths for these cases: property is string length property is array length property found on self property found on proto chain property found on self/proto chain with getter
Attached patch WIP (obsolete) — Splinter Review
This does PIC (actually, MIC) stuff for programs like this: var o = { a: 55 }; var z; for (var i = 0; i < 10000000; ++i) { z = o.a; } Times for this test: system time taken (ms) interpreter 834 JM w/o PIC (w/ prop cache) 655 1.3x vs. interp JM w/ PIC 285 2.9x vs. interp, 2.3x vs. w/o PIC
um, awesome.
mrbkap asked about TM on IRC, which I forgot to measure before. TM takes 146 ms. On an empty loop of the same size (i.e., this benchmark with the prop access cut out), JM takes 185 ms and TM 55 ms. Thus, the time for property access itself (also including the global store) is 100 ms for JM and 90 ms for TM. This makes sense; this particular MIC usage generates pretty much exactly the same code as TM.
Attached patch WIP 2 (obsolete) — Splinter Review
This one generates up to 8 stubs for self properties without getters. It passes trace-tests.
Attachment #433434 - Attachment is obsolete: true
Attachment #434044 - Attachment is obsolete: true
Attachment #434047 - Attachment is obsolete: true
Attached patch WIP 5 (rebased)Splinter Review
Attachment #435038 - Attachment is obsolete: true
I landed WIP 5 to the JM repo. Perf is a wash right now, at least on SS. That is because the PIC currently doesn't handle array lengths or prototypes, so you go through the slower PIC slow path (that tries to update the PIC, and doesn't use the property cache). I will be adding those next. I just wanted it to get landed to reduce the chance of conflicts going forward.
I landed the PIC for prototypes. It gets a 1.23x speedup on access-nbody, but a 1.36x slowdown on math-partial-sums. Either there is a bug, or I am running too much of a slow path for globals. I will look into that next.
I disabled the PIC for JSOP_GETXPROP, and now it works better. I'm not sure that's the right thing to do, though--it might disable it for too many things. Anyway, the point is that it shouldn't be trying to do things for globals. I'll have to keep working on it.
It was crashing on running the V8 benchmarks in SS. The problem was that the PIC stub format can vary according to whether the shape is available in a register on entry to the stub, but the code that patches the jump-to-slow out of the stub wasn't accounting for that. I fixed that and pushed it. It's now turned on in JM x86, but it's not done yet--I need to handle getthisprop, grandprotos and such, and do a bunch of tuning and perf testing.
Here are the current results for SunSpider, JM w/o PIC vs. JM with PIC: buildmonkey-right:SunSpider dmandelin$ ./sunspider-compare-results --shell=/Users/dmandelin/sources/jaegermonkey/js/src/opt/shell/js /tmp/ssj /tmp/ssp TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.035x as fast 1477.4ms +/- 0.2% 1427.6ms +/- 0.1% significant ============================================================================= 3d: 1.062x as fast 198.3ms +/- 0.3% 186.7ms +/- 0.2% significant cube: 1.064x as fast 60.2ms +/- 0.5% 56.6ms +/- 0.7% significant morph: 1.010x as fast 72.9ms +/- 0.3% 72.2ms +/- 0.4% significant raytrace: 1.126x as fast 65.2ms +/- 0.5% 57.9ms +/- 0.4% significant access: 1.118x as fast 244.5ms +/- 0.2% 218.6ms +/- 0.2% significant binary-trees: 1.080x as fast 35.0ms +/- 1.0% 32.4ms +/- 1.1% significant fannkuch: - 79.4ms +/- 0.5% 79.4ms +/- 0.5% nbody: 1.29x as fast 103.4ms +/- 0.4% 80.0ms +/- 0.0% significant nsieve: ?? 26.7ms +/- 1.3% 26.8ms +/- 1.1% not conclusive: might be *1.004x as slow* bitops: 1.009x as fast 146.3ms +/- 0.5% 145.0ms +/- 0.3% significant 3bit-bits-in-byte: 1.037x as fast 16.7ms +/- 2.1% 16.1ms +/- 1.4% significant bits-in-byte: - 29.0ms +/- 0.0% 29.0ms +/- 0.0% bitwise-and: 1.019x as fast 54.1ms +/- 0.7% 53.1ms +/- 1.0% significant nsieve-bits: ?? 46.5ms +/- 0.8% 46.8ms +/- 0.6% not conclusive: might be *1.006x as slow* controlflow: *1.008x as slow* 24.0ms +/- 0.0% 24.2ms +/- 1.2% significant recursive: *1.008x as slow* 24.0ms +/- 0.0% 24.2ms +/- 1.2% significant crypto: ?? 96.8ms +/- 0.5% 97.1ms +/- 0.4% not conclusive: might be *1.003x as slow* aes: *1.011x as slow* 36.9ms +/- 0.6% 37.3ms +/- 0.9% significant md5: ?? 29.4ms +/- 1.3% 29.6ms +/- 1.2% not conclusive: might be *1.007x as slow* sha1: - 30.5ms +/- 1.2% 30.2ms +/- 1.0% date: - 106.4ms +/- 1.1% 106.0ms +/- 0.6% format-tofte: - 45.8ms +/- 0.7% 45.6ms +/- 0.8% format-xparb: - 60.6ms +/- 1.6% 60.4ms +/- 0.6% math: 1.023x as fast 174.7ms +/- 0.5% 170.8ms +/- 0.4% significant cordic: *1.002x as slow* 59.0ms +/- 0.0% 59.1ms +/- 0.4% significant partial-sums: - 77.8ms +/- 0.8% 77.2ms +/- 0.7% spectral-norm: 1.099x as fast 37.9ms +/- 0.6% 34.5ms +/- 1.1% significant regexp: ?? 170.8ms +/- 0.2% 171.1ms +/- 0.1% not conclusive: might be *1.002x as slow* dna: ?? 170.8ms +/- 0.2% 171.1ms +/- 0.1% not conclusive: might be *1.002x as slow* string: 1.024x as fast 315.6ms +/- 0.3% 308.1ms +/- 0.2% significant base64: - 28.9ms +/- 0.8% 28.6ms +/- 1.3% fasta: 1.031x as fast 79.5ms +/- 0.5% 77.1ms +/- 0.3% significant tagcloud: 1.047x as fast 81.8ms +/- 0.4% 78.1ms +/- 0.3% significant unpack-code: ?? 86.1ms +/- 0.3% 86.4ms +/- 0.4% not conclusive: might be *1.003x as slow* validate-input: 1.037x as fast 39.3ms +/- 0.9% 37.9ms +/- 0.6% significant
Analysis of results in the previous comment: Most of the benchmarks don't do enough |getprop| or |length| calls for a PIC to help. Keep in mind also that the baseline for that comparison runs with the property cache, so what we are really measuring is the benefit of jitting property cache operations. * Benchmarks that benefit primarily from PIC for |getprop| getprops run time saved time saved/getprop 3d-cube 26616 3.6 ms 135 ns = 298 cycles 3d-raytrace 522707 7.3 14 31 access-binary-trees 125409 2.6 18 39 access-nbody 1040659 23.4 23 50 (Almost all getprops were run on PIC fast paths for all of the above.) * Benchmarks that benefit primarily from PIC for |length|. lengths run time saved time saved/length math-spectral-norm 129760 3.4 ms 26 ns = 58 cycles string-fasta 87014 2.4 28 61 string-validate-input 30000 1.4 47 103 (Almost all getprops were run on PIC fast paths for all of the above.) * Benchmarks that benefit from PIC for |length| and |getprop|. ops run time saved time saved/op string-tagcloud 136452 3.7 ms 27 ns = 60 cycles (Almost all ops were run on PIC fast paths for all of the above.) From the above, we can conclude that running with the PIC saves roughly 15-30 ns per |getprop| or |length| operation. Thus, we shouldn't expect to see a 1ms speedup unless there are at least 30k-60k |getprop| or |length| ops. * Benchmarks that had many ops run on fast paths but didn't speed up. crypto-md5 16092 length ops crypto-sha1 10246 length ops date-format-tofte 229500 length ops The crypto benchmarks don't run enough ops for the benefit to really show up, so that makes sense. I don't know what's going on with tofte yet, I need to look into that. * Benchmarks that had many ops run on slow paths even with the PIC. date-format-xparb 19999 length ops math-partial-sums 253952 getxprop ops string-validate-input 12000 getprop ops math-partial-sums is yet another buggy benchmark, that thinks that var a1 = a2 = ... = a9 = 0.0; inside a function declares 9 local variables. Instead, it uses mostly global variables. |x += e| for global or free variable |x| is implemented with |getxprop| in SM. I'm not running these with the PIC; I think we should solve free variables separately. I'm not exactly sure why the other two miss in the PIC, and I will check into it. I doubt it matters for SS, as the number of ops is too small to yield a noticeable speedup, but it might matter for web code. Another key thing to look at next is the amount of time taken to build, monitor, and patch PICs.
SunSpider PIC cost metrics for getprop+length: static number of pics: 240 dynamic pic mgmt calls: 505 dynamic pic patchings: 259 time for all pic mgmt calls: 201 us = 0.4% of time saved by PICs This doesn't count the time taken to set up the PIC in the first place, or to free the PIC memory when we are done. But those are probably much less than the time taken to patch a PIC, and almost certainly at no more than the time taken to patch, so I doubt it is more than 1% of the time saved, which is pretty much insignificant--it will be a long time before we care to spend nontrivial developer time to get a 0.5ms speedup on SunSpider. But there is a defect in the current pic mgmt functions: they are O(N) where N is the number of pics in the script. That will have to be fixed before this works is right for v8-v4 (or the web): they have functions with 99 pics.
Comment on attachment 435046 [details] [diff] [review] WIP 5 (rebased) > enum { SHAPELESS = 0xffffffff }; > >+ // For all objects |obj|, |obj->shape->map != INVALID_SHAPE|. This is positive >+ // so that it must be a 32-bit value in x86 ASM. >+ enum { INVALID_SHAPE = 0x8fffffff }; Prevailing style favors one enum with two enumerators. >@@ -632,25 +635,23 @@ Compiler::Compile() > END_CASE(JSOP_CALLELEM) > > BEGIN_CASE(JSOP_SETELEM) > stubCall(jsl_SetElem); > END_CASE(JSOP_SETELEM) > > BEGIN_CASE(JSOP_GETPROP) > BEGIN_CASE(JSOP_GETXPROP) >- { >- uint32 index = fullAtomIndex(codePos); >- masm.store32(Imm32(index), FrameAddress(ARG_OFFSET(NAMEOPS, index))); >- stubCall(jsl_GetProp); >- } >+ if (!jsop_getprop(fullAtomIndex(codePos))) >+ return NULL; > END_CASE(JSOP_GETPROP) > > BEGIN_CASE(JSOP_LENGTH) >- stubCall(jsl_Length); >+ if (!jsop_length()) >+ return NULL; > END_CASE(JSOP_LENGTH) > > BEGIN_CASE(JSOP_CALLPROP) > { > uint32 index = fullAtomIndex(codePos); > masm.store32(Imm32(index), FrameAddress(ARG_OFFSET(NAMEOPS, index))); > stubCall(jsl_CallProp); > } The added lines seem overindented by two spaces? Sorry for the drive-by nits, keep patching! /be
I'll pick up the nits. Here are the current results for the SunSpider v8-v4 suite: # w/o PIC buildmonkey-right:SunSpider dmandelin$ tail -n 12 /tmp/vvj -------------------------------------------- Total: 14960.0ms +/- 1.2% -------------------------------------------- v8: 14960.0ms +/- 1.2% crypto: 1473.7ms +/- 0.2% deltablue: 3611.7ms +/- 1.4% earley-boyer: 1559.3ms +/- 0.2% raytrace: 1474.0ms +/- 1.2% regexp: 1093.0ms +/- 0.0% richards: 4797.3ms +/- 3.2% splay: 951.0ms +/- 0.7% # w/ PIC buildmonkey-right:SunSpider dmandelin$ tail -n 12 /tmp/vvp -------------------------------------------- Total: 12740.3ms +/- 4.7% -------------------------------------------- v8: 12740.3ms +/- 4.7% crypto: 1436.0ms +/- 0.3% deltablue: 2846.0ms +/- 1.1% earley-boyer: 1383.3ms +/- 0.5% raytrace: 1344.0ms +/- 1.1% regexp: 1099.3ms +/- 0.1% richards: 3761.3ms +/- 15.9% splay: 870.3ms +/- 0.9% That's 1.17x. PIC patching and generation time is 7ms here, about 0.3% of the time saved by the PICs. So that's still not very significant, but of course I still must get rid of the unnecessary O(N). The PICs are actually handling everything in the v8 suite that generates a significant number of getprop/length ops, except for line 31 in v8-raytrace. That line reads a method-valued property (to use with |apply|). I think the getter relates to the read barrier for the function cloning avoidance optimization, but I'm not sure. Anyway, I do plan to do getters, but if this isn't a "real" getter, then presumably I need a special case for that as well.
I fixed up the possible O(N) blowup finding PICs by using binary search over a cutoff. The time taken to find PICs seems to be insignificant in the benchmarks (even with linear search), so this is just insurance. Basic tradeoffs: the time taken to patch one PIC is seems to be about 800 ns. We then save about 20 ns per usage (vs. calling a stub that uses the prop cache), so it takes 40 calls to reach the breakeven point. Currently, we spend a total of 0.2ms patching PICs on SunSpider (out of a run time of >1000ms) and 5ms patching PICs on v8-v4 (out of a run time of >12,000ms) so we basically don't care for now.
I just pushed a patch to purge PICs when a shape-regenerating GC happens. It's kind of unfortunate that we have to do this, because it runs rarely so it will be hard to test. At least it's fairly simple. At this point, I consider the PIC getprop to be "basically" done, in that it is correct assuming I haven't missed anything, and picks up almost all the perf wins for benchmarks. I will file followon bugs on all the other things that it needs to be truly done.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: