Closed Bug 551885 Opened 13 years ago Closed 12 years ago
TM: avoid stupid stack loads/stores at the end of fragments
When I run access-fannkuch.js with TMFLAGS=profile, I learn that the hottest block, 000046, which accounts for 33% of block entries, ends with this delightful code: ld16 = ld sp[-152] $stack2 = i2f ld16 sti sp[-152] = ld16 ld17 = ld sp[-128] $stack5 = i2f ld17 sti sp[-128] = ld17 ld18 = ld sp[-88] $stack10 = i2f ld18 sti sp[-88] = ld18 ld19 = ld sp[-80] $stack11 = i2f ld19 sti sp[-80] = ld19 sti sp[-72] = addxov1 ld20 = ld sp[-64] $stack13 = i2f ld20 sti sp[-64] = ld20 ld21 = ld sp[-48] $stack15 = i2f ld21 sti sp[-48] = ld21 sti sp[-32] = ld15 ld22 = ld sp[-8] $stack20 = i2f ld22 sti sp[-8] = ld22 j -> label1 At the end of the pipeline it looks like this: ld16 = ld sp[-152] sti sp[-152] = ld16 ld17 = ld sp[-128] sti sp[-128] = ld17 ld18 = ld sp[-88] sti sp[-88] = ld18 ld19 = ld sp[-80] sti sp[-80] = ld19 sti sp[-72] = addxov1 ld20 = ld sp[-64] sti sp[-64] = ld20 ld21 = ld sp[-48] sti sp[-48] = ld21 sti sp[-32] = ld15 ld22 = ld sp[-8] sti sp[-8] = ld22 j -> label1 The next few hottest blocks in fannkuch.js end with very similar sequences. It'd be nice to avoid these redundant loads and stores. If I had to guess I'd blame ImportBoxedStackSlotVisitor, but I don't really know. Some of those stack slots aren't even touched by the rest of the block.
Assignee: nnethercote → general
Status: ASSIGNED → NEW
ImportBoxedStackSlotVisitor shouldn't generate all that (I hope not!) What line(s) of code did this correspond to in fannkuch? Are we importing these slots in one huge batch (like at a tree call site)?
(In reply to comment #1) > ImportBoxedStackSlotVisitor shouldn't generate all that (I hope not!) I could be totally wrong about blaming it, I don't understand that stuff much at all. > What line(s) of code did this correspond to in fannkuch? The five hottest blocks all have code like that. They correspond to lines 34, 27, 32, 50, 46. I suspect just about every loop will look the same. > Are we importing these > slots in one huge batch (like at a tree call site)? No idea about that. If you want to see the code, run something like this: TMFLAGS=minimal,recorder,fragprofile debug32/js -j $SS/access-fannkuch.js The fragprofile stuff is at the bottom, use the FragID markers to navigate.
Holy crap! I was going over old bugs today and decided to look at this one again. The culprit is SlotMap::adjustType(). Turns out we load from a stack/global slot, do a d2i, and then immediately store back into it. It's got something to do with type stability that I don't fully understand. But if the loaded value itself comes from an i2d/ui2d, that cancels out the d2i and so we end up storing back the same value. The attached patch simply looks for this case and avoids the store. When this happens the load becomes dead and it's removed by Nanojit's DCE. These stupid instructions turn out to be really common, almost every progrma in SS/V8 is affected. On 32-bit Mac I see a 2.3% speedup for all of SunSpider, and 4.5% if you exclude the regexp and string tests. V8 gets about 3%. Instruction count improvements for all of SS are as follows: total 2,356,164,230 2,317,524,894 1.017x better on-trace 616,577,164 579,917,401 1.063x better Not bad for a 10 line patch :)
Assignee: general → nnethercote
Status: NEW → ASSIGNED
Attachment #453014 - Flags: review?(gal)
Full Cachegrind results: ------------------------------------------------------------------ instructions executed (nb: "on-trace" may overestimate slightly) ------------------------------------------------------------------ 3d-cube: total 127,562,453 125,316,011 1.018x better on-trace 25,355,526 23,185,247 1.094x better 3d-morph: total 66,700,088 63,283,207 1.054x better on-trace 25,516,652 22,144,236 1.152x better 3d-raytrace: total 142,218,300 141,482,814 1.005x better on-trace 25,694,660 25,535,120 1.006x better access-binary-trees: total 78,298,453 78,064,529 1.003x better on-trace 18,934,574 18,709,337 1.012x better access-fannkuch: total 192,025,263 180,772,877 1.062x better on-trace 115,868,184 105,072,844 1.103x better access-nbody: total 84,699,885 84,402,173 1.004x better on-trace 31,013,260 30,755,784 1.008x better access-nsieve: total 61,157,732 58,316,554 1.049x better on-trace 29,270,213 26,450,687 1.107x better bitops-3bit-bits-in-byte: total 16,183,512 15,795,755 1.025x better on-trace 3,372,932 2,988,877 1.128x better bitops-bits-in-byte: total 47,594,297 44,090,318 1.079x better on-trace 34,580,446 31,085,902 1.112x better bitops-bitwise-and: total 24,551,044 23,349,502 1.051x better on-trace 12,016,854 10,816,828 1.111x better bitops-nsieve-bits: total 90,488,512 84,978,832 1.065x better on-trace 47,729,371 42,249,979 1.130x better controlflow-recursive: total 44,322,686 43,464,923 1.020x better on-trace 28,337,181 27,505,221 1.030x better crypto-md5: total 49,130,970 48,918,426 1.004x better on-trace 5,808,596 5,637,685 1.030x better crypto-sha1: total 34,054,482 33,187,715 1.026x better on-trace 7,853,087 7,251,962 1.083x better date-format-tofte: total 135,816,270 135,603,460 1.002x better on-trace 11,562,104 11,351,070 1.019x better date-format-xparb: total 105,342,104 105,244,850 1.001x better on-trace 11,628,622 11,608,362 1.002x better math-cordic: total 55,781,955 55,424,599 1.006x better on-trace 30,688,747 30,338,640 1.012x better math-partial-sums: total 38,832,084 38,547,468 1.007x better on-trace 6,560,110 6,288,303 1.043x better math-spectral-norm: total 30,343,887 29,802,219 1.018x better on-trace 11,769,512 11,276,109 1.044x better regexp-dna: total 115,080,787 115,080,743 -- on-trace 75,041,535 75,041,535 -- string-base64: total 54,989,436 54,851,994 1.003x better on-trace 8,277,642 8,154,541 1.015x better string-fasta: total 132,064,626 129,052,269 1.023x better on-trace 29,493,488 26,545,259 1.111x better string-tagcloud: total 272,222,837 272,132,591 -- on-trace 6,673,511 6,592,119 1.012x better string-unpack-code: total 273,432,506 273,242,758 1.001x better on-trace 5,680,936 5,614,638 1.012x better string-validate-input: total 83,270,061 83,118,307 1.002x better on-trace 7,849,421 7,717,116 1.017x better ------- all: total 2,356,164,230 2,317,524,894 1.017x better on-trace 616,577,164 579,917,401 1.063x better
Summary: TM: stupid stack loads/stores for fannkuch.js → TM: avoid stupid stack loads/stores at the end of fragments
Comment on attachment 453014 [details] [diff] [review] patch njn++. Nice find.
Attachment #453014 - Flags: review?(gal) → review+
I love that the biggest speedup (1.27x) is for bitwise-and :)
It's worth pointing out that this doesn't just remove the obvious load/store pairs like those in comment 0. There are lots of cases where just a store is removed. For example, here's the code for bitwise-and.js; the lines marked with '*' are those that are removed: ebx = parami 0 ebx esi = parami 1 esi edi = parami 2 edi state = parami 0 ecx label1: sp = ldi.o state cx = ldi.o state eos = ldi.o state ldi3 = ldi.csro cx immi3 = immi 0 eqi1 = eqi ldi3, immi3/*0*/ xf2: xf eqi1 -> pc=...... imacpc=(nil) sp+0 rp+0 (GuardID=001) ldi2 = ldi.o eos ldi1 = ldi.o eos andi1 = andi ldi2, ldi1 sti.o eos = andi1 immi2 = immi 1 addxovi1 = addxovi ldi1, immi2/*1*/ -> pc=...... imacpc=(nil) sp+0 rp+0 (GuardID=002) sti.o eos = addxovi1 sti.s sp = addxovi1 immi1 = immi 600000 sti.s sp = immi1/*600000*/ lti1 = lti addxovi1, immi1/*600000*/ xf1: xf lti1 -> pc=...... imacpc=(nil) sp+16 rp+0 (GuardID=003) * sti.o eos = andi1 * sti.o eos = addxovi1 j -> label1 livei state x1: x -> pc=...... imacpc=(nil) sp+0 rp+0 (GuardID=004) They just duplicate stores from earlier in the block.
njn+∞! TM wins still available is good news. We need 'em, along with JM. /be
Awesome find. The only way TypeCheck_Promote is returned is if the slot has a trivial conversion, i.e. isPromoteInt would return true. This patch works because any time we sink a promoteInt back, we d2i() it first. Therefore, do we need the d2i() call at all? Removing the whole |if| branch passes trace-tests for me.
This patch takes advantage of dvander's observation. It's better because it catches some cases the first patch didn't -- when the instruction is an integral immediate double. One thing I don't like is that it can result in different LIR being generated for debug and opt builds, because of the debug-only get() call. (The final LIR will be the same, however, because the extra LIR instructions generated in debug builds are dead.) The debug-only call to get() can also change other state, eg. that of the oracle. I could just remove the assertion for the TypeCheck_Promote case, but I don't want to do that. Can anyone see how to retain it without generating LIR and changing state?
Attachment #453259 - Flags: review?(dvander)
Comment on attachment 453259 [details] [diff] [review] alternative patch urgh, yeah, that's kind of nasty. Maybe getFromTracker, which won't cause an import? It can return NULL.
Attachment #453259 - Flags: review?(dvander) → review+
http://hg.mozilla.org/tracemonkey/rev/03876a6eba5a After discussion with dvander on IRC, the pushed version ended up calling getFromTracker() and not doing the check when it returns NULL.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.