Closed Bug 551885 Opened 14 years ago Closed 14 years ago

TM: avoid stupid stack loads/stores at the end of fragments

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(3 files)

When I run access-fannkuch.js with TMFLAGS=profile, I learn that the hottest block, 000046, which accounts for 33% of block entries, ends with this delightful code:

    ld16 = ld sp[-152]
    $stack2 = i2f ld16
    sti sp[-152] = ld16
    ld17 = ld sp[-128]
    $stack5 = i2f ld17
    sti sp[-128] = ld17
    ld18 = ld sp[-88]
    $stack10 = i2f ld18
    sti sp[-88] = ld18
    ld19 = ld sp[-80]
    $stack11 = i2f ld19
    sti sp[-80] = ld19
    sti sp[-72] = addxov1
    ld20 = ld sp[-64]
    $stack13 = i2f ld20
    sti sp[-64] = ld20
    ld21 = ld sp[-48]
    $stack15 = i2f ld21
    sti sp[-48] = ld21
    sti sp[-32] = ld15
    ld22 = ld sp[-8]
    $stack20 = i2f ld22
    sti sp[-8] = ld22
    j -> label1

At the end of the pipeline it looks like this:

    ld16 = ld sp[-152]
    sti sp[-152] = ld16        
    ld17 = ld sp[-128]         
    sti sp[-128] = ld17        
    ld18 = ld sp[-88] 
    sti sp[-88] = ld18
    ld19 = ld sp[-80]          
    sti sp[-80] = ld19         
    sti sp[-72] = addxov1      
    ld20 = ld sp[-64]          
    sti sp[-64] = ld20
    ld21 = ld sp[-48]
    sti sp[-48] = ld21
    sti sp[-32] = ld15
    ld22 = ld sp[-8]
    sti sp[-8] = ld22
    j -> label1

The next few hottest blocks in fannkuch.js end with very similar sequences.

It'd be nice to avoid these redundant loads and stores.  If I had to guess I'd blame ImportBoxedStackSlotVisitor, but I don't really know.  Some of those stack slots aren't even touched by the rest of the block.
Assignee: nnethercote → general
Status: ASSIGNED → NEW
ImportBoxedStackSlotVisitor shouldn't generate all that (I hope not!)

What line(s) of code did this correspond to in fannkuch? Are we importing these slots in one huge batch (like at a tree call site)?
(In reply to comment #1)
> ImportBoxedStackSlotVisitor shouldn't generate all that (I hope not!)

I could be totally wrong about blaming it, I don't understand that stuff much at all.

> What line(s) of code did this correspond to in fannkuch?

The five hottest blocks all have code like that.  They correspond to lines 34, 27, 32, 50, 46.  I suspect just about every loop will look the same.

> Are we importing these
> slots in one huge batch (like at a tree call site)?

No idea about that.  If you want to see the code, run something like this:

  TMFLAGS=minimal,recorder,fragprofile debug32/js -j $SS/access-fannkuch.js

The fragprofile stuff is at the bottom, use the FragID markers to navigate.
Attached patch patchSplinter Review
Holy crap!  I was going over old bugs today and decided to look at this one again.  The culprit is SlotMap::adjustType().  Turns out we load from a stack/global slot, do a d2i, and then immediately store back into it.  It's got something to do with type stability that I don't fully understand.

But if the loaded value itself comes from an i2d/ui2d, that cancels out the d2i and so we end up storing back the same value.

The attached patch simply looks for this case and avoids the store.  When this happens the load becomes dead and it's removed by Nanojit's DCE.

These stupid instructions turn out to be really common, almost every progrma in SS/V8 is affected.  On 32-bit Mac I see a 2.3% speedup for all of SunSpider, and 4.5% if you exclude the regexp and string tests.  V8 gets about 3%.  Instruction count improvements for all of SS are as follows:

  total       2,356,164,230   2,317,524,894     1.017x better
  on-trace      616,577,164     579,917,401     1.063x better

Not bad for a 10 line patch :)
Assignee: general → nnethercote
Status: NEW → ASSIGNED
Attachment #453014 - Flags: review?(gal)
Full Cachegrind results:

------------------------------------------------------------------
instructions executed (nb: "on-trace" may overestimate slightly)
------------------------------------------------------------------
3d-cube:
  total         127,562,453     125,316,011     1.018x better
  on-trace       25,355,526      23,185,247     1.094x better
3d-morph:
  total          66,700,088      63,283,207     1.054x better
  on-trace       25,516,652      22,144,236     1.152x better
3d-raytrace:
  total         142,218,300     141,482,814     1.005x better
  on-trace       25,694,660      25,535,120     1.006x better
access-binary-trees:
  total          78,298,453      78,064,529     1.003x better
  on-trace       18,934,574      18,709,337     1.012x better
access-fannkuch:
  total         192,025,263     180,772,877     1.062x better
  on-trace      115,868,184     105,072,844     1.103x better
access-nbody:
  total          84,699,885      84,402,173     1.004x better
  on-trace       31,013,260      30,755,784     1.008x better
access-nsieve:
  total          61,157,732      58,316,554     1.049x better
  on-trace       29,270,213      26,450,687     1.107x better
bitops-3bit-bits-in-byte:
  total          16,183,512      15,795,755     1.025x better
  on-trace        3,372,932       2,988,877     1.128x better
bitops-bits-in-byte:
  total          47,594,297      44,090,318     1.079x better
  on-trace       34,580,446      31,085,902     1.112x better
bitops-bitwise-and:
  total          24,551,044      23,349,502     1.051x better
  on-trace       12,016,854      10,816,828     1.111x better
bitops-nsieve-bits:
  total          90,488,512      84,978,832     1.065x better
  on-trace       47,729,371      42,249,979     1.130x better
controlflow-recursive:
  total          44,322,686      43,464,923     1.020x better
  on-trace       28,337,181      27,505,221     1.030x better
crypto-md5:
  total          49,130,970      48,918,426     1.004x better
  on-trace        5,808,596       5,637,685     1.030x better
crypto-sha1:
  total          34,054,482      33,187,715     1.026x better
  on-trace        7,853,087       7,251,962     1.083x better
date-format-tofte:
  total         135,816,270     135,603,460     1.002x better
  on-trace       11,562,104      11,351,070     1.019x better
date-format-xparb:
  total         105,342,104     105,244,850     1.001x better
  on-trace       11,628,622      11,608,362     1.002x better
math-cordic:
  total          55,781,955      55,424,599     1.006x better
  on-trace       30,688,747      30,338,640     1.012x better
math-partial-sums:
  total          38,832,084      38,547,468     1.007x better
  on-trace        6,560,110       6,288,303     1.043x better
math-spectral-norm:
  total          30,343,887      29,802,219     1.018x better
  on-trace       11,769,512      11,276,109     1.044x better
regexp-dna:
  total         115,080,787     115,080,743     -- 
  on-trace       75,041,535      75,041,535     -- 
string-base64:
  total          54,989,436      54,851,994     1.003x better
  on-trace        8,277,642       8,154,541     1.015x better
string-fasta:
  total         132,064,626     129,052,269     1.023x better
  on-trace       29,493,488      26,545,259     1.111x better
string-tagcloud:
  total         272,222,837     272,132,591     -- 
  on-trace        6,673,511       6,592,119     1.012x better
string-unpack-code:
  total         273,432,506     273,242,758     1.001x better
  on-trace        5,680,936       5,614,638     1.012x better
string-validate-input:
  total          83,270,061      83,118,307     1.002x better
  on-trace        7,849,421       7,717,116     1.017x better
-------
all:
  total       2,356,164,230   2,317,524,894     1.017x better
  on-trace      616,577,164     579,917,401     1.063x better
Summary: TM: stupid stack loads/stores for fannkuch.js → TM: avoid stupid stack loads/stores at the end of fragments
Comment on attachment 453014 [details] [diff] [review]
patch

njn++. Nice find.
Attachment #453014 - Flags: review?(gal) → review+
Attached file SS results
I love that the biggest speedup (1.27x) is for bitwise-and :)
It's worth pointing out that this doesn't just remove the obvious load/store pairs like those in comment 0.  There are lots of cases where just a store is removed.  For example, here's the code for bitwise-and.js;  the lines marked with '*' are those that are removed:

      ebx = parami 0 ebx
      esi = parami 1 esi
      edi = parami 2 edi
      state = parami 0 ecx
      label1:
      sp = ldi.o state[8]
      cx = ldi.o state[0]
      eos = ldi.o state[12]
      ldi3 = ldi.csro cx[0]
      immi3 = immi 0
      eqi1 = eqi ldi3, immi3/*0*/
      xf2: xf eqi1 -> pc=...... imacpc=(nil) sp+0 rp+0 (GuardID=001)
      ldi2 = ldi.o eos[1376]
      ldi1 = ldi.o eos[1368]
      andi1 = andi ldi2, ldi1
      sti.o eos[1376] = andi1
      immi2 = immi 1
      addxovi1 = addxovi ldi1, immi2/*1*/ -> pc=...... imacpc=(nil) sp+0 rp+0 (GuardID=002)
      sti.o eos[1368] = addxovi1
      sti.s sp[0] = addxovi1
      immi1 = immi 600000
      sti.s sp[8] = immi1/*600000*/
      lti1 = lti addxovi1, immi1/*600000*/
      xf1: xf lti1 -> pc=...... imacpc=(nil) sp+16 rp+0 (GuardID=003)
*     sti.o eos[1376] = andi1
*     sti.o eos[1368] = addxovi1
      j -> label1
      livei state
      x1: x  -> pc=...... imacpc=(nil) sp+0 rp+0 (GuardID=004)

They just duplicate stores from earlier in the block.
Blocks: 536630
njn+∞!

TM wins still available is good news. We need 'em, along with JM.

/be
Awesome find.

The only way TypeCheck_Promote is returned is if the slot has a trivial conversion, i.e. isPromoteInt would return true. This patch works because any time we sink a promoteInt back, we d2i() it first. Therefore, do we need the d2i() call at all? Removing the whole |if| branch passes trace-tests for me.
This patch takes advantage of dvander's observation.  It's better because it catches some cases the first patch didn't -- when the instruction is an integral immediate double.

One thing I don't like is that it can result in different LIR being generated for debug and opt builds, because of the debug-only get() call.  (The final LIR will be the same, however, because the extra LIR instructions generated in debug builds are dead.)  The debug-only call to get() can also change other state, eg. that of the oracle.

I could just remove the assertion for the TypeCheck_Promote case, but I don't want to do that.  Can anyone see how to retain it without generating LIR and changing state?
Attachment #453259 - Flags: review?(dvander)
Comment on attachment 453259 [details] [diff] [review]
alternative patch

urgh, yeah, that's kind of nasty. Maybe getFromTracker, which won't cause an import?  It can return NULL.
Attachment #453259 - Flags: review?(dvander) → review+
http://hg.mozilla.org/tracemonkey/rev/03876a6eba5a

After discussion with dvander on IRC, the pushed version ended up calling getFromTracker() and not doing the check when it returns NULL.
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/03876a6eba5a
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.