Closed Bug 784808 Opened 12 years ago Closed 12 years ago

Remove isInline GC size class check

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla18

People

(Reporter: terrence, Assigned: terrence)

References

Details

Attachments

(2 files)

Attached patch v0Splinter Review
We depend on a size class check in JSString::isShort because JSShortStrings can have their chars() point to the middle of its inline buffer.  This is to support Int32ToString.  Considering that this path already has to do a ton of division and an allocation, just copying the <10 chars to the front of the buffer is provably trivial.  Getting rid of the size class check allows us to inline the isInline check, paving the way for a fast ensureNotInline and removal of getAllocKind as a requirement for string placement in arenas.

These benchmarks are Bill's modified v8: they run for a fixed number of iterations, rather than a fixed time.  I ran them on Bill's stabily clocked "shookfoil" machine.  Each benchmark is run six times sequentially from a shell script.  A post-processing step computes the mean and standard deviation for each test from the before and after sets, then reports the time difference between the test and the significance threshhold.  Any run within half a standard deviation of significance is flagged with a star (*).

Runs |Clean1| and |Clean2| are against a clean mozilla-inbound tree with tip at 0e73660050a4. |Patched-Loop| uses a loop to copy and |Patched-Memmove| uses a memmove to copy.

Clean1 -> Patched-Loop:
      crypto -3.333333ms +/- 20.814076ms user 
   deltablue 16.666667ms +/- 43.655134ms user 
earley-boyer 15.000000ms +/- 49.403583ms user 
    raytrace 5.000000ms +/- 31.898951ms user 
      regexp 10.000000ms +/- 38.124375ms user 
    richards 5.000000ms +/- 14.995899ms user 
       splay -3.333333ms +/- 39.510034ms user 

Clean2 -> Patched-Memmove:
      crypto -18.333333ms +/- 25.416270ms user 
   deltablue 31.666667ms +/- 45.063557ms user 
earley-boyer -26.666667ms +/- 39.472899ms user 
    raytrace 15.000000ms +/- 38.026889ms user 
      regexp -13.333333ms +/- 34.914110ms user 
    richards 3.333333ms +/- 13.004952ms user 
       splay 1.666667ms +/- 33.553946ms user 

Patched-Loop -> Patched-Memmove
      crypto -6.666667ms +/- 26.053510ms user 
   deltablue 11.666667ms +/- 24.696532ms user 
earley-boyer -20.000000ms +/- 57.363729ms user 
    raytrace 11.666667ms +/- 33.784738ms user 
      regexp -11.666667ms +/- 47.636283ms user 
    richards 1.666667ms +/- 12.691704ms user 
       splay 15.000000ms +/- 33.553946ms user

Clean1 -> Clean2:
      crypto 8.333333ms +/- 20.176837ms user 
   deltablue -3.333333ms +/- 64.022159ms user 
earley-boyer 21.666667ms +/- 31.512753ms user 
    raytrace 1.666667ms +/- 36.141103ms user 
      regexp 11.666667ms +/- 25.402203ms user 
    richards 3.333333ms +/- 15.309146ms user 
       splay 10.000000ms +/- 39.510034ms user 

I also ran SunSpider on shookfoil.  The results look similar -- 1.007x was stable across all runs:
** TOTAL **: *1.007x as slow*  2206.2ms +/- 0.1%   2221.9ms +/- 0.1%  significant

Is this too much of a slowdown to take?
Attachment #654364 - Flags: review?(luke)
Comment on attachment 654364 [details] [diff] [review]
v0

Review of attachment 654364 [details] [diff] [review]:
-----------------------------------------------------------------

Nice solution to the set of constraints.

> just copying the <10 chars to the front of the buffer is provably trivial

Uh, it's provably O(1).  I don't know what "trivial" means in this context.  Perhaps "measurably insignificant", except that your measurements do show a change.  On that subject, what sunspider are you running that takes 2205ms?  Is this the v8 one that runs SS in a loop?  Do you get the same regression every time you run?  What about normal SS?  For the v8 tests, could you summarize; the test splat is hard to understand: did anything get slower?

Lastly, what difference do you get for the micro-bench:

  for (var i = 0; i < 10000000; ++i)
    ""+i;

::: js/src/jsnum.cpp
@@ +549,5 @@
>          *--start = '-';
>  
> +    str->init(end - start);
> +    while (start != end)
> +        *storage++ = *start++;

Since the the end result is copied anyway, there is no need for inlineStorageBeforeInit; you can just use a local array (of size UINT32_CHAR_BUFFER_LENGTH) and then write:
  jschar *dst = str->init(end - start);
  PodCopy(dst, start);
(this will allow inlineStorageBeforeInit to be totally removed.)

@@ +1313,5 @@
>      RangedPtr<jschar> start = BackfillIndexInCharBuffer(index, end);
>  
> +    str->init(end - start);
> +    while (start != end)
> +        *storage++ = *start++;

ditto

::: js/src/vm/String.h
@@ +407,5 @@
>  
>      static inline js::ThingRootKind rootKind() { return js::THING_ROOT_STRING; }
>  
>  #ifdef DEBUG
> +    bool isShortAllocKind() const;

can you keep this named isShort()?
(In reply to Luke Wagner [:luke] from comment #1)
> Nice solution to the set of constraints.
> 
> > just copying the <10 chars to the front of the buffer is provably trivial
> 
> Uh, it's provably O(1).  I don't know what "trivial" means in this context. 
> Perhaps "measurably insignificant", except that your measurements do show a
> change.

Captain Malapropism strifes again!  s/provable/probably/  Sorry about the confusion.

> Lastly, what difference do you get for the micro-bench:
> 
>   for (var i = 0; i < 10000000; ++i)
>     ""+i;

Before: 0.50 sec
After:  0.59 sec (Down from 0.61 sec for the attached patch with some tweaking.)
Attached patch v1Splinter Review
(In reply to Luke Wagner [:luke] from comment #1)

I hacked on this some more today to see if I could get the time down any by making Int32ToString more efficient.  What I learned is that moving even a handful of bytes around in memory is significantly more expensive than taking a few local branches... usually.

The strategy that seems to work best is to write the number where it needs to be on the first pass (as with before).  Unlike before, we need to find where to put the number in the buffer first.  We do this by precomputing the base-10 digit count for the number as such:
  1) Loop to find the topmost bit of the number.
  2) Branch into a switch table for the closest number 
     of digits for that power-of-two range of numbers.
  3) Adjust from the base by adding the result of a 
     comparison with the largest base10 number with
     that size.

This got us most of the way back to parity, until I pulled.  The clean runs got slower by 10-20ms, but the new code got 30-40ms slower, losing most of my gains.  G++ appears to have started generating different assembly after rebasing.  Need to look into this more.

> Nice solution to the set of constraints.

Bill's idea, he deserves all the credit for this one.
 
> On that subject, what sunspider are you running that takes 2205ms? 
> Is this the v8 one that runs SS in a loop?  Do you get the same regression
> every time you run?  What about normal SS?

Good catch!  It's the normal SS, but I had the js args in the wrong spot, so it was running in the interpreter.  I am running sunspider now by passing arguments |--args "-m -n" --runs 100|.  The js binary is cross compiled to x86 using gcc-4.5.3 with -m32 and --target=i686-pc-linux and run on Bill's shookfoil machine.

I ran sunspider 4 times and got the following results

TEST          COMPARISON            FROM                 TO             DETAILS
=============================================================================

** TOTAL **: 1.002x as fast   215.8ms +/- 0.0%  215.4ms +/- 0.0% significant
** TOTAL **: -                215.9ms +/- 0.0%  215.8ms +/- 0.0%
** TOTAL **: *1.000x as slow* 215.8ms +/- 0.0%  215.9ms +/- 0.0% significant
** TOTAL **: 1.001x as fast   215.9ms +/- 0.0%  215.6ms +/- 0.0% significant

=============================================================================

The detailed results from the first run look like this:

  3d:                  1.003x as fast     35.1ms +/- 0.0%    34.9ms +/- 0.0%     significant
    cube:              1.007x as fast     13.2ms +/- 0.0%    13.1ms +/- 0.0%     significant
    morph:             *1.004x as slow*    6.9ms +/- 0.0%     6.9ms +/- 0.1%     significant
    raytrace:          1.003x as fast     15.0ms +/- 0.0%    14.9ms +/- 0.0%     significant

  access:              1.003x as fast     18.3ms +/- 0.0%    18.2ms +/- 0.0%     significant
    binary-trees:      *1.006x as slow*    2.5ms +/- 0.1%     2.5ms +/- 0.1%     significant
    fannkuch:          -                   7.8ms +/- 0.1%     7.8ms +/- 0.1% 
    nbody:             1.006x as fast      4.0ms +/- 0.0%     4.0ms +/- 0.1%     significant
    nsieve:            1.012x as fast      4.0ms +/- 0.1%     3.9ms +/- 0.1%     significant

  bitops:              -                  12.2ms +/- 0.0%    12.2ms +/- 0.0% 
    3bit-bits-in-byte: 1.003x as fast      0.8ms +/- 0.2%     0.8ms +/- 0.2%     significant
    bits-in-byte:      -                   5.1ms +/- 0.0%     5.1ms +/- 0.1% 
    bitwise-and:       -                   3.1ms +/- 0.1%     3.1ms +/- 0.1% 
    nsieve-bits:       -                   3.2ms +/- 0.1%     3.2ms +/- 0.1% 

  controlflow:         1.004x as fast      2.3ms +/- 0.1%     2.3ms +/- 0.1%     significant
    recursive:         1.004x as fast      2.3ms +/- 0.1%     2.3ms +/- 0.1%     significant

  crypto:              1.002x as fast     21.7ms +/- 0.0%    21.6ms +/- 0.0%     significant
    aes:               *1.002x as slow*   11.3ms +/- 0.1%    11.3ms +/- 0.0%     significant
    md5:               1.009x as fast      7.2ms +/- 0.1%     7.1ms +/- 0.1%     significant
    sha1:              -                   3.2ms +/- 0.0%     3.2ms +/- 0.0% 

  date:                *1.006x as slow*   31.1ms +/- 0.1%    31.3ms +/- 0.1%     significant
    format-tofte:      *1.017x as slow*   18.9ms +/- 0.1%    19.2ms +/- 0.1%     significant
    format-xparb:      1.013x as fast     12.2ms +/- 0.1%    12.1ms +/- 0.2%     significant

  math:                1.003x as fast     17.4ms +/- 0.0%    17.3ms +/- 0.0%     significant
    cordic:            -                   3.7ms +/- 0.2%     3.7ms +/- 0.1% 
    partial-sums:      1.004x as fast     11.0ms +/- 0.0%    11.0ms +/- 0.0%     significant
    spectral-norm:     1.003x as fast      2.7ms +/- 0.0%     2.7ms +/- 0.0%     significant

  regexp:              -                  12.8ms +/- 0.1%    12.8ms +/- 0.2% 
    dna:               -                  12.8ms +/- 0.1%    12.8ms +/- 0.2% 

  string:              1.003x as fast     64.9ms +/- 0.0%    64.7ms +/- 0.0%     significant
    base64:            1.004x as fast      5.0ms +/- 0.0%     4.9ms +/- 0.0%     significant
    fasta:             1.007x as fast      6.5ms +/- 0.1%     6.5ms +/- 0.1%     significant
    tagcloud:          1.009x as fast     20.8ms +/- 0.1%    20.6ms +/- 0.1%     significant
    unpack-code:       *1.004x as slow*   24.4ms +/- 0.0%    24.5ms +/- 0.1%     significant
    validate-input:    1.008x as fast      8.2ms +/- 0.0%     8.1ms +/- 0.0%     significant

I wish there were a way to make the harness print in a narrower format.

> For the v8 tests, could you
> summarize; the test splat is hard to understand: did anything get slower?

Yeah, I guess it is.

Each individual test gets run 6 times.  I use a custom format string to /usr/bin/time to print out the user time of each run.  I have a separate post-processing script that computes the mean and sigma for each test.  The output is the difference between the means of the before and after runs +/- the sum of the standard deviations from the before and after runs.  There is probably something more helpful I could do than sum these.  Is there a correct way to combine standard deviations from uncorrelated distributions?

      crypto -8.333333ms +/- 20.176837ms user 
   deltablue 13.333333ms +/- 37.005621ms user 
earley-boyer 41.666667ms +/- 38.817360ms user *
    raytrace 5.000000ms +/- 27.652559ms user 
      regexp 5.000000ms +/- 35.911655ms user 
    richards -5.000000ms +/- 22.508736ms user 
       splay 13.333333ms +/- 35.839579ms user 

So this currently looks pretty regressed, with earley-boyer particularly hard hit.

> Lastly, what difference do you get for the micro-bench:
> 
>   for (var i = 0; i < 10000000; ++i)
>     ""+i;

terrence@cloud:~$ for i in 0 1 2 3 4 5 6; do /usr/bin/time ./js-before -m -n test.js; done;
0.53user 0.02system 0:00.54elapsed 100%CPU (0avgtext+0avgdata 392704maxresident)k
0inputs+0outputs (0major+24626minor)pagefaults 0swaps
... eliding repetitions ...

Pulling out the user times gives us deciseconds:
-> 53 52 53 54 54 53 53 = mean 53.142857142857146

terrence@cloud:~$ for i in 0 1 2 3 4 5 6; do /usr/bin/time ./js-after -m -n test.js; done;
0.59user 0.04system 0:00.62elapsed 101%CPU (0avgtext+0avgdata 392704maxresident)k
0inputs+0outputs (0major+24626minor)pagefaults 0swaps
... eliding repetitions ...

And with the patch we get:
-> 59 58 60 61 58 60 57 = mean 59.0

Which is *much* larger than the same code was giving before I pulled today.  This morning I had this down to < 54.

I'm not entirely sure how I should proceed here.  I think first I'm going to see if I can convince perf to give me useful data.  If you have other ideas, I'd appreciate hearing them.
Attachment #654364 - Attachment is obsolete: true
Attachment #654364 - Flags: review?(luke)
Attachment #654855 - Flags: feedback?(luke)
(In reply to Terrence Cole [:terrence] from comment #3)
Thanks for the thorough reply.  So SS seems fine.  Could you run v8 from the v8 harness (js/src/v8/run.js) and tell me if there is any significant difference?

The micro-benchmark is a bit fake since it is converting a lot of large numbers and I suspect it is more common to convert 1 and 2 digit numbers (which are StaticStrings anyhow).  I assume there is less overhead if you convert, say "999"?

I don't see a better general scheme, so, assuming positive answers to the above two questions the perf seems fine.
(In reply to Luke Wagner [:luke] from comment #4)
> (In reply to Terrence Cole [:terrence] from comment #3)
> Thanks for the thorough reply.  So SS seems fine.  Could you run v8 from the
> v8 harness (js/src/v8/run.js) and tell me if there is any significant
> difference?

I'm not sure what how much change is significant.  Here is the raw output.

terrence@cloud:~/v8-orig$ ../js-before -m -n run.js
Richards: 8388
DeltaBlue: 9045
Crypto: 15292
RayTrace: 5624
EarleyBoyer: 9307
RegExp: 1649
Splay: 9135
NavierStokes: 14366
----
Score (version 7): 7759
terrence@cloud:~/v8-orig$ ../js-after -m -n run.js
Richards: 8363
DeltaBlue: 9038
Crypto: 15194
RayTrace: 5528
EarleyBoyer: 9353
RegExp: 1639
Splay: 9241
NavierStokes: 14323
----
Score (version 7): 7740

> The micro-benchmark is a bit fake since it is converting a lot of large
> numbers and I suspect it is more common to convert 1 and 2 digit numbers
> (which are StaticStrings anyhow).  I assume there is less overhead if you
> convert, say "999"?

Well, both before and after seem to run much faster when converting the same number repeatedly, even after disabling the dtoa cache.  (Aside: the dtoa cache saves about 10ms on the previous test.)

This test:
=========================================
for (var i = 0; i < 1000000000; ++i)
    ""+999;
=========================================
Has results:
  Before: 2.3 seconds
  After:  2.3 seconds

I thought I would try again on a larger number to see the time increase:
=========================================
for (var i = 0; i < 1000000000; ++i)
    ""+1000000000;
=========================================
Has results:
  Before: 2.3 seconds
  After:  2.3 seconds

Okay, that didn't work.  So maybe JM is doing something clever here.  Let's try again with a shorter loop count in the interpreter:

=========================================
for (var i = 0; i < 10000000; ++i)
    ""+1000000000;
=========================================
Has results:
  Before: 2.31 seconds
  After:  2.28 seconds

I did indeed have to check this twice -- JM appears to run almost exactly 100x faster than the interpreter on this test.

I'm pretty sure there is a second cache after the dtoa cache that is catching this.  Do you know where it is so I can test this?
Comment on attachment 654855 [details] [diff] [review]
v1

Ultimately, I suspect the perf for the original patch was fine without all this digit-counting trickery.
Attachment #654855 - Flags: feedback?(luke)
Comment on attachment 654364 [details] [diff] [review]
v0

r+ with nits in comment 1 addressed.
Attachment #654364 - Attachment is obsolete: false
Attachment #654364 - Flags: review+
> Getting rid of the size class check allows
> us to inline the isInline check, paving the way for a fast ensureNotInline
> and removal of getAllocKind as a requirement for string placement in arenas.

Can you expand on that last part?  Would it allow vanilla strings and short strings to share arenas?  If so, that'd be good -- anything that reduces the number of arena kinds reduces GC heap fragmentation.
(In reply to Nicholas Nethercote [:njn] from comment #8)
> > Getting rid of the size class check allows
> > us to inline the isInline check, paving the way for a fast ensureNotInline
> > and removal of getAllocKind as a requirement for string placement in arenas.
> 
> Can you expand on that last part?  Would it allow vanilla strings and short
> strings to share arenas?  If so, that'd be good -- anything that reduces the
> number of arena kinds reduces GC heap fragmentation.

That is one of the long-term goals.  Ideally we'd like to bump allocate and use a compressing GC to de-fragment, rather than our current free-list scheme.  The main thing the arenas give us right now is access to the compartment pointer.  We have plans for how we could get this in a different way, but the AllocKind is harder.

I'm not sure how immediate our plans in this direction are: as you well know, we have our hands full getting our rooting story to a happy ending right now :-).
https://hg.mozilla.org/mozilla-central/rev/539643ca53ec
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: