Closed
Bug 784808
Opened 13 years ago
Closed 13 years ago
Remove isInline GC size class check
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla18
People
(Reporter: terrence, Assigned: terrence)
References
Details
Attachments
(2 files)
8.95 KB,
patch
|
luke
:
review+
|
Details | Diff | Splinter Review |
11.68 KB,
patch
|
Details | Diff | Splinter Review |
We depend on a size class check in JSString::isShort because JSShortStrings can have their chars() point to the middle of its inline buffer. This is to support Int32ToString. Considering that this path already has to do a ton of division and an allocation, just copying the <10 chars to the front of the buffer is provably trivial. Getting rid of the size class check allows us to inline the isInline check, paving the way for a fast ensureNotInline and removal of getAllocKind as a requirement for string placement in arenas.
These benchmarks are Bill's modified v8: they run for a fixed number of iterations, rather than a fixed time. I ran them on Bill's stabily clocked "shookfoil" machine. Each benchmark is run six times sequentially from a shell script. A post-processing step computes the mean and standard deviation for each test from the before and after sets, then reports the time difference between the test and the significance threshhold. Any run within half a standard deviation of significance is flagged with a star (*).
Runs |Clean1| and |Clean2| are against a clean mozilla-inbound tree with tip at 0e73660050a4. |Patched-Loop| uses a loop to copy and |Patched-Memmove| uses a memmove to copy.
Clean1 -> Patched-Loop:
crypto -3.333333ms +/- 20.814076ms user
deltablue 16.666667ms +/- 43.655134ms user
earley-boyer 15.000000ms +/- 49.403583ms user
raytrace 5.000000ms +/- 31.898951ms user
regexp 10.000000ms +/- 38.124375ms user
richards 5.000000ms +/- 14.995899ms user
splay -3.333333ms +/- 39.510034ms user
Clean2 -> Patched-Memmove:
crypto -18.333333ms +/- 25.416270ms user
deltablue 31.666667ms +/- 45.063557ms user
earley-boyer -26.666667ms +/- 39.472899ms user
raytrace 15.000000ms +/- 38.026889ms user
regexp -13.333333ms +/- 34.914110ms user
richards 3.333333ms +/- 13.004952ms user
splay 1.666667ms +/- 33.553946ms user
Patched-Loop -> Patched-Memmove
crypto -6.666667ms +/- 26.053510ms user
deltablue 11.666667ms +/- 24.696532ms user
earley-boyer -20.000000ms +/- 57.363729ms user
raytrace 11.666667ms +/- 33.784738ms user
regexp -11.666667ms +/- 47.636283ms user
richards 1.666667ms +/- 12.691704ms user
splay 15.000000ms +/- 33.553946ms user
Clean1 -> Clean2:
crypto 8.333333ms +/- 20.176837ms user
deltablue -3.333333ms +/- 64.022159ms user
earley-boyer 21.666667ms +/- 31.512753ms user
raytrace 1.666667ms +/- 36.141103ms user
regexp 11.666667ms +/- 25.402203ms user
richards 3.333333ms +/- 15.309146ms user
splay 10.000000ms +/- 39.510034ms user
I also ran SunSpider on shookfoil. The results look similar -- 1.007x was stable across all runs:
** TOTAL **: *1.007x as slow* 2206.2ms +/- 0.1% 2221.9ms +/- 0.1% significant
Is this too much of a slowdown to take?
Attachment #654364 -
Flags: review?(luke)
![]() |
||
Comment 1•13 years ago
|
||
Comment on attachment 654364 [details] [diff] [review]
v0
Review of attachment 654364 [details] [diff] [review]:
-----------------------------------------------------------------
Nice solution to the set of constraints.
> just copying the <10 chars to the front of the buffer is provably trivial
Uh, it's provably O(1). I don't know what "trivial" means in this context. Perhaps "measurably insignificant", except that your measurements do show a change. On that subject, what sunspider are you running that takes 2205ms? Is this the v8 one that runs SS in a loop? Do you get the same regression every time you run? What about normal SS? For the v8 tests, could you summarize; the test splat is hard to understand: did anything get slower?
Lastly, what difference do you get for the micro-bench:
for (var i = 0; i < 10000000; ++i)
""+i;
::: js/src/jsnum.cpp
@@ +549,5 @@
> *--start = '-';
>
> + str->init(end - start);
> + while (start != end)
> + *storage++ = *start++;
Since the the end result is copied anyway, there is no need for inlineStorageBeforeInit; you can just use a local array (of size UINT32_CHAR_BUFFER_LENGTH) and then write:
jschar *dst = str->init(end - start);
PodCopy(dst, start);
(this will allow inlineStorageBeforeInit to be totally removed.)
@@ +1313,5 @@
> RangedPtr<jschar> start = BackfillIndexInCharBuffer(index, end);
>
> + str->init(end - start);
> + while (start != end)
> + *storage++ = *start++;
ditto
::: js/src/vm/String.h
@@ +407,5 @@
>
> static inline js::ThingRootKind rootKind() { return js::THING_ROOT_STRING; }
>
> #ifdef DEBUG
> + bool isShortAllocKind() const;
can you keep this named isShort()?
Assignee | ||
Comment 2•13 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #1)
> Nice solution to the set of constraints.
>
> > just copying the <10 chars to the front of the buffer is provably trivial
>
> Uh, it's provably O(1). I don't know what "trivial" means in this context.
> Perhaps "measurably insignificant", except that your measurements do show a
> change.
Captain Malapropism strifes again! s/provable/probably/ Sorry about the confusion.
> Lastly, what difference do you get for the micro-bench:
>
> for (var i = 0; i < 10000000; ++i)
> ""+i;
Before: 0.50 sec
After: 0.59 sec (Down from 0.61 sec for the attached patch with some tweaking.)
Assignee | ||
Comment 3•13 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #1)
I hacked on this some more today to see if I could get the time down any by making Int32ToString more efficient. What I learned is that moving even a handful of bytes around in memory is significantly more expensive than taking a few local branches... usually.
The strategy that seems to work best is to write the number where it needs to be on the first pass (as with before). Unlike before, we need to find where to put the number in the buffer first. We do this by precomputing the base-10 digit count for the number as such:
1) Loop to find the topmost bit of the number.
2) Branch into a switch table for the closest number
of digits for that power-of-two range of numbers.
3) Adjust from the base by adding the result of a
comparison with the largest base10 number with
that size.
This got us most of the way back to parity, until I pulled. The clean runs got slower by 10-20ms, but the new code got 30-40ms slower, losing most of my gains. G++ appears to have started generating different assembly after rebasing. Need to look into this more.
> Nice solution to the set of constraints.
Bill's idea, he deserves all the credit for this one.
> On that subject, what sunspider are you running that takes 2205ms?
> Is this the v8 one that runs SS in a loop? Do you get the same regression
> every time you run? What about normal SS?
Good catch! It's the normal SS, but I had the js args in the wrong spot, so it was running in the interpreter. I am running sunspider now by passing arguments |--args "-m -n" --runs 100|. The js binary is cross compiled to x86 using gcc-4.5.3 with -m32 and --target=i686-pc-linux and run on Bill's shookfoil machine.
I ran sunspider 4 times and got the following results
TEST COMPARISON FROM TO DETAILS
=============================================================================
** TOTAL **: 1.002x as fast 215.8ms +/- 0.0% 215.4ms +/- 0.0% significant
** TOTAL **: - 215.9ms +/- 0.0% 215.8ms +/- 0.0%
** TOTAL **: *1.000x as slow* 215.8ms +/- 0.0% 215.9ms +/- 0.0% significant
** TOTAL **: 1.001x as fast 215.9ms +/- 0.0% 215.6ms +/- 0.0% significant
=============================================================================
The detailed results from the first run look like this:
3d: 1.003x as fast 35.1ms +/- 0.0% 34.9ms +/- 0.0% significant
cube: 1.007x as fast 13.2ms +/- 0.0% 13.1ms +/- 0.0% significant
morph: *1.004x as slow* 6.9ms +/- 0.0% 6.9ms +/- 0.1% significant
raytrace: 1.003x as fast 15.0ms +/- 0.0% 14.9ms +/- 0.0% significant
access: 1.003x as fast 18.3ms +/- 0.0% 18.2ms +/- 0.0% significant
binary-trees: *1.006x as slow* 2.5ms +/- 0.1% 2.5ms +/- 0.1% significant
fannkuch: - 7.8ms +/- 0.1% 7.8ms +/- 0.1%
nbody: 1.006x as fast 4.0ms +/- 0.0% 4.0ms +/- 0.1% significant
nsieve: 1.012x as fast 4.0ms +/- 0.1% 3.9ms +/- 0.1% significant
bitops: - 12.2ms +/- 0.0% 12.2ms +/- 0.0%
3bit-bits-in-byte: 1.003x as fast 0.8ms +/- 0.2% 0.8ms +/- 0.2% significant
bits-in-byte: - 5.1ms +/- 0.0% 5.1ms +/- 0.1%
bitwise-and: - 3.1ms +/- 0.1% 3.1ms +/- 0.1%
nsieve-bits: - 3.2ms +/- 0.1% 3.2ms +/- 0.1%
controlflow: 1.004x as fast 2.3ms +/- 0.1% 2.3ms +/- 0.1% significant
recursive: 1.004x as fast 2.3ms +/- 0.1% 2.3ms +/- 0.1% significant
crypto: 1.002x as fast 21.7ms +/- 0.0% 21.6ms +/- 0.0% significant
aes: *1.002x as slow* 11.3ms +/- 0.1% 11.3ms +/- 0.0% significant
md5: 1.009x as fast 7.2ms +/- 0.1% 7.1ms +/- 0.1% significant
sha1: - 3.2ms +/- 0.0% 3.2ms +/- 0.0%
date: *1.006x as slow* 31.1ms +/- 0.1% 31.3ms +/- 0.1% significant
format-tofte: *1.017x as slow* 18.9ms +/- 0.1% 19.2ms +/- 0.1% significant
format-xparb: 1.013x as fast 12.2ms +/- 0.1% 12.1ms +/- 0.2% significant
math: 1.003x as fast 17.4ms +/- 0.0% 17.3ms +/- 0.0% significant
cordic: - 3.7ms +/- 0.2% 3.7ms +/- 0.1%
partial-sums: 1.004x as fast 11.0ms +/- 0.0% 11.0ms +/- 0.0% significant
spectral-norm: 1.003x as fast 2.7ms +/- 0.0% 2.7ms +/- 0.0% significant
regexp: - 12.8ms +/- 0.1% 12.8ms +/- 0.2%
dna: - 12.8ms +/- 0.1% 12.8ms +/- 0.2%
string: 1.003x as fast 64.9ms +/- 0.0% 64.7ms +/- 0.0% significant
base64: 1.004x as fast 5.0ms +/- 0.0% 4.9ms +/- 0.0% significant
fasta: 1.007x as fast 6.5ms +/- 0.1% 6.5ms +/- 0.1% significant
tagcloud: 1.009x as fast 20.8ms +/- 0.1% 20.6ms +/- 0.1% significant
unpack-code: *1.004x as slow* 24.4ms +/- 0.0% 24.5ms +/- 0.1% significant
validate-input: 1.008x as fast 8.2ms +/- 0.0% 8.1ms +/- 0.0% significant
I wish there were a way to make the harness print in a narrower format.
> For the v8 tests, could you
> summarize; the test splat is hard to understand: did anything get slower?
Yeah, I guess it is.
Each individual test gets run 6 times. I use a custom format string to /usr/bin/time to print out the user time of each run. I have a separate post-processing script that computes the mean and sigma for each test. The output is the difference between the means of the before and after runs +/- the sum of the standard deviations from the before and after runs. There is probably something more helpful I could do than sum these. Is there a correct way to combine standard deviations from uncorrelated distributions?
crypto -8.333333ms +/- 20.176837ms user
deltablue 13.333333ms +/- 37.005621ms user
earley-boyer 41.666667ms +/- 38.817360ms user *
raytrace 5.000000ms +/- 27.652559ms user
regexp 5.000000ms +/- 35.911655ms user
richards -5.000000ms +/- 22.508736ms user
splay 13.333333ms +/- 35.839579ms user
So this currently looks pretty regressed, with earley-boyer particularly hard hit.
> Lastly, what difference do you get for the micro-bench:
>
> for (var i = 0; i < 10000000; ++i)
> ""+i;
terrence@cloud:~$ for i in 0 1 2 3 4 5 6; do /usr/bin/time ./js-before -m -n test.js; done;
0.53user 0.02system 0:00.54elapsed 100%CPU (0avgtext+0avgdata 392704maxresident)k
0inputs+0outputs (0major+24626minor)pagefaults 0swaps
... eliding repetitions ...
Pulling out the user times gives us deciseconds:
-> 53 52 53 54 54 53 53 = mean 53.142857142857146
terrence@cloud:~$ for i in 0 1 2 3 4 5 6; do /usr/bin/time ./js-after -m -n test.js; done;
0.59user 0.04system 0:00.62elapsed 101%CPU (0avgtext+0avgdata 392704maxresident)k
0inputs+0outputs (0major+24626minor)pagefaults 0swaps
... eliding repetitions ...
And with the patch we get:
-> 59 58 60 61 58 60 57 = mean 59.0
Which is *much* larger than the same code was giving before I pulled today. This morning I had this down to < 54.
I'm not entirely sure how I should proceed here. I think first I'm going to see if I can convince perf to give me useful data. If you have other ideas, I'd appreciate hearing them.
Attachment #654364 -
Attachment is obsolete: true
Attachment #654364 -
Flags: review?(luke)
Attachment #654855 -
Flags: feedback?(luke)
![]() |
||
Comment 4•13 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #3)
Thanks for the thorough reply. So SS seems fine. Could you run v8 from the v8 harness (js/src/v8/run.js) and tell me if there is any significant difference?
The micro-benchmark is a bit fake since it is converting a lot of large numbers and I suspect it is more common to convert 1 and 2 digit numbers (which are StaticStrings anyhow). I assume there is less overhead if you convert, say "999"?
I don't see a better general scheme, so, assuming positive answers to the above two questions the perf seems fine.
Assignee | ||
Comment 5•13 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #4)
> (In reply to Terrence Cole [:terrence] from comment #3)
> Thanks for the thorough reply. So SS seems fine. Could you run v8 from the
> v8 harness (js/src/v8/run.js) and tell me if there is any significant
> difference?
I'm not sure what how much change is significant. Here is the raw output.
terrence@cloud:~/v8-orig$ ../js-before -m -n run.js
Richards: 8388
DeltaBlue: 9045
Crypto: 15292
RayTrace: 5624
EarleyBoyer: 9307
RegExp: 1649
Splay: 9135
NavierStokes: 14366
----
Score (version 7): 7759
terrence@cloud:~/v8-orig$ ../js-after -m -n run.js
Richards: 8363
DeltaBlue: 9038
Crypto: 15194
RayTrace: 5528
EarleyBoyer: 9353
RegExp: 1639
Splay: 9241
NavierStokes: 14323
----
Score (version 7): 7740
> The micro-benchmark is a bit fake since it is converting a lot of large
> numbers and I suspect it is more common to convert 1 and 2 digit numbers
> (which are StaticStrings anyhow). I assume there is less overhead if you
> convert, say "999"?
Well, both before and after seem to run much faster when converting the same number repeatedly, even after disabling the dtoa cache. (Aside: the dtoa cache saves about 10ms on the previous test.)
This test:
=========================================
for (var i = 0; i < 1000000000; ++i)
""+999;
=========================================
Has results:
Before: 2.3 seconds
After: 2.3 seconds
I thought I would try again on a larger number to see the time increase:
=========================================
for (var i = 0; i < 1000000000; ++i)
""+1000000000;
=========================================
Has results:
Before: 2.3 seconds
After: 2.3 seconds
Okay, that didn't work. So maybe JM is doing something clever here. Let's try again with a shorter loop count in the interpreter:
=========================================
for (var i = 0; i < 10000000; ++i)
""+1000000000;
=========================================
Has results:
Before: 2.31 seconds
After: 2.28 seconds
I did indeed have to check this twice -- JM appears to run almost exactly 100x faster than the interpreter on this test.
I'm pretty sure there is a second cache after the dtoa cache that is catching this. Do you know where it is so I can test this?
![]() |
||
Comment 6•13 years ago
|
||
Comment on attachment 654855 [details] [diff] [review]
v1
Ultimately, I suspect the perf for the original patch was fine without all this digit-counting trickery.
Attachment #654855 -
Flags: feedback?(luke)
![]() |
||
Comment 7•13 years ago
|
||
Attachment #654364 -
Attachment is obsolete: false
Attachment #654364 -
Flags: review+
![]() |
||
Comment 8•13 years ago
|
||
> Getting rid of the size class check allows
> us to inline the isInline check, paving the way for a fast ensureNotInline
> and removal of getAllocKind as a requirement for string placement in arenas.
Can you expand on that last part? Would it allow vanilla strings and short strings to share arenas? If so, that'd be good -- anything that reduces the number of arena kinds reduces GC heap fragmentation.
Assignee | ||
Comment 9•13 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #8)
> > Getting rid of the size class check allows
> > us to inline the isInline check, paving the way for a fast ensureNotInline
> > and removal of getAllocKind as a requirement for string placement in arenas.
>
> Can you expand on that last part? Would it allow vanilla strings and short
> strings to share arenas? If so, that'd be good -- anything that reduces the
> number of arena kinds reduces GC heap fragmentation.
That is one of the long-term goals. Ideally we'd like to bump allocate and use a compressing GC to de-fragment, rather than our current free-list scheme. The main thing the arenas give us right now is access to the compartment pointer. We have plans for how we could get this in a different way, but the AllocKind is harder.
I'm not sure how immediate our plans in this direction are: as you well know, we have our hands full getting our rooting story to a happy ending right now :-).
Assignee | ||
Comment 10•13 years ago
|
||
Green try run at:
https://tbpl.mozilla.org/?tree=Try&rev=b9dec21ebfda
Pushed at:
https://hg.mozilla.org/integration/mozilla-inbound/rev/539643ca53ec
Comment 11•13 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
You need to log in
before you can comment on or make changes to this bug.
Description
•