Closed
Bug 784808
Opened 12 years ago
Closed 12 years ago
Remove isInline GC size class check
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla18
People
(Reporter: terrence, Assigned: terrence)
References
Details
Attachments
(2 files)
8.95 KB,
patch
|
luke
:
review+
|
Details | Diff | Splinter Review |
11.68 KB,
patch
|
Details | Diff | Splinter Review |
We depend on a size class check in JSString::isShort because JSShortStrings can have their chars() point to the middle of its inline buffer. This is to support Int32ToString. Considering that this path already has to do a ton of division and an allocation, just copying the <10 chars to the front of the buffer is provably trivial. Getting rid of the size class check allows us to inline the isInline check, paving the way for a fast ensureNotInline and removal of getAllocKind as a requirement for string placement in arenas. These benchmarks are Bill's modified v8: they run for a fixed number of iterations, rather than a fixed time. I ran them on Bill's stabily clocked "shookfoil" machine. Each benchmark is run six times sequentially from a shell script. A post-processing step computes the mean and standard deviation for each test from the before and after sets, then reports the time difference between the test and the significance threshhold. Any run within half a standard deviation of significance is flagged with a star (*). Runs |Clean1| and |Clean2| are against a clean mozilla-inbound tree with tip at 0e73660050a4. |Patched-Loop| uses a loop to copy and |Patched-Memmove| uses a memmove to copy. Clean1 -> Patched-Loop: crypto -3.333333ms +/- 20.814076ms user deltablue 16.666667ms +/- 43.655134ms user earley-boyer 15.000000ms +/- 49.403583ms user raytrace 5.000000ms +/- 31.898951ms user regexp 10.000000ms +/- 38.124375ms user richards 5.000000ms +/- 14.995899ms user splay -3.333333ms +/- 39.510034ms user Clean2 -> Patched-Memmove: crypto -18.333333ms +/- 25.416270ms user deltablue 31.666667ms +/- 45.063557ms user earley-boyer -26.666667ms +/- 39.472899ms user raytrace 15.000000ms +/- 38.026889ms user regexp -13.333333ms +/- 34.914110ms user richards 3.333333ms +/- 13.004952ms user splay 1.666667ms +/- 33.553946ms user Patched-Loop -> Patched-Memmove crypto -6.666667ms +/- 26.053510ms user deltablue 11.666667ms +/- 24.696532ms user earley-boyer -20.000000ms +/- 57.363729ms user raytrace 11.666667ms +/- 33.784738ms user regexp -11.666667ms +/- 47.636283ms user richards 1.666667ms +/- 12.691704ms user splay 15.000000ms +/- 33.553946ms user Clean1 -> Clean2: crypto 8.333333ms +/- 20.176837ms user deltablue -3.333333ms +/- 64.022159ms user earley-boyer 21.666667ms +/- 31.512753ms user raytrace 1.666667ms +/- 36.141103ms user regexp 11.666667ms +/- 25.402203ms user richards 3.333333ms +/- 15.309146ms user splay 10.000000ms +/- 39.510034ms user I also ran SunSpider on shookfoil. The results look similar -- 1.007x was stable across all runs: ** TOTAL **: *1.007x as slow* 2206.2ms +/- 0.1% 2221.9ms +/- 0.1% significant Is this too much of a slowdown to take?
Attachment #654364 -
Flags: review?(luke)
Comment 1•12 years ago
|
||
Comment on attachment 654364 [details] [diff] [review] v0 Review of attachment 654364 [details] [diff] [review]: ----------------------------------------------------------------- Nice solution to the set of constraints. > just copying the <10 chars to the front of the buffer is provably trivial Uh, it's provably O(1). I don't know what "trivial" means in this context. Perhaps "measurably insignificant", except that your measurements do show a change. On that subject, what sunspider are you running that takes 2205ms? Is this the v8 one that runs SS in a loop? Do you get the same regression every time you run? What about normal SS? For the v8 tests, could you summarize; the test splat is hard to understand: did anything get slower? Lastly, what difference do you get for the micro-bench: for (var i = 0; i < 10000000; ++i) ""+i; ::: js/src/jsnum.cpp @@ +549,5 @@ > *--start = '-'; > > + str->init(end - start); > + while (start != end) > + *storage++ = *start++; Since the the end result is copied anyway, there is no need for inlineStorageBeforeInit; you can just use a local array (of size UINT32_CHAR_BUFFER_LENGTH) and then write: jschar *dst = str->init(end - start); PodCopy(dst, start); (this will allow inlineStorageBeforeInit to be totally removed.) @@ +1313,5 @@ > RangedPtr<jschar> start = BackfillIndexInCharBuffer(index, end); > > + str->init(end - start); > + while (start != end) > + *storage++ = *start++; ditto ::: js/src/vm/String.h @@ +407,5 @@ > > static inline js::ThingRootKind rootKind() { return js::THING_ROOT_STRING; } > > #ifdef DEBUG > + bool isShortAllocKind() const; can you keep this named isShort()?
Assignee | ||
Comment 2•12 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #1) > Nice solution to the set of constraints. > > > just copying the <10 chars to the front of the buffer is provably trivial > > Uh, it's provably O(1). I don't know what "trivial" means in this context. > Perhaps "measurably insignificant", except that your measurements do show a > change. Captain Malapropism strifes again! s/provable/probably/ Sorry about the confusion. > Lastly, what difference do you get for the micro-bench: > > for (var i = 0; i < 10000000; ++i) > ""+i; Before: 0.50 sec After: 0.59 sec (Down from 0.61 sec for the attached patch with some tweaking.)
Assignee | ||
Comment 3•12 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #1) I hacked on this some more today to see if I could get the time down any by making Int32ToString more efficient. What I learned is that moving even a handful of bytes around in memory is significantly more expensive than taking a few local branches... usually. The strategy that seems to work best is to write the number where it needs to be on the first pass (as with before). Unlike before, we need to find where to put the number in the buffer first. We do this by precomputing the base-10 digit count for the number as such: 1) Loop to find the topmost bit of the number. 2) Branch into a switch table for the closest number of digits for that power-of-two range of numbers. 3) Adjust from the base by adding the result of a comparison with the largest base10 number with that size. This got us most of the way back to parity, until I pulled. The clean runs got slower by 10-20ms, but the new code got 30-40ms slower, losing most of my gains. G++ appears to have started generating different assembly after rebasing. Need to look into this more. > Nice solution to the set of constraints. Bill's idea, he deserves all the credit for this one. > On that subject, what sunspider are you running that takes 2205ms? > Is this the v8 one that runs SS in a loop? Do you get the same regression > every time you run? What about normal SS? Good catch! It's the normal SS, but I had the js args in the wrong spot, so it was running in the interpreter. I am running sunspider now by passing arguments |--args "-m -n" --runs 100|. The js binary is cross compiled to x86 using gcc-4.5.3 with -m32 and --target=i686-pc-linux and run on Bill's shookfoil machine. I ran sunspider 4 times and got the following results TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.002x as fast 215.8ms +/- 0.0% 215.4ms +/- 0.0% significant ** TOTAL **: - 215.9ms +/- 0.0% 215.8ms +/- 0.0% ** TOTAL **: *1.000x as slow* 215.8ms +/- 0.0% 215.9ms +/- 0.0% significant ** TOTAL **: 1.001x as fast 215.9ms +/- 0.0% 215.6ms +/- 0.0% significant ============================================================================= The detailed results from the first run look like this: 3d: 1.003x as fast 35.1ms +/- 0.0% 34.9ms +/- 0.0% significant cube: 1.007x as fast 13.2ms +/- 0.0% 13.1ms +/- 0.0% significant morph: *1.004x as slow* 6.9ms +/- 0.0% 6.9ms +/- 0.1% significant raytrace: 1.003x as fast 15.0ms +/- 0.0% 14.9ms +/- 0.0% significant access: 1.003x as fast 18.3ms +/- 0.0% 18.2ms +/- 0.0% significant binary-trees: *1.006x as slow* 2.5ms +/- 0.1% 2.5ms +/- 0.1% significant fannkuch: - 7.8ms +/- 0.1% 7.8ms +/- 0.1% nbody: 1.006x as fast 4.0ms +/- 0.0% 4.0ms +/- 0.1% significant nsieve: 1.012x as fast 4.0ms +/- 0.1% 3.9ms +/- 0.1% significant bitops: - 12.2ms +/- 0.0% 12.2ms +/- 0.0% 3bit-bits-in-byte: 1.003x as fast 0.8ms +/- 0.2% 0.8ms +/- 0.2% significant bits-in-byte: - 5.1ms +/- 0.0% 5.1ms +/- 0.1% bitwise-and: - 3.1ms +/- 0.1% 3.1ms +/- 0.1% nsieve-bits: - 3.2ms +/- 0.1% 3.2ms +/- 0.1% controlflow: 1.004x as fast 2.3ms +/- 0.1% 2.3ms +/- 0.1% significant recursive: 1.004x as fast 2.3ms +/- 0.1% 2.3ms +/- 0.1% significant crypto: 1.002x as fast 21.7ms +/- 0.0% 21.6ms +/- 0.0% significant aes: *1.002x as slow* 11.3ms +/- 0.1% 11.3ms +/- 0.0% significant md5: 1.009x as fast 7.2ms +/- 0.1% 7.1ms +/- 0.1% significant sha1: - 3.2ms +/- 0.0% 3.2ms +/- 0.0% date: *1.006x as slow* 31.1ms +/- 0.1% 31.3ms +/- 0.1% significant format-tofte: *1.017x as slow* 18.9ms +/- 0.1% 19.2ms +/- 0.1% significant format-xparb: 1.013x as fast 12.2ms +/- 0.1% 12.1ms +/- 0.2% significant math: 1.003x as fast 17.4ms +/- 0.0% 17.3ms +/- 0.0% significant cordic: - 3.7ms +/- 0.2% 3.7ms +/- 0.1% partial-sums: 1.004x as fast 11.0ms +/- 0.0% 11.0ms +/- 0.0% significant spectral-norm: 1.003x as fast 2.7ms +/- 0.0% 2.7ms +/- 0.0% significant regexp: - 12.8ms +/- 0.1% 12.8ms +/- 0.2% dna: - 12.8ms +/- 0.1% 12.8ms +/- 0.2% string: 1.003x as fast 64.9ms +/- 0.0% 64.7ms +/- 0.0% significant base64: 1.004x as fast 5.0ms +/- 0.0% 4.9ms +/- 0.0% significant fasta: 1.007x as fast 6.5ms +/- 0.1% 6.5ms +/- 0.1% significant tagcloud: 1.009x as fast 20.8ms +/- 0.1% 20.6ms +/- 0.1% significant unpack-code: *1.004x as slow* 24.4ms +/- 0.0% 24.5ms +/- 0.1% significant validate-input: 1.008x as fast 8.2ms +/- 0.0% 8.1ms +/- 0.0% significant I wish there were a way to make the harness print in a narrower format. > For the v8 tests, could you > summarize; the test splat is hard to understand: did anything get slower? Yeah, I guess it is. Each individual test gets run 6 times. I use a custom format string to /usr/bin/time to print out the user time of each run. I have a separate post-processing script that computes the mean and sigma for each test. The output is the difference between the means of the before and after runs +/- the sum of the standard deviations from the before and after runs. There is probably something more helpful I could do than sum these. Is there a correct way to combine standard deviations from uncorrelated distributions? crypto -8.333333ms +/- 20.176837ms user deltablue 13.333333ms +/- 37.005621ms user earley-boyer 41.666667ms +/- 38.817360ms user * raytrace 5.000000ms +/- 27.652559ms user regexp 5.000000ms +/- 35.911655ms user richards -5.000000ms +/- 22.508736ms user splay 13.333333ms +/- 35.839579ms user So this currently looks pretty regressed, with earley-boyer particularly hard hit. > Lastly, what difference do you get for the micro-bench: > > for (var i = 0; i < 10000000; ++i) > ""+i; terrence@cloud:~$ for i in 0 1 2 3 4 5 6; do /usr/bin/time ./js-before -m -n test.js; done; 0.53user 0.02system 0:00.54elapsed 100%CPU (0avgtext+0avgdata 392704maxresident)k 0inputs+0outputs (0major+24626minor)pagefaults 0swaps ... eliding repetitions ... Pulling out the user times gives us deciseconds: -> 53 52 53 54 54 53 53 = mean 53.142857142857146 terrence@cloud:~$ for i in 0 1 2 3 4 5 6; do /usr/bin/time ./js-after -m -n test.js; done; 0.59user 0.04system 0:00.62elapsed 101%CPU (0avgtext+0avgdata 392704maxresident)k 0inputs+0outputs (0major+24626minor)pagefaults 0swaps ... eliding repetitions ... And with the patch we get: -> 59 58 60 61 58 60 57 = mean 59.0 Which is *much* larger than the same code was giving before I pulled today. This morning I had this down to < 54. I'm not entirely sure how I should proceed here. I think first I'm going to see if I can convince perf to give me useful data. If you have other ideas, I'd appreciate hearing them.
Attachment #654364 -
Attachment is obsolete: true
Attachment #654364 -
Flags: review?(luke)
Attachment #654855 -
Flags: feedback?(luke)
Comment 4•12 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #3) Thanks for the thorough reply. So SS seems fine. Could you run v8 from the v8 harness (js/src/v8/run.js) and tell me if there is any significant difference? The micro-benchmark is a bit fake since it is converting a lot of large numbers and I suspect it is more common to convert 1 and 2 digit numbers (which are StaticStrings anyhow). I assume there is less overhead if you convert, say "999"? I don't see a better general scheme, so, assuming positive answers to the above two questions the perf seems fine.
Assignee | ||
Comment 5•12 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #4) > (In reply to Terrence Cole [:terrence] from comment #3) > Thanks for the thorough reply. So SS seems fine. Could you run v8 from the > v8 harness (js/src/v8/run.js) and tell me if there is any significant > difference? I'm not sure what how much change is significant. Here is the raw output. terrence@cloud:~/v8-orig$ ../js-before -m -n run.js Richards: 8388 DeltaBlue: 9045 Crypto: 15292 RayTrace: 5624 EarleyBoyer: 9307 RegExp: 1649 Splay: 9135 NavierStokes: 14366 ---- Score (version 7): 7759 terrence@cloud:~/v8-orig$ ../js-after -m -n run.js Richards: 8363 DeltaBlue: 9038 Crypto: 15194 RayTrace: 5528 EarleyBoyer: 9353 RegExp: 1639 Splay: 9241 NavierStokes: 14323 ---- Score (version 7): 7740 > The micro-benchmark is a bit fake since it is converting a lot of large > numbers and I suspect it is more common to convert 1 and 2 digit numbers > (which are StaticStrings anyhow). I assume there is less overhead if you > convert, say "999"? Well, both before and after seem to run much faster when converting the same number repeatedly, even after disabling the dtoa cache. (Aside: the dtoa cache saves about 10ms on the previous test.) This test: ========================================= for (var i = 0; i < 1000000000; ++i) ""+999; ========================================= Has results: Before: 2.3 seconds After: 2.3 seconds I thought I would try again on a larger number to see the time increase: ========================================= for (var i = 0; i < 1000000000; ++i) ""+1000000000; ========================================= Has results: Before: 2.3 seconds After: 2.3 seconds Okay, that didn't work. So maybe JM is doing something clever here. Let's try again with a shorter loop count in the interpreter: ========================================= for (var i = 0; i < 10000000; ++i) ""+1000000000; ========================================= Has results: Before: 2.31 seconds After: 2.28 seconds I did indeed have to check this twice -- JM appears to run almost exactly 100x faster than the interpreter on this test. I'm pretty sure there is a second cache after the dtoa cache that is catching this. Do you know where it is so I can test this?
Comment 6•12 years ago
|
||
Comment on attachment 654855 [details] [diff] [review] v1 Ultimately, I suspect the perf for the original patch was fine without all this digit-counting trickery.
Attachment #654855 -
Flags: feedback?(luke)
Comment 7•12 years ago
|
||
Comment on attachment 654364 [details] [diff] [review] v0 r+ with nits in comment 1 addressed.
Attachment #654364 -
Attachment is obsolete: false
Attachment #654364 -
Flags: review+
Comment 8•12 years ago
|
||
> Getting rid of the size class check allows
> us to inline the isInline check, paving the way for a fast ensureNotInline
> and removal of getAllocKind as a requirement for string placement in arenas.
Can you expand on that last part? Would it allow vanilla strings and short strings to share arenas? If so, that'd be good -- anything that reduces the number of arena kinds reduces GC heap fragmentation.
Assignee | ||
Comment 9•12 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #8) > > Getting rid of the size class check allows > > us to inline the isInline check, paving the way for a fast ensureNotInline > > and removal of getAllocKind as a requirement for string placement in arenas. > > Can you expand on that last part? Would it allow vanilla strings and short > strings to share arenas? If so, that'd be good -- anything that reduces the > number of arena kinds reduces GC heap fragmentation. That is one of the long-term goals. Ideally we'd like to bump allocate and use a compressing GC to de-fragment, rather than our current free-list scheme. The main thing the arenas give us right now is access to the compartment pointer. We have plans for how we could get this in a different way, but the AllocKind is harder. I'm not sure how immediate our plans in this direction are: as you well know, we have our hands full getting our rooting story to a happy ending right now :-).
Assignee | ||
Comment 10•12 years ago
|
||
Green try run at: https://tbpl.mozilla.org/?tree=Try&rev=b9dec21ebfda Pushed at: https://hg.mozilla.org/integration/mozilla-inbound/rev/539643ca53ec
Comment 11•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/539643ca53ec
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
You need to log in
before you can comment on or make changes to this bug.
Description
•