Created attachment 438651 [details]
In parsing large payloads (like GMail in the attached), the fixed-size arena blocks are probably too small for the number of parse nodes we create.
They're allocated out of the context's temp pool: we take a watermark, allocate out whatever fixed size blocks we need (keeping a free-list of parsenodes for recycling), then free back to the watermark when the parser is destroyed.
As noted in bug 421435 (stagnated since two years ago), the blocks in the temp pool are fixed size. I have to read up on arena allocation theory, but it seems to make sense to have some kind of vector-like allocation strategy for parse node arena, doubling the size of the new pool each time.
I should probably try to collect some stats on the productions that are creating parse nodes as well.
Forgot to note that in the GMail parse under Shark, 10% of the time was getting eaten up by vm_fault -- only 85% of the time was spent in userspace.
One of the problems of a doubling vector is that parse nodes get reallocated (potentially). I have argued for using the VM more before. Allocate a really large block (32MB) and use it. If you run oom, throw it away, allocate 64MB, parse again and so on. For the vast majority of case 32MB should do just fine. As long you don't touch most of it, we pay the same price for a small mmap as we would for a large one. Unmap when done. As long the underlying OS doesn't commit memory eagerly, this should work in most places. Windows CE might be an issue, but looks like we don't really ship there anyway. Lets run some statistics. How much space do we need anyway? 32MB is probably aiming way too high.
With 4K arena allocation size in the temp pool we see the following
butt-kicking speedups on Linux/64:
$ python compare_bench.py ~/tm/results/parsemark_tm_4Kpool_2010_04_12.txt
tetris_combined.js: faster: 4.19ms < baseline 5.26ms (+20.34%)
gravity_combined.js: faster: 24.67ms < baseline 39.08ms (+36.87%)
ga.js: faster: 3.45ms < baseline 3.67ms ( +5.99%)
twitter_combined.js: faster: 20.30ms < baseline 27.81ms (+27.00%)
jquery.min.js: faster: 5.23ms < baseline 6.84ms (+23.54%)
zimbra_combined.js: faster: 127.30ms < baseline 192.73ms (+33.95%)
jsgb_combined.js: faster: 9.79ms < baseline 13.42ms (+27.05%)
280slides_combined.js: faster: 3.63ms < baseline 4.06ms (+10.59%)
v8_combined.js: faster: 22.43ms < baseline 27.83ms (+19.40%)
dojo.js: faster: 8.27ms < baseline 9.74ms (+15.09%)
gmail_combined.js: faster: 97.48ms < baseline 146.14ms (+33.30%)
effectgames_galaxy.js: faster: 16.14ms < baseline 23.94ms (+32.58%)
ball_pool_combined.js: faster: 16.13ms < baseline 20.48ms (+21.24%)
MochiKit.js: faster: 16.54ms < baseline 20.94ms (+21.01%)
pipio_combined.js: faster: 82.18ms < baseline 127.90ms (+35.75%)
sunspider_combined.js: faster: 13.28ms < baseline 14.84ms (+10.51%)
Average speedup: 23.39%
Will have numbers for OS X 64b soon.
Chris, can you file dependent bugs on any other remaining uses of the arena code? I have seen similar badness elsewhere (JS stacks, but luke might have fixed all of that by now).
OS X Snow Leopard 64b numbers:
gravity_combined.js: faster: 39.26ms < baseline 41.37ms ( +5.11%)
processing-0.8.min.js: faster: 18.22ms < baseline 18.40ms ( +0.96%)
twitter_combined.js: faster: 29.63ms < baseline 30.40ms ( +2.52%)
gmail_combined.js: faster: 155.83ms < baseline 175.48ms (+11.20%)
v8_combined.js: faster: 33.26ms < baseline 33.89ms ( +1.86%)
zimbra_combined.js: faster: 204.31ms < baseline 234.79ms (+12.98%)
effectgames_galaxy.js: faster: 24.98ms < baseline 25.74ms ( +2.95%)
ball_pool_combined.js: faster: 25.25ms < baseline 25.79ms ( +2.09%)
MochiKit.js: faster: 24.96ms < baseline 25.39ms ( +1.70%)
pipio_combined.js: faster: 125.87ms < baseline 142.85ms (+11.89%)
Average speedup: 5.33%
Luke is not using arenas for stacks, he's using the VM mapping thing to avoid address instability.
Let's use bigger chunks! Make our hay while the sun doth shine, to quote G&S. No need to do more if profiling doesn't support it.
How does 16K compare -- I'm not kidding.
(In reply to comment #2)
> One of the problems of a doubling vector is that parse nodes get reallocated
Right. I was thinking about this, and I'm not sure why non-contiguous arena chunks would be much worse than a contiguous allocation space if the base pool size were sufficiently large. You always add a new chunk of the total size of the current space; i.e. 1 -> [1, 1] -> [1, 1, 2] -> [1, 1, 2, 4], etc. We're using a recycled-node-list anyway, so we'd only ever be doling out of the last allocated chunk.
> If you run oom, throw it away, allocate 64MB,
> parse again and so on. For the vast majority of case 32MB should do just fine.
That seems like an awful behavior for scripts that require 32MB + sizeof(ParseNode).
> As long you don't touch most of it, we pay the same price for a small mmap as
> we would for a large one. Unmap when done. As long the underlying OS doesn't
> commit memory eagerly, this should work in most places.
I like it, but I'm not a portability expert. Do we do this elsewhere?
Comment 2 is off the mark, especially on OOM handling -- we just need bigger chunks. Measure up to a large size and gnuplot the results, attach the image. Internal fragmentation will not be bad as you say, we recycle in the non-terminal chunks.
> > As long you don't touch most of it, we pay the same price for a small mmap as
> > we would for a large one. Unmap when done. As long the underlying OS doesn't
> > commit memory eagerly, this should work in most places.
> I like it, but I'm not a portability expert. Do we do this elsewhere?
Its the year 2010 of the lord. If your OS doesn't have an MMU and virtual memory, go get an OS for grownups. I am pretty sure its true for every platform we run on.
The issue is OOM handling, not VM -- althoug check with Mobile folks before grabbing great heaps of VM (page table mappings take up whole pages and TBL entries for running processes, on some OSes).
But really, a doubling vector with stable virtual addressing is overkill here. Leary will do the science and vindicate my claim, I'm sure.
Commencing with the science! http://imgs.xkcd.com/comics/science_montage.png
Created attachment 438832 [details]
Parse time versus pool size.
Small scripts are relatively unaffected, as one would expect. Looks like 4KiB may be a sweet spot. 16KiB is better, but maybe not by enough. (It's hard to tell the whole system effects by just looking at parsing in isolation.)
Created attachment 438840 [details]
Perf increase versus pool size.
Busier plot, but perhaps easier to see the important trends.
I ripped the lazy parts out of V8 and force it to eagerly generate an AST, and we fare much better now (win one lose a bunch):
$ python parsemark.py /home/cdleary/v8/v8-stable/v8 /home/cdleary/tm/parse-tests/ -q -b /home/cdleary/tm/results/pool_size/tm_4K_pool_size.json
gravity_combined.js: faster: 20.93ms < baseline 22.37ms ( +6.46%)
ga.js: faster: 2.25ms < baseline 2.33ms ( +3.55%)
zimbra_combined.js: faster: 93.95ms < baseline 115.92ms (+18.95%)
v8_combined.js: faster: 17.71ms < baseline 18.43ms ( +3.92%)
dojo.js: faster: 6.17ms < baseline 6.57ms ( +6.15%)
gmail_combined.js: faster: 62.81ms < baseline 91.00ms (+30.97%)
effectgames_galaxy.js: faster: 12.89ms < baseline 13.80ms ( +6.63%)
MochiKit.js: faster: 12.85ms < baseline 14.10ms ( +8.90%)
pipio_combined.js: faster: 66.95ms < baseline 73.58ms ( +9.01%)
sunspider_combined.js: SLOWER: 20.75ms > baseline 12.60ms (-64.67%)
Average speedup: 2.99%
As opposed to before:
$ python compare_bench.py /home/cdleary/tm/results/parsemark_v8_eager_2010_04_13.json /home/cdleary/tm/results/parsemark_tm_2010_04_12.txt
tetris_combined.js: faster: 3.69ms < baseline 5.26ms (+29.85%)
gravity_combined.js: faster: 20.93ms < baseline 39.08ms (+46.44%)
ga.js: faster: 2.25ms < baseline 3.67ms (+38.69%)
twitter_combined.js: faster: 18.81ms < baseline 27.81ms (+32.36%)
jquery.min.js: faster: 4.30ms < baseline 6.84ms (+37.13%)
zimbra_combined.js: faster: 93.95ms < baseline 192.73ms (+51.25%)
jsgb_combined.js: faster: 10.22ms < baseline 13.42ms (+23.85%)
280slides_combined.js: faster: 3.23ms < baseline 4.06ms (+20.44%)
v8_combined.js: faster: 17.71ms < baseline 27.83ms (+36.36%)
dojo.js: faster: 6.17ms < baseline 9.74ms (+36.65%)
processing_twitch.js: faster: 2.25ms < baseline 2.79ms (+19.35%)
gmail_combined.js: faster: 62.81ms < baseline 146.14ms (+57.02%)
effectgames_galaxy.js: faster: 12.89ms < baseline 23.94ms (+46.16%)
ball_pool_combined.js: faster: 14.29ms < baseline 20.48ms (+30.22%)
yui-min.js: faster: 1.24ms < baseline 1.79ms (+30.73%)
MochiKit.js: faster: 12.85ms < baseline 20.94ms (+38.63%)
pipio_combined.js: faster: 66.95ms < baseline 127.90ms (+47.65%)
sunspider_combined.js: SLOWER: 20.75ms > baseline 15.12ms (-37.24%)
Average speedup: 32.53%
If we think it's worth the time I'll instrument to get some production statistics and see if the big win/losses stand out somehow.
(In reply to comment #14)
> Created an attachment (id=438840) [details]
> Perf increase versus pool size.
Eggsalad. Now we need to see the same plot on ARM.
Created attachment 439562 [details]
ARM parse time versus pool size.
This set of data is a lot more noisy -- big error bars.
Created attachment 439563 [details]
ARM perf increase versus pool size.
Created attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.
Okay, I'm going with 4K - ARENA_HEADER_SIZE for now, as it demonstrates a parsing speedup and doesn't have a significant effect on sunspider scores on x64 or ARM.
On my Linux x64, I found the following relationship on parsemark (from slow to fast, <~ meaning approximately equal, but strictly less than):
1K-40 <~ 1K < 4K <~ 4K - HEADER_SIZE <~ 16K <~ 16K - HEADER_SIZE
On OS X i386:
16K - HEADER_SIZE < 1K < 1K - HEADER_SIZE < 4K <~ 4K - HEADER_SIZE
That being said, I'm still looking at exactly what size mallocs we're performing as a result of arena chunks being pulled, seeing which VM pages that ends up corresponding to, and checking for thrash with a little bit of printf-based observation.
Comment on attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.
I claim we should take this trivial patch to win now. But also file a follow up to see if we can observe consistent wins when multiplying a constant by pointer size.
Comment on attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.
that sure is an increase.
What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a cursory analysis I see that we can reduce 5761 malloc/frees out of the temp pool into ~13 with the doubling approach. Note that we'd get screwed on the larger arenas where chunk utilization is low if the system's vmem doesn't support a mapped-but-not-wired state.
imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how many programs like imacros really exist and require performance, but might as well fix it.)
We also have a few oddball arena frees when the arenas are < 50% utilized, some as low as 1%. Still investigating those.
(In reply to comment #22)
> What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a
> cursory analysis I see that we can reduce 5761 malloc/frees out of the temp
> pool into ~13 with the doubling approach.
Can we hang onto the last arena for a while (GC-based time caching, see gcEmptyArenaPoolLifespan).
> imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how
> many programs like imacros really exist and require performance, but might as
> well fix it.)
Definitely -- on file?
(In reply to comment #23)
> Can we hang onto the last arena for a while (GC-based time caching, see
Yeah, I've been calling this the "second" arena, since the fakey one before it is called "first" and is a member structure of the pool. So long as API clients take marks and release to them instead of calling |pool.free()|, the GC should reap the "second" arena when it reaches a ripe old age.
> > imacros.js also shows we thrash the regexp pool
> on file?
new work to new bugs.