Closed Bug 558971 Opened 15 years ago Closed 15 years ago

Parser arena allocation overhead is too high

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: cdleary, Assigned: cdleary)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(6 files)

Massif output. 15 years ago Chris Leary [:cdleary] (not checking bugmail) 738.39 KB, text/plain		Details
Parse time versus pool size. 15 years ago Chris Leary [:cdleary] (not checking bugmail) 255.81 KB, image/png		Details
Perf increase versus pool size. 15 years ago Chris Leary [:cdleary] (not checking bugmail) 304.06 KB, image/png		Details
ARM parse time versus pool size. 15 years ago Chris Leary [:cdleary] (not checking bugmail) 121.13 KB, image/png		Details
ARM perf increase versus pool size. 15 years ago Chris Leary [:cdleary] (not checking bugmail) 196.02 KB, image/png		Details
Temp pool chunk size increase. 15 years ago Chris Leary [:cdleary] (not checking bugmail) 2.29 KB, patch	sayrer : review+ shaver : review+	Details \| Diff \| Splinter Review

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Description

•

15 years ago

Attached file Massif output. — Details

In parsing large payloads (like GMail in the attached), the fixed-size arena blocks are probably too small for the number of parse nodes we create. They're allocated out of the context's temp pool: we take a watermark, allocate out whatever fixed size blocks we need (keeping a free-list of parsenodes for recycling), then free back to the watermark when the parser is destroyed. As noted in bug 421435 (stagnated since two years ago), the blocks in the temp pool are fixed size. I have to read up on arena allocation theory, but it seems to make sense to have some kind of vector-like allocation strategy for parse node arena, doubling the size of the new pool each time. I should probably try to collect some stats on the productions that are creating parse nodes as well.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 1

•

15 years ago

Forgot to note that in the GMail parse under Shark, 10% of the time was getting eaten up by vm_fault -- only 85% of the time was spent in userspace.

Andreas Gal :gal

Comment 2

•

15 years ago

One of the problems of a doubling vector is that parse nodes get reallocated (potentially). I have argued for using the VM more before. Allocate a really large block (32MB) and use it. If you run oom, throw it away, allocate 64MB, parse again and so on. For the vast majority of case 32MB should do just fine. As long you don't touch most of it, we pay the same price for a small mmap as we would for a large one. Unmap when done. As long the underlying OS doesn't commit memory eagerly, this should work in most places. Windows CE might be an issue, but looks like we don't really ship there anyway. Lets run some statistics. How much space do we need anyway? 32MB is probably aiming way too high.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 3

•

15 years ago

With 4K arena allocation size in the temp pool we see the following butt-kicking speedups on Linux/64: $ python compare_bench.py ~/tm/results/parsemark_tm_4Kpool_2010_04_12.txt ~/tm/results/parsemark_tm_2010_04_12.txt tetris_combined.js: faster: 4.19ms < baseline 5.26ms (+20.34%) gravity_combined.js: faster: 24.67ms < baseline 39.08ms (+36.87%) ga.js: faster: 3.45ms < baseline 3.67ms ( +5.99%) twitter_combined.js: faster: 20.30ms < baseline 27.81ms (+27.00%) jquery.min.js: faster: 5.23ms < baseline 6.84ms (+23.54%) zimbra_combined.js: faster: 127.30ms < baseline 192.73ms (+33.95%) jsgb_combined.js: faster: 9.79ms < baseline 13.42ms (+27.05%) 280slides_combined.js: faster: 3.63ms < baseline 4.06ms (+10.59%) v8_combined.js: faster: 22.43ms < baseline 27.83ms (+19.40%) dojo.js: faster: 8.27ms < baseline 9.74ms (+15.09%) processing_twitch.js: Meh. gmail_combined.js: faster: 97.48ms < baseline 146.14ms (+33.30%) effectgames_galaxy.js: faster: 16.14ms < baseline 23.94ms (+32.58%) ball_pool_combined.js: faster: 16.13ms < baseline 20.48ms (+21.24%) yui-min.js: Meh. MochiKit.js: faster: 16.54ms < baseline 20.94ms (+21.01%) pipio_combined.js: faster: 82.18ms < baseline 127.90ms (+35.75%) sunspider_combined.js: faster: 13.28ms < baseline 14.84ms (+10.51%) Average speedup: 23.39% Will have numbers for OS X 64b soon.

Andreas Gal :gal

Comment 4

•

15 years ago

Chris, can you file dependent bugs on any other remaining uses of the arena code? I have seen similar badness elsewhere (JS stacks, but luke might have fixed all of that by now).

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 5

•

15 years ago

OS X Snow Leopard 64b numbers: gravity_combined.js: faster: 39.26ms < baseline 41.37ms ( +5.11%) processing-0.8.min.js: faster: 18.22ms < baseline 18.40ms ( +0.96%) ga.js: Meh. twitter_combined.js: faster: 29.63ms < baseline 30.40ms ( +2.52%) jquery.min.js: Meh. gmail_combined.js: faster: 155.83ms < baseline 175.48ms (+11.20%) tetris_combined.js: Meh. 280slides_combined.js: Meh. v8_combined.js: faster: 33.26ms < baseline 33.89ms ( +1.86%) dojo.js: Meh. processing_twitch.js: Meh. jsgb_combined.js: Meh. zimbra_combined.js: faster: 204.31ms < baseline 234.79ms (+12.98%) effectgames_galaxy.js: faster: 24.98ms < baseline 25.74ms ( +2.95%) ball_pool_combined.js: faster: 25.25ms < baseline 25.79ms ( +2.09%) yui-min.js: Meh. MochiKit.js: faster: 24.96ms < baseline 25.39ms ( +1.70%) pipio_combined.js: faster: 125.87ms < baseline 142.85ms (+11.89%) sunspider_combined.js: Meh. Average speedup: 5.33%

Brendan Eich [:brendan]

Comment 6

•

15 years ago

Luke is not using arenas for stacks, he's using the VM mapping thing to avoid address instability. Let's use bigger chunks! Make our hay while the sun doth shine, to quote G&S. No need to do more if profiling doesn't support it. /be

Brendan Eich [:brendan]

Comment 7

•

15 years ago

How does 16K compare -- I'm not kidding. /be

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 8

•

15 years ago

(In reply to comment #2) > One of the problems of a doubling vector is that parse nodes get reallocated > (potentially). Right. I was thinking about this, and I'm not sure why non-contiguous arena chunks would be much worse than a contiguous allocation space if the base pool size were sufficiently large. You always add a new chunk of the total size of the current space; i.e. 1 -> [1, 1] -> [1, 1, 2] -> [1, 1, 2, 4], etc. We're using a recycled-node-list anyway, so we'd only ever be doling out of the last allocated chunk. > If you run oom, throw it away, allocate 64MB, > parse again and so on. For the vast majority of case 32MB should do just fine. That seems like an awful behavior for scripts that require 32MB + sizeof(ParseNode). > As long you don't touch most of it, we pay the same price for a small mmap as > we would for a large one. Unmap when done. As long the underlying OS doesn't > commit memory eagerly, this should work in most places. I like it, but I'm not a portability expert. Do we do this elsewhere?

Brendan Eich [:brendan]

Comment 9

•

15 years ago

Comment 2 is off the mark, especially on OOM handling -- we just need bigger chunks. Measure up to a large size and gnuplot the results, attach the image. Internal fragmentation will not be bad as you say, we recycle in the non-terminal chunks. /be

Andreas Gal :gal

Comment 10

•

15 years ago

> > As long you don't touch most of it, we pay the same price for a small mmap as > > we would for a large one. Unmap when done. As long the underlying OS doesn't > > commit memory eagerly, this should work in most places. > > I like it, but I'm not a portability expert. Do we do this elsewhere? Its the year 2010 of the lord. If your OS doesn't have an MMU and virtual memory, go get an OS for grownups. I am pretty sure its true for every platform we run on.

Brendan Eich [:brendan]

Comment 11

•

15 years ago

The issue is OOM handling, not VM -- althoug check with Mobile folks before grabbing great heaps of VM (page table mappings take up whole pages and TBL entries for running processes, on some OSes). But really, a doubling vector with stable virtual addressing is overkill here. Leary will do the science and vindicate my claim, I'm sure. /be

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 12

•

15 years ago

Commencing with the science! http://imgs.xkcd.com/comics/science_montage.png

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 13

•

15 years ago

Attached image Parse time versus pool size. — Details

Small scripts are relatively unaffected, as one would expect. Looks like 4KiB may be a sweet spot. 16KiB is better, but maybe not by enough. (It's hard to tell the whole system effects by just looking at parsing in isolation.)

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 14

•

15 years ago

Attached image Perf increase versus pool size. — Details

Busier plot, but perhaps easier to see the important trends.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 15

•

15 years ago

I ripped the lazy parts out of V8 and force it to eagerly generate an AST, and we fare much better now (win one lose a bunch): $ python parsemark.py /home/cdleary/v8/v8-stable/v8 /home/cdleary/tm/parse-tests/ -q -b /home/cdleary/tm/results/pool_size/tm_4K_pool_size.json tetris_combined.js: Meh. gravity_combined.js: faster: 20.93ms < baseline 22.37ms ( +6.46%) ga.js: faster: 2.25ms < baseline 2.33ms ( +3.55%) twitter_combined.js: Meh. jquery.min.js: Meh. zimbra_combined.js: faster: 93.95ms < baseline 115.92ms (+18.95%) jsgb_combined.js: Meh. 280slides_combined.js: Meh. v8_combined.js: faster: 17.71ms < baseline 18.43ms ( +3.92%) dojo.js: faster: 6.17ms < baseline 6.57ms ( +6.15%) processing_twitch.js: Meh. gmail_combined.js: faster: 62.81ms < baseline 91.00ms (+30.97%) effectgames_galaxy.js: faster: 12.89ms < baseline 13.80ms ( +6.63%) ball_pool_combined.js: Meh. yui-min.js: Meh. MochiKit.js: faster: 12.85ms < baseline 14.10ms ( +8.90%) pipio_combined.js: faster: 66.95ms < baseline 73.58ms ( +9.01%) sunspider_combined.js: SLOWER: 20.75ms > baseline 12.60ms (-64.67%) Average speedup: 2.99% As opposed to before: $ python compare_bench.py /home/cdleary/tm/results/parsemark_v8_eager_2010_04_13.json /home/cdleary/tm/results/parsemark_tm_2010_04_12.txt tetris_combined.js: faster: 3.69ms < baseline 5.26ms (+29.85%) gravity_combined.js: faster: 20.93ms < baseline 39.08ms (+46.44%) ga.js: faster: 2.25ms < baseline 3.67ms (+38.69%) twitter_combined.js: faster: 18.81ms < baseline 27.81ms (+32.36%) jquery.min.js: faster: 4.30ms < baseline 6.84ms (+37.13%) zimbra_combined.js: faster: 93.95ms < baseline 192.73ms (+51.25%) jsgb_combined.js: faster: 10.22ms < baseline 13.42ms (+23.85%) 280slides_combined.js: faster: 3.23ms < baseline 4.06ms (+20.44%) v8_combined.js: faster: 17.71ms < baseline 27.83ms (+36.36%) dojo.js: faster: 6.17ms < baseline 9.74ms (+36.65%) processing_twitch.js: faster: 2.25ms < baseline 2.79ms (+19.35%) gmail_combined.js: faster: 62.81ms < baseline 146.14ms (+57.02%) effectgames_galaxy.js: faster: 12.89ms < baseline 23.94ms (+46.16%) ball_pool_combined.js: faster: 14.29ms < baseline 20.48ms (+30.22%) yui-min.js: faster: 1.24ms < baseline 1.79ms (+30.73%) MochiKit.js: faster: 12.85ms < baseline 20.94ms (+38.63%) pipio_combined.js: faster: 66.95ms < baseline 127.90ms (+47.65%) sunspider_combined.js: SLOWER: 20.75ms > baseline 15.12ms (-37.24%) Average speedup: 32.53% If we think it's worth the time I'll instrument to get some production statistics and see if the big win/losses stand out somehow.

Robert Sayre

Comment 16

•

15 years ago

(In reply to comment #14) > Created an attachment (id=438840) [details] > Perf increase versus pool size. Eggsalad. Now we need to see the same plot on ARM.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 17

•

15 years ago

Attached image ARM parse time versus pool size. — Details

This set of data is a lot more noisy -- big error bars.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 18

•

15 years ago

Attached image ARM perf increase versus pool size. — Details

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 19

•

15 years ago

Attached patch Temp pool chunk size increase. — Details — Splinter Review

Okay, I'm going with 4K - ARENA_HEADER_SIZE for now, as it demonstrates a parsing speedup and doesn't have a significant effect on sunspider scores on x64 or ARM. On my Linux x64, I found the following relationship on parsemark (from slow to fast, <~ meaning approximately equal, but strictly less than): 1K-40 <~ 1K < 4K <~ 4K - HEADER_SIZE <~ 16K <~ 16K - HEADER_SIZE On OS X i386: 16K - HEADER_SIZE < 1K < 1K - HEADER_SIZE < 4K <~ 4K - HEADER_SIZE That being said, I'm still looking at exactly what size mallocs we're performing as a result of arena chunks being pulled, seeing which VM pages that ends up corresponding to, and checking for thrash with a little bit of printf-based observation.

Attachment #440898 - Flags: review?(sayrer)

Robert Sayre

Comment 20

•

15 years ago

Comment on attachment 440898 [details] [diff] [review] Temp pool chunk size increase. I claim we should take this trivial patch to win now. But also file a follow up to see if we can observe consistent wins when multiplying a constant by pointer size.

Attachment #440898 - Flags: review?(sayrer)

Attachment #440898 - Flags: review?

Attachment #440898 - Flags: review+

Mike Shaver (:shaver emeritus)

Comment 21

•

15 years ago

Comment on attachment 440898 [details] [diff] [review] Temp pool chunk size increase. that sure is an increase.

Attachment #440898 - Flags: review? → review+

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 22

•

15 years ago

What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a cursory analysis I see that we can reduce 5761 malloc/frees out of the temp pool into ~13 with the doubling approach. Note that we'd get screwed on the larger arenas where chunk utilization is low if the system's vmem doesn't support a mapped-but-not-wired state. imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how many programs like imacros really exist and require performance, but might as well fix it.) We also have a few oddball arena frees when the arenas are < 50% utilized, some as low as 1%. Still investigating those.

Brendan Eich [:brendan]

Comment 23

•

15 years ago

(In reply to comment #22) > What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a > cursory analysis I see that we can reduce 5761 malloc/frees out of the temp > pool into ~13 with the doubling approach. Can we hang onto the last arena for a while (GC-based time caching, see gcEmptyArenaPoolLifespan). > imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how > many programs like imacros really exist and require performance, but might as > well fix it.) Definitely -- on file? /be

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 24

•

15 years ago

(In reply to comment #23) > Can we hang onto the last arena for a while (GC-based time caching, see > gcEmptyArenaPoolLifespan). Yeah, I've been calling this the "second" arena, since the fakey one before it is called "first" and is a member structure of the pool. So long as API clients take marks and release to them instead of calling |pool.free()|, the GC should reap the "second" arena when it reaches a ripe old age. > > imacros.js also shows we thrash the regexp pool > on file? Bug 561286.

Robert Sayre

Comment 25

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/55da1c595209 new work to new bugs.

Whiteboard: fixed-in-tracemonkey

Robert Sayre

Comment 26

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/55da1c595209

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.