Last Comment Bug 558971 - Parser arena allocation overhead is too high
: Parser arena allocation overhead is too high
Status: RESOLVED FIXED
fixed-in-tracemonkey
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Chris Leary [:cdleary] (not checking bugmail)
:
: Jason Orendorff [:jorendorff]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-04-12 19:51 PDT by Chris Leary [:cdleary] (not checking bugmail)
Modified: 2010-04-24 15:49 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Massif output. (738.39 KB, text/plain)
2010-04-12 19:51 PDT, Chris Leary [:cdleary] (not checking bugmail)
no flags Details
Parse time versus pool size. (255.81 KB, image/png)
2010-04-13 13:24 PDT, Chris Leary [:cdleary] (not checking bugmail)
no flags Details
Perf increase versus pool size. (304.06 KB, image/png)
2010-04-13 14:00 PDT, Chris Leary [:cdleary] (not checking bugmail)
no flags Details
ARM parse time versus pool size. (121.13 KB, image/png)
2010-04-16 10:14 PDT, Chris Leary [:cdleary] (not checking bugmail)
no flags Details
ARM perf increase versus pool size. (196.02 KB, image/png)
2010-04-16 10:15 PDT, Chris Leary [:cdleary] (not checking bugmail)
no flags Details
Temp pool chunk size increase. (2.29 KB, patch)
2010-04-22 15:42 PDT, Chris Leary [:cdleary] (not checking bugmail)
sayrer: review+
shaver: review+
Details | Diff | Splinter Review

Description Chris Leary [:cdleary] (not checking bugmail) 2010-04-12 19:51:22 PDT
Created attachment 438651 [details]
Massif output.

In parsing large payloads (like GMail in the attached), the fixed-size arena blocks are probably too small for the number of parse nodes we create.

They're allocated out of the context's temp pool: we take a watermark, allocate out whatever fixed size blocks we need (keeping a free-list of parsenodes for recycling), then free back to the watermark when the parser is destroyed.

As noted in bug 421435 (stagnated since two years ago), the blocks in the temp pool are fixed size. I have to read up on arena allocation theory, but it seems to make sense to have some kind of vector-like allocation strategy for parse node arena, doubling the size of the new pool each time.

I should probably try to collect some stats on the productions that are creating parse nodes as well.
Comment 1 Chris Leary [:cdleary] (not checking bugmail) 2010-04-12 19:57:46 PDT
Forgot to note that in the GMail parse under Shark, 10% of the time was getting eaten up by vm_fault -- only 85% of the time was spent in userspace.
Comment 2 Andreas Gal :gal 2010-04-12 20:17:01 PDT
One of the problems of a doubling vector is that parse nodes get reallocated (potentially). I have argued for using the VM more before. Allocate a really large block (32MB) and use it. If you run oom, throw it away, allocate 64MB, parse again and so on. For the vast majority of case 32MB should do just fine. As long you don't touch most of it, we pay the same price for a small mmap as we would for a large one. Unmap when done. As long the underlying OS doesn't commit memory eagerly, this should work in most places. Windows CE might be an issue, but looks like we don't really ship there anyway. Lets run some statistics. How much space do we need anyway? 32MB is probably aiming way too high.
Comment 3 Chris Leary [:cdleary] (not checking bugmail) 2010-04-12 20:19:06 PDT
With 4K arena allocation size in the temp pool we see the following
butt-kicking speedups on Linux/64:

$ python compare_bench.py ~/tm/results/parsemark_tm_4Kpool_2010_04_12.txt
~/tm/results/parsemark_tm_2010_04_12.txt 
   tetris_combined.js: faster:   4.19ms < baseline   5.26ms (+20.34%)
  gravity_combined.js: faster:  24.67ms < baseline  39.08ms (+36.87%)
                ga.js: faster:   3.45ms < baseline   3.67ms ( +5.99%)
  twitter_combined.js: faster:  20.30ms < baseline  27.81ms (+27.00%)
        jquery.min.js: faster:   5.23ms < baseline   6.84ms (+23.54%)
   zimbra_combined.js: faster: 127.30ms < baseline 192.73ms (+33.95%)
     jsgb_combined.js: faster:   9.79ms < baseline  13.42ms (+27.05%)
280slides_combined.js: faster:   3.63ms < baseline   4.06ms (+10.59%)
       v8_combined.js: faster:  22.43ms < baseline  27.83ms (+19.40%)
              dojo.js: faster:   8.27ms < baseline   9.74ms (+15.09%)
 processing_twitch.js: Meh.
    gmail_combined.js: faster:  97.48ms < baseline 146.14ms (+33.30%)
effectgames_galaxy.js: faster:  16.14ms < baseline  23.94ms (+32.58%)
ball_pool_combined.js: faster:  16.13ms < baseline  20.48ms (+21.24%)
           yui-min.js: Meh.
          MochiKit.js: faster:  16.54ms < baseline  20.94ms (+21.01%)
    pipio_combined.js: faster:  82.18ms < baseline 127.90ms (+35.75%)
sunspider_combined.js: faster:  13.28ms < baseline  14.84ms (+10.51%)

Average speedup: 23.39%

Will have numbers for OS X 64b soon.
Comment 4 Andreas Gal :gal 2010-04-12 20:27:03 PDT
Chris, can you file dependent bugs on any other remaining uses of the arena code? I have seen similar badness elsewhere (JS stacks, but luke might have fixed all of that by now).
Comment 5 Chris Leary [:cdleary] (not checking bugmail) 2010-04-12 20:55:49 PDT
OS X Snow Leopard 64b numbers:

           gravity_combined.js: faster:  39.26ms < baseline  41.37ms ( +5.11%)
         processing-0.8.min.js: faster:  18.22ms < baseline  18.40ms ( +0.96%)
                         ga.js: Meh.
           twitter_combined.js: faster:  29.63ms < baseline  30.40ms ( +2.52%)
                 jquery.min.js: Meh.
             gmail_combined.js: faster: 155.83ms < baseline 175.48ms (+11.20%)
            tetris_combined.js: Meh.
         280slides_combined.js: Meh.
                v8_combined.js: faster:  33.26ms < baseline  33.89ms ( +1.86%)
                       dojo.js: Meh.
          processing_twitch.js: Meh.
              jsgb_combined.js: Meh.
            zimbra_combined.js: faster: 204.31ms < baseline 234.79ms (+12.98%)
         effectgames_galaxy.js: faster:  24.98ms < baseline  25.74ms ( +2.95%)
         ball_pool_combined.js: faster:  25.25ms < baseline  25.79ms ( +2.09%)
                    yui-min.js: Meh.
                   MochiKit.js: faster:  24.96ms < baseline  25.39ms ( +1.70%)
             pipio_combined.js: faster: 125.87ms < baseline 142.85ms (+11.89%)
         sunspider_combined.js: Meh.

Average speedup: 5.33%
Comment 6 Brendan Eich [:brendan] 2010-04-12 23:10:18 PDT
Luke is not using arenas for stacks, he's using the VM mapping thing to avoid address instability.

Let's use bigger chunks! Make our hay while the sun doth shine, to quote G&S. No need to do more if profiling doesn't support it.

/be
Comment 7 Brendan Eich [:brendan] 2010-04-12 23:10:37 PDT
How does 16K compare -- I'm not kidding.

/be
Comment 8 Chris Leary [:cdleary] (not checking bugmail) 2010-04-13 10:23:01 PDT
(In reply to comment #2)
> One of the problems of a doubling vector is that parse nodes get reallocated
> (potentially).

Right. I was thinking about this, and I'm not sure why non-contiguous arena chunks would be much worse than a contiguous allocation space if the base pool size were sufficiently large. You always add a new chunk of the total size of the current space; i.e. 1 -> [1, 1] -> [1, 1, 2] -> [1, 1, 2, 4], etc. We're using a recycled-node-list anyway, so we'd only ever be doling out of the last allocated chunk.

> If you run oom, throw it away, allocate 64MB,
> parse again and so on. For the vast majority of case 32MB should do just fine.

That seems like an awful behavior for scripts that require 32MB + sizeof(ParseNode).

> As long you don't touch most of it, we pay the same price for a small mmap as
> we would for a large one. Unmap when done. As long the underlying OS doesn't
> commit memory eagerly, this should work in most places.

I like it, but I'm not a portability expert. Do we do this elsewhere?
Comment 9 Brendan Eich [:brendan] 2010-04-13 10:30:54 PDT
Comment 2 is off the mark, especially on OOM handling -- we just need bigger chunks. Measure up to a large size and gnuplot the results, attach the image. Internal fragmentation will not be bad as you say, we recycle in the non-terminal chunks.

/be
Comment 10 Andreas Gal :gal 2010-04-13 10:56:42 PDT
> > As long you don't touch most of it, we pay the same price for a small mmap as
> > we would for a large one. Unmap when done. As long the underlying OS doesn't
> > commit memory eagerly, this should work in most places.
> 
> I like it, but I'm not a portability expert. Do we do this elsewhere?

Its the year 2010 of the lord. If your OS doesn't have an MMU and virtual memory, go get an OS for grownups. I am pretty sure its true for every platform we run on.
Comment 11 Brendan Eich [:brendan] 2010-04-13 11:11:08 PDT
The issue is OOM handling, not VM -- althoug check with Mobile folks before grabbing great heaps of VM (page table mappings take up whole pages and TBL entries for running processes, on some OSes).

But really, a doubling vector with stable virtual addressing is overkill here. Leary will do the science and vindicate my claim, I'm sure.

/be
Comment 12 Chris Leary [:cdleary] (not checking bugmail) 2010-04-13 11:20:59 PDT
Commencing with the science! http://imgs.xkcd.com/comics/science_montage.png
Comment 13 Chris Leary [:cdleary] (not checking bugmail) 2010-04-13 13:24:58 PDT
Created attachment 438832 [details]
Parse time versus pool size.

Small scripts are relatively unaffected, as one would expect. Looks like 4KiB may be a sweet spot. 16KiB is better, but maybe not by enough. (It's hard to tell the whole system effects by just looking at parsing in isolation.)
Comment 14 Chris Leary [:cdleary] (not checking bugmail) 2010-04-13 14:00:59 PDT
Created attachment 438840 [details]
Perf increase versus pool size.

Busier plot, but perhaps easier to see the important trends.
Comment 15 Chris Leary [:cdleary] (not checking bugmail) 2010-04-13 18:16:47 PDT
I ripped the lazy parts out of V8 and force it to eagerly generate an AST, and we fare much better now (win one lose a bunch):

$ python parsemark.py /home/cdleary/v8/v8-stable/v8 /home/cdleary/tm/parse-tests/ -q -b /home/cdleary/tm/results/pool_size/tm_4K_pool_size.json

   tetris_combined.js: Meh.
  gravity_combined.js: faster:  20.93ms < baseline  22.37ms ( +6.46%)
                ga.js: faster:   2.25ms < baseline   2.33ms ( +3.55%)
  twitter_combined.js: Meh.
        jquery.min.js: Meh.
   zimbra_combined.js: faster:  93.95ms < baseline 115.92ms (+18.95%)
     jsgb_combined.js: Meh.
280slides_combined.js: Meh.
       v8_combined.js: faster:  17.71ms < baseline  18.43ms ( +3.92%)
              dojo.js: faster:   6.17ms < baseline   6.57ms ( +6.15%)
 processing_twitch.js: Meh.
    gmail_combined.js: faster:  62.81ms < baseline  91.00ms (+30.97%)
effectgames_galaxy.js: faster:  12.89ms < baseline  13.80ms ( +6.63%)
ball_pool_combined.js: Meh.
           yui-min.js: Meh.
          MochiKit.js: faster:  12.85ms < baseline  14.10ms ( +8.90%)
    pipio_combined.js: faster:  66.95ms < baseline  73.58ms ( +9.01%)
sunspider_combined.js: SLOWER:  20.75ms > baseline  12.60ms (-64.67%) 

Average speedup: 2.99%

As opposed to before:

$ python compare_bench.py /home/cdleary/tm/results/parsemark_v8_eager_2010_04_13.json /home/cdleary/tm/results/parsemark_tm_2010_04_12.txt 
   tetris_combined.js: faster:   3.69ms < baseline   5.26ms (+29.85%)
  gravity_combined.js: faster:  20.93ms < baseline  39.08ms (+46.44%)
                ga.js: faster:   2.25ms < baseline   3.67ms (+38.69%)
  twitter_combined.js: faster:  18.81ms < baseline  27.81ms (+32.36%)
        jquery.min.js: faster:   4.30ms < baseline   6.84ms (+37.13%)
   zimbra_combined.js: faster:  93.95ms < baseline 192.73ms (+51.25%)
     jsgb_combined.js: faster:  10.22ms < baseline  13.42ms (+23.85%)
280slides_combined.js: faster:   3.23ms < baseline   4.06ms (+20.44%)
       v8_combined.js: faster:  17.71ms < baseline  27.83ms (+36.36%)
              dojo.js: faster:   6.17ms < baseline   9.74ms (+36.65%)
 processing_twitch.js: faster:   2.25ms < baseline   2.79ms (+19.35%)
    gmail_combined.js: faster:  62.81ms < baseline 146.14ms (+57.02%)
effectgames_galaxy.js: faster:  12.89ms < baseline  23.94ms (+46.16%)
ball_pool_combined.js: faster:  14.29ms < baseline  20.48ms (+30.22%)
           yui-min.js: faster:   1.24ms < baseline   1.79ms (+30.73%)
          MochiKit.js: faster:  12.85ms < baseline  20.94ms (+38.63%)
    pipio_combined.js: faster:  66.95ms < baseline 127.90ms (+47.65%)
sunspider_combined.js: SLOWER:  20.75ms > baseline  15.12ms (-37.24%) 

Average speedup: 32.53%

If we think it's worth the time I'll instrument to get some production statistics and see if the big win/losses stand out somehow.
Comment 16 Robert Sayre 2010-04-15 14:23:43 PDT
(In reply to comment #14)
> Created an attachment (id=438840) [details]
> Perf increase versus pool size.

Eggsalad. Now we need to see the same plot on ARM.
Comment 17 Chris Leary [:cdleary] (not checking bugmail) 2010-04-16 10:14:37 PDT
Created attachment 439562 [details]
ARM parse time versus pool size.

This set of data is a lot more noisy -- big error bars.
Comment 18 Chris Leary [:cdleary] (not checking bugmail) 2010-04-16 10:15:01 PDT
Created attachment 439563 [details]
ARM perf increase versus pool size.
Comment 19 Chris Leary [:cdleary] (not checking bugmail) 2010-04-22 15:42:41 PDT
Created attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.

Okay, I'm going with 4K - ARENA_HEADER_SIZE for now, as it demonstrates a parsing speedup and doesn't have a significant effect on sunspider scores on x64 or ARM.

On my Linux x64, I found the following relationship on parsemark (from slow to fast, <~ meaning approximately equal, but strictly less than):

1K-40 <~ 1K < 4K <~ 4K - HEADER_SIZE <~ 16K <~ 16K - HEADER_SIZE

On OS X i386:

16K - HEADER_SIZE < 1K < 1K - HEADER_SIZE < 4K <~ 4K - HEADER_SIZE

That being said, I'm still looking at exactly what size mallocs we're performing as a result of arena chunks being pulled, seeing which VM pages that ends up corresponding to, and checking for thrash with a little bit of printf-based observation.
Comment 20 Robert Sayre 2010-04-22 17:14:12 PDT
Comment on attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.

I claim we should take this trivial patch to win now. But also file a follow up to see if we can observe consistent wins when multiplying a constant by pointer size.
Comment 21 Mike Shaver (:shaver -- probably not reading bugmail closely) 2010-04-22 17:14:43 PDT
Comment on attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.

that sure is an increase.
Comment 22 Chris Leary [:cdleary] (not checking bugmail) 2010-04-22 18:35:56 PDT
What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a cursory analysis I see that we can reduce 5761 malloc/frees out of the temp pool into ~13 with the doubling approach. Note that we'd get screwed on the larger arenas where chunk utilization is low if the system's vmem doesn't support a mapped-but-not-wired state.

imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how many programs like imacros really exist and require performance, but might as well fix it.)

We also have a few oddball arena frees when the arenas are < 50% utilized, some as low as 1%. Still investigating those.
Comment 23 Brendan Eich [:brendan] 2010-04-22 18:44:50 PDT
(In reply to comment #22)
> What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a
> cursory analysis I see that we can reduce 5761 malloc/frees out of the temp
> pool into ~13 with the doubling approach.

Can we hang onto the last arena for a while (GC-based time caching, see gcEmptyArenaPoolLifespan).
 
> imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how
> many programs like imacros really exist and require performance, but might as
> well fix it.)

Definitely -- on file?

/be
Comment 24 Chris Leary [:cdleary] (not checking bugmail) 2010-04-22 22:46:11 PDT
(In reply to comment #23)
> Can we hang onto the last arena for a while (GC-based time caching, see
> gcEmptyArenaPoolLifespan).

Yeah, I've been calling this the "second" arena, since the fakey one before it is called "first" and is a member structure of the pool. So long as API clients take marks and release to them instead of calling |pool.free()|, the GC should reap the "second" arena when it reaches a ripe old age.

> > imacros.js also shows we thrash the regexp pool
> on file?

Bug 561286.
Comment 25 Robert Sayre 2010-04-23 05:47:07 PDT
http://hg.mozilla.org/tracemonkey/rev/55da1c595209

new work to new bugs.

Note You need to log in before you can comment on or make changes to this bug.