Closed Bug 408921 Opened 17 years ago Closed 13 years ago

Arena sizes should be gross not net

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 675150

People

(Reporter: pavlov, Assigned: moz)

References

Details

(Keywords: memory-footprint, perf)

Attachments

(2 files, 2 obsolete files)

Net means the arena overhead pushes us into the next size class used by malloc, so at least with certain allocators 1K net becomes 1040 gross (due to the header) and could end up actually taking up 2K!

I suggest that JSArena get changed to subtract the header from the size rather than add to it.
Keywords: footprint, perf
Um yea - we should fix this..
Flags: blocking1.9+
Priority: -- → P1
Separate bug for PLArena?
Assignee: general → crowder
Priority: P1 → P2
So...  I was hoping for an easy one or two-liner, here, but it's not.  The issue is that a lot of code that makes use of arenas actually sets up the arenas to have an arenasize that exactly matches anticipated alloc/grow calls, or matches by near powers-of-two.  In other words, we do a lot of code like this:

    JS_INIT_ARENA_POOL(&cx->tempPool, "temp", 1024, sizeof(jsdouble),
                       &cx->scriptStackQuota);
    ...
    JS_ARENA_ALLOCATE_CAST(buf1, jschar *, &cx->tempPool, 512);
    JS_ARENA_ALLOCATE_CAST(buf2, jschar *, &cx->tempPool, 512);
    ...
    JS_FinishArenaPool(&cx->tempPool);

Which means that even if the arena code were smarter about not overallocating to include headers, it would likely eventually have to allocate more than a perfect 1024 byte sized chunk.

I wonder if a better approach to this is to move the arena header chunks elsewhere, but then I worry about memory-locality.  Thoughts?
I've run into almost exactly this problem before in another large code base that I've worked with.

Moving the arena header elsewhere provided several wins.  First, the problem gross vs. net disappears.  Second, the arena header stops polluting the cache line on which the first bytes of allocated data reside - a big deal when the arena is small.  Third, the arena pool code gets simpler and easier to understand.  Fourth, the traversal of the arena freelist will be *much* faster, and won't pollute caches and TLBs when it happens.  Fifth, the arena headers are easier to see in debugger sessions and core dumps because they're all in one place.  Data locality is improved, overall.

I am in favour of moving the arena header elsewhere, as Brian suggests.  Memory locality will be much better, not worse.  The code will be cleaner.  The product will be faster.

(In reply to comment #5)
> I've run into almost exactly this problem before in another large code base
> that I've worked with.
> 
> Moving the arena header elsewhere provided several wins.  First, the problem
> gross vs. net disappears.  Second, the arena header stops polluting the cache
> line on which the first bytes of allocated data reside - a big deal when the
> arena is small.  Third, the arena pool code gets simpler and easier to
> understand.  Fourth, the traversal of the arena freelist will be *much* faster,
> and won't pollute caches and TLBs when it happens.  Fifth, the arena headers
> are easier to see in debugger sessions and core dumps because they're all in
> one place.  Data locality is improved, overall.
> 
> I am in favour of moving the arena header elsewhere, as Brian suggests.  Memory
> locality will be much better, not worse.  The code will be cleaner.  The
> product will be faster.
> 

So you want to take this bug then? Let's try it and run some tests...
(In reply to comment #6)
> So you want to take this bug then? Let's try it and run some tests...

Yes, I'm willing to take it.  I'm not too familiar with all of the various tests that already exist, though.  Are there tests that already exist that should I try, to get a picture of the 'before'? 

(In reply to comment #7)
> (In reply to comment #6)
> > So you want to take this bug then? Let's try it and run some tests...
> 
> Yes, I'm willing to take it.  I'm not too familiar with all of the various
> tests that already exist, though.  Are there tests that already exist that
> should I try, to get a picture of the 'before'? 
> 

http://wiki.mozilla.org/Buildbot/Talos/Machines#Graph_Links

Particularly

http://graphs.mozilla.org/#spst=range&spstart=1196881975&spend=1198106723&bpst=cursor&bpstart=1196881975&bpend=1198106723&m1tid=53222&m1bl=0&m1avg=0&m2tid=53258&m2bl=0&m2avg=0&m3tid=53240&m3bl=0&m3avg=0&m4tid=53280&m4bl=0&m4avg=0

Shows memory usage while loading about 4k web pages.

Mh shows heap size during startup:

http://build-graphs.mozilla.org/graph/query.cgi?testname=trace_malloc_maxheap&units=bytes&tbox=fxdbug-linux-tbox.build.mozilla.org&autoscale=1&days=7&avg=1&showpoint=2008:01:24:08:44:03,22226949

Described here:

http://wiki.mozilla.org/Performance:Leak_Tools#tbox
I am working on this bug, but have run into problems.  My first step has been analysis.  I have arranged to log the malloc()s and free()s being called by the jsarena code.  Unfortunately, there seems to be some double-free()ing happening, which makes analysis difficult.  I am attempting to determine if the double freeing is a real bug or incompleteness in my logging code.


there aren't double frees in that code.  You're missing some mallocs or there is some other bug in your code.
Stuart, you're right:  I was handling realloc()s improperly.  Fixed.

My first result is interesting: the jsarena code is responsible for 14249 calls to malloc/realloc just from launching and then quitting the browser.  Sunspider's regex-dna test results in 32288 calls to malloc (and 13 to realloc).

The start/stop test only ever consumes about 300K, at its worst.  The regex-dna test consumes 530K, at its worst.  Does that seem right?

[The real numbers are as follows.
start/stop: requested 279250 bytes in arenas, 302080 bytes consumed including overhead
regex-dna: requested506657 bytes in arenas, 530432 bytes consumed including overhead]

These are as the code is now, without my fix.
Depends on: 415300
(In reply to comment #5)
> Second, the arena header stops polluting the cache
> line on which the first bytes of allocated data reside

I don't agree. A direct-mapped data cache holding lines containing the interpreter's stack today has

[JSArena, JSInlineFrame, jsval vars[nvars], jsbytecode *pcs[depth], jsval spbase[depth], ...]

A hot method call that spans an arena boundary will look like this:

[... callee-operands, operands-that-satisfy-nargs, JSArena, JSInlineFrame, vars, pcs, operands]

If there are not enough arguments to satisfy nargs, and no room to push the undefined value to supply the missing args, then the picture looks like the first one.

If the callee and caller(s) are hot, we do not want arena overhead to collide with either caller-operands or caller. But the probability of collision is strictly greater if the JSArena metadata is moved out of line: if JSArena takes one cache line, and there are 4K lines, then the odds of colliding with four jsvals of hot caller operands and overwriting them in the cache are 1 in 4K.

The odds could be worse in practice, since the metadata may be adjacent to other JSArena metadata that covers the entire cache.

You can have collisions the other way too, with the callee's frame or operands competing for the same cache line as the JSArena.

Since arena pools are for LIFO allocation, and LIFO allocation wraps around the cache recycling oldest lines first, cache effectiveness seems strictly better if metadata is allocated in a header to the net allocated payload.

> - a big deal when the arena is small.

The arenas are not small generally. The pool is a stack, so you could use it for LIFO allocation even with small arenas, but then you are duplicating malloc, with redundant overhead, and high overhead compared to the case where the arena size is large (1024 byte or more for small 32-to-64-byte structs, not fixed size).

Fixed size allocators can do even better, so if there are any places that only ever allocate a fixed-size struct, we might want to optimize. SpiderMonkey used arenas as a blunt tool, independent of size class, but generally for mixed or variable-size LIFO allocations.

So first, how about measuring the net arena sizes and the distribution of allocated sizes in them? It's hard to reason about this bug in the absence of quantification.

/be
(In reply to comment #13)
> A direct-mapped data cache holding lines containing the
> interpreter's stack today has
> 
> [JSArena, JSInlineFrame, jsval vars[nvars], jsbytecode *pcs[depth], jsval
> spbase[depth], ...]

Oops, forgot argv at the front:

[jsval argv[JS_MAX(nargs, argc)], JSInlineFrame, ...]

> 
> A hot method call that spans an arena boundary will look like this:
> 
> [... callee-operands, operands-that-satisfy-nargs, JSArena, JSInlineFrame,
> vars, pcs, operands]
> 
> If there are not enough arguments to satisfy nargs, and no room to push the
> undefined value to supply the missing args, then the picture looks like the
> first one.

... like the amended first one.

/be

(In reply to comment #13)
> (In reply to comment #5)
> > Second, the arena header stops polluting the cache
> > line on which the first bytes of allocated data reside
> I don't agree. A direct-mapped data cache holding lines containing the
> interpreter's stack today has
> ...
Thanks for taking the time to write your argument.  It helped me better understand how js code uses arenas.  Is the execution stack the only user of arenas is js code?

> Fixed size allocators can do even better, so if there are any places that only
> ever allocate a fixed-size struct, we might want to optimize. SpiderMonkey used
> arenas as a blunt tool, independent of size class, but generally for mixed or
> variable-size LIFO allocations.

Yes, fixed size allocators always have the best performance.  I'm not sure how this would work, though; perhaps it is worthy of offline discussion.  Or, we could open a low priority bug on it.  (How can we use fixed size allocators when we have to put differently sized items on the same stack?)

> So first, how about measuring the net arena sizes and the distribution of
> allocated sizes in them? It's hard to reason about this bug in the absence of
> quantification.

I have some of this data, now.  I continue to collect and analyze it.  Doing so was complicated recently when I discovered that the memory in the arenas does not necessarily all come from malloc() and free() code in jsarena.c.  This is because of SendToGenerator() pushes memory from JSGenerators onto the arena pool.  I'll post results here when I have them.  Meanwhile, can anyone comment on my Comment #12, where I ask if the numbers seem reasonable?
Crowder could comment on the regexp numbers better than I could, but they seem in the ballpark. How many discrete allocations, with what size-class distribution? JS_ARENAMETER results could be helpful.

/be
Brian's comment 4 was concerned about just-so sizing, but I don't see it. Anyway, this approach allows us to easily respin builds with JS_ARENAMETER defined for different JS_ARENA_GROSS_SIZE_THRESHOLD values. Suggest trying 1024 -- cursory testing shows no problems, but I didn't turn on JS_ARENAMETER to see its effects.

/be
Attachment #302460 - Flags: review?(crowder)
Attachment #302460 - Attachment is obsolete: true
Attachment #302465 - Flags: review?(crowder)
Attachment #302460 - Flags: review?(crowder)
Comment on attachment 302465 [details] [diff] [review]
with sane lower bound on JS_ARENA_GROSS_SIZE_THRESHOLD

I ran into weird trouble with a patch almost exactly like this when I first played with this bug.  I can't find the patch now to see what was different, but this seems fine, runs in the browser in opt and dbg, and seems to have decent characteristics in sunspider with respect to arena/allocation sizes.  For real frag/memory perf numbers, I'd defer to stuart.
Attachment #302465 - Flags: review?(crowder) → review+
A log of a SunSpider run on the shell.
This looks like a bad set of stats:

stack allocation statistics:
              number of arenas: 1
         number of allocations: 45300
        number of malloc calls: 1267
       number of deallocations: 0
  number of allocation growths: 0
    number of in-place growths: 0
 number of realloc'ing growths: 0
number of released allocations: 715956
       number of fast releases: 714690
         total bytes allocated: 933260
          mean allocation size: 20.6018
            standard deviation: 26.4648
       maximum allocation size: 212
Bad how?
bad in that the stack size (8k currently) is probably too small.  this patch shouldn't have any bad effects on this specific arena given that the max allocation size is 212 bytes.
Stuart: too large, right? This shows likely underutilization of that 1 8k arena. It's good that we have time-based caching of that arena -- without that the 1267 malloc calls would be much higher (crowder please confirm, it's not clear that 1267 is still unaccountably high). But we do not have a meter for the high watermark within that arena. This would be good to know; it could help us tune the stack size down from 8k.

/be
Actually, I had agreed with Stuart:  too small.  That's 1200-something mallocs for just one run of one sunspider test.
Oh, I see -- would be good to know the "high watermark" distribution. You are probably seeing a n outlier or two -- a highly recursive benchmark that uses tons of stack. This is *not* something to tune for.

My suspicion about 8k being too big is that most of the time, stacks are shallow and most of the 8k is unused. If this is true, then going to 16k just wastes even more space in the typical case (times 300 open tabs).

You need to measure more here to know what to do. Absolute malloc count minimization during SunSpider is not the goal.

/be
(In reply to comment #24)
> Stuart: too large, right? This shows likely underutilization of that 1 8k
> arena.

Of course I was wrong here -- narenas is 1 at the time the stats were dumped, but there's no instrumentation telling the high watermark in arenas or bytes, or the distribution of high watermarks. But JSBasicStats could be used to find both max and mean/sigma/histogram.

/be
I think we should land Brendan's patch here; worry about other stats in another bug.
Assignee: crowder → brendan
Woops, taking this back; needs more testing and Brendan has plenty on his plate.
Assignee: brendan → crowder
(In reply to comment #17)
> Brian's comment 4 was concerned about just-so sizing, but I don't see it.

Sample case: cx->tempPool is initialized to have an arenasize of 1001 (gross not net, using your patch), then in js_InitTokenStream, we allocate a 1024-byte jschar* buffer, which ends up being calculated in JS_ArenaAllocate to be a 1051-byte gross allocation.  Doesn't this yield the same original problem Stuart was concerned about with jemalloc?  A 1056-byte allocate that demands 2K total?
(In reply to comment #30)
> (In reply to comment #17)
> > Brian's comment 4 was concerned about just-so sizing, but I don't see it.
> 
> Sample case: cx->tempPool is initialized to have an arenasize of 1001 (gross
> not net, using your patch), then in js_InitTokenStream, we allocate a 1024-byte
> jschar* buffer, which ends up being calculated in JS_ArenaAllocate to be a
> 1051-byte gross allocation.

Glad you found this. Sample case, or only case? Keep looking.

The threshold and the line buffer size can both be tuned, and they have to be tuned together. Suggest keeping the 1024 threshold for gross not net, but tuning down the line buffer (JS_LINE_LIMIT) to 250 and commenting the relation.

> Doesn't this yield the same original problem
> Stuart was concerned about with jemalloc?  A 1056-byte allocate that demands 2K
> total?

Is that a rhetorical question?

We all know this is to be avoided, the question is how. You have to tune the workload of allocated sizes and the threshold. Sounds like work, as sayrer likes to say :-P.

/be
The other problem I am running into is, with your patch and nothing else except
#define JS_ARENA_GROSS_SIZE_THRESHOLD 1024
I get three new test-case failures in the JS suite, with a DBG build:

ecma_3/Statements/regress-324650.js
js1_5/Array/regress-108440.js
js1_5/Regress/regress-80981.js

The first, at least, is a assertion in JS_ArenaFreeAllocation, on or about line 419 (GET_HEADER(pool, b) == &a->next).  Interestingly, b->avail - b->base is 1008 (gross-size - sizeof(JSArena), maybe??), am I missing the addition of a mask or something, somewhere?
(In reply to comment #31)
> Glad you found this. Sample case, or only case? Keep looking.

Definitely not the only case; I think the examples in js/src number a dozen or so.  It would be nice to have a general fix that works nicely for PLArenas, too, without touching a lot of code.  This is why I originally speculated on the idea of separating out header data to be kept elsewhere.

I don't mind doing the work of tracking down pieces of code doing power-of-2-liable-to-abuse-arenas allocations, but if there is a general win to be had with a lighter touch, I'd like to have it.  It looks like for JS there may not be without decoupling the few places that use JSArenas directly (unless I am mistaken).
(In reply to comment #31)
> The threshold and the line buffer size can both be tuned, and they have to be
> tuned together. Suggest keeping the 1024 threshold for gross not net, but
> tuning down the line buffer (JS_LINE_LIMIT) to 250 and commenting the relation.

Right, so all allocation locations making use of arenas will balance their allocation-size against the threshold as ((threshold - hdr_size / n) < m), where n is the number of internal allocs desired per malloc, and m is the average size of the chunks being allocated (ideally, threshold - hdr_size / n as close to 0 as possible).

This seems like a bit of a burden to be placed upon clients of the arena code (again, I'm fine with doing the work), but a solution involving allocating headers elsewhere seems cleaner.
Please list the dozen or so cases, we have to proceed from the concrete to the abstract here. After all, arenas can already be oversized, and that may already be happening due to mismatched arenasize vs. workload. Add an assertion.

The JS_ArenaFreeAllocation assertbotch sounds like bug 239721. I really should have fixed that long ago. My not-so-secret shame!

/be
(In reply to comment #34)
> This seems like a bit of a burden to be placed upon clients of the arena code
> (again, I'm fine with doing the work), but a solution involving allocating
> headers elsewhere seems cleaner.

This is not making sense.

Headers elsewhere, besides having worse cache effects, does not solve anything. You still can have overflowing workloads (we may have them today, silently compensated for by the oversized arena code added to support GROW and JS_ArenaFreeAllocation).

If we can't match workload to arenasize, then we shouldn't be using arenas at all (plausible for the oversized cases, they strained the original design). If we can match workload to arenasize, then we can handle deducting header size in the matching when above a reasonable threshold. Below that threshold we don't care about malloc size class roundup.

/be
Not clear why this bug is a focus - should we spend time elsewhere?
brendan's current patch will save us memory
schrep:  This malloc round-up issue Brendan and I have been talking about is a big one, and apparently worse with jemalloc.  We're potentially wasting great swaths of memory by over-allocating by a header.  The space overalloced spills into the next "chunk" of memory and so jemalloc simply rounds-up, leaving the remainder of that memory unused (or at least badly fragmented).  Stuart has suggested that this problem is more serious with jemalloc than with previous allocators (if I remember rightly) -and- is a problem for both JS_Arena code and PL_Arena code.  I had prioritized a number of other bugs higher, and also tried to give Robin some guidance on this bug; but I haven't seen activity from him, and I've finished those bugs, so I came back to this.  Would be thrilled to move on to something else, though.  ;)
As noted previously, Brendan's current patch seems to cause a bug (or three).
(In reply to comment #39)
 and PL_Arena code.  I had prioritized a number of other bugs higher, and also
> tried to give Robin some guidance on this bug; but I haven't seen activity from
> him, and I've finished those bugs, so I came back to this.  Would be thrilled
> to move on to something else, though.  ;)
> 

Ok - thanks for the explanation.  We have a plan to close this one out?
The current plan is to figure out why/how Brendan's patch creates or exposes bugs.  I'm working on it.  I had hit a similar issue with a patch I tried previously, too, but didn't focus on it for very long before being directed to other stuff.
fwiw, I'm not sure things are way worse with jemalloc.  It is just that we know exactly what it does so we know things are bad.. With other allocators we can only speculate.
Comment on attachment 302465 [details] [diff] [review]
with sane lower bound on JS_ARENA_GROSS_SIZE_THRESHOLD

This is not the patch you're looking for.
Attachment #302465 - Flags: review+ → review-
Depends on: 417995
The other testcase regressed by the patch in comment #18 is js1_5/Array/regress-108440.js, investigating.
With this version of the patch applied, we no longer fail any additional tests from the suite.  That said, I'm not -fully- sure I understand the wallpaper I've applied in jsarena.h

Something bad is happening in the in-place growth piece of this code which is causing us to grow in-place to a point where the arena -thinks- it is an oversized one, but doesn't have a valid back-pointer, this band-aid fixes it, but I am not sure whether or not the true issue is further upstream.  Perhaps this patch will inspire either Brendan or Igor.
Attachment #302465 - Attachment is obsolete: true
Comment on attachment 303888 [details] [diff] [review]
for consideration

This patch still needs the patch from bug 417995, btw.  The wallpaper in jsarena.h does not resolve the ArenaFreeAllocation bug.
Also, I still need to hunt down alloc/grow callers that are liable to waste space, as previously noted.  I just want to get past the crashing introduced by the original idea.
Depends on: 415967
Moving to wanted-next per triage w/ brendan.
Flags: tracking1.9+ → wanted-next+
Priority: P2 → --
(In reply to comment #27)
> but there's no instrumentation telling the high watermark in arenas or bytes,
> or the distribution of high watermarks. But JSBasicStats could be used to find
> both max and mean/sigma/histogram.

I have collected stats that include a histogram and high watermark measurements.  I instrumented the jsarena code to log calls to malloc, realloc, and free.  It is those sizes which are of interest, here, because we are trying to avoid wasting space via overhead that is consumed my malloc (and more so by jemalloc).  I post-processed the log with an external program, and found the histogram of allocations at the point when the allocation total (4,486,315 bytes) was a high water mark.  This was with a log of Sunspider.  Here is the histogram:
(275,1),(303,0),(551,0),(751,0),(1040,0),(1047,0),(1155,0),(2075,0),(2079,0),(4123,0),(4127,0),(8211,46),(8219,0),(8223,0),(12307,1),(16027,0),(16411,0),(1\
6415,0),(32027,0),(32795,0),(32799,0),(64027,0),(65563,0),(65567,0),(128027,0),(131103,0),(256027,0),(262175,0),(512027,0),(524319,0),(1024027,0),(2048027,0),(4096027,1)

An entry (x,y) indicates that there were y allocations of size x.  If y is zero, then that means that there was once an allocation of size x, but not at the time of the high water mark.

I will soon collect a histogram of all allocations, not merely the one at the high water mark.  I'll post here, then.
(In reply to comment #50) 
> I will soon collect a histogram of all allocations, not merely the one at the
> high water mark.  I'll post here, then.

(275,9914)
(291,4)
(303,1)
(551,1)
(751,1)
(1040,57684)
(1047,58051)
(1155,2)
(1431,1)
(1439,1)
(2075,37)
(2079,40)
(4123,17)
(4127,25)
(8211,7174)
(8219,10)
(8223,20)
(12307,619182)
(16027,5)
(16411,10)
(16415,20)
(32027,5)
(32795,5)
(32799,15)
(64027,5)
(65563,5)
(65567,15)
(128027,5)
(131103,15)
(256027,5)
(262175,10)
(512027,5)
(524319,5)
(1024027,5)
(2048027,5)
(4096027,5)

Note that for many sizes, there are 5 (or a multiple of 5) occurrences.  This was for a 5-iteration run of Sunspider.
Please see comment #4 in bug 420476 for one possible 2-line solution to this bug.  It requires patches for 415967 and 420476.  The comment shows the "fixed" histogram of memory allocations.
No longer depends on: 415300
I think that the solution for this bug is to depend on bug 420476, and then to manually hunt down the allocations of the sort Brian mentions in comment #4, of which I think there are no more than 10.

To recap: the point of this bug is to avoid over-allocating memory, which we are currently doing in the way that Stuart describes in comment #1.  Bug 415967 also achieves the same goal, but my current patch for that produces a mild slowdown on SunSpider.  I can fix the slowdown, but not without a series of code changes which are nontrivial and which will require many reviews.

Shall I take this bug and proceed as I describe above?
No longer depends on: 415967
(In reply to comment #53)
> Shall I take this bug and proceed as I describe above?

Yes, please take.
Assignee: crowder → moz
Depends on: 420476
Blocked on bug 420476.
Bug 675150 will fix this in a straightforward way.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: