Closed Bug 675136 Opened 9 years ago Closed 9 years ago

Quantify how much of about:memory's "heap-unclassified" number is due to jemalloc rounding up request sizes

Categories

(Core :: Memory Allocator, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED WONTFIX

People

(Reporter: njn, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(2 files)

about:memory's "heap-unclassified" number is computed from jemalloc's "allocated" number, which includes all the wasted space caused by jemalloc rounding up.  Bug 675132 showed that this can be a non-trivial amount in some cases.  It would be good to quantify exactly how much rounding up occurs.  It would be even better if we could get that number into about:memory, though that would require permanent changes to our copy of jemalloc.
Summary: Quantify how much jemalloc rounding up requests contributes to "heap-unclassified" in about:memory → Quantify how much of about:memory's "heap-unclassified" number is due to jemalloc rounding up request sizes
Ding ding ding!  I think we have a winner.

The attached patch instruments all the places I could find in jemalloc.c where the 'allocated' stat is updated.  I tracked the requested amount and the actual amount.  The patch spews lots of output, here's the 2nd half of the final line after starting the browser, loading gmail, and shutting down:

  583380244 -> 707906498 (124526254)

In other words, we (cumulatively) requested 583MB of allocations, and jemalloc allocated 708MB of memory, resulting in 124MB of waste.  That's 17.6%.  If we assume that the cumulative ratio matches the live ratio after loading finishes, that's almost half of the "heap-unclassified" value of 40% that I saw for this scenario.

(And note that I applied this instrumentation patch on top of the patch in bug 675132 which fixed the over-allocation problem in JSRope::flatten.)
Attached is a list showing the frequency of every allocation size that needed rounding up.  Here is the top 20, which accounts for almost 75% of all rounded-up allocations.

( 1) 205920 (22.7%, 22.7%): small:     24 ->     32 (     8)
( 2)  66162 ( 7.3%, 29.9%): small:     72 ->     80 (     8)
( 3)  61772 ( 6.8%, 36.7%): small:     40 ->     48 (     8)
( 4)  54386 ( 6.0%, 42.7%): small:   1056 ->   2048 (   992)
( 5)  48501 ( 5.3%, 48.0%): small:     18 ->     32 (    14)
( 6)  47668 ( 5.2%, 53.3%): small:     15 ->     16 (     1)
( 7)  24938 ( 2.7%, 56.0%): large:   4095 ->   4096 (     1)
( 8)  24278 ( 2.7%, 58.7%): small:     56 ->     64 (     8)
( 9)  13064 ( 1.4%, 60.1%): small:    104 ->    112 (     8)
(10)  12852 ( 1.4%, 61.6%): small:    136 ->    144 (     8)
(11)   8970 ( 1.0%, 62.5%): small:     14 ->     16 (     2)
(12)   7969 ( 0.9%, 63.4%): small:    295 ->    304 (     9)
(13)   7789 ( 0.9%, 64.3%): small:     12 ->     16 (     4)
(14)   7517 ( 0.8%, 65.1%): small:    152 ->    160 (     8)
(15)   7445 ( 0.8%, 65.9%): small:    632 ->   1024 (   392)
(16)   6970 ( 0.8%, 66.7%): small:    120 ->    128 (     8)
(17)   6930 ( 0.8%, 67.4%): small:     36 ->     48 (    12)
(18)   6188 ( 0.7%, 68.1%): small:     44 ->     48 (     4)
(19)   5967 ( 0.7%, 68.8%): small:    168 ->    176 (     8)
(20)   5921 ( 0.7%, 69.4%): small:     13 ->     16 (     3)
(21)   5890 ( 0.6%, 70.1%): small:     88 ->     96 (     8)
(22)   5664 ( 0.6%, 70.7%): small:    600 ->   1024 (   424)
(23)   5496 ( 0.6%, 71.3%): small:     19 ->     32 (    13)
(24)   5073 ( 0.6%, 71.9%): small:      5 ->      8 (     3)
(25)   5024 ( 0.6%, 72.4%): small:     10 ->     16 (     6)
(26)   4344 ( 0.5%, 72.9%): small:     20 ->     32 (    12)
(27)   4280 ( 0.5%, 73.4%): small:    280 ->    288 (     8)
(28)   4233 ( 0.5%, 73.8%): small:   1032 ->   2048 (  1016)
(29)   3621 ( 0.4%, 74.2%): small:      6 ->      8 (     2)
(30)   3358 ( 0.4%, 74.6%): small:     46 ->     48 (     2)


(4) is the most interesting, accounting for 54MB of (cumulative) waste.  I'll try to work out what code is responsible.
> (4) is the most interesting, accounting for 54MB of (cumulative) waste. 
> I'll try to work out what code is responsible.

It's in the JS engine, due to some silliness in JSArenaPools.  I filed bug 675150 for it.
Very cool!  Are these numbers from a 64-bit system?  The cycle collector is guilty of this, at 8192+a little bit, so I think I can probably compute where in the list it is.

Also I noticed this error in the file (though it is long past the point where it matters): WARNING: unhandled variable 18 (<unknown variable>) in NPN_GetValue()
Some kind of "round up block size" function or macro as part of the mem reporter implementation would be nice.  I don't know how other memory reporters work, but the CC one tracks the number of each kind of block that are allocated and freed, then does: |numBlock * sizeOf(myBlockType)|.  With some kind of function, this could be changed to |numBlock * ROUND_UP_BLOCK_SIZE(sizeOf(myBlockType))| which make accounting for this much easier for memory reporter writers, and would also make it possible to change this in a central place depending on the specifics of the memory allocator.
(In reply to comment #5)
> Some kind of "round up block size" function or macro as part of the mem
> reporter implementation would be nice.

The definition of this function depends on which allocator is being used, right?  Wouldn't using malloc_usable_size be simpler?
Sorry, I split this off from this bug but didn't provide a link back here.  See bug 675226 comment 5.
So, there are two suggested ways to account for the rounding up space.  

- Comment 5 (and bug 675226) is suggesting that every memory reporter be responsible for computing the rounding up for its own heap blocks.

- But I was imagining modifying jemalloc to report the sum of all round-ups, resulting in a single "heap-round-up" (or whatever we call it) number in about:memory.

I like the latter approach better because you only have to implement it once.  It also captures the rounding up for heap blocks that aren't covered by any memory reporters, and so will ultimately lead to a lower "heap-unclassified" number.
Yes, that makes sense.  It would still be nice to have a way to ferret out cases where we are wasting a lot of memory on jemalloc rounding, but computing the rounding in a memory reporter doesn't directly help with that anyways.
(In reply to comment #8)
> - But I was imagining modifying jemalloc to report the sum of all round-ups,
> resulting in a single "heap-round-up" (or whatever we call it) number in
> about:memory.

How would you do this?  In particular, how would you make realloc work with this?
I didn't mean to suggest that the round_me_up function (bug 675226) would be useful for memory reporters.  I just think it would be helpful for functions which know statically that they want at least X bytes, but are willing to use whatever they get. (*)

It would be cool to have heap-round-up reported through jemalloc, although that might be hard, since you'd have to make jemalloc store the requested or wasted size of each malloc.  (When you free, you have to decrease the allocated count and the wasted count.)

Who knows, maybe the new version does this for us.  :)

(*) I'm not convinced this is better than using malloc_usable_size.  But one advantage is that such allocations wouldn't incorrectly contribute to heap-round-up.
(In reply to comment #9)
> Yes, that makes sense.  It would still be nice to have a way to ferret out
> cases where we are wasting a lot of memory on jemalloc rounding

See comment 2 and 3 :)  (The numbers are for 64-bit, BTW.)

Also, note that the "waste" caused by the rounding doesn't necessarily have as big an effect as you might think.  In the JSArena case I think the allocations are very short-lived, so the fact that we're briefly allocating 2KB many times instead of 1KB may not have that big an effect;  e.g. it won't affect the peak.


> How would you do this?  In particular, how would you make realloc work with
> this?

Good question.  For "huge" allocations it's not hard as there's a struct recording info for each one.  For "small" and "large" allocations it's more difficult.
(In reply to comment #1)
> 
> In other words, we (cumulatively) requested 583MB of allocations, and
> jemalloc allocated 708MB of memory, resulting in 124MB of waste.  That's
> 17.6%.  If we assume that the cumulative ratio matches the live ratio after
> loading finishes, that's almost half of the "heap-unclassified" value of 40%
> that I saw for this scenario.

Hmm, that assumption may be flawed.  I think we allocate a lot of JSArena blocks (which currently suffer the round-up problem) but they get freed very quickly -- i.e. there aren't that many JSArenas live at any one time.
As a quick hack, I tried using malloc_usable_size to measure js/<compartment>/scripts, instead of the current approach.  For my gmail compartment the size increased from 7,578,026 B to 9,124,368 B, an increase of 1.20x.  Some of the other js reporters could do the same thing:  mjit-data, property-tables, object-slots, string-chars, tjit-data.

That might reduce heap-unclassified by ~5% (eg. from 40% down to 35%).  The only question is do we want to report both the useful space and the slop space, or just the total of the two?  I.e. would reporting both be useful enough to warrant bloating the output?

Also, is malloc_usable_size available on all platforms?
(In reply to comment #14)
> For my gmail
> compartment the size increased from 7,578,026 B to 9,124,368 B, an increase
> of 1.20x.

On smaller compartments I saw the ratio get as high as 1.6x.
(In reply to comment #14)
> The
> only question is do we want to report both the useful space and the slop
> space, or just the total of the two?  I.e. would reporting both be useful
> enough to warrant bloating the output?

I think users only care what parts cause what amount of memory to be allocated in the end, i.e. the total. It may make sense to have both values available to devs who maybe want to reduce the overhead, though - not sure what good possibilities you have there. But as you know, devs usually only work on improvements when they can measure them...
FWIW, I wrote a test to see how jemalloc rounds up across a range of 2^n+1 sizes:

2^ 0+1 (         2) -->          2
2^ 1+1 (         3) -->          4
2^ 2+1 (         5) -->          8
2^ 3+1 (         9) -->         16
2^ 4+1 (        17) -->         32
2^ 5+1 (        33) -->         48
2^ 6+1 (        65) -->         80
2^ 7+1 (       129) -->        144
2^ 8+1 (       257) -->        272
2^ 9+1 (       513) -->       1024
2^10+1 (      1025) -->       2048
2^11+1 (      2049) -->       4096
2^12+1 (      4097) -->       8192
2^13+1 (      8193) -->      12288
2^14+1 (     16385) -->      20480
2^15+1 (     32769) -->      36864
2^16+1 (     65537) -->      69632
2^17+1 (    131073) -->     135168
2^18+1 (    262145) -->     266240
2^19+1 (    524289) -->     528384
2^20+1 (   1048577) -->    2097152
2^21+1 (   2097153) -->    3145728
2^22+1 (   4194305) -->    5242880
2^23+1 (   8388609) -->    9437184

So, see a variety of things, but there are clearly buckets along the way that are multiples of 16, 4096, and 1MB.
Whiteboard: [MemShrink] → [MemShrink:P2]
Actually achieving what this bug's title suggests (quantifying how much memory allocated by jemalloc is slop due to round-ups) is tricky if we want to do it for the memory usage at any one time (e.g. have a "heap-slop" bucket in about:memory), as opposed to the cumulative amount that I measured in comment 1 and comment 2.

More precisely, it wouldn't be hard to modify jemalloc to record this (if only in a patch for diagnostic purposes) for "large" and "huge" allocations, because they have per-allocation metadata.  But "small" allocations don't have their own metadata.
I'm going to close this bug.  The instrumentation patch has identified the four "clownshoes" bugs (bug 675150, bug 676457, bug 675132, bug 676189), which was useful, and I'll revisit it once those bugs have been fixed to see if there are any remaining clownshoes.

But I don't see a good way to directly identify the total amount of "heap slop" without very invasive changes to jemalloc.  I think the way forward is to instead use malloc_usable_size in all the appropriate reporters (e.g. bug 676732).  That won't allow us to see exactly what fraction of the heap is due to slop, but it has two advantages:  (1) the blame for the excess memory goes to the code responsible, and (2) it's implementable in a reasonable way.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.