Last Comment Bug 685438 - Avoid wasted space in nsTArray_base due to jemalloc rounding up
: Avoid wasted space in nsTArray_base due to jemalloc rounding up
Status: RESOLVED FIXED
[MemShrink:P2][clownshoes]
:
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla9
Assigned To: Justin Lebar (not reading bugmail)
:
: Nathan Froyd [:froydnj]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-09-08 00:15 PDT by Nicholas Nethercote [:njn]
Modified: 2011-09-22 09:42 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch v1 (24.30 KB, patch)
2011-09-19 09:53 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
Part 2. v1 (6.94 KB, patch)
2011-09-19 09:55 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
Part 1. v2 (24.53 KB, patch)
2011-09-20 12:52 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
Part 2 (fix up code). v2 (16.24 KB, patch)
2011-09-20 12:52 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
Patch v3 (2.95 KB, patch)
2011-09-20 17:37 PDT, Justin Lebar (not reading bugmail)
roc: review+
Details | Diff | Splinter Review

Description Nicholas Nethercote [:njn] 2011-09-08 00:15:15 PDT
nsTArray_base<Alloc>::EnsureCapacity has this code:

  Header *header;
  if (UsesAutoArrayBuffer()) {
    // Malloc() and copy
    header = static_cast<Header*>
             (Alloc::Malloc(sizeof(Header) + capacity * elemSize));
    if (!header)
      return PR_FALSE;

    memcpy(header, mHdr, sizeof(Header) + Length() * elemSize);
  } else {
    // Realloc() existing data
    size_type size = sizeof(Header) + capacity * elemSize;
    header = static_cast<Header*>(Alloc::Realloc(mHdr, size));
    if (!header)
      return PR_FALSE;
  }

AFAICT |capacity| is always a power-of-two, which means we're allocating slightly more than a power-of-two, which jemalloc rounds up to the next class size.  E.g. 1024 -> 1032 -> 2048.  I'm seeing this quite a lot relating to CSS stuff (e.g. below nsCSSRuleProcessor::CascadeSheet).

Is it possible to store the header separately?  It's late in the day and nsTArray is intimidating to a newcomer, so I don't yet know the answer to that question.

Oh, I found this:

 // nsTArray_base stores elements into the space allocated beyond
 // sizeof(*this).  This is done to minimize the size of the nsTArray
 // object when it is empty.
 struct NS_COM_GLUE nsTArrayHeader
 {   
   static nsTArrayHeader sEmptyHdr;
   
   PRUint32 mLength;
   PRUint32 mCapacity : 31;
   PRUint32 mIsAutoArray : 1;
 };

That may minimize zero-length arrays, but it really hurts with larger arrays.

This bug is probably highly related to bug 682735;  I freely admit to not having followed the details and current situation with that bug, but it would be nice if the above behaviour could be taken into consideration.
Comment 1 Nicholas Nethercote [:njn] 2011-09-08 00:18:41 PDT
BTW, after opening Gmail and TechCrunch I see that there is almost 8MB of *cumulative* waste from this problem.  If a lot of that cumulative waste is also from long-lived allocations (as CSS stuff tends to be), this could be a big win for memory usage.
Comment 2 Boris Zbarsky [:bz] (still a bit busy) 2011-09-08 07:15:08 PDT
> AFAICT |capacity| is always a power-of-two

It's a power of two times the original capacity passed to the constructor or the first EnsureCapacity call, right?

In practice, that will also be a power of 2, of course.

> e.g. below nsCSSRuleProcessor::CascadeSheet

Where below there?  There are some temporary arrays there that can get long, and some long-lived ones that shouldn't get long (and that we're about to make into auto arrays)... would be good to know which ones you're particularly seeing.  That said, it may be worth waiting for the existing bugs on the CSS arrays to be sorted out before measuring that.
Comment 3 Boris Zbarsky [:bz] (still a bit busy) 2011-09-08 07:16:22 PDT
Er, just read comment 1.

Does "cumulative" mean "total rounding-up of all allocations including ones that have already been deallocated since then" in this case?
Comment 4 Justin Lebar (not reading bugmail) 2011-09-08 07:30:42 PDT
> Does "cumulative" mean "total rounding-up of all allocations including ones that have already been 
> deallocated since then" in this case?

I think so.

I'd looked at this, but then I decided it wasn't as big a problem as all the malloc'ing because the biggest I could get the sum of the malloc_usable_size of live nsTArrays' heap allocations was 4MB out of a 200mb heap-allocated, and at most half of that is waste.  But I do think this is worth fixing, either as part of bug 682735 or separately.

> Is it possible to store the header separately?  It's late in the day and nsTArray is 
> intimidating to a newcomer, so I don't yet know the answer to that question.

One could.  It's done this way because it's one malloc for header + data rather than two; it also increases the locality of accesses.

I think the solution is probably just to change how TArray chooses its capacity.
Comment 5 Justin Lebar (not reading bugmail) 2011-09-08 12:39:20 PDT
> But I do think this is worth fixing, 

If only because, as you pointed out in another bug, this causes unnecessary churn in the allocator.
Comment 6 Nicholas Nethercote [:njn] 2011-09-08 18:26:56 PDT
(In reply to Boris Zbarsky (:bz) from comment #2)
> > AFAICT |capacity| is always a power-of-two
> 
> It's a power of two times the original capacity passed to the constructor or
> the first EnsureCapacity call, right?
> 
> In practice, that will also be a power of 2, of course.

Yes.  But capacity*elemSize is not necessarily a power-of-two.  I've found that's a big remaining source of slop bytes, in various array, vector and hash tables that we have.  It's a bit tricky to fix, though (well, arrays/vectors are easier than hash tables) and beyond the scope of this bug, because in the case I'm seeing capacity*elemSize *is* a power-of-two.


> > e.g. below nsCSSRuleProcessor::CascadeSheet
> 
> Where below there?

I'm seeing lots of stack traces like this:

 track (jemalloc.c:3780) 
 realloc (jemalloc.c:3839)
 moz_xrealloc (mozalloc.cpp:135)
 nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:89)
 CascadeRuleEnumFunc(mozilla::css::Rule*, void*) (nsTArray.h:811)
 nsVoidArray::EnumerateForwards(int (*)(void*, void*), void*) (nsVoidArray.cpp:724)
 nsCSSRuleProcessor::CascadeSheet(nsCSSStyleSheet*, CascadeEnumData*) (nsCOMArray.h:65)
 nsCSSRuleProcessor::RefreshRuleCascade(nsPresContext*) (nsCSSRuleProcessor.cpp:3034)
 nsCSSRuleProcessor::GetRuleCascade(nsPresContext*) (nsCSSRuleProcessor.cpp:2992)
 nsCSSRuleProcessor::RulesMatching(AnonBoxRuleProcessorData*) (nsCSSRuleProcessor.cpp:2261)
 int EnumRulesMatching<AnonBoxRuleProcessorData>(nsIStyleRuleProcessor*, void*) (nsStyleSet.cpp:475)

The line numbers are a bit screwy because of all the inlining, but you can work out what's going on if you squint a bit.


> Does "cumulative" mean "total rounding-up of all allocations including ones
> that have already been deallocated since then" in this case?

Sorry, I wasn't very clear.  For every heap allocation, I print out a line showing the requested size and the actual size.  But I don't do that for deallocations (because the requested size has been lost).  So, to answer your question:  yes.


(In reply to Justin Lebar [:jlebar] from comment #5)
> > But I do think this is worth fixing, 
> 
> If only because, as you pointed out in another bug, this causes unnecessary
> churn in the allocator.

I realized that DMD's output could tell me how much of this waste is long-lived.  For a (64-bit) run where I opened 5 Gmail tabs, we're wasting a shade over 1MB.  And it's spread out in lots of 1KB and 2KB chunks, which is the worst-case scenario (because we have lots of under-utilized pages).  Definitely worth fixing, IMHO.

(It's also worth noting that the waste only gets bad when the arrays are at least 512B in size;  below that jemalloc's quantum size is 16 bytes so any waste from rounding is proportionally much less.)
Comment 7 Boris Zbarsky [:bz] (still a bit busy) 2011-09-08 19:16:34 PDT
> I'm seeing lots of stack traces like this:

OK, so I'm assuming this is the array in the RuleByWeightEntry.  That's one of the ones I looked at in bug 681755 and you can see the log for it at https://bug681755.bugzilla.mozilla.org/attachment.cgi?id=555536 (first line is the total number of arrays seen there, the each line has a count followed by a length, sorted by length).  Bug 681755 comment 8 mentions that a small TArray would be good here, but there's a _very_ long tail...  Each entry in this array is two words, which on a 64-bit system is 16 bytes, so once we're over 32 entries we start running into jemalloc's non-quantized allocations.  That log shows plenty of stuff over 32 entries.  ;)  Not unexpected, given what's in these arrays (all CSS rule+selector pairs with a given weight).

The one piece of good news is that this is a transient data structure; it goes away before RefreshRuleCascade returns.

The other good news is that we don't actually _need_ an array here.  We just need some sort of order-preserving (in the sense that we can add stuff to it and then treate it as a FIFO) storage that we can add a bunch of things to, then go through them in order.  We add once, then go through once.  Ideally both would be reasonably fast, of course.  So if there is a better data structure to use here, or any other change that would improve something useful, it's totally doable.  Again, I'm not sure how much of a concern things are here given that this this a transient data structure.
Comment 8 Justin Lebar (not reading bugmail) 2011-09-18 11:19:53 PDT
I have patches for this which I'll post soon.
Comment 9 Justin Lebar (not reading bugmail) 2011-09-19 09:53:31 PDT
Created attachment 560940 [details] [diff] [review]
Patch v1
Comment 10 Justin Lebar (not reading bugmail) 2011-09-19 09:55:29 PDT
Created attachment 560942 [details] [diff] [review]
Part 2. v1
Comment 11 Justin Lebar (not reading bugmail) 2011-09-19 11:40:04 PDT
Hm.  This is causing build errors on Android because

 * The patch requires nsTArray<E> to know sizeof(E) so we can statically know the denominator of a division in the TArray code when we resize the array to a power of 2.  (sizeof(E) is usually a power of 2, so easy to divide by, but if we don't template on sizeof(E), we'll end up doing a much slower division op.)

 * FontListEntry is an IPDL-defined type (defined in PContent.ipdl)
 * gfxAndroidPlatform.h creates nsTArray<FontListEntry>
 * Therefore gfxAndroidPlatform.h needs to include PContent.h; forward-declaring class FontListEntry doesn't cut it.
 * If you include PContent.h, you have to do so before prtypes.h.
 * Now every file which includes gfxPlatform.h, which includes gfxAndroidPlatform.h, needs to include gfxPlatform.h as the first include.

I'll fix up the headers, but I hope we don't mind silly ordering requirements like this.
Comment 12 Justin Lebar (not reading bugmail) 2011-09-19 11:55:09 PDT
Actually, this isn't so bad; gfxPlatform.cpp includes gfxAndroidPlatform.h, but gfxPlatform.h doesn't.  So the weird ordering only applies to files which include gfxAndroidPlatform.h.
Comment 13 Justin Lebar (not reading bugmail) 2011-09-20 12:52:14 PDT
Created attachment 561267 [details] [diff] [review]
Part 1. v2
Comment 14 Justin Lebar (not reading bugmail) 2011-09-20 12:52:57 PDT
Created attachment 561268 [details] [diff] [review]
Part 2 (fix up code). v2
Comment 15 Boris Zbarsky [:bz] (still a bit busy) 2011-09-20 13:04:45 PDT
I'm really not a great reviewer for this, esp. if you want it fast.  Can you pick on someone who's actually read the tarray guts carefully before?
Comment 16 Justin Lebar (not reading bugmail) 2011-09-20 14:49:06 PDT
Comment on attachment 561267 [details] [diff] [review]
Part 1. v2

Roc, do you want to do the honors?
Comment 17 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-09-20 15:34:44 PDT
Are we making ElemSize a template parameter of nsTArray_base solely to avoid an integer division when allocating? That seems unnecessary.
Comment 18 Justin Lebar (not reading bugmail) 2011-09-20 15:40:56 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #17)
> Are we making ElemSize a template parameter of nsTArray_base solely to avoid
> an integer division when allocating? That seems unnecessary.

I was worried about ARM's software division.  I figured its cost might be comparable to a lock-free malloc path, but I have absolutely no data to back that up.
Comment 19 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-09-20 15:49:12 PDT
I had no idea integer division required software on ARM. There are lots of places we do integer division...

Apparently there's code to do division in less than 45 cycles:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0378c/Chdifaha.html
and apparently the worst-case for the default algorithm is less than 100 cycles. I doubt that dominates nsTArray allocation. Let's just go with the simple code.
Comment 20 David Baron :dbaron: ⌚️UTC-10 2011-09-20 16:18:34 PDT
Isn't the whole point of nsTArray_base to be the code that can be compiled just once, so that the templates don't bloat code size?
Comment 21 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-09-20 16:25:33 PDT
Yes, although you could argue that making the size the template parameter has less code-size impact than using the full type, since lots of types have the same size.
Comment 22 Justin Lebar (not reading bugmail) 2011-09-20 16:45:09 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #21)
> Yes, although you could argue that making the size the template parameter
> has less code-size impact than using the full type, since lots of types have
> the same size.

And indeed that's what I saw in codesighs; I should have mentioned that here.

I'll rework the patch to use a true division.
Comment 23 Justin Lebar (not reading bugmail) 2011-09-20 17:37:10 PDT
Created attachment 561356 [details] [diff] [review]
Patch v3
Comment 24 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-09-20 18:17:33 PDT
We don't need part 2 anymore, I assume.
Comment 25 Justin Lebar (not reading bugmail) 2011-09-21 07:38:37 PDT
Hm, comment didn't go through last night, it appears.

> We don't need part 2 anymore, I assume.

Correct.

Inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/069e927983a9

This doesn't appear to have moved any of our benchmarks, memory or CPU.  That's not surprising, given that the amount of wasted space is usually small, both as a fraction of a single array's size and as a fraction of the sum of all arrays' sizes.  I doubt there's a large effect on fragmentation, but even if there were, our tests don't test for that.
Comment 27 neil@parkwaycc.co.uk 2011-09-21 12:55:20 PDT
Comment on attachment 561356 [details] [diff] [review]
Patch v3

>+  const PRUint32 pageSizeBytes = 12;
>+  const PRUint32 pageSize = 1 << pageSizeBytes;
>+
>+  PRUint32 minBytes = capacity * elemSize + sizeof(Header);
>+  PRUint32 bytesToAlloc;
>+  if (minBytes >= pageSize) {
>+    // Round up to the next multiple of pageSize.
>+    bytesToAlloc = pageSize * ((minBytes + pageSize - 1) / pageSize);
>+  }
>+  else {
>+    // Round up to the next power of two.  See
>+    // http://graphics.stanford.edu/~seander/bithacks.html
>+    bytesToAlloc = minBytes - 1;
>+    bytesToAlloc |= bytesToAlloc >> 1;
>+    bytesToAlloc |= bytesToAlloc >> 2;
>+    bytesToAlloc |= bytesToAlloc >> 4;
>+    bytesToAlloc |= bytesToAlloc >> 8;
>+    bytesToAlloc |= bytesToAlloc >> 16;
I wonder whether the compiler is clever enough to notice that bytesToAlloc >> 16 is always zero because bytesToAlloc < pageSize.
Comment 28 Justin Lebar (not reading bugmail) 2011-09-21 13:06:24 PDT
If minBytes == 0, |minBytes - 1| will underflow, and then the >> 16 matters.  And you can't prove that minBytes > 0, because |capacity * elemSize + sizeof(Header)| might overflow to 0.

If I take out the minBytes - 1, then gcc optimizes the extra shift away.
Comment 29 Emanuel Hoogeveen [:ehoogeveen] 2011-09-22 03:08:28 PDT
What does it do if you special-case minBytes == 0 to return 1? The branch itself should be very predictable, given that it should never evaluate as true (though it's technically correct, as the nearest integer power-of-two for 0 is 1).

By the way, I assume there's a bytesToAlloc++; at the end of that sequence? Otherwise it'll always return 2^n - 1.
Comment 30 Justin Lebar (not reading bugmail) 2011-09-22 06:03:08 PDT
To be clear, I'm not particularly interested in saving the one left-shift extra instruction.  One would need to go after the division first; see comment 18.
Comment 31 Justin Lebar (not reading bugmail) 2011-09-22 06:49:01 PDT
This may have caused TP5 private bytes to move on Linux x86 and Linux x64.  Strangely enough, the x86 and x64 plots switched places.

If graphserver [1] is to be believed, this happened immediately after a NPOTB push [2]!

I'm investigating...

[1] http://graphs-new.mozilla.org/graph.html#tests=[[90,63,1],[90,63,12],[90,63,15],[90,63,13],[90,63,14]]&sel=none&displayrange=7&datatype=running
[2] http://hg.mozilla.org/integration/mozilla-inbound/rev/c7233b484b95
Comment 32 Justin Lebar (not reading bugmail) 2011-09-22 07:09:51 PDT
Apparently this range straddles the Flash upgrade releng did yesterday.

I'll push to try with this patch backed out, but I suspect it's OK.
Comment 33 Emanuel Hoogeveen [:ehoogeveen] 2011-09-22 08:13:01 PDT
(In reply to Justin Lebar [:jlebar] from comment #30)
> To be clear, I'm not particularly interested in saving the one left-shift
> extra instruction.  One would need to go after the division first; see
> comment 18.

I guess you could avoid the division by doing something like

bytesToAlloc = ((minBytes - 1) & ~pageSize) + pageSize;
Comment 34 Justin Lebar (not reading bugmail) 2011-09-22 09:42:52 PDT
The division by pageSize is optimized to that code.  It's the division by elemSize that the compiler can't optimize.

Note You need to log in before you can comment on or make changes to this bug.