Closed Bug 685438 Opened 13 years ago Closed 13 years ago

Avoid wasted space in nsTArray_base due to jemalloc rounding up

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla9

People

(Reporter: n.nethercote, Assigned: justin.lebar+bug)

Details

(Whiteboard: [MemShrink:P2][clownshoes])

Attachments

(1 file, 4 obsolete files)

nsTArray_base<Alloc>::EnsureCapacity has this code:

  Header *header;
  if (UsesAutoArrayBuffer()) {
    // Malloc() and copy
    header = static_cast<Header*>
             (Alloc::Malloc(sizeof(Header) + capacity * elemSize));
    if (!header)
      return PR_FALSE;

    memcpy(header, mHdr, sizeof(Header) + Length() * elemSize);
  } else {
    // Realloc() existing data
    size_type size = sizeof(Header) + capacity * elemSize;
    header = static_cast<Header*>(Alloc::Realloc(mHdr, size));
    if (!header)
      return PR_FALSE;
  }

AFAICT |capacity| is always a power-of-two, which means we're allocating slightly more than a power-of-two, which jemalloc rounds up to the next class size.  E.g. 1024 -> 1032 -> 2048.  I'm seeing this quite a lot relating to CSS stuff (e.g. below nsCSSRuleProcessor::CascadeSheet).

Is it possible to store the header separately?  It's late in the day and nsTArray is intimidating to a newcomer, so I don't yet know the answer to that question.

Oh, I found this:

 // nsTArray_base stores elements into the space allocated beyond
 // sizeof(*this).  This is done to minimize the size of the nsTArray
 // object when it is empty.
 struct NS_COM_GLUE nsTArrayHeader
 {   
   static nsTArrayHeader sEmptyHdr;
   
   PRUint32 mLength;
   PRUint32 mCapacity : 31;
   PRUint32 mIsAutoArray : 1;
 };

That may minimize zero-length arrays, but it really hurts with larger arrays.

This bug is probably highly related to bug 682735;  I freely admit to not having followed the details and current situation with that bug, but it would be nice if the above behaviour could be taken into consideration.
BTW, after opening Gmail and TechCrunch I see that there is almost 8MB of *cumulative* waste from this problem.  If a lot of that cumulative waste is also from long-lived allocations (as CSS stuff tends to be), this could be a big win for memory usage.
Whiteboard: [MemShrink][clownshoes]
> AFAICT |capacity| is always a power-of-two

It's a power of two times the original capacity passed to the constructor or the first EnsureCapacity call, right?

In practice, that will also be a power of 2, of course.

> e.g. below nsCSSRuleProcessor::CascadeSheet

Where below there?  There are some temporary arrays there that can get long, and some long-lived ones that shouldn't get long (and that we're about to make into auto arrays)... would be good to know which ones you're particularly seeing.  That said, it may be worth waiting for the existing bugs on the CSS arrays to be sorted out before measuring that.
Er, just read comment 1.

Does "cumulative" mean "total rounding-up of all allocations including ones that have already been deallocated since then" in this case?
> Does "cumulative" mean "total rounding-up of all allocations including ones that have already been 
> deallocated since then" in this case?

I think so.

I'd looked at this, but then I decided it wasn't as big a problem as all the malloc'ing because the biggest I could get the sum of the malloc_usable_size of live nsTArrays' heap allocations was 4MB out of a 200mb heap-allocated, and at most half of that is waste.  But I do think this is worth fixing, either as part of bug 682735 or separately.

> Is it possible to store the header separately?  It's late in the day and nsTArray is 
> intimidating to a newcomer, so I don't yet know the answer to that question.

One could.  It's done this way because it's one malloc for header + data rather than two; it also increases the locality of accesses.

I think the solution is probably just to change how TArray chooses its capacity.
> But I do think this is worth fixing, 

If only because, as you pointed out in another bug, this causes unnecessary churn in the allocator.
(In reply to Boris Zbarsky (:bz) from comment #2)
> > AFAICT |capacity| is always a power-of-two
> 
> It's a power of two times the original capacity passed to the constructor or
> the first EnsureCapacity call, right?
> 
> In practice, that will also be a power of 2, of course.

Yes.  But capacity*elemSize is not necessarily a power-of-two.  I've found that's a big remaining source of slop bytes, in various array, vector and hash tables that we have.  It's a bit tricky to fix, though (well, arrays/vectors are easier than hash tables) and beyond the scope of this bug, because in the case I'm seeing capacity*elemSize *is* a power-of-two.


> > e.g. below nsCSSRuleProcessor::CascadeSheet
> 
> Where below there?

I'm seeing lots of stack traces like this:

 track (jemalloc.c:3780) 
 realloc (jemalloc.c:3839)
 moz_xrealloc (mozalloc.cpp:135)
 nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:89)
 CascadeRuleEnumFunc(mozilla::css::Rule*, void*) (nsTArray.h:811)
 nsVoidArray::EnumerateForwards(int (*)(void*, void*), void*) (nsVoidArray.cpp:724)
 nsCSSRuleProcessor::CascadeSheet(nsCSSStyleSheet*, CascadeEnumData*) (nsCOMArray.h:65)
 nsCSSRuleProcessor::RefreshRuleCascade(nsPresContext*) (nsCSSRuleProcessor.cpp:3034)
 nsCSSRuleProcessor::GetRuleCascade(nsPresContext*) (nsCSSRuleProcessor.cpp:2992)
 nsCSSRuleProcessor::RulesMatching(AnonBoxRuleProcessorData*) (nsCSSRuleProcessor.cpp:2261)
 int EnumRulesMatching<AnonBoxRuleProcessorData>(nsIStyleRuleProcessor*, void*) (nsStyleSet.cpp:475)

The line numbers are a bit screwy because of all the inlining, but you can work out what's going on if you squint a bit.


> Does "cumulative" mean "total rounding-up of all allocations including ones
> that have already been deallocated since then" in this case?

Sorry, I wasn't very clear.  For every heap allocation, I print out a line showing the requested size and the actual size.  But I don't do that for deallocations (because the requested size has been lost).  So, to answer your question:  yes.


(In reply to Justin Lebar [:jlebar] from comment #5)
> > But I do think this is worth fixing, 
> 
> If only because, as you pointed out in another bug, this causes unnecessary
> churn in the allocator.

I realized that DMD's output could tell me how much of this waste is long-lived.  For a (64-bit) run where I opened 5 Gmail tabs, we're wasting a shade over 1MB.  And it's spread out in lots of 1KB and 2KB chunks, which is the worst-case scenario (because we have lots of under-utilized pages).  Definitely worth fixing, IMHO.

(It's also worth noting that the waste only gets bad when the arrays are at least 512B in size;  below that jemalloc's quantum size is 16 bytes so any waste from rounding is proportionally much less.)
> I'm seeing lots of stack traces like this:

OK, so I'm assuming this is the array in the RuleByWeightEntry.  That's one of the ones I looked at in bug 681755 and you can see the log for it at https://bug681755.bugzilla.mozilla.org/attachment.cgi?id=555536 (first line is the total number of arrays seen there, the each line has a count followed by a length, sorted by length).  Bug 681755 comment 8 mentions that a small TArray would be good here, but there's a _very_ long tail...  Each entry in this array is two words, which on a 64-bit system is 16 bytes, so once we're over 32 entries we start running into jemalloc's non-quantized allocations.  That log shows plenty of stuff over 32 entries.  ;)  Not unexpected, given what's in these arrays (all CSS rule+selector pairs with a given weight).

The one piece of good news is that this is a transient data structure; it goes away before RefreshRuleCascade returns.

The other good news is that we don't actually _need_ an array here.  We just need some sort of order-preserving (in the sense that we can add stuff to it and then treate it as a FIFO) storage that we can add a bunch of things to, then go through them in order.  We add once, then go through once.  Ideally both would be reasonably fast, of course.  So if there is a better data structure to use here, or any other change that would improve something useful, it's totally doable.  Again, I'm not sure how much of a concern things are here given that this this a transient data structure.
I have patches for this which I'll post soon.
Assignee: nobody → justin.lebar+bug
OS: Linux → All
Hardware: x86_64 → All
Attached patch Patch v1 (obsolete) — Splinter Review
Attached patch Part 2. v1 (obsolete) — Splinter Review
Hm.  This is causing build errors on Android because

 * The patch requires nsTArray<E> to know sizeof(E) so we can statically know the denominator of a division in the TArray code when we resize the array to a power of 2.  (sizeof(E) is usually a power of 2, so easy to divide by, but if we don't template on sizeof(E), we'll end up doing a much slower division op.)

 * FontListEntry is an IPDL-defined type (defined in PContent.ipdl)
 * gfxAndroidPlatform.h creates nsTArray<FontListEntry>
 * Therefore gfxAndroidPlatform.h needs to include PContent.h; forward-declaring class FontListEntry doesn't cut it.
 * If you include PContent.h, you have to do so before prtypes.h.
 * Now every file which includes gfxPlatform.h, which includes gfxAndroidPlatform.h, needs to include gfxPlatform.h as the first include.

I'll fix up the headers, but I hope we don't mind silly ordering requirements like this.
Actually, this isn't so bad; gfxPlatform.cpp includes gfxAndroidPlatform.h, but gfxPlatform.h doesn't.  So the weird ordering only applies to files which include gfxAndroidPlatform.h.
Attached patch Part 1. v2 (obsolete) — Splinter Review
Attachment #561267 - Flags: review?(bzbarsky)
Attachment #560940 - Attachment is obsolete: true
Attached patch Part 2 (fix up code). v2 (obsolete) — Splinter Review
Attachment #561268 - Flags: review?(bzbarsky)
Attachment #560942 - Attachment is obsolete: true
I'm really not a great reviewer for this, esp. if you want it fast.  Can you pick on someone who's actually read the tarray guts carefully before?
Whiteboard: [MemShrink][clownshoes] → [MemShrink:P2][clownshoes]
Comment on attachment 561267 [details] [diff] [review]
Part 1. v2

Roc, do you want to do the honors?
Attachment #561267 - Flags: review?(bzbarsky) → review?(roc)
Attachment #561268 - Flags: review?(bzbarsky) → review?(roc)
Are we making ElemSize a template parameter of nsTArray_base solely to avoid an integer division when allocating? That seems unnecessary.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #17)
> Are we making ElemSize a template parameter of nsTArray_base solely to avoid
> an integer division when allocating? That seems unnecessary.

I was worried about ARM's software division.  I figured its cost might be comparable to a lock-free malloc path, but I have absolutely no data to back that up.
I had no idea integer division required software on ARM. There are lots of places we do integer division...

Apparently there's code to do division in less than 45 cycles:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0378c/Chdifaha.html
and apparently the worst-case for the default algorithm is less than 100 cycles. I doubt that dominates nsTArray allocation. Let's just go with the simple code.
Isn't the whole point of nsTArray_base to be the code that can be compiled just once, so that the templates don't bloat code size?
Yes, although you could argue that making the size the template parameter has less code-size impact than using the full type, since lots of types have the same size.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #21)
> Yes, although you could argue that making the size the template parameter
> has less code-size impact than using the full type, since lots of types have
> the same size.

And indeed that's what I saw in codesighs; I should have mentioned that here.

I'll rework the patch to use a true division.
Attached patch Patch v3Splinter Review
Attachment #561356 - Flags: review?(roc)
We don't need part 2 anymore, I assume.
Attachment #561267 - Attachment is obsolete: true
Attachment #561267 - Flags: review?(roc)
Attachment #561268 - Attachment is obsolete: true
Attachment #561268 - Flags: review?(roc)
Target Milestone: --- → mozilla9
Hm, comment didn't go through last night, it appears.

> We don't need part 2 anymore, I assume.

Correct.

Inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/069e927983a9

This doesn't appear to have moved any of our benchmarks, memory or CPU.  That's not surprising, given that the amount of wasted space is usually small, both as a fraction of a single array's size and as a fraction of the sum of all arrays' sizes.  I doubt there's a large effect on fragmentation, but even if there were, our tests don't test for that.
Whiteboard: [MemShrink:P2][clownshoes] → [MemShrink:P2][clownshoes][inbound]
https://hg.mozilla.org/mozilla-central/rev/069e927983a9
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [MemShrink:P2][clownshoes][inbound] → [MemShrink:P2][clownshoes]
Comment on attachment 561356 [details] [diff] [review]
Patch v3

>+  const PRUint32 pageSizeBytes = 12;
>+  const PRUint32 pageSize = 1 << pageSizeBytes;
>+
>+  PRUint32 minBytes = capacity * elemSize + sizeof(Header);
>+  PRUint32 bytesToAlloc;
>+  if (minBytes >= pageSize) {
>+    // Round up to the next multiple of pageSize.
>+    bytesToAlloc = pageSize * ((minBytes + pageSize - 1) / pageSize);
>+  }
>+  else {
>+    // Round up to the next power of two.  See
>+    // http://graphics.stanford.edu/~seander/bithacks.html
>+    bytesToAlloc = minBytes - 1;
>+    bytesToAlloc |= bytesToAlloc >> 1;
>+    bytesToAlloc |= bytesToAlloc >> 2;
>+    bytesToAlloc |= bytesToAlloc >> 4;
>+    bytesToAlloc |= bytesToAlloc >> 8;
>+    bytesToAlloc |= bytesToAlloc >> 16;
I wonder whether the compiler is clever enough to notice that bytesToAlloc >> 16 is always zero because bytesToAlloc < pageSize.
If minBytes == 0, |minBytes - 1| will underflow, and then the >> 16 matters.  And you can't prove that minBytes > 0, because |capacity * elemSize + sizeof(Header)| might overflow to 0.

If I take out the minBytes - 1, then gcc optimizes the extra shift away.
What does it do if you special-case minBytes == 0 to return 1? The branch itself should be very predictable, given that it should never evaluate as true (though it's technically correct, as the nearest integer power-of-two for 0 is 1).

By the way, I assume there's a bytesToAlloc++; at the end of that sequence? Otherwise it'll always return 2^n - 1.
To be clear, I'm not particularly interested in saving the one left-shift extra instruction.  One would need to go after the division first; see comment 18.
This may have caused TP5 private bytes to move on Linux x86 and Linux x64.  Strangely enough, the x86 and x64 plots switched places.

If graphserver [1] is to be believed, this happened immediately after a NPOTB push [2]!

I'm investigating...

[1] http://graphs-new.mozilla.org/graph.html#tests=[[90,63,1],[90,63,12],[90,63,15],[90,63,13],[90,63,14]]&sel=none&displayrange=7&datatype=running
[2] http://hg.mozilla.org/integration/mozilla-inbound/rev/c7233b484b95
Apparently this range straddles the Flash upgrade releng did yesterday.

I'll push to try with this patch backed out, but I suspect it's OK.
(In reply to Justin Lebar [:jlebar] from comment #30)
> To be clear, I'm not particularly interested in saving the one left-shift
> extra instruction.  One would need to go after the division first; see
> comment 18.

I guess you could avoid the division by doing something like

bytesToAlloc = ((minBytes - 1) & ~pageSize) + pageSize;
The division by pageSize is optimized to that code.  It's the division by elemSize that the compiler can't optimize.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: