Closed Bug 677571 Opened 13 years ago Closed 13 years ago

Investigate nsTArray's growth behavior.

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: khuey, Assigned: justin.lebar+bug)

References

Details

(Whiteboard: [MemShrink:P2][clownshoes])

Attachments

(6 files, 2 obsolete files)

In particular, it looks like it does the "I want to allocate a power of two but I'm dumb and end up just a bit over that" thing.
It doesn't seem to try to allocate a power of 2.  Just twice the size it already has.  And in particular, the very first allocation (the mHdr == EmptyHdr() case in EnsureCapacity) just uses the provided capacity.  So we'd start doubling from there.

It's definitely worth instrumenting to see what actual runtime behavior looks like.

Note that XPCOM strings have basically the same code (but possibly different usage patterns, so may want a separate bug).
> capacity = NS_MAX<size_type>(capacity, mHdr->mCapacity * 2U);

Note that mCapacity is the number of elements, not their size.

Still seems worthwhile to do something intelligent here.  Maybe bug 675226 would help.  :)
I'll fix this.
Assignee: nobody → justin.lebar+bug
Whiteboard: [MemShrink] → [MemShrink][clownshoes]
Whiteboard: [MemShrink][clownshoes] → [MemShrink:P2][clownshoes]
I instrumented nsTArray to observe the maximum number of elements an array contained when it died.  Hopefully most arrays are destroyed when the browser shuts down nicely.

I started up the browser, visited Gmail, and then closed the browser.  227000 heap-based (as opposed to auto-based) nsTArrays were destroyed.  Here's the breakdown of their maximum lengths:

0: 52%
1: 18%
2: 10%
3: 2.4%
4: 1.4%
5: 0.7%
6 and greater: 2.5%

This indicates that, at least for small sizes, doubling starting from 1 makes sense, since the vast majority of lists are going to be very small.
Gavin pointed out that the numbers above don't add up.  The bug was in <1>, where I was off by a factor of 2.  It should be:

0: 52%
1: 31%
2: 10%
3: 2.4%
4: 1.4%
5: 0.7%
6 and greater: 2.5%
(In reply to comment #6)
> Gavin pointed out that the numbers above don't add up.  The bug was in <1>,
> where I was off by a factor of 2.  It should be:
> 
> 0: 52%
> 1: 31%

So, for arrays of size 0, we use sizeof(Hdr*) bytes.  For arrays of size 1, we use sizeof(Hdr*)+sizeof(Hdr)+sizeof(T) bytes.  Would it make sense for us to tag the lower bit of mHdr for arrays of size 1, and just embed the object itself in mHdr, rather than appending it to the header?
> So, for arrays of size 0, we use sizeof(Hdr*) bytes.  For arrays of size 1, we 
> use sizeof(Hdr*)+sizeof(Hdr)+sizeof(T) bytes.  Would it make sense for us to 
> tag the lower bit of mHdr for arrays of size 1, and just embed the object 
> itself in mHdr, rather than appending it to the header?

Maybe!

sizeof(Hdr) == 12, which jemalloc rounds up to 16.  So if you want to store just one 4-byte object, you get it for free.  If you want to store 8 bytes of data (12 + 8 = 20 bytes altogether), jemalloc rounds up to 32.

The trick would be to determine if this actually helps.  It's one malloc() either way, so I don't presume it'd be faster.  I'd need to measure how many of those one-element nsTArrays stick around, and for how long...
To extend on Ehsan's analysis from comment 7:

Suppose we used as our flags the bottom bits in the pointer to the header.  On a 32-bit machine, we have 2 bits, so we can have 4 states.  That means we can store 3 elements before we need a proper header.

Suppose we're storing 4-byte elements in the array.  Let's calculate how much memory we'd use for a variety of array lengths (ignoring the header pointer, since it's constant everywhere):

  sizeof(hdr) = 12 bytes

  slop() is the function described at the top of jemalloc.c [1] describing how jemalloc rounds up allocation requests.

== Currently ==
1:      slop(hdr + 4)          = 16 bytes
2 to 5: slop(hdr + [8  to 20]) = 32 bytes
6 to 9: slop(hdr + [24 to 36]) = 48 bytes

== Storing up to 3 elements instead of header ==
1:      slop(4)                =  4 bytes
2:      slop(8)                =  8 bytes
3:      slop(12)               = 16 bytes
4 and above are same as current.

So that would be a big space win for more than 40% of our lists.

IIRC, on a 64-bit system we have 16 bits in the middle of a pointer, plus the two at the end, to play with.  That's enough to store an 8-bit length and an 8-bit  capacity, with two bits to spare!  :)

[1] http://mxr.mozilla.org/mozilla-central/source/memory/jemalloc/jemalloc.c#47
> IIRC, on a 64-bit system we have 16 bits in the middle of a pointer, plus
> the two at the end, to play with.  That's enough to store an 8-bit length
> and an 8-bit  capacity, with two bits to spare!  :)

I'm really uncomfortable with using anything other than the 2 or 3 bottom bits for extra info -- that's totally standard, using bits from the middle is highly non-standard and IMO dangerous.
This sort of usage pattern (80+% of 0/1 elements) was the reason for 
nsAutoVoidArray:
// A zero-based array with a bit of automatic internal storage
(it includes in the base object space for an array of up to 8 items)

and nsSmallVoidArray:

//  nsSmallVoidArray is not a general-purpose replacement for
//  ns(Auto)VoidArray because there is (some) extra CPU overhead for arrays
//  larger than 1 element, though not a lot.  It is appropriate for
//  space-sensitive uses where sizes of 0 or 1 are moderately common or
//  more, and where we're NOT storing arbitrary integers or arbitrary
//  pointers.

Note that these were not just to save space, they were really for performance to reduce the number of calls to malloc().  (nsSmallVoidArray was intended to save space too)

"Those that do not know history..." ;-)

Now, these depended on using the "right" one for the usage pattern.  I suspect one could make nsTArray be like nsAutoVoidArray or even nsSmallVoidArray, and create a variant that's tuned for larger arrays to use in the occasional usage where large arrays are common.
(In reply to Nicholas Nethercote [:njn] from comment #10)
> I'm really uncomfortable with using anything other than the 2 or 3 bottom
> bits for extra info -- that's totally standard, using bits from the middle
> is highly non-standard and IMO dangerous.

Don't fatvals rely on the fact that 64-bit pointers live in a 48-bit address space?

In any case, we should do first the 90% solution, which is to get something working on win32.  :)
> Don't fatvals rely on the fact that 64-bit pointers live in a 48-bit address
> space?

Yes, and they have to jump through hoops on some architectures to make sure that JS allocations come from parts of the address space such that this is true, because it is NOT enforced by the architecture itself.

I have to wonder what fraction of those 0-or-1-element arrays could just be auto arrays....
Blocks: 678019
(In reply to Boris Zbarsky (:bz) from comment #13)
> I have to wonder what fraction of those 0-or-1-element arrays could just be
> auto arrays....

This is now bug 678019.
No longer blocks: 678019
Blocks: 678019
No longer blocks: 678019
>   sizeof(hdr) = 12 bytes

...actually, it's 8 bytes.  So to recalculate:

== Currently ==
1 to 2: slop(hdr + [4 to 8])   = 16 bytes
3 to 6: slop(hdr + [12 to 24]) = 32 bytes

== Storing up to 3 elements instead of header ==
1:      slop(4)                =  4 bytes
2:      slop(8)                =  8 bytes
3:      slop(12)               = 16 bytes
4 and above are same as current.
Note that for 0-element arrays we shouldn't be allocating a header at all! We should just be pointing at nsTArrayHeader::sEmptyHdr. Is that not the case?

nsSmallVoidArray is also neat since it lets us store a single pointer without allocating a header. It's probably doable to do the same tricks on nsTArray with enough template tricks.

However nsSmallVoidArray comes at a small performance overhead, so we might want to make it opt-in (for example by using a separate class like nsSmallTArray). Though it would be interesting to measure the performance cost of enabling it by default on all nsTArrays which store pointers.
> Note that for 0-element arrays we shouldn't be allocating a header at all! We 
> should just be pointing at nsTArrayHeader::sEmptyHdr. Is that not the case?

We don't allocate a header for 0-element arrays, but rather point to sEmptyHeader.
> nsSmallVoidArray is also neat since it lets us store a single pointer without 
> allocating a header. It's probably doable to do the same tricks on nsTArray 
> with enough template tricks.

Yes, I think we might want to try this.  But we should try to estimate whether it'd be a win, either in terms of memory or performance.  Most TArrays appear to die very quickly, but I don't have any numbers on that.

I'll instrument to see how much memory we're using for TArrays of various sizes.
In a release build, after loading nytimes.com, slashdot.org, and gmail.com, we have in existence about 30,000 heap nsTArrays.  Assuming all the elements are 8 bytes long, that's 16 bytes on the heap (8 bytes for the header, 8 for the element, and no jemalloc slop), or about .5mb.

In total, I have 4mb in heap-based TArrays under the same workload, representing 2% of heap-allocated and 1.6% of RSS.  (The 4mb is counted by malloc_usable_size, so it includes clownshoes, and overestimates slightly because I added an extra field to the header.)

It doesn't seem that optimizing the singletons or other TArrays would help memory usage much.  It might improve performance, because we'd be calling malloc() less-frequently.  I have to imagine that malloc'ing 16 bytes is very fast, though.

The one thing I think we should do is use malloc_usable_size to figure out how big our array actually is, or otherwise make sure that the array we allocate always fits nicely into a malloc bucket.  It should be easy enough.  But I'm not going to expect a big improvement in speed or memory usage.
I'm not shocked that the small-TArray issue has minimal impact on mem use directly.  The current change likely has a bigger impact on perf - even in a binning allocator like jemalloc there are still semaphores to acquire, bookkeeping, and occasional large perf hits (page alloc, etc).

SmallTArrays might save a little more.  

The usable-size thing is a fairly obvious win and costs little.  Also, anywhere it can be pushed to compile-time instead of run-time it would help minimize the downside.
I was under the impression that jemalloc let you malloc without acquiring a lock (by using separate arenas for different threads), but that doesn't appear to be the case, at least not for the version of jemalloc we currently have in the tree.

So to be done:

 * Estimate how much getting rid of the malloc call for a single-element array would improve performance.  My guess is: not much, because I don't recall malloc plus free ever being more than 5% in a profile, and this can't be more than 10% of all malloc/free.

 * Estimate how much bigger TArrays are than they need to be, because we resize too aggressively at the high end.  (This was actually the original intent of this bug.)  If we could shave off a few hundred KB from these allocations
...then that, plus the savings on the 1-element allocations, might be small but worthwhile in terms of memory used.
I just did a test to determine how the number of calls to malloc() compares with the number of 1-element nsTArrays created.

I opened nytimes.com and gmail.com.  This resulted in 1,681,000 malloc() calls, and 312,000 calls to nsTArray::IncrementLength where the resulting length was 1.  If I did this right, this suggests that about 18% of calls to malloc are due to single-element TArrays!
Cut it off around 0.8% to keep size down

Stack trace point is in EnsureCapacity, right after "if (mHdr == EmptyHeader()) {"

Easy to extend to other trace points
Attached patch Patch that generated the trace (obsolete) — Splinter Review
Note: the trace I uploaded was 30 or 40 tabs of bugzilla, mdc, intranet, other random things -- loading and then quitting.
Layout/style stuff ranks pretty high here.
This is basically a log of who uses TArrays, right?  Most users just insert one element at a time.

In any case, I think the right thing to do is to modify TArray so it's effectively TAutoArray<1> (although actually using TAutoArray<1> would waste more space than I'd like) and then, if it's necessary, find callsites which usually have *larger* array lengths and annotate them.

Also, I realized last night that my instrumentation was counting both TArrays of size 1 and TAutoArrays of size 1.  I'll redo it, but I don't expect the numbers to change a lot.
As mentioned in IRC, this is a list of who uses TArrays *and* puts something in them, with data on how many times they do that.  Not included is info on how many are created of each type but never used.  That could be generated at the same time by adding a StackTrace() call at creation...
This has a second probe at nsTArray() in nsTArray.h to catch nsTArray creations.  Note that this means you have to read the profile a little more carefully.
Please note that these jprof patches are by no means ready for primetime; I will probably build a more configurable/usable stacktrace-enabled jprof out of this.
Attachment #555387 - Attachment is obsolete: true
(In reply to Justin Lebar [:jlebar] from comment #24)
> I just did a test to determine how the number of calls to malloc() compares
> with the number of 1-element nsTArrays created.

Per comment 28, my previous test was counting nsTAutoArrays in addition to nsTArrays.  I fixed the instrumentation so it now specifically looks at when we call malloc() in EnsureCapacity with aCapacity==1.

The new number is that 9% of malloc()s are due to capacity-1 TArrays.  Still totally worth fixing.
There's actually a reason that the nsTArray<RuleValue> is not an auto array: it's a member of a hashtable entry, and those can get moved in memory.  And nsAutoTArray is not safe to move.

I could probably work around it with a MoveEntry hook if I had to, I suppose....  Would that work?

Similar for the TArray<nsCSSSelector*>.

It might be interesting to look at the actual distribution of lengths for those arrays.  Certainly they're never 0-length; if we're adding the hashtable entry at all, we're going to put stuff in the array.  So some sort of auto-array there makes sense.
started existing but almost-empty profile.  Loaded nytimes.com, amazon.com, slashdot.org, mozilla.org, quit.

Didn't seem to "hang up" generating tons of tarray hits in ::Stop the way the larger profile does.  Still has ~1000 tarrays created due to ::Stop (and 9000 more due to other CSS stuff causing a GetRuleCascade()).

This may be more representative; looks like there's a bug tickled in the larger profile by the overhead of saving all the stacks, causing ::Stop to get called a lot and force a rulecascade.
I filed bug 681755 on the rulehash arrays.
Depends on: 681755
In a test loading 10 tabs, I had 586 nsDocLoader::OnStopRequests, but it showed up 12000 times in the profile -> means that on average ~23 or so nsTArrays/nsAutoTArrays were created or had their initial storage allocted for each OnStopRequest - 9K GetUserFontSetInternal, 3K doStopDocumentLoad (eventually go to FlushPendingNotifications), (2700 calls to self), 800 FlushPendingNotifications.  

RefreshRuleCascade (9K) leads to a bunch: ~1 directly, bunch via AppendRule and AddSelector, mostly ending up in RuleHash (this is bz's bug 681755)

The test as currently run is a little confusing if you're not careful, since it's tracing multiple related events: it's counting all nsTArray_base creations (nsTArray, nsAutoTArray, etc) and it's counting when an nsTArray goes from 0->1 (and allocates).  This is stated above and you can see it if you trace the trees in the jprof output, but is easy to forget.  It's also easy to forget it's a stack analyzer, so except for the actual trigger it's not recording the number of events but number of times a routine was in the stack when we captured an event.
> 9K GetUserFontSetInternal

Is that 9K _in_ or _under_ this function?  Yesterday I thought you were talking about _in_...  In which case this number still makes absolutely no sense.  If you put a printf at that spot in the code, does it get hit that many times?

If you mean _under_, on the other hand, then that includes RefreshRuleCascade and the number makes sense.  Sounds like this is what's going on.
9k in-and-under.  Those are largely if not completely the RuleHash/etc stuff: 

> RefreshRuleCascade (9K) leads to a bunch: ~1 directly, bunch via AppendRule and AddSelector, mostly ending up in RuleHash (this is bz's bug 681755)

In some cases where the TArray rarely gets added to, it may be cheaper to use a TArray pointer lazily than to instantiate a tarray always.  It does require some care to make sure all uses of the pointer instantiate it if needed, and trades some code size for speed.  This behavior could be made even safer with judicious use of macros.  (A class would likely eat your win since TArray isn't expensive to init or in base size.)

if (!mTArrayPtr)
   mTArrayPtr = new nsTArray<blah>; // infallible malloc assumed
mTArrayPtr->AddWhatever();
Blocks: 682735
I've filed bug 682735 on improving nsAutoTArray and making nsTArray default to nsAutoTArray<1>.  I don't think we need any more investigation at this time, since the results of the investigation will be invalidated by large changes to how nsTArray works.  Marking this fixed, since the investigation is done.
I think you forgot to mark it as fixed :)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 681755
No longer depends on: 681755
Blocks: 686597
(In reply to Justin Lebar [:jlebar] from comment #19)
> In a release build, after loading nytimes.com, slashdot.org, and gmail.com,
> we have in existence about 30,000 heap nsTArrays.  Assuming all the elements
> are 8 bytes long, that's 16 bytes on the heap (8 bytes for the header, 8 for
> the element, and no jemalloc slop), or about .5mb.

That's interesting data as it shows that sizeof(mHdr) matters.

I would like to know if you also have data about the absolute number (not percentage) of empty nsTArray's?

What I'm really wondering is how much memory is being saved by doing the mHdr boxing to minimize the size of empty arrays.
> What I'm really wondering is how much memory is being saved by doing the mHdr boxing to minimize the 
> size of empty arrays.

You can't necessarily measure that just by counting the empty TArrays in existence at one point in time.  Performance issues with calling malloc() more often aside, all these extra short-lived allocations could fragment the heap.

Anyway, what's the potential benefit of getting rid of sEmptyHdr?  It'd simplify the TArray code, but only a bit...
(In reply to Justin Lebar [:jlebar] from comment #42)
> You can't necessarily measure that just by counting the empty TArrays in
> existence at one point in time.  Performance issues with calling malloc()
> more often aside, all these extra short-lived allocations could fragment the
> heap.

mHdr is not necessary to avoid dynamic memory allocation on empty arrays. An array class directly storing a mElements pointer could simply set it to NULL to indicate that the array is empty. Just like currently nsTArray sets mHdr to &sEmptyHdr whenever the array is empty.

> 
> Anyway, what's the potential benefit of getting rid of sEmptyHdr?  It'd
> simplify the TArray code, but only a bit...

Aside from simplifying nsTArray, getting rid of mHdr would:
 - make it possible to access mLength, mCapacity and mIsAutoArray without having to first dereference mHdr
 - make all nsTArray coefficient accessors (ElementAt...) faster as they wouldn't have to do that addition anymore
> mHdr is not necessary to avoid dynamic memory allocation on empty arrays.

Oh, okay.  I misunderstood what you were suggesting.

There exists code in our tree which assumes sizeof(nsTArray) == sizeof(void*).  You'd have to change all of this before you could store mLength and mCapacity inside the nsTArray.
(In reply to Justin Lebar [:jlebar] from comment #44)
> > mHdr is not necessary to avoid dynamic memory allocation on empty arrays.
> 
> Oh, okay.  I misunderstood what you were suggesting.
> 
> There exists code in our tree which assumes sizeof(nsTArray) ==
> sizeof(void*).  You'd have to change all of this before you could store
> mLength and mCapacity inside the nsTArray.

Interesting, that explains more of the trouble I've had trying to do some instrumentation.

But if it turned out that we don't have many empty nsTArrays, it would still be worth changing that.
> Aside from simplifying nsTArray, getting rid of mHdr would:
>  - make it possible to access mLength, mCapacity and mIsAutoArray without
> having to first dereference mHdr
>  - make all nsTArray coefficient accessors (ElementAt...) faster as they
> wouldn't have to do that addition anymore

I have to imagine that the elements of arrays are accessed more often than the length or capacity.  So the first point doesn't seem like a large potential win.

I think x86 has an instruction which makes the second point basically free -- you can do |MOV $eax, [$ebx + $ecx * 4 + 8]|, right?  But maybe it's not actually free, and maybe it makes a difference on ARM.  This would be relatively easy to test, since you could just change the mHdr pointer to mElems, pointing right after the header.
Making the length a member variable would probably mean that the compiler can optimize this

  nsTArray<PRUint32> arr;
  for (PRUint32 i = 0; i < arr.Length(); i++) {
     ...
  }

just as well as this

  nsTArray<PRUint32> arr;
  PRUint32 len = arr.Length();
  for (PRUint32 i = 0; i < arr.Length(); i++) {
    ...
  }

since in the first case, the array's length is a member variable, so effectively a stack variable.  Again, I'm not sure this matters...
Here's a patch counting the maximum (peak) number of empty nsTArrays that we reach during Firefox's lifetime.

Some results:

empty session:                                9181
nytimes.com:                                 16056
arstechnica.com:                             19162
nytimes.com + arstechnica.com:               25120
youtube.com:                                 19403
nytimes.com + arstechnica.com + youtube.com: 35350

So the conclusion is that the peak number of empty nsTArrays is, in practice, as a very rough approximation, roughly given by this formula:

    9000 * (1 + max_number_of_nondefault_tabs)

Of course this can vary depending on what the page does. This is just out of a sample of 3 mainstream pages.

Anyway, 9000 empty nsTArrays per tab.

The mHdr boxing saves us 8 bytes per empty nsTArray, so the conclusion is that mHdr boxing saves ~ 70k per tab. (again, just an order of magnitude).
If you want to take this further, can we do it in a separate bug, please?
Comment on attachment 575762 [details] [diff] [review]
count the maximum number of empty nsTArrays

Filed bug 704174.
Attachment #575762 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.