Last Comment Bug 677571 - Investigate nsTArray's growth behavior.
: Investigate nsTArray's growth behavior.
Status: RESOLVED FIXED
[MemShrink:P2][clownshoes]
:
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
Depends on:
Blocks: 681755 682735 686597 704174
  Show dependency treegraph
 
Reported: 2011-08-09 09:36 PDT by Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13)
Modified: 2011-11-21 09:10 PST (History)
12 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Instrumentation (6.09 KB, patch)
2011-08-09 15:25 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Review
Instrumentation (12.51 KB, patch)
2011-08-10 21:16 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Review
partial log of callers to grow tarrays from empty to non-empty in EnsureCapacity (273.41 KB, text/html)
2011-08-24 06:33 PDT, Randell Jesup [:jesup]
no flags Details
Patch that generated the trace (13.51 KB, patch)
2011-08-24 06:35 PDT, Randell Jesup [:jesup]
no flags Details | Diff | Review
profile including calls to nsTArray() (1.06 MB, text/html)
2011-08-24 08:48 PDT, Randell Jesup [:jesup]
no flags Details
Updated patch with trace point in nsTArray() (14.57 KB, patch)
2011-08-24 08:52 PDT, Randell Jesup [:jesup]
no flags Details | Diff | Review
Results with a small profile loading ~5 sites (1.25 MB, text/html)
2011-08-24 12:48 PDT, Randell Jesup [:jesup]
no flags Details
count the maximum number of empty nsTArrays (3.31 KB, patch)
2011-11-20 12:02 PST, Benoit Jacob [:bjacob] (mostly away)
no flags Details | Diff | Review

Description Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2011-08-09 09:36:55 PDT
In particular, it looks like it does the "I want to allocate a power of two but I'm dumb and end up just a bit over that" thing.
Comment 1 Boris Zbarsky [:bz] 2011-08-09 10:18:04 PDT
It doesn't seem to try to allocate a power of 2.  Just twice the size it already has.  And in particular, the very first allocation (the mHdr == EmptyHdr() case in EnsureCapacity) just uses the provided capacity.  So we'd start doubling from there.

It's definitely worth instrumenting to see what actual runtime behavior looks like.

Note that XPCOM strings have basically the same code (but possibly different usage patterns, so may want a separate bug).
Comment 2 Justin Lebar (not reading bugmail) 2011-08-09 10:22:21 PDT
> capacity = NS_MAX<size_type>(capacity, mHdr->mCapacity * 2U);

Note that mCapacity is the number of elements, not their size.

Still seems worthwhile to do something intelligent here.  Maybe bug 675226 would help.  :)
Comment 3 Justin Lebar (not reading bugmail) 2011-08-09 11:04:15 PDT
I'll fix this.
Comment 4 Justin Lebar (not reading bugmail) 2011-08-09 15:20:31 PDT
I instrumented nsTArray to observe the maximum number of elements an array contained when it died.  Hopefully most arrays are destroyed when the browser shuts down nicely.

I started up the browser, visited Gmail, and then closed the browser.  227000 heap-based (as opposed to auto-based) nsTArrays were destroyed.  Here's the breakdown of their maximum lengths:

0: 52%
1: 18%
2: 10%
3: 2.4%
4: 1.4%
5: 0.7%
6 and greater: 2.5%

This indicates that, at least for small sizes, doubling starting from 1 makes sense, since the vast majority of lists are going to be very small.
Comment 5 Justin Lebar (not reading bugmail) 2011-08-09 15:25:56 PDT
Created attachment 551909 [details] [diff] [review]
Instrumentation
Comment 6 Justin Lebar (not reading bugmail) 2011-08-09 15:44:11 PDT
Gavin pointed out that the numbers above don't add up.  The bug was in <1>, where I was off by a factor of 2.  It should be:

0: 52%
1: 31%
2: 10%
3: 2.4%
4: 1.4%
5: 0.7%
6 and greater: 2.5%
Comment 7 :Ehsan Akhgari (busy, don't ask for review please) 2011-08-09 15:57:01 PDT
(In reply to comment #6)
> Gavin pointed out that the numbers above don't add up.  The bug was in <1>,
> where I was off by a factor of 2.  It should be:
> 
> 0: 52%
> 1: 31%

So, for arrays of size 0, we use sizeof(Hdr*) bytes.  For arrays of size 1, we use sizeof(Hdr*)+sizeof(Hdr)+sizeof(T) bytes.  Would it make sense for us to tag the lower bit of mHdr for arrays of size 1, and just embed the object itself in mHdr, rather than appending it to the header?
Comment 8 Justin Lebar (not reading bugmail) 2011-08-09 16:14:11 PDT
> So, for arrays of size 0, we use sizeof(Hdr*) bytes.  For arrays of size 1, we 
> use sizeof(Hdr*)+sizeof(Hdr)+sizeof(T) bytes.  Would it make sense for us to 
> tag the lower bit of mHdr for arrays of size 1, and just embed the object 
> itself in mHdr, rather than appending it to the header?

Maybe!

sizeof(Hdr) == 12, which jemalloc rounds up to 16.  So if you want to store just one 4-byte object, you get it for free.  If you want to store 8 bytes of data (12 + 8 = 20 bytes altogether), jemalloc rounds up to 32.

The trick would be to determine if this actually helps.  It's one malloc() either way, so I don't presume it'd be faster.  I'd need to measure how many of those one-element nsTArrays stick around, and for how long...
Comment 9 Justin Lebar (not reading bugmail) 2011-08-09 17:51:24 PDT
To extend on Ehsan's analysis from comment 7:

Suppose we used as our flags the bottom bits in the pointer to the header.  On a 32-bit machine, we have 2 bits, so we can have 4 states.  That means we can store 3 elements before we need a proper header.

Suppose we're storing 4-byte elements in the array.  Let's calculate how much memory we'd use for a variety of array lengths (ignoring the header pointer, since it's constant everywhere):

  sizeof(hdr) = 12 bytes

  slop() is the function described at the top of jemalloc.c [1] describing how jemalloc rounds up allocation requests.

== Currently ==
1:      slop(hdr + 4)          = 16 bytes
2 to 5: slop(hdr + [8  to 20]) = 32 bytes
6 to 9: slop(hdr + [24 to 36]) = 48 bytes

== Storing up to 3 elements instead of header ==
1:      slop(4)                =  4 bytes
2:      slop(8)                =  8 bytes
3:      slop(12)               = 16 bytes
4 and above are same as current.

So that would be a big space win for more than 40% of our lists.

IIRC, on a 64-bit system we have 16 bits in the middle of a pointer, plus the two at the end, to play with.  That's enough to store an 8-bit length and an 8-bit  capacity, with two bits to spare!  :)

[1] http://mxr.mozilla.org/mozilla-central/source/memory/jemalloc/jemalloc.c#47
Comment 10 Nicholas Nethercote [:njn] 2011-08-09 18:13:55 PDT
> IIRC, on a 64-bit system we have 16 bits in the middle of a pointer, plus
> the two at the end, to play with.  That's enough to store an 8-bit length
> and an 8-bit  capacity, with two bits to spare!  :)

I'm really uncomfortable with using anything other than the 2 or 3 bottom bits for extra info -- that's totally standard, using bits from the middle is highly non-standard and IMO dangerous.
Comment 11 Randell Jesup [:jesup] 2011-08-09 20:31:47 PDT
This sort of usage pattern (80+% of 0/1 elements) was the reason for 
nsAutoVoidArray:
// A zero-based array with a bit of automatic internal storage
(it includes in the base object space for an array of up to 8 items)

and nsSmallVoidArray:

//  nsSmallVoidArray is not a general-purpose replacement for
//  ns(Auto)VoidArray because there is (some) extra CPU overhead for arrays
//  larger than 1 element, though not a lot.  It is appropriate for
//  space-sensitive uses where sizes of 0 or 1 are moderately common or
//  more, and where we're NOT storing arbitrary integers or arbitrary
//  pointers.

Note that these were not just to save space, they were really for performance to reduce the number of calls to malloc().  (nsSmallVoidArray was intended to save space too)

"Those that do not know history..." ;-)

Now, these depended on using the "right" one for the usage pattern.  I suspect one could make nsTArray be like nsAutoVoidArray or even nsSmallVoidArray, and create a variant that's tuned for larger arrays to use in the occasional usage where large arrays are common.
Comment 12 Justin Lebar (not reading bugmail) 2011-08-09 20:52:18 PDT
(In reply to Nicholas Nethercote [:njn] from comment #10)
> I'm really uncomfortable with using anything other than the 2 or 3 bottom
> bits for extra info -- that's totally standard, using bits from the middle
> is highly non-standard and IMO dangerous.

Don't fatvals rely on the fact that 64-bit pointers live in a 48-bit address space?

In any case, we should do first the 90% solution, which is to get something working on win32.  :)
Comment 13 Boris Zbarsky [:bz] 2011-08-10 10:35:14 PDT
> Don't fatvals rely on the fact that 64-bit pointers live in a 48-bit address
> space?

Yes, and they have to jump through hoops on some architectures to make sure that JS allocations come from parts of the address space such that this is true, because it is NOT enforced by the architecture itself.

I have to wonder what fraction of those 0-or-1-element arrays could just be auto arrays....
Comment 14 Justin Lebar (not reading bugmail) 2011-08-10 13:49:10 PDT
(In reply to Boris Zbarsky (:bz) from comment #13)
> I have to wonder what fraction of those 0-or-1-element arrays could just be
> auto arrays....

This is now bug 678019.
Comment 15 Justin Lebar (not reading bugmail) 2011-08-10 16:55:19 PDT
>   sizeof(hdr) = 12 bytes

...actually, it's 8 bytes.  So to recalculate:

== Currently ==
1 to 2: slop(hdr + [4 to 8])   = 16 bytes
3 to 6: slop(hdr + [12 to 24]) = 32 bytes

== Storing up to 3 elements instead of header ==
1:      slop(4)                =  4 bytes
2:      slop(8)                =  8 bytes
3:      slop(12)               = 16 bytes
4 and above are same as current.
Comment 16 Jonas Sicking (:sicking) 2011-08-10 17:07:54 PDT
Note that for 0-element arrays we shouldn't be allocating a header at all! We should just be pointing at nsTArrayHeader::sEmptyHdr. Is that not the case?

nsSmallVoidArray is also neat since it lets us store a single pointer without allocating a header. It's probably doable to do the same tricks on nsTArray with enough template tricks.

However nsSmallVoidArray comes at a small performance overhead, so we might want to make it opt-in (for example by using a separate class like nsSmallTArray). Though it would be interesting to measure the performance cost of enabling it by default on all nsTArrays which store pointers.
Comment 17 Justin Lebar (not reading bugmail) 2011-08-10 17:14:18 PDT
> Note that for 0-element arrays we shouldn't be allocating a header at all! We 
> should just be pointing at nsTArrayHeader::sEmptyHdr. Is that not the case?

We don't allocate a header for 0-element arrays, but rather point to sEmptyHeader.
Comment 18 Justin Lebar (not reading bugmail) 2011-08-10 17:24:49 PDT
> nsSmallVoidArray is also neat since it lets us store a single pointer without 
> allocating a header. It's probably doable to do the same tricks on nsTArray 
> with enough template tricks.

Yes, I think we might want to try this.  But we should try to estimate whether it'd be a win, either in terms of memory or performance.  Most TArrays appear to die very quickly, but I don't have any numbers on that.

I'll instrument to see how much memory we're using for TArrays of various sizes.
Comment 19 Justin Lebar (not reading bugmail) 2011-08-10 21:15:41 PDT
In a release build, after loading nytimes.com, slashdot.org, and gmail.com, we have in existence about 30,000 heap nsTArrays.  Assuming all the elements are 8 bytes long, that's 16 bytes on the heap (8 bytes for the header, 8 for the element, and no jemalloc slop), or about .5mb.

In total, I have 4mb in heap-based TArrays under the same workload, representing 2% of heap-allocated and 1.6% of RSS.  (The 4mb is counted by malloc_usable_size, so it includes clownshoes, and overestimates slightly because I added an extra field to the header.)

It doesn't seem that optimizing the singletons or other TArrays would help memory usage much.  It might improve performance, because we'd be calling malloc() less-frequently.  I have to imagine that malloc'ing 16 bytes is very fast, though.

The one thing I think we should do is use malloc_usable_size to figure out how big our array actually is, or otherwise make sure that the array we allocate always fits nicely into a malloc bucket.  It should be easy enough.  But I'm not going to expect a big improvement in speed or memory usage.
Comment 20 Justin Lebar (not reading bugmail) 2011-08-10 21:16:24 PDT
Created attachment 552304 [details] [diff] [review]
Instrumentation
Comment 21 Randell Jesup [:jesup] 2011-08-10 21:52:31 PDT
I'm not shocked that the small-TArray issue has minimal impact on mem use directly.  The current change likely has a bigger impact on perf - even in a binning allocator like jemalloc there are still semaphores to acquire, bookkeeping, and occasional large perf hits (page alloc, etc).

SmallTArrays might save a little more.  

The usable-size thing is a fairly obvious win and costs little.  Also, anywhere it can be pushed to compile-time instead of run-time it would help minimize the downside.
Comment 22 Justin Lebar (not reading bugmail) 2011-08-10 23:16:56 PDT
I was under the impression that jemalloc let you malloc without acquiring a lock (by using separate arenas for different threads), but that doesn't appear to be the case, at least not for the version of jemalloc we currently have in the tree.

So to be done:

 * Estimate how much getting rid of the malloc call for a single-element array would improve performance.  My guess is: not much, because I don't recall malloc plus free ever being more than 5% in a profile, and this can't be more than 10% of all malloc/free.

 * Estimate how much bigger TArrays are than they need to be, because we resize too aggressively at the high end.  (This was actually the original intent of this bug.)  If we could shave off a few hundred KB from these allocations
Comment 23 Justin Lebar (not reading bugmail) 2011-08-10 23:17:45 PDT
...then that, plus the savings on the 1-element allocations, might be small but worthwhile in terms of memory used.
Comment 24 Justin Lebar (not reading bugmail) 2011-08-23 13:02:05 PDT
I just did a test to determine how the number of calls to malloc() compares with the number of 1-element nsTArrays created.

I opened nytimes.com and gmail.com.  This resulted in 1,681,000 malloc() calls, and 312,000 calls to nsTArray::IncrementLength where the resulting length was 1.  If I did this right, this suggests that about 18% of calls to malloc are due to single-element TArrays!
Comment 25 Randell Jesup [:jesup] 2011-08-24 06:33:32 PDT
Created attachment 555385 [details]
partial log of callers to grow tarrays from empty to non-empty in EnsureCapacity

Cut it off around 0.8% to keep size down

Stack trace point is in EnsureCapacity, right after "if (mHdr == EmptyHeader()) {"

Easy to extend to other trace points
Comment 26 Randell Jesup [:jesup] 2011-08-24 06:35:41 PDT
Created attachment 555387 [details] [diff] [review]
Patch that generated the trace

Note: the trace I uploaded was 30 or 40 tabs of bugzilla, mdc, intranet, other random things -- loading and then quitting.
Comment 27 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2011-08-24 06:41:23 PDT
Layout/style stuff ranks pretty high here.
Comment 28 Justin Lebar (not reading bugmail) 2011-08-24 06:54:45 PDT
This is basically a log of who uses TArrays, right?  Most users just insert one element at a time.

In any case, I think the right thing to do is to modify TArray so it's effectively TAutoArray<1> (although actually using TAutoArray<1> would waste more space than I'd like) and then, if it's necessary, find callsites which usually have *larger* array lengths and annotate them.

Also, I realized last night that my instrumentation was counting both TArrays of size 1 and TAutoArrays of size 1.  I'll redo it, but I don't expect the numbers to change a lot.
Comment 29 Randell Jesup [:jesup] 2011-08-24 07:59:45 PDT
As mentioned in IRC, this is a list of who uses TArrays *and* puts something in them, with data on how many times they do that.  Not included is info on how many are created of each type but never used.  That could be generated at the same time by adding a StackTrace() call at creation...
Comment 30 Randell Jesup [:jesup] 2011-08-24 08:48:21 PDT
Created attachment 555421 [details]
profile including calls to nsTArray()

This has a second probe at nsTArray() in nsTArray.h to catch nsTArray creations.  Note that this means you have to read the profile a little more carefully.
Comment 31 Randell Jesup [:jesup] 2011-08-24 08:52:33 PDT
Created attachment 555424 [details] [diff] [review]
Updated patch with trace point in nsTArray()

Please note that these jprof patches are by no means ready for primetime; I will probably build a more configurable/usable stacktrace-enabled jprof out of this.
Comment 32 Justin Lebar (not reading bugmail) 2011-08-24 10:59:06 PDT
(In reply to Justin Lebar [:jlebar] from comment #24)
> I just did a test to determine how the number of calls to malloc() compares
> with the number of 1-element nsTArrays created.

Per comment 28, my previous test was counting nsTAutoArrays in addition to nsTArrays.  I fixed the instrumentation so it now specifically looks at when we call malloc() in EnsureCapacity with aCapacity==1.

The new number is that 9% of malloc()s are due to capacity-1 TArrays.  Still totally worth fixing.
Comment 33 Boris Zbarsky [:bz] 2011-08-24 12:01:09 PDT
There's actually a reason that the nsTArray<RuleValue> is not an auto array: it's a member of a hashtable entry, and those can get moved in memory.  And nsAutoTArray is not safe to move.

I could probably work around it with a MoveEntry hook if I had to, I suppose....  Would that work?

Similar for the TArray<nsCSSSelector*>.

It might be interesting to look at the actual distribution of lengths for those arrays.  Certainly they're never 0-length; if we're adding the hashtable entry at all, we're going to put stuff in the array.  So some sort of auto-array there makes sense.
Comment 34 Randell Jesup [:jesup] 2011-08-24 12:48:24 PDT
Created attachment 555498 [details]
Results with a small profile loading ~5 sites

started existing but almost-empty profile.  Loaded nytimes.com, amazon.com, slashdot.org, mozilla.org, quit.

Didn't seem to "hang up" generating tons of tarray hits in ::Stop the way the larger profile does.  Still has ~1000 tarrays created due to ::Stop (and 9000 more due to other CSS stuff causing a GetRuleCascade()).

This may be more representative; looks like there's a bug tickled in the larger profile by the overhead of saving all the stacks, causing ::Stop to get called a lot and force a rulecascade.
Comment 35 Boris Zbarsky [:bz] 2011-08-24 13:56:13 PDT
I filed bug 681755 on the rulehash arrays.
Comment 36 Randell Jesup [:jesup] 2011-08-25 01:00:13 PDT
In a test loading 10 tabs, I had 586 nsDocLoader::OnStopRequests, but it showed up 12000 times in the profile -> means that on average ~23 or so nsTArrays/nsAutoTArrays were created or had their initial storage allocted for each OnStopRequest - 9K GetUserFontSetInternal, 3K doStopDocumentLoad (eventually go to FlushPendingNotifications), (2700 calls to self), 800 FlushPendingNotifications.  

RefreshRuleCascade (9K) leads to a bunch: ~1 directly, bunch via AppendRule and AddSelector, mostly ending up in RuleHash (this is bz's bug 681755)

The test as currently run is a little confusing if you're not careful, since it's tracing multiple related events: it's counting all nsTArray_base creations (nsTArray, nsAutoTArray, etc) and it's counting when an nsTArray goes from 0->1 (and allocates).  This is stated above and you can see it if you trace the trees in the jprof output, but is easy to forget.  It's also easy to forget it's a stack analyzer, so except for the actual trigger it's not recording the number of events but number of times a routine was in the stack when we captured an event.
Comment 37 Boris Zbarsky [:bz] 2011-08-25 08:10:31 PDT
> 9K GetUserFontSetInternal

Is that 9K _in_ or _under_ this function?  Yesterday I thought you were talking about _in_...  In which case this number still makes absolutely no sense.  If you put a printf at that spot in the code, does it get hit that many times?

If you mean _under_, on the other hand, then that includes RefreshRuleCascade and the number makes sense.  Sounds like this is what's going on.
Comment 38 Randell Jesup [:jesup] 2011-08-25 09:10:13 PDT
9k in-and-under.  Those are largely if not completely the RuleHash/etc stuff: 

> RefreshRuleCascade (9K) leads to a bunch: ~1 directly, bunch via AppendRule and AddSelector, mostly ending up in RuleHash (this is bz's bug 681755)

In some cases where the TArray rarely gets added to, it may be cheaper to use a TArray pointer lazily than to instantiate a tarray always.  It does require some care to make sure all uses of the pointer instantiate it if needed, and trades some code size for speed.  This behavior could be made even safer with judicious use of macros.  (A class would likely eat your win since TArray isn't expensive to init or in base size.)

if (!mTArrayPtr)
   mTArrayPtr = new nsTArray<blah>; // infallible malloc assumed
mTArrayPtr->AddWhatever();
Comment 39 Justin Lebar (not reading bugmail) 2011-08-28 13:50:15 PDT
I've filed bug 682735 on improving nsAutoTArray and making nsTArray default to nsAutoTArray<1>.  I don't think we need any more investigation at this time, since the results of the investigation will be invalidated by large changes to how nsTArray works.  Marking this fixed, since the investigation is done.
Comment 40 Nicholas Nethercote [:njn] 2011-08-28 17:53:02 PDT
I think you forgot to mark it as fixed :)
Comment 41 Benoit Jacob [:bjacob] (mostly away) 2011-11-18 15:38:16 PST
(In reply to Justin Lebar [:jlebar] from comment #19)
> In a release build, after loading nytimes.com, slashdot.org, and gmail.com,
> we have in existence about 30,000 heap nsTArrays.  Assuming all the elements
> are 8 bytes long, that's 16 bytes on the heap (8 bytes for the header, 8 for
> the element, and no jemalloc slop), or about .5mb.

That's interesting data as it shows that sizeof(mHdr) matters.

I would like to know if you also have data about the absolute number (not percentage) of empty nsTArray's?

What I'm really wondering is how much memory is being saved by doing the mHdr boxing to minimize the size of empty arrays.
Comment 42 Justin Lebar (not reading bugmail) 2011-11-18 15:44:15 PST
> What I'm really wondering is how much memory is being saved by doing the mHdr boxing to minimize the 
> size of empty arrays.

You can't necessarily measure that just by counting the empty TArrays in existence at one point in time.  Performance issues with calling malloc() more often aside, all these extra short-lived allocations could fragment the heap.

Anyway, what's the potential benefit of getting rid of sEmptyHdr?  It'd simplify the TArray code, but only a bit...
Comment 43 Benoit Jacob [:bjacob] (mostly away) 2011-11-18 16:54:20 PST
(In reply to Justin Lebar [:jlebar] from comment #42)
> You can't necessarily measure that just by counting the empty TArrays in
> existence at one point in time.  Performance issues with calling malloc()
> more often aside, all these extra short-lived allocations could fragment the
> heap.

mHdr is not necessary to avoid dynamic memory allocation on empty arrays. An array class directly storing a mElements pointer could simply set it to NULL to indicate that the array is empty. Just like currently nsTArray sets mHdr to &sEmptyHdr whenever the array is empty.

> 
> Anyway, what's the potential benefit of getting rid of sEmptyHdr?  It'd
> simplify the TArray code, but only a bit...

Aside from simplifying nsTArray, getting rid of mHdr would:
 - make it possible to access mLength, mCapacity and mIsAutoArray without having to first dereference mHdr
 - make all nsTArray coefficient accessors (ElementAt...) faster as they wouldn't have to do that addition anymore
Comment 44 Justin Lebar (not reading bugmail) 2011-11-18 18:46:14 PST
> mHdr is not necessary to avoid dynamic memory allocation on empty arrays.

Oh, okay.  I misunderstood what you were suggesting.

There exists code in our tree which assumes sizeof(nsTArray) == sizeof(void*).  You'd have to change all of this before you could store mLength and mCapacity inside the nsTArray.
Comment 45 Benoit Jacob [:bjacob] (mostly away) 2011-11-20 10:13:58 PST
(In reply to Justin Lebar [:jlebar] from comment #44)
> > mHdr is not necessary to avoid dynamic memory allocation on empty arrays.
> 
> Oh, okay.  I misunderstood what you were suggesting.
> 
> There exists code in our tree which assumes sizeof(nsTArray) ==
> sizeof(void*).  You'd have to change all of this before you could store
> mLength and mCapacity inside the nsTArray.

Interesting, that explains more of the trouble I've had trying to do some instrumentation.

But if it turned out that we don't have many empty nsTArrays, it would still be worth changing that.
Comment 46 Justin Lebar (not reading bugmail) 2011-11-20 10:27:41 PST
> Aside from simplifying nsTArray, getting rid of mHdr would:
>  - make it possible to access mLength, mCapacity and mIsAutoArray without
> having to first dereference mHdr
>  - make all nsTArray coefficient accessors (ElementAt...) faster as they
> wouldn't have to do that addition anymore

I have to imagine that the elements of arrays are accessed more often than the length or capacity.  So the first point doesn't seem like a large potential win.

I think x86 has an instruction which makes the second point basically free -- you can do |MOV $eax, [$ebx + $ecx * 4 + 8]|, right?  But maybe it's not actually free, and maybe it makes a difference on ARM.  This would be relatively easy to test, since you could just change the mHdr pointer to mElems, pointing right after the header.
Comment 47 Justin Lebar (not reading bugmail) 2011-11-20 11:24:59 PST
Making the length a member variable would probably mean that the compiler can optimize this

  nsTArray<PRUint32> arr;
  for (PRUint32 i = 0; i < arr.Length(); i++) {
     ...
  }

just as well as this

  nsTArray<PRUint32> arr;
  PRUint32 len = arr.Length();
  for (PRUint32 i = 0; i < arr.Length(); i++) {
    ...
  }

since in the first case, the array's length is a member variable, so effectively a stack variable.  Again, I'm not sure this matters...
Comment 48 Benoit Jacob [:bjacob] (mostly away) 2011-11-20 12:02:12 PST
Created attachment 575762 [details] [diff] [review]
count the maximum number of empty nsTArrays

Here's a patch counting the maximum (peak) number of empty nsTArrays that we reach during Firefox's lifetime.

Some results:

empty session:                                9181
nytimes.com:                                 16056
arstechnica.com:                             19162
nytimes.com + arstechnica.com:               25120
youtube.com:                                 19403
nytimes.com + arstechnica.com + youtube.com: 35350

So the conclusion is that the peak number of empty nsTArrays is, in practice, as a very rough approximation, roughly given by this formula:

    9000 * (1 + max_number_of_nondefault_tabs)

Of course this can vary depending on what the page does. This is just out of a sample of 3 mainstream pages.

Anyway, 9000 empty nsTArrays per tab.

The mHdr boxing saves us 8 bytes per empty nsTArray, so the conclusion is that mHdr boxing saves ~ 70k per tab. (again, just an order of magnitude).
Comment 49 Justin Lebar (not reading bugmail) 2011-11-20 18:25:48 PST
If you want to take this further, can we do it in a separate bug, please?
Comment 50 Benoit Jacob [:bjacob] (mostly away) 2011-11-21 09:10:08 PST
Comment on attachment 575762 [details] [diff] [review]
count the maximum number of empty nsTArrays

Filed bug 704174.

Note You need to log in before you can comment on or make changes to this bug.