Closed Bug 682735 Opened 13 years ago Closed 12 years ago

Make nsAutoTArray memmovable and reduce its memory usage

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

Details

(Keywords: perf, Whiteboard: [MemShrink:P2])

Attachments

(5 files, 2 obsolete files)

+++ This bug was created as a clone of bug 677571 +++

In bug 677571, we discovered that malloc calls for single-element nsTArrays account for ~10% of all malloc's in the browser.

I have a plan to fix this, by effectively making nsTArray default to nsAutoTArray<1>.

My goals are:

 * Make nsAutoTArray memmove-able.  It's not movable right now because it contains a pointer to self.  This precludes hashtable<nsAutoTArray> and nsTArray<nsAutoTArray>.

 * Make nsAutoTArray<n> use as little space as possible.  Currently nsAutoTArray stores a pointer to its header, then a 64-bit aligned header (size 64 bits, for size and capacity), and then its data.

 * Keep nsTArray fast.  I'm willing to accept a few extra branches here and there, but not virtual methods or that sort of thing.
My plan is to allow us to write different storage engines to back the nsTArray interface.

We'd still have the malloc'ed header-based storage engine.  Auto TArrays would be a separate engine which use the header-based engine when they run out of space.

There'd be at least two kinds of Auto TArrays.  If you're storing pointers (or pointer-sized structures which always have the bottom two bits off, such as nsCOMPtrs), then

  sizeof(nsAutoTArray<E*, n>) == sizeof(E*) * n.

That is, there's no overhead to the auto array.

How would this work?  We store the capacity in the template parameter, and we store the size by using the bottom bit of the array's elements.  If the bottom bit is 1, then the element is empty, and if it's 0, then it contains a pointer.  If we need to expand the auto array into malloc'ed memory, then we store a pointer to the malloc'ed header in the auto array's first element, and set the bottom two bits to 1.

If the auto array isn't storing pointer-like things, then we have to remember its length in a separate member.


In order for this to work without requiring any virtual methods, the compiler needs to be able to statically determine exactly what kind of TArray it's dealing with at any given point in time.  Right now you can pass a AutoTArray to a function which expects a TArray.  But that wouldn't work under this scheme, because TArray and TAutoArray run different code.

I spent a while looking at these cases, and there aren't a lot.  In most places where TArrays are passed around, the function only takes one kind of TArray.  In that case, we can just typedef the concrete type and be done.

In places where this is not possible or desirable, I think that with enough template magic, we may be able to create a thin wrapper around the TArrays.  You'd wrap the TArray before you passed it to the method, (and maybe the wrapping would be implicit).

There'd be a performance penalty for this level of indirection, of course, but it should be acceptable for most cases.
> My plan is to allow us to write different storage engines to back the nsTArray interface.

How do I intend to do this without using virtual functions?

The plan is to write a "mixin" class (that's what it'd be called in Python, anyway; I don't know if C++ has another name for this), say nsTArrayMixin.  To explain by way of example, suppose we have three storage engines:

  nsHeaderTArrayBase<E>,
  nsAutoTArrayBase<E>, and
  nsAutoTArrayBase<E*>.

For each of these, I'd declare a corresponding class:

  nsHeaderTArray<E> : public nsHeaderTArrayBase<E>,
                      public nsTArrayMixin<E, nsHeaderTArrayBase<E> >

  nsAutoTArray<E> : public nsAutoTArrayBase<E>,
                    public nsTArrayMixin<E, nsAutoTArrayBase<E> >

The concrete classes don't themselves contain any code except constructors.  The [*]Base classes define a few methods (Length, Capacity, Elements, EnsureCapacity, etc.), but most of the current TArray code lives in nsTArrayMixin.  The mixin class defines all its methods in terms of the methods on the TArrayBase class.  For instance, it might define:

  E& nsTArrayMixin<E, Base>::ElementAt(index_type i) {
    return static_cast<Base*>(this)->Elements()[i];
  }

(This is similar to what Ehsan did to get SafeElementAt(i) working for nsTArray<E*>.)

I intend to avoid ugliness like the static casts above in the mixin class, instead favoring some ugliness in the storage engines.

I'm working through the implementation, but it'll probably take me a week or so to get it done.  Stay tuned...
js/src/jsvector.h might be interesting reading.
> In order for this to work without requiring any virtual methods, the compiler needs to 
> be able to statically determine exactly what kind of TArray it's dealing with at any 
> given point in time.  Right now you can pass a AutoTArray to a function which expects a 
> TArray.  But that wouldn't work under this scheme, because TArray and TAutoArray run 
> different code.

I'm less positive about this structure, now that I've implemented most of it.  The downsides are worse than I'd thought -- there's more code to change than I'd previously seen -- and the upsides are smaller than I'd thought -- we can get away with packing length + capacity as 16 bits each, negating much of the benefit of storing the capacity of nsAutoTArray<N> in the template parameter.

The main downside of moving away from storing the auto array's capacity in the template parameter is that it's harder (impossible?) to achieve

  sizeof(nsAutoTArray<E*, N>) == N * sizeof(E*).

Because the elements of the TArray can be accessed directly through Elements(), pointers have to be stored in the array with their bottom two bits off.  But that means there's no way to mark the end of the array!

I think we can specialize nsAutoTArray<E*, 1> so that it doesn't use any more space than a single pointer.  That's still pretty good...
It might be easiest to go with the simple solution of for nsTArray<T*>, store the value in the mHeader pointer if there is just one element in the array.
(In reply to Jonas Sicking (:sicking) from comment #5)
> It might be easiest to go with the simple solution of for nsTArray<T*>,
> store the value in the mHeader pointer if there is just one element in the
> array.

...and just leave nsAutoTArray<T*, N> for N >= 2 the same as nsAutoTArray<T, N>.  I'm going to try it and see!
Marking for MemShrink triage.  It's hard to say how much this will reduce fragmentation until we do it, but I'm hopeful it'll be noticeable.
Keywords: perf
Whiteboard: [MemShrink]
And it's likely to be a worthwhile perf win anyways (less trips through malloc & free).
Assignee: nobody → justin.lebar+bug
Whiteboard: [MemShrink] → [MemShrink:P2]
A big problem with this right off the bat is that a lot of code uses nsTArray assuming it's the same size as void*, stuffing nsTArrays away in various |void* data| fields and so forth.  What's the plan for that?
> What's the plan for that?

Fix them, I guess.  Or if that's too hard, make the relevant code only accept nsTArray<E, 0>'s, which should still be the same size as void*.

To follow up re our conversation on IRC, the motivation behind creating nsATArray and friends follows from:

 - we want the new nsTArray to default to, effectively, the old nsAutoTArray<1>, and
 - we want to allow the new nsTArray to be the old nsAutoTArray<0>, where desirable.

The only way I can think of to get this to work is to introduce nsATArray, which has no internal storage.  If we allowed methods to continue to accept nsTArray& as a parameter, then all arrays passed to those methods would need to inherit from nsTArray and would therefore be at least as large as nsTArray.  So these arrays couldn't have zero elements of internal storage.
Are the 1-element arrays you're trying to fix clustered or pretty widely dispersed?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #11)
> Are the 1-element arrays you're trying to fix clustered or pretty widely
> dispersed?

That's a tricky thing to measure.  Jesup did some profiling in bug 677571 (attachment 555385 [details]) which indicates that the CSS engine is largely to blame.  But his profile doesn't distinguish between an array that starts out as empty and grows to hold two elements, and an array which grows to hold only one element.  The distribution of arrays which grow to hold only one element should be more evenly-distributed than the profile he posted.

To the implied point, I was hoping to avoid creating Yet Another Specialized Array Class in this bug.  If the current state of things is any indication, people don't and won't use The Right Class in every circumstance.  I'd rather bite some pain now and create a fast(er) default.
For what it's worth, the CSS situation is in fact well-understood, and as long as we have the right tools (i.e. array classes) it's easy to apply them there.
(In reply to Justin Lebar [:jlebar] from comment #12)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #11)
> > Are the 1-element arrays you're trying to fix clustered or pretty widely
> > dispersed?
> 
> That's a tricky thing to measure.  Jesup did some profiling in bug 677571
> (attachment 555385 [details]) which indicates that the CSS engine is largely
> to blame.  But his profile doesn't distinguish between an array that starts
> out as empty and grows to hold two elements, and an array which grows to
> hold only one element.  The distribution of arrays which grow to hold only
> one element should be more evenly-distributed than the profile he posted.

Looking at the data there, I'd expect the top (and also #3) hit (the rule hash, converted from linked lists to arrays in bug 576136, yielding a speed win) to have a decent number of 1-element arrays, but I wouldn't expect 1-element arrays to be a majority.  I haven't checked, though.  That hash is implementing https://developer.mozilla.org/En/Writing_Efficient_CSS#How_the_style_system_breaks_up_rules

Hit #5 in that list, nsStyleDisplay::nsStyleDisplay seems suspicious to me, though, since those arrays are *already* nsAutoTArray<..., 1>, and should have 1 element almost all the time (i.e., all the time except when transitions/animations are used *and* there's more than one transition/animation applied to a given element)
It would be great if we could (In reply to Justin Lebar [:jlebar] from comment #12)
> To the implied point, I was hoping to avoid creating Yet Another Specialized
> Array Class in this bug.  If the current state of things is any indication,
> people don't and won't use The Right Class in every circumstance.  I'd
> rather bite some pain now and create a fast(er) default.

If we can fix 90%+ of the unnecessary heap allocations by using nsAutoTArray<1> (or a typedef thereto, say CSS[blah]Array) in CSS code, that seems preferable (at least initially) to rewriting much of our array code.

The proposed change would make our array stuff diverge further from the STL, further from js::Vector which has similar goals, and would be less friendly to developers, so I think it deserves detailed discussion.  There are also potential compat issues for other Gecko projects and binary extension (though the latter aren't a concern).  I wouldn't want to block a big perf win on that.
I really like the way you can currently pass an nsAutoTArray to an nsTArray&; I don't want to lose that.

It seems to me we could specialize nsTArray<T*> for sizeof(T) > 1 to store a single-element array in the header pointer without allocation. Ditto for other pointer-ish instances like nsTArray<nsRefPtr<T>> and nsTArray<nsCOMPtr<T>>.

If we do that, and manually nsAutoTArray-ify in the top few remaining hot places, say, do we need to do anything else?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> It seems to me we could specialize nsTArray<T*> for sizeof(T) > 1 to store a
> single-element array in the header pointer without allocation. Ditto for
> other pointer-ish instances like nsTArray<nsRefPtr<T>> and
> nsTArray<nsCOMPtr<T>>.

I mean alignof(T) > 1.
David, for the rulehash stuff bug 681755 has lots of data on distribution, etc.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> It seems to me we could specialize nsTArray<T*> for sizeof(T) > 1 to store a
> single-element array in the header pointer without allocation. Ditto for
> other pointer-ish instances like nsTArray<nsRefPtr<T>> and
> nsTArray<nsCOMPtr<T>>.

And of course we could do this for any nsTArray<T> where sizeof(T) < sizeof(void*), too, sometimes storing even more than one element in the header pointer area, if there is significant usage of 1-element arrays with such small sizes.
We can't use the existing nsAutoTArray in the CSS code, because nsAutoTArray isn't memmove()'able, and at least some of the arrays are stored in hashtables.  But we certainly could write something new.

> The proposed change would make our array stuff diverge further from the STL, further from 
> js::Vector which has similar goals

I don't understand what you mean.  Is your concern with API divergence, or implementation divergence?  The TArray API already bears almost no resemblance to stl::vector or js::Vector.  The only API change I'm proposing is that you have to use nsATArray<E>&/* instead of nsTArray<E>&/*.

Or are you concerned about implementation divergence?  I'm not sure why this matters.

(More generally, the sentiment that we'd like to avoid divergence so we can later modify our code to use the One Array To Rule Them All seems at odds with the sentiment that we shouldn't modify our code now to use the Shiny New TArray.)

> , and would be less friendly to developers

The API resembles what we already do for strings, which I don't find particularly unfriendly.  But maybe you do?  There's of course a cost to developers in switching all their in-progress patches and whatnot.  But it's a one-character change that'll be detected by the compiler.

> it deserves detailed discussion.

Definitely.  But -- and I'm not suggesting this has been your position -- I don't want the starting point to be "this is a library that lots of people use, so we can't change it."


An option which would obviate the need for nsATArray would be:

  - Leave nsTArray as is -- if it has more than zero elements, it needs to malloc().
  - Make nsAutoTArray memmove()'able (and maybe also make nsAutoTArray<E*, 1> stuff the one pointer into the header pointer, so it uses no extra space).
  - Change the most-frequently used TArrays to nsAutoTArray<1> (or a typedef).

I'm not particularly in favor of this; it seems to me that this isn't what we'd choose if we weren't worried about externalities stemming from the fact that TArray is widely used.

(To be clear, I'm not even sure that globally setting TArrays to AutoArray<1> is a good idea.  But it's an easy thing to experiment with under the nsATArray approach.)

I also can't guarantee that this won't require changes to existing code.  The situation with InfallibleTArray<E> versus nsTArray<E, nsTArrayInfallibleAllocator> versus nsTArray<E, nsTArrayDefaultAllocator> (these all run the same code, but are different types) is pretty unfortunate, and code to be rejiggered every time I touch anything.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> I really like the way you can currently pass an nsAutoTArray to an
> nsTArray&; I don't want to lose that.

Is it really so much to ask that you type nsATArray&?

There would be no performance difference between nsATArray& and [nsConcreteArrayType]& (just as there's no perf difference now when you pass a nsAutoTArray as a nsTArray&).
Is there any reason to keep the current crazyness with regards to allocator type? At this point I think it's safe to assume that people are relying on nsTArray being infallible and so there's no reason for neither InfallibleTArray<E> nor nsTArray<E, nsTArrayInfallibleAllocator> to exist.

As for the 'A' types, specifically for nsAString: I hate it. In part this is left-overs from when nsAString was slower than nsString in that it had to do some checking and then cast to nsString. I'm not sure if that code is still around. However the simple fact that we have so many string classes is confusing to people.

Justin: I still haven't heard a reason why we can't make nsTArray<T*> store the first element in the header pointer. And as roc has pointed out this is doable for more types than pointers.

I'm somewhat worried that making nsTArray bigger than one word is going to have downsides in that we'll use more memory anywhere where we use a nsTArray which costs in memory churn.
(In reply to Justin Lebar [:jlebar] from comment #21)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> > I really like the way you can currently pass an nsAutoTArray to an
> > nsTArray&; I don't want to lose that.
> 
> Is it really so much to ask that you type nsATArray&?
> 
> There would be no performance difference between nsATArray& and
> [nsConcreteArrayType]& (just as there's no perf difference now when you pass
> a nsAutoTArray as a nsTArray&).

It's an extra class to deal with.

> The API resembles what we already do for strings, which I don't find particularly
> unfriendly.  But maybe you do?

"Too many string classes" is a common complaint, with some justification. Of course there are more string classes than the array classes you're proposing here.
I'm also wary of making nsTArray<T> bigger than one word. sizeof(nsTArray) == sizeof(T*) is an ingrained assumption in my thinking, although I'm not sure what code it matters for.

>   - Make nsAutoTArray memmove()'able (and maybe also make nsAutoTArray<E*, 1>
> stuff the one pointer into the header pointer, so it uses no extra space).

This seems like something worth doing if we can do it cheaply, no matter what else we do, because non-memmove()ability hurts us in some times and places. I guess we'd use a special header pointer value that means "auto", with conditional tests on that?
In case it's not clear, the principal advantage of nsATArray is that it lets us globally typedef nsTArray to nsAutoTArray<N>.  If we don't have nsATArray, then nsTArray must always store 0 elements (or 1 element if the stored type is a pointer) inline.

The two main arguments against the "profile and change hot paths to nsAutoTArray<1>" approach are:

 - We won't necessarily find all the hot paths.  For instance, there appears to be a hot TArray path on awesomebar use.  But if we'd profiled using TP5, we wouldn't see this path.

 - We won't necessarily catch new code which uses nsTArray when it should be using nsAutoTArray<1>.  Indeed, almost any array which ever has any elements has, at one point, one element, so (experimentation might confirm that) any code which uses TArrays which are not usually empty should use nsAutoTArray<1>.  But then why isn't that the default?
(In reply to Jonas Sicking (:sicking) from comment #22)
> Is there any reason to keep the current crazyness with regards to allocator
> type? At this point I think it's safe to assume that people are relying on
> nsTArray being infallible and so there's no reason for neither
> InfallibleTArray<E> nor nsTArray<E, nsTArrayInfallibleAllocator> to exist.

Well, if you have nsTArrayFallibleAllocator, you kind of need nsTArrayInfallibleAllocator...  In my current patch, I changed the inheritance model so we have:

nsATArray<E, Alloc>
InfallibleATArray<E> : nsATArray<E, nsTArrayInfallibleAllocator>
FallibleATArray<E> : nsATArray<E, nsTArrayFallilbleAllocator>

nsTArray<E, N, Alloc> : either nsATArray<E, Alloc>, InfallibleATArray<E>, or FallibleATArray<E>, depending on alloc

InfallibleTArray<E, N> : nsTArray<E, N, nsTArrayInfallibleAllocator>
FallibleTArray<E, N> : nsTArray<E, N, nsTArrayFallibleAllocator>

I got rid of nsTArrayDefaultAllocator.

This lets you cast from Infallible[A]TArray<E>* to ns[A]TArray<E>*, which you can't currently do, since nsTArray<E> is actually nsTArray<E, nsTArrayDefaultAllocator>.

These [In]fallibleTArray classes are supposed to be a convenience.  If they're aesthetically unpleasing as they increase in number, we could get rid of them.

> Justin: I still haven't heard a reason why we can't make nsTArray<T*> store
> the first element in the header pointer. And as roc has pointed out this is
> doable for more types than pointers.

You definitely can; I intend to do this!  But of course it doesn't work for all non-pointer types...

> I'm somewhat worried that making nsTArray bigger than one word is going to
> have downsides in that we'll use more memory anywhere where we use a
> nsTArray which costs in memory churn.

The problem is, it's very difficult for us to experiment now.  If we had nsATArray, this would be a simple question to answer.

I think we may be talking past one another because people are making cosmetic arguments, and I feel like they're minor in comparison to the technical win here.  ATArray gives us flexibility to experiment in a way which we don't currently have.
> We can't use the existing nsAutoTArray in the CSS code

We ca, with a bit of work.  We'd just need to use a moveentry callback that's not the default one that just calls memmove.
OK, leaving aside the "cosmetic" issues (which I think is a pejorative term; "ergonomic" might be better)...

Let's assume we make nsAutoTArray safe for memmove(), and eliminate allocation for single-element arrays of T* (alignof(T) > 1) and sizeof(T) < sizeof(void*). I don't think anyone disagrees with those.

The remaining issue is that currently nsTArray is optimal for the zero-element case. You want to make it default to being optimal for the 1-element case at the cost of a space penalty for the zero-element case. I'm not convinced.

(In reply to Justin Lebar [:jlebar] from comment #25)
>  - We won't necessarily find all the hot paths.  For instance, there appears
> to be a hot TArray path on awesomebar use.  But if we'd profiled using TP5,
> we wouldn't see this path.
> 
>  - We won't necessarily catch new code which uses nsTArray when it should be
> using nsAutoTArray<1>.  Indeed, almost any array which ever has any elements
> has, at one point, one element, so (experimentation might confirm that) any
> code which uses TArrays which are not usually empty should use
> nsAutoTArray<1>.  But then why isn't that the default?

It cuts both ways. If we default to nsAutoTArray<1>, then for arrays that are usually empty, we'll be wasting space. So, we'll need to try to find those cases and fix those manually. So, do you think those cases are less common, or easier to find?
> The remaining issue is that currently nsTArray is optimal for the zero-element case. You want to 
> make it default to being optimal for the 1-element case at the cost of a space penalty for the 
> zero-element case. I'm not convinced.

I think this relates to my response to Jonas in comment 22, which may have gotten lost as we all  midaired each other a few hours ago:

>> (Jonas wrote)
>> I'm somewhat worried that making nsTArray bigger than one word is going to
>> have downsides in that we'll use more memory anywhere where we use a
>> nsTArray which costs in memory churn.
>
> (Justin wrote)
> The problem is, it's very difficult for us to experiment now.  If we had nsATArray, this would be a 
> simple question to answer.

I'm also not convinced that the 1-element default would be an improvement.  I think it might be, given that 1-element arrays are more common than all >= 2-element arrays (bug 677571 comment 6) and that 10% of malloc's in the whole browser are due to single-element arrays (bug 677571 comment 32).  But it's hard to say, and you probably know better than me!  :)

I'd been saying that it's very hard to do this experiment to determine whether the default should be size 1 without nsATArray, but thinking more about it, I don't think that's true.  We could hack in a member to the end of nsTArray and make nsAutoTArray aware of it; I don't think that would be too hard.

But suppose we did this and found that the new nsTArray (equivalent to the old nsAutoTArray<1>) is the right one for most cases.  Then we'd be unable to convert the cases we want back to nsAutoTArray<0>, because anything we called "nsTArray" would have an extra member in it.

I already have much of FF compiling with nsATArray, so if it's not a nonstarter, my preference would be to go ahead and try to get something which runs, and then experiment with setting the default TArray size to 1.  If it turns out that this is a total flop, or if we really don't like the ergonomics of nsATArray, I'd be happy to switch back to nsTArray.
It doesn't matter what you do in an experimental build.

If you get results that show a 1-element default is a good idea, then I guess something like nsATArray might be necessary. I think I'd hate that though. I might end up arguing that even if a 1-element default would be a good idea, nsATArray is too ugly. But if 1-element default is not a good idea, we don't have to have that argument :-).

Hmm, can we actually make nsAutoTArray safe for memmove() without adding a conditional branch to every ElementAt()? I don't see how :-(.
> Hmm, can we actually make nsAutoTArray safe for memmove() without adding a conditional branch to 
> every ElementAt()? I don't see how :-(.

Stuffing elements into the mHdr pointer is also going to result in more branches.

You can get around this where it matters by calling Elements() once and iterating over that.
(In reply to Justin Lebar [:jlebar] from comment #31)
> You can get around this where it matters by calling Elements() once and
> iterating over that.

Yeah, but that's more code and fragile.

I wonder how to trade off the costs and benefits...
> I wonder how to trade off the costs and benefits...

I guess it's hard to weigh, because there are a lot of "might"s.  The compiler might be able to hoist the branch out of the loop.  The branch predictor might be able to figure it out.  The list might not be long enough for it to matter.

I suspect the extra branches will only matter in the few places where we iterate over long arrays, if they matter anywhere, and there we can just use Elements()...
Attached patch WIP v1 (obsolete) — Splinter Review
This makes nsAutoTArray memmove'able.  It starts up, but crashes as I try to
browse around.  Just wanted to post a WIP in case anyone would like to take an
early look at the ideas here.
Attached patch WIP v2Splinter Review
This patch makes nsAutoTArray memmovable.  It also reduces its memory usage by

 - aligning the auto array to E's natural alignment, rather than to 64 bits, and

 - packing the auto array's length and capacity into mHdr.  (When we're using
   the auto buffer, its length and capacity are stored in mHdr.  When we're not
   using the auto buffer, we store its capacity into the beginning of the auto
   buffer.  This means the auto buffer must always have size at least 16 bytes.)
   
Before the patch, we had

  sizeof(nsAutoTArray<E, N>) == max(sizeof(void*), alignof(PRUint64)) +
                                2 * sizeof(PRUint32) + N * sizeof(E)

                             == 16 + N * sizeof(E),
but with the patch, we have

  sizeof(nsAutoTArray<E, N>) == max(sizeof(void*), alignof(E)) + N * sizeof(E).

In the common case that alignof(E) <= sizeof(void*), we save 8 bytes on 64-bit
systems and 12 bytes on 32-bit systems.  The cost is more branches and
extremely unsafe things in nsTArray.  :)

The browser seems to run fine on my machine with this patch, but of course more
testing is necessary.  I also need to make sure that this patch doesn't
increase code size too much -- nsTArray_base used to depend just on <Alloc>,
but now it depends on <E, Alloc>, since it depends on the auto array's
alignment.  But it really only needs to depend on <alignof(E), Alloc>.

We may still want to specialize nsTArray<E*> so it packs the pointer into mHdr,
but since the overhead of changing nsTArray<E*> to nsAutoTArray<E*, 1> is just
sizeof(void*), this might not be necessary.  We might get a better idea of
whether this is necessary as we go around and change nsTArray<E> to
nsAutoTArray<E, 1>.

So I'm going to morph this bug into "Make nsAutoTArray memmovable and reduce
its memory usage".  Then we can see whether we need a specialized nsTArray<E*>.
Attachment #557723 - Attachment is obsolete: true
Summary: Rewrite nsTArray to make one-element lists not require heap memory → Make nsAutoTArray memmovable and reduce its memory usage
> This means the auto buffer must always have size at least 16 bytes.

er, 16 bits, 2 bytes.
Blocks: 681755
No longer depends on: 681755
I've been replacing a lot of nsTArray's with nsAutoTArray's lately, but my process has been somewhat haphazard, and judging by my results, I don't think I've done a good job tuning the sizes of the auto arrays.

I think it shouldn't be too hard to write one-off instrumentation which tells me how big I should make the most-commonly-allocated TArrays:

* When a TArray is created, dump its address and a print a stacktrace.
* When a TArray is destroyed, dump its address and its maximum capacity.
* Aggregate these and figure out how big the most-commonly-allocated TArrays become.

We could even be clever and include a timestamp (or instruction count) on the creation and destruction.  This would tell us how long the TArray is alive for -- shorter-lived TArrays can be turned into longer AutoTArrays.

[1] http://blog.mozilla.com/nnethercote/2011/01/11/using-valgrind-to-get-stack-traces/
I instrumented nsTArray's constructor to print |this| and the destructor to print |this| and the array's final capacity.

I modified |shrinkCapacity| to do nothing, so the array's final capacity is its max capacity (modulo methods like |swapElements|), and I changed |ensureCapacity| not to reserve any extra memory, so the max capacity is the max length.  So in the results where it says "final capacity", we can read "max length".

Attached are the results of parsing the output generated from going to slashdot, nytimes, and gmail.

The analysis doesn't handle TArrays which are memmoved well, since it needs to see a constructor and destructor at the same address in order to attach a final capacity to an object.  The sections on "orphaned constructors/destructors" and "live objects" tell us about the tarrays which are memmoved.

If you scroll down, you'll see "100 functions which allocate the most TArrays with heap storage", which is where the meat is.
That data is awesome.

So in order...

1)  Arrays in a hashtable in the framelayerbuilder.  Mostly of size 1, some size 2.  These are presumably not auto arrays because of the memmove restriction, since they live in hashtable entries.  I _think_ these are short-lived, but roc can confirm.

2)  Property table arrays.  These actually use a union to store either an nsTArray of capacity 4 (passed to ctor) or a single entry.  The entries are sizeof(void*)*2 in size.  These are longish-lived.  I wonder how many cases we had of one entry, which would not even trigger an array allocation in this case...

3)  Almost certainly short-lived arrays in places there, but looks to me like they should be auto-1 all the way.

4)  AtomSelector table.  Covered by bug 681755: long-lived, need to be memmovable or need a moveentry hook on the table, should be auto-1 or auto-2.

5)  More places stuff.  Probably same as (3), but auto-4?

6)  CSS declaration mOrder.  Already uses an auto array of size 8.  This is long-lived, and I think making it bigger is probably diminishing returns....

7)  Rulehash table.  Same as (4).

8)  Hit testing frame list.  Short-lived on-stack array.  Could be auto-1.

9)  This one makes not sense.  That array is declared as a nsAutoTArray<nsAnimation, 1>

10) Same as (9), but for nsTransition.  Also declared auto-1 already.

11) Tag table.  Same as (4).

12) This bidi paragraph data is on-stack, but which one of its 3 member arrays is this
    item for?

13) Again, the line data is on stack, but has several array members...

14) Rule cascade stuff is (4), I think.

15) Frame layer builder stuff again; final capacity 0 but on heap???

16) The atom array for attr values always has at least one entry, as far as I can see.  So it should be an auto-1 array at least.... though your numbers suggest auto-2 might be better.

17) More rule cascade stuff covered by (4).

Going to stop there.
> 15) Frame layer builder stuff again; final capacity 0 but on heap???

I think this is just a bug in the printf in my script.  The script doesn't know whether something of size 0 is actually on the heap or not, but elsewhere, it assumes not.  The printf should reflect that assumption.

I talked to bz on IRC, and I'm going to try to fill in the gaps caused by memmove by sticking a unique ID into the TArray header.  Stay tuned...
(From update of attachment 557909 [details] [diff] [review])
I thought the point of nsTArray_base was to provide functions that didn't depend on the type of the array, which avoided duplication between arrays.
(In reply to Justin Lebar [:jlebar] from comment #38)
> The analysis doesn't handle TArrays which are memmoved well, since it needs
> to see a constructor and destructor at the same address in order to attach a
> final capacity to an object.

Same buffer address?  If so, you could could allocate a global atomic counter to unique identify tarrays.
(In reply to neil@parkwaycc.co.uk from comment #41)
> (From update of attachment 557909 [details] [diff] [review])
> I thought the point of nsTArray_base was to provide functions that didn't
> depend on the type of the array, which avoided duplication between arrays.

(From comment 35)
>> I also need to make sure that this patch doesn't
>> increase code size too much -- nsTArray_base used to depend just on <Alloc>,
>> but now it depends on <E, Alloc>, since it depends on the auto array's
>> alignment.  But it really only needs to depend on <alignof(E), Alloc>.

I probably should have said <sizeof(E), alignof(E), Alloc>.

(In reply to comment 42)
> Same buffer address?  If so, you could could allocate a global atomic counter to unique identify 
> tarrays.

Yep; this is what I meant in comment 40:

> I talked to bz on IRC, and I'm going to try to fill in the gaps caused by memmove by sticking a 
> unique ID into the TArray header.  Stay tuned...

I hadn't done this before because some code doesn't compile if sizeof(nsTArray) != sizeof(void*).  But bz pointed out that we can just store this information in the header.  (It means we won't be able to account for TArrays of size 0, but since those don't allocate, we don't care about them.)
New instrumentation results, generated by assigning each header a unique ID.  There's now no hanky-panky around memmove; each TArray is counted exactly once.
Attachment #558879 - Attachment is obsolete: true
Looking at the new results...

(In reply to Boris Zbarsky (:bz) from comment #39)
> That data is awesome.

Yes it is!!!!

But I would like to see it sorted by the integral of "number of heap allocated arrays" over time as well. That's what matters for optimizing space. I guess that's the same as sorting by the sum of the lifetimes of the heap-allocated arrays. Ordering by total allocations is what we want for optimizing for time, I guess.

> 1)  Arrays in a hashtable in the framelayerbuilder.  Mostly of size 1, some
> size 2.  These are presumably not auto arrays because of the memmove
> restriction, since they live in hashtable entries.  I _think_ these are
> short-lived, but roc can confirm.

There's one of those for every visible display item, as long as it's visible. So lifetime can vary but we probably won't have a huge number at any one time.

> 8)  Hit testing frame list.  Short-lived on-stack array.  Could be auto-1.

Oh yeah. Could be auto-N trivially. Let's do it.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #45)
> > 8)  Hit testing frame list.  Short-lived on-stack array.  Could be auto-1.
> 
> Oh yeah. Could be auto-N trivially. Let's do it.

Filed bug 682735 with patch.
> But I would like to see it sorted by the integral of "number of heap allocated arrays" over time as 
> well. That's what matters for optimizing space. I guess that's the same as sorting by the sum of the 
> lifetimes of the heap-allocated arrays.

I'm not sure what you mean.  By "number of heap allocated arrays over time", do you mean that a callsite's sort key should be calculated as

  sum over each chunk of memory m allocated by a TArray created at the callsite:
    size(m) * (time-of-free(m) - time-of-creation(m))

?  (I hope that's clear.)

If this is what you mean, shouldn't Massif or your favorite heap profiler do a better job of telling you about things like this?  In any case, in bug 677571 comment 19, I could only get nsTArray's heap usage up to about 2% of a 200mb heap-allocated, so I'm not sure how much room there is for improvement here.

Things like this would be relatively easy to compute, so I'm happy to generate the stats if they might be useful.

> Ordering by total allocations is what we want for optimizing for time, I guess.

I'm also hoping that by reducing the number of malloc's, we'll reduce heap fragmentation.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #46)
> Filed bug 682735 with patch.

Should be bug 685404.
It seems like we could make some huge wins here by not changing nsTArray code at all and instead making some layout code use nsAutoTArray. We'd just need to do some fixups after the mem-move.

It'd be interesting to know which other changes are worth doing after that is done.
Another advantage of the changes I'm working on is the overhead of nsAutoTArray shrinks from 16 bytes to max(4, alignof(E)).  That may make a difference when we start sticking nsAutoTArray<1>'s into hashtables.
Last night, bz noticed:

         nsTArray<nsAnimation, nsTArrayDefaultAllocator>::nsTArray(nsTArray<nsAnimation, nsTArrayDefaultAllocator> const&) (nsTArray.h:424)
         nsStyleDisplay::nsStyleDisplay(nsStyleDisplay const&) (nsTArray.h:1228)
           2541 (100%) on heap with final capacity    1

and similarly:

         nsTArray<nsTransition, nsTArrayDefaultAllocator>::nsTArray(nsTArray<nsTransition, nsTArrayDefaultAllocator> const&) (nsTArray.h:424)
         nsStyleDisplay::nsStyleDisplay(nsStyleDisplay const&) (nsTArray.h:1228)
           2536 (100%) on heap with final capacity    1

Why are these reported to be on the heap, since they're nsAutoTArray<1>'s?

I think the reporting may be right here.  nsStyleDisplay's copy constructor does

  nsStyleDisplay::nsStyleDisplay(const nsStyleDisplay& aSource)
    : mAnimations(aSource.mAnimations),
      mTransitions(aSource.mTransitions)
  { }

There's no nsAutoTArray(nsAutoTArray&) constructor, so we're calling the implicitly-defined copy constructor.  It's copy constructors all the way down; this guy calls his parent's copy constructor, and so on.  But the only constructor where mHdr is set to the auto buffer is the no-args constructor nsAutoArrayBase::nsAutoArrayBase(), which is not going to be called.
Ah.  That seems like something worth fixing!
This increases codesize on Linux x86-32 from 366e5 (bytes?) to 378e5, a 1.03x increase.  (Tinderbox isn't cooperating to give me results for other platforms.)

I'll see what I can do to reduce the codesize difference.
Roc and I talked about this for a bit tonight.  He's concerned about the
current approach, which adds a branch into every call to operator[].  This
means that, if you don't use |elements()|, the performance of TArray isn't
the same as a raw array.

As I see it, the options on the table are:

  1. Do nothing to TArray.  Do whatever we need to do to use nsAutoTArray in
  a few places in the style system where we're churning the heap badly.

     Advantages: Easy, has inertia.
     Disadvantages: AutoTArray isn't memmoveable, and AutoTArray has more
		    overhead than other approaches.

  2. Approach in attached patches.  AutoTArray inherits from TArray.  TArray
  therefore has to branch on every access -- am I using the auto buffer, or
  am I on the heap?

     Advantages: AutoTArray is memmoveable and has less overhead than does
       now.

     Disadvantage: Branch on every access, unless you use |elements()|.

  3. Earlier approach.  AutoTArray does not inherit from TArray.  If you know
  statically that you have a heap-based TArray, you don't have to branch to
  check whether it's an auto-array.  If you're an auto array, you still
  branch.  (I imagine this won't be a problem, because you wouldn't declare
  an auto array if your array is large.)

  If you don't know statically whether or not you're an auto array, you have
  to use nsATArray&, which can wrap either an auto or a vanilla TArray.
  Using one of these costs branches, of course.

    Advantages: No branches for heap-based TArray, if you know statically
      that's what you have.

    Disadvantages: More TArray classes, more invasive change (have to change
      pretty every method which is passed an nsAutoTArray to take an
      nsATArray).  Breaks users of our APIs?

I'm not sure how we should decide which of these options to persue.  We could
measure the effect on performance (2), but even if we saw no differences, I'm
not sure that would satisfy roc -- certainly one could construct a sequence
of instructions in which the extra branches matter.

There are tradeoffs to be made here; it seems that we can't both improve
AutoTArray, incur zero runtime overhead, and maintain the code's current
ergonomics.  Choose two out of three.

What do you all think?
I think "no extra overhead" is part of the ergonomics --- it really simplifies coding when you know there's no extra overhead.
My concern istht we'll start seeing a lot of people switch from using readable direct access of .length and operator[], to less readable use of elements() and cached .length access.

It's really nice that right now, the simple code is also (basically) the fastest.

Do we have data showing that the memory overhead of nsAutoTArray matters (one of the cons brought up in 1)? I guess it would matter more if layout started using it?
(In reply to Jonas Sicking (:sicking) from comment #59)
> My concern istht we'll start seeing a lot of people switch from using
> readable direct access of .length and operator[], to less readable use of
> elements() and cached .length access.

You should probably cache .length accesses now, if you care about having a fast loop.  Reading .length requires dereferencing the array's header pointer, and the C++ compiler has to assume, in most cases, that almost any pointer may alias the header pointer.  Therefore if you dereference and write to a pointer inside the loop (or call a function, or do much of anything), the compiler will probably have to read the length again.

I still feel like the vast majority of loops over arrays are too short for the extra branching to matter.  But finding the loops where it does matter would be tricky.

> It's really nice that right now, the simple code is also (basically) the
> fastest.

Only if you cache .length.  So I guess writing fast code is non-obvious in either model.

> Do we have data showing that the memory overhead of nsAutoTArray matters
> (one of the cons brought up in 1)? I guess it would matter more if layout
> started using it?

Just like with the extra branches, I'm sure we could come up with a scenario where it matters.
I ran TP (I'm not sure if it's TP4 or TP5; it's whatever is in standalone Talos v2.1) under Cachegrind with and without the TArray patch, to see whether the extra branches were material.

It appears that any change in the number of mispredictions is within the noise of the benchmark.  I ran the patched version twice, and the unpatched version once:

* Patched, run 1
  Branches:       3,409,120,025  (3,177,929,336 cond +   231,190,689 ind)
  Mispredicts:      248,122,506  (  203,573,942 cond +    44,548,564 ind)

* Patched, run 2
  Branches:       3,349,989,015  (3,123,279,007 cond +   226,710,008 ind)
  Mispredicts:      242,303,267  (  199,704,337 cond +    42,598,930 ind)

* Unpatched
  Branches:       3,339,921,232  (3,110,172,023 cond +   229,749,209 ind)
  Mispredicts:      242,905,179  (  198,723,437 cond +    44,181,742 ind)

Julian tells me that the branch predictor in Cachegrind is intentionally simplistic, and the branch predictors in modern ARM chips and desktop chips will outperform it.

I'll post the full cachegrind results in a moment.

There's obviously still a cost to the extra branches.  I think the question now is whether the value of smaller autotarrays, perhaps as measured by cache effects, outweighs the branches' cost.
(In reply to Justin Lebar [:jlebar] from comment #62)
> Cachegrind results from running TP(4? 5?)

Those will give you some idea of the cache effects on the machine you ran
it on (a beefy looking laptop iirc), since by default cachegrind uses the
cache configuration parameters of the host.  This will tend to understate
any advantage you might get from having a smaller memory footprint, since
such machines have generous caches.  If you wanted to get some numbers
for a processor more typical of current phones/tablets, you can specify a
cache config on the command line, eg

--I1=32768,4,64 --D1=32768,4,64 --LL=524288,8,128

which asks for a 32k 4-way associative D1 and I1 with 64 byte lines, and a
512k 8-way set associative L2 with 128 byte lines.  That might be more
representative of (eg) a Cortex-A8, although I'm guessing at some of the
parameters here.
(In reply to Boris Zbarsky from comment #39)
> 8)  Hit testing frame list.  Short-lived on-stack array.  Could be auto-1.
Can we make it a compile error to allocate nsTArray on the stack?
> --I1=32768,4,64 --D1=32768,4,64 --LL=524288,8,128

Close!  The associativity is right, and the sizes are valid.  But the A8 has 64-byte lines in both L1 and L2 according to the unfortunately unlinkable [1].

> Can we make it a compile error to allocate nsTArray on the stack?

We could probably add a runtime assertion.  I'm not sure how you could explicitly forbid it syntactically, since the declaration

struct foo {
  nsTArray<bar> array;
};

may result in the array being allocated on the heap or stack.

[1] http://infocenter.arm.com/help/index.jsp
Filed bug 687545 on raising a warning if TArrays are allocated on the stack.
(In reply to neil@parkwaycc.co.uk from comment #64)
> (In reply to Boris Zbarsky from comment #39)
> > 8)  Hit testing frame list.  Short-lived on-stack array.  Could be auto-1.
> Can we make it a compile error to allocate nsTArray on the stack?

We could use our static-analysis tools to create an annotation like NS_HEAP_CLASS, but I don't know why you want to forbid allocating nsTArray on the stack.  If the array is lexically scoped and the programmer knows ahead of time it will likely grow larger than a sane amount of stack to allocate to it, then a stack nsTArray could make sense.  In general, having a "this-nsTArray-should-be-nsAutoTArray<N>" dynamic check like what Justin wrote seems like a better approach.
No longer blocks: 681755
I don't think we're going to go any further with this bug, specifically making TArray memmoveable.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.