Closed Bug 1048044 Opened 10 years ago Closed 10 years ago

nsTArray does not use exponential array growth on large arrays (>1024 bytes)

Categories

(Core :: XPCOM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: defenastrator, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink])

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0 (Beta/Release)
Build ID: 20140730030201

Steps to reproduce:

Researching bug 997428 found that array expansion is not exponential for large arrays based on xpcom/glue/nsTArray-inl.h lines 146-167 of nsTArray_base::EnsureCapacity() (beginning on line 113)

Got here by searching execution path starting from nsTArray::AppendElement()




Actual results:

AppendElement execution time is amortized O(n) for large arrays (>page size)


Expected results:

AppendElement execution time should be amortized O(1) for all array sizes.

exponential array size growth is most critical for large arrays where copy overhead is high.

I also recommend changing growth factor to either 2.3 or 1.5. 2.3 is the generally accepted best growth constant for speed. 1.5 is the generally accepted best growth constant for space saving.
See Also: → 997428
This is a delicate area; we go to great lengths to avoid allocation waste. If you make any changes here, please include njn or someone else from MemShrink on the review.
Whiteboard: [MemShrink]
I think just asking njn for advice on the allocation strategy wouldn't be a bad idea here in the first place :)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(n.nethercote)
Looks like there's already a bug about slop in nsTArray (bug 1017418), which moved this logic around (in addition to other changes), but was backed out. So there's a tradeoff between performance and memory usage here, but the current behavior is bad for both.
The current logic, I should say. The behavior causing the performance issue may be desirable from a memory usage point of view.
We may be able to play games with remaping pages and using mmap(anon) to give lower memory usage in some cases. If a page is memmaped pages that have not been touched will be "allocated" but not actually backed until written to. This could be used to our advantage allowing us to over allocate without any real slop except ear marking in our address space.
We may be able to play games with remaping pages and using memmap(anon) to give lower memory usage in some cases. If a page is memmaped pages that have not been touched will be "allocated" but not actually backed until written to. This could be used to our advantage allowing us to over allocate without any real slop except ear marking in our address space. 

We cannot however see these gains using standard malloc techniques
On Windows we only ship 32-bit builds, and our primary source of OOMs is address space, not actual memory usage.
Well then, allocation strategy still needs to be a careful process. Still remaping memory pages can sharply decrease the time spent during reallocation. This should decrease reallocation time by a factor of around 200
JS arrays use a doubling strategy up to 1 million elements, and then grow by 1.125x beyond that.

Playing games with mmap shouldn't be necessary. Large heap allocations have the same behaviour.

The backing out of bug 1017418 was unfortunate. It was, AFAICT, caused by some code somewhere assuming that SetCapacity(n) would result in an array with a capacity of exactly |n|, which isn't actually guaranteed by the API (though it happens with the current implementation). Changing to an exponential growth strategy will require solving this problem too.

I can take this bug if you're willing to wait a few days. I've done this kind of thing numerous times before, most recently in bug 1039965.
Flags: needinfo?(n.nethercote)
Whiteboard: [MemShrink]
Sorry for long response time. I've got the patch writing it right now.
Just beware that the patch in bug 1039965 causes lots of failures on B2G ICS Emulator Debug tests, and I don't know why, and any patch you write for this bug is likely to have the same problem.
Do you have a notion to what allocator's nsTArray depends on and if adding a port for moz_malloc_size_of to the allocator will cause a problem?
> Do you have a notion to what allocator's nsTArray depends on and if adding a
> port for moz_malloc_size_of to the allocator will cause a problem?

Sorry, I don't understand the question. Which sense of "allocator" are you using?
The template<class Alloc... interface through which malloc,realloc,and free are accessed. I added to allow more complete knowledge of the allocations received a AllocationSize function which interfaces to moz_malloc_size_of which reports the actual allocation size from jemalloc giving a better idea of actual allocated space.
Attached patch first patch attempt (obsolete) — Splinter Review
This is a first attempt at patch. Unfortunately my compile environment is all kinds of screwed up at the moment and I can't test. uses new interface
Sounds like the existing SizeOf{In,Ex}cludingThis functions will do what you want.
Looks good Any documentation on what the arguments is for and what I should pass it
See mfbt/MemoryReporting.h.
SizeOf{In,Ex}cludingThis does not solve the inherent problem of I cannot be sure that all instantiations of nsTArray use allocations directly from the system allocator. (jemalloc) It would appear the code is written with managed allocations in mind (usesAutoArray) which would break moz_malloc_size_of because of additional meta data before the pointer handed to nsTArray or other situations.  In order to correctly implement this patch with minim slop I need to know exactly the usable size the allocator gave me.  It is impossible to calculate the size the allocator gave me because of the chaotic nature of allocators and inconsistent allocator overhead.
I guess I could patch it for now and we could start a new bug to fix the allocator size class slop later
> SizeOf{In,Ex}cludingThis does not solve the inherent problem of I cannot be
> sure that all instantiations of nsTArray use allocations directly from the
> system allocator. (jemalloc) It would appear the code is written with
> managed allocations in mind (usesAutoArray) which would break
> moz_malloc_size_of because of additional meta data before the pointer handed
> to nsTArray or other situations.  In order to correctly implement this patch
> with minim slop I need to know exactly the usable size the allocator gave
> me.  It is impossible to calculate the size the allocator gave me because of
> the chaotic nature of allocators and inconsistent allocator overhead.

As I understand it, the existing code measures nsTArray accurately, including slop. What do you mean by "the pointer handed to nsTArray" and the "other situations"? As far as I know, nsTArray is entirely responsible for the allocation of its own element storage.
The current code does not account for over allocation by the allocato and assumes that the allocation returned is exactly the requested size which considering the granularity of larger allocations is quite a bit of space Additionally it doesn't consider allocator overhead so it can make quite a bit of slop min 4KB for mid sized allocations
The current code does not account for over allocation by the allocato and assumes that the allocation returned is exactly the requested size which considering the granularity of larger allocations is quite a bit of space Additionally it doesn't consider allocator overhead so it can make quite a bit of slop min 4KB for mid sized allocations
Bug 1017418, which has been mentioned several times, is an attempt at addressing slop.

At this point I don't know if you have any remaining questions. If you do, please try to write them clearly, and use punctuation appropriately. I've had trouble understanding most of the comments you've written so far, which makes it difficult for me to help you.
I'm also seeing terrible heap churn due to this. Consider expanding an nsTArray from 4 KiB to 256 KiB. With the current approach we cumulatively allocate (4 + 8 + 12 + ... + 252 + 256) KiB, which is 8.5 MiB. With a doubling strategy we allocate (4 + 8 + 16 + 32 + ... + 256) KiB, which is just under 0.5 MiB. Do the same up to 1 MiB and the current approach will do 16x worse. It's quite possible this is contributing to our jemalloc fragmentation problems.

So I'm going to take this. I'll need to work out what's going wrong with the ICS emulator tests for bug 1017418 first, though.
Assignee: nobody → n.nethercote
Depends on: 1017418
When growing an existing nsTArray, I'm using a doubling strategy up to 8 MiB,
and then increasing by at least 1.125 each time after that.

I've left unchanged the allocate-a-new-buffer-when-one-doesn't-exist case, in
order to avoid the mysterious ICS emulator crashes that have prevented bug
1017418 from landing. In that case, the existing code just allocates the exact
amount requested. One could argue that this is a good thing, because this is
ideal behaviour for arrays that don't grow any further. But I don't have any
data about how often that occurs.

Anyway, this is a clear win for the resizing case, which is where all the heap
churn comes from. Here is a DMD report showing *cumulative* allocations, rather
than *live* allocations -- I have a special local DMD build for this which is
not yet in the repository -- prior to this patch:

> Cumulative {
>   54 blocks in heap block record 31 of 3,324
>   6,082,560 bytes (6,082,560 requested / 0 slop)
>   Individual block sizes: 221,184; 217,088; 212,992; 208,896; 204,800; 200,704; 196,608; 192,512; 188,416; 184,320; 180,224; 176,128; 172,032; 167,936; 163,840; 159,744; 155,648; 151,552; 147,456; 143,360; 139,264; 135,168; 131,072; 126,976; 122,880; 118,784; 114,688; 110,592; 106,496; 102,400; 98,304; 94,208; 90,112; 86,016; 81,920; 77,824; 73,728; 69,632; 65,536; 61,440; 57,344; 53,248; 49,152; 45,056; 40,960; 36,864; 32,768; 28,672; 24,576; 20,480; 16,384; 12,288; 8,192; 4,096
>   0.48% of the heap (66.57% cumulative)
>   Allocated at {
>     #01: nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity(unsigned long, unsigned long) (/home/njn/moz/mi5/co64dmd/toolkit/crashreporter/../../dist/include/nsTArray-inl.h:187)
>     #02: mozilla::dom::MemoryReport* nsTArray_Impl<mozilla::dom::MemoryReport, nsTArrayInfallibleAllocator>::AppendElement<mozilla::dom::MemoryReport&>(mozilla::dom::MemoryReport&) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../dist/include/nsTArray.h:1333)
>     #03: mozilla::dom::MemoryReportCallback::Callback(nsACString_internal const&, nsACString_internal const&, int, int, long, nsACString_internal const&, nsISupports*) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../../dom/ipc/ContentChild.cpp:797)
>   }
> }

The growth of a single array from 4 KiB up to 216 KiB is obvious.

And after the patch is applied:

> Cumulative {
>   ~8 blocks in heap block record 222 of 3,457
>   ~524,285 bytes (~524,285 requested / ~0 slop)
>   0.04% of the heap (88.80% cumulative)
>   Individual block sizes: 262,144; 131,072; 65,536; 32,768; 16,384; 8,192; 4,096; ~4,093
>   Allocated at {
>     #01: nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity(unsigned long, unsigned long) (/home/njn/moz/mi5/co64dmd/toolkit/crashreporter/../../dist/include/nsTArray-inl.h:199)
>     #02: mozilla::dom::MemoryReport* nsTArray_Impl<mozilla::dom::MemoryReport, nsTArrayInfallibleAllocator>::AppendElement<mozilla::dom::MemoryReport&>(mozilla::dom::MemoryReport&) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../dist/include/nsTArray.h:1346)
>     #03: mozilla::dom::MemoryReportCallback::Callback(nsACString_internal const&, nsACString_internal const&, int, int, long, nsACString_internal const&, nsISupports*) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../../dom/ipc/ContentChild.cpp:797)
>   }
> }

Much better. The url-classifier in particular will benefit from this change
because it has lots of large (200 KiB+) arrays grown like this.
Attachment #8510149 - Flags: review?(nfroyd)
No longer depends on: 1017418
> When growing an existing nsTArray, I'm using a doubling strategy up to 8 MiB,
> and then increasing by at least 1.125 each time after that.

32-bit Windows is going to be pretty unhappy about that. Once you approach the 1 MiB mark, it will be difficult to find a sufficient free region and/or they will be taking away nicely-sized blocks from future allocations. You'll want to watch crash-stats closely and have some backup plans ready.
The JS engine has used this exact strategy for allocating object slots for years.
(In reply to Nicholas Nethercote [:njn] from comment #28)
> The JS engine has used this exact strategy for allocating object slots for
> years.

One thinks that the number of slots in JS objects and the typical size of nsTArrays might not be comparable.  Though I guess judging by comment 9, "object slots" can be properly read as "array elements"?
jemalloc rounds all huge allocations up to the nearest MiB, so it would be nice if you could take advantage of the full allocated size. All huge allocations are also MiB aligned, so I don't see that causing any problems. It's more a tradeoff between moving the data around over and over again if you grow by fixed increments (say an MiB each time), or failing to allocate because you grew by too much. Perhaps you could have a fallback path where if growing by 1.125 times fails, you try again with the exact amount rounded up to the nearest MiB.
> One thinks that the number of slots in JS objects and the typical size of
> nsTArrays might not be comparable.

JS object slots are 8 bytes each, and the slower growth starts at 1 million slots, i.e. 8 MiB. So the behaviour is the same. Indeed, that's why I chose 8 MiB as the threshold for the slower growth in this patch.

Or maybe you're talking about typical sizes. In my experience they're both generally power-law-ish: many small buffers, a few large ones.

> jemalloc rounds all huge allocations up to the nearest MiB, so it would be
> nice if you could take advantage of the full allocated size.

mozjemalloc is set up so that huge allocations round up to the nearest MiB in terms of virtual address space, but round up to the nearest page size in terms of committed space. So the page-sized rounding (which was already present) is there to optimize for committed space. But I am happy to change it to MiB rounding to optimize for virtual address space.

Finally, every other class with a growable buffer that I can think of in Mozilla code uses a doubling strategy: pldhash, JS::HashTable, mozilla::Vector, JSString, nsString, and XDRBuffer. Oh and pdf.js uses it in various places with typed arrays (which unlike vanilla JS arrays are fixed length so you have to handle growth yourself). It's not a radical technique.
Attachment #8470650 - Attachment is obsolete: true
(In reply to Nicholas Nethercote [:njn] from comment #31)
> mozjemalloc is set up so that huge allocations round up to the nearest MiB
> in terms of virtual address space, but round up to the nearest page size in
> terms of committed space. So the page-sized rounding (which was already
> present) is there to optimize for committed space. But I am happy to change
> it to MiB rounding to optimize for virtual address space.

Right. The OS still shouldn't commit physical pages until you actually touch them, so as long as nsTArray doesn't do anything silly like explicitly nulling the allocated region, optimizing for virtual address space should be optimal.
Comment on attachment 8510149 [details] [diff] [review]
Use exponential growth when growing an nsTArray

Review of attachment 8510149 [details] [diff] [review]:
-----------------------------------------------------------------

Part of me wants to limit the slowGrowthThreshold to 1 MB or thereabouts, but let's keep the 8 MB threshold, given JS behavior, and see if we encounter problems.

::: xpcom/glue/nsTArray-inl.h
@@ +152,4 @@
>    size_t bytesToAlloc;
> +  if (reqSize >= slowGrowthThreshold) {
> +    size_t currSize = sizeof(Header) + Capacity() * aElemSize;
> +    size_t minNewSize = currSize + (currSize >> 3); // multiple by 1.125

Nit: "multiply by"

@@ +158,3 @@
>      // Round up to the next multiple of pageSize.
> +    const size_t pageSize = 4096;
> +    bytesToAlloc = pageSize * ((bytesToAlloc + pageSize - 1) / pageSize);

I am slightly impressed that this gets compiled into the expected:

  (bytesToAlloc + pageSize - 1) & ~pageSize

(at least on GCC).
Attachment #8510149 - Flags: review?(nfroyd) → review+
I did five AWSY runs with and without this patch applied:

- https://areweslimyet.com/?series=njn-1048044-a#evenspacing
- https://areweslimyet.com/?series=njn-1048044-b#evenspacing
- https://areweslimyet.com/?series=njn-1048044-c#evenspacing
- https://areweslimyet.com/?series=njn-1048044-d#evenspacing
- https://areweslimyet.com/?series=njn-1048044-e#evenspacing

99e317210668 is the "no change" revision, and 8409f65b9769 is the revision with
the nsTArray modifications. Note that the latter is on the *left* in the 
graphs.

The RSS results are suggestive that the patch makes an improvement, but the
noise level is high enough that it's not conclusive:

> RSS: Fresh start (light blue)
> - old: 170.91, 166.50, 170.20, 163.92, 181.40; avg=170.59
> - new: 161.38, 176.97, 162.01, 166.57, 161.54; avg=165.69
>
> RSS: After TP5 (pink)
> - old: 398.34, 399.64, 410.50, 405.28, 414.55; avg=405.66
> - new: 397.00, 405.06, 393.71, 396.21, 398.25; avg=398.05
>
> RSS: After TP5, tabs closed [+30s, forced GC] (purple)
> - old: 264.80, 234.57, 236.35, 261.69, 250.74; avg=249.63
> - new: 255.31, 240.29, 228.78, 255.25, 233.36; avg=242.60

But if you drill down to the heap-overhead numbers, the picture is much
clearer:

> RSS: After TP5, tabs closed [+30s, forced GC] (purple)
> - old: 74.43, 74.43, 74.43, 74.43, 74.43
> - new: 70.13, 70.13, 70.13, 70.13, 70.13

A 4.30 MiB improvement, yay! 2.59 MiB of the improvement is in "bin-unused",
1.66 MiB is in "page-cache", and 0.05 MiB is in "bookkeeping".
https://hg.mozilla.org/mozilla-central/rev/ac0d7bf37abc
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Whiteboard: [MemShrink]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: