1048044 - nsTArray does not use exponential array growth on large arrays (>1024 bytes)

Reporter

Description

•

10 years ago

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0 (Beta/Release)
Build ID: 20140730030201

Steps to reproduce:

Researching bug 997428 found that array expansion is not exponential for large arrays based on xpcom/glue/nsTArray-inl.h lines 146-167 of nsTArray_base::EnsureCapacity() (beginning on line 113)

Got here by searching execution path starting from nsTArray::AppendElement()




Actual results:

AppendElement execution time is amortized O(n) for large arrays (>page size)


Expected results:

AppendElement execution time should be amortized O(1) for all array sizes.

exponential array size growth is most critical for large arrays where copy overhead is high.

I also recommend changing growth factor to either 2.3 or 1.5. 2.3 is the generally accepted best growth constant for speed. 1.5 is the generally accepted best growth constant for space saving.

william bradshaw

Reporter

Updated

•

10 years ago

Comment 1

•

10 years ago

This is a delicate area; we go to great lengths to avoid allocation waste. If you make any changes here, please include njn or someone else from MemShrink on the review.

Whiteboard: [MemShrink]

Emanuel Hoogeveen [:ehoogeveen]

Comment 2

•

10 years ago

I think just asking njn for advice on the allocation strategy wouldn't be a bad idea here in the first place :)

Status: UNCONFIRMED → NEW

Ever confirmed: true

Flags: needinfo?(n.nethercote)

Emanuel Hoogeveen [:ehoogeveen]

Comment 3

•

10 years ago

Looks like there's already a bug about slop in nsTArray (bug 1017418), which moved this logic around (in addition to other changes), but was backed out. So there's a tradeoff between performance and memory usage here, but the current behavior is bad for both.

Emanuel Hoogeveen [:ehoogeveen]

Comment 4

•

10 years ago

The current logic, I should say. The behavior causing the performance issue may be desirable from a memory usage point of view.

william bradshaw

Reporter

Comment 5

•

10 years ago

We may be able to play games with remaping pages and using mmap(anon) to give lower memory usage in some cases. If a page is memmaped pages that have not been touched will be "allocated" but not actually backed until written to. This could be used to our advantage allowing us to over allocate without any real slop except ear marking in our address space.

william bradshaw

Reporter

Comment 6

•

10 years ago

We may be able to play games with remaping pages and using memmap(anon) to give lower memory usage in some cases. If a page is memmaped pages that have not been touched will be "allocated" but not actually backed until written to. This could be used to our advantage allowing us to over allocate without any real slop except ear marking in our address space. 

We cannot however see these gains using standard malloc techniques

(Away)

Comment 7

•

10 years ago

On Windows we only ship 32-bit builds, and our primary source of OOMs is address space, not actual memory usage.

william bradshaw

Reporter

Comment 8

•

10 years ago

Well then, allocation strategy still needs to be a careful process. Still remaping memory pages can sharply decrease the time spent during reallocation. This should decrease reallocation time by a factor of around 200

Nicholas Nethercote [inactive]

Assignee

Comment 9

•

10 years ago

JS arrays use a doubling strategy up to 1 million elements, and then grow by 1.125x beyond that.

Playing games with mmap shouldn't be necessary. Large heap allocations have the same behaviour.

The backing out of bug 1017418 was unfortunate. It was, AFAICT, caused by some code somewhere assuming that SetCapacity(n) would result in an array with a capacity of exactly |n|, which isn't actually guaranteed by the API (though it happens with the current implementation). Changing to an exponential growth strategy will require solving this problem too.

I can take this bug if you're willing to wait a few days. I've done this kind of thing numerous times before, most recently in bug 1039965.

Flags: needinfo?(n.nethercote)

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Whiteboard: [MemShrink]

william bradshaw

Reporter

Comment 10

•

10 years ago

Sorry for long response time. I've got the patch writing it right now.

Nicholas Nethercote [inactive]

Assignee

Comment 11

•

10 years ago

Just beware that the patch in bug 1039965 causes lots of failures on B2G ICS Emulator Debug tests, and I don't know why, and any patch you write for this bug is likely to have the same problem.

william bradshaw

Reporter

Comment 12

•

10 years ago

Do you have a notion to what allocator's nsTArray depends on and if adding a port for moz_malloc_size_of to the allocator will cause a problem?

Nicholas Nethercote [inactive]

Assignee

Comment 13

•

10 years ago

> Do you have a notion to what allocator's nsTArray depends on and if adding a
> port for moz_malloc_size_of to the allocator will cause a problem?

Sorry, I don't understand the question. Which sense of "allocator" are you using?

william bradshaw

Reporter

Comment 14

•

10 years ago

The template<class Alloc... interface through which malloc,realloc,and free are accessed. I added to allow more complete knowledge of the allocations received a AllocationSize function which interfaces to moz_malloc_size_of which reports the actual allocation size from jemalloc giving a better idea of actual allocated space.

william bradshaw

Reporter

Comment 15

•

10 years ago

Attached patch first patch attempt (obsolete) — Details — Splinter Review

This is a first attempt at patch. Unfortunately my compile environment is all kinds of screwed up at the moment and I can't test. uses new interface

Nicholas Nethercote [inactive]

Assignee

Comment 16

•

10 years ago

Sounds like the existing SizeOf{In,Ex}cludingThis functions will do what you want.

william bradshaw

Reporter

Comment 17

•

10 years ago

Looks good Any documentation on what the arguments is for and what I should pass it

Nicholas Nethercote [inactive]

Assignee

Comment 18

•

10 years ago

See mfbt/MemoryReporting.h.

william bradshaw

Reporter

Comment 19

•

10 years ago

SizeOf{In,Ex}cludingThis does not solve the inherent problem of I cannot be sure that all instantiations of nsTArray use allocations directly from the system allocator. (jemalloc) It would appear the code is written with managed allocations in mind (usesAutoArray) which would break moz_malloc_size_of because of additional meta data before the pointer handed to nsTArray or other situations.  In order to correctly implement this patch with minim slop I need to know exactly the usable size the allocator gave me.  It is impossible to calculate the size the allocator gave me because of the chaotic nature of allocators and inconsistent allocator overhead.

william bradshaw

Reporter

Comment 20

•

10 years ago

I guess I could patch it for now and we could start a new bug to fix the allocator size class slop later

Nicholas Nethercote [inactive]

Assignee

Comment 21

•

10 years ago

> SizeOf{In,Ex}cludingThis does not solve the inherent problem of I cannot be
> sure that all instantiations of nsTArray use allocations directly from the
> system allocator. (jemalloc) It would appear the code is written with
> managed allocations in mind (usesAutoArray) which would break
> moz_malloc_size_of because of additional meta data before the pointer handed
> to nsTArray or other situations.  In order to correctly implement this patch
> with minim slop I need to know exactly the usable size the allocator gave
> me.  It is impossible to calculate the size the allocator gave me because of
> the chaotic nature of allocators and inconsistent allocator overhead.

As I understand it, the existing code measures nsTArray accurately, including slop. What do you mean by "the pointer handed to nsTArray" and the "other situations"? As far as I know, nsTArray is entirely responsible for the allocation of its own element storage.

william bradshaw

Reporter

Comment 22

•

10 years ago

The current code does not account for over allocation by the allocato and assumes that the allocation returned is exactly the requested size which considering the granularity of larger allocations is quite a bit of space Additionally it doesn't consider allocator overhead so it can make quite a bit of slop min 4KB for mid sized allocations

william bradshaw

Reporter

Comment 23

•

10 years ago

The current code does not account for over allocation by the allocato and assumes that the allocation returned is exactly the requested size which considering the granularity of larger allocations is quite a bit of space Additionally it doesn't consider allocator overhead so it can make quite a bit of slop min 4KB for mid sized allocations

Nicholas Nethercote [inactive]

Assignee

Comment 24

•

10 years ago

Bug 1017418, which has been mentioned several times, is an attempt at addressing slop.

At this point I don't know if you have any remaining questions. If you do, please try to write them clearly, and use punctuation appropriately. I've had trouble understanding most of the comments you've written so far, which makes it difficult for me to help you.

Nicholas Nethercote [inactive]

Assignee

Comment 25

•

10 years ago

I'm also seeing terrible heap churn due to this. Consider expanding an nsTArray from 4 KiB to 256 KiB. With the current approach we cumulatively allocate (4 + 8 + 12 + ... + 252 + 256) KiB, which is 8.5 MiB. With a doubling strategy we allocate (4 + 8 + 16 + 32 + ... + 256) KiB, which is just under 0.5 MiB. Do the same up to 1 MiB and the current approach will do 16x worse. It's quite possible this is contributing to our jemalloc fragmentation problems.

So I'm going to take this. I'll need to work out what's going wrong with the ICS emulator tests for bug 1017418 first, though.

Assignee: nobody → n.nethercote

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Depends on: 1017418

Nicholas Nethercote [inactive]

Assignee

Comment 26

•

10 years ago

Attached patch Use exponential growth when growing an nsTArray — Details — Splinter Review

When growing an existing nsTArray, I'm using a doubling strategy up to 8 MiB,
and then increasing by at least 1.125 each time after that.

I've left unchanged the allocate-a-new-buffer-when-one-doesn't-exist case, in
order to avoid the mysterious ICS emulator crashes that have prevented bug
1017418 from landing. In that case, the existing code just allocates the exact
amount requested. One could argue that this is a good thing, because this is
ideal behaviour for arrays that don't grow any further. But I don't have any
data about how often that occurs.

Anyway, this is a clear win for the resizing case, which is where all the heap
churn comes from. Here is a DMD report showing *cumulative* allocations, rather
than *live* allocations -- I have a special local DMD build for this which is
not yet in the repository -- prior to this patch:

> Cumulative {
>   54 blocks in heap block record 31 of 3,324
>   6,082,560 bytes (6,082,560 requested / 0 slop)
>   Individual block sizes: 221,184; 217,088; 212,992; 208,896; 204,800; 200,704; 196,608; 192,512; 188,416; 184,320; 180,224; 176,128; 172,032; 167,936; 163,840; 159,744; 155,648; 151,552; 147,456; 143,360; 139,264; 135,168; 131,072; 126,976; 122,880; 118,784; 114,688; 110,592; 106,496; 102,400; 98,304; 94,208; 90,112; 86,016; 81,920; 77,824; 73,728; 69,632; 65,536; 61,440; 57,344; 53,248; 49,152; 45,056; 40,960; 36,864; 32,768; 28,672; 24,576; 20,480; 16,384; 12,288; 8,192; 4,096
>   0.48% of the heap (66.57% cumulative)
>   Allocated at {
>     #01: nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity(unsigned long, unsigned long) (/home/njn/moz/mi5/co64dmd/toolkit/crashreporter/../../dist/include/nsTArray-inl.h:187)
>     #02: mozilla::dom::MemoryReport* nsTArray_Impl<mozilla::dom::MemoryReport, nsTArrayInfallibleAllocator>::AppendElement<mozilla::dom::MemoryReport&>(mozilla::dom::MemoryReport&) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../dist/include/nsTArray.h:1333)
>     #03: mozilla::dom::MemoryReportCallback::Callback(nsACString_internal const&, nsACString_internal const&, int, int, long, nsACString_internal const&, nsISupports*) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../../dom/ipc/ContentChild.cpp:797)
>   }
> }

The growth of a single array from 4 KiB up to 216 KiB is obvious.

And after the patch is applied:

> Cumulative {
>   ~8 blocks in heap block record 222 of 3,457
>   ~524,285 bytes (~524,285 requested / ~0 slop)
>   0.04% of the heap (88.80% cumulative)
>   Individual block sizes: 262,144; 131,072; 65,536; 32,768; 16,384; 8,192; 4,096; ~4,093
>   Allocated at {
>     #01: nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity(unsigned long, unsigned long) (/home/njn/moz/mi5/co64dmd/toolkit/crashreporter/../../dist/include/nsTArray-inl.h:199)
>     #02: mozilla::dom::MemoryReport* nsTArray_Impl<mozilla::dom::MemoryReport, nsTArrayInfallibleAllocator>::AppendElement<mozilla::dom::MemoryReport&>(mozilla::dom::MemoryReport&) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../dist/include/nsTArray.h:1346)
>     #03: mozilla::dom::MemoryReportCallback::Callback(nsACString_internal const&, nsACString_internal const&, int, int, long, nsACString_internal const&, nsISupports*) (/home/njn/moz/mi5/co64dmd/dom/ipc/../../../dom/ipc/ContentChild.cpp:797)
>   }
> }

Much better. The url-classifier in particular will benefit from this change
because it has lots of large (200 KiB+) arrays grown like this.

Attachment #8510149 - Flags: review?(nfroyd)

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

No longer depends on: 1017418

(Away)

Comment 27

•

10 years ago

> When growing an existing nsTArray, I'm using a doubling strategy up to 8 MiB,
> and then increasing by at least 1.125 each time after that.

32-bit Windows is going to be pretty unhappy about that. Once you approach the 1 MiB mark, it will be difficult to find a sufficient free region and/or they will be taking away nicely-sized blocks from future allocations. You'll want to watch crash-stats closely and have some backup plans ready.

Nicholas Nethercote [inactive]

Assignee

Comment 28

•

10 years ago

The JS engine has used this exact strategy for allocating object slots for years.

Nathan Froyd [:froydnj]

Comment 29

•

10 years ago

(In reply to Nicholas Nethercote [:njn] from comment #28)
> The JS engine has used this exact strategy for allocating object slots for
> years.

One thinks that the number of slots in JS objects and the typical size of nsTArrays might not be comparable.  Though I guess judging by comment 9, "object slots" can be properly read as "array elements"?

Emanuel Hoogeveen [:ehoogeveen]

Comment 30

•

10 years ago

jemalloc rounds all huge allocations up to the nearest MiB, so it would be nice if you could take advantage of the full allocated size. All huge allocations are also MiB aligned, so I don't see that causing any problems. It's more a tradeoff between moving the data around over and over again if you grow by fixed increments (say an MiB each time), or failing to allocate because you grew by too much. Perhaps you could have a fallback path where if growing by 1.125 times fails, you try again with the exact amount rounded up to the nearest MiB.

Nicholas Nethercote [inactive]

Assignee

Comment 31

•

10 years ago

> One thinks that the number of slots in JS objects and the typical size of
> nsTArrays might not be comparable.

JS object slots are 8 bytes each, and the slower growth starts at 1 million slots, i.e. 8 MiB. So the behaviour is the same. Indeed, that's why I chose 8 MiB as the threshold for the slower growth in this patch.

Or maybe you're talking about typical sizes. In my experience they're both generally power-law-ish: many small buffers, a few large ones.

> jemalloc rounds all huge allocations up to the nearest MiB, so it would be
> nice if you could take advantage of the full allocated size.

mozjemalloc is set up so that huge allocations round up to the nearest MiB in terms of virtual address space, but round up to the nearest page size in terms of committed space. So the page-sized rounding (which was already present) is there to optimize for committed space. But I am happy to change it to MiB rounding to optimize for virtual address space.

Finally, every other class with a growable buffer that I can think of in Mozilla code uses a doubling strategy: pldhash, JS::HashTable, mozilla::Vector, JSString, nsString, and XDRBuffer. Oh and pdf.js uses it in various places with typed arrays (which unlike vanilla JS arrays are fixed length so you have to handle growth yourself). It's not a radical technique.

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Attachment #8470650 - Attachment is obsolete: true

Nicholas Nethercote [inactive]

Assignee

Comment 32

•

10 years ago

Try looks good: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e79cdf6885b5

Emanuel Hoogeveen [:ehoogeveen]

Comment 33

•

10 years ago

(In reply to Nicholas Nethercote [:njn] from comment #31)
> mozjemalloc is set up so that huge allocations round up to the nearest MiB
> in terms of virtual address space, but round up to the nearest page size in
> terms of committed space. So the page-sized rounding (which was already
> present) is there to optimize for committed space. But I am happy to change
> it to MiB rounding to optimize for virtual address space.

Right. The OS still shouldn't commit physical pages until you actually touch them, so as long as nsTArray doesn't do anything silly like explicitly nulling the allocated region, optimizing for virtual address space should be optimal.

Nathan Froyd [:froydnj]

Comment 34

•

10 years ago

Comment on attachment 8510149 [details] [diff] [review]
Use exponential growth when growing an nsTArray

Review of attachment 8510149 [details] [diff] [review]:
-----------------------------------------------------------------

Part of me wants to limit the slowGrowthThreshold to 1 MB or thereabouts, but let's keep the 8 MB threshold, given JS behavior, and see if we encounter problems.

::: xpcom/glue/nsTArray-inl.h
@@ +152,4 @@
>    size_t bytesToAlloc;
> +  if (reqSize >= slowGrowthThreshold) {
> +    size_t currSize = sizeof(Header) + Capacity() * aElemSize;
> +    size_t minNewSize = currSize + (currSize >> 3); // multiple by 1.125

Nit: "multiply by"

@@ +158,3 @@
>      // Round up to the next multiple of pageSize.
> +    const size_t pageSize = 4096;
> +    bytesToAlloc = pageSize * ((bytesToAlloc + pageSize - 1) / pageSize);

I am slightly impressed that this gets compiled into the expected:

  (bytesToAlloc + pageSize - 1) & ~pageSize

(at least on GCC).

Attachment #8510149 - Flags: review?(nfroyd) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 35

•

10 years ago

I did five AWSY runs with and without this patch applied:

- https://areweslimyet.com/?series=njn-1048044-a#evenspacing
- https://areweslimyet.com/?series=njn-1048044-b#evenspacing
- https://areweslimyet.com/?series=njn-1048044-c#evenspacing
- https://areweslimyet.com/?series=njn-1048044-d#evenspacing
- https://areweslimyet.com/?series=njn-1048044-e#evenspacing

99e317210668 is the "no change" revision, and 8409f65b9769 is the revision with
the nsTArray modifications. Note that the latter is on the *left* in the 
graphs.

The RSS results are suggestive that the patch makes an improvement, but the
noise level is high enough that it's not conclusive:

> RSS: Fresh start (light blue)
> - old: 170.91, 166.50, 170.20, 163.92, 181.40; avg=170.59
> - new: 161.38, 176.97, 162.01, 166.57, 161.54; avg=165.69
>
> RSS: After TP5 (pink)
> - old: 398.34, 399.64, 410.50, 405.28, 414.55; avg=405.66
> - new: 397.00, 405.06, 393.71, 396.21, 398.25; avg=398.05
>
> RSS: After TP5, tabs closed [+30s, forced GC] (purple)
> - old: 264.80, 234.57, 236.35, 261.69, 250.74; avg=249.63
> - new: 255.31, 240.29, 228.78, 255.25, 233.36; avg=242.60

But if you drill down to the heap-overhead numbers, the picture is much
clearer:

> RSS: After TP5, tabs closed [+30s, forced GC] (purple)
> - old: 74.43, 74.43, 74.43, 74.43, 74.43
> - new: 70.13, 70.13, 70.13, 70.13, 70.13

A 4.30 MiB improvement, yay! 2.59 MiB of the improvement is in "bin-unused",
1.66 MiB is in "page-cache", and 0.05 MiB is in "bookkeeping".

Nicholas Nethercote [inactive]

Assignee

Comment 36

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/ac0d7bf37abc

Carsten Book [:Tomcat]

Comment 37

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/ac0d7bf37abc

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla36

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Whiteboard: [MemShrink]

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Depends on: cumulative-heap-profiling

first patch attempt 10 years ago william bradshaw 6.43 KB, patch		Details \| Diff \| Splinter Review
Use exponential growth when growing an nsTArray 10 years ago Nicholas Nethercote [inactive] 3.84 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review