Last Comment Bug 685783 - Take advantage of slop bytes in js::Vector
: Take advantage of slop bytes in js::Vector
Status: RESOLVED FIXED
[MemShrink:P2]
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86_64 Linux
: -- normal (vote)
: mozilla21
Assigned To: Nicholas Nethercote [:njn]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-09-08 21:31 PDT by Nicholas Nethercote [:njn]
Modified: 2013-02-15 12:10 PST (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch to adjust newcap, v0 (3.17 KB, patch)
2011-09-08 23:22 PDT, Nicholas Nethercote [:njn]
no flags Details | Diff | Splinter Review
alternative patch that modifies calculateNewCapacity, v0 (1.46 KB, patch)
2011-09-09 00:11 PDT, Nicholas Nethercote [:njn]
luke: feedback+
Details | Diff | Splinter Review
proper patch (1.51 KB, patch)
2011-10-18 18:11 PDT, Nicholas Nethercote [:njn]
luke: review+
Details | Diff | Splinter Review
new patch (2.20 KB, patch)
2011-10-20 17:19 PDT, Nicholas Nethercote [:njn]
luke: review-
Details | Diff | Splinter Review
Avoid slop in js::Vector when the element size is not a power of two. (2.22 KB, patch)
2013-02-07 20:10 PST, Nicholas Nethercote [:njn]
no flags Details | Diff | Splinter Review
Avoid slop in js::Vector when the element size is not a power of two. (10.91 KB, patch)
2013-02-11 19:50 PST, Nicholas Nethercote [:njn]
luke: review+
Details | Diff | Splinter Review

Description Nicholas Nethercote [:njn] 2011-09-08 21:31:25 PDT
js::Vector uses a doubling growth strategy, which is a good idea in general.  In particular, when the element size is a power-of-two, you end up with zero slop bytes from jemalloc.

But when the element size is not a power-of-two, you see some waste.  Here is a summary of the waste from all js::Vectors when loading Gmail and TechCrunch:

22315170 counts:
( 1) 4546048 (20.4%, 20.4%):    4864 ->   8192 (  3328) 
( 2) 4446208 (19.9%, 40.3%):    2432 ->   4096 (  1664) 
( 3) 2840448 (12.7%, 53.0%):    1216 ->   2048 (   832) 
( 4) 2441216 (10.9%, 64.0%):    3072 ->   4096 (  1024) 
( 5) 2244608 (10.1%, 74.0%):    6144 ->   8192 (  2048) 
( 6) 1679872 ( 7.5%, 81.6%):    1536 ->   2048 (   512) 
( 7) 1487360 ( 6.7%, 88.2%):    9728 ->  12288 (  2560) 
( 8) 454912 ( 2.0%, 90.3%):     768 ->   1024 (   256) 
( 9) 358400 ( 1.6%, 91.9%):    3584 ->   4096 (   512) 
(10) 317184 ( 1.4%, 93.3%):    1792 ->   2048 (   256) 
(11) 311296 ( 1.4%, 94.7%):    7168 ->   8192 (  1024) 
(12) 297216 ( 1.3%, 96.0%):     640 ->   1024 (   384) 
(13) 198912 ( 0.9%, 96.9%):    1280 ->   2048 (   768) 
(14) 159136 ( 0.7%, 97.6%):      24 ->     32 (     8) 
(15) 118272 ( 0.5%, 98.1%):    2560 ->   4096 (  1536) 
(16) 103424 ( 0.5%, 98.6%):   19456 ->  20480 (  1024) 
(17)  96256 ( 0.4%, 99.0%):   38912 ->  40960 (  2048) 
(18)  67584 ( 0.3%, 99.3%):   14336 ->  16384 (  2048) 
(19)  64512 ( 0.3%, 99.6%):    2304 ->   4096 (  1792) 
(20)  36864 ( 0.2%, 99.8%):    5120 ->   8192 (  3072) 
(21)  36736 ( 0.2%,100.0%):    1152 ->   2048 (   896) 
(22)   3584 ( 0.0%,100.0%):    4608 ->   8192 (  3584) 
(23)   3072 ( 0.0%,100.0%):    9216 ->  12288 (  3072) 
(24)   2048 ( 0.0%,100.0%):   10240 ->  12288 (  2048) 
(25)      2 ( 0.0%,100.0%):       1 ->      2 (     1) 

How to read this:  

- From the first line:  22,315,170 slop bytes were allocated in total.  (But note that they weren't all live at the same time.)

- From the second line:  the case responsible for the biggest fraction (20.4%) of the waste was allocations of 4864 bytes.  These get rounded up to 8192, wasting 3328 bytes.  The total waste of all such allocations was 4,546,048 (which is 1366 * 3328).

- The remaining lines are like the second line.

I used Valgrind's stack tracing ability to determine that the 4864-sized cases were all from the YarrGenerator::m_ops vector, which has element type YarrOp.  sizeof(YarrOp) = 152 (on 64-bit), and 4864 / 152 = 32.  And it turns out that the 2nd and 3rd most common cases are also from m_ops -- notice that 1216 is half of 2432, which is half of 4864.

----

So, what to do about this?  If we can take advantage of moz_malloc_usable_size, we can utilize a lot of the wasted space.  For example, in the 4864 case we can round up to 8192 and squeeze in 53 YarrOp elements (because 53 * 152 = 8056).

Now, using moz_malloc_usable_size in the JS engine is difficult, see bug 676732 for details.  In that bug I ended up working around the difficulties by passing in a pointer to moz_malloc_usable_size when necessary from xpconnect, but that won't work here.  Instead we could create a new JSAPI function to register a JSUsableSize function at start-up (thanks to Igor for that idea).  If no such function was registered, we'd fall back to the current behaviour.

(Note that js::Vector is not the only case where "number of elements is a power-of-two but the element size is not a power-of-two" is a significant cause of slop bytes.  I've also seen it in nsTArray, which is very similar to js::Vector.  And also in jshashtable.cpp and pldhash.cpp, but the malloc_usable_size trick won't work there AFAICT because those hashtables must have a power-of-two number of elements.  I've only filed this bug so far, however.)
Comment 1 Nicholas Nethercote [:njn] 2011-09-08 21:41:12 PDT
BTW, m_ops is defined like this:

    Vector<YarrOp, 128> m_ops;

This asks for the first 128 elements to be stored inline.  But at 152 bytes per element, this is outrageously optimistic -- js::Vector allows a maximum of only 1024 bytes of inline capacity, which gives 6 YarrOp elements.  

That explains why the 1216, 2432, and 4864 byte cases are common -- they correspond to 8, 16 and 32 elements respectively -- but there is no 608 byte case.  Oh, and the 9728, 19456 and 38912 byte cases are all from m_ops too, and correspond to 64, 128 and 256 elements.
Comment 2 Nicholas Nethercote [:njn] 2011-09-08 23:22:07 PDT
Created attachment 559390 [details] [diff] [review]
patch to adjust newcap, v0

The obvious way to tackle this is in two patches.  The first would add the malloc_usable_size JSAPI setter, the second would use it.  But the first patch is boring, so I did the second one first.  (I hardcoded a call to malloc_usable_size() in it, which works on Linux, but won't work in general.)

Here's some numbers:

14592180 counts:
( 1) 2618960 (17.9%, 17.9%): vec<152>:     16 ->     26 ( 1520)
( 2) 2419536 (16.6%, 34.5%): vec<152>:     32 ->     53 ( 3192)
( 3) 2371960 (16.3%, 50.8%): vec<152>:      8 ->     13 (  760)
( 4) 1762992 (12.1%, 62.9%): vec< 48>:     64 ->     85 ( 1008)
( 5) 1538208 (10.5%, 73.4%): vec< 48>:    128 ->    170 ( 2016)
( 6) 1397760 ( 9.6%, 83.0%): vec< 48>:     32 ->     42 (  480)
( 7) 389120 ( 2.7%, 85.7%): vec<152>:     64 ->     80 ( 2432)
( 8) 331440 ( 2.3%, 87.9%): vec< 48>:     16 ->     21 (  240)
( 9) 257600 ( 1.8%, 89.7%): vec<112>:     32 ->     36 (  448)
(10) 249760 ( 1.7%, 91.4%): vec<112>:     16 ->     18 (  224)
(11) 224784 ( 1.5%, 92.9%): vec<112>:     64 ->     73 ( 1008)
(12) 207480 ( 1.4%, 94.4%): vec< 76>:     16 ->     26 (  760)
(13) 148160 ( 1.0%, 95.4%): vec< 80>:      8 ->     12 (  320)
(14) 123120 ( 0.8%, 96.2%): vec< 80>:     16 ->     25 (  720)
(15) 100700 ( 0.7%, 96.9%): vec< 20>:     32 ->     51 (  380)
(16)  89520 ( 0.6%, 97.5%): vec< 24>:     32 ->     42 (  240)
(17)  54264 ( 0.4%, 97.9%): vec< 76>:     32 ->     53 ( 1596)
(18)  51072 ( 0.3%, 98.2%): vec<152>:    128 ->    134 (  912)
(19)  47376 ( 0.3%, 98.6%): vec< 24>:    128 ->    170 ( 1008)
(20)  36288 ( 0.2%, 98.8%): vec< 24>:     64 ->     85 (  504)
(21)  36288 ( 0.2%, 99.1%): vec< 72>:     16 ->     28 (  864)
(22)  36288 ( 0.2%, 99.3%): vec<112>:    128 ->    146 ( 2016)
(23)  27360 ( 0.2%, 99.5%): vec< 80>:     32 ->     51 ( 1520)
(24)  16200 ( 0.1%, 99.6%): vec< 40>:     16 ->     25 (  360)
(25)  12160 ( 0.1%, 99.7%): vec< 80>:     64 ->    102 ( 3040)
(26)   9880 ( 0.1%, 99.8%): vec<152>:    256 ->    269 ( 1976)
(27)   9804 ( 0.1%, 99.8%): vec< 76>:     64 ->    107 ( 3268)
(28)   7524 ( 0.1%, 99.9%): vec< 76>:    128 ->    161 ( 2508)
(29)   6120 ( 0.0%, 99.9%): vec< 24>:    256 ->    341 ( 2040)
(30)   3528 ( 0.0%,100.0%): vec< 72>:     64 ->    113 ( 3528)
(31)   3456 ( 0.0%,100.0%): vec< 72>:     32 ->     56 ( 1728)
(32)   1976 ( 0.0%,100.0%): vec< 76>:    256 ->    269 (  988)
(33)    960 ( 0.0%,100.0%): vec< 96>:     32 ->     42 (  960)
(34)    480 ( 0.0%,100.0%): vec< 96>:     16 ->     21 (  480)

Similar to before:

- First line:  14,592,180 bytes of waste was avoided.

- Second line:  The most common case involved a vector with element size of 152 (our friend m_ops).  The capacity would have been 16, but thanks to malloc_usable_size it becomes 26.  That avoids a total of 10 * 152 = 1520 bytes of waste each occurrence.  And there were 2,618,960 / 1520 = 1723 occurrences.

A few more data points.  Without the patch, we get these frequencies of the vec<152> cases:

(19)   3284 ( 1.3%, 81.8%): vec<152>:      8 ->      8 (     0)
(22)   2633 ( 1.1%, 85.3%): vec<152>:     16 ->     16 (     0)
(31)   1505 ( 0.6%, 92.0%): vec<152>:     32 ->     32 (     0)
(42)    728 ( 0.3%, 96.3%): vec<152>:     64 ->     64 (     0)
(57)    254 ( 0.1%, 98.8%): vec<152>:    128 ->    128 (     0)
(59)    204 ( 0.1%, 99.0%): vec<152>:    256 ->    256 (     0)

With the patch, we get these:

(20)   3128 ( 1.4%, 83.1%): vec<152>:      8 ->     13 (   760)
(28)   1730 ( 0.7%, 90.3%): vec<152>:     16 ->     26 (  1520)
(40)    765 ( 0.3%, 96.0%): vec<152>:     32 ->     53 (  3192)
(59)    162 ( 0.1%, 99.2%): vec<152>:     64 ->     80 (  2432)
(71)     56 ( 0.0%, 99.7%): vec<152>:    128 ->    134 (   912)
(101)     5 ( 0.0%,100.0%): vec<152>:    256 ->    269 (  1976)

The 2nd column is the key one here -- we end up with many fewer of the larger size vectors, e.g. five 256-element (well, 269-element) ones with the patch vs. 204 without the patch.

----

Luke, I've attached a rough patch, what do you think?  Note that it still has some debugging printfs in it, the ones I used to generate the data above.

There's one thing I don't like about this patch.  I've been finding a lot of allocation waste through the very simple process of printing out, for every allocation done by jemalloc, the requested size, the actual size, and the difference between the two.  (That's how I found this issue and half a dozen others.)

This patch will interfere with that process, because that logging code doesn't know that malloc_usable_size is being used after-the-fact to take advantage of the slop bytes.  I'll have to manually filter them out.  One possible way to avoid this is to instead use the je_malloc_usable_size_in_advance function that bug 678977 will introduce, to increase the allocation request size in advance.  But that's jemalloc-specific and a bit ugly to boot.  Hmm.
Comment 3 Luke Wagner [:luke] 2011-09-08 23:49:52 PDT
I still need to look at the patch, but what about avoiding the whole usable_size route and just have Vector use the heuristic "when it is time to grow the vector, take the current capacity (in # elems), double it, convert that to bytes, round that up to a power of 2 and then further inflate the new capacity to use as much of the (predicted) slop as possible"?  That is, just bake in the assumption that power-of-2 malloc size is desirable.  Is that a valid assumption?  Perhaps it breaks down for really big allocs?
Comment 4 Nicholas Nethercote [:njn] 2011-09-09 00:08:45 PDT
jemalloc's size classes are as follows:

   // Small/Tiny: 2, 4, 8 (bytes)
   // Small/Quantum-spaced: 16, 32, 48, ..., 480, 496, 512 (bytes)
   // Small/Sub-page: 1, 2 (KiB)
   // Large: 4, 8, 12, ..., 1012, 1016, 1020 (KiB)
   // Huge: 1, 2, 3, ... (MiB)

So in the 512..8192 bytes range it's all powers-of-two, but not elsewhere.

(In reply to Luke Wagner [:luke] from comment #3)
> take the current capacity (in # elems), double it, convert
> that to bytes, round that up to a power of 2 and then further inflate the
> new capacity to use as much of the (predicted) slop as possible"?

Hmm, if you choose an ok size for your first heap allocation, this should work.  And since the maximum inline capacity is 1024 bytes, the biggest possible first heap allocation is 2048 bytes, so maybe this would work out.

Another possibility might be to change calculateNewCapacity so that it gets the number of bytes to a power-of-two instead of the capacity.
Comment 5 Nicholas Nethercote [:njn] 2011-09-09 00:11:25 PDT
Created attachment 559392 [details] [diff] [review]
alternative patch that modifies calculateNewCapacity, v0

It's 5:11pm on Friday here and I go on vacation on Monday, so here's a totally half-assed, untested patch that demonstrates the calculateNewCapacity alternative.  Assuming it works, it'll end up being simpler than the malloc_usable_size approach and won't screw up my slop finding process.
Comment 6 Igor Bukanov 2011-09-09 02:17:13 PDT
(In reply to Nicholas Nethercote [:njn] (on vacation Sep 12--Oct 17) from > Now, using moz_malloc_usable_size in the JS engine is difficult, see bug
> 676732 for details.

Hm, I think instead of moz_malloc_usable_size we can get simpler code in Vector if we would have a function that rounds up the allocation size up to the nearest malloc size class, like moz_malloc_roundup_allocation_size. To do it fast in jemalloc.h we can provide a fast inline implementation of such function together with a suitable data structure that depends on the runtime parameters that jemalloc uses. xpconnect can initialize the structure and pass a pointer to it to JS engine. Then Vector can do the following:

#include "../../memory/jenalloc.h"

size_t newAllocationSize = ...;
if (mallocRoundUpPtr)
    newAllocationSize = moz_malloc_roundup_allocation_size(newAllocationSize);

where moz_malloc_roundup_allocation_size is fast inline function in jemalloc.h There would be little overhead in using that compared with an indirect function call.

> jemalloc's size classes are as follows:
> 
>    // Small/Tiny: 2, 4, 8 (bytes)
>    // Small/Quantum-spaced: 16, 32, 48, ..., 480, 496, 512 (bytes)
>    // Small/Sub-page: 1, 2 (KiB)
>    // Large: 4, 8, 12, ..., 1012, 1016, 1020 (KiB)
>    // Huge: 1, 2, 3, ... (MiB)

Note that this is with values of jemalloc default parameters. They all can be changed at runtime using an environment variable.
Comment 7 Luke Wagner [:luke] 2011-09-15 16:42:45 PDT
Comment on attachment 559392 [details] [diff] [review]
alternative patch that modifies calculateNewCapacity, v0

Yep, like that.
Comment 8 Nicholas Nethercote [:njn] 2011-10-18 18:11:49 PDT
Created attachment 567947 [details] [diff] [review]
proper patch

Here it is, done properly.  Simple.
Comment 9 Luke Wagner [:luke] 2011-10-18 20:06:10 PDT
Comment on attachment 567947 [details] [diff] [review]
proper patch

Indeed.
Comment 10 Justin Lebar (not reading bugmail) 2011-10-18 20:11:33 PDT
> +     * rounded up by the allocator.  (Asking 2^N elements isn't as good,

Drive by nit: Asking *for* 2^N elements...
Comment 11 Nicholas Nethercote [:njn] 2011-10-19 22:56:04 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/56ec5e954858
Comment 12 Marco Bonardo [::mak] 2011-10-20 16:04:40 PDT
hm, I had actually merged this to central, somehow I forgot to mark it, btw I see that now it has been backed out from inbound.
Comment 13 Nicholas Nethercote [:njn] 2011-10-20 16:25:18 PDT
Backed out due to a Dromaeo regression:

https://hg.mozilla.org/integration/mozilla-inbound/rev/6f0c66b2ce76

I think this is because it can cause some vectors to grow more slowly, resulting in more calls to malloc.

For example, consider a vector with sizeof(T) == 12, and length of 16 (and thus size of 192 bytes).  If we need to increase its length by one, the old code would double the capacity to 32 (384 bytes).  The new code would only increase it 12 (256 bytes).
Comment 14 Nicholas Nethercote [:njn] 2011-10-20 17:19:01 PDT
Created attachment 568565 [details] [diff] [review]
new patch

This patch avoids the slow growth problem.  For the example in comment 13, it would increase the capacity from 16 to 42 (42 = 512 / 12).  More details are in the patch's comments.

The fact that the overflow check becomes stricter means that a Vector's maximum size is less than it was previously.  Does this seem like a problem?
Comment 15 Luke Wagner [:luke] 2011-10-20 18:45:34 PDT
Hmm, does try server confirm your hypothesis?  I am doubtful and here's why:

Yes, the sizeof(T) = 12, capacity = 16 case does less-than-double, but capacity won't ever be 16 when sizeof(T) = 12; the progression will be: 1 2 5 10 21 42 85 170 341... and that's not just the special case of sizeof(T) = 12, one can prove the capacity (in units of elements) always doubles or doubles+1... (I think :).

My guess is that the extra imul and idiv on the critical newCap path is costing.  While reviewing this I thought "surely calculateNewCapacity is not being called enough to matter; after all, shouldn't that happen only a few times per Vector?".  This assumption would be false if you have code like:

    Vector<int> v;
    v.append(1);  // calculateNewCapacity 0->1
    v.append(1);  // calculateNewCapacity 1->2
    v.append(1);  // calculateNewCapacity 2->4

that gets called repeatedly.  (Basically, code that should have specified a bigger inline capacity or called reserve().)

Since your new patch makes the critical path even slower, then the talos results should tell us if my guess is correct (if the new patch is slower than the original).

Is there any hacker's delight trickery to compute the same result with less expensive ops?  None spring to mind, but I'm not a bit master.  Alternatively, you could see which subtest got worse, JS_NEVER_INLINE calculateNewCapacity, profile it, and the callstack should point a finger at who is calling calculateNewCapacity so darn much.  Hey, if it's hot enough to regress, you could get an actual speedup by setting the inline/reserve capacity correctly!
Comment 16 Nicholas Nethercote [:njn] 2011-10-20 19:04:25 PDT
(In reply to Luke Wagner [:luke] from comment #15)
> Hmm, does try server confirm your hypothesis?  I am doubtful and here's why:
> 
> Yes, the sizeof(T) = 12, capacity = 16 case does less-than-double, but
> capacity won't ever be 16 when sizeof(T) = 12; the progression will be: 1 2
> 5 10 21 42 85 170 341...

The inline capacity can be 16, i.e. |Vector<T, 16>| where sizeof(T) is 12.  It's a real example that I took from some debugging printfs.

> My guess is that the extra imul and idiv on the critical newCap path is
> costing.  While reviewing this I thought "surely calculateNewCapacity is not
> being called enough to matter; after all, shouldn't that happen only a few
> times per Vector?".  This assumption would be false if you have code like:
> 
>     Vector<int> v;
>     v.append(1);  // calculateNewCapacity 0->1
>     v.append(1);  // calculateNewCapacity 1->2
>     v.append(1);  // calculateNewCapacity 2->4
> 
> that gets called repeatedly.  (Basically, code that should have specified a
> bigger inline capacity or called reserve().)

But the rate of calls to calculateNewCapacity in even that pathological case will drop off exponentially.

I measured Sunspider with Cachegrind, the instruction count changes were negligible.  The impact of imul and idiv will be under-represented by instruction counts... but there's only 13,827 calls to this function in all of SunSpider.
 
As for fast bit-twiddling tricks... they're unlikely when you don't know sizeof(T) in advance.

Now I'm wondering if going back to the malloc_usable_size approach is a better idea.
Comment 17 Luke Wagner [:luke] 2011-10-20 21:12:49 PDT
(In reply to Nicholas Nethercote [:njn] from comment #16)
> The inline capacity can be 16, i.e. |Vector<T, 16>| where sizeof(T) is 12. 
> It's a real example that I took from some debugging printfs.

Sigh.  Yes, this can happen once when transitioning from in-line to out-of-line storage.

> > My guess is that the extra imul and idiv on the critical newCap path is
> > costing.  While reviewing this I thought "surely calculateNewCapacity is not
> > being called enough to matter; after all, shouldn't that happen only a few
> > times per Vector?".  This assumption would be false if you have code like:
> > 
> >     Vector<int> v;
> >     v.append(1);  // calculateNewCapacity 0->1
> >     v.append(1);  // calculateNewCapacity 1->2
> >     v.append(1);  // calculateNewCapacity 2->4
> > 
> > that gets called repeatedly.  (Basically, code that should have specified a
> > bigger inline capacity or called reserve().)
> 
> But the rate of calls to calculateNewCapacity in even that pathological case
> will drop off exponentially.

No, my point was if those were the *only* append() calls on that vector before it was destroyed.  Rinse, repeat: now you have linear number of calls to calculateNewCapacity, not logarithmic.
 
> I measured Sunspider with Cachegrind, the instruction count changes were
> negligible.  The impact of imul and idiv will be under-represented by
> instruction counts... but there's only 13,827 calls to this function in all
> of SunSpider.

Was it Dromaeo SunSpider that regressed?
  
> As for fast bit-twiddling tricks... they're unlikely when you don't know
> sizeof(T) in advance.

I actually had an idea long this line, but I'm not going to do anything until you push your patch to try and see if it either fixes the regression or makes it worse.

> Now I'm wondering if going back to the malloc_usable_size approach is a
> better idea.

Let's see what try says about the patch you posted.
Comment 18 Nicholas Nethercote [:njn] 2011-10-20 21:25:55 PDT
> Was it Dromaeo SunSpider that regressed?

No, it was Dromaeo DOM.
Comment 19 Guilherme Lima 2011-10-21 06:24:36 PDT
Kraken-astar and Kraken-fft regressed as well with TI disabled. Do you guys worry about this setting (TI-off)?
Comment 20 Nicholas Nethercote [:njn] 2013-02-07 20:10:34 PST
Created attachment 711671 [details] [diff] [review]
Avoid slop in js::Vector when the element size is not a power of two.

This bug has lain dormant for over a year, so here's a reminder of what
happened...

I landed a patch that made js::Vector resize itself in a way that avoided slop
bytes for vectors with elements whose size is not a power-of-two.  It
supposedly regressed Dromaeo DOM and it wasn't backed out.  Luke suggested that
it was the multiplication and division that I added that was the cause, and
wanted me to do a Talos run.

Back to the present... I just did a Talos run
(http://perf.snarkfest.net/compare-talos/index.html?oldRevs=f530dc7e68ec&newRev=1581a91bbdfb&submit=true) and the results were noisy rubbish as usual.

Anyway, attached is a patch which reduces the number of multiply/divides by
more than half.  (It avoids the overflow checks in those cases, too.)
Comment 21 Luke Wagner [:luke] 2013-02-07 21:41:59 PST
Ah, this old bug :)  I was looking at Vector again (it's been a while), in the context of "optimizing the common case of calculateNewCapacity" I noticed a few things:

What makes calculateNewCapacity have to do some much work is the dynamic 'inc' value; if we know inc is 1, we don't need to do integer division: we can inductively assume that there are < sizeof(T) slop bytes for the current allocation so, when we double the storage, we can gain at most one extra element from the slop bytes.  This lets us write:

  if (length & tl::MulOverflowMask<2*sizeof(T)>::result)
     return false;
  size_t newCap = length * 2;
  size_t bytes = newCap * sizeof(T);
  newCap += size_t(RoundUpPow2(bytes) - bytes > sizeof(T));

which saves an integer division and comparison.

Furthermore, 'inc = 1' would, I'd think (but please confirm with measurement) be like 99% of calls (coming from append()).  This suggests having an 'inc == 1' branch in calculateNewCapacity.  We'd hope this branch was free (inc being a constant after inlining), but growStorageBy is JS_NEVER_INLINE, so I don't think it will.  So, rather than doing that, I think it would be a good idea to write a new "growStorageByOne" path that is called by append.  What's great is that this opens a second optimization which generalizes your length = 0 special case: when we do the first inline-storage-to-heap allocation (length = 0 being a subcase of this) AND we know the increment is 1, we can statically compute newCap (see tl::RoundUpPow2<N>)!

With these changes, I'd bet we'd pretty well mitigate any negative performance of the div on the non-append() case and maybe even get a speedup.
Comment 22 Nicholas Nethercote [:njn] 2013-02-07 22:03:36 PST
> Furthermore, 'inc = 1' would, I'd think (but please confirm with
> measurement) be like 99% of calls (coming from append()).

inc=1 does account for 98--99% of calls.  However, not all of them come from append().  Measuring just the calls that come from append(), it drops to 88% for Sunspider (though it's still 97% for V8bench).

Still, I reckon all the optimizations combined will suffice.  I'll do it on Monday.
Comment 23 Luke Wagner [:luke] 2013-02-08 08:24:27 PST
Not sure if it's worth it, but, once you have a growStorageByOne(), we could consider adding a 'incr == 1' test in growStorageBy() that calls growStorageByOne().
Comment 24 Nicholas Nethercote [:njn] 2013-02-10 16:23:35 PST
> What makes calculateNewCapacity have to do some much work is the dynamic
> 'inc' value; if we know inc is 1, we don't need to do integer division: we
> can inductively assume that there are < sizeof(T) slop bytes for the current
> allocation so, when we double the storage, we can gain at most one extra
> element from the slop bytes.  This lets us write:
> 
>   if (length & tl::MulOverflowMask<2*sizeof(T)>::result)
>      return false;
>   size_t newCap = length * 2;
>   size_t bytes = newCap * sizeof(T);
>   newCap += size_t(RoundUpPow2(bytes) - bytes > sizeof(T));
> 
> which saves an integer division and comparison.

That's not true if we have inline storage and are converting to heap storage.  For example, if sizeof(T) is 12 and someone awkwardly did Vector<T, 50>, we'd have 600 bytes of inline storage.  We'd double that to 1200, round up to 2048, which should leave us with 2048/12 = 170 elements.  Your calculation would give us 101 elements.


> I think it would be a good idea to write a new "growStorageByOne" path that is
> called by append.

If we do that we end up having two versions of each of growStorageBy, convertToHeapStorage, growHeapStorageBy, and calculateNewCapacity.  Given how many times this template is instantiated, I'm worried about code bloat.

My most recent patch has this comment:

     * Note that sizeof(T) is usually a power-of-two, in which case the
     * multiply and divide will just be shifts.

(I've checked the code generated by clang -- in a debug build -- and this is true.)

So there are three cases:  "fast", which is the length==0, incr==1;  "medium", which is when sizeof(T) is a power-of-two, and "slow" which is when sizeof(T) is not a power-of-two.

Here are some frequencies.

SUNSPIDER
42356 counts:
( 1)  23214 (54.8%, 54.8%): fast
( 2)  17714 (41.8%, 96.6%): medium
( 3)   1428 ( 3.4%,100.0%): slow

V8BENCH
107631 counts:
( 1)  63270 (58.8%, 58.8%): fast
( 2)  39415 (36.6%, 95.4%): medium
( 3)   4946 ( 4.6%,100.0%): slow

SOME GENERAL BROWSING
451575 counts:
( 1) 291693 (64.6%, 64.6%): fast
( 2) 129229 (28.6%, 93.2%): medium
( 3)  30653 ( 6.8%,100.0%): slow

Also, the "slow" case is actually not as bad as it first seems, because it's multiplication and division by a constant.  For example, when sizeof(T) is 24, here's the multiplication, which is just a shift and a scaled add:

  5010ef:       48 8b 75 d0             mov    -0x30(%rbp),%rsi
  5010f3:       48 c1 e6 03             shl    $0x3,%rsi
  5010f7:       48 8d 34 76             lea    (%rsi,%rsi,2),%rsi

and here's the division:

  50110e:       48 be ab aa aa aa aa    movabs $0xaaaaaaaaaaaaaaab,%rsi
  501115:       aa aa aa
  501118:       48 89 45 a8             mov    %rax,-0x58(%rbp)
  50111c:       48 f7 e6                mul    %rsi
  50111f:       48 c1 ea 04             shr    $0x4,%rdx

I think this is a multiply-by-reciprocal?

Do you think the "slow" numbers are too high?  It's frustrating that this is held up by a possible regression that I can't measure meaningfully.
Comment 25 Luke Wagner [:luke] 2013-02-11 09:18:12 PST
(In reply to Nicholas Nethercote [:njn] from comment #24)
> That's not true if we have inline storage and are converting to heap
> storage.

I know; I was basing that statement on the inductive assumption that the previous heap allocation fit the length to a power-of-2 number of bytes.  The base case is handled by the other part of my comment where I stated that, for the convertToHeapStorage case, we can statically compute the target heap capacity.

> > I think it would be a good idea to write a new "growStorageByOne" path that is
> > called by append.
> 
> If we do that we end up having two versions of each of growStorageBy,
> convertToHeapStorage, growHeapStorageBy, and calculateNewCapacity.  Given
> how many times this template is instantiated, I'm worried about code bloat.

Come on, work with me a little; of course we shouldn't duplicate all that stuff (for maintenance as well as code bloat reasons).  IIUC, we can simply:
 - change growHeapStorageBy/convertToHeapStorage from taking a lengthInc to taking 'newCap'
 - have growStorageBy/growStorageByOne calculate the new capacity before calling the above

> It's frustrating that this is
> held up by a possible regression that I can't measure meaningfully.

With the above, I'm fairly certain this patch would have a net positive impact; at least certain enough to land and see what happens :)
Comment 26 Nicholas Nethercote [:njn] 2013-02-11 14:36:33 PST
> I was basing that statement on the inductive assumption that the
> previous heap allocation fit the length to a power-of-2 number of bytes. 
> The base case is handled by the other part of my comment where I stated
> that, for the convertToHeapStorage case, we can statically compute the
> target heap capacity.

By "the other part of my comment" do you mean "when we do the first inline-storage-to-heap allocation (length = 0 being a subcase of this) AND we know the increment is 1, we can statically compute newCap (see tl::RoundUpPow2<N>)!"?

That provides the base case when the increment is 1, but not otherwise.  Am I missing something?
Comment 27 Luke Wagner [:luke] 2013-02-11 15:00:28 PST
(In reply to Nicholas Nethercote [:njn] from comment #26)
> By "the other part of my comment" do you mean "when we do the first
> inline-storage-to-heap allocation (length = 0 being a subcase of this) AND
> we know the increment is 1, we can statically compute newCap (see
> tl::RoundUpPow2<N>)!"?

Correct

> That provides the base case when the increment is 1, but not otherwise.  Am
> I missing something?

> That provides the base case when the increment is 1, but not otherwise.  Am
> I missing something?

If the increment isn't 1, then we'd take the existing growStorageBy path which calls calculateNewCapacity which (with your changes) will inflate length to fit a power of 2.  In either inline-to-heap transition path, we'd ensure the base case that there were < sizeof(T) slop bytes.
Comment 28 Nicholas Nethercote [:njn] 2013-02-11 19:50:52 PST
Created attachment 712781 [details] [diff] [review]
Avoid slop in js::Vector when the element size is not a power of two.

I think this does everything you want.

- I inlined growHeapStorageBy(), because it was trivial after removing the
  calculateNewCapacity call.

- I inline calculateNewCapacity() because I had to.  This removes the
  STATIC_POSTCONDITION, oh well.

- growStorageBy()'s control flow is ugly, but I couldn't see how to improve it
  further.
Comment 29 Luke Wagner [:luke] 2013-02-11 21:44:05 PST
Comment on attachment 712781 [details] [diff] [review]
Avoid slop in js::Vector when the element size is not a power of two.

Review of attachment 712781 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, thanks!

::: js/public/Vector.h
@@ +35,5 @@
> + * power-of-two as possible.  growStorageBy() is responsible for ensuring
> + * this.
> + */
> +template <typename T>
> +static bool CapacityHasExcessSpace(size_t cap) {

{ on newline

@@ +643,5 @@
> +
> +        /* This case occurs in ~15--20% of the calls to this function. */
> +
> +        /* Will newCap*sizeof(T) overflow? */
> +        if ((mLength + 1) & tl::MulOverflowMask<2 * sizeof(T)>::result) {

I think you can s/(mLength + 1)/mLength/ here.

@@ +647,5 @@
> +        if ((mLength + 1) & tl::MulOverflowMask<2 * sizeof(T)>::result) {
> +            this->reportAllocOverflow();
> +            return false;
> +        }
> +        /*

Newline after }

@@ +677,5 @@
> +    /*
> +     * Do not allow a buffer large enough that the expression ((char *)end() -
> +     * (char *)begin()) overflows ptrdiff_t. See Bug 510319.
> +     */
> +    if (newCap & tl::UnsafeRangeSizeMask<T>::result) {

I keep thinking we could fold this branch into the preceding tests (e.g., using tl::MulOverflowMask<4 * sizeof(T)>::result), but I haven't proven it to myself.  Feel free to do that, if you want, or not.

@@ +683,5 @@
> +        return false;
> +    }
> +
> +    if (usingInlineStorage())
> +        goto convert;

You're right the control flow is hairy.  I guess your reasoning is that you don't want growTo/convertToHeapStorage to be inlined multiple times?

I think I'd rather read:

    if (usingInlineStorage()) {
      convert:
        return convertToHeapStorage(newCap);
    }

  grow:
    return Impl::growTo(*this, newCap);

although I don't feel too strongly.
Comment 30 Nicholas Nethercote [:njn] 2013-02-12 17:16:52 PST
> > +        /* This case occurs in ~15--20% of the calls to this function. */
> > +
> > +        /* Will newCap*sizeof(T) overflow? */
> > +        if ((mLength + 1) & tl::MulOverflowMask<2 * sizeof(T)>::result) {
> 
> I think you can s/(mLength + 1)/mLength/ here.

I don't think so:  we do |newCap = mLength*2| and then possibly increment |newCap|.  So it's possible that |mLength*2*sizeof(T)| is safe but |(mLength*2+1)*sizeof(T)| overflows.


> > +     * (char *)begin()) overflows ptrdiff_t. See Bug 510319.
> > +     */
> > +    if (newCap & tl::UnsafeRangeSizeMask<T>::result) {
> 
> I keep thinking we could fold this branch into the preceding tests (e.g.,
> using tl::MulOverflowMask<4 * sizeof(T)>::result), but I haven't proven it
> to myself.  Feel free to do that, if you want, or not.

I'm pretty sure it would be safe, but it'd reduce the max size of a Vector with byte-size elements.  I'll leave it.
Comment 31 Luke Wagner [:luke] 2013-02-13 12:27:26 PST
(In reply to Nicholas Nethercote [:njn] from comment #30)
> I don't think so:  we do |newCap = mLength*2| and then possibly increment
> |newCap|.  So it's possible that |mLength*2*sizeof(T)| is safe but
> |(mLength*2+1)*sizeof(T)| overflows.

Given that SIZET_MAX is odd and mLength*2 is even, I still don't think an overflow is possible.  (Maybe this reasoning is too hacky :)

> I'm pretty sure it would be safe, but it'd reduce the max size of a Vector
> with byte-size elements.  I'll leave it.

It's fine if you leave it in, but, in general, I'm not aware of any cases where it is a good idea to allow a >1GB Vector on a 32-bit system.  64-bit will OOM long before we hit this limit.
Comment 32 Nicholas Nethercote [:njn] 2013-02-14 21:01:06 PST
I ended up going with |MulOverflowMask<4 * sizeof(T)>|.

https://hg.mozilla.org/integration/mozilla-inbound/rev/77bd010e0e62
Comment 33 Ryan VanderMeulen [:RyanVM] 2013-02-15 06:48:35 PST
https://hg.mozilla.org/mozilla-central/rev/77bd010e0e62
Comment 34 Luke Wagner [:luke] 2013-02-15 09:00:05 PST
(In reply to Nicholas Nethercote [:njn] from comment #32)
Great!  Does the 4* need to be added to the MulOverflowMask on the 'else' branch as well?
Comment 35 Nicholas Nethercote [:njn] 2013-02-15 12:10:11 PST
> Does the 4* need to be added to the MulOverflowMask on the 'else'
> branch as well?

D'oh, yes.  I'll fix it on Monday, or rs=me if you want to do it today.

Note You need to log in before you can comment on or make changes to this bug.