Closed Bug 610587 Opened 14 years ago Closed 14 years ago

improve jsvector.h

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file)

Vector can store elements inline, which is nice.  Except that it means for
every Vector operation we have to check if we're using inline storage or
non-inline storage, viz:

    template <class T, size_t N, class AP>
    JS_ALWAYS_INLINE bool
    Vector<T,N,AP>::append(const T &t)
    {
        ReentrancyGuard g(*this);
        if (usingInlineStorage()) {
            if (inlineLength() < sInlineCapacity) {
                new(inlineEnd()) T(t);
                ++inlineLength();
                JS_ASSERT(usingInlineStorage());
                return true;
            }
            if (!convertToHeapStorage(1))
                return false;
        } else {
            if (heapLength() == heapCapacity() && !growHeapStorageBy(1))
                return false;
        }   
                
        /* We are !usingInlineStorage(). Initialize new elements. */
        JS_ASSERT(heapLength() <= heapCapacity() && heapCapacity() - heapLength() >= 1);
        new(heapEnd()++) T(t);
        return true;
    }

Cachegrind tells me these checks are hot.  I think it should be possible to
avoid them a lot of them by doing something like this:

    template <class T, size_t N, class AP>
    JS_ALWAYS_INLINE bool
    Vector<T,N,AP>::append(const T &t)
    {
        ReentrancyGuard g(*this);
        if (length() < capacity()) {
            new(end()) T(t);
            ++length();
            return;
        }

        if (usingInlineStorage()) {
            JS_ASSERT(inlineLength() == sInlineCapacity) {
            if (!convertToHeapStorage(1))
                return false;
        } else {
            JS_ASSERT((heapLength() == heapCapacity())
            if (!growHeapStorageBy(1))
                return false;
        }   
                
        /* We are !usingInlineStorage(). Initialize new elements. */
        JS_ASSERT(heapLength() <= heapCapacity() && heapCapacity() - heapLength() >= 1);
        new(heapEnd()++) T(t);
        return true;
    }

append() is the most important case, but I'm sure similar things can be done
in a lot of the cases.  Some minor futzing of the length()/begin()/end()-style
functions may be required.

Luke, can you see any problems with this idea?
Sounds good!
On the subject, one thing I've been thinking about for a while with js::Vector is moving the 'grow' path out-of-line.  It seems like this could remove a lot of cold code and thus improve code density.
(In reply to comment #2)
> On the subject, one thing I've been thinking about for a while with js::Vector
> is moving the 'grow' path out-of-line.  It seems like this could remove a lot
> of cold code and thus improve code density.

Do you mean splitting functions like append() into two, with the fast path in one function and the slow path in another?
Yes, with the fast path JS_ALWAYS_INLINE (as it is now) and the slow path just 'inline'.  Even better, for pod types (tl::IsPodType<T>::result == true), there can be a single non-templated non-inline grow function.

I don't mean to hijack this bug, though, perhaps this is best filed as a follow-up bug.
Inlining decls probably don't matter for PGO builds. I guess Mac still doesn't do those, bleh.
(In reply to comment #5)
> Inlining decls probably don't matter for PGO builds. I guess Mac still doesn't
> do those, bleh.

PGO would move the code off the hot path, true, but unless PGO ignores causes __forceinline to be ignored at compile time, there would still be a lot of extra code being generated.  Probably not a lot of bytes on an absolute scale, but unless PGO moves cold code in a function not only off the hot path but out of hot code pages, this could still have a negative effect.  (Lot of PGO unknowns here...)  (Also Linux ;-)
(In reply to comment #3)

> Do you mean splitting functions like append() into two, with the fast path in
> one function and the slow path in another?

gcc, at least, doesn't handle the following idiom well

  ... function_which_is_called_very_frequently ( ... )
  {
     if (test_for_common_case) {
        // simple common case handler
        // with minimal register pressure
     } else {
        // complex case handler, uses lots of regs
     }
  }

because it generates prologue and epilogue code which dumps all
the callee saved registers on the stack and then restores them,
as is required by the complex case.  But that causes lots of
pointless stores/loads in the common, fast case.

Hence the splitting Luke proposes is good for more than
just increasing code density.  It means the fast case fn can
have minimal prologue/epilogue overhead; we only take that hit
when calling the slow function.  Need to guarantee though that the
slow function doesn't get inlined back in as that would obviously
make the splitting pointless.  Parking "__attribute__((noinline))"
on it works in gcc-world.  I don't know if there's a MSVC equivalent.
(In reply to comment #7)
> gcc, at least, doesn't handle the following idiom well

I wonder if __builtin_expect((cond), 1) (our JS_LIKELY macro) fixes that.
Summary: Avoid frequent usingInlineStorage() tests in jsvector.h → improve jsvector.h
This patch makes Vector simpler and faster.

- The main change was to make the handling of the buffer identical where
  possible, regardless of whether it is inline or heap-allocated.  We now
  always store the buffer begin pointer, the length and the capacity.  The
  end pointer is never stored, it can be computed from the length.
  
  This has the following benefits:

  - Conceptually simpler -- no need to mentally switch between two different
    ways of storing the length/capacity/begin/end values.

  - More concise code -- no need for inlineFoo() and heapFoo() versions of
    various functions.  Also, functions like append() became shorter (and
    their slow patch now goes via a never-inlined call to growStorageBy()).
    This cut about 120 lines from jsvector.h.  The size of the 32-bit JS
    shell binary on my Linux box dropped by 2.7%, from 3,157,442 to
    3,075,528 bytes.

  - Faster -- many fewer calls to usingInlineStorage() are required.
    Instruction count results are below.  Measurements indicate roughly a
    3ms win on my MacBook Pro which gives pretty similar results to AWFY.
    The in-browser result won't be as good because at least some of the
    speed-up is in jsscan.cpp which isn't included in browser measurements,
    AIUI.  But hey, faster parsing is a win for day-to-day browser usage.

- The only downside is that each Vector is two words larger than it used to
  be.  Previously we had one size_t member plus the AlignedStorage.  Now we
  have one pointer, two size_t members, plus the AlignedStorage.

- I preserved the prior checking behaviour of begin() et al by introducing
  beginNoCheck() and endNoCheck().  This is a bit ugly but makes it clearer
  than it was before when the checks are occurring.

- I added a handful of extra assertions.

- replaceRawBuffer() now doesn't have to convert to inline storage if
  length < sInlineCapacity, but the code still does that.  Perhaps this
  should be rethought.  I guess it depends on whether the cost of the
  copying and freeing is greater than the cost of using the extra memory.


---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    68.822    68.754 (1.001x) |    44.101    44.095 (------) | 3d-cube
|    39.720    39.728 (------) |    24.717    24.716 (------) | 3d-morph
|    68.987    68.690 (1.004x) |    40.447    40.433 (------) | 3d-raytrace
|    25.500    25.506 (------) |    12.116    12.115 (------) | access-binary-
|    88.394    88.390 (------) |    83.307    83.306 (------) | access-fannkuc
|    28.384    28.351 (1.001x) |    16.267    16.265 (------) | access-nbody
|    35.795    35.806 (------) |    29.366    29.366 (------) | access-nsieve
|     7.522     7.536 (0.998x) |     3.256     3.255 (------) | bitops-3bit-bi
|    35.362    35.376 (------) |    30.401    30.400 (------) | bitops-bits-in
|    15.914    15.931 (0.999x) |    12.019    12.019 (------) | bitops-bitwise
|    38.138    38.148 (------) |    32.967    32.966 (------) | bitops-nsieve-
|    17.420    17.430 (0.999x) |    13.169    13.168 (------) | controlflow-re
|    54.029    53.864 (1.003x) |    30.473    30.467 (------) | crypto-aes
|    24.424    24.327 (1.004x) |    12.332    12.328 (------) | crypto-md5
|    20.049    20.012 (1.002x) |     8.694     8.692 (------) | crypto-sha1
|    67.241    67.092 (1.002x) |    21.027    21.006 (1.001x) | date-format-to
|    63.267    62.635 (1.010x) |    10.200    10.198 (------) | date-format-xp
|    43.645    43.657 (------) |    29.513    29.513 (------) | math-cordic
|    22.928    22.931 (------) |     6.310     6.310 (------) | math-partial-s
|    32.347    32.350 (------) |    27.176    27.175 (------) | math-spectral-
|    49.202    48.269 (1.019x) |    34.581    34.580 (------) | regexp-dna
|    28.795    28.770 (1.001x) |     9.278     9.277 (------) | string-base64
|    64.396    63.902 (1.008x) |    32.213    32.102 (1.003x) | string-fasta
|   104.222   102.168 (1.020x) |    17.271    17.260 (1.001x) | string-tagclou
|   102.728    96.995 (1.059x) |    12.809    12.807 (------) | string-unpack-
|    43.198    42.531 (1.016x) |     8.574     8.573 (------) | string-validat
-------
|  1190.442  1179.160 (1.010x) |   602.594   602.404 (------) | all
Attachment #489761 - Flags: review?(lw)
Fantastic!  I look forward to reading this.  In the meantime, in the spirit of science and testing hypotheses in comments 2 through 7, would it be possible with a reasonable amount of effort to measure the effects of not inlining the cold path?  (Perhaps comparing your usingInlineStorage optimization with the cold path in the header and JS_ALWAYS_INLINEd vs. the posted patch?)  Regardless, I should bump the priority of the long-standing todo to un-inline more of js::HashTable.
(In reply to comment #10)
> Fantastic!  I look forward to reading this.  In the meantime, in the spirit of
> science and testing hypotheses in comments 2 through 7, would it be possible
> with a reasonable amount of effort to measure the effects of not inlining the
> cold path? 

The patch has JS_NEVER_INLINE on growStorageBy(), so it's already done.  I could undo it by changing it to JS_ALWAYS_INLINE or just 'inline' and re-measure, would that be sufficient?
Yes.  (By the way, I'm surprised you don't get multiple-definition linker errors if you have a memfun definition in a header that isn't marked inline.  You might want to test building on all platforms.)
Moving growStorageBy() back into the hot path is bad.  The binary size is 3,153,234 bytes, almost as big as the original.  And the instruction count reduction is less:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    68.619    68.495 (1.002x) |    43.773    43.772 (------) | 3d-cube
|    39.728    39.728 (------) |    24.716    24.717 (------) | 3d-morph
|    65.660    65.526 (1.002x) |    37.051    37.050 (------) | 3d-raytrace
|    24.395    24.395 (------) |    11.010    11.010 (------) | access-binary-
|    88.404    88.395 (------) |    83.307    83.307 (------) | access-fannkuc
|    28.431    28.392 (1.001x) |    16.266    16.266 (------) | access-nbody
|    35.167    35.169 (------) |    28.735    28.736 (------) | access-nsieve
|     7.522     7.525 (------) |     3.255     3.256 (------) | bitops-3bit-bi
|    35.362    35.366 (------) |    30.401    30.401 (------) | bitops-bits-in
|    15.912    15.918 (------) |    12.019    12.019 (------) | bitops-bitwise
|    38.142    38.144 (------) |    32.966    32.966 (------) | bitops-nsieve-
|    17.138    17.138 (------) |    12.876    12.876 (------) | controlflow-re
|    53.730    54.088 (0.993x) |    30.042    30.082 (0.999x) | crypto-aes
|    24.032    23.958 (1.003x) |    11.807    11.807 (------) | crypto-md5
|    19.902    19.864 (1.002x) |     8.499     8.499 (------) | crypto-sha1
|    67.345    67.271 (1.001x) |    21.027    21.027 (------) | date-format-to
|    63.034    63.577 (0.991x) |     9.946     9.946 (------) | date-format-xp
|    43.644    43.649 (------) |    29.513    29.513 (------) | math-cordic
|    22.933    22.938 (------) |     6.310     6.311 (------) | math-partial-s
|    31.374    31.370 (------) |    26.189    26.190 (------) | math-spectral-
|    49.215    48.497 (1.015x) |    34.581    34.581 (------) | regexp-dna
|    28.830    28.806 (1.001x) |     9.278     9.278 (------) | string-base64
|    64.261    63.814 (1.007x) |    32.065    32.065 (------) | string-fasta
|   104.619   103.250 (1.013x) |    17.206    17.224 (0.999x) | string-tagclou
|   102.923    98.010 (1.050x) |    12.831    12.831 (------) | string-unpack-
|    43.223    43.177 (1.001x) |     8.574     8.574 (------) | string-validat
-------
|  1183.559  1176.473 (1.006x) |   594.257   594.316 (------) | all


So the patch submitted is good to review, I think!  (I will try server it before landing to check the thing about the non-inline function being in the header.)
(In reply to comment #13)
Awesome, that's useful to know.

I looked at the patch.  Trading space for removing-usingInlineStorage is pure win.  I'm not sure why I was so hung up on minimizing sizeof(js::Vector)... I was young and naive and it was my third day at Mozilla after grad school :)

Another change I noticed was representing 'end' via mBegin+mLength instead of T *mEnd.  I think the reason for this change is that mCapacity is a size_t so this makes mLength/mCapacpity more efficient to compare.  Now, that hottest operations for Vector seem to be append(), end(), and back().  FWIW, the hot path for these operations would be strictly faster representing both 'end' and 'capacity' via pointers.  E.g., on append(), this would avoid the need to load mBegin and the addition mBegin+mLength.
(In reply to comment #14)
> 
> Another change I noticed was representing 'end' via mBegin+mLength instead of T
> *mEnd.  I think the reason for this change is that mCapacity is a size_t so
> this makes mLength/mCapacpity more efficient to compare.  Now, that hottest
> operations for Vector seem to be append(), end(), and back().  FWIW, the hot
> path for these operations would be strictly faster representing both 'end' and
> 'capacity' via pointers.  E.g., on append(), this would avoid the need to load
> mBegin and the addition mBegin+mLength.

I tried adding mEnd.  In append(), it makes the endNoCheck() slightly faster, but you have to increment mEnd when you increment mLength, so it ends up a wash;  the instruction counts improved by a factor of 1.00028x.  We're well into diminishing returns territory.

Also, making the change was a pain, it took me numerous tries to find all the places where mLength and mBegin were updated and add a corresponding mEnd update as well.

So I suggest we keep the mBegin/mLength/mCapacity configuration we have in the patch :)
I'm sorry I wasn't more explicit: mEnd *instead* of mLength (and changing mCapacity to a pointer).
Another wash:

| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    68.495    68.437 (1.001x) |    43.772    43.768 (------) | 3d-cube
|    39.728    39.726 (------) |    24.717    24.716 (------) | 3d-morph
|    65.526    65.504 (------) |    37.050    37.044 (------) | 3d-raytrace
|    24.395    24.391 (------) |    11.010    11.010 (------) | access-binary-
|    88.395    88.391 (------) |    83.307    83.306 (------) | access-fannkuc
|    28.392    28.370 (1.001x) |    16.266    16.264 (------) | access-nbody
|    35.169    35.167 (------) |    28.736    28.735 (------) | access-nsieve
|     7.525     7.524 (------) |     3.256     3.255 (------) | bitops-3bit-bi
|    35.366    35.364 (------) |    30.401    30.400 (------) | bitops-bits-in
|    15.918    15.916 (------) |    12.019    12.019 (------) | bitops-bitwise
|    38.144    38.142 (------) |    32.966    32.966 (------) | bitops-nsieve-
|    17.138    17.136 (------) |    12.876    12.875 (------) | controlflow-re
|    54.088    54.092 (------) |    30.082    30.080 (------) | crypto-aes
|    23.958    23.932 (1.001x) |    11.807    11.804 (------) | crypto-md5
|    19.864    19.859 (------) |     8.499     8.498 (------) | crypto-sha1
|    67.270    67.225 (1.001x) |    21.027    21.026 (------) | date-format-to
|    63.577    63.564 (------) |     9.946     9.945 (------) | date-format-xp
|    43.649    43.638 (------) |    29.513    29.513 (------) | math-cordic
|    22.938    22.928 (------) |     6.311     6.310 (------) | math-partial-s
|    31.370    31.368 (------) |    26.190    26.189 (------) | math-spectral-
|    48.497    48.436 (1.001x) |    34.581    34.580 (------) | regexp-dna
|    28.806    28.801 (------) |     9.278     9.277 (------) | string-base64
|    63.814    63.824 (------) |    32.065    32.064 (------) | string-fasta
|   103.250   102.828 (1.004x) |    17.224    17.211 (1.001x) | string-tagclou
|    98.010    98.696 (0.993x) |    12.831    12.829 (------) | string-unpack-
|    43.179    43.200 (------) |     8.574     8.573 (------) | string-validat
-------
|  1176.473  1176.473 (------) |   594.316   594.269 (------) | all

I like the original formulation better, tracking length and capacity as numbers makes the most sense to me.
Comment on attachment 489761 [details] [diff] [review]
patch (against TM 57083:eaaae3775c00)

Fair enough, thanks for trying!

>-    /* If reserve(length() + N) succeeds, the N next appends are guaranteed to succeed. */
>+    /* If reserve(mLength + N) succeeds, the N next appends are guaranteed to succeed. */

>-    /* Call shrinkBy or growBy based on whether newSize > length(). */
>+    /* Call shrinkBy or growBy based on whether newSize > mLength. */

>-     * N.B. Although a T*, only the range [0, length()) is constructed.
>+     * N.B. Although a T*, only the range [0, mLength) is constructed.

Could you keep these 'length()': they are public API comments and mLength is a private member.

>+    JS_ASSERT(mLength + incr > mCapacity);
>+    if (usingInlineStorage()) {
>+        return convertToHeapStorage(incr);
>+    } else {
>+        return growHeapStorageBy(incr);
>+    }

SM style is:

  if (....)
      return ...
  return ...

>+    REENTRANCY_GUARD_ET_AL;
>+    if (incr > mCapacity - mLength)
>+        if (!growStorageBy(incr))
>             return false;

SM style is:

  if (incr > mCapacity - mLength && !growStorageBy(incr))
       return false;

This happens in: growByImpl, append(scalar), appendN, append(range).

r+ with nits fixed
Attachment #489761 - Flags: review?(lw) → review+
(In reply to comment #18)
> SM style is:
> 
>   if (....)
>       return ...
>   return ...

or return a ? b : c; if that isn't too complicated.

> >+    REENTRANCY_GUARD_ET_AL;
> >+    if (incr > mCapacity - mLength)
> >+        if (!growStorageBy(incr))
> >             return false;
> 
> SM style is:
> 
>   if (incr > mCapacity - mLength && !growStorageBy(incr))
>        return false;

Generally true, but if the fallible call can be profitably given its own if (if overlong, or commented separately, or for parallelism with other code that has to nest if's for other reasons), then the outer if's then clause needs bracing.

/be
Pushed with nits fixed:

http://hg.mozilla.org/tracemonkey/rev/b60ac564e39d
Whiteboard: fixed-in-tracemonkey
Turns out I pushed a version where growStorageBy() was marked JS_ALWAYS_INLINE :(
I'll change it to JS_NEVER_INLINE tomorrow.  BTW, I try-server'd it and there were no complaints about the non-inline memfun being defined in the header.
http://hg.mozilla.org/mozilla-central/rev/b60ac564e39d
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: