Closed
Bug 610587
Opened 14 years ago
Closed 14 years ago
improve jsvector.h
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: n.nethercote, Assigned: n.nethercote)
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(1 file)
25.35 KB,
patch
|
luke
:
review+
|
Details | Diff | Splinter Review |
Vector can store elements inline, which is nice. Except that it means for every Vector operation we have to check if we're using inline storage or non-inline storage, viz: template <class T, size_t N, class AP> JS_ALWAYS_INLINE bool Vector<T,N,AP>::append(const T &t) { ReentrancyGuard g(*this); if (usingInlineStorage()) { if (inlineLength() < sInlineCapacity) { new(inlineEnd()) T(t); ++inlineLength(); JS_ASSERT(usingInlineStorage()); return true; } if (!convertToHeapStorage(1)) return false; } else { if (heapLength() == heapCapacity() && !growHeapStorageBy(1)) return false; } /* We are !usingInlineStorage(). Initialize new elements. */ JS_ASSERT(heapLength() <= heapCapacity() && heapCapacity() - heapLength() >= 1); new(heapEnd()++) T(t); return true; } Cachegrind tells me these checks are hot. I think it should be possible to avoid them a lot of them by doing something like this: template <class T, size_t N, class AP> JS_ALWAYS_INLINE bool Vector<T,N,AP>::append(const T &t) { ReentrancyGuard g(*this); if (length() < capacity()) { new(end()) T(t); ++length(); return; } if (usingInlineStorage()) { JS_ASSERT(inlineLength() == sInlineCapacity) { if (!convertToHeapStorage(1)) return false; } else { JS_ASSERT((heapLength() == heapCapacity()) if (!growHeapStorageBy(1)) return false; } /* We are !usingInlineStorage(). Initialize new elements. */ JS_ASSERT(heapLength() <= heapCapacity() && heapCapacity() - heapLength() >= 1); new(heapEnd()++) T(t); return true; } append() is the most important case, but I'm sure similar things can be done in a lot of the cases. Some minor futzing of the length()/begin()/end()-style functions may be required. Luke, can you see any problems with this idea?
Comment 1•14 years ago
|
||
Sounds good!
Comment 2•14 years ago
|
||
On the subject, one thing I've been thinking about for a while with js::Vector is moving the 'grow' path out-of-line. It seems like this could remove a lot of cold code and thus improve code density.
Assignee | ||
Comment 3•14 years ago
|
||
(In reply to comment #2) > On the subject, one thing I've been thinking about for a while with js::Vector > is moving the 'grow' path out-of-line. It seems like this could remove a lot > of cold code and thus improve code density. Do you mean splitting functions like append() into two, with the fast path in one function and the slow path in another?
Comment 4•14 years ago
|
||
Yes, with the fast path JS_ALWAYS_INLINE (as it is now) and the slow path just 'inline'. Even better, for pod types (tl::IsPodType<T>::result == true), there can be a single non-templated non-inline grow function. I don't mean to hijack this bug, though, perhaps this is best filed as a follow-up bug.
Comment 5•14 years ago
|
||
Inlining decls probably don't matter for PGO builds. I guess Mac still doesn't do those, bleh.
Comment 6•14 years ago
|
||
(In reply to comment #5) > Inlining decls probably don't matter for PGO builds. I guess Mac still doesn't > do those, bleh. PGO would move the code off the hot path, true, but unless PGO ignores causes __forceinline to be ignored at compile time, there would still be a lot of extra code being generated. Probably not a lot of bytes on an absolute scale, but unless PGO moves cold code in a function not only off the hot path but out of hot code pages, this could still have a negative effect. (Lot of PGO unknowns here...) (Also Linux ;-)
Comment 7•14 years ago
|
||
(In reply to comment #3) > Do you mean splitting functions like append() into two, with the fast path in > one function and the slow path in another? gcc, at least, doesn't handle the following idiom well ... function_which_is_called_very_frequently ( ... ) { if (test_for_common_case) { // simple common case handler // with minimal register pressure } else { // complex case handler, uses lots of regs } } because it generates prologue and epilogue code which dumps all the callee saved registers on the stack and then restores them, as is required by the complex case. But that causes lots of pointless stores/loads in the common, fast case. Hence the splitting Luke proposes is good for more than just increasing code density. It means the fast case fn can have minimal prologue/epilogue overhead; we only take that hit when calling the slow function. Need to guarantee though that the slow function doesn't get inlined back in as that would obviously make the splitting pointless. Parking "__attribute__((noinline))" on it works in gcc-world. I don't know if there's a MSVC equivalent.
Comment 8•14 years ago
|
||
(In reply to comment #7) > gcc, at least, doesn't handle the following idiom well I wonder if __builtin_expect((cond), 1) (our JS_LIKELY macro) fixes that.
Assignee | ||
Updated•14 years ago
|
Summary: Avoid frequent usingInlineStorage() tests in jsvector.h → improve jsvector.h
Assignee | ||
Comment 9•14 years ago
|
||
This patch makes Vector simpler and faster. - The main change was to make the handling of the buffer identical where possible, regardless of whether it is inline or heap-allocated. We now always store the buffer begin pointer, the length and the capacity. The end pointer is never stored, it can be computed from the length. This has the following benefits: - Conceptually simpler -- no need to mentally switch between two different ways of storing the length/capacity/begin/end values. - More concise code -- no need for inlineFoo() and heapFoo() versions of various functions. Also, functions like append() became shorter (and their slow patch now goes via a never-inlined call to growStorageBy()). This cut about 120 lines from jsvector.h. The size of the 32-bit JS shell binary on my Linux box dropped by 2.7%, from 3,157,442 to 3,075,528 bytes. - Faster -- many fewer calls to usingInlineStorage() are required. Instruction count results are below. Measurements indicate roughly a 3ms win on my MacBook Pro which gives pretty similar results to AWFY. The in-browser result won't be as good because at least some of the speed-up is in jsscan.cpp which isn't included in browser measurements, AIUI. But hey, faster parsing is a win for day-to-day browser usage. - The only downside is that each Vector is two words larger than it used to be. Previously we had one size_t member plus the AlignedStorage. Now we have one pointer, two size_t members, plus the AlignedStorage. - I preserved the prior checking behaviour of begin() et al by introducing beginNoCheck() and endNoCheck(). This is a bit ugly but makes it clearer than it was before when the checks are occurring. - I added a handful of extra assertions. - replaceRawBuffer() now doesn't have to convert to inline storage if length < sInlineCapacity, but the code still does that. Perhaps this should be rethought. I guess it depends on whether the cost of the copying and freeing is greater than the cost of using the extra memory. --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 68.822 68.754 (1.001x) | 44.101 44.095 (------) | 3d-cube | 39.720 39.728 (------) | 24.717 24.716 (------) | 3d-morph | 68.987 68.690 (1.004x) | 40.447 40.433 (------) | 3d-raytrace | 25.500 25.506 (------) | 12.116 12.115 (------) | access-binary- | 88.394 88.390 (------) | 83.307 83.306 (------) | access-fannkuc | 28.384 28.351 (1.001x) | 16.267 16.265 (------) | access-nbody | 35.795 35.806 (------) | 29.366 29.366 (------) | access-nsieve | 7.522 7.536 (0.998x) | 3.256 3.255 (------) | bitops-3bit-bi | 35.362 35.376 (------) | 30.401 30.400 (------) | bitops-bits-in | 15.914 15.931 (0.999x) | 12.019 12.019 (------) | bitops-bitwise | 38.138 38.148 (------) | 32.967 32.966 (------) | bitops-nsieve- | 17.420 17.430 (0.999x) | 13.169 13.168 (------) | controlflow-re | 54.029 53.864 (1.003x) | 30.473 30.467 (------) | crypto-aes | 24.424 24.327 (1.004x) | 12.332 12.328 (------) | crypto-md5 | 20.049 20.012 (1.002x) | 8.694 8.692 (------) | crypto-sha1 | 67.241 67.092 (1.002x) | 21.027 21.006 (1.001x) | date-format-to | 63.267 62.635 (1.010x) | 10.200 10.198 (------) | date-format-xp | 43.645 43.657 (------) | 29.513 29.513 (------) | math-cordic | 22.928 22.931 (------) | 6.310 6.310 (------) | math-partial-s | 32.347 32.350 (------) | 27.176 27.175 (------) | math-spectral- | 49.202 48.269 (1.019x) | 34.581 34.580 (------) | regexp-dna | 28.795 28.770 (1.001x) | 9.278 9.277 (------) | string-base64 | 64.396 63.902 (1.008x) | 32.213 32.102 (1.003x) | string-fasta | 104.222 102.168 (1.020x) | 17.271 17.260 (1.001x) | string-tagclou | 102.728 96.995 (1.059x) | 12.809 12.807 (------) | string-unpack- | 43.198 42.531 (1.016x) | 8.574 8.573 (------) | string-validat ------- | 1190.442 1179.160 (1.010x) | 602.594 602.404 (------) | all
Attachment #489761 -
Flags: review?(lw)
Comment 10•14 years ago
|
||
Fantastic! I look forward to reading this. In the meantime, in the spirit of science and testing hypotheses in comments 2 through 7, would it be possible with a reasonable amount of effort to measure the effects of not inlining the cold path? (Perhaps comparing your usingInlineStorage optimization with the cold path in the header and JS_ALWAYS_INLINEd vs. the posted patch?) Regardless, I should bump the priority of the long-standing todo to un-inline more of js::HashTable.
Assignee | ||
Comment 11•14 years ago
|
||
(In reply to comment #10) > Fantastic! I look forward to reading this. In the meantime, in the spirit of > science and testing hypotheses in comments 2 through 7, would it be possible > with a reasonable amount of effort to measure the effects of not inlining the > cold path? The patch has JS_NEVER_INLINE on growStorageBy(), so it's already done. I could undo it by changing it to JS_ALWAYS_INLINE or just 'inline' and re-measure, would that be sufficient?
Comment 12•14 years ago
|
||
Yes. (By the way, I'm surprised you don't get multiple-definition linker errors if you have a memfun definition in a header that isn't marked inline. You might want to test building on all platforms.)
Assignee | ||
Comment 13•14 years ago
|
||
Moving growStorageBy() back into the hot path is bad. The binary size is 3,153,234 bytes, almost as big as the original. And the instruction count reduction is less: --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 68.619 68.495 (1.002x) | 43.773 43.772 (------) | 3d-cube | 39.728 39.728 (------) | 24.716 24.717 (------) | 3d-morph | 65.660 65.526 (1.002x) | 37.051 37.050 (------) | 3d-raytrace | 24.395 24.395 (------) | 11.010 11.010 (------) | access-binary- | 88.404 88.395 (------) | 83.307 83.307 (------) | access-fannkuc | 28.431 28.392 (1.001x) | 16.266 16.266 (------) | access-nbody | 35.167 35.169 (------) | 28.735 28.736 (------) | access-nsieve | 7.522 7.525 (------) | 3.255 3.256 (------) | bitops-3bit-bi | 35.362 35.366 (------) | 30.401 30.401 (------) | bitops-bits-in | 15.912 15.918 (------) | 12.019 12.019 (------) | bitops-bitwise | 38.142 38.144 (------) | 32.966 32.966 (------) | bitops-nsieve- | 17.138 17.138 (------) | 12.876 12.876 (------) | controlflow-re | 53.730 54.088 (0.993x) | 30.042 30.082 (0.999x) | crypto-aes | 24.032 23.958 (1.003x) | 11.807 11.807 (------) | crypto-md5 | 19.902 19.864 (1.002x) | 8.499 8.499 (------) | crypto-sha1 | 67.345 67.271 (1.001x) | 21.027 21.027 (------) | date-format-to | 63.034 63.577 (0.991x) | 9.946 9.946 (------) | date-format-xp | 43.644 43.649 (------) | 29.513 29.513 (------) | math-cordic | 22.933 22.938 (------) | 6.310 6.311 (------) | math-partial-s | 31.374 31.370 (------) | 26.189 26.190 (------) | math-spectral- | 49.215 48.497 (1.015x) | 34.581 34.581 (------) | regexp-dna | 28.830 28.806 (1.001x) | 9.278 9.278 (------) | string-base64 | 64.261 63.814 (1.007x) | 32.065 32.065 (------) | string-fasta | 104.619 103.250 (1.013x) | 17.206 17.224 (0.999x) | string-tagclou | 102.923 98.010 (1.050x) | 12.831 12.831 (------) | string-unpack- | 43.223 43.177 (1.001x) | 8.574 8.574 (------) | string-validat ------- | 1183.559 1176.473 (1.006x) | 594.257 594.316 (------) | all So the patch submitted is good to review, I think! (I will try server it before landing to check the thing about the non-inline function being in the header.)
Comment 14•14 years ago
|
||
(In reply to comment #13) Awesome, that's useful to know. I looked at the patch. Trading space for removing-usingInlineStorage is pure win. I'm not sure why I was so hung up on minimizing sizeof(js::Vector)... I was young and naive and it was my third day at Mozilla after grad school :) Another change I noticed was representing 'end' via mBegin+mLength instead of T *mEnd. I think the reason for this change is that mCapacity is a size_t so this makes mLength/mCapacpity more efficient to compare. Now, that hottest operations for Vector seem to be append(), end(), and back(). FWIW, the hot path for these operations would be strictly faster representing both 'end' and 'capacity' via pointers. E.g., on append(), this would avoid the need to load mBegin and the addition mBegin+mLength.
Assignee | ||
Comment 15•14 years ago
|
||
(In reply to comment #14) > > Another change I noticed was representing 'end' via mBegin+mLength instead of T > *mEnd. I think the reason for this change is that mCapacity is a size_t so > this makes mLength/mCapacpity more efficient to compare. Now, that hottest > operations for Vector seem to be append(), end(), and back(). FWIW, the hot > path for these operations would be strictly faster representing both 'end' and > 'capacity' via pointers. E.g., on append(), this would avoid the need to load > mBegin and the addition mBegin+mLength. I tried adding mEnd. In append(), it makes the endNoCheck() slightly faster, but you have to increment mEnd when you increment mLength, so it ends up a wash; the instruction counts improved by a factor of 1.00028x. We're well into diminishing returns territory. Also, making the change was a pain, it took me numerous tries to find all the places where mLength and mBegin were updated and add a corresponding mEnd update as well. So I suggest we keep the mBegin/mLength/mCapacity configuration we have in the patch :)
Comment 16•14 years ago
|
||
I'm sorry I wasn't more explicit: mEnd *instead* of mLength (and changing mCapacity to a pointer).
Assignee | ||
Comment 17•14 years ago
|
||
Another wash: | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 68.495 68.437 (1.001x) | 43.772 43.768 (------) | 3d-cube | 39.728 39.726 (------) | 24.717 24.716 (------) | 3d-morph | 65.526 65.504 (------) | 37.050 37.044 (------) | 3d-raytrace | 24.395 24.391 (------) | 11.010 11.010 (------) | access-binary- | 88.395 88.391 (------) | 83.307 83.306 (------) | access-fannkuc | 28.392 28.370 (1.001x) | 16.266 16.264 (------) | access-nbody | 35.169 35.167 (------) | 28.736 28.735 (------) | access-nsieve | 7.525 7.524 (------) | 3.256 3.255 (------) | bitops-3bit-bi | 35.366 35.364 (------) | 30.401 30.400 (------) | bitops-bits-in | 15.918 15.916 (------) | 12.019 12.019 (------) | bitops-bitwise | 38.144 38.142 (------) | 32.966 32.966 (------) | bitops-nsieve- | 17.138 17.136 (------) | 12.876 12.875 (------) | controlflow-re | 54.088 54.092 (------) | 30.082 30.080 (------) | crypto-aes | 23.958 23.932 (1.001x) | 11.807 11.804 (------) | crypto-md5 | 19.864 19.859 (------) | 8.499 8.498 (------) | crypto-sha1 | 67.270 67.225 (1.001x) | 21.027 21.026 (------) | date-format-to | 63.577 63.564 (------) | 9.946 9.945 (------) | date-format-xp | 43.649 43.638 (------) | 29.513 29.513 (------) | math-cordic | 22.938 22.928 (------) | 6.311 6.310 (------) | math-partial-s | 31.370 31.368 (------) | 26.190 26.189 (------) | math-spectral- | 48.497 48.436 (1.001x) | 34.581 34.580 (------) | regexp-dna | 28.806 28.801 (------) | 9.278 9.277 (------) | string-base64 | 63.814 63.824 (------) | 32.065 32.064 (------) | string-fasta | 103.250 102.828 (1.004x) | 17.224 17.211 (1.001x) | string-tagclou | 98.010 98.696 (0.993x) | 12.831 12.829 (------) | string-unpack- | 43.179 43.200 (------) | 8.574 8.573 (------) | string-validat ------- | 1176.473 1176.473 (------) | 594.316 594.269 (------) | all I like the original formulation better, tracking length and capacity as numbers makes the most sense to me.
Comment 18•14 years ago
|
||
Comment on attachment 489761 [details] [diff] [review] patch (against TM 57083:eaaae3775c00) Fair enough, thanks for trying! >- /* If reserve(length() + N) succeeds, the N next appends are guaranteed to succeed. */ >+ /* If reserve(mLength + N) succeeds, the N next appends are guaranteed to succeed. */ >- /* Call shrinkBy or growBy based on whether newSize > length(). */ >+ /* Call shrinkBy or growBy based on whether newSize > mLength. */ >- * N.B. Although a T*, only the range [0, length()) is constructed. >+ * N.B. Although a T*, only the range [0, mLength) is constructed. Could you keep these 'length()': they are public API comments and mLength is a private member. >+ JS_ASSERT(mLength + incr > mCapacity); >+ if (usingInlineStorage()) { >+ return convertToHeapStorage(incr); >+ } else { >+ return growHeapStorageBy(incr); >+ } SM style is: if (....) return ... return ... >+ REENTRANCY_GUARD_ET_AL; >+ if (incr > mCapacity - mLength) >+ if (!growStorageBy(incr)) > return false; SM style is: if (incr > mCapacity - mLength && !growStorageBy(incr)) return false; This happens in: growByImpl, append(scalar), appendN, append(range). r+ with nits fixed
Attachment #489761 -
Flags: review?(lw) → review+
Comment 19•14 years ago
|
||
(In reply to comment #18) > SM style is: > > if (....) > return ... > return ... or return a ? b : c; if that isn't too complicated. > >+ REENTRANCY_GUARD_ET_AL; > >+ if (incr > mCapacity - mLength) > >+ if (!growStorageBy(incr)) > > return false; > > SM style is: > > if (incr > mCapacity - mLength && !growStorageBy(incr)) > return false; Generally true, but if the fallible call can be profitably given its own if (if overlong, or commented separately, or for parallelism with other code that has to nest if's for other reasons), then the outer if's then clause needs bracing. /be
Assignee | ||
Comment 20•14 years ago
|
||
Pushed with nits fixed: http://hg.mozilla.org/tracemonkey/rev/b60ac564e39d
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Comment 21•14 years ago
|
||
Turns out I pushed a version where growStorageBy() was marked JS_ALWAYS_INLINE :( I'll change it to JS_NEVER_INLINE tomorrow. BTW, I try-server'd it and there were no complaints about the non-inline memfun being defined in the header.
Assignee | ||
Comment 22•14 years ago
|
||
The JS_NEVER_INLINE change: http://hg.mozilla.org/tracemonkey/rev/7b8898c9b54c
Comment 23•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/b60ac564e39d
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•