701560 - template version of merge sort

Assignee

Description

•

14 years ago

For the bug 669245 and other GC-related patches I need a sort function. Currently jsarray.cpp provides an implementation of the merge sort. However, that uses a function pointer for the compare operation. That is be expensive especially for simple compare operations. It would be nice to have a template version of the sort implementation to avoid the overhead of indirect function calls.

Igor Bukanov

Assignee

Comment 1

•

14 years ago

Attached patch v1 (obsolete) — Details — Splinter Review

The patch provides the templated version of the merge sort and switches jsarray.cpp to use it. To measure its effects I use the following benchmark that sorts an array with 2e6 strings: var a = []; for (var i = 2e6; i != 0; --i) a.push(""+i); var t = Date.now(); a.sort(); t = Date.now() - t; print("Time: "+t); Without the patch the running time is 724 ms, with the patch that drops to 451 ms.

Assignee: general → igor

Attachment #573685 - Flags: review?(luke)

Emanuel Hoogeveen [:ehoogeveen]

Comment 2

•

14 years ago

Iterative merge sort, cool :) 1) Since the scratch space is callee-allocated it doesn't matter for performance, but you could drop the bool * parameter if you adjusted the insertion sort limit (and the starting run size) based on whether ceil(log2(nelem)) (or its integer equivalent) is even or odd. That might introduce a little bit of slowdown by making INS_SORT_LIMIT dynamic though. 2) Could C++11 move semantics (e.g. *dst++ = move(*a++)) be used to avoid deep copying for sorting strings? I don't know if Mozilla's build environments all support those. 3) If comparisons can be expensive (e.g. almost-equal strings), have you considered using binary merging for cases where e.g. run1 > 8 * run2? If the majority of overhead is in the sorting function itself (e.g. sorting POD types) this won't help.

Emanuel Hoogeveen [:ehoogeveen]

Comment 3

•

14 years ago

Ah, I see you do copy the result back over to src a bit further down. So yes, you can avoid that copy by doing what I proposed in 1). (sorry about the bugspam)

Igor Bukanov

Assignee

Comment 4

•

14 years ago

(In reply to Emanuel Hoogeveen from comment #2) > 1) Since the scratch space is callee-allocated it doesn't matter for > performance, but you could drop the bool * parameter if you adjusted the > insertion sort limit (and the starting run size) based on whether > ceil(log2(nelem)) (or its integer equivalent) is even or odd. That might > introduce a little bit of slowdown by making INS_SORT_LIMIT dynamic though. The goal of this bug is to replace in the existing implementation of the merge sort the function pointer comparator with a template compare operation to make things faster especially with a trivial compare operation. Optimizations like the above suggestions is orthogonal to that especially since they may slowdown things. For example, I have remember from benchmarks that the increasing the insertion sort size can make things slower, see bug 224128 comment 25 and later. So I will file a bug to consider tuning the sort including investigating decreasing the extra storage requirement to n/2. > 2) Could C++11 move semantics (e.g. *dst++ = move(*a++)) be used to avoid > deep copying for sorting strings? I don't know if Mozilla's build > environments all support those. AFAIC we cannot so far use C++11 freely. > 3) If comparisons can be expensive (e.g. almost-equal strings), have you > considered using binary merging for cases where e.g. run1 > 8 * run2? If the > majority of overhead is in the sorting function itself (e.g. sorting POD > types) this won't help. A compare operation is rather expensive if it is done using user-supplied JS function. So anything that decrease their number should be helpful. But that should not harm other cases.

Igor Bukanov

Assignee

Comment 5

•

14 years ago

Attached patch v2 (obsolete) — Details — Splinter Review

The new version replaces in the comparator class operator()(a, b) with lessOrEqual(a, b) to make clearer the meaning of the compare operation.

Attachment #573685 - Attachment is obsolete: true

Attachment #573685 - Flags: review?(luke)

Attachment #574244 - Flags: review?(luke)

Igor Bukanov

Assignee

Comment 6

•

14 years ago

Attached patch v2 for real (obsolete) — Details — Splinter Review

The previous attachment had a wrong patch.

Attachment #574244 - Attachment is obsolete: true

Attachment #574244 - Flags: review?(luke)

Attachment #574245 - Flags: review?(luke)

Emanuel Hoogeveen [:ehoogeveen]

Comment 7

•

14 years ago

(In reply to Igor Bukanov from comment #4) > The goal of this bug is to replace in the existing implementation of the > merge sort the function pointer comparator with a template compare operation > to make things faster especially with a trivial compare operation. > Optimizations like the above suggestions is orthogonal to that You're right, I don't mean to hold up this bug for it (which is clearly an improvement over the status quo). I'll do some testing at home, see if I can setup a build environment, and file a follow-up if I can demonstrate improvements after this lands :)

Luke Wagner [:luke]

Comment 8

•

14 years ago

Comment on attachment 574245 [details] [diff] [review] v2 for real Great, the patch will make things much nicer. Can we avoid this whole "error()" business and have the return value of MergeSort and the comparator indicate success/fail? Thus lessOrEqual would return its success-value through an outparam.

Attachment #574245 - Flags: review?(luke)

Igor Bukanov

Assignee

Comment 9

•

14 years ago

(In reply to Luke Wagner [:luke] from comment #8) > Comment on attachment 574245 [details] [diff] [review] [diff] [details] [review] > v2 for real > > Can we avoid this whole > "error()" business and have the return value of MergeSort and the comparator > indicate success/fail? In that case if the compare function is not available at the point of template instantiation or it cannot be inlined, then the compiler cannot see that the compare function is infallible and would generate useless error checking code. With separated error function that can be inlined this is avoided. An alternative to the error() is an extra template parameter that specifies the return type of the compare operation. It can be either bool to specify an infallible operation or some 3-state type. I tried that, but it quickly became rather awkward with the need to map that extra type to void in case of infallible compare or to bool in case of fallible operation.

Luke Wagner [:luke]

Comment 10

•

14 years ago

(In reply to Igor Bukanov from comment #9) > In that case if the compare function is not available at the point of > template instantiation or it cannot be inlined, then the compiler cannot see > that the compare function is infallible and would generate useless error > checking code. For all the cases where this actually matters it should be easy (and already done) to ensure that the definition of the comparator is visible and inlined; that's the whole point, right? It would be crazy to gunk up the whole code with this or the other proposed non-standard error-handling strategy for such a miniscule and improbable situation.

Igor Bukanov

Assignee

Comment 11

•

14 years ago

(In reply to Luke Wagner [:luke] from comment #10) > For all the cases where this actually matters it should be easy (and already > done) to ensure that the definition of the comparator is visible and > inlined; that's the whole point, right? It would be crazy to gunk up the > whole code with this or the other proposed non-standard error-handling > strategy for such a miniscule and improbable situation. I tried to implement a patch that would use a stl style comparator: bool operator()(const T &a, const T &b, bool *lessOrEqualp); However, that made a benchmark that similar to one in the comment 1 to run in about 569 ms, not 548 ms. as it does with the last patch. I also tried a version that uses 3-state result, like error, false, true: int operator()(const T &a, const T &b); That improved the running time slightly to 560 ms, but still this is slower than 548. So for some reasons the GCC 4.6 on 32 bit can optimize the error() method better. So is 10% reduction in speed is acceptable price for the following a standard error-handling convention?

Luke Wagner [:luke]

Comment 12

•

14 years ago

(In reply to Igor Bukanov from comment #11) > So is 10% reduction in speed is acceptable price for the following a > standard error-handling convention? I don't see why there would be a speed difference if everything was being inlined. Did you try JS_ALWAYS_INLINE'ing the comparator? Failing that, it would be interesting to see the asm diff.

Igor Bukanov

Assignee

Comment 13

•

14 years ago

Attached patch using standard error convention (obsolete) — Details — Splinter Review

This is the patch that I use to compare with v2. It requires for the comparator to implement: bool operator()(const T &a, const T &b, bool *lessOrEqualp); If I add JS_ALWAYS_INLINE on top of it, then the test timing decreases from 569 to 558. That matches the version that uses 3-state return value. But that does not match 548 that I got from that error function. Assembly diff shows AFAICS that with that error() function GCC uses less stack variables and keeps more stuff in registers.

Attachment #574668 - Flags: review?

Igor Bukanov

Assignee

Comment 14

•

14 years ago

Attached patch v4 — Details — Splinter Review

The new version uses the standard style of the error handling in the comparator signature but tunes the merge function implementation to minimize the amount of generated code. With that change the sorting time for the test case drops from 569 to 522 ms, which is significantly better than for the previous patch with non-standard error propagation (548 ms). The assembly shows that with GCC 4.6 on 32-bit x86 code everything is inlined even without JS_ALWAYS_INLINE on the compare functions (just inline is enough). For references, without the patch the sorting time is 839 ms.

Attachment #574245 - Attachment is obsolete: true

Attachment #574668 - Attachment is obsolete: true

Attachment #574668 - Flags: review?

Attachment #574887 - Flags: review?(luke)

Luke Wagner [:luke]

Comment 15

•

14 years ago

Comment on attachment 574887 [details] [diff] [review] v4 Review of attachment 574887 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/ds/Sort.h @@ +135,5 @@ > + if (lessOrEqual) > + break; > + T tmp = array[j - 1]; > + array[j - 1] = array[j]; > + array[j] = tmp; Incidentally, gcc seems really bad manipulating uint64 by value on 32-bit. I just looked with -O3 and got something awful. When I broke it down into two manual uint32 swaps, the resulting code was 6 memory movs shorter. This suggests replacing all these inlined swaps with a call to js::Swap(a,b) (in jsutil.h) which has an overload for js::Value that does the uint32 swaps. If you're tired of tweaking this, feel free to ignore; just a suggestion. Adding a reusable Swap(a,b) does seem nice, though. @@ +157,5 @@ > + return false; > + } > + T *swap = vec1; > + vec1 = vec2; > + vec2 = swap; ... and this could use Swap() as well. ::: js/src/jsarray.cpp @@ +1988,5 @@ > + *lessOrEqualp = (result <= 0); > + return true; > +} > + > +struct SortComparatorStrings { Can all the {'s go on a newline? It matches function definitions. @@ +2129,2 @@ > ~AutoFreeVector() { > + Foreground::free_(vec); Since you are already touching this, can you just replace this with a js::Vector? @@ +2216,5 @@ > vec[2 * i].setString(str); > } while (i != 0); > > + Value *extraScratch = static_cast<Value *>(cx->malloc_(newlen * 2 * sizeof(Value))); > + if (!extraScratch) Also could you use a Vector here?

Attachment #574887 - Flags: review?(luke) → review+

Igor Bukanov

Assignee

Comment 16

•

14 years ago

(In reply to Luke Wagner [:luke] from comment #15) > This > suggests replacing all these inlined swaps with a call to js::Swap(a,b) (in > jsutil.h) which has an overload for js::Value that does the uint32 swaps. > If you're tired of tweaking this, feel free to ignore; just a suggestion. > Adding a reusable Swap(a,b) does seem nice, though. I file a separated bug about it and other sorting improvements/fixes. > > +struct SortComparatorStrings { > > Can all the {'s go on a newline? It matches function definitions. > > @@ +2129,2 @@ > > ~AutoFreeVector() { > > + Foreground::free_(vec); > > Since you are already touching this, can you just replace this with a > js::Vector? I will replace it with AutoVectorRooter.

Igor Bukanov

Assignee

Updated

•

14 years ago

Blocks: 703446

Igor Bukanov

Assignee

Updated

•

14 years ago

Blocks: 703451

Igor Bukanov

Assignee

Comment 17

•

14 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/e1587f23d2f0

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 18

•

14 years ago

(In reply to Igor Bukanov from comment #17) > https://hg.mozilla.org/integration/mozilla-inbound/rev/e1587f23d2f0 Burning.

Igor Bukanov

Assignee

Comment 19

•

14 years ago

(In reply to Ms2ger from comment #18) > (In reply to Igor Bukanov from comment #17) > > https://hg.mozilla.org/integration/mozilla-inbound/rev/e1587f23d2f0 > > Burning. I backed out that and landed a version with trivial fixes for 64 bits: https://hg.mozilla.org/integration/mozilla-inbound/rev/9f758d9c051e

Ed Morley [:emorley]

Comment 20

•

14 years ago

Talos Regression :( Dromaeo (CSS) decrease 1.46% on XP Mozilla-Inbound-Non-PGO https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/EM2bCJ0Rk9M

Ed Morley [:emorley]

Comment 21

•

14 years ago

https://hg.mozilla.org/mozilla-central/rev/9f758d9c051e (Although is the regression in comment 20 expected?)

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla11

Igor Bukanov

Assignee

Comment 22

•

14 years ago

(In reply to Ed Morley [:edmorley] from comment #20) > Talos Regression :( Dromaeo (CSS) decrease 1.46% on XP > Mozilla-Inbound-Non-PGO > > https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/ > EM2bCJ0Rk9M Dromaeo (CSS) does not call Array.sort. I suppose the extra code bloat coming from the template expansion could made things slower, via, for example, moving the array methods that the benchmark calls into different pages.

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 23

•

14 years ago

Comment on attachment 574887 [details] [diff] [review] v4 >--- /dev/null >+++ b/js/src/ds/Sort.h >+ * The Initial Developer of the Original Code is >+ * Mozilla Corporation. "the Mozilla Foundation."

:Ms2ger (he/him; ⌚ UTC+1/+2)

Updated

•

14 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jeff Walden [:Waldo]

Comment 24

•

14 years ago

How about you just hop on #jsapi, find someone to rubber-stamp the change, and push it without passive-aggressively reopening the bug seemingly just for that? I mean, really. :-)

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

v1 14 years ago Igor Bukanov 30.32 KB, patch		Details \| Diff \| Splinter Review
v2 14 years ago Igor Bukanov 29.69 KB, patch		Details \| Diff \| Splinter Review
v2 for real 14 years ago Igor Bukanov 30.40 KB, patch		Details \| Diff \| Splinter Review
using standard error convention 14 years ago Igor Bukanov 29.44 KB, patch		Details \| Diff \| Splinter Review
v4 14 years ago Igor Bukanov 33.21 KB, patch	luke : review+	Details \| Diff \| Splinter Review