Last Comment Bug 701560 - template version of merge sort
: template version of merge sort
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla11
Assigned To: Igor Bukanov
:
Mentors:
Depends on:
Blocks: 703446 703451
  Show dependency treegraph
 
Reported: 2011-11-10 15:46 PST by Igor Bukanov
Modified: 2012-02-01 13:56 PST (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (30.32 KB, patch)
2011-11-10 16:09 PST, Igor Bukanov
no flags Details | Diff | Review
v2 (29.69 KB, patch)
2011-11-14 01:20 PST, Igor Bukanov
no flags Details | Diff | Review
v2 for real (30.40 KB, patch)
2011-11-14 01:27 PST, Igor Bukanov
no flags Details | Diff | Review
using standard error convention (29.44 KB, patch)
2011-11-15 13:29 PST, Igor Bukanov
no flags Details | Diff | Review
v4 (33.21 KB, patch)
2011-11-16 05:45 PST, Igor Bukanov
luke: review+
Details | Diff | Review

Description Igor Bukanov 2011-11-10 15:46:23 PST
For the bug 669245 and other GC-related patches I need a sort function. Currently jsarray.cpp provides an implementation of the merge sort. However, that uses a function pointer for the compare operation. That is be expensive especially for simple compare operations. It would be nice to have a template version of the sort implementation to avoid the overhead of indirect function calls.
Comment 1 Igor Bukanov 2011-11-10 16:09:01 PST
Created attachment 573685 [details] [diff] [review]
v1

The patch provides the templated version of the merge sort and switches jsarray.cpp to use it. 

To measure its effects I use the following benchmark that sorts an array with 2e6 strings:

var a = [];
for (var i = 2e6; i != 0; --i)
    a.push(""+i);

var t = Date.now();
a.sort();
t = Date.now() - t;
print("Time: "+t);

Without the patch the running time is 724 ms, with the patch that drops to 451 ms.
Comment 2 Emanuel Hoogeveen [:ehoogeveen] 2011-11-12 08:24:39 PST
Iterative merge sort, cool :)

1) Since the scratch space is callee-allocated it doesn't matter for performance, but you could drop the bool * parameter if you adjusted the insertion sort limit (and the starting run size) based on whether ceil(log2(nelem)) (or its integer equivalent) is even or odd. That might introduce a little bit of slowdown by making INS_SORT_LIMIT dynamic though.
2) Could C++11 move semantics (e.g. *dst++ = move(*a++)) be used to avoid deep copying for sorting strings? I don't know if Mozilla's build environments all support those.
3) If comparisons can be expensive (e.g. almost-equal strings), have you considered using binary merging for cases where e.g. run1 > 8 * run2? If the majority of overhead is in the sorting function itself (e.g. sorting POD types) this won't help.
Comment 3 Emanuel Hoogeveen [:ehoogeveen] 2011-11-12 08:32:06 PST
Ah, I see you do copy the result back over to src a bit further down. So yes, you can avoid that copy by doing what I proposed in 1). (sorry about the bugspam)
Comment 4 Igor Bukanov 2011-11-14 01:03:36 PST
(In reply to Emanuel Hoogeveen from comment #2)

> 1) Since the scratch space is callee-allocated it doesn't matter for
> performance, but you could drop the bool * parameter if you adjusted the
> insertion sort limit (and the starting run size) based on whether
> ceil(log2(nelem)) (or its integer equivalent) is even or odd. That might
> introduce a little bit of slowdown by making INS_SORT_LIMIT dynamic though.

The goal of this bug is to replace in the existing implementation of the merge sort the function pointer comparator with a template compare operation to make things faster especially with a trivial compare operation. Optimizations like the above suggestions is orthogonal to that especially since they may slowdown things. For example, I have remember from benchmarks that the increasing the insertion sort size can make things slower, see bug 224128 comment 25 and later. 

So I will file a bug to consider tuning the sort including investigating decreasing the extra storage requirement to n/2.

> 2) Could C++11 move semantics (e.g. *dst++ = move(*a++)) be used to avoid
> deep copying for sorting strings? I don't know if Mozilla's build
> environments all support those.

AFAIC we cannot so far use C++11 freely.

> 3) If comparisons can be expensive (e.g. almost-equal strings), have you
> considered using binary merging for cases where e.g. run1 > 8 * run2? If the
> majority of overhead is in the sorting function itself (e.g. sorting POD
> types) this won't help.

A compare operation is rather expensive if it is done using user-supplied JS function. So anything that decrease their number should be helpful. But that should not harm other cases.
Comment 5 Igor Bukanov 2011-11-14 01:20:46 PST
Created attachment 574244 [details] [diff] [review]
v2

The new version replaces in the comparator class operator()(a, b) with lessOrEqual(a, b) to make clearer the meaning of the compare operation.
Comment 6 Igor Bukanov 2011-11-14 01:27:00 PST
Created attachment 574245 [details] [diff] [review]
v2 for real

The previous attachment had a wrong patch.
Comment 7 Emanuel Hoogeveen [:ehoogeveen] 2011-11-14 02:45:08 PST
(In reply to Igor Bukanov from comment #4)
> The goal of this bug is to replace in the existing implementation of the
> merge sort the function pointer comparator with a template compare operation
> to make things faster especially with a trivial compare operation.
> Optimizations like the above suggestions is orthogonal to that
You're right, I don't mean to hold up this bug for it (which is clearly an improvement over the status quo).

I'll do some testing at home, see if I can setup a build environment, and file a follow-up if I can demonstrate improvements after this lands :)
Comment 8 Luke Wagner [:luke] 2011-11-14 09:21:58 PST
Comment on attachment 574245 [details] [diff] [review]
v2 for real

Great, the patch will make things much nicer.  Can we avoid this whole "error()" business and have the return value of MergeSort and the comparator indicate success/fail?  Thus lessOrEqual would return its success-value through an outparam.
Comment 9 Igor Bukanov 2011-11-14 13:39:33 PST
(In reply to Luke Wagner [:luke] from comment #8)
> Comment on attachment 574245 [details] [diff] [review] [diff] [details] [review]
> v2 for real
> 
>  Can we avoid this whole
> "error()" business and have the return value of MergeSort and the comparator
> indicate success/fail?

In that case if the compare function is not available at the point of template instantiation or it cannot be inlined, then the compiler cannot see that the compare function is infallible and would generate useless error checking code. With separated error function that can be inlined  this is avoided.

An alternative to the error() is an extra template parameter that specifies the return type of the compare operation. It can be either bool to specify an infallible operation or some 3-state type. I tried that, but it quickly became rather awkward with the need to map that extra type to void in case of infallible compare or to bool in case of fallible operation.
Comment 10 Luke Wagner [:luke] 2011-11-14 14:46:18 PST
(In reply to Igor Bukanov from comment #9)
> In that case if the compare function is not available at the point of
> template instantiation or it cannot be inlined, then the compiler cannot see
> that the compare function is infallible and would generate useless error
> checking code.

For all the cases where this actually matters it should be easy (and already done) to ensure that the definition of the comparator is visible and inlined; that's the whole point, right?  It would be crazy to gunk up the whole code with this or the other proposed non-standard error-handling strategy for such a miniscule and improbable situation.
Comment 11 Igor Bukanov 2011-11-15 08:44:35 PST
(In reply to Luke Wagner [:luke] from comment #10)
> For all the cases where this actually matters it should be easy (and already
> done) to ensure that the definition of the comparator is visible and
> inlined; that's the whole point, right?  It would be crazy to gunk up the
> whole code with this or the other proposed non-standard error-handling
> strategy for such a miniscule and improbable situation.

I tried to implement a patch that would use a stl style comparator:

bool operator()(const T &a, const T &b, bool *lessOrEqualp);

However, that made a benchmark that similar to one in the comment 1 to run in about 569 ms, not 548 ms. as it does with the last patch. I also tried a version that uses 3-state result, like error, false, true:

int operator()(const T &a, const T &b);

That improved the running time slightly to 560 ms, but still this is slower than 548.

So for some reasons the GCC 4.6 on 32 bit can optimize the error() method better.

So is 10% reduction in speed is acceptable price for the following a standard error-handling convention?
Comment 12 Luke Wagner [:luke] 2011-11-15 08:53:16 PST
(In reply to Igor Bukanov from comment #11)
> So is 10% reduction in speed is acceptable price for the following a
> standard error-handling convention?

I don't see why there would be a speed difference if everything was being inlined.  Did you try JS_ALWAYS_INLINE'ing the comparator?  Failing that, it would be interesting to see the asm diff.
Comment 13 Igor Bukanov 2011-11-15 13:29:24 PST
Created attachment 574668 [details] [diff] [review]
using standard error convention

This is the patch that I use to compare with v2. It requires for the comparator to implement:

bool operator()(const T &a, const T &b, bool *lessOrEqualp);

If I add JS_ALWAYS_INLINE on top of it, then the test timing decreases from 569 to 558. That matches the version that uses 3-state return value. But that does not match 548 that I got from that error function.

Assembly diff shows AFAICS that with that error() function GCC uses less stack variables and keeps more stuff in registers.
Comment 14 Igor Bukanov 2011-11-16 05:45:36 PST
Created attachment 574887 [details] [diff] [review]
v4

The new version uses the standard style of the error handling in the comparator signature but tunes the merge function implementation to minimize the amount of generated code. With that change the sorting time for the test case drops from 569 to 522 ms, which is significantly better than for the previous patch with non-standard error propagation (548 ms). The assembly shows that with GCC 4.6 on 32-bit x86 code everything is inlined even without JS_ALWAYS_INLINE on the compare functions (just inline is enough).

For references, without the patch the sorting time is 839 ms.
Comment 15 Luke Wagner [:luke] 2011-11-16 18:13:34 PST
Comment on attachment 574887 [details] [diff] [review]
v4

Review of attachment 574887 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ds/Sort.h
@@ +135,5 @@
> +                if (lessOrEqual)
> +                    break;
> +                T tmp = array[j - 1];
> +                array[j - 1] = array[j];
> +                array[j] = tmp;

Incidentally, gcc seems really bad manipulating uint64 by value on 32-bit.  I just looked with -O3 and got something awful.  When I broke it down into two manual uint32 swaps, the resulting code was 6 memory movs shorter.  This suggests replacing all these inlined swaps with a call to js::Swap(a,b) (in jsutil.h) which has an overload for js::Value that does the uint32 swaps.  If you're tired of tweaking this, feel free to ignore; just a suggestion.  Adding a reusable Swap(a,b) does seem nice, though.

@@ +157,5 @@
> +                return false;
> +        }
> +        T *swap = vec1;
> +        vec1 = vec2;
> +        vec2 = swap;

... and this could use Swap() as well.

::: js/src/jsarray.cpp
@@ +1988,5 @@
> +    *lessOrEqualp = (result <= 0);
> +    return true;
> +}
> +
> +struct SortComparatorStrings {

Can all the {'s go on a newline?  It matches function definitions.

@@ +2129,2 @@
>              ~AutoFreeVector() {
> +                Foreground::free_(vec);

Since you are already touching this, can you just replace this with a js::Vector?

@@ +2216,5 @@
>                      vec[2 * i].setString(str);
>                  } while (i != 0);
>  
> +                Value *extraScratch = static_cast<Value *>(cx->malloc_(newlen * 2 * sizeof(Value)));
> +                if (!extraScratch)

Also could you use a Vector here?
Comment 16 Igor Bukanov 2011-11-17 15:34:08 PST
(In reply to Luke Wagner [:luke] from comment #15)
> This
> suggests replacing all these inlined swaps with a call to js::Swap(a,b) (in
> jsutil.h) which has an overload for js::Value that does the uint32 swaps. 
> If you're tired of tweaking this, feel free to ignore; just a suggestion. 
> Adding a reusable Swap(a,b) does seem nice, though.

I file a separated bug about it and other sorting improvements/fixes.


> > +struct SortComparatorStrings {
> 
> Can all the {'s go on a newline?  It matches function definitions.
> 
> @@ +2129,2 @@
> >              ~AutoFreeVector() {
> > +                Foreground::free_(vec);
> 
> Since you are already touching this, can you just replace this with a
> js::Vector?

I will replace it with AutoVectorRooter.
Comment 18 :Ms2ger 2011-11-18 06:38:23 PST
(In reply to Igor Bukanov from comment #17)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/e1587f23d2f0

Burning.
Comment 19 Igor Bukanov 2011-11-18 07:11:30 PST
(In reply to Ms2ger from comment #18)
> (In reply to Igor Bukanov from comment #17)
> > https://hg.mozilla.org/integration/mozilla-inbound/rev/e1587f23d2f0
> 
> Burning.

I backed out that and landed a version with trivial fixes for 64 bits:

https://hg.mozilla.org/integration/mozilla-inbound/rev/9f758d9c051e
Comment 20 Ed Morley [:emorley] 2011-11-18 18:25:56 PST
Talos Regression :( Dromaeo (CSS) decrease 1.46% on XP Mozilla-Inbound-Non-PGO

https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/EM2bCJ0Rk9M
Comment 21 Ed Morley [:emorley] 2011-11-19 05:13:59 PST
https://hg.mozilla.org/mozilla-central/rev/9f758d9c051e

(Although is the regression in comment 20 expected?)
Comment 22 Igor Bukanov 2011-11-19 14:43:54 PST
(In reply to Ed Morley [:edmorley] from comment #20)
> Talos Regression :( Dromaeo (CSS) decrease 1.46% on XP
> Mozilla-Inbound-Non-PGO
> 
> https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/
> EM2bCJ0Rk9M

Dromaeo (CSS) does not call Array.sort. I suppose the extra code bloat coming from the template expansion could made things slower, via, for example, moving the array methods that the benchmark calls into different pages.
Comment 23 :Ms2ger 2011-11-20 11:26:56 PST
Comment on attachment 574887 [details] [diff] [review]
v4

>--- /dev/null
>+++ b/js/src/ds/Sort.h
>+ * The Initial Developer of the Original Code is
>+ * Mozilla Corporation.

"the Mozilla Foundation."
Comment 24 Jeff Walden [:Waldo] (remove +bmo to email) 2011-11-20 12:19:19 PST
How about you just hop on #jsapi, find someone to rubber-stamp the change, and push it without passive-aggressively reopening the bug seemingly just for that?  I mean, really.  :-)

Note You need to log in before you can comment on or make changes to this bug.