Closed Bug 715181 Opened 13 years ago Closed 8 years ago

Self-host Array.sort

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox46 --- fixed

People

(Reporter: mingyiliu, Assigned: mrrrgn, Mentored)

References

(Blocks 3 open bugs)

Details

Attachments

(6 files, 12 obsolete files)

239 bytes, text/plain
Details
3.80 KB, patch
Details | Diff | Splinter Review
1.03 KB, text/plain
Details
92.15 KB, image/jpeg
Details
17.01 KB, patch
Details | Diff | Splinter Review
24.30 KB, patch
till
: review+
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Build ID: 20111220165912

Steps to reproduce:

Please refer to http://jsfiddle.net/LzUmL/6/

I created a simple array with 200k random integers. Ran array.sort() or array.sort(function(a,b){return a-b}) on it in Firefox vs. Chrome. In my example, I also provided an option to run function(a,b){return a-b} 200K*log(200K) times.


Actual results:

Firefox is consistently several fold slower in array.sort(function(a,b){return a-b}), but only moderately slower in array.sort().  Calling function(a,b){return a-b} 200K*log(200K) times took almost no time, however.  Therefore this suggest that it's not function call overhead (on this particular simple function) that led to FF being so much slower sorting numerical.


Expected results:

I expect FF to perform similarly to Chrome for array.sort in all situations, instead of being several fold slower sorting numerical arrays, which affects our JS data-intensive app a lot. BTW, in a stackoverflow user's test (http://jsfiddle.net/u8t2a/122/), the user's implementation of quicksort in pure JS functioned 10+ fold faster than FF's native array.sort (even without using typed array it's ~5 fold faster, see http://jsfiddle.net/u8t2a/123/). It's shocking and suggest room for improvement for FF's array.sort.
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Depends on: 462300
It mainly seems to suggest that we should consider self-hosting sort(), as Tom noted, because calling from C++ back into JS is just slow.

In particular, looking at a profile of the testcase I'm about to attach (percentages rounded to nearest integer; may not add up to 100):

    4% self time in array_sort
   20% of the time is in (not just under!) SortComparatorFunction::operator()
   12% is in (not under) js::InvokeKernel
    5% is in js::RunScript
   18% is in js::types::TypeMonitorCallSlow
    6% is in JaegerShot (or mjit code)
    6% is in EnterMethodJIT (or mjit code)
    5% is in JaegerTrampoline
    5% is in PushActiveVMFrame
    2% is in SetVMFrameRegs
    4% is in PopActiveVMFrame
    8% is in js::ContextStack::pushInvokeFrame
    5% is in js::ContextStack::popFrame

almost all of this is overhead junk; the relevant part is something like 16% at most (the mjit code and array_sort).  With self-hosting, all that would go away.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Firefox sorts numerical arrays much slower than Chrome → Firefox sorts numerical arrays much slower than Chrome (self-host Array.sort?)
Self-hosting sort() when given a comparator function sounds great (for replace() too!).  I'd just like to point out that sort() w/o a comparator is still much more faster in C++.  But what about comment 0?  In the testcase, the array is full of small ints for which v8 has a special array representation AND a special case in the comparator which does a lexicographic compare on the ints rather than calling ToString (which does a GC-thing allocation for the temporary value!).  If I add the analogous comparator special case to SM, we go from slightly slower than v8 to 10x faster (and that is using IntToCString+strcmp, we could do much better).  For doubles (where both SM and V8 convert to strings), we are 4.4x faster.
That sounds really promising! I'd love to see the proposed hybrid solution makes its way into FF release soon, that'd be awesome!

I did a quick test (http://jsfiddle.net/u8t2a/124/), the C++ sort is about 7% faster than pure JS sort on average on my laptop when no comparator's used. So SM is at least optimized enough for JS execution that C++ version is not winning by a large margin. But still, even 7% speed difference is a good enough reason to use the C++ version when no comparator's provided.
It's sort of doubtful that it'll be soon; there's no existing infrastructure for self-hosting methods.
(In reply to Mingyi Liu from comment #4)
> I did a quick test (http://jsfiddle.net/u8t2a/124/), the C++ sort is about
> 7% faster than pure JS sort on average on my laptop when no comparator's
> used.

I'm glad to hear that, but it's also not an apples-to-apples comparison since if I am reading it correctly, quickSortIP is sorting the array arithmetically.  Tweaking the C++ to use an arithmetic sort goes 10x faster.  (As noted above, and filed today in bug 715265, we should still be able to go about 10x faster with lexicographic sort on an all-int array.)
Depends on: 715419
Depends on: 715422
A few more optimizations jump out: bug 715419 and bug 715422.  Although self-hosting should really help the general case, these two bugs would handle the cases we've seen here and go significantly faster than anything self-hosted.
(In reply to Luke Wagner [:luke] from comment #6)
> I'm glad to hear that, but it's also not an apples-to-apples comparison
> since if I am reading it correctly, quickSortIP is sorting the array
> arithmetically.  Tweaking the C++ to use an arithmetic sort goes 10x faster.
> (As noted above, and filed today in bug 715265, we should still be able to
> go about 10x faster with lexicographic sort on an all-int array.)

Not exactly. quickSortIP did the comparisons using '<' which is overloaded for numerical or alphabetical sort. Basically out of the examples I listed:

http://jsfiddle.net/u8t2a/122/ compares quickSortIP sorting typed Int32 array vs. FF's native array.sort + integer comparator on a normal integer array (the former is 10+ fold faster); 
http://jsfiddle.net/u8t2a/123/ compares quickSortIP sorting a normal integer array vs. FF's native array.sort + integer comparator on a normal integer array (the former is still 5+ fold faster); 
http://jsfiddle.net/u8t2a/124/ compares quickSortIP sorting a normal String array vs. FF's native array.sort on a normal String array (the former is 7% slower).

The quickSortIP could be modified to take a second comparator function ref argument to deal with non-integer and non-string sorting. So in the case that self-hosting is not going to get into release any time soon, one could use quickSortIP.

BTW I thought some FF functions such as text search on page is implemented pure JS? Of course, it is a somewhat different case, so array.sort can't be done the same way I reckon?
(In reply to Mingyi Liu from comment #8)
Ah, I see what you were saying.  Yes, that is interesting!  Even more interesting: I tried to dig into this a bit with a shell testcase but the results are mystifying:
  - on an opt trunk js shell (-m -n), the builtin sort is 7% *slower* than quickSortIP (both on a MBP and Linux Core i7)
  - shark shows all of the builtin's time is in js::CompareStrings yet instrumenting js::CompareStrings shows that it gets called *less* for builtin than quickSortIP
  - cachegrind shows quickSortIP executes almost twice as many instructions, twice as many data refs, and equal branch mispredicts

So.. perhaps there is some micro-architectural pitfall the MergeSort is hitting?  Very weird.
Blocks: 739016
Depends on: 784288
Depends on: 784294
No longer depends on: 784288
Blocks: 784288
No longer depends on: 784294
As discussed on IRC, we should add a generic self-hosted implementation of Array.sort that gets invoked by the native version if a sorting function is supplied that we don't successfully pattern-match as a numeric sort.

A first step towards this is to implement a proper, spec-compliant Array.sort and measure its performance on various micro-benchmarks.
Assignee: general → sankha93
Status: NEW → ASSIGNED
OS: Windows XP → All
Hardware: x86 → All
Summary: Firefox sorts numerical arrays much slower than Chrome (self-host Array.sort?) → Self-host Array.sort
Whiteboard: [mentor=tschneidereit]
Version: 9 Branch → Trunk
Attached patch WIP Patch v0 (obsolete) — Splinter Review
This is WIP patch, can't even call it v1.

I tried to implement TimSort while following the spec. Majority of the TimSort code is from https://github.com/bellbind/stepbystep-timsort/blob/master/30timsort.js.

Can you just take a look at the ArraySort function, and whether I am doing the correct thing according to the spec?
Attachment #748663 - Flags: feedback?(tschneidereit)
(In reply to Sankha Narayan Guria [:sankha93] from comment #12)
> I tried to implement TimSort while following the spec. Majority of the
> TimSort code is from
> https://github.com/bellbind/stepbystep-timsort/blob/master/30timsort.js.

I quickly looked at the github project and, sadly, can't see any license for the code. Without a compatible license (Public Domain, MIT, 3-clause BSD or MPL2, I guess), we can't use the code at all.

Can you check with bellbind and ask him whether he's willing to license the code in a way that's usable for us?
bellbind verified that the code was written by him and added language to the README specifically putting it into the public domain. So we're good to use this as we see fit.

We should still give credit to him, but we don't need to adhere to any specific format.

Can you take a look at bug 720677 and see if there's interesting information to be gleaned from that implementation, too?

Also, that bug has attachments that provide extremely thorough performance tests for variously sized Arrays with different degrees of pre-sortedness. Running those would be very helpful.
Comment on attachment 748663 [details] [diff] [review]
WIP Patch v0

Review of attachment 748663 [details] [diff] [review]:
-----------------------------------------------------------------

Clearing f? to clean up my dashboard.
Attachment #748663 - Flags: feedback?(tschneidereit)
Attached patch WIP Patch v1 (obsolete) — Splinter Review
This has a somewhat working Tim Sort, in the sense that it is not follow the spec completely and there are a large number of test cases that are failing, it also needs a comparator function to run. I ran the benchmarks on http://jsperf.com/javascript-sort/37

Without patch: 9190 ops/sec
With patch: 16027 ops/sec
Attachment #748663 - Attachment is obsolete: true
Attachment #765738 - Flags: feedback?(tschneidereit)
Assignee: sankha93 → till
Comment on attachment 765738 [details] [diff] [review]
WIP Patch v1

Review of attachment 765738 [details] [diff] [review]:
-----------------------------------------------------------------

Finally! Sorry this took so long.

I have attached a patch that changes js::array_sort to invoke your self-hosted implementation if we know that the sorting process has to invoke the compare function.

I verified that that works, but quickly got stuck while testing things. There seem to be some issues with how the compare function is made available to the functions contained within ArraySort, causing this input to fail:
"[1,4,3, 20].sort(function(a, b){return a > b});" //ignore the bogus results this would give

It seems as if the closure property `c` isn't available to the contained functions, for reasons unclear to me.

However, instead of fixing that, please do something entirely different: move all the functions out of ArraySort and explicitly pass all the required arguments to them. That way, the compiler will be able to optimize the entire thing much better. Please give all those functions names like ArraySort_binarySort.

You can find explanations of why that is in [1] and [2].

Style nit: no space between function name and left brace. Other than that, looks very decent.

For now, I'd ask you to fix these issues and get tests working. We can deal with algorithm efficiency and edge case-correctness later.


[1]: http://en.wikipedia.org/wiki/Lambda_lifting
[2]: http://stackoverflow.com/questions/592584/what-is-lambda-lifting/592634
Attachment #765738 - Flags: feedback?(till)
Assignee: till → sankha93
Attached patch benchmark (obsolete) — Splinter Review
This is a simple test I wrote to measure the performance in the JS Shell. It is giving the results:

Array Sort Avg: 1.4
Tim Sort Avg: 8.39
Attached patch WIP Patch v2 (obsolete) — Splinter Review
I have made the changes that you suggested. This patch is better to read, and has less failures in the jit-test and js-test cases.

I later moved it to Array.prototype.timsort so that I could compare directly to Array.prototype.sort. The JS shell benchmarks is showing a different story, I also wrote a test on jsPerf as well, which tells the opposite. http://jsperf.com/arrtimsort

On jsPerf I am getting:

Array Sort: 9,766 ±0.88%, 17% slower
Tim Sort: 11,738 ±0.71%, fastest
Attachment #765738 - Attachment is obsolete: true
Attachment #775280 - Flags: feedback?(till)
Comment on attachment 775280 [details] [diff] [review]
WIP Patch v2

Review of attachment 775280 [details] [diff] [review]:
-----------------------------------------------------------------

Looking much better, but still a ways off from being finished.

I've added some inline comments below, but the more important overriding one is:

This implementation creates lots of temporary objects. Both by copying values into new arrays using Array#slice (example: line 173) and, probably less importantly, by creating temporary state objects (example: line 166).

There are formulations of timsort (well, of mergesort, at least) that don't need a temporary array at all. These are very annoying and likely to be so much slower that its not worth it any way. At the very least, however, you should do an implementation that creates the required scratch array at the beginning and operates by using different ranges of that space. And you should create as few temporary state objects as possible. Either by re-using them, or by passing all required variables explicitly.

Also, make sure that all variables you're using are actually defined. It looks like you forgot to properly lift some of them. Which can happen easily and it yet another reason why we should active strict mode for all self-hosted code. I should do the necessary perf testing and land bug 784295 if that turns out well.

Sorry this is getting more complicated by the day. (Efficient) sorting is hard, sadly. And I didn't even start with the spec-lawyering. ;)

::: js/src/builtin/Array.js
@@ +3,5 @@
>   * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>  
> +#define MIN_GALLOP 7
> +
> +var ArraySort_timSort = function (array, first, last, lessThan) {

Just use `function ArraySort_*` here and for all the other ArraySort_* functions.

@@ +37,5 @@
> +};
> +
> +// cut array to monotonic chunks named (natural) "run"
> +var ArraySort_nextRun = function (state) {
> +    if (state.remain >= state.last) return false;

Nit: `return false` on new line

@@ +71,5 @@
> +    if (last - state.remain < state.minrun) {
> +        // replace binary sorted minrun
> +        var minrun = state.remain + state.minrun;
> +        var sortStart = last;
> +        last = (minrun > state.last) ? state.last : minrun;

Nit: std_Math_min

@@ +92,5 @@
> +
> +// reverse elements in array range
> +var ArraySort_reverse = function (array, first, last) {
> +    last--;
> +    while (first < last) {

Nit: no braces needed for single-line loop body

@@ +125,5 @@
> +// merge two chunks
> +var ArraySort_mergeTwoRuns = function (state) {
> +    if (state.runStack.length > 2 &&
> +        state.runStack[state.runStack.length - 3].length <
> +        state.runStack[state.runStack.length - 1].length) {

Nit: brace on new line for multi-line condition

@@ +257,5 @@
> +    // packed states of the function
> +    var m = {};
> +    // escape shorter buffer only
> +    m.left = array
> +    m.lcur = connect; m.lfirst = first;

Nit: Don't pack statements on one line like this

@@ +382,5 @@
> +    return array;
> +};
> +
> +var ArraySort_builtinLessThan = function (a, b) {
> +    if (ArraySort_comparefn(a, b) < 0)

Nit: simplify to `return ArraySort_comparefn(a, b) < 0;`

@@ +390,5 @@
> +
> +var ArraySort_comparefn = function (j, k) {
> +    var jString = ToString(j);
> +    var kString = ToString(k);
> +    var hasj = HasProperty(obj, jString)

Where does 'obj' come from? Did you forget to lift it?

No need for the `ToString` and `HasProperty` stuff. Use instead:
var hasJ = j in obj;
var hasK = k in obj;

The engine will do the right thing, here. In a faster way, even.

@@ +391,5 @@
> +var ArraySort_comparefn = function (j, k) {
> +    var jString = ToString(j);
> +    var kString = ToString(k);
> +    var hasj = HasProperty(obj, jString)
> +    if (!hasj && !hask)

Where does `hask` come from?

@@ +398,5 @@
> +        return 1;
> +    if (!hask)
> +        return -1;
> +    var x = obj[jString];
> +    var y = obj[kString];

Nit: `x` and `y` aren't very descriptive. How about `jVal` and `kVal`?

@@ +405,5 @@
> +    if (x === undefined)
> +        return 1;
> +    if (y === undefined)
> +        return -1;
> +    if (compareFn !== undefined) {

Where does `compareFn` come from, here? Looks like you didn't properly lift this access, right?

@@ +406,5 @@
> +        return 1;
> +    if (y === undefined)
> +        return -1;
> +    if (compareFn !== undefined) {
> +        if (!IsCallable(compareFn))

Hoist this check into `ArraySort`. No need to call it more than once. Here, you should probably use state.compareFn, propagated in whichever way. (i.e., either give the entire state to ArraySort_comparefn, or add the predicate to its arguments explicitly.)

@@ +430,5 @@
> +        lessThan = function (a, b) {
> +            if (comparefn(a, b) < 0)
> +                return true;
> +            return false;
> +            };

I think wrapping this into a new function just for convenience is very heavy-handed. Moreso for the lessThanEqual variant, which contains two indirections. Ideally, the JIT should inline through both of those, but I wouldn't rely on it.

Instead, pass comparefn directly to ArraySort_timSort and add it to the state object. Replace all callsites with `comparefn() < 0` for lessThan, and `compareFn() <= 0` for lessThanEqual.
Attachment #775280 - Flags: feedback?(till) → feedback-
Attached file Shell benchmark
Oh, and I forgot to comment on the benchmarks.

Never (as in NEVER) use jsperf.com. It doesn't work. It makes you believe it does, but behind your back, it tries cheating at poker while you're building your benchmarks. I'm trying to build an alternative, but until that's ready, use shell benchmarks.

Of those, I have attached a more extensive version of yours. It probably still way too specific, but it shows one important flaw of your benchmark: the predicate you used is exactly the one that SpiderMonkey detects and turns into a C++-implemented numeric sort. As such, it's no wonder that Array.sort is so much faster than Array.timsort.

A slight modification of the predicate to trip up the pattern detection shows that the race is much closer. Given that the timsort implementation currently isn't likely to do what one would think it does (because of the non-lifted and thus undefined local vars), this probably doesn't mean much.
Attachment #775279 - Attachment is obsolete: true
> Never (as in NEVER) use jsperf.com. It doesn't work.

ccing Mathias, because if there are bugs in jsperf.com he'd probably like to know...  And we would too, since web developers _will_ use jsperf.com to measure stuff.

> use shell benchmarks.

Prefer browser benchmarks if you can, actually.
Flags: needinfo?(mathias)
Also, if you have specific information on the jsperf issue you mention, I would like to know what it is...
Thanks for CC’ing me, Boris.

(In reply to Boris Zbarsky (:bz) from comment #25)
> Also, if you have specific information on the jsperf issue you mention, I
> would like to know what it is...

Same here! jsPerf/Benchmark.js is supposed to yield accurate and fair results. No “cheating behind your back” should be happening. If you found a bug, I’d appreciate a reduced test case. Thanks!

Disclaimer: I’m mostly AFK the next few days but I’ll be able to properly look in to this next week.
Flags: needinfo?(mathias)
(In reply to Mathias Bynens from comment #26)
> (In reply to Boris Zbarsky (:bz) from comment #25)
> > Also, if you have specific information on the jsperf issue you mention, I
> > would like to know what it is...
> 
> Same here! jsPerf/Benchmark.js is supposed to yield accurate and fair
> results. No “cheating behind your back” should be happening. If you found a
> bug, I’d appreciate a reduced test case. Thanks!

First of all, apologies for the rude comments! I should know better than that, really.

As for actual feedback, I'll write up what I think is problematic about how jsperf currently works and send it to you, directly. Obviously, that's much better than to sound off somewhere you wouldn't even be aware of it, without bz kindly getting you involved. (@bz: I'll cc you, since you asked for such a writeup on IRC.)
Attached patch patch v3 (obsolete) — Splinter Review
The current benchmark results:

Sorting Array of size 10000 with Array.sort(optimized) took 1.446ms (avg of 100 iterations)
Sorting Array of size 10000 with Array.sort(JS) took 9.462ms (avg of 100 iterations)
Sorting Array of size 10000 with Array.timsort took 6.94ms (avg of 100 iterations)

This still has the .slice(s), I will be removing them.
Attachment #775280 - Attachment is obsolete: true
Attachment #775343 - Flags: feedback?(till)
Comment on attachment 775343 [details] [diff] [review]
patch v3

Review of attachment 775343 [details] [diff] [review]:
-----------------------------------------------------------------

Clearing the f? as we have long since talked about the patch on IRC.
Attachment #775343 - Flags: feedback?(till)
Given https://github.com/rwldrn/tc39-notes/blob/master/es6/2013-07/july-23.md#48-stable-arrayprototypesort, getting this in would be pretty cool: we'd most likely have the fastest current implementation of Array.sort and could thus demonstrate that a stable sort wouldn't be a performance issue.
Attached patch patch v4 (obsolete) — Splinter Review
Removed all new array allocation via the .slice method. Now the current benchmark results are:

Sorting Array of size 10000 with Array.sort(optimized) took 1.399ms (avg of 100 iterations)
Sorting Array of size 10000 with Array.sort(JS) took 9.677ms (avg of 100 iterations)
Sorting Array of size 10000 with Array.timsort took 5.175ms (avg of 100 iterations)
Attachment #775343 - Attachment is obsolete: true
Attachment #782648 - Flags: feedback?(till)
Gentle review ping! :)
Till? We need you, man! Don't let us die!!
>> if (compareFn !== undefined) {
compareFn is undefined
also,
this implementation adds some "undefined" elements to the array somehow at the end.
>>Removed all new array allocation via the .slice method. Now the current benchmark results are:
this cause some new bugs

>>There are formulations of timsort (well, of mergesort, at least) that don't need a temporary array at all. These are very annoying and likely to be so much slower that its not worth it any way. At the very least, however, you should do an implementation that creates the required scratch array at the beginning and operates by using different ranges of that space. And you should create as few temporary state objects as possible. Either by re-using them, or by passing all required variables explicitly.
Could someone assign this bug to me?

I see that this bug is still not closed, even though there was no work done on it this year.

I intend to create a Quicksort In Place and TimSort, check their correctness, and see which is fastest.
I second Mark's comment. It's been a while, despite that this is a very important performance issue. Why spend all the effort on controversial cosmetic redesigns that is Australis (which BTW I do like, but also do realize its issues) but making very slow progress on core issues like performance?  I really feel that this is a low hanging fruit for fixing some performance problems.  The other area that I see that Firefox needs to really put effort into is Canvas performance.

If no one takes this or continue to make progress on this, could someone give it to Mark? Or even myself.  After all, every CS student has learned how to implement QuickSort at certain point.  

Thanks.
Till, what's the status here?  It looks like there's a patch that's been waiting on your feedback since July.
Flags: needinfo?(till)
BTW would the dual-pivot qsort be checked out? Sounds fairly promising and generally faster than the classic qsort. URL is http://permalink.gmane.org/gmane.comp.java.openjdk.core-libs.devel/2628  Not sure how it compares to Timsort, which might perform rather well on real life data.
* markvp and Mingyi: You don't need to be assigned to a bug to work on it. Bug assignment is just a formality. Please feel free to test and fix Sankha's proposed patch or write your own. :)
Status: ASSIGNED → NEW
> Why spend all the effort on controversial
> cosmetic redesigns that is Australis (which BTW I do like, but also do
> realize its issues) but making very slow progress on core issues like
> performance?

Many people work on Firefox. Some of them are UI experts. Some of them are JS engine performance experts. Few (if any) of these people are interchangeable. Let's not get distracted, please.
True, I got carried away by the intense controversy surrounding Australis. I apologize. From my own perspective, I frequently defended FF against non-tech users who blindly go for Chrome and can't utter any real criticism other than "my friend told me to use Chrome". But for my tech-savy colleagues and friends, one infuriated by Australis said that Mozilla seems to pay more attention to UI than performance, and that was hard to defend given the lack of progress on this issue, for example. Thus I posed the question as well.

Anyhow, back to this issue. To validate if this issue still exists - before jumping in to take up the work on it - I went to the test cases I posted 2 years ago.  Guess what - FF now beats Chrome by a significant margin in practically all sorting!  So it is just a myth that FF wasn't working **** performance but rather spent too much effort on cosmetic issues.  I went back to FF9 (the version I used when reporting the issue), FF17 and FF26.  FF26 had vast improvements on array sorting all around. Please see the attached image, where shorter bar is better.

The only two things I want to note is that:
1. The qsortIP (in-place quicksort implemented in pure JS) is still significantly faster than the native array.sort for integers across board - although it might just be typed array usage in non-native sorting.
2. Very strangely, the s=0.2M functional overhead (calling comparator func s*log(s) times) is much less time consuming in FF than Chrome., BUT in FF26 and above, if you run all 3 tests in http://jsfiddle.net/LzUmL/6/ again and again, you'd find the 2nd or 3rd time running the func overhead test, the running time shoots up 10-20 fold compared to the first run!  This is reproducible in FF26 and above, but this issue didn't exist in FF9 and FF17, neither does it exist in the current Chrome, the v34 used in the attached graphs.
Sorry, I was not reading bugmail closely.

It was quite sometime back when I worked on this. So I will try to write whatever I recall.

1. Using a quick sort would result in change of the behaviour of the current sorting that Firefox has. It currently implements a merge sort which is a stable sort but a quick sort is not a stable sort. Someone more experienced from the JS engine team can say if that would be desirable or not.

2. The standard cases ae very easy to trap with the self-hosted case. But the spec has much more detailed steps which are implemented properly in C++. It would take a bit of time to get it in the self-hosted mode.

3. The test case provided in the previous comment appears really fast beacuse the comparator function uses just the standard return a - b; format. SpiderMonkey has an optimized C++ comparator function for that so it is really fast - as we never have calls from C++ to the JS land for calling the comparator function. The intention for self-hosting was get rid of the C++ to JS calls for the comparator function. So the above testcase would make sense if we have something like return 0 | (a - b); Because then we'll be using the JS function for each comparision operation.

So the plan is to have the C++ version of the sort as well, with only having the self-hosted version as a fallback when we have an unusual comparator function.

I could start working on this bug again, and go forward with a standard sorting algorithm if that's alright, or if anybody is willing to pick it up feel free to do so. :)
Sankha, did you mean that the reason why the numerical sort in FF26+ is much faster now is because of the special treatment of f(a,b){return a-b} comparator function (because no JS call needed, all done in C++)?  If so then yes, it might still be worth it to try all JS implementations for complex comparators (potentially with closure).

I'm surprised that the array.sort C++ in spidermonkey is mergesort - I personally never trust sorting is stable as in most languages AFAIK qsort is used as standard.  Then timsort should be used over mergesort in C++, shouldn't it?
I examined the different algorithms here: http://jsfiddle.net/GedqT/
I also noticed that FF is faster than Chrome. The Timsort that I took from Sankha's patch is a bit slower than quicksort IP, but it has a bug, the array is bigger after sorting, so it may be faster if this problem is solved. It also uses a merge sort that creates a temporary array which can be avoided, as noticed by Till. I have no idea where the bug is in the code, I'll look around for other code.

PS: ok if this bug is not officially assigned to me, but does this still count for my two bugs to get editing rights?
(In reply to Mark Van Peteghem [:markvp] from comment #47)
> PS: ok if this bug is not officially assigned to me, but does this still
> count for my two bugs to get editing rights?

Yes. It's your patches that count. :)

* Sankha: Do you plan to resume working on this bug? Or can Mark continue from where you left off?
Flags: needinfo?(sankha93)
I'm truly sorry for letting the ball drop so badly on this bug. :(

I've thought long and hard about what to do here and came to the conclusion that I made several mistakes in mentoring this bug. Critically, I'm probably not the right person to mentor it in the first place. I know a lot about our self-hosting infrastructure and JS as a language, so I can competently comment on those parts. I'm not, however, an algorithms expert, and I think that the person mentoring the work here should be that first and foremost. The second mistake is directly connected to that: it's just flat out the wrong approach to implement a fairly complicated algorithm such as Timsort from the get-go.

Sankha, I owe you an apology for having you do all this work and then not following up on it. :( I'm glad that at least that didn't discourage you from contributing lots of other patches.


Now having said all this, here's what I think should happen in this bug next:

- rebase attachment 770750 [details] [diff] [review] to make the selective usage of a self-hosted sort only for complex comparators possible again
- implement a very simple sort function, probably mergesort
- extensively test the resulting performance characteristics using the scripts from bug 720677
- if the performance is superior to the C++ implementation we have now:
 - get reviews from me for the infrastructure parts and maybe luke, sunfish, jorendorff, waldo, jandem, nbp for the algorithm part
 - land the result
 - optionally experiment with further optimizations such as implementing Timsort
- if the performance isn't great, get feedback on the algorithm and then proceed with further optimizations immediately


I can facilitate the proper mentoring on this, but will liberally forward feedback request to people more qualified than me.

Mark, I would still like to give Sankha the option to work on this first. That being said, implementing a highly-optimized, validated sort function is a great exercise in any case, so I don't think parallelization is all too bad here, either.
Flags: needinfo?(till)
A quick further note here:

(In reply to Mingyi Liu from comment #44)
> The only two things I want to note is that:
> 1. The qsortIP (in-place quicksort implemented in pure JS) is still
> significantly faster than the native array.sort for integers across board -
> although it might just be typed array usage in non-native sorting.

Note that, while I'm absolutely sure that we can get substantial speedups here, the comparison isn't entirely fair for several reasons:
- quicksort isn't stable (I assume that this isn't a stable variants as those are usually slower)
- JS implementations of standard functions are frequently faster than the builtin versions, because they don't have to be spec-compliant. Frequently, builtins have to contains checks for corner cases that can't be hoisted out of tight loops and slow the operation down. (For what it's worth, the specification for Array#sort can be found here: http://people.mozilla.org/~jorendorff/es6-draft.html#sec-array.prototype.sort)

> 2. Very strangely, the s=0.2M functional overhead (calling comparator func
> s*log(s) times) is much less time consuming in FF than Chrome., BUT in FF26
> and above, if you run all 3 tests in http://jsfiddle.net/LzUmL/6/ again and
> again, you'd find the 2nd or 3rd time running the func overhead test, the
> running time shoots up 10-20 fold compared to the first run!  This is
> reproducible in FF26 and above, but this issue didn't exist in FF9 and FF17,
> neither does it exist in the current Chrome, the v34 used in the attached
> graphs.

That is indeed interesting, and worth further investigation if it's not resolved by a self-hosted implementation in this bug.
Till, as it happens I just finished implementing and testing Timsort in this JSFiddle: http://jsfiddle.net/mark_vp/GedqT/4/

It works well, I don't see how I could still improve its performance. I did not use inplace sort, because on http://svn.python.org/projects/python/trunk/Objects/listsort.txt Tim himself explains that this is slower (and more complex to implement).

But it seems that there is no advantage in self-hosting, the current native sort method of Array is about as fast as my implementation on random data (using 0|a-b as the comparison function).

On nearly sorted data my implementation is faster than the current sort method, as expected because timsort was designed to be faster on such data. But when a-b is used for the comparison, the native sort method is a lot faster on both random and nearly sorted data, because of the optimization with MatchNumericComparator in jsarray.cpp, which we should keep. So the best of both worlds would be to replace the use of the MergeSort function in jsarray.cpp with a timsort. I could probably try this next week.
Mark, thanks for the investigation work here!

> But it seems that there is no advantage in self-hosting, the current native
> sort method of Array is about as fast as my implementation on random data
> (using 0|a-b as the comparison function).
> 
> On nearly sorted data my implementation is faster than the current sort
> method, as expected because timsort was designed to be faster on such data.
> But when a-b is used for the comparison, the native sort method is a lot
> faster on both random and nearly sorted data, because of the optimization
> with MatchNumericComparator in jsarray.cpp, which we should keep. So the
> best of both worlds would be to replace the use of the MergeSort function in
> jsarray.cpp with a timsort. I could probably try this next week.

The idea is to only call the self-hosted version if the MatchNumericComparator optimization fails. My patch in attachment 770750 [details] [diff] [review] does that.

I'm getting slightly different results from you, btw: for me, QuickSort is quite a bit faster in all cases except for the completely sorted ones. For those, TimSort is ~10x as fast. In any case, QuickSort can't be used (in this form, at least) because it's not stable.

> PS: I based my implementation on
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/timsort/raw_files/new/
> src/share/classes/java/util/TimSort.java which I trust to be quite good.

That, sadly, means that we can't use it at all: since it's a straight port of the Java code, it is a derived work, and thus covered by the GPL v2, which is incompatible with MPL 2, the license that SpiderMonkey uses.

I just discussed this with gerv, our licensing expert, and it might be that just having you write a new implementation that's not based on the Java code is not enough. I'll file a bug requesting a more in-depth analysis on that and have it block this one.

I'm sorry this is such a problem, but I guess it's obvious that we can't take the risk of falling afoul of another open source project's licensing terms. :(


In the meantime, we could absolutely use a merge sort implementation that's not copied from somewhere else (or the source of which has a license that is MPL 2-compatible), so that might be a good next step.
(In reply to Till Schneidereit [:till] from comment #53)
> Mark, thanks for the investigation work here!
> 
> > But it seems that there is no advantage in self-hosting, the current native
> > sort method of Array is about as fast as my implementation on random data
> > (using 0|a-b as the comparison function).
> > 
> > On nearly sorted data my implementation is faster than the current sort
> > method, as expected because timsort was designed to be faster on such data.
> > But when a-b is used for the comparison, the native sort method is a lot
> > faster on both random and nearly sorted data, because of the optimization
> > with MatchNumericComparator in jsarray.cpp, which we should keep. So the
> > best of both worlds would be to replace the use of the MergeSort function in
> > jsarray.cpp with a timsort. I could probably try this next week.
> 
> The idea is to only call the self-hosted version if the
> MatchNumericComparator optimization fails. My patch in attachment 770750 [details] [diff] [review]
> [details] [diff] [review] does that.

I see, but still the advantage would be small. On my laptop it takes 11 seconds to sort 1,000,000 random numbers with the present Array.sort and 10.5 seconds with the self-hosted timsort, only a small advantage.

> I'm getting slightly different results from you, btw: for me, QuickSort is
> quite a bit faster in all cases except for the completely sorted ones. For
> those, TimSort is ~10x as fast. In any case, QuickSort can't be used (in
> this form, at least) because it's not stable.

Strange, I just checked again and quicksort is still a little slower on random data. Must be a difference in CPUs, I have an i5. But anyway, I only left that in out of curiosity, since it is not stable.

> > PS: I based my implementation on
> > http://cr.openjdk.java.net/~martin/webrevs/openjdk7/timsort/raw_files/new/
> > src/share/classes/java/util/TimSort.java which I trust to be quite good.
> 
> That, sadly, means that we can't use it at all: since it's a straight port
> of the Java code, it is a derived work, and thus covered by the GPL v2,
> which is incompatible with MPL 2, the license that SpiderMonkey uses.
> 
> I just discussed this with gerv, our licensing expert, and it might be that
> just having you write a new implementation that's not based on the Java code
> is not enough. I'll file a bug requesting a more in-depth analysis on that
> and have it block this one.
> 
> I'm sorry this is such a problem, but I guess it's obvious that we can't
> take the risk of falling afoul of another open source project's licensing
> terms. :(

I see. I'm not that familiar with licenses, I thought GPL would be ok.

> In the meantime, we could absolutely use a merge sort implementation that's
> not copied from somewhere else (or the source of which has a license that is
> MPL 2-compatible), so that might be a good next step.

I'm of the opinion that timsort is the best algorithm for this, since it is a lot faster on partially sorted data. So should I await the result of the in-depth analysis of the licensing? I don't feel like writing a merge sort if it is discarded when I'm allowed to write another timsort implementation, or someone else writes a timsort soon.
(In reply to Mark Van Peteghem [:markvp] from comment #54)
> > The idea is to only call the self-hosted version if the
> > MatchNumericComparator optimization fails. My patch in attachment 770750 [details] [diff] [review]
> > [details] [diff] [review] does that.
> 
> I see, but still the advantage would be small. On my laptop it takes 11
> seconds to sort 1,000,000 random numbers with the present Array.sort and
> 10.5 seconds with the self-hosted timsort, only a small advantage.

That is strange: my numbers (in current Firefox Nightly, on a 2013 rMBP with a 2.7Ghz i7) are substantially different:

"a-b":
default on 10000000 random using fdiff: 1.717
timsort on 10000000 random using fdiff: 2.004

default on 10000000 ascending using fdiff: 0.512
timsort on 10000000 ascending using fdiff: 0.056

default on 10000000 nearly_ascending using fdiff: 0.932
timsort on 10000000 nearly_ascending using fdiff: 1.196

default on 10000000 descending using fdiff: 0.629
timsort on 10000000 descending using fdiff: 0.059

"0|a-b":
default on 10000000 random using f0ordiff: 11.247
timsort on 10000000 random using f0ordiff: 1.963

default on 10000000 ascending using f0ordiff: 0.919
timsort on 10000000 ascending using f0ordiff: 0.056

default on 10000000 nearly_ascending using f0ordiff: 9.866
timsort on 10000000 nearly_ascending using f0ordiff: 1.176

default on 10000000 descending using f0ordiff: 5.77
timsort on 10000000 descending using f0ordiff: 0.086

So in the case where our optimization hits, the native implementation is a bit faster for the non-fully sorted cases, but when it fails or for the fully sorted cases, Timsort is between ~5x and ~10x (or 67x for the non-optimized descending case) faster.

I wonder what's so different about your setup: I very, very much doubt that it's just the i5 vs my i7.

> > That, sadly, means that we can't use it at all: since it's a straight port
> > of the Java code, it is a derived work, and thus covered by the GPL v2,
> > which is incompatible with MPL 2, the license that SpiderMonkey uses.
> > 
> > I just discussed this with gerv, our licensing expert, and it might be that
> > just having you write a new implementation that's not based on the Java code
> > is not enough. I'll file a bug requesting a more in-depth analysis on that
> > and have it block this one.
> 
> I see. I'm not that familiar with licenses, I thought GPL would be ok.

Yeah, it's a real bummer. Licenses are a complicated mess in general, but in this case it's somewhat straight-forward: GPL'd code can only be incorporated into code that's also under the GPL. Since all Firefox code is under the MPL 2 (except for some code under compatible licenses that we imported), that means we can't use GPL'd code.

> > In the meantime, we could absolutely use a merge sort implementation that's
> > not copied from somewhere else (or the source of which has a license that is
> > MPL 2-compatible), so that might be a good next step.
> 
> I'm of the opinion that timsort is the best algorithm for this, since it is
> a lot faster on partially sorted data. So should I await the result of the
> in-depth analysis of the licensing? I don't feel like writing a merge sort
> if it is discarded when I'm allowed to write another timsort implementation,
> or someone else writes a timsort soon.

I definitely agree: Timsort is really nice for this. Let's wait for a bit until the licensing people get back to me. I filed a bug about that but cannot CC you: legal bugs aren't open for ... legal reasons. :(

Until then it'd still be nice to figure out the disparity between our results.
This is strange. I thought it was because you used a nightly, so I tried it on a nightly too, and got results similar to yours, just a bit slower. But when I tried again with 29.0.1 I got about the same results as with the nightly. So I suspect it is just because I restarted FF.

"a-b":
default on 10000000 random using fdiff: 3.025
timsort on 10000000 random using fdiff: 3.279

default on 10000000 ascending using fdiff: 0.786
timsort on 10000000 ascending using fdiff: 0.042

default on 10000000 nearly_ascending using fdiff: 1.639
timsort on 10000000 nearly_ascending using fdiff: 1.875

default on 10000000 descending using fdiff: 1.143
timsort on 10000000 descending using fdiff: 0.06

"0|a-b":
default on 10000000 random using f0ordiff: 16.623
timsort on 10000000 random using f0ordiff: 3.888

default on 10000000 ascending using f0ordiff: 1.368
timsort on 10000000 ascending using f0ordiff: 0.05

default on 10000000 nearly_ascending using f0ordiff: 15.366
timsort on 10000000 nearly_ascending using f0ordiff: 2.022

default on 10000000 descending using f0ordiff: 8.182
timsort on 10000000 descending using f0ordiff: 0.068
(In reply to Till Schneidereit [:till] from comment #55)
> So in the case where our optimization hits, the native implementation is a
> bit faster for the non-fully sorted cases, but when it fails or for the
> fully sorted cases, Timsort is between ~5x and ~10x (or 67x for the
> non-optimized descending case) faster.

This is what I expected. The native implementation has the advantage of avoiding the callback to JS, timsort has the advantage of being better at partially sorted data; both advantages happen to be similar in speedup. If the native implementation loses its advantage, timsort is of course a lot faster. Timsort checks for ascending and descending ranges, while the native implementation only checks if all data are ascending, so that explains the 67x speedup.
(In reply to Mark Van Peteghem [:markvp] from comment #57)
> (In reply to Till Schneidereit [:till] from comment #55)
> > So in the case where our optimization hits, the native implementation is a
> > bit faster for the non-fully sorted cases, but when it fails or for the
> > fully sorted cases, Timsort is between ~5x and ~10x (or 67x for the
> > non-optimized descending case) faster.
> 
> This is what I expected. The native implementation has the advantage of
> avoiding the callback to JS, timsort has the advantage of being better at
> partially sorted data; both advantages happen to be similar in speedup. If
> the native implementation loses its advantage, timsort is of course a lot
> faster. Timsort checks for ascending and descending ranges, while the native
> implementation only checks if all data are ascending, so that explains the
> 67x speedup.

Exactly. Based on these numbers, we could *almost* drop the native version entirely. Not quite, though, as it doesn't make sense to take the 20% performance hit for the common case.

Nice to see that your numbers line up pretty much precisely with mine. I always do performance measurements in a clean profile to prevent interference by one of the many tabs I have open in my main profile, or some addon. The easiest way to do that is to just create a throwaway profile in /tmp like so:
[path to firefox executable] -profile /tmp/whatever

In my case, that means
/Applications/FirefoxNightly.app/Contents/MacOS/firefox -profile /tmp/whatever
Flags: needinfo?(sankha93)
Attached patch patchv4Splinter Review
I took a look again into this. This patch has the necessary code in jsarray.cpp to call the self-hosted version. I made some changes to the patch that Till posted previously. Till can you take a look if the current changes are correct or not?

This patch correctly sorts the array (unlike the previous one). But this uses array.slice calls. I tried sorting 100,000 elements array, this is what the results are:

With patch:
Sorting Array of size 100000 with Array.sort(optimized) took 51.245ms (avg of 100 iterations)
Sorting Array of size 100000 with Array.sort(JS) took 241.732ms (avg of 100 iterations)

Without patch:
Sorting Array of size 100000 with Array.sort(optimized) took 53.665ms (avg of 100 iterations)
Sorting Array of size 100000 with Array.sort(JS) took 484.976ms (avg of 100 iterations)

This fails one jit-test related to self-hosting. Till can you help me with it?
self-hosting/invoke-self-hosted-with-primitive-this.js

Instead of the error message being "0 is not a function", I am getting "lastVal is not a function". This lastVal is coming from the self-hosted code.

I did not run it through the complete js-tests suite, but from what I have tested it fails the test:
js1_5/Array/regress-255555.js
This is because the spec doesn't say anything about undefined values not being passed to the comparator function. But this test checks that.
Attachment #782648 - Attachment is obsolete: true
Attachment #782648 - Flags: feedback?(till)
Attachment #8436304 - Flags: feedback?(till)
I happened to receive this link in email today:

http://research.microsoft.com/apps/mobile/publication.aspx?id=209622

If the Timsort implementation can't go in for legal reasons, perhaps an interesting new algorithm would be an appealing way around the problem...
That looks interesting, I'll have a look at it this weekend.
Mentor: till
Whiteboard: [mentor=tschneidereit]
Comment on attachment 8436304 [details] [diff] [review]
patchv4

Cancelling feedback after talking with Sankha in person. We agreed that we're not a good team to work on this, at least not using a complex algorithm such as Timsort.

Should legal not get back to us on the Timsort implementation options and should a P3 implementation be out of reach for too long, it might make sense to land a basic mergesort implementation as an interim solution. Sankha might work on that.
Attachment #8436304 - Flags: feedback?(till)
I have a working P3 sort since last weekend, I just need to examine a few points where I might improve performance. This week was too busy, I'll probably show some results tomorrow.
Amazing! I'm very much looking forward to seeing your results.

As mentioned earlier, bug 720677 contains scripts to generate exhaustive sets of test data. Those might be a good way to verify that there aren't any bad corner cases somewhere in the algorithm or implementation.
I'd argue that extracting the P3 sort + exhaustive test cases and sharing them with the public (perhaps in a github repo on moz's account) is a good move to encourage people to share and refine the implementation.

Way too many half-baked/subtly broken sorts out there, and if P3 really is a bleeding-edge, high quality/high performance sort, it'd be good to get people using it.
Extracting a standalone version of the sort function should be straight-forward, so I don't see why we couldn't do that after landing it. It does indeed make sense to do something like this, yes.
I've made some last improvements, you can test it on http://jsfiddle.net/GedqT/7/

Timsort is still faster than P3 sort now, but perhaps more improvements to P3 sort are possible. Advice is welcome, most of the time in P3 sort is spent in the first for loop. One thing I did not do from the article is unrolling loops, but I'm not sure if that makes a difference in Javascript.

I wrote this code based only on the article, so there should be no copyright issues.
A couple guesses on why P3 might be slower: The P3 main function is a *lot* bigger than the timsort main function, and much more complex, so this probably hinders attempts to optimize it. Splitting the various stages into their own functions would probably help here, since you're only calling them once anyway. You're also allocating arrays inside the main body and I feel like that probably has a fixed additional cost per run, especially with a large number of values involved. You also call 'new Date' a few times, and call Array.sort inside (????) and those both seem like they could have a negative impact.

The performance of P3 here seems consistently good, but also consistently behind timsort to a small extent, so it's possible that addressing some of those minor nits/possible issues would close the gap or even make it closer than timsort. It's nice to see that it is already competitive.

It would also be great to see a test scenario that timsort actually fails on. The cases available in this fiddle all seem to be things that timsort devours quite easily.
I reimplemented the P3 algorithm from the jsfiddle in a statically typed language (to rule out various JIT optimization quirks) and optimized some of the heap allocations out, with the goal of looking at the performance with some of that stuff out of the picture.

The test results are pretty bad. In some ideal cases it manages to come close to timsort, but other than scenarios where quicksort falls over, quicksort demolishes it. Profiles show that a significant amount of time is spent performing the binary searches in the first run-building pass - this is a cost timsort doesn't pay since it uses a much simpler, naive run-building algorithm. It's possible that a denser, more cache-friendly representation of the run information would help here - I'm using the four-arrays setup that the JS uses, because it makes the binary searches easier to perform, but I may experiment with switching to a denser data structure.

The existence of the variable-length run lists is also a problem. In my profiles the cost of growing the run arrays (since it's not really possible to predict their size well or carve their space out of a shared buffer) and doing the appends is significant. This is another problem timsort doesn't have because it can represent runs as ranges instead of actual sequences. Maybe you don't actually need to maintain run lists containing actual values to make this algorithm work? I can't really tell; from reading the paper it seems like it might be possible to implement without actually maintaining lists of items, but the algorithm as specified in pseudocode definitely states that you need to maintain the run lists.

I think finding a way to optimize out the run lists will be a big improvement by itself.

Along with the core problem that this algorithm uses an *absurd* amount of temporary space is that the algorithm allocates in phases, so you hit the allocator more often and have more chances to trigger a GC. First it allocates space for the list of runs, then it allocates space for each run as it constructs it, then once it's built the list of runs it allocates additional space for the phase where it builds run sizes and merges the runs into a temporary array. Each one of these allocations seems likely to flush some or all of the cache, which defeats a lot of the 'cache-friendliness' this algorithm purports to offer.

The use of the Array.sort builtin in the JS P3 implementation probably hurts its performance, since we've already established that the builtin runs slowly. If you replace those Array.sort calls with calls to your JS quicksort that will probably give P3 another small boost.
(In reply to K. Gadd (:kael) from comment #68)
> A couple guesses on why P3 might be slower: The P3 main function is a *lot*
> bigger than the timsort main function, and much more complex, so this
> probably hinders attempts to optimize it. Splitting the various stages into
> their own functions would probably help here, since you're only calling them
> once anyway.

I'll try that, thanks.

> You're also allocating arrays inside the main body and I feel
> like that probably has a fixed additional cost per run, especially with a
> large number of values involved. You also call 'new Date' a few times, and
> call Array.sort inside (????) and those both seem like they could have a
> negative impact.

Calling 'new Date' is just to see which part of the algorithm consumes the most time. Calling Arrat.sort is a left over from when the algorithm wasn't complete yet, I still needed to sort at that stage.

> The performance of P3 here seems consistently good, but also consistently
> behind timsort to a small extent, so it's possible that addressing some of
> those minor nits/possible issues would close the gap or even make it closer
> than timsort. It's nice to see that it is already competitive.
> 
> It would also be great to see a test scenario that timsort actually fails
> on. The cases available in this fiddle all seem to be things that timsort
> devours quite easily.

What other cases are you thinking of besides the four I tested? Timsort is best when there are long ranges that are ascending or descending, but even on random data it is still fast.
(In reply to K. Gadd (:kael) from comment #69)
> The existence of the variable-length run lists is also a problem. In my
> profiles the cost of growing the run arrays (since it's not really possible
> to predict their size well or carve their space out of a shared buffer) and
> doing the appends is significant. This is another problem timsort doesn't
> have because it can represent runs as ranges instead of actual sequences.
> Maybe you don't actually need to maintain run lists containing actual values
> to make this algorithm work? I can't really tell; from reading the paper it
> seems like it might be possible to implement without actually maintaining
> lists of items, but the algorithm as specified in pseudocode definitely
> states that you need to maintain the run lists.
>
> I think finding a way to optimize out the run lists will be a big
> improvement by itself.

The run lists seem essential to me, they are a quick way to create sorted parts of the sequence.

> Along with the core problem that this algorithm uses an *absurd* amount of
> temporary space is that the algorithm allocates in phases, so you hit the
> allocator more often and have more chances to trigger a GC. First it
> allocates space for the list of runs, then it allocates space for each run
> as it constructs it, then once it's built the list of runs it allocates
> additional space for the phase where it builds run sizes and merges the runs
> into a temporary array. Each one of these allocations seems likely to flush
> some or all of the cache, which defeats a lot of the 'cache-friendliness'
> this algorithm purports to offer.

I see that I forgot to implement the allocation optimization that the article mentions (right under figure 4). I'll try to implement that, it's a bit harder but doable.
I've made several changes to the P3Sort algorithm, which made it a bit faster but still slower.

I split the main body in smaller parts, this had little impact on speed.

I then switched to P3Sort for sorting the runs by size, but this made it a lot slower. I assume that this hinders optimization. I now use Quick Sort IP, which made no improvement, probably because the array of runs is a lot smaller than the array of data.

I then tried binary search on the runs themselves instead of the separate arrays with the last element, this made it a little slower so I switched back (change the if (true) to if (false) in the main body to try it).

I then gave the runs an initial size of the square root of the array, which gave a small improvement.

I then replaced something like b=a; c=a; by b=c=a; (lines 943 and 948) which gave another small improvement.

Last I changed the order how the data in the head runs are copied to the array to make it more cache friendly, which showed no improvement.

The memory usage is indeed quite big. With 10,000,000 elements Timsort uses about 140 MB, while P3Sort uses between 250 and 300 MB. Quick sort IP uses about 100 MB.

It was fun to implement, but if no further improvements are possible, I'd recommend to not use it.
Mark, thanks a lot for your fantastic work on this!

After looking into the implementation, I agree that P3 seems to be a worse fit than expected. Mostly, the memory usage is a serious problem, and I don't think we can meaningfully reduce it.

One final thing that might be interesting to try is to use typed arrays for the metadata used during sorting. It'd even be possible to use a Uint16Array for cases where that's enough to represent all indices, but that might make the speed worse. If you want to try this, note the uint32 values aren't as well-optimized in SpiderMonkey as int32 values, to an Int32Array would be preferrable if it's large enough.

I don't expect any of this to make enough of a difference to make P3 a viable algorithm for our use case, though.

I haven't yet heard back from our legal team regarding the Timsort licensing issue, but will get back to you once I have.
Dense memory (typed arrays etc) for P3 isn't likely to make a big difference; the real problem is that the allocations occur in multiple stages and their sizes aren't predictable.

For timsort you can pick a reasonably large 'maximum' stack depth and be able to trivially assert that you have enough stack to do a sort. A reasonably-sized Int32Array for stack storage should improve the performance and memory usage for sure (in particular, the better locality will probably provide a big speed boost).

QuickSortIP would also benefit from a dense Int32Array representation of its stack - it seems like it would never get larger than, say, 2*(log2 N) + 1? Having both of those algorithms to choose from would be nice, especially for short arrays where (IIRC) the overhead of timsort is not justified.
Hey Mark,

I finally heard back from our legal team, and it's good news: they don't see any reason why you couldn't clean-room implement the algorithm based on the textual explanation, as long as you don't copy things over from the current implementation or another incompatibly-licensed implementation.

If you're still interested in working on this, that'd be fantastic.

Note that I'm on PTO for the next almost four weeks, so should you have an implementation during that time, please ask Luke Wagner (luke on IRC) for feedback.
Assignee: sankha93 → nobody
Flags: needinfo?(Mark.Van.Peteghem-moz)
Hi Till,

That is great to hear. I'll make some time free to work on it this week.
Flags: needinfo?(Mark.Van.Peteghem-moz)
I finally finished my own implementation of Timsort, based solely on the description on http://bugs.python.org/file4451/timsort.txt. I'm sorry it took so long, Timsort is harder to implement than P3sort, and I was busy with some other stuff in the past month.

You can test it on http://jsfiddle.net/GedqT/12/
For arrays of a million elements it is the fastest, with 10 million elements Quicksort IP is faster.

I think it could be made faster by implementing some stuff in C++, especially the functions arraycopy, moveinarray and such, and prehaps reversing part of an array.
I just saw that Array has a method copyWithin that could replace my function moveinarray. I'll try tomorrow with that method.
I tried it, but copyWithin doesn't seem to be available in FF (at least in the one I'm using). That is really a shame, I think it could make it lots faster.
Mark, which version of Firefox are you using?  copyWithin was added in Firefox 32.
Mark, thank you so much for your continued work on this!

As Boris says, copyWithin was added in FF32. However, it is self-hosted, because that turns out to be faster for the common case. We might at some point add fast paths for dense arrays where we get away with memcpying, but I'm not sure it'd make too much of a difference. What *could* make a difference is unrolling loops, which is worked on in bug 1039458. See comment 2 in that bug about initial results. For now, this optimization is hidden behind a flag, but I think it'll be activated relatively soon.

In any case, the numbers I get from your current version look great! If we can get optimizations based on that, even better, but we should get this landed with or without.

I'll work on doing at least a preliminary integration into the engine over the next few days and do a first review pass of the implementation. Based on that, we'll see how best to proceed. Probably by asking someone on the team for an in-depth review of the algorithmic parts of the implementation.
Hi, you can add one more tests, if you like - "descending with equal pairs":

        var start = 1000*1000*1000;
        for (i=0; i<length; ++i)
            array[i] = start+length - i + (i % 2);

This should affect timsort performance.
see http://stackoverflow.com/questions/20311706/how-does-timsort-perform-on-data-in-descending-order
Boris, yesterday FF prompted me to upgrade to v. 32. Did you do that to help me upgrade? :-)

So I could use copyWithin now, but Till said it's hosted, so it would not matter much.

Till, I'll unroll some loops and see how well that goes.

4esn0k, I'll add these tests too.
(In reply to Mark Van Peteghem [:markvp] from comment #83)
> Boris, yesterday FF prompted me to upgrade to v. 32. Did you do that to help
> me upgrade? :-)

Heh :) We release a new version every six weeks.

> 
> So I could use copyWithin now, but Till said it's hosted, so it would not
> matter much.

It's somewhat unlikely that a non-selfhosted version would be much, if at all, faster for the usual cases, really. The problem is that a spec-compliant implementation has to deal with lots of corner cases like, e.g., arrays with holes in them (such as [1,,3,4]) where the prototype has a getter.

> 
> Till, I'll unroll some loops and see how well that goes.

That'll be a helpful experiment, yes. Eventually, it should be handled internally by the engine, though, and thus it shouldn't be necessary to manually unroll these loops.
I unrolled the loops of the functions that move and copy data in arrays. There was no measurable difference in speed. It's at http://jsfiddle.net/GedqT/13/
I added a test for descending pairs, and the proposed improvement for this case on SO. The improvement makes it significantly faster for descending pairs, but for random data it seems to be about 5% slower. The code is at http://jsfiddle.net/GedqT/15/
It's been a month now since I made my last adjustments. Any comments?
Till, can we get some eyes on this?
Flags: needinfo?(till)
(In reply to Mingyi Liu from 
> I expect FF to perform similarly to Chrome for array.sort in all situations
Will tests include situations with options equivalent to, for example:

sort(function(a,b){return a.localeCompare(b,{sensitivity:'base'})})
Flags: needinfo?(Mark.Van.Peteghem-moz)
I am a newbie can anyone pls guide me on how to fix bugs
Right now we're detecting common custom comparators and optimizing them which helps the issues which created this bug, however, that still leaves an infinite amount of custom comparators that will behave like dogs. ;) 

Over in bug 1121937 I implemented self-hosted merge sort that performed well on TypedArrays. I would like to see us keep our default std_Array_sort as it is, but switch to that self-hosted merge sort when we encounter custom comparators.

If no one takes issue with that I'll grab this bug and throw up a patch in the next week or so.
Assignee: nobody → winter2718
Attached patch arraysort.diff (obsolete) — Splinter Review
Attaching this WIP, switching to a self-hosted mergesort when we can't optimize the comparator gives a very nice performance boost. There are two problems with this patch, 1.) the mergesort works but I fudged the splitting logic a bit (feeble human brain) 2.) there are no tests.

Testing for comparator optimization in self-hosted code may seem a bit much, but on very large arrays std_Array_sort with optimized comparator is 1.5 to 2 times as fast as the self-hosted sort.

Without Patch =>
Morgans-MacBook-Pro:_DBG.OBJ mrrrgn$ dist/bin/js ../../../../test-scripts/arraysort.js
SORTING 200000 ITEMS WITH DEFAULT COMPARATOR 0.189
SORTING 200000 ITEMS WITH OPTIMIZED COMPARATOR 0.12
SORTING 200000 ITEMS WITH NON-OPTIMIZED COMPARATOR 4.218

With Patch =>
Morgans-MacBook-Pro:_DBG.OBJ mrrrgn$ dist/bin/js ../../../../test-scripts/arraysort.js
SORTING 200000 ITEMS WITH DEFAULT COMPARATOR 0.195
SORTING 200000 ITEMS WITH OPTIMIZED COMPARATOR 0.123
SORTING 200000 ITEMS WITH NON-OPTIMIZED COMPARATOR 0.145
^ The times given above are in seconds.
@Morgan, 
Hi, I have two suggestions:

1) What if to use only one "buffer"?
2) What if to use "InsertionSort" for small parts?

Here is your implementation with those changes:

// A helper function for MergeSort.
function Merge(source, start, mid, end, destination, comparefn) {
    var i, j, k;
    var sizeLeft = mid - start + 1;
    var sizeRight =  end - mid;

    i = 0;
    j = 0;
    k = start;
    while (i < sizeLeft && j < sizeRight) {
        if (comparefn(source[start + i], source[mid + 1 + j]) <= 0) {
            destination[k] = source[start + i];
            i++;
        }
        else {
            destination[k] = source[mid + 1 + j];
            j++;
        }
        k++;
    }

    // Empty out any remaining elements in the buffer.
    while (i < sizeLeft) {
        destination[k] = source[start + i];
        i++;
        k++;
    }

    while (j < sizeRight) {
        destination[k] = source[mid + 1 + j];
        j++;
        k++;
    }
}

// Iterative, bottom up, mergesort
function MergeSort(array, len, comparefn) {

    // We do all of our allocating up front
    var source = array;
    var destination = new List(len);
    var mid, end, endOne, endTwo;

    for (var i = 0; i < len; i += 16) {
        InsertionSort(source, i, (i + 16 < len ? i + 16 : len) - 1, comparefn);
    }

    for (var window_size = 16; window_size < len - 1; window_size = 2*window_size) {
        for (var start = 0; start < len - 1; start += 2*window_size) {
            // The midpoint between the two subarrays
            mid = start + window_size - 1;
            mid = mid < len - 2 ? mid : len - 2;
            // To keep from going over the edge
            end = start + 2 * window_size - 1;
            end = end < len - 1 ? end : len - 1;
            Merge(source, start, mid, end, destination, comparefn);
        }
        var tmp = source;
        source = destination;
        destination = tmp;
    }
    
    if (array !== source) {
        for (var i = 0; i < len; i += 1) {
            array[i] = source[i];
        }
    }
    
    return array;
}
The implementation in ds/Sort.h also uses basically that implementation (wow, has it really been 4 years?) [1]. In C++, the overhead from recursion tends to make the standard approach slower. The trade-off might be different in JS though.

I wanted to add that merge sort is very efficient when it comes to comparisons (at least in general - you can specially handle near reverse-sorted lists and so on), with only Ford-Johnson sort being more efficient, and then only for small numbers of elements - but if objects are large, copying them around all over the place may not be efficient. One possibility would be to sort a list of pointers or indices instead, then use them as keys to cycle sort the actual elements at the end (at most N moves). Apologies if this is preaching to the choir :)

[1] https://dxr.mozilla.org/mozilla-central/source/js/src/ds/Sort.h#82
Er, by "basically that implementation" I meant 4esn0k's suggestion, sorry.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #97)
>In C++, the overhead from recursion tends to make the standard approach slower. The trade-off might be different in JS though.
>if objects are large...
How to create a large object in JS stored not in the heap? (larger than SIMD.Float64x2)
(In reply to Morgan Phillips [:mrrrgn] from comment #94)

Another thing:
    `for (var window_size = 1; window_size < len - 1; window_size = 2*window_size) {`
Should there be `window_size < len` ?
Test case:
[1,2,3,4,0].sort(function (a, b) {
  return a - b;
});

And:
    `comparefn(lBuffer[i], rBuffer[j]) <= 0`
What about `NaN` ?
(In reply to 4esn0k from comment #99)
> How to create a large object in JS stored not in the heap? (larger than
> SIMD.Float64x2)

Well, JSDate objects are currently 224 bytes (bug 1240283), so it depends on your definition of "large" - but it's a fair point :)
@4esn0k
Replacing the implementation from the patch above with a copy/paste of your suggested change (plus the InsertionSort from builtin/TypedArray.js) appears to slow things down quite a lot. Looking into the reasons. 

Morgans-MacBook-Pro:_DBG.OBJ mrrrgn$ dist/bin/js ../../../../test-scripts/arraysort.js
SORTING 200000 ITEMS WITH DEFAULT COMPARATOR 0.192 Seconds
SORTING 200000 ITEMS WITH OPTIMIZED COMPARATOR 0.118 Seconds
SORTING 200000 ITEMS WITH NON-OPTIMIZED COMPARATOR 5.789 Seconds
The script used to generate the above ^
https://pastebin.mozilla.org/8856961
I replaced the single buffer implementation above with my own and also saw slowdowns.
Morgans-MacBook-Pro:_DBG.OBJ mrrrgn$ dist/bin/js ../../../../test-scripts/arraysort.js
SORTING 200000 ITEMS WITH DEFAULT COMPARATOR 0.191 Seconds
SORTING 200000 ITEMS WITH OPTIMIZED COMPARATOR 0.118 Seconds
SORTING 200000 ITEMS WITH NON-OPTIMIZED COMPARATOR 0.386 Seconds

InsertionSort doesn't yield an improvement outside of the range of normal variation between runs. :(

I'm not a very clever person, so my implementation is dumb, but also probably better than nothing at all. Moving forward with the previous implementation unless there's a good performance gain to be had. Then, once something exists, it will be awesome to see the folks who are smarter at algorithms make neat optimizations.
(In reply to 4esn0k from comment #100)
> (In reply to Morgan Phillips [:mrrrgn] from comment #94)
> 
> Another thing:
>     `for (var window_size = 1; window_size < len - 1; window_size =
> 2*window_size) {`
> Should there be `window_size < len` ?
> Test case:
> [1,2,3,4,0].sort(function (a, b) {
>   return a - b; 
> });
> 

Assertion added. Thanks.

> And:
>     `comparefn(lBuffer[i], rBuffer[j]) <= 0`
> What about `NaN` ?

The spec says: 
If comparefn is not undefined, it should be a function that accepts two arguments x and y and returns a negative value if x < y, zero if x = y, or a positive value if x > y.

My take on this is that handling NaN is up to the user who supplies us a comparefn. Happy to be corrected.
I was off base about InsertionSort, the following implementation seems to have shaved more time off:

// Iterative, bottom up, mergesort
function MergeSort(array, len, comparefn) {

    // We do all of our allocating up front
    var lBuffer = new List();
    var rBuffer = new List();
    var mid, end, endOne, endTwo;

    for (var window_size = 1; window_size <= len - 1; window_size = 2*window_size) {
        for (var start = 0; start < len - 1; start += 2*window_size) {
            assert(window_size < len, "The window size is larger than the array length!");
            // The midpoint between the two subarrays
            mid = start + window_size - 1;
            mid = mid < len - 2 ? mid : len - 2;
            // To keep from going over the edge
            end = start + 2 * window_size - 1;
            end = end < len - 1 ? end : len - 1;
            // Use insertion sort on small arrays, where "small" is defined by
            // performance testing
            if ((end - start) < 23) {
                InsertionSort(array, start, end, comparefn);
            } else {
                Merge(array, start, mid, end, lBuffer, rBuffer, comparefn);
            }
        }
    }
    return array;
}

Morgans-MacBook-Pro:_DBG.OBJ mrrrgn$ dist/bin/js ../../../../test-scripts/arraysort.js
SORTING 200000 ITEMS WITH DEFAULT COMPARATOR 0.192 Seconds
SORTING 200000 ITEMS WITH OPTIMIZED COMPARATOR 0.12 Seconds
SORTING 200000 ITEMS WITH NON-OPTIMIZED COMPARATOR 0.097 Seconds

This would make it worthwhile to remove the optimized numeric comparator unless I've done something silly that voids the results.
Attached patch arraysort-conservative.diff (obsolete) — Splinter Review
This patch is something like I would submit for final review, however, it looks like the self-hosted sort performs better than optimized comparators like: [].sort((x, y) => x - y).

If I verify that this is correct I'll remove the optimization code and submit a patch for review.
Attached patch arraysort.diff (obsolete) — Splinter Review
Here's a stab at moving all user-supplied comparators to a self-hosted sort because that turned out to be slightly faster than the optimized comparative approach. 

Morgans-MacBook-Pro:_DBG.OBJ mrrrgn$ dist/bin/js ../../../../test-scripts/arraysort.js
SORTING 200000 ITEMS WITH DEFAULT COMPARATOR 0.192 Seconds
SORTING 200000 ITEMS WITH OPTIMIZED COMPARATOR 0.12 Seconds
SORTING 200000 ITEMS WITH NON-OPTIMIZED COMPARATOR 0.097 Seconds

There is a bit of extra clutter in this patch due to moving all sorting algorithms to a single file (builtin/Sorts.js) and a couple of testing utilities into testing/shell.js. Having the single file for storing sorting algorithms also seems like a nice way to organize for the eventual introduction of specialized TypedArray sorts.
Attachment #8709242 - Flags: review?(till)
Comment on attachment 8709242 [details] [diff] [review]
arraysort.diff

Review of attachment 8709242 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/Sorts.js
@@ +76,5 @@
> +    var lBuffer = new List();
> +    var rBuffer = new List();
> +    var mid, end, endOne, endTwo;
> +
> +    for (var window_size = 1; window_size <= len - 1; window_size = 2*window_size) {

I have made this `window_size < len` in my local copy.

::: js/src/tests/ecma_6/Array/sort_small.js
@@ +18,5 @@
> +sortAllPermutations(lex1);
> +sortAllPermutations(lex2);
> +
> +sortAllPermutations(num1, (x, y) => x - y);
> +sortAllPermutations(num1, (x, y) => (1*x - 1*y));

The `num2` counterparts are in my local patch as well.
Comment on attachment 8709242 [details] [diff] [review]
arraysort.diff

Review of attachment 8709242 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/tests/ecma_6/Array/sort_small.js
@@ +12,5 @@
> +let lex1  = [2112, "bob", "is", "my", "name"];
> +let lex2  = [1/undefined, NaN, Number.NaN]
> +
> +let num1  = [-11, 0, 0, 100, 101];
> +let num2  = [-11, 0, 100, 201234.23, ];

My local patch has a comment: "// the hole at the end of num2 is intentional"
(In reply to Morgan Phillips [:mrrrgn] from comment #110)
> > +let num2  = [-11, 0, 100, 201234.23, ];
> 
> My local patch has a comment: "// the hole at the end of num2 is intentional"

That's not a hole -- trailing commas are simply ignored in array literals.  You want

  let num2 = [-11, 0, 100, 201234.23, /* hole */, ];
  assertEq(4 in num2, false);

so that the trailing comma gets stripped, but there's still an empty space following it.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #111)
> (In reply to Morgan Phillips [:mrrrgn] from comment #110)
> > > +let num2  = [-11, 0, 100, 201234.23, ];
> > 
> > My local patch has a comment: "// the hole at the end of num2 is intentional"
> 
> That's not a hole -- trailing commas are simply ignored in array literals. 
> You want
> 
>   let num2 = [-11, 0, 100, 201234.23, /* hole */, ];
>   assertEq(4 in num2, false);
> 
> so that the trailing comma gets stripped, but there's still an empty space
> following it.

Aha, thanks !
Attached patch arraysort.diff (obsolete) — Splinter Review
Bug spam ahoy !
Attachment #8709242 - Attachment is obsolete: true
Attachment #8709242 - Flags: review?(till)
Attachment #8709503 - Flags: review?(till)
(In reply to Morgan Phillips [:mrrrgn] from comment #113)
> Created attachment 8709503 [details] [diff] [review]
> arraysort.diff
> 
> Bug spam ahoy !

1.
> What about `NaN` ?
ES6 spec requires to work with NaN like with it was `0` - https://tc39.github.io/ecma262/#sec-sortcompare
I do not know why that edge case (for me) was added to spec.  Probably, you could replace `comparefn(a, b) <= 0` with `!(comparefn(a, b) > 0)` or change branches in "if-then-else"...
2. comparefn should not be called for undefined values according to spec.
3. "Array.from" fills holes in sparse arrays... and comparefn should not be called for them too.

Although, those things are useless for me.
(In reply to 4esn0k from comment #114)
> (In reply to Morgan Phillips [:mrrrgn] from comment #113)
> > Created attachment 8709503 [details] [diff] [review]
> > arraysort.diff
> > 
> > Bug spam ahoy !
> 
> 1.
> > What about `NaN` ?
> ES6 spec requires to work with NaN like with it was `0` -
> https://tc39.github.io/ecma262/#sec-sortcompare
> I do not know why that edge case (for me) was added to spec.  Probably, you
> could replace `comparefn(a, b) <= 0` with `!(comparefn(a, b) > 0)` or change
> branches in "if-then-else"...
> 2. comparefn should not be called for undefined values according to spec.
> 3. "Array.from" fills holes in sparse arrays... and comparefn should not be
> called for them too.
> 
> Although, those things are useless for me.

Ah, derp, I see, thank you! +100
Attachment #8709503 - Flags: review?(till)
Attached patch arraysort2.diff (obsolete) — Splinter Review
The speed boost seen in the self-hosted sort before came in large part because I'd cheated by not including checks for undefined and NaNs, as was awesomely pointed out. Optimized comparators are still faster, which makes good sense. Given that, the optimization check is back in.
Attachment #8709503 - Attachment is obsolete: true
Attached patch arraysort2.diff (obsolete) — Splinter Review
Hokay, let's see what comes of this. Thanks again to everyone who's chimed in to help reduce the derpiness.

micro-benchmark https://pastebin.mozilla.org/8856961:
Morgans-MacBook-Pro:_DBG.OBJ mrrrgn$ dist/bin/js ../../../../test-scripts/arraysort.js
SORTING 200000 ITEMS WITH DEFAULT COMPARATOR 0.203 Seconds
SORTING 200000 ITEMS WITH OPTIMIZED COMPARATOR 0.13 Seconds
SORTING 200000 ITEMS WITH NON-OPTIMIZED COMPARATOR 0.24 Seconds
Attachment #8709600 - Attachment is obsolete: true
Attachment #8709603 - Flags: review?(till)
(In reply to Morgan Phillips [:mrrrgn] from comment #117)

Your QuickSort seems very fast. May be you could just use it, what do you think? Personally I only need stable sort, but I have to implement it in js anyway because other js engines do not provide it.

Some JS implementations move undefined values to the end of array before the actual sorting.
Something like this:

var moveUndefinedValuesAndHoles = function (array, len) {
  var defineds = 0;
  var holes = 0;
  for (var i = 0; i < len; i += 1) {
    var item = array[i];
    if (item !== undefined) {
      if (defineds < i) {
        array[defineds] = item;
      }
      defineds += 1;
    } else {
      if (!(i in array)) {
        holes += 1;
      }
    }
  }
  for (var j = defineds; j < len - holes; j += 1) {
    array[j] = undefined;
  }
  for (var k = len - holes; k < len; k += 1) {
    delete array[j];
  }
  return defineds;
};

This way it is possible to avoid "wrappedComparefn".
Comment on attachment 8709603 [details] [diff] [review]
arraysort2.diff

Review of attachment 8709603 [details] [diff] [review]:
-----------------------------------------------------------------

\o/, thank you, this looks great!

I left a bunch of comments to address and would like to see another version, but really there's nothing major blocking this.


I think it'd be better to first invoke the C++ implementation and only forward to the self-hosted implementation if the comparator optimizations fail. That's what attachment 770750 [details] [diff] [review] does.

There are two reasons for that: because the CanOptimizeComparator intrinsic isn't inlined, you have a native call anyway, so in the case of successful optimizations, you now have two native calls. More importantly, the check in CanOptimizeComparator is now run twice, once to know whether it's possible at all, and once to actually select which one to use. If the check is done first and the fallback only applied if it fails, we avoid all of this.

::: js/src/builtin/Array.js
@@ +206,5 @@
> +       return callFunction(std_Array_sort, O, comparefn);
> +
> +    /* 22.1.3.25.1 Runtime Semantics: SortCompare( x, y ) */
> +    var wrappedCompareFn = comparefn;
> +    comparefn = function(x, y) {

It's slightly unfortunate that this is an inner function in a non-singleton outer function. It might not matter in the end, because it'll all but certainly be inlined in all cases that matter.

@@ +208,5 @@
> +    /* 22.1.3.25.1 Runtime Semantics: SortCompare( x, y ) */
> +    var wrappedCompareFn = comparefn;
> +    comparefn = function(x, y) {
> +        /* Step 1. */
> +        if (x === undefined && y === undefined)

As this is called for every step of the sorting algorithm, we should make sure it's as well optimized as it can be. To that end, change it to be as lazy and repeat itself as little as possible:

// Steps 1-3, reordered for performance.
if (x === undefined) {
  if (y === undefined)
    return 0;
  return 1;
}
if (y === undefined)
  return -1;

@@ +213,5 @@
> +            return 0;
> +        /* Step 2-3. */
> +        if (x === undefined || y === undefined)
> +            return x === undefined ? 1 : -1;
> +        /* Step 4a */

Nit: "4.a.", same below.

@@ +216,5 @@
> +            return x === undefined ? 1 : -1;
> +        /* Step 4a */
> +        var v = ToNumber(wrappedCompareFn(x, y));
> +        /* Step 4b-4c. */
> +        return Number_isNaN(v) ? 0 : v;

Inline the NaN check here:
return v !== v ? 0 : v;
(The typeof check is redundant because of step 4.a.)

In theory this entire check could be avoided by careful use of the results in the sorting algorithms. I somewhat doubt that that's worth the footgun potential, though. Easy to test: just remove the check, feed in data that doesn't contain NaNs, and do perf testing.

::: js/src/builtin/Sorts.js
@@ +1,1 @@
> +/* This Source Code Form is subject to the terms of the Mozilla Public

I'd call this file "Sorting.js".

@@ +1,5 @@
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +// We use varying sorts across the self-hosted codebase, all sorts are

Nit: make this "," either a ";" or a ".".

@@ +2,5 @@
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +// We use varying sorts across the self-hosted codebase, all sorts are
> +// consolodated here to avoid confusion and re-implementation of existing

Nit: "consolidated"

@@ +5,5 @@
> +// We use varying sorts across the self-hosted codebase, all sorts are
> +// consolodated here to avoid confusion and re-implementation of existing
> +// algorithms.
> +
> +// For sorting small arrays

Nit: I know it's more of a fragment than a proper sentence, but still, it should have a "." at the end.

@@ +68,5 @@
> +        k++;
> +    }
> +}
> +
> +// Iterative, bottom up, mergesort

Nit: "."

@@ +71,5 @@
> +
> +// Iterative, bottom up, mergesort
> +function MergeSort(array, len, comparefn) {
> +
> +    // We do all of our allocating up front

Nit: "."

@@ +79,5 @@
> +
> +    for (var window_size = 1; window_size < len; window_size = 2*window_size) {
> +        for (var start = 0; start < len - 1; start += 2*window_size) {
> +            assert(window_size < len, "The window size is larger than the array length!");
> +            // The midpoint between the two subarrays

Nit: "."

@@ +86,5 @@
> +            // To keep from going over the edge
> +            end = start + 2 * window_size - 1;
> +            end = end < len - 1 ? end : len - 1;
> +            // Use insertion sort on small arrays, where "small" is defined by
> +            // performance testing

Nit: "."

Also, it'd actually be mildly interesting to test this on slow hardware, such as a mid-range Android device.

@@ +105,5 @@
> +function Partition(array, from, to, comparefn) {
> +    assert(to - from >= 3,
> +           "Partition will not work with less than three elements");
> +
> +    var median_i = (from + to) >> 1;

No hungarian notation please. Here and below.

@@ +139,5 @@
> +}
> +
> +// In-place QuickSort
> +function QuickSort(array, len, comparefn) {
> +    // Managing the stack ourselves seems to provide a small performance boost

Nit: "."

@@ +146,5 @@
> +
> +    var start = 0;
> +    var end   = len - 1;
> +
> +    var pivot_i, i, j, l_len, r_len;

Nit: no mixed naming conventions, please. "leftLength", "rightLength" or "lenLeft", "lenRight" or something like that seem fine to me.

@@ +163,5 @@
> +
> +            // Calculate the left and right sub-array lengths and save
> +            // stack space by directly modifying start/end so that
> +            // we sort the longest of the two during the next iteration.
> +            // This reduces the maximum stack size to log2(len)

Nit: "."

::: js/src/jsarray.cpp
@@ +3153,5 @@
>      JS_SELF_HOSTED_FN("filter",      "ArrayFilter",      1,0),
>      JS_SELF_HOSTED_FN("reduce",      "ArrayReduce",      1,0),
>      JS_SELF_HOSTED_FN("reduceRight", "ArrayReduceRight", 1,0),
>      JS_SELF_HOSTED_FN("some",        "ArraySome",        1,0),
> +    JS_SELF_HOSTED_FN("sort",        "ArraySort",        1,0),

Move this up to where "sort" is current defined, replacing the C++ implementation instead of overwriting it here. (It's surprising that this even works, I'd have expected an assertion to be triggered.)

This comment isn't relevant if you follow my earlier suggestion and keep the initial C++ invocation.

::: js/src/tests/ecma_6/Array/sort_small.js
@@ +1,1 @@
> +// Sort every possible permutation of some arrays

Nit: "." at the end.

@@ +1,5 @@
> +// Sort every possible permutation of some arrays
> +function sortAllPermutations(data, comparefn) {
> +    for (let permutation of Permutations(Array.from(data))) {
> +        let sorted = (Array.from(permutation)).sort(comparefn);
> +        for (let i in sorted)

Nit: braces around multi-line bodies.
Attachment #8709603 - Flags: review?(till) → feedback+
(In reply to 4esn0k from comment #118)
> (In reply to Morgan Phillips [:mrrrgn] from comment #117)
> 
> Your QuickSort seems very fast. May be you could just use it, what do you
> think? Personally I only need stable sort, but I have to implement it in js
> anyway because other js engines do not provide it.
> 
> Some JS implementations move undefined values to the end of array before the
> actual sorting.
> Something like this:
> 
> var moveUndefinedValuesAndHoles = function (array, len) {
>   var defineds = 0;
>   var holes = 0;
>   for (var i = 0; i < len; i += 1) {
>     var item = array[i];
>     if (item !== undefined) {
>       if (defineds < i) {
>         array[defineds] = item;
>       }
>       defineds += 1;
>     } else {
>       if (!(i in array)) {
>         holes += 1;
>       }
>     }
>   }
>   for (var j = defineds; j < len - holes; j += 1) {
>     array[j] = undefined;
>   }
>   for (var k = len - holes; k < len; k += 1) {
>     delete array[j];
>   }
>   return defineds;
> };
> 
> This way it is possible to avoid "wrappedComparefn".

Oh nice, I'm going to give moveUndefinedValuesAndHoles a shot. WRT to QuickSort, my only worry is that, because we're currently using MergeSort for the default and optimized-comparator cases it would be super strange to switch from a stable to non-stable sort. :( I'll see what perf testing yields.
Attached patch arraysort.diffSplinter Review
So, I tried getting rid of the wrappedCompareFn steps however, in the end the jit optimizes away any gains. 

Here's a perf test where the comparefn is running with _no_ extra steps (undefined/nan checks). The DBG build sees improvements, but they are all lost in the OPT build.

https://pastebin.mozilla.org/8857174
Attachment #8708583 - Attachment is obsolete: true
Attachment #8709241 - Attachment is obsolete: true
Attachment #8709603 - Attachment is obsolete: true
Attachment #8710138 - Flags: review?(till)
Comment on attachment 8710138 [details] [diff] [review]
arraysort.diff

Review of attachment 8710138 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice! r=me with nits addressed.

::: js/src/builtin/Array.js
@@ +205,5 @@
> +
> +    /* 22.1.3.25.1 Runtime Semantics: SortCompare( x, y ) */
> +    var wrappedCompareFn = comparefn;
> +    comparefn = function(x, y) {
> +        /* Step 1-3. */

"Steps"

@@ +214,5 @@
> +        }
> +        if (y === undefined)
> +            return -1;
> +
> +        /* Step 4a */

"4.a."

@@ +217,5 @@
> +
> +        /* Step 4a */
> +        var v = ToNumber(wrappedCompareFn(x, y));
> +
> +        /* Step 4b-4c. */

"Steps 4.b-c.

::: js/src/builtin/Sorting.js
@@ +13,5 @@
> +        item = array[i];
> +        for (var j = i - 1; j >= from; j--) {
> +            swap = array[j];
> +            if (comparefn(swap, item) <= 0)
> +                break

Nit: missing ;

@@ +47,5 @@
> +        if (comparefn(lBuffer[i], rBuffer[j]) <= 0) {
> +            array[k] = lBuffer[i];
> +            i++;
> +        }
> +        else {

Nit: } and else should cuddle.

::: js/src/jsarray.cpp
@@ +1824,5 @@
> +    if (!fval.isNull() && comp == Match_None) {
> +        /*
> +         * Non-optimized user supplied comparators perform much better when
> +         * called from within a self-hosted sorting function.
> +         */

Please add some whitespace to the code in this block. Something similar to CallSelfHostedNonGenericMethod (in SelfHosting.cpp).

@@ +1828,5 @@
> +         */
> +        RootedAtom selfHostedSortAtom(cx, Atomize(cx, "ArraySort", 9));
> +        RootedPropertyName selfHostedSortName(cx, selfHostedSortAtom->asPropertyName());
> +        RootedValue selfHostedSortValue(cx);
> +        if (!GlobalObject::getIntrinsicValue(cx, cx->global(), selfHostedSortName, &selfHostedSortValue))

Nit: sadly, this doesn't fit in one line, so it needs to be wrapped, and hence the body needs braces.

::: js/src/vm/SelfHosting.cpp
@@ -1600,5 @@
>      JS_INLINABLE_FN("std_Array_shift",           array_shift,                  0,0, ArrayShift),
>      JS_FN("std_Array_unshift",                   array_unshift,                1,0),
>      JS_INLINABLE_FN("std_Array_slice",           array_slice,                  2,0, ArraySlice),
>      JS_FN("std_Array_sort",                      array_sort,                   1,0),
> -

Please revert these two changes - no need to make this list even less readable than it already is.
Attachment #8710138 - Flags: review?(till) → review+
https://hg.mozilla.org/mozilla-central/rev/1c4b0a89fd5b
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
This patch caused a small regression on ss-tagcloud and dromaeo-sunspider-string-tagcloud on AWFY.
(In reply to Guilherme Lima from comment #125)
> This patch caused a small regression on ss-tagcloud and
> dromaeo-sunspider-string-tagcloud on AWFY.

This makes some sense to me, the changes added here add overhead to the calling of sorts with custom comparators. The speed improvement becomes more significant as the array size grows, so for small arrays a little dip is expected.
`std_Array(len)` is much better than `new List()`, seems.
And `Merge` with only one buffer does less moves.

// A helper function for MergeSort.
function Merge(source, start, mid, end, destination, comparefn) {
    var i, j, k;

    i = start;
    j = mid;
    k = start;
    while (i < mid && j < end) {
        if (comparefn(source[j], source[i]) < 0) {
            destination[k] = source[j];
            j++;
        } else {
            destination[k] = source[i];
            i++;
        }
        k++;
    }

    // Empty out any remaining elements in the buffer.
    while (i < mid) {
        destination[k] = source[i];
        i++;
        k++;
    }

    while (j < end) {
        destination[k] = source[j];
        j++;
        k++;
    }
}

// Iterative, bottom up, mergesort.
function MergeSort(array, len, comparefn) {
    // To save effort we will do all of our work on a dense list,
    // then create holes at the end.
    var denseList = std_Array(len);
    var denseLen = 0;

    for (var i = 0; i < len; i++) {
        if (i in array)
            denseList[denseLen++] = array[i];
    }

    if (denseLen < 1)
        return array;

    // Insertion sort for small arrays, where "small" is defined by performance
    // testing.
    if (len < 24) {
        InsertionSort(denseList, 0, denseLen - 1, comparefn);
        MoveHoles(array, len, denseList, denseLen);
        return array;
    }

    var source = denseList;
    // We do all of our allocating up front
    var destination = std_Array(len);

    var mid, end, endOne, endTwo;
    for (var windowSize = 1; windowSize < denseLen; windowSize = 2 * windowSize) {
        for (var start = 0; start < denseLen; start += 2 * windowSize) {
            assert(windowSize < denseLen, "The window size is larger than the array denseLength!");
            // The midpoint between the two subarrays.
            mid = start + windowSize;
            mid = mid < denseLen ? mid : denseLen;
            // To keep from going over the edge.
            end = start + 2 * windowSize;
            end = end < denseLen ? end : denseLen;
            Merge(source, start, mid, end, destination, comparefn);
        }
        var tmp = source;
        source = destination;
        destination = tmp;
    }
    MoveHoles(array, len, source, denseLen);
    return array;
}
Depends on: 1259600
Flags: needinfo?(till)
Flags: needinfo?(Mark.Van.Peteghem-moz)
No longer depends on: 1266242
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: