Last Comment Bug 872497 - O(n^2) performance in NS_QuickSort
: O(n^2) performance in NS_QuickSort
Status: RESOLVED FIXED
: perf
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: Trunk
: x86 Linux
: -- normal (vote)
: mozilla24
Assigned To: Derrick Moser
:
Mentors:
Depends on: 917955
Blocks: 1058026
  Show dependency treegraph
 
Reported: 2013-05-15 05:38 PDT by Derrick Moser
Modified: 2015-03-13 22:01 PDT (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Avoid O(n^2) performance from insertion sort (2.74 KB, patch)
2013-05-15 05:47 PDT, Derrick Moser
no flags Details | Diff | Splinter Review
Eliminate O(n^2) performance issue in NS_QuickSort (2.91 KB, patch)
2013-05-18 10:14 PDT, Derrick Moser
justin.lebar+bug: review+
Details | Diff | Splinter Review

Description Derrick Moser 2013-05-15 05:38:53 PDT
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) Gecko/20100101 Firefox/20.0
Build ID: 20130329043827

Steps to reproduce:

I added my mail account to Thunderbird (version 17.0.5).  The mail account uses IMAP.  One of the folders on the IMAP server has over 350000 email messages.


Actual results:

Periodically Thunderbird attempts to download messages and freezes for several minutes.  It encountered two O(n^2) performance problems.  One is described in Bug 870556.  The other occurred in xpcom/glue/NS_QuickSort.cpp.  NS_QuickSort spent a long time performing an insertion sort.


Expected results:

NS_QuickSort should avoid producing O(n^2) behaviour.
Comment 1 Derrick Moser 2013-05-15 05:47:35 PDT
Created attachment 749836 [details] [diff] [review]
Avoid O(n^2) performance from insertion sort

Attached is my attempt to fix the O(n^2) performance problem in NS_QuickSort.

The existing insertion sort is used to speed up sorting when the input is mostly pre-sorted (making NS_QuickSort faster than std::sort for these situations).  I have moved it after the code for restoring the pivot values so the quick sort algorithm can continue if the insertion sort is aborted.

Below is an example of the slow behaviour I saw happening in NS_QuickSort.  Increase "n" a bit more to see even worse behaviour from NS_QuickSort.

int my_cmp(const void *a, const void *b, void *data)
{
    int aa = *(int *)a;
    int bb = *(int *)b;
    return (aa < bb) ? -1 : (aa == bb ? 0 : 1);
}

int main(int argc, const char *argv[])
{
    int n = 100000;
    int *a = new int[n];

    // create some partially sorted data
    srand(1);
    int mid = n / 2;
    for(int i = 0; i < mid; ++i)
        a[i] = -(rand() % n) - 1;
    a[mid] = 0;
    for(int i = mid + 1; i < n; ++i)
        a[i] = rand() % n + 1;

    printf("Sorting %d items...\n", n);
    NS_QuickSort(a, n, sizeof(int), my_cmp, 0);
    printf("Done!\n");
    delete [] a;
    return 0;
}
Comment 2 Josh Matthews [:jdm] 2013-05-15 06:57:21 PDT
Comment on attachment 749836 [details] [diff] [review]
Avoid O(n^2) performance from insertion sort

Thanks Derrick! Let's see what Benjamin thinks of this change.
Comment 3 Benjamin Smedberg [:bsmedberg] 2013-05-15 07:06:44 PDT
Comment on attachment 749836 [details] [diff] [review]
Avoid O(n^2) performance from insertion sort

Delegating review to the resident algorithm expert.
Comment 4 Derrick Moser 2013-05-15 10:14:23 PDT
I suspect the data that exposed this problem was produced one of the times I forcibly killed Thunderbird.  It may have left the data in a partially sorted ordering.  A later attempt to sort the data with NS_QuickSort would have resulted in it naively assuming the mostly sorted data could be better handled with an insertion sort.  This scenario means it is easy for others to encounter similar pathological cases.

I can tidy up the comments a bit.  One nagging thought is the "[a,b]" style ranges should be more accurately described as "[a,b)".
Comment 5 Justin Lebar (not reading bugmail) 2013-05-15 21:27:16 PDT
Thanks a lot for taking this bug (and for adding comments to this crufty code).

What if we just got rid of the insertion sort business?  We use median-of-three pivots, so you'd have to try pretty hard to get qsort into an n^2 case.

It would certainly be a lot easier to have confidence in the correctness of such a patch.  :)
Comment 6 WADA 2013-05-16 10:02:00 PDT
I could observe one of worst case of NS_QuickSort() what Derrick Moser presented at last, using my quick/lazy JavaScript port of NS_QuickSort().

Test data : array of 20000 whole numbers.
- first  half of array = descending : 20000 to 10001 by -1 (lengh=10000)
- second half of array = ascending  : 40000 to 49999 by +1 (lengh=10000)
- any data in first half of array < any data in second half of array

(A) Without swap_cnt check in insertion sort logic.
    => Sort execution time = 8922 msec
(B) At same place as current code for insrtion sort,
    with simply going to loop: if swap_cnt is larger than 2*n,
    without pivot change, expecting swap at mid of data in next loop.
    => Sort execution time = 22 msec
Note:
Sort time by my JavaScript port of original NS_QuickSort().
- If 20000 random data,
  sort time by my JavaScript port of NS_QuickSort() == around 180 msec.
- If 20000 pre-sorted/ascending data, or same data, sort time=12 msec.
- If 20000 pre-sorted/descending data, sort time == 18 msec.

In above case (A), because of first half < second half, swap doesn't happen upon first pivot selection which is mid of data, then insertion sort is applied to all 20000 data. And, because first half of the data is reversed order, O(2^n) problem is exposed. 

"Reason why O(2^n) issue doesn't occur when all data is pre-sorted in ascending order, or when all data is pre-sorted in descendin order, or when all data is same value" was;
  Insetion sort logic of NS_QuickSort() is fortunately applied on
  long array data which is already sorted in ascending order.

"Reason why I couldn't see O(2^n) problem in my previous duplication test" was;
  - Value in "descending first half" is larger than
    value in "ascending second half".
  - "first half==ascending order" and "second half==descending order"
In these cases, even when insertion sort is applied to long array, it's applied on "already sorted in ascending order" part of data, or is applied on "not so long sub array" part of data after recursive call for smaller part of data.
So, even in worst case, sort time is similar to random data case( == O(n*log(n) ).

To  Derrick Moser(bug opener):

"Worst case == O(2^n)" itself is attribute of algorithm called "quicksort".
(Q1) Do you see such severe performance problem of standard qsort() of C++ in following case?
  - "first half of data == pre-sorted data in descending order"
    is smaller than 
    "second harf of data == pre-sorted data in ascending order"
(Q2) How about std::sort() or std::stable_sort() of C++ with such data?
(Q3) Is this bug for general "Worst case==O(2^n)" issue of sort algorithm called "quicksort"? Or current NS_QuickTime() specific issue?
Comment 7 Derrick Moser 2013-05-16 18:28:15 PDT
I think there is some confusion on what this bug is about.  I'll try to describe what is going wrong, why my patch fixes the problem, its correctness, and confidence in its performance characteristics.  Much of this could be added as comments to the code.

NS_QuickSort basically implements the standard quick sort algorithm with an additional optimisation that can produce O(n) performance in some situations (the standard quick sort algorithm produces O(n*log(n)) performance in the best case).  Unfortunately, there is a bug in the optimisation that causes it to produce O(n^2) behaviour in some situations instead of improving on quick sort's O(n*log(n)) expected behaviour.  Thunderbird encountered one of these situations while it was fetching my email from an IMAP server.

This issue is not about potential O(n^2) behaviour from the quick sort algorithm.  NS_QuickSort could encounter such a pathological situation but that is not what my patch addresses.  My patch only addresses a bug in the optimisation aimed at producing O(n) best time behaviour.

When given a large array of data, NS_QuickSort basically performs a step of the quick sort algorithm: it chooses a pivot value then moves everything less than the pivot value to the beginning of the array, moves everything greater than the pivot value to the end of the array, and items equal to the pivot value are placed between the two other sections.  This takes O(n) time.  If items were moved around when performing the step of the quick sort algorithm, it will recursively process the two interesting sections.  However, if no items needed to be moved around when performing the step of the quick sort algorithm, NS_QuickSort will use an insertion sort to finish the sorting.  This can result in O(n) behaviour if the data was mostly sorted but will likely result in O(n^2) behaviour in other situations.

The goal of achieving O(n) behaviour on the somewhat common case of mostly sorted input is obtainable without introducing potential O(n^2) behaviour -- this is what my patch achieves.  By limiting the number of swaps the insertion sort is allowed to perform, we can abort the insertion sort before it exceeds O(n) running time.  Since the step of the quick sort algorithm performed immediately before our insertion sort attempt needed O(n) time, our insertion sort attempt will not destroy the O(n*log(n)) expected performance of the overall algorithm.  Furthermore, we know our insertion sort attempt could not have destroyed the partitioning of items from the previous step of the quick sort algorithm (everything less than the pivot value will still be at the beginning of the array and everything greater than the pivot value will still be at the end of the array).  This allows us to safely resume the quick sort algorithm after aborting the insertion sort attempt.

All of the above ultimately means we can achieve O(n) best case behaviour without destroying our O(n*log(n)) expected performance or making it any more likely we will encounter pathological cases that cause the quick sort algorithm to produce O(n^2) behaviour (I leave this as an exercise for the reader).

But what is the cost?  When doing the pivoting we just need to set a bit when swapping items.  This is negligible compared to the cost of calling a user provided comparison function and swapping two items that had to be done before setting the bit.  After performing a step of the quick sort algorithm, we test our bit.  This too is very cheap considering we must have already done at least 9 calls to the user provided comparison function (at least 3 for the pivot choice, then at least 6 more to test items against the pivot value).  Also, the insertion sort attempt would rarely occur given unsorted data as it is very unlikely for unsorted data to match the quick sort partitioning.  Even if the we have a pathological case where the insertion sort is attempted and aborted after every step of the quick sort algorithm, we will only have wasted O(n*log(n)) time on our insertion sort attempts.

What is the benefit?  Why not simply eliminate the insertion sort optimisation?  On my system with the patched NS_QuickSort implementation, it processes sorted (and reverse sorted) data almost 10 times faster than without the insertion sort optimisation.  For unsorted data the difference is too small to measure.

Why not just switch to qsort, std::sort, or std::stable_sort?  All of these are slower when given sorted and reverse sorted data as inputs.  The don't appear to contain any optimisation for pre-sorted input.  These methods probably do address the pathological cases that cause the basic quick sort algorithm to produce O(n^2) behaviour.  However, someone could easily further patch NS_QuickSort to address this problem too (likely by switching to a heap sort when a recursion limit is reached).
Comment 8 Justin Lebar (not reading bugmail) 2013-05-16 23:57:04 PDT
> Why not simply eliminate the insertion sort optimisation?  On my system with the patched 
> NS_QuickSort implementation, it processes sorted (and reverse sorted) data almost 10 times faster 
> than without the insertion sort optimisation.

Okay, that's good enough for me.  Thanks.

FWIW I didn't have any confusion as to what was going on here.  :)
Comment 9 Justin Lebar (not reading bugmail) 2013-05-17 10:36:59 PDT
Thinking about this more...

There's a trade-off to this optimization.  While I agree with your analysis
that it doesn't change the asymptotic runtime of the algorithm, it does make
some cases faster and others slower.

In particular, if we're sorting a list that's partially sorted, I think this
optimization could hurt us significantly, since we could make two passes over
the first half of the data, in the first recursive qsort call.  If the list is
long, both iterations will read out of L2, not L1 cache.  That might be OK if
we got something in return for running the insertion sort, but we don't, in
this case.

When we sort a sorted list with this patch, all we do is one nop partition over
all the data and then a second nop insertion sort over the whole list.  I'd
expect this should be extremely fast compared to a full qsort (even with the
cache misses).  Performing 10x worse than that (asymptotically, a factor of
log(n) worse) in this special case doesn't seem bad at all.  And the upside is
that other cases will run faster, because they don't do useless work.

So I still lean towards getting rid of this optimization.

But if we keep it, I think we can get more out of it.

After you bail on the insertion sort, you know that part of the list is
properly sorted, so there's no need to recurse on that portion of the list,
right?  Depending on how far the insertion sort got, we might still have to
make two qsort calls, or we might be able to make just one.

With this change, the insertion sort in the partially-sorted case would
actually win us something, I think.

What do you think?
Comment 10 Derrick Moser 2013-05-18 10:14:45 PDT
Created attachment 751370 [details] [diff] [review]
Eliminate O(n^2) performance issue in NS_QuickSort

I've given this some more thought and done some experiments.  I've come to the conclusion this optimisation should be eliminated and, if desired, any optimisations to aim for O(n) performance should be the responsibility of the caller.  Attached is an updated patch.

Quick sort partitioning has a good chance of destroying any ordering that existed in the input data.  Searching for special cases deep within quick sort processing on average will cost more than it will save.  It would be more efficient to simply check the input for potential cases that could be processed in O(n) time before attempting any steps of the quick sort algorithm.  If we can't find any, just use the standard quick sort algorithm and don't look back.  Since the caller is in a better position to determine the cost/benefit of looking for special cases that can be processed quicker, doing so should be their responsibility.
Comment 11 Justin Lebar (not reading bugmail) 2013-05-18 12:18:10 PDT
Comment on attachment 751370 [details] [diff] [review]
Eliminate O(n^2) performance issue in NS_QuickSort

r=me.

I've pushed this with s/invarients/invariants/ in the comment.  Thanks for the patch!

https://hg.mozilla.org/integration/mozilla-inbound/rev/aad29aa89237
Comment 12 Justin Lebar (not reading bugmail) 2013-05-18 12:19:08 PDT
> Quick sort partitioning has a good chance of destroying any ordering that existed in the input data.

That's a very good point.
Comment 13 WADA 2013-05-19 03:18:38 PDT
(In reply to Derrick Moser from comment #7)
> I think there is some confusion on what this bug is about.

I'm not confused on your proposal of change, nor reason why "reducing insertion sort" is needed for special cases such as following case, nor reason why removing "falling back to insertion sort" is needed.
- All data before pivot is smaller than any data after pivot.
- If swap doesn't happen by first pivot selection,
  current NS_QuickSort() does apply insertion sort to entire data.
- Insertion sort requires loop of O(2^n) times.
- If entire data is not sorted in ascending order, swap(move A to X,
  move B to A, move X to A) occurs in each step of loop, 
  so explosion of sort time is exposed by not so large number of data
  such as 10,000 data.

What I couldn't understand was;
  Why code change of NS_QuickSort() is mandatory.
  Why std:sort() or std::statble_sort() is not usable. 

See Bug 738533 and following, please. 
(mozilla-central)
std::sort use in mozilla-central
> http://mxr.mozilla.org/mozilla-central/search?string=std::sort
qsort use in mozilla-central
> http://mxr.mozilla.org/mozilla-central/search?string=qsort&case=on
QuickSort use in mozilla-central
> http://mxr.mozilla.org/mozilla-central/search?string=QuickSort&case=on
(comm-central)
std::sort use in comm-central
> http://mxr.mozilla.org/comm-central/search?string=std::sort
qsort() use in comm-central
> http://mxr.mozilla.org/comm-central/search?string=qsort&case=on
QuickSort use in comm-central
> http://mxr.mozilla.org/comm-central/search?string=QuickSort&case=on

(In reply to Derrick Moser from comment #7)
> Why not just switch to qsort, std::sort, or std::stable_sort?
> All of these are slower when given sorted and reverse sorted data as inputs.

Sorry but I can't find Web site describes about such severe slowness in std::sort() of GNU C++, and super quickness of quick sort algorithm than std::sort() of C++. (std::sort() seems to guarantee O(n*log(n) by switching algorithm internally.)
AFAIK, "Major reason why NS_QUickSort() was used in the past" was "qsort() of Soraris was broken in the past", instead of that NS_QuickSort() is far faster than qsort(). 

Is std:sort() of C++ actually so slow in case you presented or in reversed order case?
If the "slowness of qsort() or std::sort() than NSQuickSort()" is actual problem, why browser code uses qsort()? Why was change from NS_QuickSort() to qsort() done by Bug 738533? Why newer components use std::sort()?

Please note that I'm never opposite to your proposal on NS_QuickSort().
C++ qsort() implementation was "recursive call for both left side and right side". "Loop for right side of NS_QuickSort() made by 'Bentley & McIlroy's "Engineering a Sort Function"' is surely advantage of NS_QuickSort() to qsort().
> http://stackoverflow.com/questions/7624843/quicksort-implementation

If possible, please add parameter to NS_QuickSort().
- opt=1 : same as current
- opt=2 : prohibit "insertion sort" for less than 7. (latest proposal)
- opt=3 : break "insertion sort" when swap_cnt exceeds thresholds
          with expecting "split of array" in next loop for same data.
          (but swap(a,pm) is already done on the array.)
          (your initial proposal)
Comment 14 Justin Lebar (not reading bugmail) 2013-05-19 08:06:26 PDT
Hi, WADA.

Derrick fixed an important bug here.  If you'd like to do something else, like switch us to std::sort, please file a new bug.  We can consider your idea there.

Thanks.
Comment 15 Phil Ringnalda (:philor) 2013-05-19 17:50:05 PDT
https://hg.mozilla.org/mozilla-central/rev/aad29aa89237
Comment 16 Philip Chee 2013-09-23 03:32:35 PDT
Can this bug be re-opened and/or backed out? This is the cause of a topcrash for Thunderbird.
Comment 17 Andrew McCreight [:mccr8] 2013-09-23 08:57:54 PDT
With jlebar gone, bsmedberg is probably the one to make the call there.
Comment 18 Benjamin Smedberg [:bsmedberg] 2013-09-30 13:15:44 PDT
It seems that bug 917955 has a patch for an incorrect case-comparison function in mailnews code. I don't see any reason for this to be backed out, given the information in the bugs.
Comment 19 WADA 2015-03-13 22:01:48 PDT
(In reply to Justin Lebar (not reading bugmail) from comment #14)
> Hi, WADA.
> Derrick fixed an important bug here.  If you'd like to do something else,
> like switch us to std::sort, please file a new bug.  We can consider your idea there.
> Thanks.

I still believe that fault is in caller side of original NS_QuickSort(). It's never fault of original NS_QuickSort(). "Insertion sort if no swap" was done on purpose in original NS_QuickSort().
I still believe "Stop using NS_QuickSort(), use qsort() or std::sort() instead" is correct solution.

I've opened Bug 1141431.

Note You need to log in before you can comment on or make changes to this bug.