Closed Bug 1082580 Opened 10 years ago Closed 11 months ago

nsTArray::Sort uses non-templated NS_QuickSort function where each comparison is a non-inlined function call

Categories

(Core :: XPCOM, defect)

Other Branch
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bjacob, Unassigned)

References

Details

Attachments

(1 file)

This is sort of the school example of where C++ can be 2x faster than C - basically the reason why templates were added to C++. NS_QuickSort is a plain C function; every time it needs to compare two array entries, it calls a comparison function that takes 3 parameters and returns a return value. Want to bet that NS_QuickSort spends more time accomoding this, than doing useful work? Anyway, I'll just use the C++ STL for now.
(In reply to Benoit Jacob [:bjacob] from comment #0) > Want to bet that NS_QuickSort spends more time accomoding this, than doing > useful work? Not without a profile, I don't. > Anyway, I'll just use the C++ STL for now. Of course, NS_QuickSort has the notable codesize benefit that there's not N sort implementations for N different datatypes...
$ nm -C obj-firefox-debug/dist/bin/libxul.so | grep 'nsTArray_Impl' | grep '[>]::Sort' | wc -l 41 So we have 41 instantiations of nsTArray_Impl::Sort. With a standalone program, one can check that libstdc++ generates roughly 1.7k of code per std::sort instantiation, with -O3 -DNDEBUG, with both clang and gcc. So we can expect that switching nsTArray_Impl::Sort to std::sort would increase libxul's code size by about 41 * 1.7k ~= 70k.
Results are just as expected... sometimes, the wisdom of "always ask for a profile before talking about performance" breaks down. $ clang++ -O3 -DNDEBUG a.cpp -o a $ ./a sorting an array of 1000000 elements (templated) std::sort took 0.0582 s (non-templated) std::qsort took 0.111 s (templated) std::sort took 0.0584 s (non-templated) std::qsort took 0.112 s (templated) std::sort took 0.0589 s (non-templated) std::qsort took 0.109 s (templated) std::sort took 0.0593 s (non-templated) std::qsort took 0.111 s
Thank you for providing the numbers. The next questions would be: 1) Is this a realistic simulation of Gecko's sorting workloads? (My guess is that virtually nothing in Gecko sorts a million elements in one go.) 2) Are we seeing profiles where NS_QuickSort is the bottleneck? I'm all for faster code, but I'm not keen on introducing ~70K of code just so a few 10-element sorts can go a microsecond faster.
(In reply to Nathan Froyd (:froydnj) from comment #4) > Thank you for providing the numbers. > > The next questions would be: > > 1) Is this a realistic simulation of Gecko's sorting workloads? (My guess > is that virtually nothing in Gecko sorts a million elements in one go.) Actually, changing the size in this benchmark, the 2x performance difference stays roughly constant across _all_ array sizes all the way down to size 2. (Yes, clock_gettime(CLOCK_THREAD_CPUTIME_ID) is that precise). > > 2) Are we seeing profiles where NS_QuickSort is the bottleneck? That's a good question indeed :) I pushed to try a patch switching to std::sort, and Talos didn't see any significant performance difference. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=9d816742dea2 http://perf.snarkfest.net/compare-talos/?oldRevs=54217864bae9,155b84a1d18a,c7f5a7b46fcd&newRev=9d816742dea2&submit=true > > I'm all for faster code, but I'm not keen on introducing ~70K of code just > so a few 10-element sorts can go a microsecond faster. Of course. All sorts are 2x faster, even 2-element sorts, when using a templated sort, but that doesn't mean that sorting is important for gecko performance.

I'd like to give that a heads up, given that the modern libstdc++ implementation uses introsort, which has optimal worst-case performance as opposed to quicksort?

(I have seen arguments that don't want to use Sort but rather a loop doing InsertElementSorted, because that might be faster for small N, which doesn't seem like a good situation.)

Can we just see how changing from NS_QuickSort to std::sort affects the code size, and then think again if the effect is so bad that is warrants to keep using the non-templated sort? It's not even necessarily the case that every instance of the template generates distinct code, even for different template parameters.

It turns out that replacing NS_QuickSort by std::sort in nsTArray_Impl::Sort is not that trivial, since apparently that's used with types that are not (move-)assignable, which is a requirement for std::sort. OTOH, that lead me to the insight that nsTArray_Impl::Sort uses memutils even for types that use MOZ_DECLARE_RELOCATE_USING_MOVE_CONSTRUCTOR, which is a defect. (But I am not yet sure if sorting is done for any such element type)

Severity: normal → S3
See Also: → 1839051
Status: NEW → RESOLVED
Closed: 11 months ago
Depends on: 1839051
Resolution: --- → WORKSFORME
See Also: 1839051
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: