What Makes the Gcc Std::List Sort Implementation So Fast

What makes the gcc std::list sort implementation so fast?

I've been taking a look at the interesting GLibC implementation for list::sort (source code) and it doesn't seem to implement a traditional merge sort algorithm (at least not one I've ever seen before).

Basically what it does is:

  1. Creates a series of buckets (64 total).
  2. Removes the first element of the list to sort and merges it with the first (i=0th) bucket.
  3. If, before the merge, the ith bucket is not empty, merge the ith bucket with the i+1th bucket.
  4. Repeat step 3 until we merge with an empty bucket.
  5. Repeat step 2 and 3 until the list to sort is empty.
  6. Merge all the remaining non-empty buckets together starting from smallest to largest.

Small note: merging a bucket X with a bucket Y will remove all the elements from bucket X and add them to bucket Y while keeping everything sorted. Also note that the number of elements within a bucket is either 0 or 2^i.

Now why is this faster then a traditionnal merge sort? Well I can't say for sure but here are a few things that comes to mind:

  • It never traverses the list to find a mid-point which also makes the algorithm more cache friendly.
  • Because the earlier buckets are small and used more frequently, the calls to merge trash the cache less frequently.
  • The compiler is able to optimize this implementation better. Would need to compare the generated assembly to be sure about this.

I'm pretty sure the folks who implemented this algorithm tested it thoroughly so if you want a definitive answer you'll probably have to ask on the GCC mailing list.

`std::list ::sort()` - why the sudden switch to top-down strategy?

Note this answer has been updated to address all of the issues mentioned in the comments below and after the question, by making the same change from an array of lists to an array of iterators, while retaining the faster bottom up merge sort algorithm, and eliminating the small chance of stack overflow due to recursion with the top down merge sort algorithm.

The reason I didn't originally consider iterators was due to the VS2015 change to top down, leading me to believe there was some issue with trying to change the existing bottom up algorithm to use iterators, requiring a switch to the slower top down algorithm. It was only when I tried to analyze the switch to iterators myself that I realized there was a solution for bottom up algorithm.

In @sbi's comment, he asked the author of the top down approach, Stephan T. Lavavej, why the change was made. Stephan's response was "to avoid memory allocation and default constructing allocators". VS2015 introduced non-default-constructible and stateful allocators, which presents an issue when using the prior version's array of lists, as each instance of a list allocates a dummy node, and a change would be needed to handle no default allocator.

Lavavej's solution was to switch to using iterators to keep track of run boundaries within the original list instead of an internal array of lists. The merge logic was changed to use 3 iterator parameters, 1st parameter is iterator to start of left run, 2nd parameter is iterator to end of left run == iterator to start of right run, 3rd parameter is iterator to end of right run. The merge process uses std::list::splice to move nodes within the original list during merge operations. This has the added benefit of being exception safe. If a caller's compare function throws an exception, the list will be re-ordered, but no loss of data will occur (assuming splice can't fail). With the prior scheme, some (or most) of the data would be in the internal array of lists if an exception occurred, and data would be lost from the original list.

However the switch to top down merge sort was not needed. Initially, thinking there was some unknown to me reason for VS2015 switch to top down, I focused on using the internal interfaces in the same manner as std::list::splice. I later decided to investigate switching bottom up to use an array of iterators. I realized the order of runs stored in the internal array was newest (array[0] = rightmost) to oldest (array[last] = leftmost), and that it could use the same iterator based merge logic as VS2015's top down approach.

For bottom up merge sort, array[i] is an iterator to the start of a sorted sub-list with 2^i nodes, or it is empty (using std::list::end to indicate empty). The end of each sorted sub-list will be the start of a sorted sub-list in the next prior non-empty entry in the array, or if at the start of the array, in a local iterator (it points to end of newest run). Similar to the top down approach, the array of iterators is only used to keep track of sorted run boundaries within the original linked list, while the merge process uses std::list::splice to move nodes within the original linked list.

If a linked list is large and the nodes scattered, there will be a lot of cache misses. Bottom up will be about 30% faster than top down (equivalent to stating top down is about 42% slower than bottom up ). Then again, if there's enough memory, it would usually be faster to move the list to an array or vector, sort the array or vector, then create a new list from the sorted array or vector.

Example C++ code:

#define ASZ 32

template <typename T>
void SortList(std::list<T> &ll)
{
if (ll.size() < 2) // return if nothing to do
return;
std::list<T>::iterator ai[ASZ]; // array of iterators
std::list<T>::iterator mi; // middle iterator (end lft, bgn rgt)
std::list<T>::iterator ei; // end iterator
size_t i;
for (i = 0; i < ASZ; i++) // "clear" array
ai[i] = ll.end();
// merge nodes into array
for (ei = ll.begin(); ei != ll.end();) {
mi = ei++;
for (i = 0; (i < ASZ) && ai[i] != ll.end(); i++) {
mi = Merge(ll, ai[i], mi, ei);
ai[i] = ll.end();
}
if(i == ASZ)
i--;
ai[i] = mi;
}
// merge array into single list
ei = ll.end();
for(i = 0; (i < ASZ) && ai[i] == ei; i++);
mi = ai[i++];
while(1){
for( ; (i < ASZ) && ai[i] == ei; i++);
if (i == ASZ)
break;
mi = Merge(ll, ai[i++], mi, ei);
}
}

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
typename std::list<T>::iterator li,
typename std::list<T>::iterator mi,
typename std::list<T>::iterator ei)
{
std::list<T>::iterator ni;
(*mi < *li) ? ni = mi : ni = li;
while(1){
if(*mi < *li){
ll.splice(li, ll, mi++);
if(mi == ei)
return ni;
} else {
if(++li == mi)
return ni;
}
}
}

Example replacement code for VS2019's std::list::sort() (the merge logic was made into a separate internal function, since it's now used in two places).

private:
template <class _Pr2>
iterator _Merge(_Pr2 _Pred, iterator _First, iterator _Mid, iterator _Last){
iterator _Newfirst = _First;
for (bool _Initial_loop = true;;
_Initial_loop = false) { // [_First, _Mid) and [_Mid, _Last) are sorted and non-empty
if (_DEBUG_LT_PRED(_Pred, *_Mid, *_First)) { // consume _Mid
if (_Initial_loop) {
_Newfirst = _Mid; // update return value
}
splice(_First, *this, _Mid++);
if (_Mid == _Last) {
return _Newfirst; // exhausted [_Mid, _Last); done
}
}
else { // consume _First
++_First;
if (_First == _Mid) {
return _Newfirst; // exhausted [_First, _Mid); done
}
}
}
}

template <class _Pr2>
void _Sort(iterator _First, iterator _Last, _Pr2 _Pred,
size_type _Size) { // order [_First, _Last), using _Pred, return new first
// _Size must be distance from _First to _Last
if (_Size < 2) {
return; // nothing to do
}
const size_t _ASZ = 32; // array size
iterator _Ai[_ASZ]; // array of iterators to runs
iterator _Mi; // middle iterator
iterator _Li; // last (end) iterator
size_t _I; // index to _Ai
for (_I = 0; _I < _ASZ; _I++) // "empty" array
_Ai[_I] = _Last; // _Ai[] == _Last => empty entry
// merge nodes into array
for (_Li = _First; _Li != _Last;) {
_Mi = _Li++;
for (_I = 0; (_I < _ASZ) && _Ai[_I] != _Last; _I++) {
_Mi = _Merge(_Pass_fn(_Pred), _Ai[_I], _Mi, _Li);
_Ai[_I] = _Last;
}
if (_I == _ASZ)
_I--;
_Ai[_I] = _Mi;
}
// merge array runs into single run
for (_I = 0; _I < _ASZ && _Ai[_I] == _Last; _I++);
_Mi = _Ai[_I++];
while (1) {
for (; _I < _ASZ && _Ai[_I] == _Last; _I++);
if (_I == _ASZ)
break;
_Mi = _Merge(_Pass_fn(_Pred), _Ai[_I++], _Mi, _Last);
}
}

The remainder of this answer is historical, and only left for the historical comments, otherwise it is no longer relevant.


I was able to reproduce the issue (old sort fails to compile, new one works) based on a demo from @IgorTandetnik:

#include <iostream>
#include <list>
#include <memory>

template <typename T>
class MyAlloc : public std::allocator<T> {
public:
MyAlloc(T) {} // suppress default constructor

template <typename U>
MyAlloc(const MyAlloc<U>& other) : std::allocator<T>(other) {}

template< class U > struct rebind { typedef MyAlloc<U> other; };
};

int main()
{
std::list<int, MyAlloc<int>> l(MyAlloc<int>(0));
l.push_back(3);
l.push_back(0);
l.push_back(2);
l.push_back(1);
l.sort();
return 0;
}

I noticed this change back in July, 2016 and emailed P.J. Plauger about this change on August 1, 2016. A snippet of his reply:

Interestingly enough, our change log doesn't reflect this change. That
probably means it was "suggested" by one of our larger customers and
got by me on the code review. All I know now is that the change came
in around the autumn of 2015. When I reviewed the code, the first
thing that struck me was the line:

    iterator _Mid = _STD next(_First, _Size / 2);

which, of course, can take a very long time for a large list.

The code looks a bit more elegant than what I wrote in early 1995(!),
but definitely has worse time complexity. That version was modeled
after the approach by Stepanov, Lee, and Musser in the original STL.
They are seldom found to be wrong in their choice of algorithms.

I'm now reverting to our latest known good version of the original code.

I don't know if P.J. Plauger's reversion to the original code dealt with the new allocator issue, or if or how Microsoft interacts with Dinkumware.

For a comparison of the top down versus bottom up methods, I created a linked list with 4 million elements, each consisting of one 64 bit unsigned integer, assuming I would end up with a doubly linked list of nearly sequentially ordered nodes (even though they would be dynamically allocated), filled them with random numbers, then sorted them. The nodes don't move, only the linkage is changed, but now traversing the list accesses the nodes in random order. I then filled those randomly ordered nodes with another set of random numbers and sorted them again. I compared the 2015 top down approach with the prior bottom up approach modified to match the other changes made for 2015 (sort() now calls sort() with a predicate compare function, rather than having two separate functions). These are the results. update - I added a node pointer based version and also noted the time for simply creating a vector from list, sorting vector, copy back.

sequential nodes: 2015 version 1.6 seconds, prior version 1.5  seconds
random nodes: 2015 version 4.0 seconds, prior version 2.8 seconds
random nodes: node pointer based version 2.6 seconds
random nodes: create vector from list, sort, copy back 1.25 seconds

For sequential nodes, the prior version is only a bit faster, but for random nodes, the prior version is 30% faster, and the node pointer version 35% faster, and creating a vector from the list, sorting the vector, then copying back is 69% faster.

Below is the first replacement code for std::list::sort() I used to compare the prior bottom up with small array (_BinList[]) method versus VS2015's top down approach I wanted the comparison to be fair, so I modified a copy of < list >.

    void sort()
{ // order sequence, using operator<
sort(less<>());
}

template<class _Pr2>
void sort(_Pr2 _Pred)
{ // order sequence, using _Pred
if (2 > this->_Mysize())
return;
const size_t _MAXBINS = 25;
_Myt _Templist, _Binlist[_MAXBINS];
while (!empty())
{
// _Templist = next element
_Templist._Splice_same(_Templist.begin(), *this, begin(),
++begin(), 1);
// merge with array of ever larger bins
size_t _Bin;
for (_Bin = 0; _Bin < _MAXBINS && !_Binlist[_Bin].empty();
++_Bin)
_Templist.merge(_Binlist[_Bin], _Pred);
// don't go past end of array
if (_Bin == _MAXBINS)
_Bin--;
// update bin with merged list, empty _Templist
_Binlist[_Bin].swap(_Templist);
}
// merge bins back into caller's list
for (size_t _Bin = 0; _Bin < _MAXBINS; _Bin++)
if(!_Binlist[_Bin].empty())
this->merge(_Binlist[_Bin], _Pred);
}

I made some minor changes. The original code kept track of the actual maximum bin in a variable named _Maxbin, but the overhead in the final merge is small enough that I removed the code associated with _Maxbin. During the array build, the original code's inner loop merged into a _Binlist[] element, followed by a swap into _Templist, which seemed pointless. I changed the inner loop to just merge into _Templist, only swapping once an empty _Binlist[] element is found.

Below is a node pointer based replacement for std::list::sort() I used for yet another comparison. This eliminates allocation related issues. If a compare exception is possible and occurred, all the nodes in the array and temp list (pNode) would have to be appended back to the original list, or possibly a compare exception could be treated as a less than compare.

    void sort()
{ // order sequence, using operator<
sort(less<>());
}

template<class _Pr2>
void sort(_Pr2 _Pred)
{ // order sequence, using _Pred
const size_t _NUMBINS = 25;
_Nodeptr aList[_NUMBINS]; // array of lists
_Nodeptr pNode;
_Nodeptr pNext;
_Nodeptr pPrev;
if (this->size() < 2) // return if nothing to do
return;
this->_Myhead()->_Prev->_Next = 0; // set last node ->_Next = 0
pNode = this->_Myhead()->_Next; // set ptr to start of list
size_t i;
for (i = 0; i < _NUMBINS; i++) // zero array
aList[i] = 0;
while (pNode != 0) // merge nodes into array
{
pNext = pNode->_Next;
pNode->_Next = 0;
for (i = 0; (i < _NUMBINS) && (aList[i] != 0); i++)
{
pNode = _MergeN(_Pred, aList[i], pNode);
aList[i] = 0;
}
if (i == _NUMBINS)
i--;
aList[i] = pNode;
pNode = pNext;
}
pNode = 0; // merge array into one list
for (i = 0; i < _NUMBINS; i++)
pNode = _MergeN(_Pred, aList[i], pNode);
this->_Myhead()->_Next = pNode; // update sentinel node links
pPrev = this->_Myhead(); // and _Prev pointers
while (pNode)
{
pNode->_Prev = pPrev;
pPrev = pNode;
pNode = pNode->_Next;
}
pPrev->_Next = this->_Myhead();
this->_Myhead()->_Prev = pPrev;
}

template<class _Pr2>
_Nodeptr _MergeN(_Pr2 &_Pred, _Nodeptr pSrc1, _Nodeptr pSrc2)
{
_Nodeptr pDst = 0; // destination head ptr
_Nodeptr *ppDst = &pDst; // ptr to head or prev->_Next
if (pSrc1 == 0)
return pSrc2;
if (pSrc2 == 0)
return pSrc1;
while (1)
{
if (_DEBUG_LT_PRED(_Pred, pSrc2->_Myval, pSrc1->_Myval))
{
*ppDst = pSrc2;
pSrc2 = *(ppDst = &pSrc2->_Next);
if (pSrc2 == 0)
{
*ppDst = pSrc1;
break;
}
}
else
{
*ppDst = pSrc1;
pSrc1 = *(ppDst = &pSrc1->_Next);
if (pSrc1 == 0)
{
*ppDst = pSrc2;
break;
}
}
}
return pDst;
}

Which sorting algorithm is used in GCC?

GCC uses a variation of Musser’s introsort. This guarantees a worst-case running time of O(n log n):

It begins with quicksort and switches to heapsort when the recursion depth exceeds a level based on … the number of elements being sorted.

The implementation can be found in the stl_algo.h header in the __introsort_loop function.

Performance gap between sorting a list and a vector of structs. C++

No a std::vector is not sorted using merge sort (in most implementations; the standard doesn't specify the algorithm).

std::list does not have O(1) random access, so it cannot use algorithms like Quick sort* which requires O(1) random access to be fast (this is also why std::sort doesn't work on std::list.)

With this, you'll have to use algorithms that forward iteration is enough, such as the Merge sort**.

And merge sort is typically slower [1][2].

See also: what's the difference between list.sort and std::sort?

*: libstdc++ actually uses introsort.

**: libstdc++ actually uses a variant of merge sort

How big is the performance gap between std::sort and std::stable_sort in practice?

There are good answers that compared the algorithms theoretically. I benchmarked std::sort and std::stable_sort with google/benchmark for curiosity's sake.

It is useful to point out ahead of time that;

  • Benchmark machine has 1 X 2500 MHz CPU and 1 GB RAM
  • Benchmark OS Arch Linux 2015.08 x86-64
  • Benchmark compiled with g++ 5.3.0 and clang++ 3.7.0 (-std=c++11, -O3 and -pthread)
  • BM_Base* benchmark tries to measure the time populating std::vector<>. That time should be subtracted from the sorting results for better comparison.

First benchmark sorts std::vector<int> with 512k size.

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:37:43
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean 24730499 24726189 28
BM_BaseInt/512k_stddev 293107 310668 0
...
BM_SortInt/512k_mean 70967679 70799990 10
BM_SortInt/512k_stddev 1300811 1301295 0
...
BM_StableSortInt/512k_mean 73487904 73481467 9
BM_StableSortInt/512k_stddev 979966 925172 0
[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:39:07
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean 26198558 26197526 27
BM_BaseInt/512k_stddev 320971 348314 0
...
BM_SortInt/512k_mean 70648019 70666660 10
BM_SortInt/512k_stddev 2030727 2033062 0
...
BM_StableSortInt/512k_mean 82004375 81999989 9
BM_StableSortInt/512k_stddev 197309 181453 0

Second benchmark sorts std::vector<S> with 512k size (sizeof(Struct S) = 20).

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:49:32
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean 26485063 26410254 26
BM_BaseStruct/512k_stddev 270355 128200 0
...
BM_SortStruct/512k_mean 81844178 81833325 8
BM_SortStruct/512k_stddev 240868 204088 0
...
BM_StableSortStruct/512k_mean 106945879 106857114 7
BM_StableSortStruct/512k_stddev 10446119 10341548 0
[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:53:01
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean 27327329 27280000 25
BM_BaseStruct/512k_stddev 488318 333059 0
...
BM_SortStruct/512k_mean 78611207 78407400 9
BM_SortStruct/512k_stddev 690207 372230 0
...
BM_StableSortStruct/512k_mean 109477231 109333325 8
BM_StableSortStruct/512k_stddev 11697084 11506626 0

Anyone who likes to run the benchmark, here is the code,

#include <vector>
#include <random>
#include <algorithm>

#include "benchmark/benchmark_api.h"

#define SIZE 1024 << 9

static void BM_BaseInt(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<int> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back(dist(mt));
}
}
}
BENCHMARK(BM_BaseInt)->Arg(SIZE);

static void BM_SortInt(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<int> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back(dist(mt));
}

std::sort(v.begin(), v.end());
}
}
BENCHMARK(BM_SortInt)->Arg(SIZE);

static void BM_StableSortInt(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<int> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back(dist(mt));
}

std::stable_sort(v.begin(), v.end());
}
}
BENCHMARK(BM_StableSortInt)->Arg(SIZE);


struct S {
int key;
int arr[4];
};

static void BM_BaseStruct(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<S> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back({dist(mt)});
}
}
}
BENCHMARK(BM_BaseStruct)->Arg(SIZE);

static void BM_SortStruct(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<S> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back({dist(mt)});
}

std::sort(v.begin(), v.end(),
[](const S &a, const S &b) { return a.key < b.key; });
}
}
BENCHMARK(BM_SortStruct)->Arg(SIZE);

static void BM_StableSortStruct(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<S> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back({dist(mt)});
}

std::stable_sort(v.begin(), v.end(),
[](const S &a, const S &b) { return a.key < b.key; });
}
}
BENCHMARK(BM_StableSortStruct)->Arg(SIZE);


BENCHMARK_MAIN();

Why std::sort is faster than introsort coded by hand?

Both gcc and Microsoft's VisualC++ provide source code for std::sort (in header file algorithm). So, you can take a look yourself. I have investigated similar issues before. My conclusion was that the code was optimized for the general code path even to the extent of making the code more complex and difficult to maintain. Trade-offs that make sense to me.

Why does sorting a vector with execution::par take longer than normal sort (gcc 10.1.0)?

running perf on your code, it looks like it spends a tiny bit longer trying to partition the data.

This is just one example, but i ran it several times and consistently the parallel version was taking longer to partition the data at multiple levels of the sort. Since its recursive, its hard to get an exact picture of how much extra overhead it end up adding overall.

sort1 is the non-parallel sort.

sort2 is the parallel sort.

Sample Image

That aside, the answer to the underlying question is that you need intel thread building blocks installed in order for gcc to use anything other than the serial algorithms.

This can be installed quick simply on linux with sudo apt install libtbb-dev and then you link against it with -ltbb

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.

Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.

The rough C++ equivalent would be

sum += data[c] & -(data[c] >= 128);

The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.

Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.

The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.



Related Topics



Leave a reply



Submit