Performance of Qsort VS Std::Sort

Performance of qsort vs std::sort?

std::clock() is not a viable timing clock. You should use a platform-specific higher resolution timer, like the Windows High Performance Timer. More than that, the way that you call clock() is that first, text is output to the console, which is included in the time. This definitely invalidates the test. In addition, make sure that you compiled with all optimizations.

Finally, I copied and pasted your code, and got 0.016 for qsort and 0.008 for std::sort.

Comparing qsort with std::sort

The signature of std::sort() is:

template <class RandomAccessIterator, class StrictWeakOrdering>
void sort(RandomAccessIterator first, RandomAccessIterator last,
StrictWeakOrdering comp);

So it requires two iterators and a comparator, not a pointer and a length. To fix your code, call

std::sort(pcount, pcount+structs_len, cmp_port_count);

assuming that cmp_port_count is your comparator function which takes two port_count_t objects by reference and returns true when the first argument is to be ordered before the second argument, false otherwise.

Should I use qsort over sort in C++?

Probably none, std::sort is type safe, plus it can inline the comparator function. The only distinctive strong point of qsort is that it's available in plain C.

why in this code the sort in C++ is much slower than qsort in C (for string array)?

The problem was found,as @manni66 pointed out, I made a big mistake.. I used Debug mode , when I changed to Release mode, the result was reasonable:

[1] vector<char*>(qsort) Elapsed time: 0.559000
[2] vector<char*>(sort) Elapsed time: 0.530000
Other test:
[3] char array(qsort) Elapsed time: 0.512000
[3] vector<string>(sort) Elapsed time: 0.620000

Why does this quicksort appear to be faster than std::sort?

Visual Studio's std::sort has some overhead and some optimizations that your program does not. Your program is based on Lomuto partition scheme, while std::sort is a single pivot, 3 partition Hoare like quicksort + insertion sort for small partitions. The 3 partitions are elements < pivot, elements == pivot, elements > pivot. If there are no duplicate values, the 3 partition sort is just some overhead. If there are duplicate values, then as the number of duplicate values increases, Lomuto gets worse, and Hoare or std::sort get better. Try using fillRandom(myints, size,10); and you should see a large performance hit with Lomuto method, and an increase in performance with std::sort().

Visual Studio's std::sort uses median of nine if >= 40 elements, median of 3 for 33 to 39 elements, which reduces the probability of worst case scenarios, and switches to insertion sort for <= 32 elements (this speeds it up). To reduce stack overhead, it uses recursion for the smaller partition, and loops back to handle the larger partition. It has a check to avoid worst case time complexity O(n^2), switching to heap sort if the depth of partitioning ("recursions") becomes excessive. It uses iterators instead of plain pointers, but in release mode, my impression is the iterators are unchecked, and effectively pointers. It also uses a function pointer for less than compare, defaulting to std::less, which I don't know if it get optimized away.

Is there a sorting routine faster than qsort?

The data set is huge compared to cache, so it will be cache to memory limited.

Using indirection will make this worse because there is cache for the pointers, and memory is being accessed in a more random order, i.e. comparison isn't with neighbours. The program is working against any pre-fetch mechanisms in the CPU

Consider splitting the struct into two structs, in two arrays.

As an experiment, compare pass 1, with a pass one, where the struct is only { float val; int idx; };

If it is cache and bandwidth bound, it should make a significant difference.

If cache locality is a key issue, it might be worth considering multi-way merges, or Shell sort; anything to improve locality.

Try sorting cache-size subsets of the records, then do multi-way merge sorts (might be worth looking at the processor cache manager spec to see if it is clear about the number of pre-fetch streams is tries to anticipate. Again, reducing the size of the data sets, by reducing the size of the structs streaming in from RAM may be q winner.

How is the idx field derived? It sounds like it is the original position in the array. Is it the index of the original record?

If that is the case, just allocate a second array, and copy the first into the second:

struct { float val; float val2; int idx } sortedByVal[40000000];
struct { float val; float val2 } sortedbyIdx[40000000];

for (int i=0; i<40000000; ++i) {
sortedbyIdx[sortedByVal[i].idx].val = sortedByVal[i].val;
sortedbyIdx[sortedByVal[i].idx].val2 = sortedByVal[i].val2;
}

There is no second sort. If that is is the case, merge the allocation of the val2 value with this pass.

Edit

I was curious, about relative performance, so I wrote a program to compare the 'library' C sort functions, qsort, mergesort, heapsort, and also compare sorting to idx with copy to idx. It also re-sorts sorted values, to get some handle on that. This is quite interesting too. I did not implemenet and test Shell sort, which often beats qsort in practice.

The program uses command line parameters to choose which sort, and whether to sort by idx, or just copy. Code: http://pastebin.com/Ckc4ixNp

The jitter on run-time is quite clear. I should have used CPU clocks, done many runs, and presented better results, but that is an 'exercise for the reader'.

I ran this on an old-ish MacBook Pro 2.2GHz Intel Core 2 Duo.
Some of the timing is OS C specific.

Timing (reformatted slightly):

qsort(data, number-of-elements=40000000, element-size=12)
Sorting by val - duration = 16.304194
Re-order to idx by copying - duration = 2.904821
Sort in-order data - duration = 2.013237
Total duration = 21.222251
User Time: 20.754574
System Time: 0.402959

mergesort(data, number-of-elements=40000000, element-size=12)
Sorting by val - duration = 25.948651
Re-order to idx by copying - duration = 2.907766
Sort in-order data - duration = 0.593022
Total duration = 29.449438
User Time: 28.428954
System Time: 0.973349

heapsort(data, number-of-elements=40000000, element-size=12)
Sorting by val - duration = 72.236463
Re-order to idx by copying - duration = 2.899309
Sort in-order data - duration = 28.619173
Total duration = 103.754945
User Time: 103.107129
System Time: 0.564034

WARNING: Those are single runs. Many runs would be needed to get reasonable statistics.

The code at pastebin actually sorts the 'reduced size', 8-byte array. On the first pass, only val and idx are needed, and as the array gets copied when val2 is added, there is no need for val2 in the first array. This optimisation causes the sort functions to copy a smaller struct, and also fit more structs in the cache, which are good. I was disappointed that this gives a few % improvement on qsort. I interpret this as qsort quickly gets chunks being sorted to a size which fits in the cache.

The same reduced-size strategy gives more than 25% improvement on heapsort.

Timing for 8 byte structs, without val2:

qsort(data, number-of-elements=40000000, element-size=8)
Sorting by val - duration = 16.087761
Re-order to idx by copying - duration = 2.858881
Sort in-order data - duration = 1.888554
Total duration = 20.835196
User Time: 20.417285
System Time: 0.402756

mergesort(data, number-of-elements=40000000, element-size=8)
Sorting by val - duration = 22.590726
Re-order to idx by copying - duration = 2.860935
Sort in-order data - duration = 0.577589
Total duration = 26.029249
User Time: 25.234369
System Time: 0.779115

heapsort(data, number-of-elements=40000000, element-size=8)
Sorting by val - duration = 52.835870
Re-order to idx by copying - duration = 2.858543
Sort in-order data - duration = 24.660178
Total duration = 80.354592
User Time: 79.696220
System Time: 0.549068

WARNING: Those are single runs. Many runs would be needed to get reasonable statistics.

How big is the performance gap between std::sort and std::stable_sort in practice?

There are good answers that compared the algorithms theoretically. I benchmarked std::sort and std::stable_sort with google/benchmark for curiosity's sake.

It is useful to point out ahead of time that;

  • Benchmark machine has 1 X 2500 MHz CPU and 1 GB RAM
  • Benchmark OS Arch Linux 2015.08 x86-64
  • Benchmark compiled with g++ 5.3.0 and clang++ 3.7.0 (-std=c++11, -O3 and -pthread)
  • BM_Base* benchmark tries to measure the time populating std::vector<>. That time should be subtracted from the sorting results for better comparison.

First benchmark sorts std::vector<int> with 512k size.

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:37:43
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean 24730499 24726189 28
BM_BaseInt/512k_stddev 293107 310668 0
...
BM_SortInt/512k_mean 70967679 70799990 10
BM_SortInt/512k_stddev 1300811 1301295 0
...
BM_StableSortInt/512k_mean 73487904 73481467 9
BM_StableSortInt/512k_stddev 979966 925172 0
[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:39:07
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean 26198558 26197526 27
BM_BaseInt/512k_stddev 320971 348314 0
...
BM_SortInt/512k_mean 70648019 70666660 10
BM_SortInt/512k_stddev 2030727 2033062 0
...
BM_StableSortInt/512k_mean 82004375 81999989 9
BM_StableSortInt/512k_stddev 197309 181453 0

Second benchmark sorts std::vector<S> with 512k size (sizeof(Struct S) = 20).

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:49:32
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean 26485063 26410254 26
BM_BaseStruct/512k_stddev 270355 128200 0
...
BM_SortStruct/512k_mean 81844178 81833325 8
BM_SortStruct/512k_stddev 240868 204088 0
...
BM_StableSortStruct/512k_mean 106945879 106857114 7
BM_StableSortStruct/512k_stddev 10446119 10341548 0
[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:53:01
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean 27327329 27280000 25
BM_BaseStruct/512k_stddev 488318 333059 0
...
BM_SortStruct/512k_mean 78611207 78407400 9
BM_SortStruct/512k_stddev 690207 372230 0
...
BM_StableSortStruct/512k_mean 109477231 109333325 8
BM_StableSortStruct/512k_stddev 11697084 11506626 0

Anyone who likes to run the benchmark, here is the code,

#include <vector>
#include <random>
#include <algorithm>

#include "benchmark/benchmark_api.h"

#define SIZE 1024 << 9

static void BM_BaseInt(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<int> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back(dist(mt));
}
}
}
BENCHMARK(BM_BaseInt)->Arg(SIZE);

static void BM_SortInt(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<int> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back(dist(mt));
}

std::sort(v.begin(), v.end());
}
}
BENCHMARK(BM_SortInt)->Arg(SIZE);

static void BM_StableSortInt(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<int> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back(dist(mt));
}

std::stable_sort(v.begin(), v.end());
}
}
BENCHMARK(BM_StableSortInt)->Arg(SIZE);

struct S {
int key;
int arr[4];
};

static void BM_BaseStruct(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<S> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back({dist(mt)});
}
}
}
BENCHMARK(BM_BaseStruct)->Arg(SIZE);

static void BM_SortStruct(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<S> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back({dist(mt)});
}

std::sort(v.begin(), v.end(),
[](const S &a, const S &b) { return a.key < b.key; });
}
}
BENCHMARK(BM_SortStruct)->Arg(SIZE);

static void BM_StableSortStruct(benchmark::State &state) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist;

while (state.KeepRunning()) {
std::vector<S> v;
v.reserve(state.range_x());
for (int i = 0; i < state.range_x(); i++) {
v.push_back({dist(mt)});
}

std::stable_sort(v.begin(), v.end(),
[](const S &a, const S &b) { return a.key < b.key; });
}
}
BENCHMARK(BM_StableSortStruct)->Arg(SIZE);

BENCHMARK_MAIN();

QuickSort is slower than std::sort

Your QuickSort implementation is pretty vanilla. It does use random pivot selection, which ensures that there are no "killer" inputs that cause performance degredation, so that's better than the absolute basic QuickSort.

There are a number of optimizations that might be used, among them:

  • It is typical for "Quick Sort" to in fact be implemented as a hybrid sort that falls back to (say) Insertion Sort for partitions smaller than some fixed threshold. Once you get to small partitions, the overhead of Quick Sort tends to overcome its asymptotic complexity advantages.

  • Maximum recursion depth can be minimized and function call overhead can be reduced by switching to a hybrid recursive / iterative approach, wherein upon each partitioning, the smaller sub-array is sorted recursively, but the code just loops to sort the larger one.

  • When partitioning, you can reduce the number of swaps performed by finding pairs of elements for which a swap puts both in the correct sub-partition, and swapping those, instead of alternating between swapping into one sub-partition and swapping into the other.

  • It would probably help to come up with a way to reuse the same random number source throughout the sort, instead of instantiating a new one upon every partitioning.

C++ sorting classes faster than qsort

Three things:

  • Use std::sort instead of std::qsort, it’s faster because it can inline calls to the comparison operator (if you define it in the header or enable link-time optimisations).

  • Override swap for your class so that it can be swapped efficiently instead of copying via a temporary variable. However, that requires changing the header (because you need access to the private variables).

  • Since your sorted strings are of fixed length 4, a different sorting algorithm will be beneficial. The classical choice, which is fairly easy to implement, is radix sort. Judging from some of your comments it seems likely that your professor wants you to implement this.

qsort() vs std::sort, comparison function philosophical difference

Others have pointed out the equivalence of the two ways of doing comparisons; here's why the two approaches are followed.

In C, the comparison needs to be a function pointer. That means you always get function call and pointer indirection overhead. When qsort was designed back in the 1970s on a PDP-11 computer, the amount of overhead had to be reduced so comparison functions such as strcmp did a three-way comparison in a single function call. (Note that qsort is generally an unstable sort, so the equality case may seem useless, but it can be made stable with the appropriate pointer comparisons.)

In C++, the comparison can be inlined at template instantiation time, so much of the overhead is gone (there need not even be a function call) and the simpler convention can be used. This also means that std::sort can work with operator< overloads by default.



Related Topics



Leave a reply



Submit