Performance of Qsort VS Std::Sort

Performance of qsort vs std::sort?

std::clock() is not a viable timing clock. You should use a platform-specific higher resolution timer, like the Windows High Performance Timer. More than that, the way that you call clock() is that first, text is output to the console, which is included in the time. This definitely invalidates the test. In addition, make sure that you compiled with all optimizations.

Finally, I copied and pasted your code, and got 0.016 for qsort and 0.008 for std::sort.

Comparing qsort with std::sort

The signature of std::sort() is:

template <class RandomAccessIterator, class StrictWeakOrdering>
void sort(RandomAccessIterator first, RandomAccessIterator last,
          StrictWeakOrdering comp);

So it requires two iterators and a comparator, not a pointer and a length. To fix your code, call

std::sort(pcount, pcount+structs_len, cmp_port_count);

assuming that cmp_port_count is your comparator function which takes two port_count_t objects by reference and returns true when the first argument is to be ordered before the second argument, false otherwise.

Should I use qsort over sort in C++?

Probably none, std::sort is type safe, plus it can inline the comparator function. The only distinctive strong point of qsort is that it's available in plain C.

why in this code the sort in C++ is much slower than qsort in C (for string array)?

The problem was found，as @manni66 pointed out, I made a big mistake.. I used Debug mode , when I changed to Release mode, the result was reasonable:

[1] vector<char*>(qsort) Elapsed time: 0.559000
[2] vector<char*>(sort) Elapsed time: 0.530000
Other test:
[3] char array(qsort)   Elapsed time: 0.512000
[3] vector<string>(sort) Elapsed time: 0.620000

Why does this quicksort appear to be faster than std::sort?

Visual Studio's std::sort has some overhead and some optimizations that your program does not. Your program is based on Lomuto partition scheme, while std::sort is a single pivot, 3 partition Hoare like quicksort + insertion sort for small partitions. The 3 partitions are elements < pivot, elements == pivot, elements > pivot. If there are no duplicate values, the 3 partition sort is just some overhead. If there are duplicate values, then as the number of duplicate values increases, Lomuto gets worse, and Hoare or std::sort get better. Try using fillRandom(myints, size,10); and you should see a large performance hit with Lomuto method, and an increase in performance with std::sort().

Visual Studio's std::sort uses median of nine if >= 40 elements, median of 3 for 33 to 39 elements, which reduces the probability of worst case scenarios, and switches to insertion sort for <= 32 elements (this speeds it up). To reduce stack overhead, it uses recursion for the smaller partition, and loops back to handle the larger partition. It has a check to avoid worst case time complexity O(n^2), switching to heap sort if the depth of partitioning ("recursions") becomes excessive. It uses iterators instead of plain pointers, but in release mode, my impression is the iterators are unchecked, and effectively pointers. It also uses a function pointer for less than compare, defaulting to std::less, which I don't know if it get optimized away.

Is there a sorting routine faster than qsort?

The data set is huge compared to cache, so it will be cache to memory limited.

Using indirection will make this worse because there is cache for the pointers, and memory is being accessed in a more random order, i.e. comparison isn't with neighbours. The program is working against any pre-fetch mechanisms in the CPU

Consider splitting the struct into two structs, in two arrays.

As an experiment, compare pass 1, with a pass one, where the struct is only { float val; int idx; };

If it is cache and bandwidth bound, it should make a significant difference.

If cache locality is a key issue, it might be worth considering multi-way merges, or Shell sort; anything to improve locality.

Try sorting cache-size subsets of the records, then do multi-way merge sorts (might be worth looking at the processor cache manager spec to see if it is clear about the number of pre-fetch streams is tries to anticipate. Again, reducing the size of the data sets, by reducing the size of the structs streaming in from RAM may be q winner.

How is the idx field derived? It sounds like it is the original position in the array. Is it the index of the original record?

If that is the case, just allocate a second array, and copy the first into the second:

struct { float val; float val2; int idx } sortedByVal[40000000];
struct { float val; float val2 } sortedbyIdx[40000000];

for (int i=0; i<40000000; ++i) {
    sortedbyIdx[sortedByVal[i].idx].val = sortedByVal[i].val;
    sortedbyIdx[sortedByVal[i].idx].val2 = sortedByVal[i].val2;
}

There is no second sort. If that is is the case, merge the allocation of the val2 value with this pass.

Edit

I was curious, about relative performance, so I wrote a program to compare the 'library' C sort functions, qsort, mergesort, heapsort, and also compare sorting to idx with copy to idx. It also re-sorts sorted values, to get some handle on that. This is quite interesting too. I did not implemenet and test Shell sort, which often beats qsort in practice.

The program uses command line parameters to choose which sort, and whether to sort by idx, or just copy. Code: http://pastebin.com/Ckc4ixNp

The jitter on run-time is quite clear. I should have used CPU clocks, done many runs, and presented better results, but that is an 'exercise for the reader'.

I ran this on an old-ish MacBook Pro 2.2GHz Intel Core 2 Duo.
Some of the timing is OS C specific.

Timing (reformatted slightly):

qsort(data, number-of-elements=40000000, element-size=12)
Sorting by val - duration =            16.304194
Re-order to idx by copying - duration = 2.904821
Sort in-order data - duration =         2.013237
Total duration = 21.222251
User Time:       20.754574
System Time:      0.402959

mergesort(data, number-of-elements=40000000, element-size=12)
Sorting by val - duration =            25.948651
Re-order to idx by copying - duration = 2.907766
Sort in-order data - duration =         0.593022
Total duration = 29.449438
User Time:       28.428954
System Time:      0.973349

heapsort(data, number-of-elements=40000000, element-size=12)
Sorting by val - duration =            72.236463
Re-order to idx by copying - duration = 2.899309
Sort in-order data - duration =        28.619173
Total duration = 103.754945
User Time:       103.107129
System Time:       0.564034

WARNING: Those are single runs. Many runs would be needed to get reasonable statistics.

The code at pastebin actually sorts the 'reduced size', 8-byte array. On the first pass, only val and idx are needed, and as the array gets copied when val2 is added, there is no need for val2 in the first array. This optimisation causes the sort functions to copy a smaller struct, and also fit more structs in the cache, which are good. I was disappointed that this gives a few % improvement on qsort. I interpret this as qsort quickly gets chunks being sorted to a size which fits in the cache.

The same reduced-size strategy gives more than 25% improvement on heapsort.

Timing for 8 byte structs, without val2:

qsort(data, number-of-elements=40000000, element-size=8)
Sorting by val - duration =            16.087761
Re-order to idx by copying - duration = 2.858881
Sort in-order data - duration =         1.888554
Total duration = 20.835196
User Time:       20.417285
System Time:      0.402756

mergesort(data, number-of-elements=40000000, element-size=8)
Sorting by val - duration =            22.590726
Re-order to idx by copying - duration = 2.860935
Sort in-order data - duration =         0.577589
Total duration = 26.029249
User Time:       25.234369
System Time:      0.779115

heapsort(data, number-of-elements=40000000, element-size=8)
Sorting by val - duration =            52.835870
Re-order to idx by copying - duration = 2.858543
Sort in-order data - duration =        24.660178
Total duration = 80.354592
User Time:       79.696220
System Time:      0.549068

WARNING: Those are single runs. Many runs would be needed to get reasonable statistics.

How big is the performance gap between std::sort and std::stable_sort in practice?

There are good answers that compared the algorithms theoretically. I benchmarked std::sort and std::stable_sort with google/benchmark for curiosity's sake.

It is useful to point out ahead of time that;

Benchmark machine has 1 X 2500 MHz CPU and 1 GB RAM
Benchmark OS Arch Linux 2015.08 x86-64
Benchmark compiled with g++ 5.3.0 and clang++ 3.7.0 (-std=c++11, -O3 and -pthread)
BM_Base* benchmark tries to measure the time populating std::vector<>. That time should be subtracted from the sorting results for better comparison.

First benchmark sorts std::vector<int> with 512k size.

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:37:43
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean              24730499   24726189         28
BM_BaseInt/512k_stddev              293107     310668          0
...
BM_SortInt/512k_mean              70967679   70799990         10
BM_SortInt/512k_stddev             1300811    1301295          0
...
BM_StableSortInt/512k_mean        73487904   73481467          9
BM_StableSortInt/512k_stddev        979966     925172          0

[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:39:07
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean              26198558   26197526         27
BM_BaseInt/512k_stddev              320971     348314          0
...
BM_SortInt/512k_mean              70648019   70666660         10
BM_SortInt/512k_stddev             2030727    2033062          0
...
BM_StableSortInt/512k_mean        82004375   81999989          9
BM_StableSortInt/512k_stddev        197309     181453          0

Second benchmark sorts std::vector<S> with 512k size (sizeof(Struct S) = 20).

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:49:32
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean           26485063   26410254         26
BM_BaseStruct/512k_stddev           270355     128200          0
...
BM_SortStruct/512k_mean           81844178   81833325          8
BM_SortStruct/512k_stddev           240868     204088          0
...
BM_StableSortStruct/512k_mean    106945879  106857114          7
BM_StableSortStruct/512k_stddev   10446119   10341548          0

[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:53:01
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean           27327329   27280000         25
BM_BaseStruct/512k_stddev           488318     333059          0 
...
BM_SortStruct/512k_mean           78611207   78407400          9
BM_SortStruct/512k_stddev           690207     372230          0 
...
BM_StableSortStruct/512k_mean    109477231  109333325          8
BM_StableSortStruct/512k_stddev   11697084   11506626          0

Anyone who likes to run the benchmark, here is the code,

#include <vector>
#include <random>
#include <algorithm>

#include "benchmark/benchmark_api.h"

#define SIZE 1024 << 9

static void BM_BaseInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }
  }
}
BENCHMARK(BM_BaseInt)->Arg(SIZE);

static void BM_SortInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }

    std::sort(v.begin(), v.end());
  }
}
BENCHMARK(BM_SortInt)->Arg(SIZE);

static void BM_StableSortInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }

    std::stable_sort(v.begin(), v.end());
  }
}
BENCHMARK(BM_StableSortInt)->Arg(SIZE);

struct S {
  int key;
  int arr[4];
};

static void BM_BaseStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }
  }
}
BENCHMARK(BM_BaseStruct)->Arg(SIZE);

static void BM_SortStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }

    std::sort(v.begin(), v.end(),
              [](const S &a, const S &b) { return a.key < b.key; });
  }
}
BENCHMARK(BM_SortStruct)->Arg(SIZE);

static void BM_StableSortStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }

    std::stable_sort(v.begin(), v.end(),
                     [](const S &a, const S &b) { return a.key < b.key; });
  }
}
BENCHMARK(BM_StableSortStruct)->Arg(SIZE);

BENCHMARK_MAIN();

QuickSort is slower than std::sort

Your QuickSort implementation is pretty vanilla. It does use random pivot selection, which ensures that there are no "killer" inputs that cause performance degredation, so that's better than the absolute basic QuickSort.

There are a number of optimizations that might be used, among them:

It is typical for "Quick Sort" to in fact be implemented as a hybrid sort that falls back to (say) Insertion Sort for partitions smaller than some fixed threshold. Once you get to small partitions, the overhead of Quick Sort tends to overcome its asymptotic complexity advantages.
Maximum recursion depth can be minimized and function call overhead can be reduced by switching to a hybrid recursive / iterative approach, wherein upon each partitioning, the smaller sub-array is sorted recursively, but the code just loops to sort the larger one.
When partitioning, you can reduce the number of swaps performed by finding pairs of elements for which a swap puts both in the correct sub-partition, and swapping those, instead of alternating between swapping into one sub-partition and swapping into the other.
It would probably help to come up with a way to reuse the same random number source throughout the sort, instead of instantiating a new one upon every partitioning.

C++ sorting classes faster than qsort

Three things:

Use std::sort instead of std::qsort, it’s faster because it can inline calls to the comparison operator (if you define it in the header or enable link-time optimisations).
Override swap for your class so that it can be swapped efficiently instead of copying via a temporary variable. However, that requires changing the header (because you need access to the private variables).
Since your sorted strings are of fixed length 4, a different sorting algorithm will be beneficial. The classical choice, which is fairly easy to implement, is radix sort. Judging from some of your comments it seems likely that your professor wants you to implement this.

qsort() vs std::sort, comparison function philosophical difference

Others have pointed out the equivalence of the two ways of doing comparisons; here's why the two approaches are followed.

In C, the comparison needs to be a function pointer. That means you always get function call and pointer indirection overhead. When qsort was designed back in the 1970s on a PDP-11 computer, the amount of overhead had to be reduced so comparison functions such as strcmp did a three-way comparison in a single function call. (Note that qsort is generally an unstable sort, so the equality case may seem useless, but it can be made stable with the appropriate pointer comparisons.)

In C++, the comparison can be inlined at template instantiation time, so much of the overhead is gone (there need not even be a function call) and the simpler convention can be used. This also means that std::sort can work with operator< overloads by default.

Performance of Qsort VS Std::Sort