Parallel Sort Algorithm

Which parallel sorting algorithm has the best average case performance?

The following article (PDF download) is a comparative study of parallel sorting algorithms on various architectures:

Parallel sorting algorithms on various architectures

According to the article, sample sort seems to be best on many parallel architecture types.

Update to address Mark's concern of age:

Here are more recent articles introducing something more novel (from 2007, which, btw, still get compared with sample sort):

Improvements on sample sort

AA-Sort

The bleeding edge (circa 2010, some only a couple months old):

Parallel sorting pattern

Many-core GPU based parallel sorting

Hybrid CPU/GPU parallel sort

Randomized Parallel Sorting Algorithm with an Experimental Study

Highly scalable parallel sorting

Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach

Update for 2013:
Here is the bleeding edge circa January, 2013. (Note: A few of the links are to papers at Citeseer and require registration which is free):

University lectures:

Parallel Partitioning for Selection and Sorting

Parallel Sorting Algorithms Lecture

Parallel Sorting Algorithms Lecture 2

Parallel Sorting Algorithms Lecture 3



Other sources and papers:


A novel sorting algorithm for many-core architectures based on adaptive bitonic sort

Highly Scalable Parallel Sorting 2

Parallel Merging

Parallel Merging 2

Parallel Self-Sorting System for Objects

Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms

Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs

Various parallel algorithms (sorting et al) including implementations



GPU and CPU/GPU hybrid sources and papers:


An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture

Data Sorting Using Graphics Processing Units

Efficient Algorithms for Sorting on GPUs

Designing efficient sorting algorithms for manycore GPUs

Deterministic Sample Sort For GPUs

Fast in-place sorting with CUDA based on bitonic sort

Fast parallel GPU-sorting using a hybrid algorithm

Fast Parallel Sorting Algorithms on GPUs

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

GPU sample sort

GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures

GPUTeraSort: high performance graphics co-processor sorting for large database management

High performance comparison-based sorting algorithm on many-core GPUs

Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead

Sorting on GPUs for large scale datasets: A thorough comparison

Update for 2022: I have not forgotten this answer and like all things computer related, it has not aged well. I will do my best to update and refresh it for current trends as well as the state of the art, at some point prior to the end of this year (2022). If you have interest in this topic and would like to see the update sooner, please either reply to or better yet, upvote the comment I made below this answer, so that I can gauge interest in this topic over others that also are in need of an update.

Parallel Sort Algorithm

From "The Darkside" in his article Parallel Extensions to the .Net Framework we have this parallel extensions version of quicksort:

(Edit: Since the link is now dead, interested readers may find an archive of it at the Wayback Machine)

private void QuicksortSequential<T>(T[] arr, int left, int right) 
where T : IComparable<T>
{
if (right > left)
{
int pivot = Partition(arr, left, right);
QuicksortSequential(arr, left, pivot - 1);
QuicksortSequential(arr, pivot + 1, right);
}
}

private void QuicksortParallelOptimised<T>(T[] arr, int left, int right)
where T : IComparable<T>
{
const int SEQUENTIAL_THRESHOLD = 2048;
if (right > left)
{
if (right - left < SEQUENTIAL_THRESHOLD)
{

QuicksortSequential(arr, left, right);
}
else
{
int pivot = Partition(arr, left, right);
Parallel.Do(
() => QuicksortParallelOptimised(arr, left, pivot - 1),
() => QuicksortParallelOptimised(arr, pivot + 1, right));
}
}
}

Notice that he reverts to a sequential sort once the number of items is less than 2048.

C++ parallel sort

If you are using libstdc++ (g++'s standard) as your standard library implementation, you can rely on its built in "Parallel Mode".

To use it, you need to compile with -fopenmp and have _GLIBCXX_PARALLEL defined during compilation. Here you can find more information on the usage as well as a list of the algorithms that gcc will consider for parallization.

Be aware of the following warning from the usage site:

Note that the _GLIBCXX_PARALLEL define may change the sizes and behavior of standard class templates such as std::search, and therefore one can only link code compiled with parallel mode and code compiled without parallel mode if no instantiation of a container is passed between the two translation units. Parallel mode functionality has distinct linkage, and cannot be confused with normal mode symbols.

Each individual parallel algorithm can also be called explicitly. You only need to compile with -fopenmp (and not the _GLIBCXX_PARALLEL flag), and include the parallel/numeric or parallel/algorithm depending on the function listed in this subsection of the documentation. Be aware that the parallel algorithms are in the __gnu_parallel namespace.

Which sorting method is most suitable for parallel processing?

Like merge sort, quicksort can also be easily parallelized due to its divide-and-conquer nature. Individual in-place partition operations are difficult to parallelize, but once divided, different sections of the list can be sorted in parallel.

One advantage of parallel quicksort over other parallel sort algorithms is that no synchronization is required. A new thread is started as soon as a sublist is available for it to work on and it does not communicate with other threads. When all threads complete, the sort is done.

http://en.wikipedia.org/wiki/Quicksort

Good choice of a parallelized sorting algorithm to implement as homework?

Quick sort can split the unsorted list into two halves, but unfortunately, the halves aren't guaranteed to be anywhere near even. So one machine (or half of a cluster of machines) could get 20 entries, and the other half could get 20 billion.

I can't think of a good way to make heapsort work in parallel. It can be done, but man, that feels really counterintuitive.

Merge sort is the one I think you want.

  • Each split is exactly 50% of the list, so it's easy to split between processors.
  • You can implement merge sort on two sets of tape drives, which means that it doesn't require the whole list be in memory at one time. For large lists, especially those that are larger than the memory you have available, that's a must have.
  • Merge sort is also stable in parallel implementations, if it matters.

Linear sort on GPU. Does parallel processing change Big-O?

The answer depends on whether you regard the number of processor's as a relevant parameter of your complexity analysis. If yes, then you have to introduce an extra parameter, say p, for the number of processors.

If your algorithm is scalable, this means that the time complexity scales inversely linearly with the number of processors, so instead of O(n) you would get O(n/p) in the ideal case. But that's really the ideal case, and it's called perfect linear speedup. (See here for more details.)

But it's definitely wrong to say that an O(n^2) algorithm would run in O(n) on a parallel machine since it's unreasonable to assume that the number of processors automatically grows with the size of the input.

If you regard the number of processors as a constant, then nothing changes.

std::sort is way more fast than custom OpenMP parallel sorting algorithm

Your algorithm is O(n^2) in total work done. Using k threads, this at most makes it O(n^2/k).

std::sort is O(n lg n). Unless k is O( n / lg n ) adding more threads won't keep up.

And even if you did have piles of threads. NUMA on most mega-thread systems means that your memory is going to be a serious pain. The threads don't access the same memory in each cycle, and in fact constantly pass data back and forth.

An example of a way that might speed up your work compared to a simple std::sort would be to split the data into K pieces, std::sort each of the K pieces, then do a parallel merge of those pieces.

int data_count = 100000;
auto get_block_edge = [data_count](int i, int n) {
return data_count * n / i;
};
int blocks = 0;
#pragma omp parallel
{
blocks = omp_get_num_threads();
int block = omp_get_thread_num();
int start = get_block_edge( block, blocks );
int finish = get_block_edge( block+1, blocks );
std::sort( std::begin(vec_)+start, std::begin(vec_)+finish );
}

now we have a bunch of sorted blocks. You just need to merge them.

for (int merge_step = 1; merge_step < blocks; merge_step *= 2 )
{
#pragma omp parallel for
for (int i = 0; i < (blocks/2/merge_step); ++i) {
int start = get_block_edge(i*2*merge_step, blocks);
int mid = get_block_edge(i*2*merge_step+merge_step, blocks);
int finish = get_block_edge(i*2*merge_step+2*merge_step, blocks);
finish = std::min(finish, data_count);
auto b = std::begin(vec_);
std::inplace_merge( b+start, b+mid, b+finish );
}
}

I think that should be a highly parallel merge. Or, more likely, segfault because I have an off-by-1 error.

Now, this is still O(n) with an infinite number of threads, as the very last merge has to be single-threaded. Getting around that is, to say it mildly, tricky.

Parallelize sorting algorithm using pyspark

I figured out how to solve the problem of TypeError: 'float' object is not iterable.

This can be solved by flattening the data using flatMap(lambda x: x) and calling glom() in order to wrap the list and make it executable by the function execute_merge_sort.
By executing the following line, the returned result is a list containing sorted lists.

sc.parallelize(random_list_of_lists).flatMap(lambda x: x).glom().mapPartitions(execute_merge_sort_rdd).collect()


Related Topics



Leave a reply



Submit