Why Does Java's Arrays.Sort Method Use Two Different Sorting Algorithms for Different Types

Why does Java's Arrays.sort method use two different sorting algorithms for different types?

The most likely reason: quicksort is not stable, i.e. equal entries can change their relative position during the sort; among other things, this means that if you sort an already sorted array, it may not stay unchanged.

Since primitive types have no identity (there is no way to distinguish two ints with the same value), this does not matter for them. But for reference types, it could cause problems for some applications. Therefore, a stable merge sort is used for those.

OTOH, a reason not to use the (guaranteed n*log(n)) stable merge sort for primitive types might be that it requires making a clone of the array. For reference types, where the referred objects usually take up far more memory than the array of references, this generally does not matter. But for primitive types, cloning the array outright doubles the memory usage.

Why does java.util.Arrays.sort(Object[]) use 2 kinds of sorting algorithms?

It's important to note that an algorithm that is O(N log N) is not always faster in practice than an O(N^2) algorithm. It depends on the constants, and the range of N involved. (Remember that asymptotic notation measures relative growth rate, not absolute speed).

For small N, insertion sort in fact does beat merge sort. It's also faster for almost-sorted arrays.

Here's a quote:

Although it is one of the elementary sorting algorithms with O(N^2) worst-case time, insertion sort is the algorithm of choice either when the data is nearly sorted (because it is adaptive) or when the problem size is small (because it has low overhead).
For these reasons, and because it is also stable, insertion sort is often used as the recursive base case (when the problem size is small) for higher overhead divide-and-conquer sorting algorithms, such as merge sort or quick sort.

Here's another quote from Best sorting algorithm for nearly sorted lists paper:

straight insertion sort is best for small or very nearly sorted lists

What this means is that, in practice:

Some algorithm A₁ with higher asymptotic upper bound may be preferable than another known algorithm A₂ with lower asymptotic upper bound
- Perhaps A₂ is just too complicated to implement
- Or perhaps it doesn't matter in the range of N considered
  - See e.g. Coppersmith–Winograd algorithm
Some hybrid algorithms may adapt different algorithms depending on the input size

A numerical example

Let's consider these two functions:

f(x) = 2x^2; this function has a quadratic growth rate, i.e. "O(N^2)"
g(x) = 10x; this function has a linear growth rate, i.e. "O(N)"

Now let's plot the two functions together:

alt text

^{Source: WolframAlpha: plot 2x^2 and 10x for x from 0 to 10}

Note that between x=0..5, f(x) <= g(x), but for any larger x, f(x) quickly outgrows g(x).

Analogously, if A₁ is a quadratic algorithm with a low overhead, and A₂ is a linear algorithm with a high overhead, for smaller input, A₁ may be faster than A₂.

Thus, you can, should you choose to do so, create a hybrid algorithm A₃ which simply selects one of the two algorithms depending on the size of the input. Whether or not this is worth the effort depends on the actual parameters involved.

Many tests and comparisons of sorting algorithms have been made, and it was decided that because insertion sort beats merge sort for small arrays, it was worth it to implement both for Arrays.sort.

Why java.util.Arrays uses Two Sorting Algorithms?

A good explanation here:-

Quicksort is faster in both cases. Mergesort is stable in both cases.
But for primitive types quicksort is stable too! That’s because
primitive types in Java are like elementary particles in quantum
mechanics. You can’t tell the difference between one 7 and another 7.
Their value is all that defines them. Sort the array such [7, 6, 6, 7,
6, 5, 4, 6, 0] into [0, 4, 5, 6, 6, 6, 6, 7, 7]. Not only do you not
care which 6 ended up in which position. It’s a meaningless question.
The array positions don’t hold pointers to the objects. They hold the
actual values of the objects. We might as well say that all the
original values were thrown away and replaced with new ones. Or not.
It just doesn’t matter at all. There is no possible way you can tell
the difference between the output of a stable and unstable sorting
algorithm when all that’s sorted are primitive types. Stability is
irrelevant with primitive types in Java.

Why does Collections.sort use Mergesort but Arrays.sort does not?

The API guarantees a stable sorting which Quicksort doesn’t offer. However, when sorting primitive values by their natural order you won’t notice a difference as primitive values have no identity. Therefore, Quicksort can used for primitive arrays and will be used when it is considered more efficient¹.

For objects you may notice, when objects with different identity which are deemed equal according to their equals implementation or the provided Comparator change their order. Therefore, Quicksort is not an option. So a variant of MergeSort is used, the current Java versions use TimSort. This applies to both, Arrays.sort and Collections.sort, though with Java 8, the List itself may override the sort algorithms.

¹ The efficiency advantage of Quicksort is needing less memory when done in-place. But it has a dramatic worst case performance and can’t exploit runs of pre-sorted data in an array, which TimSort does.

Therefore, the sorting algorithms were reworked from version to version, while staying in the now-misleadingly named class DualPivotQuicksort. Also, the documentation didn’t catch up, which shows, that it is a bad idea in general, to name an internally used algorithm in a specification, when not necessary.

The current situation (including Java 8 to Java 11) is as follows:

Generally, the sorting methods for primitive arrays will use Quicksort only under certain circumstances. For larger arrays, they will try to identify runs of pre-sorted data first, like TimSort does, and will merge them when the number of runs does not exceed a certain threshold. Otherwise they will fall back to Quicksort, but with an implementation that will fall back to Insertion sort for small ranges, which does not only affect small arrays, but also quick sort’s recursion.
sort(char[],…) and sort(short[],…) add another special case, to use Counting sort for arrays whose length exceeds a certain threshold
Likewise, sort(byte[],…) will use Counting sort, but with a much smaller threshold, which creates the biggest contrast to the documentation, as sort(byte[],…) never uses Quicksort. It only uses Insertion sort for small arrays and Counting sort otherwise.

Drawback of using Arrays.sort() method in Java instead of QuickSort or MergeSort

Unless you really need (proven by measurements) a more efficient implementation, and can come up with one, it always makes much more sense to use Arrays.sort. It is efficient and well-tested. It rarely makes any sense at all to implement Quicksort or Merge Sort, as these are generic sorting algorithms. It might make sense to implement a custom sorting algorithm that uses some properties of your data that Java has no way of using (counting sort and bucket sort are nice examples).

As for Arrays.sort, Oracle implementation works as follows:

For int, long, float and double there are essentially three algorithms: Insertion Sort, Dual-Pivot Stable Quicksort and Merge Sort.

1.1. If the array is large enough (currently larger than 286 elements) and it's not “almost sorted” (determined dynamically), then Merge Sort is used.

1.2. Otherwise, if the array is very small (currently less than 47 elements), then Insertion Sort is used.

1.3. Otherwise (if the array is not that small, but either smaller than 286 elements or mostly sorted), Dual-Pivot Quicksort is used.

For short and char, there are Insertion Sort, Dual-Pivot Stable Quicksort and Counting Sort.

2.1. If the array is larger than a certain threshold (currently 3200 elements), then Counting Sort is used.

2.2. Otherwise, similarly to larger types, Insertion Sort or Dual-Pivot Stable Quicksort is used (using the threshold of 47 elements).

For bytes, it's like for shorts, but Counting Sort is used when there are more than 29 elements (since it doesn't require that much memory).
For reference types, things are complicated.

4.1. If the java.util.Arrays.useLegacyMergeSort property is set, then some sort of legacy Merge Sort is used (surprisingly), which falls back to Insertion Sort on really small arrays (less than 7 elements).

4.2. Otherwise, TimSort is used (Tim Pieter's list sort for Python), which is something similar to Merge Sort, but for arrays of less than 32 objects no merging is performed.

The point of the above is that Java people really did their research instead of blindly implementing a random sorting algorithm, hoping that everyone will be happy with it.

In my experience, I find Arrays.sort to be extremely efficient. I can think of two reasons to implement a custom algorithm: the aforementioned case using certain data properties, and sorting data that comes in different “parallel” arrays that we can't merge into one array of composite objects for whatever reason (performance, or perhaps we just don't have control over the code that produced those arrays).

Arrays.sort() -- two different strategies for primitive & complex data types to be sorted

Because sorting reference types is guaranteed to be stable, whereas sorting primitives does not need to be (so quicksort, a non-stable sorting algorithm, can be used; merge sort on the other hand is in fact stable). Also note that quicksort is generally more optimal than mergesort (see this as well), which explains why it's taken advantage of when sorting primitives.

From Arrays.sort:

This sort is guaranteed to be stable: equal elements will not be reordered as a result of the sort.

Why Arrays.sort is quicksort algorithm, why not another sort algorithm?

Quicksort has O(n log n) average and O(n^2) worst case performance, that is the best "average case" a sort algorithm can be, there are other sort algorithms that have this performance, but quicksort tends to perform better than most.

See: http://en.wikipedia.org/wiki/Quicksort

Arrays.sort() vs. Sorting Algorithms

Most language-provided implementations of sort are really good. I doubt for most cases -- unless your have extra knowledge about the type of data that might help you sort -- that you will ever be able to beat it.

However, using them doesn't provide you with the 'knowledge' about how they work. If your goal is to get stuff done, or provide production-ready code, use the pre-built methods. If you goal is to gain knowledge or demonstrate knowledge, write your own.

If you are asked to sort something at an interview or an assignment, be very clear with the interviewer/teacher: "Do you want me to implement sorting myself, or use a pre-built sorting function?" If you don't ask, you could be docked either way. Some interviewers or teachers might think "oh, she doesn't know how to implement her own sort" and others might think "why isn't she using the sort that is part of the language?"

Role of sorting algorithms in spite of having array.sort

Why do we need so many sorting algorithms to sort an array of integers when we have array.sort which already does the same?

Most programming languages ship with some sort of sorting algorithm in their standard libraries. Most of the time, you can just use this default sorting algorithm and you'll be fine.

That said, different sorting algorithms have different performance characteristics and different tradeoffs. Most library implementations of sorting algorithms use comparison sorts like quicksort, timsort, or introsort. These algorithms can be used to sort any objects that can be compared to one another, so they're good generic sorts. For certain cases - such as sorting integers - there are specialized algorithms that can take advantage of the fact that you're sorting integer data. If, for example, you're sorting an array with lots of small numbers in it, it might be faster to use counting sort or radix sort than quicksort, both in theory and in practice.

There are other considerations to take into account as well. Quicksort, for example, is fast on average but has degenerate cases. You might have applications where you absolutely cannot hit these cases, in which case using a custom sorting algorithm might be appropriate.

Is sorting and finding the largest number in an array different ? i mean if an array is sorted obviously the largest number will end up at the end

Yes, these are different problems. You can find the largest element of an array in one pass over the array by simply recording the largest value that you've seen so far. This takes time O(n), uses space O(1), and is extremely efficient. Sorting algorithms need to spend more time and effort than this because they have to also find the second-largest element, the third-largest element, etc. In that sense, sorting is a fundamentally "harder" problem than the problem of finding the largest element of an array.

which is the best sorting algorithm to sort an array ?

As I alluded to earlier, there is no one "best" algorithm to sort an array. It depends on many factors, such as what sorts of elements are stored in the array, what the distribution of those elements are, whether you want worst-case or average-case efficiency, how much memory you want to use, etc. As with most aspects of software engineering, there are lots of different tradeoffs to make, and it's up to you to choose which one you think is best.

Hope this helps!

Why Does Java's Arrays.Sort Method Use Two Different Sorting Algorithms for Different Types