Radix Sort Implemented in C++

Radix Sort implemented in C++

I think you're severely overcomplicating your solution. You can implement radix using the single array received in the input, with the buckets in each step represented by an array of indices that mark the starting index of each bucket in the input array.

In fact, you could even do it recursively:

// Sort 'size' number of integers starting at 'input' according to the 'digit'th digit
// For the parameter 'digit', 0 denotes the least significant digit and increases as significance does
void radixSort(int* input, int size, int digit)
{
if (size == 0)
return;

int[10] buckets; // assuming decimal numbers

// Sort the array in place while keeping track of bucket starting indices.
// If bucket[i] is meant to be empty (no numbers with i at the specified digit),
// then let bucket[i+1] = bucket[i]

for (int i = 0; i < 10; ++i)
{
radixSort(input + buckets[i], buckets[i+1] - buckets[i], digit+1);
}
}

Of course buckets[i+1] - buckets[i] will cause a buffer overflow when i is 9, but I omitted the extra check or readability's sake; I trust you know how to handle that.

With that, you just have to call radixSort(testcases, sizeof(testcases) / sizeof(testcases[0]), 0) and your array should be sorted.

Explanation on Radix Sort Algorithm

This is a neat example of unnecessary compact and, hence, hard to read code.

To analyze it it helps to separate it a bit:

// what a mess...
c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;

First the argument of cs subscription is taken out:

// determine index for c
const int iC = k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0;
// post-increment c (as it is it could become a pre-increment as well)
c[iC]++;

The index calculation contains a condition:

// determine index for c
const int iC
// check whether k is (not) exceeding the size of a
= k < a[j].size()
// then
? (int)(unsigned char)a[j][k] + 1
// else
: 0;

The array a is an array of std::strings where std::string contains itself an array of char. So, a[j][k] results in a single char. A char might be signed or unsigned – that's left to the compiler. So, (unsigned char)a[j][k] doesn't change the bits of that char but interprets them as unsigned number. Then (int)(unsigned char)a[j][k] promotes that to an int.

Please, note that this may be different to (int)a[j][k] if the current compiler has signed chars because in that case the possible sign of a value would be kept. (This is called sign extension.) So, the whole thing is just responsible to convert the current character into a (positive) index and adds 1 finally.


Actually, I intended to leave the rest as practice for the reader but then I saw this:

b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];

Separating it like above, it results in:

const int iC = k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0;
const int iB = c[iC - 1]; // What?
b[iB] = a[r];

Considering that iC might result in 0 (while I didn't check the whole code whether this is possible at all), iC - 1 might result in -1. So, c[-1] would be accessed.

This might be correct if e.g. c where a pointer into a bigger array but not at its begin. So, a negative index would access valid storage. This seems not to be the case here:

c = new int[257];

and I couldn't see any other assignment to c.

This all doesn't look much trustworthy. At best, the condition is over-pessimistic and 0 is never assigned.


I'm quite sure that I could demonstrate that less compact code may improve readability if not help to easier uncover possible issues in it.

So, is the non-compact code slower?
According to my experience, it is not on modern compilers with its amazing optimization capabilities.

I once read an article about optimization and Static single assignment form.
As well I see all the funny $$ variables in Visual Studios debugger watch window from time to time when I debug my C++ code (which definitely doesn't contain any variable named $$).
So, I believe the compiler will do internally something similar as well. – Doing this explicitly to improve readability shouldn't have the least impact on to performance.

If I really would be in doubt I still could check the assembler output.
(Compiler Explorer is a good place for example.)


Btw. c = new int[257];?

Why not int c[257];?

257 int values are not that much that I would afraid to exceed the stack size immediately.

Not to mention that, arrays and especially arrays allocated with new are really bad C++ style and asking for U.B.. As if std::vector was not yet invented…


I somehow missed the lessons about Radix sorting when I was a student (while I must admit that I didn't yet miss this knowledge in daily business).
So, out of curiosity, I had a look into Wikipedia and re-implemented the descriptions found there.
This is intended to provide a (hopefully better) replacement for what OP found and exposed in the question.

Thereby, I implemented

  1. a naïve approach according to the description on en.wikipedia.org: Radix sort – History
  2. and then OPs shown approach (with counting sort) which I found on de.wikipedia.org: Countingsort – Algorithmus.
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

/* helper to find max. length in data strings
*/
size_t maxLength(const std::vector<std::string> &data)
{
size_t lenMax = 0;
for (const std::string &value : data) {
if (lenMax < value.size()) lenMax = value.size();
}
return lenMax;
}

/* a naive implementation of radix sort
* like described in https://en.wikipedia.org/wiki/Radix_sort
*/
void radixSort(std::vector<std::string> &data)
{
/* A char has 8 bits - which encode (unsigned) the numbers of [0, 255].
* Hence, 256 buckets are used for sorting.
*/
std::vector<std::string> buckets[256];
// determine max. length of input data:
const size_t len = maxLength(data);
/* iterate over data for according to max. length
*/
for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
// sort data into buckets according to the current "digit":
for (std::string &value : data) {
/* digits after end of string are considered as '\0'
* because 0 is the usual end-marker of C strings
* and the least possible value of an unsigned char.
* This shall ensure that an string goes before a longer
* string with same prefix.
*/
const unsigned char digit = i < value.size() ? value[i] : '\0';
// move current string into the corresponding bucket
buckets[digit].push_back(std::move(value));
}
// store buckets back into data (preserving current order)
data.clear();
for (std::vector<std::string> &bucket : buckets) {
// append bucket to the data
data.insert(data.end(),
std::make_move_iterator(bucket.begin()),
std::make_move_iterator(bucket.end()));
bucket.clear();
}
}
}

/* counting sort as helper for the not so naive radix sort
*/
void countSort(std::vector<std::string> &data, size_t i)
{
/* There are 256 possible values for an unsigned char
* (which may have a value in [0, 255]).
*/
size_t counts[256] = { 0 }; // initialize all counters with 0.
// count how often a certain charater appears at the place i
for (const std::string &value : data) {
/* digits after end of string are considered as '\0'
* because 0 is the usual end-marker of C strings
* and the least possible value of an unsigned char.
* This shall ensure that an string goes before a longer
* string with same prefix.
*/
const unsigned char digit = i < value.size() ? value[i] : '\0';
// count the resp. bucket counter
++counts[digit];
}
// turn counts of digits into offsets in data
size_t total = 0;
for (size_t &count : counts) {
#if 0 // could be compact (and, maybe, confusing):
total = count += total; // as C++ assignment is right-associative
#else // but is the same as:
count += total; // add previous total sum to count
total = count; // remember new total
#endif // 0
}
// an auxiliary buffer to sort the input data into.
std::vector<std::string> buffer(data.size());
/* Move input into aux. buffer
* while using the bucket offsets (the former counts)
* for addressing of new positions.
* This is done backwards intentionally as the offsets
* are decremented from end to begin of partitions.
*/
for (size_t j = data.size(); j--;) { // j-- -> check for 0 and post-decrement
std::string &value = data[j];
// see comment for digit above...
const unsigned char digit = i < value.size() ? value[i] : '\0';
/* decrement offset and use as index
* Arrays (and vectors) in C++ are 0-based.
* Hence, this is adjusted respectively (compared to the source of algorithm).
*/
const size_t k = --counts[digit];
// move input element into auxiliary buffer at the determined offset
buffer[k] = std::move(value);
}
/* That's it.
* Move aux. buffer back into data.
*/
data = std::move(buffer);
}

/* radix sort using count sort internally
*/
void radixCountSort(std::vector<std::string> &data)
{
// determine max. length of input data:
const size_t len = maxLength(data);
/* iterate over data according to max. length
*/
for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
countSort(data, i);
}
}

/* output of vector with strings
*/
std::ostream& operator<<(std::ostream &out, const std::vector<std::string> &data)
{
const char *sep = " ";
for (const std::string &value : data) {
out << sep << '"' << value << '"';
sep = ", ";
}
return out;
}

/* do a test for certain data
*/
void test(const std::vector<std::string> &data)
{
std::cout << "Data: {" << data << " }\n";
std::vector<std::string> data1 = data;
radixSort(data1);
std::cout << "Radix Sorted: {" << data1 << " }\n";
std::vector<std::string> data2 = data;
radixCountSort(data2);
std::cout << "Radix Count Sorted: {" << data2 << " }\n";
}

/* helper to turn a text into a vector of strings
* (by separating at white spaces)
*/
std::vector<std::string> tokenize(const char *text)
{
std::istringstream in(text);
std::vector<std::string> tokens;
for (std::string token; in >> token;) tokens.push_back(token);
return tokens;
}

/* main program
*/
int main()
{
// do some tests:
test({ "Hi", "He", "Hello", "World", "Wide", "Web" });
test({ });
test(
tokenize(
"Radix sort dates back as far as 1887 to the work of Herman Hollerith on tabulating machines.\n"
"Radix sorting algorithms came into common use as a way to sort punched cards as early as 1923.\n"
"The first memory-efficient computer algorithm was developed in 1954 at MIT by Harold H. Seward.\n"
"Computerized radix sorts had previously been dismissed as impractical "
"because of the perceived need for variable allocation of buckets of unknown size.\n"
"Seward's innovation was to use a linear scan to determine the required bucket sizes and offsets beforehand, "
"allowing for a single static allocation of auxiliary memory.\n"
"The linear scan is closely related to Seward's other algorithm - counting sort."));
}

Output:

Data: { "Hi", "He", "Hello", "World", "Wide", "Web" }
Radix Sorted: { "He", "Hello", "Hi", "Web", "Wide", "World" }
Radix Count Sorted: { "He", "Hello", "Hi", "Web", "Wide", "World" }
Data: { }
Radix Sorted: { }
Radix Count Sorted: { }
Data: { "Radix", "sort", "dates", "back", "as", "far", "as", "1887", "to", "the", "work", "of", "Herman", "Hollerith", "on", "tabulating", "machines.", "Radix", "sorting", "algorithms", "came", "into", "common", "use", "as", "a", "way", "to", "sort", "punched", "cards", "as", "early", "as", "1923.", "The", "first", "memory-efficient", "computer", "algorithm", "was", "developed", "in", "1954", "at", "MIT", "by", "Harold", "H.", "Seward.", "Computerized", "radix", "sorts", "had", "previously", "been", "dismissed", "as", "impractical", "because", "of", "the", "perceived", "need", "for", "variable", "allocation", "of", "buckets", "of", "unknown", "size.", "Seward's", "innovation", "was", "to", "use", "a", "linear", "scan", "to", "determine", "the", "required", "bucket", "sizes", "and", "offsets", "beforehand,", "allowing", "for", "a", "single", "static", "allocation", "of", "auxiliary", "memory.", "The", "linear", "scan", "is", "closely", "related", "to", "Seward's", "other", "algorithm", "-", "counting", "sort." }
Radix Sorted: { "-", "1887", "1923.", "1954", "Computerized", "H.", "Harold", "Herman", "Hollerith", "MIT", "Radix", "Radix", "Seward's", "Seward's", "Seward.", "The", "The", "a", "a", "a", "algorithm", "algorithm", "algorithms", "allocation", "allocation", "allowing", "and", "as", "as", "as", "as", "as", "as", "at", "auxiliary", "back", "because", "been", "beforehand,", "bucket", "buckets", "by", "came", "cards", "closely", "common", "computer", "counting", "dates", "determine", "developed", "dismissed", "early", "far", "first", "for", "for", "had", "impractical", "in", "innovation", "into", "is", "linear", "linear", "machines.", "memory-efficient", "memory.", "need", "of", "of", "of", "of", "of", "offsets", "on", "other", "perceived", "previously", "punched", "radix", "related", "required", "scan", "scan", "single", "size.", "sizes", "sort", "sort", "sort.", "sorting", "sorts", "static", "tabulating", "the", "the", "the", "to", "to", "to", "to", "to", "unknown", "use", "use", "variable", "was", "was", "way", "work" }
Radix Count Sorted: { "-", "1887", "1923.", "1954", "Computerized", "H.", "Harold", "Herman", "Hollerith", "MIT", "Radix", "Radix", "Seward's", "Seward's", "Seward.", "The", "The", "a", "a", "a", "algorithm", "algorithm", "algorithms", "allocation", "allocation", "allowing", "and", "as", "as", "as", "as", "as", "as", "at", "auxiliary", "back", "because", "been", "beforehand,", "bucket", "buckets", "by", "came", "cards", "closely", "common", "computer", "counting", "dates", "determine", "developed", "dismissed", "early", "far", "first", "for", "for", "had", "impractical", "in", "innovation", "into", "is", "linear", "linear", "machines.", "memory-efficient", "memory.", "need", "of", "of", "of", "of", "of", "offsets", "on", "other", "perceived", "previously", "punched", "radix", "related", "required", "scan", "scan", "single", "size.", "sizes", "sort", "sort", "sort.", "sorting", "sorts", "static", "tabulating", "the", "the", "the", "to", "to", "to", "to", "to", "unknown", "use", "use", "variable", "was", "was", "way", "work" }

Live Demo on coliru

Please, note that the strings are sorted interpreting the numerical values of the characters.
If instead an English-dictionary sorting would be intended than the digit to bucket mapping had to be modified. Thereby, the order of character values might be changed as well as mapping corresponding uppercase and lowercase characters to the same bucket.

Frequently copying strings (or other containers) is space and time consuming and something, I would prevent at best in productive code.
The move semantics is one option to lower the stress for the CPU while keeping the code quite clean and comparable to the algorithm behind.
This is what I tried to consider (to my best knowledge) in the sample code.

Using C to implement radix sort

You have problem with code:

for(i=1;i<size;i++){
bucket[i] += bucket[i-1];
}

Because size can be greater then bucket_size.

And probably you have problem with:

for(i=size-1;i>=0;i--){
semi_sort[--bucket[(arr[i]/significant_num)%10]] = arr[i];
}

Because --bucket[(arr[i]/significant_num)%10] can be greater then 11 or less then 0.

And find_the_largest does not work correct on negative numbers.

You can dynamically allocate memory for your buffers, like this:

semi_sort = malloc(size * (sizeof *semi_sort));

And don't forget to free memory on end (free(semi_sort)).



Related Topics



Leave a reply



Submit