Hashtable in C++

What is a hash table and how do you make it in C?

Prerequisites

For this answer I'm going to assume you know how to use pointers, structs, and have a basic understanding of the C language.

Also if you don't know. When talking about the speed of algorithms and data structures you should know the terms:

O() = (it's pronounced "Big-oh") Big-oh or O() refers to the "worst-case-scenario" runtime. Similarly, in math, it's big O notation and describes the limiting behavior of a function. If somethings O(1) that's constant time "really good". If somethings O(n) that means if the list is a million long. It is at worst going to run a million time. O() is generally the one used to determine how fast something runs because that's how fast it'll run in it's worst case.

Ω = (greek letter Omega) refers to it's best case scenario. It's not used that as much as O() so I won't go into too much detail about it. But just know that if somethings Ω(1), in it's best case scenario it'll take just one time.

Θ = (greek letter theta) is unique in that it is only used when the O() and Ω() runtime are the same. So like in the case of the recursive sorting algorithm merge sort. It's run time is Θ(n(log(n))). Which means that it's O(n(log(n))) and it's Ω(n(log(n))).

What is a Hash table?

A hash table or associative array is a popular data structure used in programming. A hash table is just a linked list (I'll get to what a linked list is later on) with a hash function. A hash function basically just takes things and puts them in different "baskets". Each "basket" is just another linked list or something else depending on how you implement it. I'll explain more details on hash tables when I show you how to implement one.

Why would I want to use a hash table rather than an array?

An array is very easy to use and simple to make, but it also has its downsides. For this example, let's say we have a program and in which we want to keep all its users in an array.

That's pretty simple. Let's just say we plan on this program having no more than 100 users and fill that array with our users

char* users[100];

// iterate over every user and "store" their name
for (int i = 0; i < userCount; i++)
{
users[i] = "New username here";
}

So that works all well and fine and really fast too. That's O(1) right there. We can access any user in constant time.

But let's now assume that our program gets really popular. It now has over 80 users. Uh-Oh! We better increase the size of that array or else we'll get a buffer overflow.

So how do we do that? Well we're gonna have to make a new array that's bigger and copy over the contents of the old array into the new array.

That's very costly and we don't want to do that. We want to think cleverly and not use a something that has a fixed size. Well we already know how to use pointers to our advantage and we can bundle information into a struct if we wanted to.

So we could create a struct to store the username and then have it point (via a pointer) to a new struct. Voila! We now have a data structure that is expandable. It's a list of bundled information that's linked together by pointers. Thus the name linked list.

Linked Lists

So let's create that linked list. First we're gonna need a struct

typedef struct node
{
char* name;
struct node* next;
}
node;

Alright so we have a string name and a... Wait a sec... I've never heard of a data type called a struct node. Well for our convenience I typedef a new "data type" called node that also happens to be our struct called node.

So now that we have our node for our list, what do we need next? Well we need to create a "root" to our list so we can traverse it (I'll explain what I mean by traverse later). So let's assign a root. (remember that node data type I typdefed earlier)

node* first = NULL;

So now that we have our root all we need to do is make a function to insert new usernames into our list.

/*
* inserts a name called buffer into
* our linked list
*/
void insert(char* buffer)
{
// try to instantiate node for number
node* newptr = malloc(sizeof(node));
if (newptr == NULL)
{
return;
}

// make a new ponter
newptr->name = buffer;
newptr->next = NULL;

// check for empty list
if (first == NULL)
{
first = newptr;
}
// check for insertion at tail
else
{
// keep track of the previous spot in list
node* predptr = first;

// because we don't know how long this list is
// we must induce a forever loop until we find the end
while (true)
{
// check if it is the end of the list
if (predptr->next == NULL)
{
// add new node to end of list
predptr->next = newptr;

// break out of forever loop
break;
}

// update pointer
predptr = predptr->next;
}
}
}

So there you go. We have a basic linked list and now we can keep adding users all we want and we don't have to worry about running out of room. But this does come with down sides. The big problem with this is that every node or "user" in our list is "anonymous". We don't know were they are at or even how many users we have with this. (of course there are ways of making this much better -- I just want to show a very basic linked list) We have to traverse the entire list to add a user because we cannot access the end directly.

It's like we are in a huge dust storm and you can't see anything and we need to get to our barn. We can't see where our barn is but we have a solution. There are people standing our there (our nodes) and they are all holding two ropes (our pointers). Each person only owns one rope but that rope is being held at the other end by someone else. Just like our struct, the rope acts as a pointer to where they are. So how do we get to our barn? (for this example the barn is the last "person" in the list). Well we have no idea how big our line of people are or where they go. In fact, all we see is a fence post with a rope tied to it. (Our root!) that fence post will never change so we can grab the post and start moving along until we see our first person. That person is holding two ropes (the post's pointer and their pointer).

So we keep traveling along the rope until we reach a new person and grab onto their rope. Eventually, we get to the end and find our barn!

So that is a linked list in a nutshell. Its benefits are that it can expand as much as you want but its runtime depends on how big the list is, namely O(n). So if there are 1 million users, it would have to run 1 million times to insert a new name! Wow that seems really wasteful just to insert 1 name.

Luckily, we are clever and can create a better solution. Why don't we, instead of having just one linked list, have a few linked lists. An array of linked lists if you will. Why don't we make an array of size 26. So we can have a unique linked list for every letter of the alphabet. Now instead of a run time of n. We can reasonably say that our new run time will be n/26. Now that won't make much of a difference at all if you have a list 1 million big. But we're just going to keep it simple for this example.

So we have an array of linked lists but how are we going to sort our users into the array. Well... why don't we make a function that decides which user should go where. This function will "hash" the users if you will into an array or "table". So let's create this "hashed" linked list. Thus the name hash table

Hash Table

As I just said, our hash table will be an array of linked lists and will be hashed by the first letter of their username. A will go to position 0, B to 1, and so on.

The struct for the this hash table will be the same as the struct for our previous linked list

typedef struct node
{
char* name;
struct node* next;
}
node;

Now just like our linked list, we need a root for our hash table

node* first[26] = {NULL};

The root will be an array the size of the alphabet and all positions in it will be initialized to NULL. (Remember: the last element in a linked list always has to point to NULL or else we wouldn't know it was the end)

Let's make a main function. It takes a username we are going to hash then insert.

int main(char* name)
{
// hash the name into a spot
int hashedValue = hash(name);

// insert the name in table with hashed value
insert(hashedValue, name);
}

So here's our hash function. It's pretty simple. All we want to do is look at the first letter in the word and give a value from 0 - 25 based on what letter it is

/*
* takes a string and hashes it into the correct bucket
*/
int hash(const char* buffer)
{
// assign a number to the first char of buffer from 0-25
return tolower(buffer[0]) - 'a';
}

So now all we need is to create our insert function. It's going to look just like our insert function before except every time we reference our root, we're going to reference it as an array.

/*
* takes a string and inserts it into a linked list at a part of the hash table
*/
void insert(int key, const char* buffer)
{
// try to instantiate node to insert word
node* newptr = malloc(sizeof(node));
if (newptr == NULL)
{
return;
}

// make a new pointer
strcpy(newptr->name, buffer);
newptr->next = NULL;

// check for empty list
if (first[key] == NULL)
{
first[key] = newptr;
}
// check for insertion at tail
else
{
node* predptr = first[key];
while (true)
{
// insert at tail
if (predptr->next == NULL)
{
predptr->next = newptr;
break;
}

// update pointer
predptr = predptr->next;
}
}
}

So that's basics of a hash table. It's pretty simple if you know how to use pointers and structs. I know that was a pretty simple example of a hash table with only an insert function, but you can make it a lot better and get more creative with your hashing function. You can also make the array as big as you want or even a use multi-dimensional array.

how to initializing a hash table in C

What you are dealing with is Undefined Behavior.

See, struct node_t *hash_table[HSZ];
So, hash_table is an array of HSZ (127) pointers of the data type struct node_t.

When you do,

for(i=0; i<HSZ; i++){
hash_table[i]->val = 0;
hash_table[i]->next = NULL;
}

hash_table[0] to hash_table[126] pointers are not pointing to anything.
So, each of them (or all of them) should be initialized first to point to an object of the type struct node_t and then you can initialize them. For that matter, Using a memset does not cause a problem because memset is filling the contents of the pointers with all zeros. There is difference between filling the pointers with all zeros and filling all zeros to the memory pointed by pointers.

Trying this,

for(i=0; i<HSZ; i++){
hash_table[i].val = 0;
hash_table[i]->next = NULL;
}

is plain wrong.

To fix the issue you are facing, you need to allocate memory dynamically using malloc. You can do the in your for loop.

for(i = 0; i < HSZ; i++) 
{
//Allocate memory of the size struct_node_t
hash_table[i] = malloc(sizeof(struct node_t)); //Do not cast!
//Check if memory is allocated
if(hash_table[i] == NULL)
{
//Memory not allocated, set some error state to handle and break
break;
}
//Initialize to zero
hash_table[i]->val = 0;
hash_table[i]->next = NULL;
}

Why are there no hashtables in the C standard library?

There is no hashtable in the standard C library because either:

  • no-one has submitted a proposal to the working group; or
  • the working group has deemed it unnecessary.

That's the way ISO works. Proposals are put forward and accepted or rejected.

You have to be careful what you add to the standard library since you have two conflicting groups. As a user, you might want every data structure under the sun to be added to the standard to make the language more useful.

But, as a language implementor (as an aside, these are probably the people that tend to make up most of the various working groups so their view is likely to have more impact), you don't really want the hassle of having to implement stuff that may not be used by everyone. All the stuff that was there when C89 appeared was to do with the fact that the primary purpose was to codify existing practice rather than introduce new practices. All iterations of the standards since then have been a little freer in what they can do but backwards compatibility is still an important issue.

Myself, I also have conflicts. I'd love to have all the features of the Java, C++ or Python libraries at my disposal in C. Of course, that would make it so much harder to learn everything for newcomers and, as one commenter stated, probably make it so any code monkey can pump out useful code, reducing my value in the process :-)

And I pretty much have all the data structures I'll ever need, from my long and (mostly) illustrious career. You're not limited to the standard library for this sort of stuff. There are plenty of third-party tools you can get to do the job and (like me) you can also roll your own.

If you want to know why certain decisions were made in each iteration, ISO (and ANSI originally, before ISO took over) usually publish rationale documents. The C89 one from ANSI can be found here. It contains this little beauty in the scope:

This Rationale focuses primarily on additions, clarifications, and changes made to the language as described in the Base Documents. It is not a rationale for the C language as a whole: the Committee was charged with codifying an existing language, not designing a new one. No attempt is made in this Rationale to defend the pre-existing syntax of the language, such as the syntax of declarations or the binding of operators.

I especially enjoy the admission that they're not responsible for any unholy mess that may have predated their attempts to standardise.

But, perhaps the real answer to your question lies in this bit, one of the guiding principles:


Keep the spirit of C. The Committee kept as a major goal to preserve the traditional spirit of C. There are many facets of the spirit of C, but the essence is a community sentiment of the underlying principles upon which the C language is based. Some of the facets of the spirit of C can be summarized in phrases like:

  • Trust the programmer.
  • Don't prevent the programmer from doing what needs to be done.
  • Keep the language small and simple.
  • Provide only one way to do an operation.
  • Make it fast, even if it is not guaranteed to be portable.

That third one is probably the main reason why the library wasn't massively expanded with the initial standardisation effort - that, and the fact that such an expansion from a committee would probably have resulted in ANSI C being labeled C2038 rather than C89.

Creating Hash tables in C

the basic concept is easy:

  1. use a very large array to store the data
  2. use the hash function to index the array
  3. you will have collisions- so figure out a mechanism to deal with those ( a linked list is probably the easiest)

So, when you want to write data : index = hash_function(); array[index] = data. Similar for read.

You will figure the rest out once you read more about hash tables.



Related Topics



Leave a reply



Submit