Good Hash Function for Strings

Good Hash Function for Strings

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash = 7;
for (int i = 0; i < strlen; i++) {
hash = hash*31 + charAt(i);
}

hash function for string

I've had nice results with djb2 by Dan Bernstein.

unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}

Hash function for a string

First, it usually does not matter that much in practice. Most hash functions are "good enough".

But if you really care, you should know that it is a research subject by itself. There are thousand of papers about that. You can still get a PhD today by studying & designing hashing algorithms.

Your second hash function might be slightly better, because it probably should separate the string "ab" from the string "ba". On the other hand, it is probably less quick than the first hash function. It may, or may not, be relevant for your application.

I'll guess that hash functions used for genome strings are quite different than those used to hash family names in telephone databases. Perhaps even some string hash functions are better suited for German, than for English or French words.

Many software libraries give you good enough hash functions, e.g. Qt has qhash, and C++11 has std::hash in <functional>, Glib has several hash functions in C, and POCO has some hash function.

I quite often have hashing functions involving primes (see Bézout's identity) and xor, like e.g.

#define A 54059 /* a prime */
#define B 76963 /* another prime */
#define C 86969 /* yet another prime */
#define FIRSTH 37 /* also prime */
unsigned hash_str(const char* s)
{
unsigned h = FIRSTH;
while (*s) {
h = (h * A) ^ (s[0] * B);
s++;
}
return h; // or return h % C;
}

But I don't claim to be an hash expert. Of course, the values of A, B, C, FIRSTH should preferably be primes, but you could have chosen other prime numbers.

Look at some MD5 implementation to get a feeling of what hash functions can be.

Most good books on algorithmics have at least a whole chapter dedicated to hashing. Start with wikipages on hash function & hash table.

What is a good Hash Function?

For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.

http://www.azillionmonkeys.com/qed/hash.html

If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

What is a good 64bit hash function in Java for textual strings?

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
long h = 1125899906842597L; // prime
int len = string.length();

for (int i = 0; i < len; i++) {
h = 31*h + string.charAt(i);
}
return h;
}

If you're looking for even more bits, you could probably use a BigInteger
Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.



Related Topics



Leave a reply



Submit