What's a Good Hash Function for English Words

What's a good hash function for English words?

To simply sum the letters is not a good strategy because a permutation gives the same result.

This one (djb2) is quite popular and works nicely with ASCII strings.

unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}

More info here.

If you need more alternatives and some perfomance measures, read here.

Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

Good Hash Function for Strings

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash = 7;
for (int i = 0; i < strlen; i++) {
hash = hash*31 + charAt(i);
}

Hash algorithm for english phrases

Yes, it seems impossible to find a perfect collision free hashing algorithm, I might end up with maintaining a mapping file.

I also find a great question and answer
here.


Actually I don't mind the performance of this algorithm for it's all done on server, and only once while starting up. All I want is the id for every word/phrase is unique, as short as possible, like a fingerprint. I wonder if I can take advantage of the prime numbers..

Finally, I decide to use a long for my id

(8 bit) first letter of the first word

(8 bit) last letter of the last word

(4 bit) word count

(4 bit) the serial number of the longest word in the phrase

(8 bit) character count, space included

(32bit) MurmurHash3 result

You can find murmurHash3 cs implementation here:

https://gist.github.com/automatonic/3725443

I think this approach will generate unique ids for any existing words and phrases with no collision.

hash function for string

I've had nice results with djb2 by Dan Bernstein.

unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}

What is a good Hash Function?

For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.

http://www.azillionmonkeys.com/qed/hash.html

If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

What is the best 32bit hash function for short strings (tag names)?

If performance isn't important, simply take a secure hash such as MD5 or SHA1, and truncate its output to 32 bits. This will give you a distribution of hash codes that's indistinguishable from random.

What is a good 64bit hash function in Java for textual strings?

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
long h = 1125899906842597L; // prime
int len = string.length();

for (int i = 0; i < len; i++) {
h = 31*h + string.charAt(i);
}
return h;
}

If you're looking for even more bits, you could probably use a BigInteger
Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

Have a good hash function for a C++ hash table?

Now assumming you want a hash, and want something blazing fast that would work in your case, because your strings are just 6 chars long you could use this magic:

size_t precision = 2; //change the precision with this
size_t hash(const char* str)
{
return (*(size_t*)str)>> precision;
}

CRC is for slowpokes ;)

Explanation:
This works by casting the contents of the string pointer to "look like" a size_t (int32 or int64 based on the optimal match for your hardware). So the contents of the string are interpreted as a raw number, no worries about characters anymore, and you then bit-shift this the precision needed (you tweak this number to the best performance, I've found 2 works well for hashing strings in set of a few thousands).

Also the really neat part is any decent compiler on modern hardware will hash a string like this in 1 assembly instruction, hard to beat that ;)



Related Topics



Leave a reply



Submit