Consistent String#Hash Based Only on the String's Content

Consistent String#hash based only on the string's content

there are lot of such functionality in ruby's digest module: http://ruby-doc.org/stdlib/libdoc/digest/rdoc/index.html

simple example:

require 'digest/sha1'
Digest::SHA1.hexdigest("some string")

Can I be sure the built-in hash for a given string is always the same?

Just to add some detail as to where the idea of a changing hashcode may have come from.

As the other answers have rightly said the hashcode for a specific string will always be the same for a specific runtime version. There is no guarantee that a newer runtime might use a different algorithm perhaps for performance reasons.

The String class overrides the default GetHashCode implementation in object.

The default implementation for a reference type in .NET is to allocate a sequential ID (held internally by .NET) and assign it to the object (the objects heap storage has slot for storing this hashcode, it only assigned on the first call to GetHashCode for that object).

Hence creating an instance of a class, assigning it some values then retrieving the hashcode, followed by doing the exact same sequence with the same set of values will yeild different hashcodes. This may be the reason why some have been led to believe that hashcodes can change. In fact though its the instance of a class which is allocated a hashcode once allocated that hashcode does not change for that instance.

Edit: I've just noticed that none of the answers directly reference each of you questions (although I think the answer to them is clear) but just to tidy up:-

Can I be sure that the hash for a given string ("a very long string") will be always the same?

In your usage, yes.

Can I be sure that two different strings won't have the same hash?

No. Two different strings may have the same hash.

Also, if possible, how likely is it to get the same hash for different strings?

The probability is quite low, resulting hash is pretty random from a 4G domain.

Persistent Hashing of Strings in Python

If a hash function really won't work for you, you can turn the string into a number.

my_string = 'my string'
def string_to_int(s):
    ord3 = lambda x : '%.3d' % ord(x)
    return int(''.join(map(ord3, s)))

In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L

This is invertible, by mapping each triplet through chr.

def int_to_string(n)
    s = str(n)
    return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])

In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'

Good Hash Function for Strings

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash = 7;
for (int i = 0; i < strlen; i++) {
    hash = hash*31 + charAt(i);
}

Why is the hash function O(1)

Big-O notation is a way of describing how the execution time of an algorithm will grow with the size of the data-set it is working on. In order for that definition to be applied, we have to specify what the data-set is that will be growing.

In most cases that's obvious, but sometimes it's a bit ambiguous, and this is one of those cases. Does "data set" in this case refer to an individual string? If so, then the algorithm is O(N), since the sequence of operations it has to perform grows linearly with the size of the string. But if by "data set" you mean the number of items in the associated hash table, then the algorithm can be considered O(1), since it will operate just as quickly (for any given string) when dealing with a million-entry Hashtable as it would for a two-entry Hashtable.

Regarding using hashing to compare two strings: once you have computed the hash values for the two strings, and stored those hash values somewhere, you can compare the two hash values in O(1) time (since comparing two integers is a fixed-cost operation, regardless of the sizes of the strings they were calculated from). It's important to note, however, that this comparison can only tell you if the two strings are different -- if the two hash values are equal, you still can't be 100% certain that the two strings are equal, since (due to the pigeonhole principle) two different strings can hash to the same value. So in that case you'd still need to compare the strings the regular O(N)/char-by-char way.

Constant-time hash for strings?

In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).

Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...

Consistency of hashCode() on a Java string

I can see that documentation as far back as Java 1.2.

While it's true that in general you shouldn't rely on a hash code implementation remaining the same, it's now documented behaviour for java.lang.String, so changing it would count as breaking existing contracts.

Wherever possible, you shouldn't rely on hash codes staying the same across versions etc - but in my mind java.lang.String is a special case simply because the algorithm has been specified... so long as you're willing to abandon compatibility with releases before the algorithm was specified, of course.

Why does the hash() function in python take constant time to operate on strings of variable length?

Since strings are immutable, the hashcode of a string is computed only once and cached thereafter.

A better way to benchmark would be to generate different(unique) strings of length k and average their hash time, instead of calling hash of the same string multiple times.

Consistent String#Hash Based Only on the String's Content