Why Do -1 and -2 Both Hash to -2 in Cpython

Why do -1 and -2 both hash to -2 in CPython?

-1 is a reserved value at the C level of CPython which prevents hash functions from being able to produce a hash value of -1. As noted by DSM, the same is not true in IronPython and PyPy where hash(-1) != hash(-2).

See this Quora answer:

If you write a type in a C extension module and provide a tp_hash
method, you have to avoid -1 — if you return -1, Python will assume
you meant to throw an error.

If you write a class in pure Python and provide a __hash__ method,
there's no such requirement, thankfully. But that's because the C code
that invokes your __hash__ method does that for you — if your
__hash__ returns -1, then hash() applied to your object will actually return -2.

Which really just repackages the information from effbot:

The hash value -1 is reserved (it’s used to flag errors in the C
implementation). If the hash algorithm generates this value, we simply
use -2 instead.

You can also see this in the source. For example for Python 3’s int object, this is at the end of the hash implementation:

if (x == (Py_uhash_t)-1)
x = (Py_uhash_t)-2;
return (Py_hash_t)x;

Since they do, how does Python tell these two numbers apart?

Since all hash functions map a large input space to a smaller input space, collisions are always expected, no matter how good the hash function is. Think of hashing strings, for example. If hash codes are 32-bit integers, you have 2^32 (a little more than 4 billion) hash codes. If you consider all ASCII strings of length 6, you have (2^7)^6 (just under 4.4 trillion) different items in your input space. With only this set, you are guaranteed to have many, many collisions no matter how good you are. Add Unicode characters and strings of unlimited length to that!

Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.

This hash code and equality duality is hinted at in the CPython documentation on hashable objects. In other languages/frameworks, there is a guideline/rule that if you provide a custom hash code function, you must also provide a custom equality test (performed on the same fields as the hash code function).


Indeed, the Python release today address exactly this, with a security patch that addresses the efficiency issue when this (identical hash values, but on a massive scale) is used as a denial of service attack - http://mail.python.org/pipermail/python-list/2012-April/1290792.html

When is a python object's hash computed and why is the hash of -1 different?

The hash is generally computed each time it's used, as you can quite easily check yourself (see below).
Of course, any particular object is free to cache its hash. For example, CPython strings do this, but tuples don't (see e.g. this rejected bug report for reasons).

The hash value -1 signals an error in CPython. This is because C doesn't have exceptions, so it needs to use the return value. When a Python object's __hash__ returns -1, CPython will actually silently change it to -2.

See for yourself:

class HashTest(object):
def __hash__(self):
print('Yes! __hash__ was called!')
return -1

hash_test = HashTest()

# All of these will print out 'Yes! __hash__ was called!':

print('__hash__ call #1')
hash_test.__hash__()

print('__hash__ call #2')
hash_test.__hash__()

print('hash call #1')
hash(hash_test)

print('hash call #2')
hash(hash_test)

print('Dict creation')
dct = {hash_test: 0}

print('Dict get')
dct[hash_test]

print('Dict set')
dct[hash_test] = 0

print('__hash__ return value:')
print(hash_test.__hash__()) # prints -1
print('Actual hash value:')
print(hash(hash_test)) # prints -2

Why aren't CPython Dicts affected by Hash Values for Negative one and negative two

The fact that x has different hash value than y implies that x != y. But the converse is not true! So when x has hash value equal to y they are still checked for equality explicitly.

A situation when hash(x) == hash(y) and x != y is called a collision in the context of hash functions and is something that's likely to happen from time to time. You want to avoid it as much as possible but in general it is inevitable. You can read more about hash functions and collisions here.

hash( (-2,2) ) == hash( (2,-2) ) returns True (Python)

Before looking at the hashing logic, lets see few tidbits.

  1. Hash value of the python integers will be the numbers themselves.

  2. Only exception to point 1 is, -1. Because -1 is internally used to indicate error cases. So, hash value of -1 will be -2. (For -2 also, it will be -2 only.)

With this understanding, we can answer your examples 2, 3 and 5.

Example 2: Hash value of -1 == Hash value of -2. So, it is true.

Example 3: Same as the reason for Example 2

Example 5: Hash value of -1 and 1 are different (-2 and 1 respectively). That's why the result is False.

The case, Example 4, hash value of floating point numbers are not the same as the numbers themselves, because hash values should be integers. And the numbers are not represented as they are in the memory. (2.01 is NOT stored as 2.01 itself in the memory). So, hash values of them being different is acceptable.

To answer the Example 1, lets see some code. As per the tuple hashing implementation, lets calculate the hash value using a Python program

x, hash_mult = int("0x345678", 16), 1000003
x, hash_mult = (x ^ 2) * hash_mult, hash_mult + 82522
x = (x ^ -2) * hash_mult
print(x + 97531) # -3713082714466793269

x, hash_mult = int("0x345678", 16), 1000003
x, hash_mult = (x ^ -2) * hash_mult, hash_mult + 82522
x = (x ^ 2) * hash_mult
print(x + 97531) # -3713082714466793269

print(hash((-2, 2))) # -3713082714466793269

Note: This is not only true for -2, 2, it is true for all integers, except the above seen case and all the multiples of 8.

Why can a Python dict have multiple keys with the same hash?

For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?

Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.

If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).

The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.

Why is CPython's hash(-1) != -1

From http://effbot.org/zone/python-hash.htm:

The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash algorithm generates this value, we simply use -2 instead.

In hash collision, how does CPython know which value is stored at index HASHVALUE and which value is stored at RESOLUTIONINDEX

Keys are stored with their associated values

Each index into a CPython hash table is associated with a structure that includes three fields: the hash value, a key pointer, and a value pointer:

typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;

How it works for a hash collision

In your scenario, {hash5, key5, value5} are stored at index 12, and {hash13, key13, value13} are stored at an alternate index chosen by open-addressing collision resolution.

When looking up key5 at index 12, Python verifies that the key matches and then returns the associated value5.

Contrariwise, when first looking for key13 at index 12, Python notices that key13 != key5 and will not return value5. Instead, it jumps to the alternate index where key13 matches and then returns the associated value13.

Conclusion

You asked how does "CPython know which value is stored at index HASHVALUE and which value is stored at RESOLUTIONINDEX". The answer is that values are stored with the associated keys, making it possible to know which value is associated with which key.

Why doesn't Python hash function give the same values when run on Android implementation?

for old python (at least, my Python 2.7), it seems that

hash(<some type>) = id(<type>) / 16

and for CPython id() is the address in memory - http://docs.python.org/2/library/functions.html#id

>>> id(int) / hash(int)                                                     
16
>>> id(int) % hash(int)
0

so my guess is that the Android port has some strange convention for memory addresses?

anyway, given the above, hashes for types (and other built-ins i guess) will differ across installs because functions are at different addresses.

in contrast, hashes for values (what i think you mean by "non-internal objects") (before the random stuff was added) are calculated from their values and so likely repeatable.

PS but there's at least one more CPython wrinkle:

>>> for i in range(-1000,1000):
... if hash(i) != i: print(i)
...
-1

there's an answer here somewhere explaining that one...



Related Topics



Leave a reply



Submit