hash function in Python 3.3 returns different results between sessions
Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.
You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED
environment variable; the default is random
but you can set it to a fixed positive integer value, with 0
disabling the feature altogether.
Python versions 2.7 and 3.2 have the feature disabled by default (use the -R
switch or set PYTHONHASHSEED=random
to enable it); it is enabled by default in Python 3.3 and up.
If you were relying on the order of keys in a Python set, then don't. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.
Also see the object.__hash__()
special method documentation:
Note: By default, the
__hash__()
values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).
See also
PYTHONHASHSEED
.
If you need a stable hash implementation, you probably want to look at the hashlib
module; this implements cryptographic hash functions. The pybloom project uses this approach.
Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.
Does python's hash function remain identical across different versions?
No. Apart from long-standing differences between 32- and 64-bit versions of Python, the hashing algorithm was changed in Python 3.3 to resolve a security issue:
By default, the hash() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).
As a result, from 3.3 onwards hash()
is not even guaranteed to return the same result across different invocations of the same Python version.
hash() returning different values on different OSs
There are no guarantees about the value hash
returns in Python. It appears that you're using a 32-bit Windows Python (that's a guess), and that you're using a 64-bit python on linux (again, a guess). IIRC (and I haven't checked), The default hash(item)
returns the address of item
as its hash value.
If you want to have values you can compare across operating systems, look at hashlib.
Python consistent hash replacement
Short answer to broad question: There are no explicit guarantees made about hashing stability aside from the overall guarantee that x == y
requires that hash(x) == hash(y)
. There is an implication that x
and y
are both defined in the same run of the program (you can't perform x == y
where one of them doesn't exist in that program obviously, so no guarantees are needed about the hash across runs).
Longer answers to specific questions:
Is [your belief that
int
,float
,tuple(x)
,frozenset(x)
(forx
with consistent hash) have consistent hashes across separate runs] always true and guaranteed?
It's true of numeric types, with the mechanism being officially documented, but the mechanism is only guaranteed for a particular interpreter for a particular build. sys.hash_info
provides the various constants, and they'll be consistent on that interpreter, but on a different interpreter (CPython vs. PyPy, 64 bit build vs. 32 bit build, even 3.n vs. 3.n+1) they can differ (documented to differ in the case of 64 vs. 32 bit CPython), so the hashes won't be portable across machines with different interpreters.
No guarantees on algorithm are made for tuple
and frozenset
; I can't think of any reason they'd change it between runs (if the underlying types are seeded, the tuple
and frozenset
benefit from it without needing any changes), but they can and do change the implementation between releases of CPython (e.g. in late 2018 they made a change to reduce the number of hash collisions in short tuple
s of int
s and float
s), so if you store off the hashes of tuple
s from say, 3.7, and then compute hashes of the same tuple
s in 3.8+, they won't match (even though they'd match between runs on 3.7 or between runs on 3.8).
If so, is that expected to stay that way?
Expected to, yes. Guaranteed, no. I could easily see seeded hashes for int
s (and by extension, for all numeric types to preserve the numeric hash/equality guarantees) for the same reason they seeded hashes for str
/bytes
, etc. The main hurdles would be:
- It would almost certainly be slower than the current, very simple algorithm.
- By documenting the numeric hashing algorithm explicitly, they'd need a long period of deprecation before they could change it.
- It's not strictly necessary (if web apps need seeded hashes for DoS protection, they can always just convert
int
s tostr
before using them as keys).
Is the
PYTHONHASHSEED
only applied to salt the hash of strings and byte arrays?
Beyond str
and bytes
, it applies to a number of random things that implement their own hashing in terms of the hash of str
or bytes
, often because they're already naturally convertable to raw bytes and are commonly used as keys in dict
s populated by web-facing frontends. The ones I know of off-hand include the various classes of the datetime
module (datetime
, date
, time
, though this isn't actually documented in the module itself), and read-only memoryview
s of with byte-sized formats (which hash equivalently to hashing the result of the view's .tobytes()
method).
What would be a good way to write a consistent hash replacement for
hash(frozenset(some_dict.items()))
when thedict
contains various types and classes?
The simplest/most composable solution would probably be to define your const_hash
as a single dispatch function, using it the same way you do hash
itself. This avoids having one single function defined in a single place that must handle all types; you can have the const_hash
default implementation (which just relies on hash
for those things with known consistent hashes) in a central location, and provide additional definitions for the built-in types you know aren't consistent (or which might contain inconsistent stuff) there, while still allowing people to extend the set of things it covers seamlessly by registering their own single-dispatch functions by importing your const_hash
and decorating the implementation for their type with @const_hash.register
. It's not significantly different in effect from your proposed const_hash
, but it's a lot more manageable.
Sort by value then by key returns different results upon multiple runs?
By default tuples are compared in field order. That is, tuples are sorted by their first fields and in the case of ties the second fields are compared, etc. So, if your challenges is how to sort by score followed by name it may be as simple as leveraging this inherent feature of tuples with one wrinkle: you want the numeric sort to be from high to low, while you want the lexicographical sort to be from low to high. The following example does that albeit in a somewhat tricky way.
Example:
word_scores = [('parish', 7.427144133408616), ('saar', 4.406719247264253), ('saaremaa', 4.406719247264253), ('jõe', 4.406719247264253), ('villag', 4.208268308540415)]
sorted_word_scores = sorted(word_scores, key=lambda ws: (0 - ws[1], ws[0]))
print(sorted_word_scores)
Output:
[('parish', 7.427144133408616), ('jõe', 4.406719247264253), ('saaremaa', 4.406719247264253), ('saar', 4.406719247264253), ('villag', 4.208268308540415)]
Related Topics
Single VS Double Quotes in JSON
Different Behaviour for List._Iadd_ and List._Add_
Main() Function Doesn't Run When Running Script
How to Get User Ip Address in Django
How to Initialize a Two-Dimensional Array in Python
Pandas: Filter Rows of Dataframe with Operator Chaining
Numpy Array Is Not JSON Serializable
Finding Local Maxima/Minima with Numpy in a 1D Numpy Array
Plot Logarithmic Axes with Matplotlib in Python
Replace Non-Ascii Characters with a Single Space
Method Resolution Order (Mro) in New-Style Classes
How to Filter a Dictionary According to an Arbitrary Condition Function
Strings in a Dataframe, But Dtype Is Object
Error "Filename.Whl Is Not a Supported Wheel on This Platform"
Pandas Groupby: How to Get a Union of Strings
Python Garbage Collector Documentation