Most Efficient Property to Hash for Numpy Array

Most efficient property to hash for numpy array

You can simply hash the underlying buffer, if you make it read-only:

>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop

For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.

>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'

How to make a tuple including a numpy array hashable?

The fastest way to hash a numpy array is likely tostring.

In [11]: %timeit hash(y.tostring())

What you could do is rather than use a tuple define a class:

class KeySet(object):
    def __init__(self, i, arr):
        self.i = i
        self.arr = arr
    def __hash__(self):
        return hash((self.i, hash(self.arr.tostring())))

Now you can use it in a dict:

In [21]: ks = KeySet(0, npArray)

In [22]: myDict = {ks: 1}

In [23]: myDict[ks]
Out[23]: 1

Why is md5 hashing so much faster on strings than on numpy arrays in python?

str(random_matrix) will not include all of the matrix due to numpy's eliding things with "...":

>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

So when you hash str(random_matrix), you aren't really hashing all the data.

See this previous question and this one about how to hash numpy arrays.

numpy - most efficient way of calculating f(v1,v2) for every pair of rows in a 2d array

A more generic approach if you have a general function is to use np.fromiter (which is generally faster than a for loop):

import itertools
n = 4
data = np.random.random((n, n))

def func(tup):
    v1, v2 = tup
    n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
    return(np.dot(v1, v2) / (n1 * n2))

out = np.fromiter(map(func, itertools.product(data, data)), np.float).reshape(n,n)

print(out)
>>array([[1.        , 0.57588563, 0.44980109, 0.93490176],
       [0.57588563, 1.        , 0.71004626, 0.6908402 ],
       [0.44980109, 0.71004626, 1.        , 0.68118222],
       [0.93490176, 0.6908402 , 0.68118222, 1.        ]])

Using scalar ndarrays as keys

Python dictionaries require their keys to implement both __eq__ and __hash__ methods, and Python's data model requires that:

The hash of an object does not change during its lifetime
If x == y then hash(x) == hash(y)

Numpy's ndarray class overrides __eq__ to support elementwise comparison and broadcasting. This means that for numpy arrays x and y, x == y is not a boolean but another array. This in itself probably rules out ndarrays functioning correctly as dictionary keys.

Even ignoring this quirk of ndarray.__eq__, it would be tricky to come up with a (useful) implementation of ndarray.__hash__. Since the data in a numpy array is mutable, we could not use that data to calculate __hash__ without violating the requirement that the hash of an object does not change during its lifetime.

There is nothing wrong with defining __hash__ for mutable objects provided that the hash itself does not change during the object's lifetime. Similarly, dictionary keys can be mutable provided they implement __hash__ and the hash is immutable. E.g. simple user-defined classes are mutable but can still be used as dictionary keys.

Find all positions in a NumPy array that contain a substring (most efficient?)

np.char is a group of functions that apply string methods to the elements of an array like yours. So using the find function:

In [311]: np.char.find(A, 'contig')
Out[311]: 
array([[-1, -1,  0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [-1, -1,  0, -1, -1, -1, -1, -1, -1,  0, -1, -1, -1, -1],
       [-1, -1,  0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]])

-1 for the elements where it was not found, 0 or larger for finds.

In [312]: np.where(np.char.find(A, 'contig')>=0)
Out[312]: (array([0, 1, 1, 2], dtype=int32), array([2, 2, 9, 2], dtype=int32))

In [313]: A[_]
Out[313]: 
array(['contig_1758_2278_4341_-', 'contig_1758_2278_4341_-',
       'contig_1297_3232_198298_+', 'contig_1281_415_1704_-'], 
      dtype='<U149')

Functions like this have to iterate over the elements, and apply the corresponding string method, so they aren't as fast as the usually numpy numeric code, but they are a lot easier than doing your own iteration.

np.vectorize or np.frompyfunc can also be used to apply a function to each element of an array. They too iterate, so aren't significant speedups over your own iteration. Still I have found that frompyfunc often provides a 30% speedup.

In [331]: f=np.frompyfunc(lambda x: x.find('contig'), 1,1)  # like char.find

In [332]: f=np.frompyfunc(lambda x: 'contig' in x, 1,1)  # your 'in'

In [333]: f(A)
Out[333]: 
array([[False, False, True, False, False, False, False, False, False,
        False, False, False, False, False],
       [False, False, True, False, False, False, False, False, False, True,
        False, False, False, False],
       [False, False, True, False, False, False, False, False, False,
        False, False, False, False, False]], dtype=object)

In [334]: np.where(f(A))
Out[334]: (array([0, 1, 1, 2], dtype=int32), array([2, 2, 9, 2], dtype=int32))

Most Efficient Property to Hash for Numpy Array