Most efficient property to hash for numpy array
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a))
is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
How to make a tuple including a numpy array hashable?
The fastest way to hash a numpy array is likely tostring.
In [11]: %timeit hash(y.tostring())
What you could do is rather than use a tuple define a class:
class KeySet(object):
def __init__(self, i, arr):
self.i = i
self.arr = arr
def __hash__(self):
return hash((self.i, hash(self.arr.tostring())))
Now you can use it in a dict:
In [21]: ks = KeySet(0, npArray)
In [22]: myDict = {ks: 1}
In [23]: myDict[ks]
Out[23]: 1
Why is md5 hashing so much faster on strings than on numpy arrays in python?
str(random_matrix)
will not include all of the matrix due to numpy's eliding things with "...":
>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
...,
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]]
So when you hash str(random_matrix)
, you aren't really hashing all the data.
See this previous question and this one about how to hash numpy arrays.
numpy - most efficient way of calculating f(v1,v2) for every pair of rows in a 2d array
A more generic approach if you have a general function is to use np.fromiter
(which is generally faster than a for
loop):
import itertools
n = 4
data = np.random.random((n, n))
def func(tup):
v1, v2 = tup
n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
return(np.dot(v1, v2) / (n1 * n2))
out = np.fromiter(map(func, itertools.product(data, data)), np.float).reshape(n,n)
print(out)
>>array([[1. , 0.57588563, 0.44980109, 0.93490176],
[0.57588563, 1. , 0.71004626, 0.6908402 ],
[0.44980109, 0.71004626, 1. , 0.68118222],
[0.93490176, 0.6908402 , 0.68118222, 1. ]])
Using scalar ndarrays as keys
Python dictionaries require their keys to implement both __eq__
and __hash__
methods, and Python's data model requires that:
- The hash of an object does not change during its lifetime
- If
x == y
thenhash(x) == hash(y)
Numpy's ndarray
class overrides __eq__
to support elementwise comparison and broadcasting. This means that for numpy arrays x
and y
, x == y
is not a boolean but another array. This in itself probably rules out ndarray
s functioning correctly as dictionary keys.
Even ignoring this quirk of ndarray.__eq__
, it would be tricky to come up with a (useful) implementation of ndarray.__hash__
. Since the data in a numpy array is mutable, we could not use that data to calculate __hash__
without violating the requirement that the hash of an object does not change during its lifetime.
There is nothing wrong with defining __hash__
for mutable objects provided that the hash itself does not change during the object's lifetime. Similarly, dictionary keys can be mutable provided they implement __hash__
and the hash is immutable. E.g. simple user-defined classes are mutable but can still be used as dictionary keys.
Find all positions in a NumPy array that contain a substring (most efficient?)
np.char
is a group of functions that apply string methods to the elements of an array like yours. So using the find
function:
In [311]: np.char.find(A, 'contig')
Out[311]:
array([[-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[-1, -1, 0, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1],
[-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]])
-1 for the elements where it was not found, 0 or larger for finds.
In [312]: np.where(np.char.find(A, 'contig')>=0)
Out[312]: (array([0, 1, 1, 2], dtype=int32), array([2, 2, 9, 2], dtype=int32))
In [313]: A[_]
Out[313]:
array(['contig_1758_2278_4341_-', 'contig_1758_2278_4341_-',
'contig_1297_3232_198298_+', 'contig_1281_415_1704_-'],
dtype='<U149')
Functions like this have to iterate over the elements, and apply the corresponding string method, so they aren't as fast as the usually numpy numeric code, but they are a lot easier than doing your own iteration.
np.vectorize
or np.frompyfunc
can also be used to apply a function to each element of an array. They too iterate, so aren't significant speedups over your own iteration. Still I have found that frompyfunc
often provides a 30% speedup.
In [331]: f=np.frompyfunc(lambda x: x.find('contig'), 1,1) # like char.find
In [332]: f=np.frompyfunc(lambda x: 'contig' in x, 1,1) # your 'in'
In [333]: f(A)
Out[333]:
array([[False, False, True, False, False, False, False, False, False,
False, False, False, False, False],
[False, False, True, False, False, False, False, False, False, True,
False, False, False, False],
[False, False, True, False, False, False, False, False, False,
False, False, False, False, False]], dtype=object)
In [334]: np.where(f(A))
Out[334]: (array([0, 1, 1, 2], dtype=int32), array([2, 2, 9, 2], dtype=int32))
Related Topics
What Version of Visual Studio Is Python on My Computer Compiled With
Add 'Decimal-Mark' Thousands Separators to a Number
How to Convert a Numpy Array to Pil Image Applying Matplotlib Colormap
Python Method for Reading Keypress
Runtimeerror: Main Thread Is Not in Main Loop
How to Access Pandas Groupby Dataframe by Key
When Should I Subclass Enummeta Instead of Enum
Add Column to Dataframe with Constant Value
Pandas: Looking Up the List of Sheets in an Excel File
Forced Naming of Parameters in Python
Python Numpy/Scipy Curve Fitting
Python: Load Variables in a Dict into Namespace
How to Check If One Two-Dimensional Numpy Array Contains a Specific Pattern of Values Inside It