Matrix Multiplication in Pure Python

Python 3: Multiply a vector by a matrix without NumPy

The Numpythonic approach: (using numpy.dot in order to get the dot product of two matrices)

In [1]: import numpy as np

In [3]: np.dot([1,0,0,1,0,0], [[0,1],[1,1],[1,0],[1,0],[1,1],[0,1]])
Out[3]: array([1, 1])

The Pythonic approach:

The length of your second for loop is len(v) and you attempt to indexing v based on that so you got index Error . As a more pythonic way you can use zip function to get the columns of a list then use starmap and mul within a list comprehension:

In [13]: first,second=[1,0,0,1,0,0], [[0,1],[1,1],[1,0],[1,0],[1,1],[0,1]]

In [14]: from itertools import starmap

In [15]: from operator import mul

In [16]: [sum(starmap(mul, zip(first, col))) for col in zip(*second)]
Out[16]: [1, 1]

matrix multiplication with scalar without numpy

Since the OP says that he is not allowed to use imports, the following alternative has its own implementation of deepcopy:

class Matrix:
def __init__(self, rows):
self.rows = rows[:]

def deepcopy(self):
rows = [[elem for elem in row]
for row in self.rows
]
return Matrix(rows)

def scale(self, w):
copy = self.deepcopy()
copyr = copy.rows
for i in range(len(copyr)):
for j in range(len(copyr)):
copyr[i][j] = copyr[i][j] * w

return copyr

c = Matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
print(c.scale(10))
print(c.scale(10))

3d Matrix multiplication in numpy

For array of dimension 3 (or, a rank-3 tensor) that you have, you can use np.einsum doc for more complex matrix multiplications. In your particular case, you can use the following

>>> import numpy as np
>>> x = np.random.randint(0, 3, (3, 3, 3)) # shape (3, 3, 3)
>>> y = np.random.randint(0, 3, (3, 3, 3)) # shape (3, 3, 3)
>>> np.einsum('ijk,ikl->ijl', x, y) # still shape (3, 3, 3)

In particular, the einsum expression 'ijk,ikl->ijl' means for each ith matrix, do a regular matrix multiplication jk,kl->jl and put the result in the ith entry in the resulting tensor (ndarray). A more general form of this process could be

np.einsum('...jk,...kl->...jl', x, y)

where you can have arbitrary number of dimensions in front of each tensor (ndarray) that you have.

See the following for a complete example:

>>> import numpy as np
>>> x = np.random.randint(0, 3, (3, 3, 3)) # shape (3, 3, 3)
>>> x
array([[[0, 0, 1],
[2, 2, 1],
[2, 1, 1]],

[[2, 0, 2],
[2, 2, 1],
[2, 2, 2]],

[[2, 2, 2],
[1, 1, 2],
[0, 2, 2]]])
>>> y = np.random.randint(0, 3, (3, 3, 3)) # shape (3, 3, 3)
>>> y
array([[[0, 0, 1],
[2, 1, 0],
[0, 0, 2]],

[[1, 2, 0],
[2, 0, 1],
[2, 2, 1]],

[[0, 2, 1],
[0, 1, 0],
[0, 2, 1]]])
>>> np.einsum('ijk,ikl->ijl', x, y)
array([[[ 0, 0, 2],
[ 4, 2, 4],
[ 2, 1, 4]],

[[ 6, 8, 2],
[ 8, 6, 3],
[10, 8, 4]],

[[ 0, 10, 4],
[ 0, 7, 3],
[ 0, 6, 2]]])
>>> np.einsum('...ij,...jk->...ik', x, y)
array([[[ 0, 0, 2],
[ 4, 2, 4],
[ 2, 1, 4]],

[[ 6, 8, 2],
[ 8, 6, 3],
[10, 8, 4]],

[[ 0, 10, 4],
[ 0, 7, 3],
[ 0, 6, 2]]])

Why is matrix multiplication faster with numpy than with ctypes in Python?

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

/**begin repeat
*
* #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
* LONG, ULONG, LONGLONG, ULONGLONG,
* FLOAT, DOUBLE, LONGDOUBLE,
* DATETIME, TIMEDELTA#
* #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
* #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
*/
static void
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
void *NPY_UNUSED(ignore))
{
@out@ tmp = (@out@)0;
npy_intp i;

for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
tmp += (@out@)(*((@type@ *)ip1)) *
(@out@)(*((@type@ *)ip2));
}
*((@type@ *)op) = (@type@) tmp;
}
/**end repeat**/

This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) *
SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}

This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.

Matrix multiplication dimensons

You can use broadcast:

ret = (M @ v[...,None]).reshape(v.shape)

### check
loop = np.array([M[i] @ v[i] for i in range(n)])
(ret==loop).all()
# True

Most efficient way to multiply a small matrix with a scalar in numpy

large_dimension = 1000

a = np.random.random((1,))
B = np.random.random((1, large_dimension))

%timeit np.matmul(a, B)
5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit a[0] * B
5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Use just float

%timeit float(a[0]) * B
3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

To avoid memory allocation use "buffer"

buffer = np.empty_like(B)

%timeit np.multiply(float(a[0]), B, buffer)
2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

To avoid unnecessary getting attribute use "alias"

mul = np.multiply

%timeit mul(float(a[0]), B, buffer)
2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And I don't recommend using numpy scalars at all,
because if you avoid it, computation will be faster

a_float = float(a[0])

%timeit mul(a_float, B, buffer)
1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)

rng = range(1000)

%%timeit
for i in rng:
pass
24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
for i in rng:
mul(a_float, B, buffer)
1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So,

"best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)

"speedup" = 5.43 / 1.89 = 2.87

Multiple matrix multiplication

Use np.einsum -

np.einsum('ijk,ik->ij',matrices,vectors)

Steps :

1) Keep the first axes aligned.

2) Sum-reduce the last axes from the input arrays against each other.

3) Let the remainining axes(second axis from matrices) be element-wise multiplied.



Related Topics



Leave a reply



Submit