How to Normalize a Numpy Array to a Unit Vector

How to normalize a NumPy array to a unit vector?

If you're using scikit-learn you can use sklearn.preprocessing.normalize:

import numpy as np
from sklearn.preprocessing import normalize

x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True

Normalization VS. numpy way to normalize?

There are different types of normalization. You are using min-max normalization. The min-max normalization from scikit learn is as follows.

import numpy as np
from sklearn.preprocessing import minmax_scale

# your function
def normalize_list(list_normal):
max_value = max(list_normal)
min_value = min(list_normal)
for i in range(len(list_normal)):
list_normal[i] = (list_normal[i] - min_value) / (max_value - min_value)
return list_normal

#Scikit learn version
def normalize_list_numpy(list_numpy):
normalized_list = minmax_scale(list_numpy)
return normalized_list

test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9]
test_array_numpy = np.array(test_array)

print(normalize_list(test_array))
print(normalize_list_numpy(test_array_numpy))

Output:

[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]    
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]

MinMaxscaler uses exactly your formula for normalization/scaling:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html

@OuuGiii: NOTE: It is not a good idea to use Python built-in function names as varibale names. list() is a Python builtin function so its use as a variable should be avoided.

NumPy: how to quickly normalize many vectors?

Well, unless I missed something, this does work:

vectors / norms

The problem in your suggestion is the broadcasting rules.

vectors  # shape 2, 10
norms # shape 10

The shape do not have the same length! So the rule is to first extend the small shape by one on the left:

norms  # shape 1,10

You can do that manually by calling:

vectors / norms.reshape(1,-1)  # same as vectors/norms

If you wanted to compute vectors.T/norms, you would have to do the reshaping manually, as follows:

vectors.T / norms.reshape(-1,1)  # this works

Vectorize calculating all unit vectors for a set of points in numpy

norm is calculated as root of sum squared, you can implement your own norm calculation as follows, and then vectorize your solution with broadcasting:

diff = (in_a[:, None] - in_b).reshape(-1, 3)
norm = ((in_a[:, None] ** 2 + in_b ** 2).sum(2) ** 0.5).reshape(-1, 1)

diff / norm

gives:

[[-0.36851098  0.01759667  0.01684932]
[-0.3777128 0.02035861 0.00997706]
[-0.47964868 0.03250422 -0.02628129]
[-0.4851439 0.03115273 -0.03235091]
[-0.35515452 0.01449887 0.01145004]
[-0.36444229 0.01727756 0.00463872]
[-0.46762047 0.02971985 -0.03098581]
[-0.4732132 0.02839518 -0.03700341]
[-0.17814297 0.00926242 -0.00805704]
[-0.18821243 0.01190899 -0.01430339]
[-0.30440561 0.02430441 -0.04632135]
[-0.31113513 0.0230996 -0.05193153]
[-0.16408845 0.0103343 -0.01190965]
[-0.1741932 0.01295652 -0.01808905]
[-0.29113461 0.02519489 -0.04959355]
[-0.29793917 0.02399093 -0.05515092]]

Play.

How to normalize a NumPy array to within a certain range?

# Normalize audio channels to between -1.0 and +1.0
audio /= np.max(np.abs(audio),axis=0)
# Normalize image to between 0 and 255
image *= (255.0/image.max())

Using /= and *= allows you to eliminate an intermediate temporary array, thus saving some memory. Multiplication is less expensive than division, so

image *= 255.0/image.max()    # Uses 1 division and image.size multiplications

is marginally faster than

image /= image.max()/255.0    # Uses 1+image.size divisions

Since we are using basic numpy methods here, I think this is about as efficient a solution in numpy as can be.


In-place operations do not change the dtype of the container array. Since the desired normalized values are floats, the audio and image arrays need to have floating-point point dtype before the in-place operations are performed.
If they are not already of floating-point dtype, you'll need to convert them using astype. For example,

image = image.astype('float64')

How to normalize a 2-dimensional numpy array in python less verbose?

Broadcasting is really good for this:

row_sums = a.sum(axis=1)
new_matrix = a / row_sums[:, numpy.newaxis]

row_sums[:, numpy.newaxis] reshapes row_sums from being (3,) to being (3, 1). When you do a / b, a and b are broadcast against each other.

You can learn more about broadcasting here or even better here.

Normalize a Numpy array of 2D vector by a Pandas column of norms

If both arrays are in numpy, you just need to Transpose it:

(uv.T/L).T

In the case of the question, as L is a Series, then:

(uv.T/L.to_numpy()).T

Fast inverse square root in python to normalize a vector

Shouldn’t there be a numpy function to do that? Which uses the fast inverse square root algorithm. Or is it outside the numpy's scope and there shouldn't be a function like that?

I am not aware of any function in Numpy doing exactly that. Multiple functions calls are needed in pure-Numpy. sklearn.preprocessing.normalize is indeed a good alternative (and AFAIK not the only package to provide that).

The thing is Numpy is not designed to compute small arrays efficiently. The overhead of Numpy calls is huge for small arrays (like with only 3 values). Combining multiple function calls only make it worse. The overhead is basically due to type/shape/value checking, internal function calls, the CPython interpreter as well as the allocation of new arrays. Thus, even if Numpy would provide exactly the function you wanted, it would be slow for an array with only 3 items.

Should I implement my function own in cython/numba?

This is a good idea since Numba can do that with a much smaller overhead. Note that Numba function calls still have a small overhead though, but calling them from a Numba context is very cheap (native calls).

For example, you can use:

# Note:
# - The signature cause an eager compilation
# - ::1 means the axis is contiguous (generate a faster code)
@nb.njit('(float64[::1],)')
def normalize(v):
s = 0.0
for i in range(v.size):
s += v[i] * v[i]
inv_norm = 1.0 / np.sqrt(s)
for i in range(v.size):
v[i] *= inv_norm

This function does not allocate any new array as it works in-place. Moreover, Numba can only make the minimal amount of checks in a wrapping function. The loops are very fast but they can be made even faster if you replace v.size by the actual size (eg. 3) because the JIT can then unroll the loop and generate a nearly optimal code. np.sqrt will be inlined and it should generate a fast square-root FP instruction. If you use the flag fastmath=True, the JIT might even be able to compute the reciprocal square root using a dedicated faster instruction on x86-64 platform (note that fastmath is unsafe if you use special values like NaN or care about the FP associativity). Still, the overhead of calling this function is likely of 100-300 ns on mainstream machines for very small vectors: the CPython wrapping function have a significant overhead. The only solution to remove it is to use Numba/Cython in the caller function. If you need to use them on most of your project, then writing directly a C/C++ code is certainly better.

Or if I'm so worried about performance, should I give up python and start making code in C/C++?

It depends of your overall project but is you want to manipulate many small vector like this, it is much more efficient to use C/C++ directly. The alternative is to use Numba or Cython for kernels that are currently slow.

The performance of a well-optimized Numba code or Cython code can be very close to the one of natively compiled C/C++ code. For example, I succeed to outperform the heavily-optimized OpenBLAS code with Numba once (thanks to specialization). One of the main source of overhead in Numba is array bound checking (that can be often optimized out regarding the loops). C/C++ are lower-level so you do not pay any hidden cost but the code can be harder to maintain. Additionally, you can apply lower-level optimizations that are not even possible in Numba/Cython (eg. using directly SIMD intrinsics or assembly instructions, generating specialized code with templates).



Related Topics



Leave a reply



Submit