np.reshape(x, (-1,1)) vs x[:, np.newaxis]
Both ways return views of the exact same data, therefore the 'data contiguity' is likely a non-issue as the data is not change, only the view is changed. See Numpy: use reshape or newaxis to add dimensions.
However there might be a practical advantage of using .reshape((-1,1))
, as it will reshape the array into 2d array regardless of the original shape. For [:, np.newaxis]
, the result will depend on the original shape of the array, considering these:
In [3]: a1 = np.array([0, 1, 2])
In [4]: a2 = np.array([[0, 1, 2]])
In [5]: a1.reshape((-1, 1))
Out[5]:
array([[0],
[1],
[2]])
In [6]: a2.reshape((-1, 1))
Out[6]:
array([[0],
[1],
[2]])
In [7]: a1[:, np.newaxis]
Out[7]:
array([[0],
[1],
[2]])
In [8]: a2[:, np.newaxis]
Out[8]: array([[[0, 1, 2]]])
Numpy: use reshape or newaxis to add dimensions
I don't see evidence of much difference. You could do a time test on very large arrays. Basically both fiddle with the shape, and possibly the strides. __array_interface__
is a nice way of accessing this information. For example:
In [94]: b.__array_interface__
Out[94]:
{'data': (162400368, False),
'descr': [('', '<f8')],
'shape': (5,),
'strides': None,
'typestr': '<f8',
'version': 3}
In [95]: b[None,:].__array_interface__
Out[95]:
{'data': (162400368, False),
'descr': [('', '<f8')],
'shape': (1, 5),
'strides': (0, 8),
'typestr': '<f8',
'version': 3}
In [96]: b.reshape(1,5).__array_interface__
Out[96]:
{'data': (162400368, False),
'descr': [('', '<f8')],
'shape': (1, 5),
'strides': None,
'typestr': '<f8',
'version': 3}
Both create a view, using the same data
buffer as the original. Same shape, but reshape doesn't change the strides
. reshape
lets you specify the order
.
And .flags
shows differences in the C_CONTIGUOUS
flag.
reshape
may be faster because it is making fewer changes. But either way the operation shouldn't affect the time of larger calculations much.
e.g. for large b
In [123]: timeit np.outer(b.reshape(1,-1),b)
1 loops, best of 3: 288 ms per loop
In [124]: timeit np.outer(b[None,:],b)
1 loops, best of 3: 287 ms per loop
Interesting observation that: b.reshape(1,4).strides -> (32, 8)
Here's my guess. .__array_interface__
is displaying an underlying attribute, and .strides
is more like a property (though it may all be buried in C code). The default underlying value is None
, and when needed for calculation (or display with .strides
) it calculates it from the shape and item size. 32
is the distance to the end of the 1st row (4x8). np.ones((2,4)).strides
has the same (32,8)
(and None
in __array_interface__
.
b[None,:]
on the other hand is preparing the array for broadcasting. When broadcasted, existing values are used repeatedly. That's what the 0
in (0,8)
does.
In [147]: b1=np.broadcast_arrays(b,np.zeros((2,1)))[0]
In [148]: b1.shape
Out[148]: (2, 5000)
In [149]: b1.strides
Out[149]: (0, 8)
In [150]: b1.__array_interface__
Out[150]:
{'data': (3023336880L, False),
'descr': [('', '<f8')],
'shape': (2, 5),
'strides': (0, 8),
'typestr': '<f8',
'version': 3}
b1
displays the same as np.ones((2,5))
but has only 5 items.
np.broadcast_arrays
is a function in /numpy/lib/stride_tricks.py
. It uses as_strided
from the same file. These functions directly play with the shape and strides attributes.
Add multiple np.newaxis as needed?
There's builtin for that -
np.less_equal.outer(A,B)
Another way would be with reshaping to accomodate new axes -
A.reshape(list(A.shape)+[1]*B.ndim) <= B
Using np.newaxis to compute sum of squared differences
Why is a third axis created? What is the best way to visualize what is going on?
The adding new dimensions before adding/subtracting trick is a relatively common one to generate all pairs, by using broadcasting (None
is the same as np.newaxis
here):
>>> a = np.arange(10)
>>> a[:,None]
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
>>> a[None,:]
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> a[:,None] + 100*a[None,:]
array([[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900],
[ 1, 101, 201, 301, 401, 501, 601, 701, 801, 901],
[ 2, 102, 202, 302, 402, 502, 602, 702, 802, 902],
[ 3, 103, 203, 303, 403, 503, 603, 703, 803, 903],
[ 4, 104, 204, 304, 404, 504, 604, 704, 804, 904],
[ 5, 105, 205, 305, 405, 505, 605, 705, 805, 905],
[ 6, 106, 206, 306, 406, 506, 606, 706, 806, 906],
[ 7, 107, 207, 307, 407, 507, 607, 707, 807, 907],
[ 8, 108, 208, 308, 408, 508, 608, 708, 808, 908],
[ 9, 109, 209, 309, 409, 509, 609, 709, 809, 909]])
Your example does the same, just with 2-vectors instead of scalars at the innermost level:
>>> X[:,np.newaxis,:].shape
(10, 1, 2)
>>> X[np.newaxis,:,:].shape
(1, 10, 2)
>>> (X[:,np.newaxis,:] - X[np.newaxis,:,:]).shape
(10, 10, 2)
Thus we find that the 'magical subtraction' is just all combinations of the coordinate X
subtracted from each other.
Is there a more intuitive way to perform this calculation?
Yes, use scipy.spatial.distance.pdist
for pairwise distances. To get an equivalent result to your example:
from scipy.spatial.distance import pdist, squareform
dist_sq = squareform(pdist(X))**2
Numpy np.newaxis
df_train['SalePrice']
is a Pandas.Series (vector / 1D array) of a shape: (N elements,)
Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.
df_train['SalePrice'][:,np.newaxis]
transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).
Demo:
In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc'))
In [22]: df
Out[22]:
a b c
0 4 3 8
1 7 5 6
2 1 3 9
3 7 5 7
4 7 0 6
In [23]: from sklearn.preprocessing import StandardScaler
In [24]: df['a'].shape
Out[24]: (5,) # <--- 1D array
In [25]: df['a'][:, np.newaxis].shape
Out[25]: (5, 1) # <--- 2D array
There is Pandas way to do the same:
In [26]: df[['a']].shape
Out[26]: (5, 1) # <--- 2D array
In [27]: StandardScaler().fit_transform(df[['a']])
Out[27]:
array([[-0.5 ],
[ 0.75],
[-1.75],
[ 0.75],
[ 0.75]])
What happens if we will pass 1D array:
In [28]: StandardScaler().fit_transform(df['a'])
C:\Users\Max\Anaconda4\lib\site-packages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t
o float64 by StandardScaler.
warnings.warn(msg, _DataConversionWarning)
C:\Users\Max\Anaconda4\lib\site-packages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0
.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
C:\Users\Max\Anaconda4\lib\site-packages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0
.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
Out[28]: array([-0.5 , 0.75, -1.75, 0.75, 0.75])
Why is arr[:][np.newaxis].shape = (1, n) instead of (n, 1)?
In arr[:][np.newaxis]
and arr[np.newaxis][:]
the indexing is done sequentially, so arr2 = arr[:][np.newaxis]
is equivalent to:
arr_temp = arr[:]
arr2 = arr_temp[np.newaxis]
del arr_temp
The same logic applies to ordering the indexing operators the other way round, for arr2 = arr[np.newaxis][:]
:
arr_temp = arr[np.newaxis]
arr2 = arr_temp[:]
del arr_temp
Now, to quote https://numpy.org/doc/1.19/reference/arrays.indexing.html:
Each newaxis object in the selection tuple serves to expand the dimensions of the resulting selection by one unit-length dimension. The added dimension is the position of the newaxis object in the selection tuple.
Since np.newaxis
is at the first position (there is only one position) in the indexing selection tuple in both arr[np.newaxis]
and arr_temp[np.newaxis]
, it will create the new dimension as the first dimension, and thus the resulting shape is (1, 4)
in both cases.
Related Topics
How to Run an External Command Asynchronously from Python
How to Read Realtime Microphone Audio Volume in Python and Ffmpeg or Similar
How to Set Explicitly the Terminal Size When Using Pexpect
Importerror: No Module Named Pil
Check If a String Matches an Ip Address Pattern in Python
What's a Correct and Good Way to Implement _Hash_()
Subprocess.Popen: Cloning Stdout and Stderr Both to Terminal and Variables
How to Install Both Python 2.X and Python 3.X in Windows
Windows Scipy Install: No Lapack/Blas Resources Found
How to Execute Python File in Linux
Importerror: Matplotlib Is Required for Plotting When the Default Backend "Matplotlib" Is Selected
Matplotlib: Format Axis Offset-Values to Whole Numbers or Specific Number
Python: Problem with Raw_Input Reading a Number
Pandas Timeseries Plot Setting X-Axis Major and Minor Ticks and Labels