Convert array of indices to one-hot encoded array in NumPy
Create a zeroed array b
with enough columns, i.e. a.max() + 1
.
Then, for each row i
, set the a[i]
th column to 1
.
>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1
>>> b
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
One Hot Encoding using numpy
Usually, when you want to get a one-hot encoding for classification in machine learning, you have an array of indices.
import numpy as np
nb_classes = 6
targets = np.array([[2, 3, 4, 0]]).reshape(-1)
one_hot_targets = np.eye(nb_classes)[targets]
The one_hot_targets
is now
array([[[ 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0.]]])
The .reshape(-1)
is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]
). The -1
is a special value which means "put all remaining stuff in this dimension". As there is only one, it flattens the array.
Copy-Paste solution
def get_one_hot(targets, nb_classes):
res = np.eye(nb_classes)[np.array(targets).reshape(-1)]
return res.reshape(list(targets.shape)+[nb_classes])
Package
You can use mpu.ml.indices2one_hot. It's tested and simple to use:
import mpu.ml
one_hot = mpu.ml.indices2one_hot([1, 3, 0], nb_classes=5)
One hot encoding from numpy
The line Y_one_hot[Y.flatten(), np.arange(m)] = 1
is setting values of the array with lists of integer indices (Documented at Integer Array Indexing)
The arrays of indices are broadcast together, and the result for 1D arrays is essentially an efficient way to do this:
for i, j in zip(Y.flatten(), np.arange(m)):
Y_one_hot[i, j] = 1
In words, each column of Y_one_hot
corresponds to an entry of Y.flatten()
, and has a single nonzero value in the row given by the entry.
It may be easier to see with a smaller array:
Y_onehot = np.zeros((2, 3), dtype=int)
Y = np.array([0, 1, 0])
Y_onehot[Y.flatten(), np.arange(3)] = 1
print(Y_onehot)
# [[1 0 1]
# [0 1 0]]
Three entries map to three columns, and each column has a single nonzero entry in the row corresponding to the value.
One-hot encode a column of integers into a NumPy matrix, including missing indices
Advanced indexing is your answer! Assuming you know your desired final shape (here, (5, 7)
):
In [5]: desired_shape = (5, 7)
In [6]: z = np.zeros(desired_shape, dtype="uint8")
In [5]: z
Out[5]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
In [6]: idxs = [5, 2, 4, 6, 3]
In [7]: z[range(len(z)), idxs] = 1
In [8]: z
Out[8]:
array([[0, 0, 0, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 0]], dtype=uint8)
One-Hot Encode numpy array with 2 dims
You can do this in a single indexing operation if you know the max. Given an array a
and m = a.max() + 1
:
out = np.zeros(a.shape[:-1] + (m,), dtype=bool)
out[(*np.indices(a.shape[:-1], sparse=True), a[..., 0])] = True
It's easier if you remove the unnecessary trailing dimension:
a = np.squeeze(a)
out = np.zeros(a.shape + (m,), bool)
out[(*np.indices(a.shape, sparse=True), a)] = True
The explicit tuple in the index is necessary to do star expansion.
If you want to extend this to an arbitrary dimension, you can do that too. The following will insert a new dimension into the squeezed array at axis
. Here axis
is the position in the final array of the new axis, which is consistent with say np.stack
, but not consistent with list.insert
:
def onehot(a, axis=-1, dtype=bool):
pos = axis if axis >= 0 else a.ndim + axis + 1
shape = list(a.shape)
shape.insert(pos, a.max() + 1)
out = np.zeros(shape, dtype)
ind = list(np.indices(a.shape, sparse=True))
ind.insert(pos, a)
out[tuple(ind)] = True
return out
If you have a singleton dimension to expand, the generalized solution can find the first available singleton dimension:
def onehot2(a, axis=None, dtype=bool):
shape = np.array(a.shape)
if axis is None:
axis = (shape == 1).argmax()
if shape[axis] != 1:
raise ValueError(f'Dimension at {axis} is non-singleton')
shape[axis] = a.max() + 1
out = np.zeros(shape, dtype)
ind = list(np.indices(a.shape, sparse=True))
ind[axis] = a
out[tuple(ind)] = True
return out
To use the last available singleton, replace axis = (shape == 1).argmax()
with
axis = a.ndim - 1 - (shape[::-1] == 1).argmax()
Here are some example usages:
>>> np.random.seed(0x111)
>>> x = np.random.randint(5, size=(3, 2))
>>> x
array([[2, 3],
[3, 1],
[4, 0]])
>>> a = onehot(x, axis=-1, dtype=int)
>>> a.shape
(3, 2, 5)
>>> a
array([[[0, 0, 1, 0, 0], # 2
[0, 0, 0, 1, 0]], # 3
[[0, 0, 0, 1, 0], # 3
[0, 1, 0, 0, 0]], # 1
[[0, 0, 0, 0, 1], # 4
[1, 0, 0, 0, 0]]] # 0
>>> b = onehot(x, axis=-2, dtype=int)
>>> b.shape
(3, 5, 2)
>>> b
array([[[0, 0],
[0, 0],
[1, 0],
[0, 1],
[0, 0]],
[[0, 0],
[0, 1],
[0, 0],
[1, 0],
[0, 0]],
[[0, 1],
[0, 0],
[0, 0],
[0, 0],
[1, 0]]])
onehot2
requires you to mark the dimension you want to add as a singleton:
>>> np.random.seed(0x111)
>>> y = np.random.randint(5, size=(3, 1, 2, 1))
>>> y
array([[[[2],
[3]]],
[[[3],
[1]]],
[[[4],
[0]]]])
>>> c = onehot2(y, axis=-1, dtype=int)
>>> c.shape
(3, 1, 2, 5)
>>> c
array([[[[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]]],
[[[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0]]],
[[[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0]]]])
>>> d = onehot2(y, axis=-2, dtype=int)
ValueError: Dimension at -2 is non-singleton
>>> e = onehot2(y, dtype=int)
>>> e.shape
(3, 5, 2, 1)
>>> e.squeeze()
array([[[0, 0],
[0, 0],
[1, 0],
[0, 1],
[0, 0]],
[[0, 0],
[0, 1],
[0, 0],
[1, 0],
[0, 0]],
[[0, 1],
[0, 0],
[0, 0],
[0, 0],
[1, 0]]])
How to convert 2D numpy array to One Hot Encoding?
You need to one-hot encode each column separately so you will get 4 new columns for each column in your ndarray:
X = np.array(X)
# Get unique classes.
classes = np.unique(X)
# Replace classes with itegers.
X = np.searchsorted(classes, X)
# Get an identity matrix.
eye = np.eye(classes.shape[0])
# Iterate over all columns
# and get one-hot encoding for each column.
X = np.concatenate([eye[i] for i in X.T], axis=1)
X.shape
# (5, 40)
Consider the following example:
[['A', 'G'],
['C', 'C'],
['T', 'A']]
You will get 8 (2 x 4) columns in your one-hot encoded ndarray:
Column 0 Column 1
A C G T A C G T
1 0 0 0 0 0 1 0
0 1 0 0 0 1 0 0
0 0 0 1 1 0 0 0
How do I one-hot encode an array of strings with Numpy?
Got it. This will work with arrays of any number of unique values.
import numpy as np
target = np.array(['dog', 'dog', 'cat', 'cat', 'cat', 'dog', 'dog',
'cat', 'cat', 'hamster', 'hamster'])
def one_hot(array):
unique, inverse = np.unique(array, return_inverse=True)
onehot = np.eye(unique.shape[0])[inverse]
return onehot
print(one_hot(target))
Out[9]:
[[0., 1., 0.],
[0., 1., 0.],
[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.],
[1., 0., 0.],
[1., 0., 0.],
[0., 0., 1.],
[0., 0., 1.]])
How to convert a numpy array to one hot encoding?
If you don't mind going to pandas for data handling, you could use pd.Categorical along with pd.get_dummies to achieve the result. Here's a code snippet that should work for you:
import numpy as np
import pandas as pd
sex_list = [
"male",
"female"
]
type_list = [
"histo",
"follow_up",
"consensus",
"confocal"
]
localization_list = [
"back",
"lower extremity",
"trunk",
"upper extremity",
"abdomen"
]
values = np.array([
["male", "follow_up", "trunk"]
])
values = pd.DataFrame(values, columns=["sex", "type", "localization"]).assign(
sex=lambda row: pd.Categorical(row.sex, sex_list),
type=lambda row: pd.Categorical(row.type, type_list),
localization=lambda row: pd.Categorical(row.localization, localization_list)
)
encoded_array = pd.get_dummies(values).values
If you want to be particular about the number that is used to represent the different values, you can simply replace the different lists with the dicts. sex_list -> sex_dict
and so on in the pd.Categorical
calls.
Convert a 2D numpy array into a hot-encoded 3D numpy array, with same values in the same plane
Create the desired array (arr.max()+1
here) and then reshape it to compare to the original array:
Setup:
arr = np.array([
[0, 1, 0],
[0, 1, 4],
[2, 0, 0],
])
u = np.arange(arr.max()+1)
(u[:,np.newaxis,np.newaxis]==arr).astype(int)
array([[[1, 0, 1],
[1, 0, 0],
[0, 1, 1]],
[[0, 1, 0],
[0, 1, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[1, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]]])
Related Topics
How to Hide Console Window in Python
Generating Permutations with Repetitions
In Pandas, Is Inplace = True Considered Harmful, or Not
Using Lambda Expression to Connect Slots in Pyqt
Create Pandas Dataframe from Txt File with Specific Pattern
Is It Pythonic: Naming Lambdas
Pandas Read_Csv: Low_Memory and Dtype Options
Python List VS. Array - When to Use
How to Create a Guid/Uuid in Python
How to Convert SQLalchemy Row Object to a Python Dict
Return a Default Value If a Dictionary Key Is Not Available
How to Format a Decimal to Always Show 2 Decimal Places
Determining Application Path in a Python Exe Generated by Pyinstaller
Calculate Cosine Similarity Given 2 Sentence Strings
Differencebetween Nan and None
The Modulo Operation on Negative Numbers in Python