Convert Array of Indices to One-Hot Encoded Array in Numpy

Convert array of indices to one-hot encoded array in NumPy

Create a zeroed array b with enough columns, i.e. a.max() + 1.

Then, for each row i, set the a[i]th column to 1.

>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1

>>> b
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

One Hot Encoding using numpy

Usually, when you want to get a one-hot encoding for classification in machine learning, you have an array of indices.

import numpy as np
nb_classes = 6
targets = np.array([[2, 3, 4, 0]]).reshape(-1)
one_hot_targets = np.eye(nb_classes)[targets]

The one_hot_targets is now

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]). The -1 is a special value which means "put all remaining stuff in this dimension". As there is only one, it flattens the array.

Copy-Paste solution

def get_one_hot(targets, nb_classes):
    res = np.eye(nb_classes)[np.array(targets).reshape(-1)]
    return res.reshape(list(targets.shape)+[nb_classes])

Package

You can use mpu.ml.indices2one_hot. It's tested and simple to use:

import mpu.ml
one_hot = mpu.ml.indices2one_hot([1, 3, 0], nb_classes=5)

One hot encoding from numpy

The line Y_one_hot[Y.flatten(), np.arange(m)] = 1 is setting values of the array with lists of integer indices (Documented at Integer Array Indexing)

The arrays of indices are broadcast together, and the result for 1D arrays is essentially an efficient way to do this:

for i, j in zip(Y.flatten(), np.arange(m)):
    Y_one_hot[i, j] = 1

In words, each column of Y_one_hot corresponds to an entry of Y.flatten(), and has a single nonzero value in the row given by the entry.

It may be easier to see with a smaller array:

Y_onehot = np.zeros((2, 3), dtype=int)
Y = np.array([0, 1, 0])

Y_onehot[Y.flatten(), np.arange(3)] = 1

print(Y_onehot)
# [[1 0 1]
#  [0 1 0]]

Three entries map to three columns, and each column has a single nonzero entry in the row corresponding to the value.

One-hot encode a column of integers into a NumPy matrix, including missing indices

Advanced indexing is your answer! Assuming you know your desired final shape (here, (5, 7)):

In [5]: desired_shape = (5, 7)

In [6]: z = np.zeros(desired_shape, dtype="uint8")

In [5]: z
Out[5]:
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

In [6]: idxs = [5, 2, 4, 6, 3]

In [7]: z[range(len(z)), idxs] = 1

In [8]: z
Out[8]:
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0]], dtype=uint8)

One-Hot Encode numpy array with 2 dims

You can do this in a single indexing operation if you know the max. Given an array a and m = a.max() + 1:

out = np.zeros(a.shape[:-1] + (m,), dtype=bool)
out[(*np.indices(a.shape[:-1], sparse=True), a[..., 0])] = True

It's easier if you remove the unnecessary trailing dimension:

a = np.squeeze(a)
out = np.zeros(a.shape + (m,), bool)
out[(*np.indices(a.shape, sparse=True), a)] = True

The explicit tuple in the index is necessary to do star expansion.

If you want to extend this to an arbitrary dimension, you can do that too. The following will insert a new dimension into the squeezed array at axis. Here axis is the position in the final array of the new axis, which is consistent with say np.stack, but not consistent with list.insert:

def onehot(a, axis=-1, dtype=bool):
    pos = axis if axis >= 0 else a.ndim + axis + 1
    shape = list(a.shape)
    shape.insert(pos, a.max() + 1)
    out = np.zeros(shape, dtype)
    ind = list(np.indices(a.shape, sparse=True))
    ind.insert(pos, a)
    out[tuple(ind)] = True
    return out

If you have a singleton dimension to expand, the generalized solution can find the first available singleton dimension:

def onehot2(a, axis=None, dtype=bool):
    shape = np.array(a.shape)
    if axis is None:
        axis = (shape == 1).argmax()
    if shape[axis] != 1:
        raise ValueError(f'Dimension at {axis} is non-singleton')
    shape[axis] = a.max() + 1
    out = np.zeros(shape, dtype)
    ind = list(np.indices(a.shape, sparse=True))
    ind[axis] = a
    out[tuple(ind)] = True
    return out

To use the last available singleton, replace axis = (shape == 1).argmax() with

axis = a.ndim - 1 - (shape[::-1] == 1).argmax()

Here are some example usages:

>>> np.random.seed(0x111)
>>> x = np.random.randint(5, size=(3, 2))
>>> x
array([[2, 3],
       [3, 1],
       [4, 0]])

>>> a = onehot(x, axis=-1, dtype=int)
>>> a.shape
(3, 2, 5)
>>> a
array([[[0, 0, 1, 0, 0],    # 2
        [0, 0, 0, 1, 0]],   # 3

       [[0, 0, 0, 1, 0],    # 3
        [0, 1, 0, 0, 0]],   # 1

       [[0, 0, 0, 0, 1],    # 4
        [1, 0, 0, 0, 0]]]   # 0

>>> b = onehot(x, axis=-2, dtype=int)
>>> b.shape
(3, 5, 2)
>>> b
array([[[0, 0],
        [0, 0],
        [1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 1],
        [0, 0],
        [1, 0],
        [0, 0]],

       [[0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0]]])

onehot2 requires you to mark the dimension you want to add as a singleton:

>>> np.random.seed(0x111)
>>> y = np.random.randint(5, size=(3, 1, 2, 1))
>>> y
array([[[[2],
         [3]]],
       [[[3],
         [1]]],
       [[[4],
         [0]]]])

>>> c = onehot2(y, axis=-1, dtype=int)
>>> c.shape
(3, 1, 2, 5)
>>> c
array([[[[0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0]]],

       [[[0, 0, 0, 1, 0],
         [0, 1, 0, 0, 0]]],

       [[[0, 0, 0, 0, 1],
         [1, 0, 0, 0, 0]]]])

>>> d = onehot2(y, axis=-2, dtype=int)
ValueError: Dimension at -2 is non-singleton

>>> e = onehot2(y, dtype=int)
>>> e.shape
(3, 5, 2, 1)
>>> e.squeeze()
array([[[0, 0],
        [0, 0],
        [1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 1],
        [0, 0],
        [1, 0],
        [0, 0]],

       [[0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0]]])

How to convert 2D numpy array to One Hot Encoding?

You need to one-hot encode each column separately so you will get 4 new columns for each column in your ndarray:

X = np.array(X)

# Get unique classes.
classes = np.unique(X)

# Replace classes with itegers.
X = np.searchsorted(classes, X)

# Get an identity matrix.
eye = np.eye(classes.shape[0])

# Iterate over all columns
# and get one-hot encoding for each column.
X = np.concatenate([eye[i] for i in X.T], axis=1)

X.shape
# (5, 40)

Consider the following example:

[['A', 'G'],
 ['C', 'C'],
 ['T', 'A']]

You will get 8 (2 x 4) columns in your one-hot encoded ndarray:

  Column 0      Column 1         
 A  C  G  T    A  C  G  T

 1  0  0  0    0  0  1  0
 0  1  0  0    0  1  0  0
 0  0  0  1    1  0  0  0

How do I one-hot encode an array of strings with Numpy?

Got it. This will work with arrays of any number of unique values.

import numpy as np

target = np.array(['dog', 'dog', 'cat', 'cat', 'cat', 'dog', 'dog', 
    'cat', 'cat', 'hamster', 'hamster'])

def one_hot(array):
    unique, inverse = np.unique(array, return_inverse=True)
    onehot = np.eye(unique.shape[0])[inverse]
    return onehot

print(one_hot(target))

Out[9]:

[[0., 1., 0.],

[0., 1., 0.],

[1., 0., 0.],

[1., 0., 0.],

[1., 0., 0.],

[0., 1., 0.],

[0., 1., 0.],

[1., 0., 0.],

[1., 0., 0.],

[0., 0., 1.],

[0., 0., 1.]])

How to convert a numpy array to one hot encoding?

If you don't mind going to pandas for data handling, you could use pd.Categorical along with pd.get_dummies to achieve the result. Here's a code snippet that should work for you:

import numpy as np
import pandas as pd

sex_list = [
  "male",
  "female"
]
type_list = [
  "histo",
  "follow_up",
  "consensus",
  "confocal"
]
localization_list = [
  "back",
  "lower extremity",
  "trunk",
  "upper extremity",
  "abdomen"
]

values = np.array([
  ["male", "follow_up", "trunk"]
])
values = pd.DataFrame(values, columns=["sex", "type", "localization"]).assign(
  sex=lambda row: pd.Categorical(row.sex, sex_list),
  type=lambda row: pd.Categorical(row.type, type_list),
  localization=lambda row: pd.Categorical(row.localization, localization_list)
)
encoded_array = pd.get_dummies(values).values

If you want to be particular about the number that is used to represent the different values, you can simply replace the different lists with the dicts. sex_list -> sex_dict and so on in the pd.Categorical calls.

Convert a 2D numpy array into a hot-encoded 3D numpy array, with same values in the same plane

Create the desired array (arr.max()+1 here) and then reshape it to compare to the original array:

Setup:

arr = np.array([
  [0, 1, 0],
  [0, 1, 4],
  [2, 0, 0],
])

u = np.arange(arr.max()+1)
(u[:,np.newaxis,np.newaxis]==arr).astype(int)

array([[[1, 0, 1],
        [1, 0, 0],
        [0, 1, 1]],

       [[0, 1, 0],
        [0, 1, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [1, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 1],
        [0, 0, 0]]])

Convert Array of Indices to One-Hot Encoded Array in Numpy