Lookup Values by Corresponding Column Header in Pandas 1.2.0 or Newer

Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer

Standard LookUp Values With Any Index

The documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize and reindex as the replacement for the deprecated DataFrame.lookup.

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df

  Col  A  B  Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8

factorize is used to convert the column encode the values as an "enumerated type".

idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')

Notice that B corresponds to 0 and A corresponds to 1. reindex is used to ensure that columns appear in the same order as the enumeration:

df.reindex(columns=col)

B A # B appears First (location 0) A appers second (location 1)
0 5 1
1 6 2
2 7 3
3 8 4

We need to create an appropriate range indexer compatible with NumPy indexing.

The standard approach is to use np.arange based on the length of the DataFrame:

np.arange(len(df))

[0 1 2 3]

Now NumPy indexing will work to select values from the DataFrame:

df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

[5 2 3 8]

*Note: This approach will always work regardless of type of index.

MultiIndex

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
    Col  A  B  Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8

Why use np.arange and not df.index directly?

Standard Contiguous Range Index

import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

In this case only, there is no error as the result from np.arange is the same as the df.index.
df

  Col  A  B  Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8

Non-Contiguous Range Index Error

Raises IndexError:

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: index 8 is out of bounds for axis 0 with size 4

MultiIndex Error

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

Raises IndexError:

df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices


LookUp with Default For Unmatched/Not-Found Values

There are a few approaches.

First let's look at what happens by default if there is a non-corresponding value:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 C 4 8

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
  Col  A  B  Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 C 4 8 NaN # NaN Represents the Missing Value in C

If we look at why the NaN values are introduced, we will find that when factorize goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.

For this reason, when we reindex the DataFrame we will end up with the following result:

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
B A C
0 5 1 NaN
1 6 2 NaN
2 7 3 NaN
3 8 4 NaN # Reindex adds the missing column with the Default `NaN`

If we want to specify a default value, we can specify the fill_value argument of reindex which allows us to modify the behaviour as it relates to missing column values:

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
B A C
0 5 1 0
1 6 2 0
2 7 3 0
3 8 4 0 # Notice reindex adds missing column with specified value `0`

This means that we can do:

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
columns=col,
fill_value=0 # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]

df:

  Col  A  B  Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0

*Notice the dtype of the column is int, since NaN was never introduced, and, therefore, the column type was not changed.



LookUp with Missing Values in the lookup Col

factorize has a default na_sentinel=-1, meaning that when NaN values appear in the column being factorized the resulting idx value is -1

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 NaN 4 8 # <- Missing Lookup Key

idx, col = pd.factorize(df['Col'])
# idx = array([ 0, 1, 1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
# Col A B Val
# 0 B 1 5 5
# 1 A 2 6 2
# 2 A 3 7 3
# 3 NaN 4 8 4 <- Value From A

This -1 means that, by default, we'll be pulling from the last column when we reindex. Notice the col still only contains the values B and A. Meaning, that we will end up with the value from A in Val for the last row.

The easiest way to handle this is to fillna Col with some value that cannot be found in the column headers.

Here I use the empty string '':

idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')

Now when I reindex, the '' column will contain NaN values meaning that the lookup produces the desired result:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df:

   Col  A  B  Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # Missing as expected

Reference DataFrame value corresponding to column header

We can use numpy indexing as recommended by the documentation on Looking up values by index/column labels as the replacement for the deprecated DataFrame.lookup.

With factorize Select and reindex:

idx, cols = pd.factorize(df['Select'])
df['value'] = (
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
)
  • Notice 1: If there is a value in the factorized column that does not correspond to a column header, the resulting value will be NaN (which indicates missing Data).

  • Notice 2: Both indexers need to be 0 based range indexes (compatible with numpy indexing). np.arange(len(df)) creates the range index based on the length of the DataFrame and therefore works in all cases.

However, if the DataFrame already has a compatible index (like in this example) df.index can be used directly.

idx, cols = pd.factorize(df['Select'])
df['value'] = (
df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)

df:

   Area   1   2   3   4  Select  value
0 22 54 33 46 23 4 23
1 45 36 54 32 14 1 36
2 67 34 29 11 14 3 11
3 54 35 19 22 45 2 19
4 21 27 39 43 22 3 43

Another option is Index.get_indexer:

df['value'] = df.to_numpy()[
df.index.get_indexer(df.index),
df.columns.get_indexer(df['Select'])
]
  • Notice: The same condition as above applies, if df.index is already a contiguous 0-Based index (compatible with numpy indexing) we can use df.index directly, instead of processing it with Index.get_indexer:
df['value'] = df.to_numpy()[
df.index,
df.columns.get_indexer(df['Select'])
]

df:

   Area   1   2   3   4  Select  value
0 22 54 33 46 23 4 23
1 45 36 54 32 14 1 36
2 67 34 29 11 14 3 11
3 54 35 19 22 45 2 19
4 21 27 39 43 22 3 43

Warning For get_indexer: if there is a value in Select that does not correspond to a column header the return value is -1 which will return the value from the last column in the DataFrame (since python supports negative indexing relative to the end). This is not as safe as NaN since it will return a value from the Select Column which is numeric and it may be difficult to tell the Data is invalid immediately.

Sample Program:

import pandas as pd

df = pd.DataFrame({
'Select': ['B', 'A', 'C', 'D'],
'A': [47, 2, 51, 95],
'B': [56, 88, 10, 56],
'C': [70, 73, 59, 56]
})

df['value'] = df.to_numpy()[
df.index,
df.columns.get_indexer(df['Select'])
]

print(df)

Notice in the last row the Select column is D but it pulls the value from C which is the last column in the DataFrame (-1). This is not immediately apparent that the lookup failed/is incorrect.

  Select   A   B   C value
0 B 47 56 70 56
1 A 2 88 73 2
2 C 51 10 59 59
3 D 95 56 56 56 # <- Value from C

Compared with factorize:

idx, cols = pd.factorize(df['Select'])
df['value'] = (
df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)

Notice in the last row the Select column is D and the corresponding value is NaN which is used in pandas to indicate missing data.

  Select   A   B   C  value
0 B 47 56 70 56.0
1 A 2 88 73 2.0
2 C 51 10 59 59.0
3 D 95 56 56 NaN # <- Missing Data

Setup and imports:

import numpy as np  # (Only needed is using np.arange)
import pandas as pd

df = pd.DataFrame({
'Area': [22, 45, 67, 54, 21],
1: [54, 36, 34, 35, 27],
2: [33, 54, 29, 19, 39],
3: [46, 32, 11, 22, 43],
4: [23, 14, 14, 45, 22],
'Select': [4, 1, 3, 2, 3]
})

Pandas get column value based on row value

Per this page:

idx, cols = pd.factorize(df['flag'])
df['COl_VAL'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Output:

>>> df
flag col1 col2 col3 col4 COl_VAL
index
A col3 1 5 6 0 6
B col2 3 2 3 4 2
C col2 2 4 6 4 4

Python Pandas Match Vlookup columns based on header values

Deprecation Notice: lookup was deprecated in v1.2.0

Use pd.DataFrame.lookup

Keep in mind that I'm assuming Customer_ID is the index.

df.lookup(df.index, df.Year_joined_mailing)

array([5, 7, 5, 7])


df.assign(
Purchases_1st_year=df.lookup(df.index, df.Year_joined_mailing)
)

2015 2016 2017 Year_joined_mailing Purchases_1st_year
Customer_ID
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 7 10 2016 7

However, you have to be careful with comparing possible strings in the column names and integers in the first year column...

Nuclear option to ensure type comparisons are respected.

df.assign(
Purchases_1st_year=df.rename(columns=str).lookup(
df.index, df.Year_joined_mailing.astype(str)
)
)

2015 2016 2017 Year_joined_mailing Purchases_1st_year
Customer_ID
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 7 10 2016 7

Pandas: Use column value to select the value from a different column to populate a new column

You can use DataFrame.apply

def label_score(row):
col_num = int(row['true_label'])
return row[f'{col_num}_score']

quest['true_label_score'] = quest.apply(label_score, axis=1)

If you want a solution based on the scores list you can do

scores = ['0_score', '1_score', '2_score', '3_score', '4_score','5_score']

def label_score(row, scores):
col_num = int(row['true_label'])
col_label = scores[col_num]
return row[col_label]

quest['true_label_score'] = quest.apply(label_score, scores=scores, axis=1)

However, assuming that the columns are in the right order (i.e. 0_score is the first column, 1_score is the second, etc.), a
faster is using numpy fancy indexing, as @mozway suggested.

quest['true_label_score'] = quest.to_numpy()[np.arange(len(quest)), quest['true_label']]

Output:

>>> quest 

0_score 1_score 2_score 3_score 4_score 5_score true_label true_label_score
0 0.007512 0.264500 0.273147 0.218029 0.233726 0.003084 1 0.264500
1 0.130695 0.289085 0.173402 0.144897 0.238129 0.023792 1 0.289085
2 0.006896 0.130070 0.289822 0.210133 0.219567 0.143512 4 0.219567
3 0.006819 0.178320 0.259109 0.041048 0.316587 0.198118 1 0.178320
4 0.011121 0.058437 0.182823 0.317847 0.123521 0.306250 3 0.317847

Pandas select a specific column from each row

pandas has a lot of numpy behind it, and so the workaround from the pandas docs is very easy to plug right back into your DataFrame:

In [27]: df = pd.DataFrame({'select': ['a', 'b', 'c', 'b', 'c', 'a'], 'a': range(6), 'b': range(6, 12), 'c': range(12, 18)})

In [28]: idx, cols = pd.factorize(df['select'])

In [29]: df['chosen'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

In [30]: df
Out[30]:
select a b c chosen
0 a 0 6 12 0
1 b 1 7 13 7
2 c 2 8 14 14
3 b 3 9 15 9
4 c 4 10 16 16
5 a 5 11 17 5


Related Topics



Leave a reply



Submit