Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer
Standard LookUp Values With Any Index
The documentation on Looking up values by index/column labels recommends using NumPy indexing viafactorize
and reindex
as the replacement for the deprecated DataFrame.lookup
.import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
factorize
is used to convert the column encode the values as an "enumerated type".idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
Notice that B
corresponds to 0
and A
corresponds to 1
. reindex
is used to ensure that columns appear in the same order as the enumeration:df.reindex(columns=col)
B A # B appears First (location 0) A appers second (location 1)
0 5 1
1 6 2
2 7 3
3 8 4
We need to create an appropriate range indexer compatible with NumPy indexing.The standard approach is to use np.arange
based on the length of the DataFrame:
np.arange(len(df))
[0 1 2 3]
Now NumPy indexing will work to select values from the DataFrame:df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
[5 2 3 8]
*Note: This approach will always work regardless of type of index.MultiIndex
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
Why use np.arange
and not df.index
directly?
Standard Contiguous Range Index
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
In this case only, there is no error as the result from np.arange
is the same as the df.index
.df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Non-Contiguous Range Index Error
Raises IndexError:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: index 8 is out of bounds for axis 0 with size 4
MultiIndex Error
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
Raises IndexError:df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
LookUp with Default For Unmatched/Not-Found Values
There are a few approaches.First let's look at what happens by default if there is a non-corresponding value:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 C 4 8
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 C 4 8 NaN # NaN Represents the Missing Value in C
If we look at why the NaN
values are introduced, we will find that when factorize
goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.For this reason, when we reindex
the DataFrame we will end up with the following result:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
B A C
0 5 1 NaN
1 6 2 NaN
2 7 3 NaN
3 8 4 NaN # Reindex adds the missing column with the Default `NaN`
If we want to specify a default value, we can specify the fill_value
argument of reindex
which allows us to modify the behaviour as it relates to missing column values:idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
B A C
0 5 1 0
1 6 2 0
2 7 3 0
3 8 4 0 # Notice reindex adds missing column with specified value `0`
This means that we can do:idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
columns=col,
fill_value=0 # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]
df
: Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0
*Notice the dtype
of the column is int
, since NaN
was never introduced, and, therefore, the column type was not changed.LookUp with Missing Values in the lookup Col
factorize
has a default na_sentinel=-1
, meaning that when NaN
values appear in the column being factorized the resulting idx
value is -1
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 NaN 4 8 # <- Missing Lookup Key
idx, col = pd.factorize(df['Col'])
# idx = array([ 0, 1, 1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
# Col A B Val
# 0 B 1 5 5
# 1 A 2 6 2
# 2 A 3 7 3
# 3 NaN 4 8 4 <- Value From A
This -1
means that, by default, we'll be pulling from the last column when we reindex. Notice the col
still only contains the values B
and A
. Meaning, that we will end up with the value from A
in Val
for the last row.The easiest way to handle this is to fillna
Col
with some value that cannot be found in the column headers.
Here I use the empty string ''
:
idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')
Now when I reindex, the ''
column will contain NaN
values meaning that the lookup produces the desired result:import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
: Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # Missing as expected
Reference DataFrame value corresponding to column header
We can use numpy indexing as recommended by the documentation on Looking up values by index/column labels as the replacement for the deprecated DataFrame.lookup
.
With factorize
Select
and reindex
:
idx, cols = pd.factorize(df['Select'])
df['value'] = (
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
)
Notice 1: If there is a value in the factorized column that does not correspond to a column header, the resulting value will be
NaN
(which indicates missing Data).Notice 2: Both indexers need to be 0 based range indexes (compatible with numpy indexing).
np.arange(len(df))
creates the range index based on the length of the DataFrame and therefore works in all cases.
df.index
can be used directly.idx, cols = pd.factorize(df['Select'])
df['value'] = (
df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)
df
: Area 1 2 3 4 Select value
0 22 54 33 46 23 4 23
1 45 36 54 32 14 1 36
2 67 34 29 11 14 3 11
3 54 35 19 22 45 2 19
4 21 27 39 43 22 3 43
Another option is
Index.get_indexer
:df['value'] = df.to_numpy()[
df.index.get_indexer(df.index),
df.columns.get_indexer(df['Select'])
]
- Notice: The same condition as above applies, if
df.index
is already a contiguous 0-Based index (compatible with numpy indexing) we can usedf.index
directly, instead of processing it withIndex.get_indexer
:
df['value'] = df.to_numpy()[
df.index,
df.columns.get_indexer(df['Select'])
]
df
: Area 1 2 3 4 Select value
0 22 54 33 46 23 4 23
1 45 36 54 32 14 1 36
2 67 34 29 11 14 3 11
3 54 35 19 22 45 2 19
4 21 27 39 43 22 3 43
Warning For
get_indexer
: if there is a value in Select
that does not correspond to a column header the return value is -1
which will return the value from the last column in the DataFrame (since python supports negative indexing relative to the end). This is not as safe as NaN
since it will return a value from the Select
Column which is numeric and it may be difficult to tell the Data is invalid immediately.Sample Program:
import pandas as pd
df = pd.DataFrame({
'Select': ['B', 'A', 'C', 'D'],
'A': [47, 2, 51, 95],
'B': [56, 88, 10, 56],
'C': [70, 73, 59, 56]
})
df['value'] = df.to_numpy()[
df.index,
df.columns.get_indexer(df['Select'])
]
print(df)
Notice in the last row the Select column is D
but it pulls the value from C
which is the last column in the DataFrame (-1). This is not immediately apparent that the lookup failed/is incorrect. Select A B C value
0 B 47 56 70 56
1 A 2 88 73 2
2 C 51 10 59 59
3 D 95 56 56 56 # <- Value from C
Compared with factorize
:idx, cols = pd.factorize(df['Select'])
df['value'] = (
df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)
Notice in the last row the Select column is D
and the corresponding value is NaN which is used in pandas to indicate missing data. Select A B C value
0 B 47 56 70 56.0
1 A 2 88 73 2.0
2 C 51 10 59 59.0
3 D 95 56 56 NaN # <- Missing Data
Setup and imports:
import numpy as np # (Only needed is using np.arange)
import pandas as pd
df = pd.DataFrame({
'Area': [22, 45, 67, 54, 21],
1: [54, 36, 34, 35, 27],
2: [33, 54, 29, 19, 39],
3: [46, 32, 11, 22, 43],
4: [23, 14, 14, 45, 22],
'Select': [4, 1, 3, 2, 3]
})
Pandas get column value based on row value
Per this page:
idx, cols = pd.factorize(df['flag'])
df['COl_VAL'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:>>> df
flag col1 col2 col3 col4 COl_VAL
index
A col3 1 5 6 0 6
B col2 3 2 3 4 2
C col2 2 4 6 4 4
Python Pandas Match Vlookup columns based on header values
Deprecation Notice: lookup
was deprecated in v1.2.0
Use pd.DataFrame.lookup
Keep in mind that I'm assuming
Customer_ID
is the index.df.lookup(df.index, df.Year_joined_mailing)
array([5, 7, 5, 7])
df.assign(
Purchases_1st_year=df.lookup(df.index, df.Year_joined_mailing)
)
2015 2016 2017 Year_joined_mailing Purchases_1st_year
Customer_ID
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 7 10 2016 7
However, you have to be careful with comparing possible strings in the column names and integers in the first year column...
Nuclear option to ensure type comparisons are respected.
df.assign(
Purchases_1st_year=df.rename(columns=str).lookup(
df.index, df.Year_joined_mailing.astype(str)
)
)
2015 2016 2017 Year_joined_mailing Purchases_1st_year
Customer_ID
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 7 10 2016 7
Pandas: Use column value to select the value from a different column to populate a new column
You can use DataFrame.apply
def label_score(row):
col_num = int(row['true_label'])
return row[f'{col_num}_score']
quest['true_label_score'] = quest.apply(label_score, axis=1)
If you want a solution based on the scores
list you can doscores = ['0_score', '1_score', '2_score', '3_score', '4_score','5_score']
def label_score(row, scores):
col_num = int(row['true_label'])
col_label = scores[col_num]
return row[col_label]
quest['true_label_score'] = quest.apply(label_score, scores=scores, axis=1)
However, assuming that the columns are in the right order (i.e. 0_score
is the first column, 1_score
is the second, etc.), afaster is using
numpy
fancy indexing, as @mozway suggested.quest['true_label_score'] = quest.to_numpy()[np.arange(len(quest)), quest['true_label']]
Output:>>> quest
0_score 1_score 2_score 3_score 4_score 5_score true_label true_label_score
0 0.007512 0.264500 0.273147 0.218029 0.233726 0.003084 1 0.264500
1 0.130695 0.289085 0.173402 0.144897 0.238129 0.023792 1 0.289085
2 0.006896 0.130070 0.289822 0.210133 0.219567 0.143512 4 0.219567
3 0.006819 0.178320 0.259109 0.041048 0.316587 0.198118 1 0.178320
4 0.011121 0.058437 0.182823 0.317847 0.123521 0.306250 3 0.317847
Pandas select a specific column from each row
pandas
has a lot of numpy
behind it, and so the workaround from the pandas docs is very easy to plug right back into your DataFrame:
In [27]: df = pd.DataFrame({'select': ['a', 'b', 'c', 'b', 'c', 'a'], 'a': range(6), 'b': range(6, 12), 'c': range(12, 18)})
In [28]: idx, cols = pd.factorize(df['select'])
In [29]: df['chosen'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
In [30]: df
Out[30]:
select a b c chosen
0 a 0 6 12 0
1 b 1 7 13 7
2 c 2 8 14 14
3 b 3 9 15 9
4 c 4 10 16 16
5 a 5 11 17 5
Related Topics
Python Argparse: Default Value or Specified Value
"Pip Install --Editable ./" VS "Python Setup.Py Develop"
How to Read a File Line-By-Line in Python
Plotting Results of Hierarchical Clustering Ontop of a Matrix of Data in Python
Change to Sudo User Within a Python Script
Is There a Numpy Builtin to Reject Outliers from a List
How to Get the Version Defined in Setup.Py (Setuptools) in My Package
Convert Date to Datetime in Python
Continuing in Python's Unittest When an Assertion Fails
What Does 'Wb' Mean in This Code, Using Python
Why Is the Apt-Get Function Not Working in the Terminal on MAC Os X V10.9 (Mavericks)
Start a Flask Application in Separate Thread
In-Place Type Conversion of a Numpy Array
Is There a Simple Process-Based Parallel Map for Python
Understanding the Python with Statement and Context Managers
Pandas - Add New Column to Dataframe from Dictionary
Vectorized Numpy Linspace for Multiple Start and Stop Values