Replace values in a pandas series via dictionary efficiently
One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.
General case
- Use
df['A'].map(d)
if all values mapped; or - Use
df['A'].map(d).fillna(df['A']).astype(int)
if >5% values mapped.
Few, e.g. < 5%, values in d
- Use
df['A'].replace(d)
The "crossover point" of ~5% is specific to Benchmarking below.
Interestingly, a simple list comprehension generally underperforms map
in either scenario.
Benchmarking
import pandas as pd, numpy as np
df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()
##### TEST 1 - Full Map #####
d = {i: i+1 for i in range(1000)}
%timeit df['A'].replace(d) # 1.98s
%timeit df['A'].map(d) # 84.3ms
%timeit [d[i] for i in lst] # 134ms
##### TEST 2 - Partial Map #####
d = {i: i+1 for i in range(10)}
%timeit df['A'].replace(d) # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int) # 111ms
%timeit [d.get(i, i) for i in lst] # 243ms
Explanation
The reason why s.replace
is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.
This is an excerpt from replace()
in pandas\generic.py
.
items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]
if any(are_mappings):
# handling of nested dictionaries
else:
to_replace, value = keys, values
return self.replace(to_replace, value, inplace=inplace,
limit=limit, regex=regex)
There appear to be many steps involved:
- Converting dictionary to a list.
- Iterating through list and checking for nested dictionaries.
- Feeding an iterator of keys and values into a replace function.
This can be compared to much leaner code from map()
in pandas\series.py
:
if isinstance(arg, (dict, Series)):
if isinstance(arg, dict):
arg = self._constructor(arg, index=arg.keys())
indexer = arg.index.get_indexer(values)
new_values = algos.take_1d(arg._values, indexer)
How to efficiently replace values in a dataframe by iterating through a dictionary?
- The most efficient way is to use series
apply
function. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html - using
apply
function on series apply any function defined to each of elemet. - Here we are mapping each element of
df['Salary']
to its equivalent value in the dictionary. - If you dont understand this part
lambda x: salary_dict.get(x, x)
Look into python lambdas. - Also
get
method on dictonary is used just to safeguard incase key is not in dictonary.
df['Salary'] = df['Salary'].apply(lambda x: salary_dict.get(x, x))
print(df)
output:
Salary
0 12
1 6
2 23
3 5
4 15
5 8
6 17
7 1
8 3
9 16
10 18
11 20
12 1
13 1
14 13
15 10
16 20
17 1
18 8
19 9
20 10
21 19
22 1
Replacing text with dictionary keys (having multiple values) in Python - more efficiency
You can build a reverse index of product to type, by creating a dictionary where the keys are the values of the sublists
product_to_type = {}
for typ, product_lists in CountryList.items():
for product_list in product_lists:
for product in product_list:
product_to_type[product] = typ
A little python magic lets you compress this step into a generator that creates the dict
product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}
Then you can create a function that splits the ingredients and maps them to type and apply that to the dataframe.
import pandas as pd
CountryList = {'FRUIT': [['apple'], ['orange'], ['banana']],
'CEREAL': [['oat'], ['wheat'], ['corn']],
'MEAT': [['chicken'], ['lamb'], ['pork'], ['turkey'], ['duck']]}
product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}
def convert_product_to_type(products):
return " ".join(product_to_type.get(product, product)
for product in products.split(" "))
df = pd.DataFrame({'Dish': ['A', 'B','C'],
'Price': [15,8,20],
'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
})
df["Ingredient"] = df["Ingredient"].apply(convert_product_to_type)
print(df)
Note: This solution splits the ingredient list on word boundaries which assumes that ingredients themselves don't have spaces in them.
Remap values in pandas column with a dict, preserve NaNs
You can use .replace
. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series
, i.e. df["col1"].replace(di, inplace=True)
.
How to replace pandas dataframe column values based on dictionary key and values?
You can create dict
with replace key
and value
with each other then use .replace
as your try:
>>> df = pd.DataFrame({'col1': {0: 'w', 1: 'A', 2: 'B'}, 'col2': {0: 'a', 1: 2, 2: np.nan}})
>>> df
col1 col2
0 w a
1 A 2
2 B NaN
>>> di = {1: "A", 2: "B"}
>>> di2 = {v:k for k,v in di.items()}
>>> df.replace({"col1": di2})
col1 col2
0 w a
1 1 2
2 2 NaN
Using replace efficiently in pandas
use map
to perform a lookup:
In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
1st 2nd
0 a 2
1 b 4
2 c 6
to avoid the situation where there is no valid key you can pass na_action='ignore'
You can also use df['1st'].replace(idxDict)
but to answer you question about efficiency:
timings
In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop
In [70]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 3.25 ms per loop
So using map
is over 3x faster here
on a larger dataset:
In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape
Out[3]:
(30000, 2)
In [4]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop
In [5]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 18.2 ms per loop
For 30K row df, map
is ~4x faster so it scales better than replace
or looping
Better way to replace values in DataFrame from large dictionary
I think you can use map
with Series
converted to_dict
- get NaN
if not exist value in df2
:
df1['id'] = df1.id.map(df2.set_index('id')['name'].to_dict())
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134
Or replace
, if dont exist value in df2
let original values from df1
:
df1['id'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134
Sample:
#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42, 51, 23, 14, 111, 134]})
print (df1)
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 6, 7, 8, 9, 10], 'name' : ['id1', 'id2', 'id3', 'id4', 'id6', 'id7', 'id8', 'id9', 'id10']})
print (df2)
df1['new_map'] = df1.id.map(df2.set_index('id')['name'].to_dict())
df1['new_replace'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values new_map new_replace
0 1 12 id1 id1
1 2 32 id2 id2
2 3 42 id3 id3
3 4 51 id4 id4
4 5 23 NaN 5
5 3 14 id3 id3
6 5 111 NaN 5
7 9 134 id9 id9
Python/Pandas - What is the most efficient way to replace values in specifc columns
According to the documentation on replace, you can specify a dictionary for each column:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 4, 6, 8], 'c': [2, 4, 5, 6]})
lookup = {col : {2: 20, 4: 40} for col in ['a', 'c']}
df.replace(lookup, inplace=True)
print(df)
Output
a b c
0 1 2 20
1 20 4 40
2 3 6 5
3 40 8 6
Related Topics
How to Filter Only Printable Characters in a File on Bash (Linux) or Python
Multi Platform Portable Python
Running a Bash Script from Python
Executable Python Program with All Dependencies for Linux
Checking Running Python Script Within the Python Script
I Have a Problem with Sending Mail:Typeerror: _Init_() Got an Unexpected Keyword Argument 'Context'
Use Df Command to Show Only the %Used
How to Run Celery Workers by Superuser
Change Parent Shell's Environment from a Subprocess
Creating a Symbolic in Shared Volume of Docker and Accessing It in Host MAChine
How to Install Writable Shared and User Specific Data Files with Setuptools
How to Send Http Requests to Flask Server