Replace Values in a Pandas Series via Dictionary Efficiently

Replace values in a pandas series via dictionary efficiently

One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

General case

  • Use df['A'].map(d) if all values mapped; or
  • Use df['A'].map(d).fillna(df['A']).astype(int) if >5% values mapped.

Few, e.g. < 5%, values in d

  • Use df['A'].replace(d)

The "crossover point" of ~5% is specific to Benchmarking below.

Interestingly, a simple list comprehension generally underperforms map in either scenario.

Benchmarking

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 - Full Map #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d) # 1.98s
%timeit df['A'].map(d) # 84.3ms
%timeit [d[i] for i in lst] # 134ms

##### TEST 2 - Partial Map #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d) # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int) # 111ms
%timeit [d.get(i, i) for i in lst] # 243ms

Explanation

The reason why s.replace is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

This is an excerpt from replace() in pandas\generic.py.

items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]

if any(are_mappings):
# handling of nested dictionaries
else:
to_replace, value = keys, values

return self.replace(to_replace, value, inplace=inplace,
limit=limit, regex=regex)

There appear to be many steps involved:

  • Converting dictionary to a list.
  • Iterating through list and checking for nested dictionaries.
  • Feeding an iterator of keys and values into a replace function.

This can be compared to much leaner code from map() in pandas\series.py:

if isinstance(arg, (dict, Series)):
if isinstance(arg, dict):
arg = self._constructor(arg, index=arg.keys())

indexer = arg.index.get_indexer(values)
new_values = algos.take_1d(arg._values, indexer)

How to efficiently replace values in a dataframe by iterating through a dictionary?

  • The most efficient way is to use series apply function. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html
  • using apply function on series apply any function defined to each of elemet.
  • Here we are mapping each element of df['Salary'] to its equivalent value in the dictionary.
  • If you dont understand this part lambda x: salary_dict.get(x, x) Look into python lambdas.
  • Also get method on dictonary is used just to safeguard incase key is not in dictonary.
df['Salary'] = df['Salary'].apply(lambda x: salary_dict.get(x, x))
print(df)

output:

   Salary
0 12
1 6
2 23
3 5
4 15
5 8
6 17
7 1
8 3
9 16
10 18
11 20
12 1
13 1
14 13
15 10
16 20
17 1
18 8
19 9
20 10
21 19
22 1

Replacing text with dictionary keys (having multiple values) in Python - more efficiency

You can build a reverse index of product to type, by creating a dictionary where the keys are the values of the sublists

product_to_type = {}
for typ, product_lists in CountryList.items():
for product_list in product_lists:
for product in product_list:
product_to_type[product] = typ

A little python magic lets you compress this step into a generator that creates the dict

product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}

Then you can create a function that splits the ingredients and maps them to type and apply that to the dataframe.

import pandas as pd

CountryList = {'FRUIT': [['apple'], ['orange'], ['banana']],
'CEREAL': [['oat'], ['wheat'], ['corn']],
'MEAT': [['chicken'], ['lamb'], ['pork'], ['turkey'], ['duck']]}

product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}

def convert_product_to_type(products):
return " ".join(product_to_type.get(product, product)
for product in products.split(" "))

df = pd.DataFrame({'Dish': ['A', 'B','C'],
'Price': [15,8,20],
'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
})

df["Ingredient"] = df["Ingredient"].apply(convert_product_to_type)

print(df)

Note: This solution splits the ingredient list on word boundaries which assumes that ingredients themselves don't have spaces in them.

Remap values in pandas column with a dict, preserve NaNs

You can use .replace. For example:

>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN

or directly on the Series, i.e. df["col1"].replace(di, inplace=True).

How to replace pandas dataframe column values based on dictionary key and values?

You can create dict with replace key and value with each other then use .replace as your try:

>>> df = pd.DataFrame({'col1': {0: 'w', 1: 'A', 2: 'B'}, 'col2': {0: 'a', 1: 2, 2: np.nan}})
>>> df
col1 col2
0 w a
1 A 2
2 B NaN

>>> di = {1: "A", 2: "B"}
>>> di2 = {v:k for k,v in di.items()}
>>> df.replace({"col1": di2})
col1 col2
0 w a
1 1 2
2 2 NaN

Using replace efficiently in pandas

use map to perform a lookup:

In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
1st 2nd
0 a 2
1 b 4
2 c 6

to avoid the situation where there is no valid key you can pass na_action='ignore'

You can also use df['1st'].replace(idxDict) but to answer you question about efficiency:

timings

In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop

In [70]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 3.25 ms per loop

So using map is over 3x faster here

on a larger dataset:

In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape

Out[3]:
(30000, 2)

In [4]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop

In [5]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 18.2 ms per loop

For 30K row df, map is ~4x faster so it scales better than replace or looping

Better way to replace values in DataFrame from large dictionary

I think you can use map with Series converted to_dict - get NaN if not exist value in df2:

df1['id'] = df1.id.map(df2.set_index('id')['name'].to_dict())
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134

Or replace, if dont exist value in df2 let original values from df1:

df1['id'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134

Sample:

#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42, 51, 23, 14, 111, 134]})
print (df1)
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 6, 7, 8, 9, 10], 'name' : ['id1', 'id2', 'id3', 'id4', 'id6', 'id7', 'id8', 'id9', 'id10']})
print (df2)

df1['new_map'] = df1.id.map(df2.set_index('id')['name'].to_dict())
df1['new_replace'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values new_map new_replace
0 1 12 id1 id1
1 2 32 id2 id2
2 3 42 id3 id3
3 4 51 id4 id4
4 5 23 NaN 5
5 3 14 id3 id3
6 5 111 NaN 5
7 9 134 id9 id9

Python/Pandas - What is the most efficient way to replace values in specifc columns

According to the documentation on replace, you can specify a dictionary for each column:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 4, 6, 8], 'c': [2, 4, 5, 6]})
lookup = {col : {2: 20, 4: 40} for col in ['a', 'c']}
df.replace(lookup, inplace=True)
print(df)

Output

    a  b   c
0 1 2 20
1 20 4 40
2 3 6 5
3 40 8 6


Related Topics



Leave a reply



Submit