Looping in Python: Modify One Column Based on Values in Other Columns

Looping in Python: modify one column based on values in other columns

# Copy the existing column over
df['col2_corrected'] = df.col2

# Increment the values of only those items where col1 is A C or E
df.loc[df.col1.isin(['A', 'C', 'E']), 'col2_corrected'] += 1

df
Out[]:
col1 col2 col2_corrected
0 A 1 2
1 B 3 3
2 C 3 4
3 D 7 7
4 E 4 5
5 C 3 4

The reason you get that error is from the line if df.col1.isin(add_one_to_me):

If we take a look at: df.col1.isin(add_one_to_me)

Out[]: 
0 True
1 False
2 True
3 False
4 True
5 True

And this doesn't bode with the if statement. What you could have done is iteratively checked each item in col1 and then increment col2_corrected by one. This could be done by using df.apply(...) or for index, row in df.iterrows():

Modify values in a column based on condition from another

The problem with your current method is the output of each subsequent iteration overwrites the output of the one before it. So you'd end up with output for just the last item and nothing more.

Select all rows with elements in items and assign, same as you did before.

df['math'] = df.loc[df.col1.isin(items), 'col3'] * 10

Or,

df['math'] = df.query("col1 in @items").col3 * 10

Or even,

df['math'] = df.col3.where(df.col1.isin(items)) * 10

df

col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN

Modify one column values based on multiple conditions of another column

For better performance use numpy.select instead apply, also is possible set default value if not match any condition:

masks = [(df['A'] >= 0) & (df['A'] < 50),
(df['A'] >= 50) & (df['A'] < 70),
(df['A'] >= 70) & (df['A'] <= 100)]

vals = [df['B'], df['B'] / 3, df['B']/df['C']/3]

df['B'] = np.select(masks, vals, default=0)

Performance - It is about 1000 times faster:

np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10000, 3)), columns=list('ABC'))

#Jeril solution
In [74]: %timeit df['B1'] = df.apply(Standard, axis=1)
__main__:18: RuntimeWarning: divide by zero encountered in double_scalars
424 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [75]: %timeit df['B'] = np.select(masks, vals, default=0)
468 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Iterate over values in DataFrame column, and update said value if condition is met

If you really want to do it with a loop you can
simply use the indices like arrays:

import pandas as pd

df = pd.DataFrame()
df['date'] = [-3000,3000,1000,5000]

for i in range(len(df['date'])):
if df['date'][i] > 2000:
df['date'][i] = df['date'][i] - 2400
elif df['date'][i] < -2000:
df['date'][i] = df['date'][i] + 2400

df

But I would use a simpler method using .loc:

df['date'].loc[df['date'] > 2000] = df['date'] - 2400
df['date'].loc[df['date'] <-2000] = df['date'] + 2400

df

Update Pandas Cells based on Column Values and Other Columns

The most elegant is definitely the CountVectorizer from sklearn.

I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.

First, we'll do it step by step:

let's create some data

raw = ['ABC', 'AAA', 'BA', 'DD']

things = [list(s) for s in raw]

Then read in some packages and initialize count vectorizer

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)

Next we generate a matrix of counts

matrix = cv.fit_transform(things)

names = ["count_"+n for n in cv.get_feature_names()]

And save as a data frame

df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)

Generating a data frame like this:

    count_A count_B count_C count_D
ABC 1 1 1 0
AAA 3 0 0 0
BA 1 1 0 0
DD 0 0 0 2

Elegant version:

Everything above in one line

df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

Timing:

You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.

Previous response by @piRSquared (which otherwise looks very good!)

pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)

100 loops, best of 3: 3.27 ms per loop

My answer:

pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

1000 loops, best of 3: 1.08 ms per loop

According to my testing, CountVectorizer is about 3x faster.

Update dataframe values based on conditions without for loop

Do you want something like this?

mask = (~df.index.isin(values))
df.loc[mask, 'a2'] = df.loc[mask].index


Related Topics



Leave a reply



Submit