Looping in Python: modify one column based on values in other columns
# Copy the existing column over
df['col2_corrected'] = df.col2
# Increment the values of only those items where col1 is A C or E
df.loc[df.col1.isin(['A', 'C', 'E']), 'col2_corrected'] += 1
df
Out[]:
col1 col2 col2_corrected
0 A 1 2
1 B 3 3
2 C 3 4
3 D 7 7
4 E 4 5
5 C 3 4
The reason you get that error is from the line if df.col1.isin(add_one_to_me):
If we take a look at: df.col1.isin(add_one_to_me)
Out[]:
0 True
1 False
2 True
3 False
4 True
5 True
And this doesn't bode with the if
statement. What you could have done is iteratively checked each item in col1
and then increment col2_corrected
by one. This could be done by using df.apply(...)
or for index, row in df.iterrows():
Modify values in a column based on condition from another
The problem with your current method is the output of each subsequent iteration overwrites the output of the one before it. So you'd end up with output for just the last item and nothing more.
Select all rows with elements in items
and assign, same as you did before.
df['math'] = df.loc[df.col1.isin(items), 'col3'] * 10
Or,
df['math'] = df.query("col1 in @items").col3 * 10
Or even,
df['math'] = df.col3.where(df.col1.isin(items)) * 10
df
col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN
Modify one column values based on multiple conditions of another column
For better performance use numpy.select
instead apply
, also is possible set default value if not match any condition:
masks = [(df['A'] >= 0) & (df['A'] < 50),
(df['A'] >= 50) & (df['A'] < 70),
(df['A'] >= 70) & (df['A'] <= 100)]
vals = [df['B'], df['B'] / 3, df['B']/df['C']/3]
df['B'] = np.select(masks, vals, default=0)
Performance - It is about 1000 times faster:
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10000, 3)), columns=list('ABC'))
#Jeril solution
In [74]: %timeit df['B1'] = df.apply(Standard, axis=1)
__main__:18: RuntimeWarning: divide by zero encountered in double_scalars
424 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [75]: %timeit df['B'] = np.select(masks, vals, default=0)
468 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Iterate over values in DataFrame column, and update said value if condition is met
If you really want to do it with a loop you can
simply use the indices like arrays:
import pandas as pd
df = pd.DataFrame()
df['date'] = [-3000,3000,1000,5000]
for i in range(len(df['date'])):
if df['date'][i] > 2000:
df['date'][i] = df['date'][i] - 2400
elif df['date'][i] < -2000:
df['date'][i] = df['date'][i] + 2400
df
But I would use a simpler method using .loc:
df['date'].loc[df['date'] > 2000] = df['date'] - 2400
df['date'].loc[df['date'] <-2000] = df['date'] + 2400
df
Update Pandas Cells based on Column Values and Other Columns
The most elegant is definitely the CountVectorizer from sklearn.
I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.
First, we'll do it step by step:
let's create some data
raw = ['ABC', 'AAA', 'BA', 'DD']
things = [list(s) for s in raw]
Then read in some packages and initialize count vectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
Next we generate a matrix of counts
matrix = cv.fit_transform(things)
names = ["count_"+n for n in cv.get_feature_names()]
And save as a data frame
df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)
Generating a data frame like this:
count_A count_B count_C count_D
ABC 1 1 1 0
AAA 3 0 0 0
BA 1 1 0 0
DD 0 0 0 2
Elegant version:
Everything above in one line
df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
Timing:
You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.
Previous response by @piRSquared (which otherwise looks very good!)
pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)
100 loops, best of 3: 3.27 ms per loop
My answer:
pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
1000 loops, best of 3: 1.08 ms per loop
According to my testing, CountVectorizer is about 3x faster.
Update dataframe values based on conditions without for loop
Do you want something like this?
mask = (~df.index.isin(values))
df.loc[mask, 'a2'] = df.loc[mask].index
Related Topics
How to Move to One Folder Back in Python
Python Searching for Partial Matches in a List
How to Download Multiple Files or an Entire Folder from Google Colab
How to Delete a Character in an Item in a List (Python)
Python Strftime - Date Without Leading 0
Stuck With Loops in Python - Only Returning First Value
Pandas - Find Index of Value Anywhere in Dataframe
How to Check the Version of Python Modules
Convert CSV File to Pipe Delimited File in Python
How to Test If a List Contains Another List as a Contiguous Subsequence
Python - Split a List of Dicts into Individual Dicts
How to Insert a Checkbox in a Django Form
Package Only Binary Compiled .So Files of a Python Library Compiled With Cython
Pyqt: Getting Widgets to Resize Automatically in a Qdialog
How to Locate the Input Within Div
Using Beautifulsoup to Extract Text from Div
Typeerror: Unsupported Format String Passed to List._Format_