Pandas: Calculate the Percentage Between Two Rows and Add the Value as a Column

Percentage change between two rows in pandas based on certain criteria

Just to clarify:
In your second output example the one value after B is 10.8 not 11, am I right? Also the value corresponding to price 11.1 is null? If so then this primitive loop may work:

df = pd.DataFrame({
'price': [10, 10.3, 11, 11.5, 11.1, 11, 10.8, 10],
'value': ['B', 'H','H', 'S',None, 'B','H', 'SL']
})
B_list = df[df['value'] == 'B'].index
S_list = df[(df['value'] == 'S') | (df['value'] == 'SL' )].index
df['change'] = None
for el in range(len(B_list)):
x = (df['price'].iloc[B_list[el]+1] - df['price'].iloc[S_list[el]])/ df['price'].iloc[B_list[el]+1]
df.loc[[B_list[el]+1], 'change'] = x

But it won't work if you have consecutive 'B's in the column (that's not the case in the above example)

Pandas Calculate percentage by column values

Use Series.value_counts with normalize=True, then multiple by 100 and change format to DataFrame:

df1 = (df['age'].value_counts(normalize=True)
.mul(100)
.rename_axis('age')
.reset_index(name='percentage'))
print (df1)
age percentage
0 30-40 60.0
1 40-50 40.0

Calculating percentages for multiple columns

You can groupby 'Page Name' and 'candidato' then find the sum of each of 'Total Interactions','Likes','Comments','Shares','Love','Angry' for each page name and each candidate: totals.

Then use groupby on totals by the first index level (which is "page name") and transform sum function so that you get the sum for each page name transformed for totals and divide totals by it to find the percentages.

Finally join the two DataFrames for the final outcome.

totals = df.groupby(['Page Name','candidato'])[['Total Interactions','Likes','Comments','Shares','Love','Angry']].sum()
percentages = totals.groupby(level=0).transform('sum').rdiv(totals).mul(100).round(2)

out = totals.join(percentages, lsuffix='', rsuffix='_Percentages').reset_index()

This produces a DataFrame that can produce the plot in the question.

Python/DataFrame: Calculate percentage of occurrences/rows when value is greater than zero

You can use loc , since the previous code return a dataframe count, in your case you need series

a = df.loc[df['A'] > 0,'A'].count()/df['A'].count()
a
Out[58]: 0.5

Compare two columns based on last N rows in a pandas DataFrame

Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.

import numpy as np
import pandas as pd

df = pd.DataFrame([['A', 20, 10],
['A', 30, 5],
['A', 40, 20],
['A', 50, 10],
['A', 20, 30],
['B', 50, 10],
['B', 30, 5],
['B', 40, 20],
['B', 10, 10],
['B', 20, 30]],
columns=['ts_code', 'high', 'low']
)


def custom_f(df, n):
s = pd.Series(np.nan, index=df.index)

def sub_f(df_):
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high

for i in range(df.shape[0] - n + 1):
df_ = df.iloc[i:i + n]
s.iloc[i + n - 1] = sub_f(df_)

return s


df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values


print(df)

If you prefer to use the rolling function, this method gives the same output:

def rolling_f(rolling_df):
df_ = df.loc[rolling_df.index]
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high

df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]

print(df)

Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)

from numpy_ext import rolling_apply

def np_ext_f(rolling_df, n):
def rolling_apply_f(high, low):
return 1 - low[np.argmax(high):].min() / high.max()
try:
return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
except ValueError:
return pd.Series(np.nan, index=rolling_df.index)


df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values

print(df)

output:

  ts_code  high  low  l3_high_low_pct_chg  l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75

For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:

import time

def timeit(f):

def timed(*args, **kw):
ts = time.time()
result = f(*args, **kw)
te = time.time()
print ('func:%r took: %2.4f sec' % \
(f.__name__, te-ts))
return result

return timed

Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:

df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()

Finally, we run the three tests under a timing function:

@timeit
def method_1():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()

@timeit
def method_2():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()

@timeit
def method_3():
df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()

Which gives us this output:

func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec

So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.



Related Topics



Leave a reply



Submit