Python Compounded Return Group By
Assume we have dataframe:
DATE STOCK RETURN Year Month
1/1/2020 A 0.02 2020 1
2/1/2020 A 0.01 2020 1
3/1/2020 A 0.04 2020 1
1/1/2020 B 0.03 2020 1
2/1/2020 B 0.01 2020 1
3/1/2020 B 0.04 2020 1
1/2/2020 A 0.05 2020 2
2/2/2020 A 0.02 2020 2
3/2/2020 A 0.01 2020 2
4/2/2020 A 0.04 2020 2
1/2/2020 B 0.01 2020 2
2/2/2020 B 0.03 2020 2
3/2/2020 B 0.02 2020 2
4/2/2020 B 0.05 2020 2
To calculate geometric mean of daily returns (check definition) per every month and stock use:
df.groupby(['STOCK','Year','Month'])[['RETURN']].apply(lambda g: ((g+1).cumprod()**(1/len(g))-1).iloc[-1])
result:
RETURN
STOCK Year Month
A 2020 1 0.023258
2 0.029879
B 2020 1 0.026591
2 0.027394
You can check manually if calculation is correct, for example for stock A and Jan:
(1.02*1.01*1.04)**(1/3)-1 = 0.023258
Note that iloc[-1]
in lambda function select last calculated return per group.
Calculating cumulative returns with pandas dataframe
If performance is important, use numpy.cumprod
:
np.cumprod(1 + df['Daily_rets'].values) - 1
Timings:
#7k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [191]: %timeit np.cumprod(1 + df['Daily_rets'].values) - 1
41 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [192]: %timeit (1 + df.Daily_rets).cumprod() - 1
554 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Calculate Daily Returns with Pandas DataFrame
Because operations will do alignment on index, you can convert one of the DataFrames to array:
prices[:-1].values / prices[1:] - 1
or
prices[:-1] / prices[1:].values - 1
depends on what the index of the result you want.
or use shift()
method:
prices.shift(1) / prices - 1
and:
prices / prices.shift(1) - 1
Compute a compounded return series in Python
There is a fantastic module called pandas that was written by a guy at AQR (a hedge fund) that excels at calculations like this... what you need is a way to handle "missing data"... as someone mentioned above, the basics are using the nan (not a number) capabilities of scipy or numpy; however, even those libraries don't make financial calculations that much easier... if you use pandas, you can mark the data you don't want to consider as nan
, and then any future calculations will reject it, while performing normal operations on other data.
I have been using pandas on my trading platform for about 8 months... I wish I had started using it sooner.
Wes (the author) gave a talk at pyCon 2010 about the capabilities of the module... see the slides and video on the pyCon 2010 webpage. In that video, he demonstrates how to get daily returns, run 1000s of linear regressions on a matrix of returns (in a fraction of a second), timestamp / graph data... all done with this module. Combined with psyco, this is a beast of a financial analysis tool.
The other great thing it handles is cross-sectional data... so you could grab daily close prices, their rolling means, etc... then timestamp every calculation, and get all this stored in something similar to a python dictionary (see the pandas.DataFrame
class)... then you access slices of the data as simply as:
close_prices['stdev_5d']
See the pandas rolling moments doc for more information on to calculate the rolling stdev (it's a one-liner).
Wes has gone out of his way to speed the module up with cython, although I'll concede that I'm considering upgrading my server (an older Xeon), due to my analysis requirements.
EDIT FOR STRIMP's QUESTION:
After you converted your code to use pandas data structures, it's still unclear to me how you're indexing your data in a pandas dataframe and the compounding function's requirements for handling missing data (or for that matter the conditions for a 0.0 return... or if you are using NaN
in pandas..). I will demonstrate using my data indexing... a day was picked at random... df
is a dataframe with ES Futures quotes in it... indexed per second... missing quotes are filled in with numpy.nan
. DataFrame indexes are datetime
objects, offset by the pytz
module's timezone objects.
>>> df.info
<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
Index: 86400 entries , 2011-03-21 00:00:00-04:00 to 2011-03-21 23:59:59-04:00
etf 18390 non-null values
etfvol 18390 non-null values
fut 29446 non-null values
futvol 23446 non-null values
...
>>> # ET is a pytz object...
>>> et
<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>
>>> # To get the futures quote at 9:45, eastern time...
>>> df.xs(et.localize(dt.datetime(2011,3,21,9,45,0)))['fut']
1291.75
>>>
To give a simple example of how to calculate a column of continuous returns (in a pandas.TimeSeries
), which reference the quote 10 minutes ago (and filling in for missing ticks), I would do this:
>>> df['fut'].fill(method='pad')/df['fut'].fill(method='pad').shift(600)
No lambda is required in that case, just dividing the column of values by itself 600 seconds ago. That .shift(600)
part is because my data is indexed per-second.
HTH,
\mike
Pandas: Cumulative return function
there is a pandas cumprod()
method for that. this will work for every columns.
df.ix["Cumulative"] = ((df+1).cumprod()-1).iloc[-1]
this would be about 2 time faster than other solutions on large dataset:
In[106]: %timeit df.ix["Cumulative"] = ((df+1).cumprod()-1).iloc[-1]
10 loops, best of 3: 18.4 ms per loop
In[107]: %timeit df.ix['Cummulative'] = df.apply(lambda x: (x+1).prod()-1)
10 loops, best of 3: 32.9 ms per loop
In[110]: %timeit df.append(df.iloc[:,1:].apply(lambda col: (col + 1).prod() - 1), ignore_index=True)
10 loops, best of 3: 37.1 ms per loop
In[113]: %timeit df.append(df.apply(lambda col: prod([(1+c) for c in col]) - 1), ignore_index=True)
1 loop, best of 3: 262 ms per loop
I would suggest to never use apply if you can find a built-in method since apply is looping over the dataframe which makes it slow. Bult-in method are highly efficient and normally there is no way you are going to get faster than them using apply.
How do I calculate a compounding return in a vector indexed to 100?
We can take the cumulative product of the returns expressed as coefficients, i.e. 1 + pct_monthly_return
dat$index = 100 * cumprod(1+dat$pct_monthly_return)
Then to add the header row you described:
rbind(data.frame(date = "2020-12-31", index = 100, pct_monthly_return = 0),
dat)
date index pct_monthly_return
1 2020-12-31 100.0000 0.0000
2 2021-01-31 100.2300 0.0023
3 2021-02-28 101.2323 0.0100
4 2021-03-31 104.7754 0.0350
Sample data
dat <- tibble::tribble( ~date , ~pct_monthly_return,
"2021-01-31", 0.0023,
"2021-02-28", 0.01,
"2021-03-31", 0.035)
Related Topics
Pandas Update and Add Rows One Dataframe With Key Column in Another Dataframe
How to Install Pypdf2 Module Using Windows
How to Make a Discord Bot Leave a Server from a Command in Another Server
How to Extract the Entire Row and Columns When Condition Met in Numpy Array
How to Write Python Array (Data = []) to Excel
Turn String into a List and Remove Carriage Returns (Python)
Change Date Formats in CSV With Python 3
How to Assign Values to a Numpy Array as a Function of Index
Python: How to Keep Repeating a Program Until a Specific Input Is Obtained
Join Dataframes Based on Partial String-Match Between Columns
How to Locate the Index With in a Nested List Python
Remove Last Few Characters in Pyspark Dataframe Column
Reading a Text File and Converting String to Float
How to Create a for Loop That Goes Through All Diagonal Possibilities of a List
Python: How to Read and Load an Excel File from Aws S3
Sklearn: Typeerror: Fit() Missing 1 Required Positional Argument: 'X"
How to Write a Lambda Function That Is Conditional on Two Variables (Columns) in Python