Pandas Timedelta in Days
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Extracting number of days from timedelta column in pandas
IMO, a better idea would be to convert to timedelta
and extract the days component.
pd.to_timedelta(df.Aging, errors='coerce').dt.days
0 -84
1 -46
2 -131
3 -131
4 -130
5 -80
Name: Aging, dtype: int64
If you insist on using string methods, you can use str.extract
.
pd.to_numeric(
df.Aging.str.extract('(.*?) days', expand=False),
errors='coerce')
0 -84
1 -46
2 -131
3 -131
4 -130
5 -80
Name: Aging, dtype: int32
Or, using str.split
pd.to_numeric(df.Aging.str.split(' days').str[0], errors='coerce')
0 -84
1 -46
2 -131
3 -131
4 -130
5 -80
Name: Aging, dtype: int64
Remove the days in the timedelta object
I think you can subtract days
converted to timedeltas:
td = pd.to_timedelta(['-1 days +02:45:00','1 days +02:45:00','0 days +02:45:00'])
df = pd.DataFrame({'td': td})
df['td'] = df['td'] - pd.to_timedelta(df['td'].dt.days, unit='d')
print (df.head())
td
0 02:45:00
1 02:45:00
2 02:45:00
print (type(df.loc[0, 'td']))
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
Or convert timedeltas to strings and extract strings between days
and .
:
df['td'] = df['td'].astype(str).str.extract('days (.*?)\.')
print (df.head())
td
0 +02:45:00
1 02:45:00
2 02:45:00
print (type(df.loc[0, 'td']))
<class 'str'>
Pandas dataframe Timedelta format: with days or with cumulative hours
Why does this change occur?
The list of time strings have all values less than 24 hours. Which means they all have day = 0
. Therefore, when you print the df, pandas
doesn't display it. If you change some value, let's say 12:05:00
to 25:05:00
, you will get the following output
Duration Cumulative
0 0 days 01:07:37 0 days 01:07:37
1 0 days 13:16:44 0 days 14:24:21
2 0 days 11:09:56 1 days 01:34:17
3 1 days 01:05:00 2 days 02:39:17
4 0 days 01:33:01 2 days 04:12:18
Now, as we have different days
in our Duration
column, pandas display it's values.
How can I control it?
You don't have to worry about the difference in output. When, you need to get the values you can use components()
function which returns a namedtuple
print(df['Duration'].iloc[0].components)
output:
Components(days=0, hours=1, minutes=7, seconds=37, milliseconds=0, microseconds=0, nanoseconds=0)
Convert timedelta of days into years
I can help you, check this out- > timedelta(days=5511).days
this returns days in int
and then you can divide it to 365 and you will take years. timedelta(days=5511).days/365
.
Grouping by date range (timedelta) with Pandas
You can use a groupby
with a custom group:
# convert to datetime
s = pd.to_datetime(df['date'], dayfirst=False)
# set up groups of consecutive dates within ± 3 days
group = (s.groupby(df['user_id'])
.apply(lambda s: s.diff().abs().gt('3days').cumsum())
)
# group by ID and new group and aggregate
out = (df.groupby(['user_id', group], as_index=False)
.agg({'date': 'last', 'val': 'sum'})
)
output:
user_id date val
0 1 1-2-17 3
1 2 1-2-17 2
2 2 1-10-17 1
3 3 1-1-17 1
4 3 2-5-17 8
intermediates (sorted by user_id
for clarity):
user_id date val datetime diff abs >3days cumsum
0 1 1-1-17 1 2017-01-01 NaT NaT False 0
3 1 1-1-17 1 2017-01-01 0 days 0 days False 0
4 1 1-2-17 1 2017-01-02 1 days 1 days False 0
1 2 1-1-17 1 2017-01-01 NaT NaT False 0
5 2 1-2-17 1 2017-01-02 1 days 1 days False 0
6 2 1-10-17 1 2017-01-10 8 days 8 days True 1
2 3 1-1-17 1 2017-01-01 NaT NaT False 0
7 3 2-1-17 1 2017-02-01 31 days 31 days True 1
8 3 2-2-17 1 2017-02-02 1 days 1 days False 1
9 3 2-3-17 2 2017-02-03 1 days 1 days False 1
10 3 2-4-17 3 2017-02-04 1 days 1 days False 1
11 3 2-5-17 1 2017-02-05 1 days 1 days False 1
Related Topics
How to Format a String Using a Dictionary in Python-3.X
Virtualenv --No-Site-Packages and Pip Still Finding Global Packages
Coalesce Values from 2 Columns into a Single Column in a Pandas Dataframe
Adding a Background Image to a Plot
Python Dictionary from an Object's Fields
How to Determine a Point Is Between Two Other Points on a Line Segment
Why Is Bubble Sort Implementation Looping Forever
How to Kill a Process on Windows from Within Python
How to Deal with Multi-Level Column Names Downloaded with Yfinance
Custom Filter in Django Admin on Django 1.3 or Below
Unpickling a Python 2 Object with Python 3
How to Determine Whether a Year Is a Leap Year
What Is the Most Pythonic Way to Pop a Random Element from a List
Best Way to Preserve Numpy Arrays on Disk
Python: Converting from Iso-8859-1/Latin1 to Utf-8