How to do/workaround a conditional join in python Pandas?
Well, I can think of a few ways:
- essentially blow up the dataframe by just merging on the exact field (
company
)... then filter on the 30-day windows after the merge.
- should be fast but could use lots of memory
- Move the merging and filtering on the 30-day window into a
groupby()
.
- results in a merge for each group, so slower but should use less memory
Option #1
Suppose your data looks like the following (I expanded your sample data):
print df
company date measure
0 0 2010-01-01 10
1 0 2010-01-15 10
2 0 2010-02-01 10
3 0 2010-02-15 10
4 0 2010-03-01 10
5 0 2010-03-15 10
6 0 2010-04-01 10
7 1 2010-03-01 5
8 1 2010-03-15 5
9 1 2010-04-01 5
10 1 2010-04-15 5
11 1 2010-05-01 5
12 1 2010-05-15 5
print windows
company end_date
0 0 2010-02-01
1 0 2010-03-15
2 1 2010-04-01
3 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print windows
company end_date beg_date
0 0 2010-02-01 2010-01-02
1 0 2010-03-15 2010-02-13
2 1 2010-04-01 2010-03-02
3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date
falls within beg_date
and end_date
:
df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df
company date measure end_date beg_date
2 0 2010-01-15 10 2010-02-01 2010-01-02
4 0 2010-02-01 10 2010-02-01 2010-01-02
7 0 2010-02-15 10 2010-03-15 2010-02-13
9 0 2010-03-01 10 2010-03-15 2010-02-13
11 0 2010-03-15 10 2010-03-15 2010-02-13
16 1 2010-03-15 5 2010-04-01 2010-03-02
18 1 2010-04-01 5 2010-04-01 2010-03-02
21 1 2010-04-15 5 2010-05-15 2010-04-15
23 1 2010-05-01 5 2010-05-15 2010-04-15
25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company
and end_date
:
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
def cond_merge(g,windows):
g = g.merge(windows,on='company',how='left')
g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
return g.groupby('end_date')['measure'].sum()
print df.groupby('company').apply(cond_merge,windows)
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date']
df = df.merge(windows,on=['company','date'],how='outer')
print df
company date measure end_date
0 0 2010-01-01 10 NaT
1 0 2010-01-15 10 NaT
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 NaT
4 0 2010-03-01 10 NaT
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 NaT
8 1 2010-03-15 5 NaT
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 NaT
11 1 2010-05-01 5 NaT
12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())
print df
company date measure end_date
0 0 2010-01-01 10 2010-02-01
1 0 2010-01-15 10 2010-02-01
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 2010-03-15
4 0 2010-03-01 10 2010-03-15
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 2010-04-01
8 1 2010-03-15 5 2010-04-01
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 2010-05-15
11 1 2010-05-01 5 2010-05-15
12 1 2010-05-15 5 2010-05-15
df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print df
company date measure end_date beg_date
0 0 2010-01-01 10 2010-02-01 2010-01-02
1 0 2010-01-15 10 2010-02-01 2010-01-02
2 0 2010-02-01 10 2010-02-01 2010-01-02
3 0 2010-02-15 10 2010-03-15 2010-02-13
4 0 2010-03-01 10 2010-03-15 2010-02-13
5 0 2010-03-15 10 2010-03-15 2010-02-13
7 1 2010-03-01 5 2010-04-01 2010-03-02
8 1 2010-03-15 5 2010-04-01 2010-03-02
9 1 2010-04-01 5 2010-04-01 2010-03-02
10 1 2010-04-15 5 2010-05-15 2010-04-15
11 1 2010-05-01 5 2010-05-15 2010-04-15
12 1 2010-05-15 5 2010-05-15 2010-04-15
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.
Pandas: how to make a join based on a condition between two columns of 2 seperate dataframes
one liner:
df3 = pd.merge(df1,df2, on="B")\
.where((df1.D-df2.D)==1)\
.dropna()\
.reset_index(drop=True)
out:
A B C D_x E F Y D_y
0 1.0 2015-02-27 1.0 5.0 train foo 1.0 4.0
Join in Pandas Dataframe using conditional join statement
Idea is rename
columns before left join by both columns first and then replace missing value by matching by column1
, here is necessary remove duplicates by DataFrame.drop_duplicates
before Series.map
for unique values in col1
:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge
by one column with DataFrame.combine_first
for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
How to a conditional join(like SQL where join) of Dataframes in Python
If you wish to compute the distance all row combinations across tables, you can: 1) create a cartesian product of the tables, 2) compute the distance 3) filter on your threshold. This is memory inefficient as you are expanding out all row combinations, but at least straightforward to compute:
import pandas as pd
from geopy.distance import geodesic
# create a dummy key to join all rows from df1 to df2:
df1["dummy_key"] = 0
df2["dummy_key"] = 0
# create cartesian product table
df3 = pd.merge(left=df1, right=df2, on="dummy_key").drop(columns=["dummy_key"])
# apply geodesic (newer version of geopy.distance.vincenty) to get the distance in meters for each row
dist = df3.apply(lambda row: geodesic((row["SG_lat"], row["SG_long"]), (row["lat_dgr"], row["long_dgr"])).m, axis=1)
# filter for rows that you desire:
df3 = df3[dist < 100]
Python Pandas - Conditional Join
Even if you try that and it works, you will not get the right answer. in fact the key is gonna be duplicated 4 times.
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
d = _
df = pd.DataFrame(d)
df.merge(df, how='inner', on='dept')
Out[8]:
Name_x dept Name_y
0 maria 1 maria
1 maria 1 anna
2 anna 1 maria
3 anna 1 anna
4 joao 2 joao
5 jorge 3 jorge
Best way around is to groupby
:
dd = df.groupby('dept').agg(list)
Out[10]:
Name
dept
1 [maria, anna]
2 [joao]
3 [jorge]
Then you apply pd.Series
dd['Name'].apply(pd.Series)
Out[21]:
0 1
dept
1 maria anna
2 joao NaN
3 jorge NaN
Pandas dataframe conditional inner join with itself
combinations
will give you the tuple pairs you are looking for. Once you get those you can explode the tuple combinations into rows. Then your weight
is the group size of each pair - in this case 1 because they all exist only in one group.
import pandas as pd
import numpy as np
from itertools import combinations
df = pd.DataFrame({'Group': ['a1', 'a1', 'a1', 'a1', 'a2'],
'Person': ['p1', 'p2', 'p3', 'p4', 'p1']})
df = (
df.groupby('Group')['Person']
.apply(lambda x: tuple(combinations(x,2)))
.explode()
.dropna()
.reset_index()
)
df['Weight'] = df.groupby('Person').transform(np.size)
df[['Person1','Person2']] = df['Person'].apply(pd.Series)
df = df[['Person1','Person2','Weight']]
print(df)
Output
Person1 Person2 Weight
0 p1 p2 1
1 p1 p3 1
2 p1 p4 1
3 p2 p3 1
4 p2 p4 1
5 p3 p4 1
Conditional Merge Pandas Dataframe with Dataframe to add column
You can a lot simplify your solution:
s = traits_df['Trait_Count'].value_counts(normalize=True, dropna=False, ascending=True)
traits_df = traits_df.join(s.rename('Trait_Count_Rarity'), on='Trait_Count')
instead:
traitcount_df = traits_df['Trait_Count'].value_counts(normalize=True, dropna=False, ascending=True).to_frame()
rarity_conditions = [
(traits_df['Trait_Count'] == 4),
(traits_df['Trait_Count'] == 5),
(traits_df['Trait_Count'] == 6),
(traits_df['Trait_Count'] == 7)
]
rarity_values = [.0254, .2540, .5323, .1883]
traits_df['Trait_Count_Rarity'] = np.select(rarity_conditions, rarity_values)
Related Topics
Want to Find Contours -> Valueerror: Not Enough Values to Unpack (Expected 3, Got 2), This Appears
Is It Safe to Replace a Self Object by Another Object of the Same Type in a Method
Adding a Background Image to a Plot
Replicating Rows in a Pandas Data Frame by a Column Value
How to Make a For-Loop Pyramid More Concise in Python
Only Extracting Text from This Element, Not Its Children
Check If Item Is in an Array/List
How to Account for Period (Am/Pm) Using Strftime
How to Set the Aspect Ratio in Matplotlib
Check If a String in a Pandas Dataframe Column Is in a List of Strings
How to Switch to the Active Tab in Selenium
Cs50: Like Operator, Variable Substitution with % Expansion
How to Color Python Logging Output
Check If a File Is Not Open Nor Being Used by Another Process