How to Do/Workaround a Conditional Join in Python Pandas

How to do/workaround a conditional join in python Pandas?

Well, I can think of a few ways:

  1. essentially blow up the dataframe by just merging on the exact field (company)... then filter on the 30-day windows after the merge.
  • should be fast but could use lots of memory

  1. Move the merging and filtering on the 30-day window into a groupby().
  • results in a merge for each group, so slower but should use less memory

Option #1

Suppose your data looks like the following (I expanded your sample data):

print df

company date measure
0 0 2010-01-01 10
1 0 2010-01-15 10
2 0 2010-02-01 10
3 0 2010-02-15 10
4 0 2010-03-01 10
5 0 2010-03-15 10
6 0 2010-04-01 10
7 1 2010-03-01 5
8 1 2010-03-15 5
9 1 2010-04-01 5
10 1 2010-04-15 5
11 1 2010-05-01 5
12 1 2010-05-15 5

print windows

company end_date
0 0 2010-02-01
1 0 2010-03-15
2 1 2010-04-01
3 1 2010-05-15

Create a beginning date for the 30 day windows:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print windows

company end_date beg_date
0 0 2010-02-01 2010-01-02
1 0 2010-03-15 2010-02-13
2 1 2010-04-01 2010-03-02
3 1 2010-05-15 2010-04-15

Now do a merge and then select based on if date falls within beg_date and end_date:

df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df

company date measure end_date beg_date
2 0 2010-01-15 10 2010-02-01 2010-01-02
4 0 2010-02-01 10 2010-02-01 2010-01-02
7 0 2010-02-15 10 2010-03-15 2010-02-13
9 0 2010-03-01 10 2010-03-15 2010-02-13
11 0 2010-03-15 10 2010-03-15 2010-02-13
16 1 2010-03-15 5 2010-04-01 2010-03-02
18 1 2010-04-01 5 2010-04-01 2010-03-02
21 1 2010-04-15 5 2010-05-15 2010-04-15
23 1 2010-05-01 5 2010-05-15 2010-04-15
25 1 2010-05-15 5 2010-05-15 2010-04-15

You can compute the 30 day window sums by grouping on company and end_date:

print df.groupby(['company','end_date']).sum()

measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15

Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))

def cond_merge(g,windows):
g = g.merge(windows,on='company',how='left')
g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
return g.groupby('end_date')['measure'].sum()

print df.groupby('company').apply(cond_merge,windows)

company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15

Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:

windows['date'] = windows['end_date']

df = df.merge(windows,on=['company','date'],how='outer')
print df

company date measure end_date
0 0 2010-01-01 10 NaT
1 0 2010-01-15 10 NaT
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 NaT
4 0 2010-03-01 10 NaT
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 NaT
8 1 2010-03-15 5 NaT
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 NaT
11 1 2010-05-01 5 NaT
12 1 2010-05-15 5 2010-05-15

This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())

print df

company date measure end_date
0 0 2010-01-01 10 2010-02-01
1 0 2010-01-15 10 2010-02-01
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 2010-03-15
4 0 2010-03-01 10 2010-03-15
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 2010-04-01
8 1 2010-03-15 5 2010-04-01
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 2010-05-15
11 1 2010-05-01 5 2010-05-15
12 1 2010-05-15 5 2010-05-15

df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))

print df

company date measure end_date beg_date
0 0 2010-01-01 10 2010-02-01 2010-01-02
1 0 2010-01-15 10 2010-02-01 2010-01-02
2 0 2010-02-01 10 2010-02-01 2010-01-02
3 0 2010-02-15 10 2010-03-15 2010-02-13
4 0 2010-03-01 10 2010-03-15 2010-02-13
5 0 2010-03-15 10 2010-03-15 2010-02-13
7 1 2010-03-01 5 2010-04-01 2010-03-02
8 1 2010-03-15 5 2010-04-01 2010-03-02
9 1 2010-04-01 5 2010-04-01 2010-03-02
10 1 2010-04-15 5 2010-05-15 2010-04-15
11 1 2010-05-01 5 2010-05-15 2010-04-15
12 1 2010-05-15 5 2010-05-15 2010-04-15

df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()

measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15

Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

Pandas: how to make a join based on a condition between two columns of 2 seperate dataframes

one liner:

df3 = pd.merge(df1,df2, on="B")\
.where((df1.D-df2.D)==1)\
.dropna()\
.reset_index(drop=True)

out:

     A          B    C  D_x      E    F    Y  D_y
0 1.0 2015-02-27 1.0 5.0 train foo 1.0 4.0

Join in Pandas Dataframe using conditional join statement

Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:

df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))

EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:

df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]

df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

How to a conditional join(like SQL where join) of Dataframes in Python

If you wish to compute the distance all row combinations across tables, you can: 1) create a cartesian product of the tables, 2) compute the distance 3) filter on your threshold. This is memory inefficient as you are expanding out all row combinations, but at least straightforward to compute:

import pandas as pd
from geopy.distance import geodesic

# create a dummy key to join all rows from df1 to df2:
df1["dummy_key"] = 0
df2["dummy_key"] = 0

# create cartesian product table
df3 = pd.merge(left=df1, right=df2, on="dummy_key").drop(columns=["dummy_key"])

# apply geodesic (newer version of geopy.distance.vincenty) to get the distance in meters for each row
dist = df3.apply(lambda row: geodesic((row["SG_lat"], row["SG_long"]), (row["lat_dgr"], row["long_dgr"])).m, axis=1)

# filter for rows that you desire:
df3 = df3[dist < 100]

Python Pandas - Conditional Join

Even if you try that and it works, you will not get the right answer. in fact the key is gonna be duplicated 4 times.

{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
d = _
df = pd.DataFrame(d)
df.merge(df, how='inner', on='dept')

Out[8]:
Name_x dept Name_y
0 maria 1 maria
1 maria 1 anna
2 anna 1 maria
3 anna 1 anna
4 joao 2 joao
5 jorge 3 jorge

Best way around is to groupby :

dd = df.groupby('dept').agg(list)
Out[10]:
Name
dept
1 [maria, anna]
2 [joao]
3 [jorge]

Then you apply pd.Series

dd['Name'].apply(pd.Series)

Out[21]:
0 1
dept
1 maria anna
2 joao NaN
3 jorge NaN

Pandas dataframe conditional inner join with itself

combinations will give you the tuple pairs you are looking for. Once you get those you can explode the tuple combinations into rows. Then your weight is the group size of each pair - in this case 1 because they all exist only in one group.

import pandas as pd
import numpy as np
from itertools import combinations

df = pd.DataFrame({'Group': ['a1', 'a1', 'a1', 'a1', 'a2'],
'Person': ['p1', 'p2', 'p3', 'p4', 'p1']})

df = (
df.groupby('Group')['Person']
.apply(lambda x: tuple(combinations(x,2)))
.explode()
.dropna()
.reset_index()
)

df['Weight'] = df.groupby('Person').transform(np.size)
df[['Person1','Person2']] = df['Person'].apply(pd.Series)

df = df[['Person1','Person2','Weight']]

print(df)

Output

  Person1 Person2  Weight
0 p1 p2 1
1 p1 p3 1
2 p1 p4 1
3 p2 p3 1
4 p2 p4 1
5 p3 p4 1

Conditional Merge Pandas Dataframe with Dataframe to add column

You can a lot simplify your solution:

s = traits_df['Trait_Count'].value_counts(normalize=True, dropna=False, ascending=True)
traits_df = traits_df.join(s.rename('Trait_Count_Rarity'), on='Trait_Count')

instead:

traitcount_df = traits_df['Trait_Count'].value_counts(normalize=True, dropna=False, ascending=True).to_frame()

rarity_conditions = [
(traits_df['Trait_Count'] == 4),
(traits_df['Trait_Count'] == 5),
(traits_df['Trait_Count'] == 6),
(traits_df['Trait_Count'] == 7)
]

rarity_values = [.0254, .2540, .5323, .1883]

traits_df['Trait_Count_Rarity'] = np.select(rarity_conditions, rarity_values)


Related Topics



Leave a reply



Submit