Merge two data frames based on common column values in Pandas
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title'
as 'movie_name'
.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge
operation.
JOIN two dataframes on common column in python
Use merge
:
print (pd.merge(df1, df2, left_on='id', right_on='id1', how='left').drop('id1', axis=1))
id name count price rating
0 1 a 10 100.0 1.0
1 2 b 20 200.0 2.0
2 3 c 30 300.0 3.0
3 4 d 40 NaN NaN
4 5 e 50 500.0 5.0
Another solution is simple rename column:
print (pd.merge(df1, df2.rename(columns={'id1':'id'}), on='id', how='left'))
id name count price rating
0 1 a 10 100.0 1.0
1 2 b 20 200.0 2.0
2 3 c 30 300.0 3.0
3 4 d 40 NaN NaN
4 5 e 50 500.0 5.0
If need only column price
the simpliest is map
:
df1['price'] = df1.id.map(df2.set_index('id1')['price'])
print (df1)
id name count price
0 1 a 10 100.0
1 2 b 20 200.0
2 3 c 30 300.0
3 4 d 40 NaN
4 5 e 50 500.0
Another 2 solutions:
print (pd.merge(df1, df2, left_on='id', right_on='id1', how='left')
.drop(['id1', 'rating'], axis=1))
id name count price
0 1 a 10 100.0
1 2 b 20 200.0
2 3 c 30 300.0
3 4 d 40 NaN
4 5 e 50 500.0
print (pd.merge(df1, df2[['id1','price']], left_on='id', right_on='id1', how='left')
.drop('id1', axis=1))
id name count price
0 1 a 10 100.0
1 2 b 20 200.0
2 3 c 30 300.0
3 4 d 40 NaN
4 5 e 50 500.0
Merging 2 dataframes by common column values under a common column name in R
# set as data.table
lapply(list(df1, df2), \(i) setDT(i))
# inner join
df1[df2, on=.(ID), nomatch=0]
Merge multiple dataframes based on a common column
Use merge
and reduce
In [86]: from functools import reduce
In [87]: reduce(lambda x,y: pd.merge(x,y, on='Col1', how='outer'), [df1, df2, df3])
Out[87]:
Col1 Col2 Col3 Col4 Col5 Col6 Col7
0 data1 3 4 7.0 4.0 NaN NaN
1 data2 4 3 6.0 9.0 5.0 8.0
2 data3 2 3 1.0 4.0 2.0 7.0
3 data4 2 4 NaN NaN NaN NaN
4 data5 1 4 NaN NaN 5.0 3.0
Details
In [88]: df1
Out[88]:
Col1 Col2 Col3
0 data1 3 4
1 data2 4 3
2 data3 2 3
3 data4 2 4
4 data5 1 4
In [89]: df2
Out[89]:
Col1 Col4 Col5
0 data1 7 4
1 data2 6 9
2 data3 1 4
In [90]: df3
Out[90]:
Col1 Col6 Col7
0 data2 5 8
1 data3 2 7
2 data5 5 3
joining two dataframes on matching values of two common columns R
We may do a left_join
library(dplyr)
library(tidyr)
A %>%
mutate(week = as.character(week)) %>%
left_join(B) %>%
mutate(fill = replace_na(fill, 0))
merge two dataframes with some common columns where the combining of the common needs to be a custom function
You can concatenate the dataframes, and then groupby the column names to apply an operation on the similarly named columns: In this case you can get away with taking the sum and then typecasting back to bool to get the or
operation.
import pandas as pd
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).sum().astype(bool)
Output:
0.0 0.5 0.7
12.5 True True True
14.0 True True False
15.5 False True False
If you need to see how to do this in a less case-specific manner, then again just group by the columns and apply something to the grouped object over axis=1
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: x.any(1))
# 0.0 0.5 0.7
#12.5 True True True
#14.0 True True False
#15.5 False True False
Further, you can define a custom combining function. Here's one which adds twice the left Frame to 4 times the right Frame. If there is only one column, it returns 2x the left frame.
Sample Data
left:
0.0 0.5
12.5 1 11
14.0 2 17
15.5 3 17
right:
0.7 0.5
12.5 4 2
14.0 4 -1
15.5 5 5
Code
def my_func(x):
try:
res = x.iloc[:, 0]*2 + x.iloc[:, 1]*4
except IndexError:
res = x.iloc[:, 0]*2
return res
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: my_func(x))
Output:
0.0 0.5 0.7
12.5 2 30 8
14.0 4 30 8
15.5 6 54 10
Finally, if you wanted to do this in a consecutive manner, then you should make use of reduce
. Here I'll combine 5 DataFrames
with the above function. (I'll just repeat the right Frame 4x for the example)
from functools import reduce
def my_comb(df_l, df_r, func):
""" Concatenate df_l and df_r along axis=1. Apply the
specified function.
"""
df = pd.concat([df_l, df_r], 1)
return df.groupby(df.columns, 1).apply(lambda x: func(x))
reduce(lambda dfl, dfr: my_comb(dfl, dfr, func=my_func), [left, right, right, right, right])
# 0.0 0.5 0.7
#12.5 16 296 176
#14.0 32 212 176
#15.5 48 572 220
Combine two pandas Data Frames (join on a common column)
You can use merge to combine two dataframes into one:
import pandas as pd
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer')
where on specifies field name that exists in both dataframes to join on, and how
defines whether its inner/outer/left/right join, with outer using 'union of keys from both frames (SQL: full outer join).' Since you have 'star' column in both dataframes, this by default will create two columns star_x and star_y in the combined dataframe. As @DanAllan mentioned for the join method, you can modify the suffixes for merge by passing it as a kwarg. Default is suffixes=('_x', '_y')
. if you wanted to do something like star_restaurant_id
and star_restaurant_review
, you can do:
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer', suffixes=('_restaurant_id', '_restaurant_review'))
The parameters are explained in detail in this link.
Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]
We can merge, then perform a query to drop rows not within the range:
(df1.merge(df2, on=['Date', 'BillNo.'])
.query('abs(Amount_x - Amount_y) <= 5')
.drop('Amount_x', axis=1))
Date BillNo. Amount_y
0 10/08/2020 ABBCSQ1ZA 876
1 10/16/2020 AA171E1Z0 5491
This works well as long as there is only one row that corresponds to a specific (Date, BillNo) combination in each frame.
Merging two dataframes with one common column name
You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")
Related Topics
Query Combinations with Nested Array of Records in JSON Datatype
How to Get a Value from Previous Result Row of a Select Statement
There Is Already an Object Named '#Columntable' in the Database
Insert Multiple Rows Using Subquery
Postgres Syntax Error at or Near "If"
Select "Where Clause" Evaluation Order
"Ora-01438: Value Larger Than Specified Precision Allowed for This Column" When Inserting 3
Generating Rows Based on Column Value
SQL Server 2008 Cross Tab Query
SQL Find Difference Between Previous and Current Row
SQL Left Join Losing Rows After Filtering
Update and Select in One Query
How to Specify 'Default' as a SQL Parameter Value in Ado.Net
How to Get Windows Log-In User Name for a SQL Log in User
Find the Referenced Table Name Using Table, Field and Schema Name
Check If Entry in Table a Exists in Table B