Pandas: how to merge two dataframes on a column by keeping the information of the first one?
Sample
:
df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
'Age': [34, 18, 44, 27, 30]})
#print (df1)
df3 = df1.copy()
df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)
Use map
by Series
created by set_index
:
df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
Name Age Sex
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3 Jack 27 M
4 Laura 30 NaN
Alternative solution with merge
with left join:
df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
Name Age Sex
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3 Jack 27 M
4 Laura 30 NaN
If need map by multiple columns (e.g. Year
and Code
) need merge
with left join:
df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
'Year':[2000,2003,2003,2004,2007],
'Code':[1,2,3,4,4],
'Age': [34, 18, 44, 27, 30]})
print (df1)
Name Year Code Age
0 Tom 2000 1 34
1 Sara 2003 2 18
2 Eva 2003 3 44
3 Jack 2004 4 27
4 Laura 2007 4 30
df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
'Sex': ['M', 'M', 'F', 'M', 'F'],
'Year':[2001,2003,2003,2004,2007],
'Code':[1,2,3,5,3],
'Val':[21,34,23,44,67]})
print (df2)
Name Sex Year Code Val
0 Tom M 2001 1 21
1 Paul M 2003 2 34
2 Eva F 2003 3 23
3 Jack M 2004 5 44
4 Michelle F 2007 3 67
#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
Name_x Year Code Age Name_y Sex Val
0 Tom 2000 1 34 NaN NaN NaN
1 Sara 2003 2 18 Paul M 34.0
2 Eva 2003 3 44 Eva F 23.0
3 Jack 2004 4 27 NaN NaN NaN
4 Laura 2007 4 30 NaN NaN NaN
#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
Name Year Code Age Val
0 Tom 2000 1 34 NaN
1 Sara 2003 2 18 34.0
2 Eva 2003 3 44 23.0
3 Jack 2004 4 27 NaN
4 Laura 2007 4 30 NaN
If get error with map
it means duplicates by columns of join, here Name
:
df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
'Age': [34, 18, 44, 27, 30]})
print (df1)
Name Age
0 Tom 34
1 Sara 18
2 Eva 44
3 Jack 27
4 Laura 30
df3, df4 = df1.copy(), df1.copy()
df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'],
'Val': [1,2,3,4,5]})
print (df2)
Name Val
0 Tom 1 <-duplicated name Tom
1 Tom 2 <-duplicated name Tom
2 Eva 3
3 Jack 4
4 Michelle 5
s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Solutions are removed duplicates by DataFrame.drop_duplicates
, or use map by dict
for last dupe match:
#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom 1
Eva 3
Jack 4
Michelle 5
Name: Val, dtype: int64
df1['New'] = df1['Name'].map(s)
print (df1)
Name Age New
0 Tom 34 1.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
#add parameter for keep last value
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom 2
Eva 3
Jack 4
Michelle 5
Name: Val, dtype: int64
df3['New'] = df3['Name'].map(s)
print (df3)
Name Age New
0 Tom 34 2.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}
df4['New'] = df4['Name'].map(d)
print (df4)
Name Age New
0 Tom 34 2.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
merge two DataFrame with two columns and keep the same order with original indexes in the result
when constructing the merged dataframe, get the index values from each dataframe.
merged_df = pd.merge(df1, df2, how="outer", on=['key1', 'key2'])
use combine_first
to combine index_x
& index_y
merged_df['combined_index'] =merged_df.index_x.combine_first(merged_df.index_y)
sort using combined_index
& index_x
dropping columns which are not needed & resetting index.
output = merged_df.sort_values(
['combined_index', 'index_x']
).drop(
['index_x', 'index_y', 'combined_index'], axis=1
).reset_index(drop=True)
This results in the following output:
key1 key2 Value1 Value2
0 K a5 apple NaN
1 K a9 NaN apple
2 K a4 guava NaN
3 A1 a7 kiwi kiwi
4 A3 a9 NaN grape
5 A2 a9 grape NaN
6 B1 b2 banana banana
7 C2 c7 NaN guava
8 B9 b8 peach NaN
9 C3 c1 berry orange
Merge dataframes based on column, only keeping first match
Use drop_duplicates
for first rows:
df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten
If want add only one column faster is use map
:
s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten
How to merge two dataframes with preserving the same order of one of them?
try first DataFrame.stack
then DataFrame.merge
,
df2_stack = df2.stack().reset_index(level=1)
level_1 0
0 A txt1
0 B txt1
0 C txt1
1 A txt2
1 C txt2
2 A txt3
2 C txt3
# rename columns after stack
df2_stack.columns = ["name", "text"]
name text
0 A txt1
0 B txt1
0 C txt1
1 A txt2
1 C txt2
2 A txt3
2 C txt3
df.merge(df2_stack, on=['name','text'])
name text desc
0 A txt2 text2
1 A txt1 text1
2 A txt3 text3
3 B txt1 text1
4 C txt2 text2
5 C txt3 text3
6 C txt1 text1
Pandas left merge keeping data in right dataframe on duplicte columns
Frankenstein Answer
df[['ser', 'no']].merge(df2, 'left').set_axis(df.index).fillna(df)
ser no c d
0 0 0 1.0 NaN
1 0 1 1.0 NaN
2 0 2 1.0 NaN
3 1 0 1.0 NaN
4 1 1 1.0 NaN
5 1 2 88.0 90.0
6 2 0 1.0 NaN
7 2 1 1.0 NaN
8 2 2 1.0 NaN
Explanation
I'm going to merge on the columns
['ser', 'no']
and don't want to specify in themerge
call. Also, I don't want goofy duplicate column names like'c_x'
and'c_y'
so I slice only columns that I want in common then mergedf[['ser', 'no']].merge(df2, 'left')
When I merge, I want only rows from the left dataframe. However,
merge
usually produces a number of rows vastly different from the original dataframes and therefore produces a newindex
. However, NOTE this is assuming the right dataframe (df2
) has NO DUPLICATES with respect['ser', 'no']
then a'left'
merge
should produce the same exact number of rows as the left dataframe (df
). But it won't have the sameindex
necessarily. It turns out that in this example it does. But I don't want to take chances. So I useset_axis
set_axis(df.index)
Finally, since the resulting dataframe has the same
index
andcolumns
asdf
. I can fill in the missing bits with:fillna(df)
Merge two data frames based on common column values in Pandas
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title'
as 'movie_name'
.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge
operation.
Merging two dataframes by keeping certain column values in r
We may use rows_update
library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))
-output
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>
How to merge two pandas DataFrames and keep repeated values?
It is as simple as:
df1 = pd.DataFrame({'Name':['John','John','John','Paul','Paul','Jimmy'], 'Book':['B1','B2','B1','B3','B4','B3']})
df2 = pd.DataFrame({'Name':['John','Paul','Jimmy'], 'Age':[25,18,28]})
df1.merge(df2)
Out[22]:
Name Book Age
0 John B1 25
1 John B2 25
2 John B1 25
3 Paul B3 18
4 Paul B4 18
5 Jimmy B3 28
Python Pandas merge only certain columns
You could merge the sub-DataFrame (with just those columns):
df2[list('xab')] # df2 but only with columns x, a, and b
df1.merge(df2[list('xab')])
Related Topics
How to Get Linux Console Window Width in Python
Two Versions of Python on Linux. How to Make 2.7 the Default
Using Pip3: Module "Importlib._Bootstrap" Has No Attribute "Sourcefileloader"
Open() in Python Does Not Create a File If It Doesn't Exist
Python Spawn Off a Child Subprocess, Detach, and Exit
Run Multiple Python Scripts Concurrently
How to Terminate Process from Python Using Pid
Python Subprocess.Popen "Oserror: [Errno 12] Cannot Allocate Memory"
How to Set Your Pythonpath in an Already-Created Virtualenv
Process List on Linux Via Python
Python-Dev Installation Error: Importerror: No Module Named Apt_Pkg
What Is the 'Self' Parameter in Class Methods
Iterating Over Dictionaries Using 'For' Loops
How to Sort a Dictionary by Key
Pip' Is Not Recognized as an Internal or External Command
Multiprocessing VS Threading Python
How to Force Division to Be Floating Point? Division Keeps Rounding Down to 0