Pandas: How to Merge Two Dataframes on a Column by Keeping the Information of the First One

Pandas: how to merge two dataframes on a column by keeping the information of the first one?

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
'Age': [34, 18, 44, 27, 30]})

#print (df1)
df3 = df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
Name Age Sex
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3 Jack 27 M
4 Laura 30 NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
Name Age Sex
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3 Jack 27 M
4 Laura 30 NaN

If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
'Year':[2000,2003,2003,2004,2007],
'Code':[1,2,3,4,4],
'Age': [34, 18, 44, 27, 30]})

print (df1)
Name Year Code Age
0 Tom 2000 1 34
1 Sara 2003 2 18
2 Eva 2003 3 44
3 Jack 2004 4 27
4 Laura 2007 4 30

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
'Sex': ['M', 'M', 'F', 'M', 'F'],
'Year':[2001,2003,2003,2004,2007],
'Code':[1,2,3,5,3],
'Val':[21,34,23,44,67]})
print (df2)
Name Sex Year Code Val
0 Tom M 2001 1 21
1 Paul M 2003 2 34
2 Eva F 2003 3 23
3 Jack M 2004 5 44
4 Michelle F 2007 3 67
#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
Name_x Year Code Age Name_y Sex Val
0 Tom 2000 1 34 NaN NaN NaN
1 Sara 2003 2 18 Paul M 34.0
2 Eva 2003 3 44 Eva F 23.0
3 Jack 2004 4 27 NaN NaN NaN
4 Laura 2007 4 30 NaN NaN NaN

#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
Name Year Code Age Val
0 Tom 2000 1 34 NaN
1 Sara 2003 2 18 34.0
2 Eva 2003 3 44 23.0
3 Jack 2004 4 27 NaN
4 Laura 2007 4 30 NaN

If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
'Age': [34, 18, 44, 27, 30]})

print (df1)
Name Age
0 Tom 34
1 Sara 18
2 Eva 44
3 Jack 27
4 Laura 30

df3, df4 = df1.copy(), df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'],
'Val': [1,2,3,4,5]})
print (df2)
Name Val
0 Tom 1 <-duplicated name Tom
1 Tom 2 <-duplicated name Tom
2 Eva 3
3 Jack 4
4 Michelle 5

s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom 1
Eva 3
Jack 4
Michelle 5
Name: Val, dtype: int64

df1['New'] = df1['Name'].map(s)
print (df1)
Name Age New
0 Tom 34 1.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
#add parameter for keep last value 
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom 2
Eva 3
Jack 4
Michelle 5
Name: Val, dtype: int64

df3['New'] = df3['Name'].map(s)
print (df3)
Name Age New
0 Tom 34 2.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}

df4['New'] = df4['Name'].map(d)
print (df4)
Name Age New
0 Tom 34 2.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN

merge two DataFrame with two columns and keep the same order with original indexes in the result

when constructing the merged dataframe, get the index values from each dataframe.

merged_df = pd.merge(df1, df2, how="outer", on=['key1', 'key2'])

use combine_first to combine index_x & index_y

merged_df['combined_index'] =merged_df.index_x.combine_first(merged_df.index_y)

sort using combined_index & index_x dropping columns which are not needed & resetting index.

output = merged_df.sort_values(
['combined_index', 'index_x']
).drop(
['index_x', 'index_y', 'combined_index'], axis=1
).reset_index(drop=True)

This results in the following output:

  key1 key2  Value1  Value2
0 K a5 apple NaN
1 K a9 NaN apple
2 K a4 guava NaN
3 A1 a7 kiwi kiwi
4 A3 a9 NaN grape
5 A2 a9 grape NaN
6 B1 b2 banana banana
7 C2 c7 NaN guava
8 B9 b8 peach NaN
9 C3 c1 berry orange

Merge dataframes based on column, only keeping first match

Use drop_duplicates for first rows:

df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten

If want add only one column faster is use map:

s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten

How to merge two dataframes with preserving the same order of one of them?

try first DataFrame.stack then DataFrame.merge,

df2_stack = df2.stack().reset_index(level=1)

level_1 0
0 A txt1
0 B txt1
0 C txt1
1 A txt2
1 C txt2
2 A txt3
2 C txt3

# rename columns after stack
df2_stack.columns = ["name", "text"]

name text
0 A txt1
0 B txt1
0 C txt1
1 A txt2
1 C txt2
2 A txt3
2 C txt3

df.merge(df2_stack, on=['name','text'])


  name  text   desc
0 A txt2 text2
1 A txt1 text1
2 A txt3 text3
3 B txt1 text1
4 C txt2 text2
5 C txt3 text3
6 C txt1 text1

Pandas left merge keeping data in right dataframe on duplicte columns

Frankenstein Answer

df[['ser', 'no']].merge(df2, 'left').set_axis(df.index).fillna(df)

ser no c d
0 0 0 1.0 NaN
1 0 1 1.0 NaN
2 0 2 1.0 NaN
3 1 0 1.0 NaN
4 1 1 1.0 NaN
5 1 2 88.0 90.0
6 2 0 1.0 NaN
7 2 1 1.0 NaN
8 2 2 1.0 NaN


Explanation

  1. I'm going to merge on the columns ['ser', 'no'] and don't want to specify in the merge call. Also, I don't want goofy duplicate column names like 'c_x' and 'c_y' so I slice only columns that I want in common then merge

     df[['ser', 'no']].merge(df2, 'left')
  2. When I merge, I want only rows from the left dataframe. However, merge usually produces a number of rows vastly different from the original dataframes and therefore produces a new index. However, NOTE this is assuming the right dataframe (df2) has NO DUPLICATES with respect ['ser', 'no'] then a 'left' merge should produce the same exact number of rows as the left dataframe (df). But it won't have the same index necessarily. It turns out that in this example it does. But I don't want to take chances. So I use set_axis

      set_axis(df.index)
  3. Finally, since the resulting dataframe has the same index and columns as df. I can fill in the missing bits with:

    fillna(df)

Merge two data frames based on common column values in Pandas

We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.

import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')

For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.

dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')

If you want to be even more specific, you may read the documentation of pandas merge operation.

Merging two dataframes by keeping certain column values in r

We may use rows_update

library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))

-output

  id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>

How to merge two pandas DataFrames and keep repeated values?

It is as simple as:

df1 = pd.DataFrame({'Name':['John','John','John','Paul','Paul','Jimmy'], 'Book':['B1','B2','B1','B3','B4','B3']})

df2 = pd.DataFrame({'Name':['John','Paul','Jimmy'], 'Age':[25,18,28]})

df1.merge(df2)

Out[22]:
Name Book Age
0 John B1 25
1 John B2 25
2 John B1 25
3 Paul B3 18
4 Paul B4 18
5 Jimmy B3 28

Python Pandas merge only certain columns

You could merge the sub-DataFrame (with just those columns):

df2[list('xab')]  # df2 but only with columns x, a, and b

df1.merge(df2[list('xab')])


Related Topics



Leave a reply



Submit