Pandas: How to Merge Two Dataframes on a Column by Keeping the Information of the First One

Pandas: how to merge two dataframes on a column by keeping the information of the first one?

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

#print (df1)
df3 = df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Year':[2000,2003,2003,2004,2007],
                    'Code':[1,2,3,4,4],
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Year  Code  Age
0    Tom  2000     1   34
1   Sara  2003     2   18
2    Eva  2003     3   44
3   Jack  2004     4   27
4  Laura  2007     4   30

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F'],
                    'Year':[2001,2003,2003,2004,2007],
                    'Code':[1,2,3,5,3],
                    'Val':[21,34,23,44,67]})
print (df2)
       Name Sex  Year  Code  Val
0       Tom   M  2001     1   21
1      Paul   M  2003     2   34
2       Eva   F  2003     3   23
3      Jack   M  2004     5   44
4  Michelle   F  2007     3   67

#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
  Name_x  Year  Code  Age Name_y  Sex   Val
0    Tom  2000     1   34    NaN  NaN   NaN
1   Sara  2003     2   18   Paul    M  34.0
2    Eva  2003     3   44    Eva    F  23.0
3   Jack  2004     4   27    NaN  NaN   NaN
4  Laura  2007     4   30    NaN  NaN   NaN

#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
    Name  Year  Code  Age   Val
0    Tom  2000     1   34   NaN
1   Sara  2003     2   18  34.0
2    Eva  2003     3   44  23.0
3   Jack  2004     4   27   NaN
4  Laura  2007     4   30   NaN

If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Age
0    Tom   34
1   Sara   18
2    Eva   44
3   Jack   27
4  Laura   30

df3, df4 = df1.copy(), df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                    'Val': [1,2,3,4,5]})
print (df2)
       Name  Val
0       Tom    1 <-duplicated name Tom
1       Tom    2 <-duplicated name Tom
2       Eva    3
3      Jack    4
4  Michelle    5

s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom         1
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df1['New'] = df1['Name'].map(s)
print (df1)
    Name  Age  New
0    Tom   34  1.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

#add parameter for keep last value 
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom         2
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df3['New'] = df3['Name'].map(s)
print (df3)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}

df4['New'] = df4['Name'].map(d)
print (df4)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

merge two DataFrame with two columns and keep the same order with original indexes in the result

when constructing the merged dataframe, get the index values from each dataframe.

merged_df = pd.merge(df1, df2, how="outer", on=['key1', 'key2'])

use combine_first to combine index_x & index_y

merged_df['combined_index'] =merged_df.index_x.combine_first(merged_df.index_y)

sort using combined_index & index_x dropping columns which are not needed & resetting index.

output = merged_df.sort_values(
    ['combined_index', 'index_x']
).drop(
    ['index_x', 'index_y', 'combined_index'], axis=1
).reset_index(drop=True)

This results in the following output:

  key1 key2  Value1  Value2
0    K   a5   apple     NaN
1    K   a9     NaN   apple
2    K   a4   guava     NaN
3   A1   a7    kiwi    kiwi
4   A3   a9     NaN   grape
5   A2   a9   grape     NaN
6   B1   b2  banana  banana
7   C2   c7     NaN   guava
8   B9   b8   peach     NaN
9   C3   c1   berry  orange

Merge dataframes based on column, only keeping first match

Use drop_duplicates for first rows:

df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
   Index   Fruit   Taste
0      1   Apple   Tasty
1      2  Banana   Tasty
2      3   Peach  Rotten

If want add only one column faster is use map:

s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
   Index   Fruit   Taste
0      1   Apple   Tasty
1      2  Banana   Tasty
2      3   Peach  Rotten

How to merge two dataframes with preserving the same order of one of them?

try first DataFrame.stack then DataFrame.merge,

df2_stack = df2.stack().reset_index(level=1)

  level_1     0
0       A  txt1
0       B  txt1
0       C  txt1
1       A  txt2
1       C  txt2
2       A  txt3
2       C  txt3

# rename columns after stack
df2_stack.columns = ["name", "text"] 

  name  text
0    A  txt1
0    B  txt1
0    C  txt1
1    A  txt2
1    C  txt2
2    A  txt3
2    C  txt3

df.merge(df2_stack, on=['name','text'])

  name  text   desc
0    A  txt2  text2
1    A  txt1  text1
2    A  txt3  text3
3    B  txt1  text1
4    C  txt2  text2
5    C  txt3  text3
6    C  txt1  text1

Pandas left merge keeping data in right dataframe on duplicte columns

Frankenstein Answer

df[['ser', 'no']].merge(df2, 'left').set_axis(df.index).fillna(df)

   ser  no     c     d
0    0   0   1.0   NaN
1    0   1   1.0   NaN
2    0   2   1.0   NaN
3    1   0   1.0   NaN
4    1   1   1.0   NaN
5    1   2  88.0  90.0
6    2   0   1.0   NaN
7    2   1   1.0   NaN
8    2   2   1.0   NaN

Explanation

I'm going to merge on the columns ['ser', 'no'] and don't want to specify in the merge call. Also, I don't want goofy duplicate column names like 'c_x' and 'c_y' so I slice only columns that I want in common then merge
```
 df[['ser', 'no']].merge(df2, 'left')
```
When I merge, I want only rows from the left dataframe. However, merge usually produces a number of rows vastly different from the original dataframes and therefore produces a new index. However, NOTE this is assuming the right dataframe (df2) has NO DUPLICATES with respect ['ser', 'no'] then a 'left' merge should produce the same exact number of rows as the left dataframe (df). But it won't have the same index necessarily. It turns out that in this example it does. But I don't want to take chances. So I use set_axis
```
  set_axis(df.index)
```
Finally, since the resulting dataframe has the same index and columns as df. I can fill in the missing bits with:
```
fillna(df)
```

Merge two data frames based on common column values in Pandas

We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.

import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')

For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.

dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')

If you want to be even more specific, you may read the documentation of pandas merge operation.

Merging two dataframes by keeping certain column values in r

We may use rows_update

library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))

-output

  id item score cat.a cat.b
1  1   11     1     A     a
2  2   22     0     B     a
3  3   33     1     C     b
4  4   44     1     D     b
5  5   55     1     E     c
6  6   66     0     F     f
7  7   77     1  <NA>  <NA>
8  8   88     1  <NA>  <NA>

How to merge two pandas DataFrames and keep repeated values?

It is as simple as:

df1 = pd.DataFrame({'Name':['John','John','John','Paul','Paul','Jimmy'], 'Book':['B1','B2','B1','B3','B4','B3']})

df2 = pd.DataFrame({'Name':['John','Paul','Jimmy'], 'Age':[25,18,28]})

df1.merge(df2)

Out[22]: 
    Name Book  Age
0   John   B1   25
1   John   B2   25
2   John   B1   25
3   Paul   B3   18
4   Paul   B4   18
5  Jimmy   B3   28

Python Pandas merge only certain columns

You could merge the sub-DataFrame (with just those columns):

df2[list('xab')]  # df2 but only with columns x, a, and b

df1.merge(df2[list('xab')])

Pandas: How to Merge Two Dataframes on a Column by Keeping the Information of the First One