Differencebetween Join and Merge in Pandas

What is the difference between join and merge in Pandas?

I always use join on indices:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

val_l val_r
key
foo 1 4
bar 2 5

The same functionality can be had by using merge on the columns follows:

left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))

key val_l val_r
0 foo 1 4
1 bar 2 5

Pandas Merging 101

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • merging with multiple columns
    • avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.





Enough talk - just show me how to use merge!

Setup & Basics

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893

right

key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by


Note
This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'.

left.merge(right, on='key', how='left')

key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how='right':

left.merge(right, on='key', how='right')

key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how='outer'.

left.merge(right, on='key', how='outer')

key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely:

Sample Image



Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),

(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))

key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN

Where,

left.merge(right, on='key', how='left', indicator=True)

key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both

And similarly, for a RIGHT-Excluding JOIN,


(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))

key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))

key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357


Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893

right2

keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278


Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278

Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.



Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3

If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')

key value newcol
0 B 0.400157 0
1 D 2.240893 1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')

key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0


Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


Other useful merge* operations and functions

  • Merging a DataFrame with Series on index: See this answer.

  • Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.

  • pd.merge_ordered is a useful function for ordered JOINs.

  • pd.merge_asof (read: merge_asOf) is useful for approximate joins.

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.





Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

  • Merging basics - basic types of joins *

  • Index-based joins

  • Generalizing to multiple DataFrames

  • Cross join

*You are here.

pandas join DF - merge vs. join different semantics

join joins per default along the indexes, and merge along the columns with the same names.

Check this:

In [115]: df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Out[115]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 1 Billy DE 1.0 NaN
1 2 B AUT 2 2 Brian US 1.0 NaN
2 3 C US 1 3 Bran US 3.0 NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN

let's set ['a','b','c'] as an index in df_b and try to join it again - you'll see only NaN's in all *_r columns:

In [116]: df_a.join(df_b.set_index(pd.Index(['a','b','c'])), lsuffix='_l', rsuffix='_r')
Out[116]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 NaN NaN NaN NaN NaN
1 2 B AUT 2 NaN NaN NaN NaN NaN
2 3 C US 1 NaN NaN NaN NaN NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN

In [117]: df_b.set_index(pd.Index(['a','b','c']))
Out[117]:
subject_id name nationality age_group average_returns_per_group
a 1 Billy DE 1 NaN
b 2 Brian US 1 NaN
c 3 Bran US 3 NaN

UPDATE: IMO merge works as expected (described in docs)

In [151]: df_a.merge(df_b, on=['nationality', 'age_group'], how='left', suffixes=['_l','_r'])
Out[151]:
subject_id_l name_l nationality age_group subject_id_r name_r average_return_per_group
0 1 A DE 1 1 Billy 1.5
1 2 B AUT 2 NaN NaN NaN
2 3 C US 1 2 Brian 2.3
3 4 D US 3 3 Bran 1.4
4 5 E US 1 2 Brian 2.3

pandas join two json columns and combine them but results in a key error

For vectorized join dont use apply, only use + with Series:

df = pd.merge(students_df_json, courses_df_json, on="Name", how='left')
s = df["students_json"] + df["courses_json"]

Pandas merge stop at first match like vlookup instead of duplicating

pandas' merge behaves (mostly) like a SQL merge and will provide all combinations of matching keys. If you only want the first item, simply remove it from the data you feed to merge.

Use drop_duplicates on mat_grp:

merged = pd.merge(pos, mat_grp.drop_duplicates('matgrp'), how='left', on='matgrp')

output:

       PO matgrp      commodity
0 123456 1001 foo - 10001
1 654321 803A spam - 100003
2 971358 803B eggs - 10003


Related Topics



Leave a reply



Submit