How to Keep Index When Using Pandas Merge

How to keep index when using pandas merge

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.

Keep index of First dataframe when doing inner merge on columns

Use reset_index() to keep the index of ClientFileDf and then set that index:

df2 = pd.merge(ClientFileDf.reset_index(), df_CPCodeDF,  how='inner', \
left_on=['CPCode','CPPAN'], \
right_on = ['HEDGE_CP_CODE','HEDGE_PAN_NO']).set_index('index')

Setting the index after merging with pandas?

Here's what happens:

  1. the output index is the intersection of the index/column merge keys [0, 1].
  2. missing keys are replaced with NaN
  3. NaNs result in the index type being upcasted to float

To set the index, just assign to it:

s2 = pd.merge(s, df, how='left', left_index=True, right_on='id')
s2.index = s.index

score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN

You can also merge on s (just because I dislike calling pd.merge directly):

(s.to_frame()
.merge(df, how='left', left_index=True, right_on='id')
.set_axis(s.index, axis=0, inplace=False))

score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN

Pandas merge and retain the index

Provisional solution:

In [255]: a = a.reset_index()

In [256]: a
Out[256]:
id1 id2 col1 to_merge_on
0 1 a 1 2
1 1 b 3 4
2 2 a 1 2
3 2 b 3 4

In [271]: c = pd.merge(a, b, how="left")

In [272]: c
Out[272]:
id1 id2 col1 to_merge_on col2
0 1 a 1 2 NaN
1 2 a 1 2 NaN
2 2 b 3 3 2
3 1 b 3 4 NaN

In [273]: c = c.set_index(['id1','id2'])

In [274]: c
Out[274]:
col1 to_merge_on col2
id1 id2
1 a 1 2 NaN
2 a 1 2 NaN
b 3 3 2
1 b 3 4 NaN

merge two DataFrame with two columns and keep the same order with original indexes in the result

when constructing the merged dataframe, get the index values from each dataframe.

merged_df = pd.merge(df1, df2, how="outer", on=['key1', 'key2'])

use combine_first to combine index_x & index_y

merged_df['combined_index'] =merged_df.index_x.combine_first(merged_df.index_y)

sort using combined_index & index_x dropping columns which are not needed & resetting index.

output = merged_df.sort_values(
['combined_index', 'index_x']
).drop(
['index_x', 'index_y', 'combined_index'], axis=1
).reset_index(drop=True)

This results in the following output:

  key1 key2  Value1  Value2
0 K a5 apple NaN
1 K a9 NaN apple
2 K a4 guava NaN
3 A1 a7 kiwi kiwi
4 A3 a9 NaN grape
5 A2 a9 grape NaN
6 B1 b2 banana banana
7 C2 c7 NaN guava
8 B9 b8 peach NaN
9 C3 c1 berry orange

Merge two dataframes by index

Use merge, which is an inner join by default:

pd.merge(df1, df2, left_index=True, right_index=True)

Or join, which is a left join by default:

df1.join(df2)

Or concat), which is an outer join by default:

pd.concat([df1, df2], axis=1)

Samples:

df1 = pd.DataFrame({'a':range(6),
'b':[5,3,6,9,2,4]}, index=list('abcdef'))

print (df1)
a b
a 0 5
b 1 3
c 2 6
d 3 9
e 4 2
f 5 4

df2 = pd.DataFrame({'c':range(4),
'd':[10,20,30, 40]}, index=list('abhi'))

print (df2)
c d
a 0 10
b 1 20
h 2 30
i 3 40


# Default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a b c d
a 0 5 0 10
b 1 3 1 20

# Default left join
df4 = df1.join(df2)
print (df4)
a b c d
a 0 5 0.0 10.0
b 1 3 1.0 20.0
c 2 6 NaN NaN
d 3 9 NaN NaN
e 4 2 NaN NaN
f 5 4 NaN NaN

# Default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a b c d
a 0.0 5.0 0.0 10.0
b 1.0 3.0 1.0 20.0
c 2.0 6.0 NaN NaN
d 3.0 9.0 NaN NaN
e 4.0 2.0 NaN NaN
f 5.0 4.0 NaN NaN
h NaN NaN 2.0 30.0
i NaN NaN 3.0 40.0

How to merge two dataframes according to their indexes?

Having your DataFrame :

>>> df1 = pd.DataFrame({'col_a': [1, 2, 3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
>>> df2 = pd.DataFrame({'col_b': [4, 5, 6]}, index=['bb/bbb', 'ccc', 'hello'])

And changing the index to column :

>>> df1=df1.reset_index(drop=False)
>>> df1 = df1.rename(columns={'index': 'value_df1'})
>>> df1
value_df1 col_a
0 a/aa/aaa 1
1 b/bb/bbb 2
2 c/cc/ccc 3

>>> df2=df2.reset_index(drop=False)
>>> df2 = df2.rename(columns={'index': 'value_df2'})
>>> df2
value_df2 col_b
0 bb/bbb 4
1 ccc 5
2 hello 6

We merge both DataFrame on the joincolumn :

>>> df1['join'] = 1
>>> df2['join'] = 1
>>> dfFull = df1.merge(df2, on='join').drop('join', axis=1)
>>> dfFull
value_df1 col_a value_df2 col_b
0 a/aa/aaa 1 bb/bbb 4
1 a/aa/aaa 1 ccc 5
2 a/aa/aaa 1 hello 6
3 b/bb/bbb 2 bb/bbb 4
4 b/bb/bbb 2 ccc 5
5 b/bb/bbb 2 hello 6
6 c/cc/ccc 3 bb/bbb 4
7 c/cc/ccc 3 ccc 5
8 c/cc/ccc 3 hello 6

Then we use an apply to match the initial index value :

>>> df2.drop('join', axis=1, inplace=True)
>>> dfFull['match'] = dfFull.apply(lambda x: x['value_df1'].find(x['value_df2']), axis=1).ge(0)
>>> dfFull
value_df1 col_a value_df2 col_b match
0 a/aa/aaa 1 bb/bbb 4 False
1 a/aa/aaa 1 ccc 5 False
2 a/aa/aaa 1 hello 6 False
3 b/bb/bbb 2 bb/bbb 4 True
4 b/bb/bbb 2 ccc 5 False
5 b/bb/bbb 2 hello 6 False
6 c/cc/ccc 3 bb/bbb 4 False
7 c/cc/ccc 3 ccc 5 True
8 c/cc/ccc 3 hello 6 False

Filtering on the row where the column match is True and dropping the match column, we get the expected result :

>>> dfFull[dfFull['match']].drop(['match'], axis=1)
value_df1 col_a value_df2 col_b
3 b/bb/bbb 2 bb/bbb 4
7 c/cc/ccc 3 ccc 5

This solution is inspired by this post.

pandas not matching initial index when I try to join/merge/loc

You can try with merge() method:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

It would help a lot if you could provide a snippet of dataframes you are working on.

Is there a way to merge on Interval Index and another Column Value in pandas?

Merge your dataframe on your UniqueID column then check if Trip_Date is between Start_Date and End_date. Finally, set to nan all rows where the condition is not met:

# Only if this columns have not datetime64 dtype
df1['Start_Date'] = pd.to_datetime(df1['Start_Date'], dayfirst=True)
df1['End_Date'] = pd.to_datetime(df1['End_Date'], dayfirst=True)
df2['Trip_Date'] = pd.to_datetime(df2['Trip_Date'], dayfirst=True)

out = pd.merge(df1, df2, on='UniqueID', how='left')
m = out['Trip_Date'].between(out['Start_Date'], out['End_Date'])

out.loc[~m, ['Trip_Date', 'Value']] = np.NaN

Output:

>>> out
UniqueID Start_Date End_Date Trip_Date Value
0 ID1 2020-01-01 2020-08-01 2020-02-10 1.0
1 ID1 2020-01-01 2020-08-01 2020-02-15 207.0
2 ID2 2020-02-01 2020-04-01 2020-03-06 10.0
3 ID3 2020-03-01 2020-05-01 NaT NaN
4 ID4 2020-04-01 2020-09-01 NaT NaN
5 ID5 2020-05-01 2020-10-01 NaT NaN
6 ID6 2020-06-01 2020-11-01 NaT NaN


Related Topics



Leave a reply



Submit