How to keep index when using pandas merge
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN
Note that for some left merge operations, you may end up with more rows than in a
when there are multiple matches between a
and b
. In this case, you may need to drop duplicates.
Keep index of First dataframe when doing inner merge on columns
Use reset_index()
to keep the index of ClientFileDf
and then set that index:
df2 = pd.merge(ClientFileDf.reset_index(), df_CPCodeDF, how='inner', \
left_on=['CPCode','CPPAN'], \
right_on = ['HEDGE_CP_CODE','HEDGE_PAN_NO']).set_index('index')
Setting the index after merging with pandas?
Here's what happens:
- the output index is the intersection of the index/column merge keys
[0, 1]
. - missing keys are replaced with NaN
- NaNs result in the index type being upcasted to
float
To set the index, just assign to it:
s2 = pd.merge(s, df, how='left', left_index=True, right_on='id')
s2.index = s.index
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
You can also merge on s
(just because I dislike calling pd.merge
directly):
(s.to_frame()
.merge(df, how='left', left_index=True, right_on='id')
.set_axis(s.index, axis=0, inplace=False))
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
Pandas merge and retain the index
Provisional solution:
In [255]: a = a.reset_index()
In [256]: a
Out[256]:
id1 id2 col1 to_merge_on
0 1 a 1 2
1 1 b 3 4
2 2 a 1 2
3 2 b 3 4
In [271]: c = pd.merge(a, b, how="left")
In [272]: c
Out[272]:
id1 id2 col1 to_merge_on col2
0 1 a 1 2 NaN
1 2 a 1 2 NaN
2 2 b 3 3 2
3 1 b 3 4 NaN
In [273]: c = c.set_index(['id1','id2'])
In [274]: c
Out[274]:
col1 to_merge_on col2
id1 id2
1 a 1 2 NaN
2 a 1 2 NaN
b 3 3 2
1 b 3 4 NaN
merge two DataFrame with two columns and keep the same order with original indexes in the result
when constructing the merged dataframe, get the index values from each dataframe.
merged_df = pd.merge(df1, df2, how="outer", on=['key1', 'key2'])
use combine_first
to combine index_x
& index_y
merged_df['combined_index'] =merged_df.index_x.combine_first(merged_df.index_y)
sort using combined_index
& index_x
dropping columns which are not needed & resetting index.
output = merged_df.sort_values(
['combined_index', 'index_x']
).drop(
['index_x', 'index_y', 'combined_index'], axis=1
).reset_index(drop=True)
This results in the following output:
key1 key2 Value1 Value2
0 K a5 apple NaN
1 K a9 NaN apple
2 K a4 guava NaN
3 A1 a7 kiwi kiwi
4 A3 a9 NaN grape
5 A2 a9 grape NaN
6 B1 b2 banana banana
7 C2 c7 NaN guava
8 B9 b8 peach NaN
9 C3 c1 berry orange
Merge two dataframes by index
Use merge
, which is an inner join by default:
pd.merge(df1, df2, left_index=True, right_index=True)
Or join
, which is a left join by default:
df1.join(df2)
Or concat
), which is an outer join by default:
pd.concat([df1, df2], axis=1)
Samples:
df1 = pd.DataFrame({'a':range(6),
'b':[5,3,6,9,2,4]}, index=list('abcdef'))
print (df1)
a b
a 0 5
b 1 3
c 2 6
d 3 9
e 4 2
f 5 4
df2 = pd.DataFrame({'c':range(4),
'd':[10,20,30, 40]}, index=list('abhi'))
print (df2)
c d
a 0 10
b 1 20
h 2 30
i 3 40
# Default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a b c d
a 0 5 0 10
b 1 3 1 20
# Default left join
df4 = df1.join(df2)
print (df4)
a b c d
a 0 5 0.0 10.0
b 1 3 1.0 20.0
c 2 6 NaN NaN
d 3 9 NaN NaN
e 4 2 NaN NaN
f 5 4 NaN NaN
# Default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a b c d
a 0.0 5.0 0.0 10.0
b 1.0 3.0 1.0 20.0
c 2.0 6.0 NaN NaN
d 3.0 9.0 NaN NaN
e 4.0 2.0 NaN NaN
f 5.0 4.0 NaN NaN
h NaN NaN 2.0 30.0
i NaN NaN 3.0 40.0
How to merge two dataframes according to their indexes?
Having your DataFrame :
>>> df1 = pd.DataFrame({'col_a': [1, 2, 3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
>>> df2 = pd.DataFrame({'col_b': [4, 5, 6]}, index=['bb/bbb', 'ccc', 'hello'])
And changing the index
to column
:
>>> df1=df1.reset_index(drop=False)
>>> df1 = df1.rename(columns={'index': 'value_df1'})
>>> df1
value_df1 col_a
0 a/aa/aaa 1
1 b/bb/bbb 2
2 c/cc/ccc 3
>>> df2=df2.reset_index(drop=False)
>>> df2 = df2.rename(columns={'index': 'value_df2'})
>>> df2
value_df2 col_b
0 bb/bbb 4
1 ccc 5
2 hello 6
We merge both DataFrame on the join
column :
>>> df1['join'] = 1
>>> df2['join'] = 1
>>> dfFull = df1.merge(df2, on='join').drop('join', axis=1)
>>> dfFull
value_df1 col_a value_df2 col_b
0 a/aa/aaa 1 bb/bbb 4
1 a/aa/aaa 1 ccc 5
2 a/aa/aaa 1 hello 6
3 b/bb/bbb 2 bb/bbb 4
4 b/bb/bbb 2 ccc 5
5 b/bb/bbb 2 hello 6
6 c/cc/ccc 3 bb/bbb 4
7 c/cc/ccc 3 ccc 5
8 c/cc/ccc 3 hello 6
Then we use an apply
to match the initial index
value :
>>> df2.drop('join', axis=1, inplace=True)
>>> dfFull['match'] = dfFull.apply(lambda x: x['value_df1'].find(x['value_df2']), axis=1).ge(0)
>>> dfFull
value_df1 col_a value_df2 col_b match
0 a/aa/aaa 1 bb/bbb 4 False
1 a/aa/aaa 1 ccc 5 False
2 a/aa/aaa 1 hello 6 False
3 b/bb/bbb 2 bb/bbb 4 True
4 b/bb/bbb 2 ccc 5 False
5 b/bb/bbb 2 hello 6 False
6 c/cc/ccc 3 bb/bbb 4 False
7 c/cc/ccc 3 ccc 5 True
8 c/cc/ccc 3 hello 6 False
Filtering on the row where the column match
is True
and dropping the match
column, we get the expected result :
>>> dfFull[dfFull['match']].drop(['match'], axis=1)
value_df1 col_a value_df2 col_b
3 b/bb/bbb 2 bb/bbb 4
7 c/cc/ccc 3 ccc 5
This solution is inspired by this post.
pandas not matching initial index when I try to join/merge/loc
You can try with merge() method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would help a lot if you could provide a snippet of dataframes you are working on.
Is there a way to merge on Interval Index and another Column Value in pandas?
Merge your dataframe on your UniqueID
column then check if Trip_Date
is between Start_Date
and End_date
. Finally, set to nan
all rows where the condition is not met:
# Only if this columns have not datetime64 dtype
df1['Start_Date'] = pd.to_datetime(df1['Start_Date'], dayfirst=True)
df1['End_Date'] = pd.to_datetime(df1['End_Date'], dayfirst=True)
df2['Trip_Date'] = pd.to_datetime(df2['Trip_Date'], dayfirst=True)
out = pd.merge(df1, df2, on='UniqueID', how='left')
m = out['Trip_Date'].between(out['Start_Date'], out['End_Date'])
out.loc[~m, ['Trip_Date', 'Value']] = np.NaN
Output:
>>> out
UniqueID Start_Date End_Date Trip_Date Value
0 ID1 2020-01-01 2020-08-01 2020-02-10 1.0
1 ID1 2020-01-01 2020-08-01 2020-02-15 207.0
2 ID2 2020-02-01 2020-04-01 2020-03-06 10.0
3 ID3 2020-03-01 2020-05-01 NaT NaN
4 ID4 2020-04-01 2020-09-01 NaT NaN
5 ID5 2020-05-01 2020-10-01 NaT NaN
6 ID6 2020-06-01 2020-11-01 NaT NaN
Related Topics
Why Does My Pandas Dataframe Not Display New Order Using 'Sort_Values'
In-Memory Size of a Python Structure
Retrieving a Foreign Key Value with Django-Rest-Framework Serializers
How to Check If Character in a String Is a Letter? (Python)
Pip - Fatal Error in Launcher: Unable to Create Process Using '"'
How to Automatically Fix an Invalid JSON String
How to Run a Function Periodically in Python
Python Convert Tuple to String
Writing to MySQL Database with Pandas Using SQLalchemy, To_Sql
Comparing Previous Row Values in Pandas Dataframe
Imports in _Init_.Py and 'Import As' Statement
Pil: Convert Bytearray to Image
How to Specify an Authenticated Proxy for a Python Http Connection
Override a Method at Instance Level
How to Solve Readtimeouterror: Httpsconnectionpool(Host='Pypi.Python.Org', Port=443) with Pip
How Does the Key Argument in Python's Sorted Function Work
How to Implement Option Buttons and Change the Button Color in Pygame