Remove Pandas Rows with Duplicate Indices

Remove pandas rows with duplicate indices

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop

Fastest Way to Drop Duplicated Index in a Pandas DataFrame

Simply: DF.groupby(DF.index).first()

how could i delete rows with repeating/duplicate index from dataframe

Without resetting the index:

df[~df.index.duplicated()]

pandas: removing duplicate values in rows with same index in two columns

Your syntax is not correct, have a look at the documentation of numpy.where.
Check for equality between your two columns, and replace the values in your label column:

import numpy as np
df['label'] = np.where(df['text'].eq(df['label']),'same',df['label'])

prints:

          text        label
0 she is good same
1 she is bad she is good

Consider duplicate index in drop_duplicates method of a pandas DataFrame

Call reset_index and duplicated, and then index the original:

df = df[~df.reset_index().duplicated().values]
print (df)
A B
a 0 1
b 0 0
c 0 0

How to remove mirror duplicate pair rows in pandas?

df.loc[pd.DataFrame(np.sort(df[['a','b','c','d']],1),index=df.index).drop_duplicates(keep='first').index]

U can use np.sort to sort columns in ascending order and then use .drop duplicates to get rid of the duplicate rows.

How to remove duplicate values in one column but keep the rows pandas?

You could use the pd.Series.duplicated method:

import pandas as pd

df = pd.DataFrame(
[
['China', 'CN', 'Yantian'],
['China', 'CN', 'Shekou'],
['China', 'CN', 'Quanzhou'],
['United Kingdom', 'UK', 'Plymouth'],
['United Kingdom', 'UK', 'Cardiff'],
['United Kingdom', 'UK', 'Bird port']
],
columns=['Country', 'Country code', 'Port Name']
)

for col in ['Country', 'Country code']:
df[col][df[col].duplicated()] = np.NaN
print(df)

prints



Leave a reply



Submit