Remove pandas rows with duplicate indices
I would suggest using the duplicated method on the Pandas Index itself:
df3 = df3[~df3.index.duplicated(keep='first')]
While all the other methods work, .drop_duplicates
is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.
Using the sample data provided:
>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop
>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop
>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop
Note that you can keep the last element by changing the keep argument to 'last'
.
It should also be noted that this method works with MultiIndex
as well (using df1 as specified in Paul's example):
>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop
>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop
Fastest Way to Drop Duplicated Index in a Pandas DataFrame
Simply: DF.groupby(DF.index).first()
how could i delete rows with repeating/duplicate index from dataframe
Without resetting the index:
df[~df.index.duplicated()]
pandas: removing duplicate values in rows with same index in two columns
Your syntax is not correct, have a look at the documentation of numpy.where
.
Check for equality between your two columns, and replace the values in your label column:
import numpy as np
df['label'] = np.where(df['text'].eq(df['label']),'same',df['label'])
prints:
text label
0 she is good same
1 she is bad she is good
Consider duplicate index in drop_duplicates method of a pandas DataFrame
Call reset_index
and duplicated
, and then index the original:
df = df[~df.reset_index().duplicated().values]
print (df)
A B
a 0 1
b 0 0
c 0 0
How to remove mirror duplicate pair rows in pandas?
df.loc[pd.DataFrame(np.sort(df[['a','b','c','d']],1),index=df.index).drop_duplicates(keep='first').index]
U can use np.sort to sort columns in ascending order and then use .drop duplicates to get rid of the duplicate rows.
How to remove duplicate values in one column but keep the rows pandas?
You could use the pd.Series.duplicated
method:
import pandas as pd
df = pd.DataFrame(
[
['China', 'CN', 'Yantian'],
['China', 'CN', 'Shekou'],
['China', 'CN', 'Quanzhou'],
['United Kingdom', 'UK', 'Plymouth'],
['United Kingdom', 'UK', 'Cardiff'],
['United Kingdom', 'UK', 'Bird port']
],
columns=['Country', 'Country code', 'Port Name']
)
for col in ['Country', 'Country code']:
df[col][df[col].duplicated()] = np.NaN
print(df)
prints
index | Country | Country code | Port Name |
---|---|---|---|
0 | China | CN | Yantian |
1 | NaN | NaN | Shekou |
2 | NaN | NaN | Quanzhou |
3 | United Kingdom | UK | Plymouth |
4 | NaN | NaN | Cardiff |
5 | NaN | NaN | Bird port |
Related Topics
Python Threading with Queue: How to Avoid to Use Join
How to Upload File with Python Requests
Unicodeencodeerror: 'Charmap' Codec Can't Encode - Character Maps to <Undefined>, Print Function
Usb Automatic Detection in Python for Linux Env
Module Not Found After Building Python Project by Using Pysinstaller
Passing Variable from Python Script to Bash Script
Why Use Python's Os Module Methods Instead of Executing Shell Commands Directly
Fastest Way to Download 3 Million Objects from a S3 Bucket
No Module Named 'Virtualenvwrapper'
Make (Install from Source) Python Without Running Tests
How to Add File Extensions Based on File Type on Linux/Unix
Why Aren't Python Nested Functions Called Closures
How to Generate Dynamic (Parameterized) Unit Tests in Python
Multiprocessing: How to Share a Dict Among Multiple Processes
Sending "User-Agent" Using Requests Library in Python