How to Get a List of All the Duplicate Items Using Pandas in Python

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

How do I get a list of the duplicate rows in pandas?

Groupby all the columns; find groups with more than one item and put those in a list. Uses a for loop.

>>> gb = df.groupby(df.columns.to_list())
>>> d = {}
>>> for a,b in gb:
... if len(b) > 1:
... d[b.index[0]] = b.index[1:].to_list()


>>> d
{1000084: [1000092, 1000116], 1000096: [1000110]}
>>>

Using the same groupby as above, write a function to return the index for a group and construct a dictionary using the aggregate method.

def f(thing):
return thing.index.to_list()

>>> {key:val for key,*val in gb.aggregate(f) if val}
{1000084: [1000092, 1000116], 1000096: [1000110]}

Looks like the execution time for this scales linearly with number of columns and rows (number of items).


Here is a large DataFrame for testing. Unfortunately it doesn't want to produce duplicate rows - maybe that is worst case for groupby then iterate?

import itertools,string
import numpy as np
nrows,ncols = 100000,300

a = np.random.randint(1,3,(nrows,ncols))
# or using the new random stuff
#from numpy.random import default_rng
#rng = default_rng()
#a = rng.integers(1,3,(nrows,ncols))

index = np.arange(1000000,1000000+nrows,dtype=np.int64)
cols = [''.join(thing) for thing in itertools.combinations(string.ascii_letters,3)]
df2 = pd.DataFrame(data=a,index=index,columns=cols[:ncols])

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

How do I find the duplicates in a list and create another list with them?

To remove duplicates use set(a). To print duplicates, something like:

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

Note that Counter is not particularly efficient (timings) and probably overkill here. set will perform better. This code computes a list of unique elements in the source order:

seen = set()
uniq = []
for x in a:
if x not in seen:
uniq.append(x)
seen.add(x)

or, more concisely:

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]

I don't recommend the latter style, because it is not obvious what not seen.add(x) is doing (the set add() method always returns None, hence the need for not).

To compute the list of duplicated elements without libraries:

seen = set()
dupes = []

for x in a:
if x in seen:
dupes.append(x)
else:
seen.add(x)

or, more concisely:

seen = set()
dupes = [x for x in a if x in seen or seen.add(x)]

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]

Searching for duplicate values in rows of pandas dataframes Python

We can use nunique

cond = df[["Open", "High","Low", "Close"]].apply(pd.Series.nunique,1).eq(1)
Out[344]:
0 False
1 False
2 False
3 False
4 False
5 True
dtype: bool
#row = df['cond']

One-liner to identify duplicates using pandas?

Just use duplicated:

>>> df[df.duplicated()]
email
3 a
4 b

Or if you want a list:

>>> df[df["email"].duplicated()]["email"].tolist()
['a', 'b']

How to select duplicate rows with pandas?

You can use Series.duplicated with parameter keep=False to create a mask for all duplicates and then boolean indexing, ~ to invert the mask:

mask = df.B.duplicated(keep=False)
print (mask)
0 True
1 True
2 False
3 False
Name: B, dtype: bool

print (df[mask])
A B C
0 100 ci s
1 200 ci t

print (df[~mask])
A B C
2 250 po p
3 300 pa w


Related Topics



Leave a reply



Submit