How do I get a list of all the duplicate items using pandas in python?
Method #1: print all rows where the ID is one of the IDs in duplicated:
>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
but I couldn't think of a nice way to prevent repeating ids
so many times. I prefer method #2: groupby
on the ID.
>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
How do I get a list of the duplicate rows in pandas?
Groupby all the columns; find groups with more than one item and put those in a list. Uses a for loop.
>>> gb = df.groupby(df.columns.to_list())
>>> d = {}
>>> for a,b in gb:
... if len(b) > 1:
... d[b.index[0]] = b.index[1:].to_list()
>>> d
{1000084: [1000092, 1000116], 1000096: [1000110]}
>>>
Using the same groupby as above, write a function to return the index for a group and construct a dictionary using the aggregate method.
def f(thing):
return thing.index.to_list()
>>> {key:val for key,*val in gb.aggregate(f) if val}
{1000084: [1000092, 1000116], 1000096: [1000110]}
Looks like the execution time for this scales linearly with number of columns and rows (number of items).
Here is a large DataFrame for testing. Unfortunately it doesn't want to produce duplicate rows - maybe that is worst case for groupby then iterate?
import itertools,string
import numpy as np
nrows,ncols = 100000,300
a = np.random.randint(1,3,(nrows,ncols))
# or using the new random stuff
#from numpy.random import default_rng
#rng = default_rng()
#a = rng.integers(1,3,(nrows,ncols))
index = np.arange(1000000,1000000+nrows,dtype=np.int64)
cols = [''.join(thing) for thing in itertools.combinations(string.ascii_letters,3)]
df2 = pd.DataFrame(data=a,index=index,columns=cols[:ncols])
Filter and display all duplicated rows based on multiple columns in Pandas
The following code works, by adding keep = False
:
df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)
How do I find the duplicates in a list and create another list with them?
To remove duplicates use set(a)
. To print duplicates, something like:
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print([item for item, count in collections.Counter(a).items() if count > 1])
## [1, 2, 5]
Note that Counter
is not particularly efficient (timings) and probably overkill here. set
will perform better. This code computes a list of unique elements in the source order:
seen = set()
uniq = []
for x in a:
if x not in seen:
uniq.append(x)
seen.add(x)
or, more concisely:
seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]
I don't recommend the latter style, because it is not obvious what not seen.add(x)
is doing (the set add()
method always returns None
, hence the need for not
).
To compute the list of duplicated elements without libraries:
seen = set()
dupes = []
for x in a:
if x in seen:
dupes.append(x)
else:
seen.add(x)
or, more concisely:
seen = set()
dupes = [x for x in a if x in seen or seen.add(x)]
If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:
a = [[1], [2], [3], [1], [5], [3]]
no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]
dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]
Searching for duplicate values in rows of pandas dataframes Python
We can use nunique
cond = df[["Open", "High","Low", "Close"]].apply(pd.Series.nunique,1).eq(1)
Out[344]:
0 False
1 False
2 False
3 False
4 False
5 True
dtype: bool
#row = df['cond']
One-liner to identify duplicates using pandas?
Just use duplicated
:
>>> df[df.duplicated()]
email
3 a
4 b
Or if you want a list:
>>> df[df["email"].duplicated()]["email"].tolist()
['a', 'b']
How to select duplicate rows with pandas?
You can use Series.duplicated
with parameter keep=False
to create a mask for all duplicates and then boolean indexing
, ~
to invert the mask
:
mask = df.B.duplicated(keep=False)
print (mask)
0 True
1 True
2 False
3 False
Name: B, dtype: bool
print (df[mask])
A B C
0 100 ci s
1 200 ci t
print (df[~mask])
A B C
2 250 po p
3 300 pa w
Related Topics
Is Python Interpreted, or Compiled, or Both
How to Get Href Links from HTML Using Python
Running a Linux Command from Python
How to Directly Send a Python Output to Clipboard
Sharing a Result Queue Among Several Processes
How to Get Different Colored Lines for Different Plots in a Single Figure
Regular Expression Matching a Multiline Block of Text
Pip Install Access Denied on Windows
How to Install Python Modules Without Root Access
Beautifulsoup Webscraping Find_All( ): Finding Exact Match
Python Ta-Lib Install Error, How Solve It
Detecting When a Child Process Is Waiting for Input
How to Mock Requests and the Response