How to Output Duplicated Rows

How to output duplicated rows

You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.

dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

how to find duplicated rows of data and output

Use df.duplicated with keep=False to get a boolean mask of your dup rows then extract rows:

# split name / number from your csv file
df = pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True)

# increment index to match line number
df.index += 1

# keep duplicate entries
out = df[df[0].duplicated(keep=False)]

# export to duplicated_data.csv
out.to_csv('duplicated_data.csv', header=False)

Content of output file:

53,KOH KANG RI,89943392
56,KOH KANG RI,89943392

One line version

pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True) \
.assign(index=lambda x: x.index+1) \
.set_index('index') \
[lambda x: x[0].duplicated(keep=False)] \
.to_csv('duplicated_data.csv', header=False)

Find duplicate lines in a file and count how many time each line was duplicated?

Assuming there is one number per line:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

sort <file> | uniq --count

Pandas Dataframe: Show duplicate rows - with exact duplicates

To keep the function readable and general, so it works for more or less than three cols, I'd just rely on writing a dedicated function that uses pandas built in functionality for finding duplicates, and applying that to the dataframe rows:

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':['1-233','2-766g','6-455','4-356','5-253','2-122','5-531','8- 345','1-505','3-127','3-622'],
'col3':['3-957','NaN','NaN','3-602m','1-266','2-122','7-834','8-345','2-858','7-984g', 'NaN']})

def get_duplicate_value(row):
"""If row has duplicates, return that value, else NaN."""
duplicate_locations = row.duplicated()
if duplicate_locations.any():
dup_index = duplicate_locations.idxmax()
return row[dup_index]
return np.NaN

df["solution"] = df.apply(get_duplicate_value, axis=1)

Check out the docs of pd.Dataframe.apply, pd.Series.duplicated, pd.Series.any and pd.Series.idxmax to figure out how this works exactly.


      col1    col2    col3 solution
0 1-233 6-998 3-957 NaN
1 2-766g 2-766g NaN 2-766g
2 6-455 5-955 NaN NaN
3 4-356 7-236 3-602m NaN
4 5-253 5-253 1-266 5-253
5 2-122 7-258 2-122 2-122
6 5-531 8-987t 7-834 NaN
7 8- 345 7-567 8-345 NaN
8 1-505 1-505 2-858 1-505
9 3-127 6-876 7-984g NaN
10 3-622 NaN NaN NaN

Display duplicate records in data.frame and omit single ones

A solution using duplicated twice:

village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]

Names age height
1 John 18 76.1
2 John 19 77.0
3 John 20 78.1
5 Paul 22 78.8
6 Paul 23 79.7
7 Paul 24 79.9
8 Khan 25 81.1
9 Khan 26 81.2
10 Khan 27 81.8

An alternative solution with by:

village[unlist(by(seq(nrow(village)), village$Names, 
function(x) if(length(x)-1) x)), ]

pandas finding duplicate rows with different label

In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:

g = df.groupby(['feature_1', 'feature_2'])['label']

(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])


   feature_1  feature_2 label  cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

remove duplicate rows based on specific criteria with pandas

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
Id Name Sales Rent Rate
0 40808 A2 0 43 340
1 17486 DV 491 0 346
4 27977 A-M 0 0 94
6 80210 M-1 0 0 -37
7 15545 M-2 0 0 -17
10 53549 A-M8 0 0 50
12 66666 MK 0 0 0

Related Topics

Leave a reply
