How to Output Duplicated Rows

How to output duplicated rows

You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.

dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

how to find duplicated rows of data and output

Use df.duplicated with keep=False to get a boolean mask of your dup rows then extract rows:

# split name / number from your csv file
df = pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True)

# increment index to match line number
df.index += 1

# keep duplicate entries
out = df[df[0].duplicated(keep=False)]

# export to duplicated_data.csv
out.to_csv('duplicated_data.csv', header=False)

Content of output file:

15,ANDREW ZHAO CHONG,83091746
19,ANDREW ZHAO CHONG,83091746
26,ANDREW ZHAO CHONG,83091746
48,ANDREW ZHAO CHONG,83091746
53,KOH KANG RI,89943392
56,KOH KANG RI,89943392
63,ENOS ZHAO KANG SONG,80746554
66,ENOS ZHAO KANG SONG,80746554
80,ENOS ZHAO KANG SONG,80746554

One line version

pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True) \
.assign(index=lambda x: x.index+1) \
.set_index('index') \
[lambda x: x[0].duplicated(keep=False)] \
.to_csv('duplicated_data.csv', header=False)

Find duplicate lines in a file and count how many time each line was duplicated?

Assuming there is one number per line:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

sort <file> | uniq --count

Pandas Dataframe: Show duplicate rows - with exact duplicates

To keep the function readable and general, so it works for more or less than three cols, I'd just rely on writing a dedicated function that uses pandas built in functionality for finding duplicates, and applying that to the dataframe rows:

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':['1-233','2-766g','6-455','4-356','5-253','2-122','5-531','8- 345','1-505','3-127','3-622'],
'col2':['6-998','2-766g','5-955','7-236','5-253','7-258','8-987t','7-567','1-505','6-876','NaN'],
'col3':['3-957','NaN','NaN','3-602m','1-266','2-122','7-834','8-345','2-858','7-984g', 'NaN']})

def get_duplicate_value(row):
"""If row has duplicates, return that value, else NaN."""
duplicate_locations = row.duplicated()
if duplicate_locations.any():
dup_index = duplicate_locations.idxmax()
return row[dup_index]
return np.NaN

df["solution"] = df.apply(get_duplicate_value, axis=1)

Check out the docs of pd.Dataframe.apply, pd.Series.duplicated, pd.Series.any and pd.Series.idxmax to figure out how this works exactly.

Output:

      col1    col2    col3 solution
0 1-233 6-998 3-957 NaN
1 2-766g 2-766g NaN 2-766g
2 6-455 5-955 NaN NaN
3 4-356 7-236 3-602m NaN
4 5-253 5-253 1-266 5-253
5 2-122 7-258 2-122 2-122
6 5-531 8-987t 7-834 NaN
7 8- 345 7-567 8-345 NaN
8 1-505 1-505 2-858 1-505
9 3-127 6-876 7-984g NaN
10 3-622 NaN NaN NaN

Display duplicate records in data.frame and omit single ones

A solution using duplicated twice:

village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]

Names age height
1 John 18 76.1
2 John 19 77.0
3 John 20 78.1
5 Paul 22 78.8
6 Paul 23 79.7
7 Paul 24 79.9
8 Khan 25 81.1
9 Khan 26 81.2
10 Khan 27 81.8

An alternative solution with by:

village[unlist(by(seq(nrow(village)), village$Names, 
function(x) if(length(x)-1) x)), ]

pandas finding duplicate rows with different label

In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:

g = df.groupby(['feature_1', 'feature_2'])['label']

(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)

output:

   feature_1  feature_2 label  cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

remove duplicate rows based on specific criteria with pandas

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
df[~duplicateMask]])
Id Name Sales Rent Rate
0 40808 A2 0 43 340
1 17486 DV 491 0 346
4 27977 A-M 0 0 94
6 80210 M-1 0 0 -37
7 15545 M-2 0 0 -17
10 53549 A-M8 0 0 50
12 66666 MK 0 0 0


Related Topics



Leave a reply



Submit