How to Output Duplicated Rows

How to output duplicated rows

You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.

dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
#   x1 x2 x3 x4
# 1 34 14 45 53
# 2  2  8 18 17
# 3 34 14 45 20
# 5  2  8 18  5

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

how to find duplicated rows of data and output

Use df.duplicated with keep=False to get a boolean mask of your dup rows then extract rows:

# split name / number from your csv file
df = pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
       .str.split('\t', expand=True)

# increment index to match line number
df.index += 1

# keep duplicate entries
out = df[df[0].duplicated(keep=False)]

# export to duplicated_data.csv
out.to_csv('duplicated_data.csv', header=False)

Content of output file:

15,ANDREW ZHAO CHONG,83091746
19,ANDREW ZHAO CHONG,83091746
26,ANDREW ZHAO CHONG,83091746
48,ANDREW ZHAO CHONG,83091746
53,KOH KANG RI,89943392
56,KOH KANG RI,89943392
63,ENOS ZHAO KANG SONG,80746554
66,ENOS ZHAO KANG SONG,80746554
80,ENOS ZHAO KANG SONG,80746554

One line version

pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
  .str.split('\t', expand=True) \
  .assign(index=lambda x: x.index+1) \
  .set_index('index') \
  [lambda x: x[0].duplicated(keep=False)] \
  .to_csv('duplicated_data.csv', header=False)

Find duplicate lines in a file and count how many time each line was duplicated?

Assuming there is one number per line:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

sort <file> | uniq --count

Pandas Dataframe: Show duplicate rows - with exact duplicates

To keep the function readable and general, so it works for more or less than three cols, I'd just rely on writing a dedicated function that uses pandas built in functionality for finding duplicates, and applying that to the dataframe rows:

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':['1-233','2-766g','6-455','4-356','5-253','2-122','5-531','8- 345','1-505','3-127','3-622'],
'col2':['6-998','2-766g','5-955','7-236','5-253','7-258','8-987t','7-567','1-505','6-876','NaN'],
'col3':['3-957','NaN','NaN','3-602m','1-266','2-122','7-834','8-345','2-858','7-984g', 'NaN']})

def get_duplicate_value(row):
    """If row has duplicates, return that value, else NaN."""
    duplicate_locations = row.duplicated()
    if duplicate_locations.any():
        dup_index = duplicate_locations.idxmax()
        return row[dup_index]
    return np.NaN

df["solution"] = df.apply(get_duplicate_value, axis=1)

Check out the docs of pd.Dataframe.apply, pd.Series.duplicated, pd.Series.any and pd.Series.idxmax to figure out how this works exactly.

Output:

      col1    col2    col3 solution
0    1-233   6-998   3-957      NaN
1   2-766g  2-766g     NaN   2-766g
2    6-455   5-955     NaN      NaN
3    4-356   7-236  3-602m      NaN
4    5-253   5-253   1-266    5-253
5    2-122   7-258   2-122    2-122
6    5-531  8-987t   7-834      NaN
7   8- 345   7-567   8-345      NaN
8    1-505   1-505   2-858    1-505
9    3-127   6-876  7-984g      NaN
10   3-622     NaN     NaN      NaN

Display duplicate records in data.frame and omit single ones

A solution using duplicated twice:

village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]

   Names age height
1   John  18   76.1
2   John  19   77.0
3   John  20   78.1
5   Paul  22   78.8
6   Paul  23   79.7
7   Paul  24   79.9
8   Khan  25   81.1
9   Khan  26   81.2
10  Khan  27   81.8

An alternative solution with by:

village[unlist(by(seq(nrow(village)), village$Names, 
                  function(x) if(length(x)-1) x)), ]

pandas finding duplicate rows with different label

In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:

g = df.groupby(['feature_1', 'feature_2'])['label']

(df.assign(cluster_index=g.ngroup()) # get group name
   .loc[g.transform('size').gt(1)]   # filter the non-duplicates
   # line below only to have a nice cluster_index range (0,1…)
   .assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)

output:

   feature_1  feature_2 label  cluster_index
1          0          5     A              0
2          0          5     B              0
3          4          1     B              1
4          4          1     D              1

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

remove duplicate rows based on specific criteria with pandas

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
               df[~duplicateMask]])
       Id  Name  Sales  Rent  Rate
0   40808    A2      0    43   340
1   17486    DV    491     0   346
4   27977   A-M      0     0    94
6   80210   M-1      0     0   -37
7   15545   M-2      0     0   -17
10  53549  A-M8      0     0    50
12  66666    MK      0     0     0

How to Output Duplicated Rows