Identify Duplicates and Mark First Occurrence and All Others

Identify duplicates and mark first occurrence and all others

When I saw this question I asked myself "what would Jim Holtman or Bill Dunlap advise on Rhelp?". Haven't looked in the archives, but I think they might have advised using two "parallel" applications of duplicated, one with the defaults and one with the fromLast parameter and conjoining with a vector OR (|) operator.

duplicated(m[,1]) | duplicated(m[,1], fromLast=TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Identify duplicates and mark first occurrence and all others

duplicated(m[,1]) | duplicated(m[,1], fromLast=TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Identify duplicate together with original observation in R (maybe by clustering)

Using dplyr package:

library(dplyr) 

#filter on n, do not create new column
df %>% group_by(v1, v2, v3) %>% filter(n() > 1)

#filter on n, create new column
df %>% group_by(v1, v2, v3) %>% mutate(n = n()) %>% filter(n > 1)

Sum duplicates then remove all but first occurrence

I got different sums, but it were b/c I forgot the seed:

> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
    id year    month     x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December  7.16
6 8503 2007  January 13.23
7 2961 2007  January  7.38
9 8564 2007  January  7.21

(To be safer It would be better to work on a copy. And you might need to add an ordering step.)

Flag duplicates in R

We can use duplicated with and without fromLast = TRUE to mark all the values that are repeated as 1.

dataset$flag <- as.integer(duplicated(dataset$value) | 
                           duplicated(dataset$value, fromLast = TRUE))
dataset

#  id value flag
#1  A     1    1
#2  A     1    1
#3  A     2    0
#4  A     3    0
#5  B     5    0
#6  B     6    1
#7  B     6    1
#8  B     7    0

How to identify the first occurence of duplicate rows in Python pandas Dataframe

There is a DataFrame method duplicated for the first column:

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]: 
0    False
1    False
2     True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

To do the second you could try something like this:

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df
Out[17]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

Identifying which values are duplicates in R

You could try a table

x <- c(1,2,3,4,5,7,5,7)
tab <- table(x) > 1
x[x %in% names(which(tab))]
# [1] 5 7 5 7

Another method inspired by @rawr's comment is

x %in% x[duplicated(x)]
# [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
x[ x %in% x[duplicated(x)] ]
# [1] 5 7 5 7
which(x %in% x[duplicated(x)])
# [1] 5 6 7 8

MS Access Mark Duplicates in order of appearance

Let us say you have a unique ID, you might say:

SELECT dups.FIELDS, dups.ID, (
    SELECT Count(*) 
    FROM dups a 
    WHERE a.Fields=dups.Fields And a.ID <= dups.ID) AS RankOfDup
FROM dups
ORDER BY dups.FIELDS, dups.ID;

To simply get a count of duplicates, you can say:

SELECT ID, Count(ID) FROM dups 
GROUP BY dups.ID 
HAVING Count(dups.ID)>0

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

Identify Duplicates and Mark First Occurrence and All Others

Identify duplicates and mark first occurrence and all others

Identify duplicates and mark first occurrence and all others

Identify duplicate together with original observation in R (maybe by clustering)

Sum duplicates then remove all but first occurrence

Flag duplicates in R

How to identify the first occurence of duplicate rows in Python pandas Dataframe

Identifying which values are duplicates in R

MS Access Mark Duplicates in order of appearance

How do I get a list of all the duplicate items using pandas in python?

Related Topics

Leave a reply