Identify Duplicates and Mark First Occurrence and All Others

Identify duplicates and mark first occurrence and all others

When I saw this question I asked myself "what would Jim Holtman or Bill Dunlap advise on Rhelp?". Haven't looked in the archives, but I think they might have advised using two "parallel" applications of duplicated, one with the defaults and one with the fromLast parameter and conjoining with a vector OR (|) operator.

duplicated(m[,1]) | duplicated(m[,1], fromLast=TRUE)
[1] TRUE FALSE TRUE FALSE TRUE

Identify duplicates and mark first occurrence and all others

When I saw this question I asked myself "what would Jim Holtman or Bill Dunlap advise on Rhelp?". Haven't looked in the archives, but I think they might have advised using two "parallel" applications of duplicated, one with the defaults and one with the fromLast parameter and conjoining with a vector OR (|) operator.

duplicated(m[,1]) | duplicated(m[,1], fromLast=TRUE)
[1] TRUE FALSE TRUE FALSE TRUE

Identify duplicate together with original observation in R (maybe by clustering)

Using dplyr package:

library(dplyr) 

#filter on n, do not create new column
df %>% group_by(v1, v2, v3) %>% filter(n() > 1)

#filter on n, create new column
df %>% group_by(v1, v2, v3) %>% mutate(n = n()) %>% filter(n > 1)

Sum duplicates then remove all but first occurrence

I got different sums, but it were b/c I forgot the seed:

> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21

(To be safer It would be better to work on a copy. And you might need to add an ordering step.)

Flag duplicates in R

We can use duplicated with and without fromLast = TRUE to mark all the values that are repeated as 1.

dataset$flag <- as.integer(duplicated(dataset$value) | 
duplicated(dataset$value, fromLast = TRUE))
dataset

# id value flag
#1 A 1 1
#2 A 1 1
#3 A 2 0
#4 A 3 0
#5 B 5 0
#6 B 6 1
#7 B 6 1
#8 B 7 0

How to identify the first occurence of duplicate rows in Python pandas Dataframe

There is a DataFrame method duplicated for the first column:

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]:
0 False
1 False
2 True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

To do the second you could try something like this:

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df
Out[17]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0

Identifying which values are duplicates in R

You could try a table

x <- c(1,2,3,4,5,7,5,7)
tab <- table(x) > 1
x[x %in% names(which(tab))]
# [1] 5 7 5 7

Another method inspired by @rawr's comment is

x %in% x[duplicated(x)]
# [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
x[ x %in% x[duplicated(x)] ]
# [1] 5 7 5 7
which(x %in% x[duplicated(x)])
# [1] 5 6 7 8

MS Access Mark Duplicates in order of appearance

Let us say you have a unique ID, you might say:

SELECT dups.FIELDS, dups.ID, (
SELECT Count(*)
FROM dups a
WHERE a.Fields=dups.Fields And a.ID <= dups.ID) AS RankOfDup
FROM dups
ORDER BY dups.FIELDS, dups.ID;

To simply get a count of duplicates, you can say:

SELECT ID, Count(ID) FROM dups 
GROUP BY dups.ID
HAVING Count(dups.ID)>0

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12


Related Topics



Leave a reply



Submit