Identify duplicates and mark first occurrence and all others
When I saw this question I asked myself "what would Jim Holtman or Bill Dunlap advise on Rhelp?". Haven't looked in the archives, but I think they might have advised using two "parallel" applications of duplicated
, one with the defaults and one with the fromLast
parameter and conjoining with a vector OR (|
) operator.
duplicated(m[,1]) | duplicated(m[,1], fromLast=TRUE)
[1] TRUE FALSE TRUE FALSE TRUE
Identify duplicates and mark first occurrence and all others
When I saw this question I asked myself "what would Jim Holtman or Bill Dunlap advise on Rhelp?". Haven't looked in the archives, but I think they might have advised using two "parallel" applications of duplicated
, one with the defaults and one with the fromLast
parameter and conjoining with a vector OR (|
) operator.
duplicated(m[,1]) | duplicated(m[,1], fromLast=TRUE)
[1] TRUE FALSE TRUE FALSE TRUE
Identify duplicate together with original observation in R (maybe by clustering)
Using dplyr package:
library(dplyr)
#filter on n, do not create new column
df %>% group_by(v1, v2, v3) %>% filter(n() > 1)
#filter on n, create new column
df %>% group_by(v1, v2, v3) %>% mutate(n = n()) %>% filter(n > 1)
Sum duplicates then remove all but first occurrence
I got different sums, but it were b/c I forgot the seed:
> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21
(To be safer It would be better to work on a copy. And you might need to add an ordering step.)
Flag duplicates in R
We can use duplicated
with and without fromLast = TRUE
to mark all the values that are repeated as 1.
dataset$flag <- as.integer(duplicated(dataset$value) |
duplicated(dataset$value, fromLast = TRUE))
dataset
# id value flag
#1 A 1 1
#2 A 1 1
#3 A 2 0
#4 A 3 0
#5 B 5 0
#6 B 6 1
#7 B 6 1
#8 B 7 0
How to identify the first occurence of duplicate rows in Python pandas Dataframe
There is a DataFrame method duplicated
for the first column:
In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]:
0 False
1 False
2 True
In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])
To do the second you could try something like this:
In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])
In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])
In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])
In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])
In [17]: df
Out[17]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Identifying which values are duplicates in R
You could try a table
x <- c(1,2,3,4,5,7,5,7)
tab <- table(x) > 1
x[x %in% names(which(tab))]
# [1] 5 7 5 7
Another method inspired by @rawr's comment is
x %in% x[duplicated(x)]
# [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
x[ x %in% x[duplicated(x)] ]
# [1] 5 7 5 7
which(x %in% x[duplicated(x)])
# [1] 5 6 7 8
MS Access Mark Duplicates in order of appearance
Let us say you have a unique ID, you might say:
SELECT dups.FIELDS, dups.ID, (
SELECT Count(*)
FROM dups a
WHERE a.Fields=dups.Fields And a.ID <= dups.ID) AS RankOfDup
FROM dups
ORDER BY dups.FIELDS, dups.ID;
To simply get a count of duplicates, you can say:
SELECT ID, Count(ID) FROM dups
GROUP BY dups.ID
HAVING Count(dups.ID)>0
How do I get a list of all the duplicate items using pandas in python?
Method #1: print all rows where the ID is one of the IDs in duplicated:
>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
but I couldn't think of a nice way to prevent repeating ids
so many times. I prefer method #2: groupby
on the ID.
>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
Related Topics
Make Sequential Numeric Column Names Prefixed with a Letter
Reshaping Data Frame with Duplicates
Combining Duplicated Rows in R and Adding New Column Containing Ids of Duplicates
Converting Numeric Time to Datetime Posixct Format in R
How to Perform Pairwise Operation Like '%In%' and Set Operations for a List of Vectors
Correctly Specifying "Logical Conditions" (In R)
What Does "Not Run" Mean in R Help Pages
Risks of Using Setwd() in a Script
Row Sums Over Columns with a Certain Pattern in Their Name
Can't Loop with R's Leaflet Package to Produce Multiple Maps
Non-Numeric Argument to Binary Operator Error in R
Ggplot: Colour Points by Groups Based on User Defined Colours
R Data.Table Apply Function to Rows Using Columns as Arguments