Filtering Out Duplicated/Non-Unique Rows in Data.Table

Filtering out duplicated/non-unique rows in data.table

For v1.9.8+ (released November 2016)

From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G

Extracting unique rows from a data table in R

Before data.table v1.9.8, the default behavior of unique.data.table method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key was NULL (the default), one would get the original data set back (as in OPs situation).

As of data.table 1.9.8+, unique.data.table method uses all columns by default which is consistent with the unique.data.frame in base R. To have it use the key columns, explicitly pass by = key(DT) into unique (replacing DT in the call to key with the name of the data.table).

Hence, old behavior would be something like

library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b)) 
## [1] 8 3

While for data.table v1.9.8+, just

b <- data.table(a) 
dim(unique(b)) 
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them

Or without a copy

setDT(a)
dim(unique(a))
## [1] 8 3

Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable

Here some data.table-based solutions.

setDT(df_id_year_and_type)

method 1

na.omit(df_id_year_and_type, cols="type") drops NA rows based on column type.
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE) finds all the groups.
And by joining them (using the last match: mult="last"), we obtain the desired output.

na.omit(df_id_year_and_type, cols="type"
        )[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE), 
          on=c('id', 'year'), 
          mult="last"]

#       id  year   type
#    <num> <num> <char>
# 1:     1  2002      A
# 2:     2  2008      B
# 3:     3  2010      D
# 4:     3  2013   <NA>
# 5:     4  2020      C
# 6:     5  2009      A
# 7:     6  2010      B
# 8:     6  2012   <NA>

method 2

df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]

method 3

(likely slower because of [ overhead)

df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]

How to remove all duplicated rows in data.table in r

We group by 'ID', get a logical index with duplicated on the 'Date', and negate so that all the unique elements are now TRUE, use .I to get the row index, extract the index column 'V1' and use that to subset 'dt'.

dt[dt[, .I[!(duplicated(Date)|duplicated(Date, fromLast=TRUE))], ID]$V1]
#      Date  ID INC
#1: 201505 500  80
#2: 201504 600  50

Or another option would be to group by 'Date', 'ID' and if the nrow is equal to 1 (.N==1), we get the Subset of Data.table (.SD).

dt[, if(.N==1) .SD, .(Date, ID)]
#     Date  ID INC
#1: 201504 600  50
#2: 201505 500  80

Or as @Frank mentioned, we can use a data.table/base R combo

DT[ave(seq(.N), Date, ID, FUN = function(x) length(x) == 1L)]

R data.table - only keep rows with duplicate ID (most efficient solution)

We can use .I to get the index of groups with frequency count greater than 1, extract the column and subset the data.table

dt[dt[, .I[.N >1], .(x, y)]$V1]

NOTE: It should be faster than .SD

Extracting unique rows in R data table based on another column

Subset in the j part :

library(data.table)
setDT(df)
df[, .SD[!duplicated(Color)], Year]

#   Year  Color X Y
#1: 2014    red 1 3
#2: 2014   blue 1 3
#3: 2015    red 1 3
#4: 2015   blue 1 3
#5: 2015 yellow 1 3

Another approach is to group by Year and Color and select the first row.

df[, .SD[seq_len(.N) == 1], .(Year, Color)]

Or the most easy one is to select unique rows and specify by :

unique(df, by = c('Year', 'Color'))

data

df <- structure(list(Year = c(2014L, 2014L, 2014L, 2015L, 2015L, 2015L
), Color = c("red", "red", "blue", "red", "blue", "yellow"), 
X = c(1L, 1L, 1L, 1L, 1L, 1L), Y = c(3L, 3L, 3L, 3L, 3L, 
3L)), class = "data.frame", row.names = c(NA, -6L))

Removing rows in R only if they are duplicated in direct succession

Using rleid from data.table we create a dummy-grouping variable, and with distinct from dplyr we remove the duplicates. In your data you may want to include Transponder in the rleid function, if it does vary in your real data.

library(tidyverse)
library(data.table)

df %>% 
  mutate(dummy = rleid(Units)) %>% 
  distinct(dummy, .keep_all = T) %>% 
  select(-dummy)

        Date           TimeStamp Transponder Units
1 2021-08-15 2021-08-15-14:11:13    DA2C614E    M2
2 2021-08-15 2021-08-15-14:12:40    DA2C614E   HM2
3 2021-08-15 2021-08-15-14:12:49    DA2C614E    H2
4 2021-08-15 2021-08-15-14:18:02    DA2C614E    H1
5 2021-08-15 2021-08-15-14:25:29    DA2C614E   HM2

Using just data.table and no temporary variable you could do the following: dt[!duplicated(rleid(Units)),], based on comments.

Best way to remove duplicate entries from a data table

Remove Duplicates

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
   Hashtable hTable = new Hashtable();
   ArrayList duplicateList = new ArrayList();

   //Add list of all the unique item value to hashtable, which stores combination of key, value pair.
   //And add duplicate item value in arraylist.
   foreach (DataRow drow in dTable.Rows)
   {
      if (hTable.Contains(drow[colName]))
         duplicateList.Add(drow);
      else
         hTable.Add(drow[colName], string.Empty); 
   }

   //Removing a list of duplicate items from datatable.
   foreach (DataRow dRow in duplicateList)
      dTable.Rows.Remove(dRow);

   //Datatable which contains unique records will be return as output.
      return dTable;
}

Here Links below

http://www.dotnetspider.com/resources/4535-Remove-duplicate-records-from-table.aspx

http://www.dotnetspark.com/kb/94-remove-duplicate-rows-value-from-datatable.aspx

For remove duplicates in column

http://dotnetguts.blogspot.com/2007/02/removing-duplicate-records-from.html

Filtering Out Duplicated/Non-Unique Rows in Data.Table