Filtering out duplicated/non-unique rows in data.table
For v1.9.8+ (released November 2016)
From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame
)
unique(dt)
V1 V2
1: A B
2: A C
3: A D
4: B A
5: C D
6: E F
7: G G
Or using the by
argument in order to get unique combinations of specific columns (like previously keys were used for)
unique(dt, by = "V2")
V1 V2
1: A B
2: A C
3: A D
4: B A
5: E F
6: G G
Prior v1.9.8
From ?unique.data.table
, it is clear that calling unique
on a data table only works on the key. This means you have to reset the key to all columns before calling unique
.
library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)
Calling unique
with one column as key:
setkey(dt, "V2")
unique(dt)
V1 V2
[1,] B A
[2,] A B
[3,] A C
[4,] A D
[5,] E F
[6,] G G
Extracting unique rows from a data table in R
Before data.table v1.9.8, the default behavior of unique.data.table
method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key
was NULL
(the default), one would get the original data set back (as in OPs situation).
As of data.table 1.9.8+, unique.data.table
method uses all columns by default which is consistent with the unique.data.frame
in base R. To have it use the key columns, explicitly pass by = key(DT)
into unique
(replacing DT
in the call to key with the name of the data.table).
Hence, old behavior would be something like
library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3
While for data.table v1.9.8+, just
b <- data.table(a)
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them
Or without a copy
setDT(a)
dim(unique(a))
## [1] 8 3
Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable
Here some data.table-based solutions.
setDT(df_id_year_and_type)
method 1
na.omit(df_id_year_and_type, cols="type")
drops NA
rows based on column type
.unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)
finds all the groups.
And by joining them (using the last match: mult="last"
), we obtain the desired output.
na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]
# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>
method 2
df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]
method 3
(likely slower because of [
overhead)
df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]
How to remove all duplicated rows in data.table in r
We group by 'ID', get a logical index with duplicated
on the 'Date', and negate so that all the unique elements are now TRUE, use .I
to get the row index, extract the index column 'V1' and use that to subset 'dt'.
dt[dt[, .I[!(duplicated(Date)|duplicated(Date, fromLast=TRUE))], ID]$V1]
# Date ID INC
#1: 201505 500 80
#2: 201504 600 50
Or another option would be to group by 'Date', 'ID' and if
the nrow is equal to 1 (.N==1
), we get the Subset of Data.table (.SD
).
dt[, if(.N==1) .SD, .(Date, ID)]
# Date ID INC
#1: 201504 600 50
#2: 201505 500 80
Or as @Frank mentioned, we can use a data.table/base R combo
DT[ave(seq(.N), Date, ID, FUN = function(x) length(x) == 1L)]
R data.table - only keep rows with duplicate ID (most efficient solution)
We can use .I
to get the index of groups with frequency count greater than 1, extract the column and subset the data.table
dt[dt[, .I[.N >1], .(x, y)]$V1]
NOTE: It should be faster than .SD
Extracting unique rows in R data table based on another column
Subset in the j
part :
library(data.table)
setDT(df)
df[, .SD[!duplicated(Color)], Year]
# Year Color X Y
#1: 2014 red 1 3
#2: 2014 blue 1 3
#3: 2015 red 1 3
#4: 2015 blue 1 3
#5: 2015 yellow 1 3
Another approach is to group by Year
and Color
and select the first row.
df[, .SD[seq_len(.N) == 1], .(Year, Color)]
Or the most easy one is to select unique
rows and specify by
:
unique(df, by = c('Year', 'Color'))
data
df <- structure(list(Year = c(2014L, 2014L, 2014L, 2015L, 2015L, 2015L
), Color = c("red", "red", "blue", "red", "blue", "yellow"),
X = c(1L, 1L, 1L, 1L, 1L, 1L), Y = c(3L, 3L, 3L, 3L, 3L,
3L)), class = "data.frame", row.names = c(NA, -6L))
Removing rows in R only if they are duplicated in direct succession
Using rleid
from data.table
we create a dummy-grouping variable, and with distinct
from dplyr
we remove the duplicates. In your data you may want to include Transponder
in the rleid
function, if it does vary in your real data.
library(tidyverse)
library(data.table)
df %>%
mutate(dummy = rleid(Units)) %>%
distinct(dummy, .keep_all = T) %>%
select(-dummy)
Date TimeStamp Transponder Units
1 2021-08-15 2021-08-15-14:11:13 DA2C614E M2
2 2021-08-15 2021-08-15-14:12:40 DA2C614E HM2
3 2021-08-15 2021-08-15-14:12:49 DA2C614E H2
4 2021-08-15 2021-08-15-14:18:02 DA2C614E H1
5 2021-08-15 2021-08-15-14:25:29 DA2C614E HM2
Using just data.table
and no temporary variable you could do the following: dt[!duplicated(rleid(Units)),]
, based on comments.
Best way to remove duplicate entries from a data table
Remove Duplicates
public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
Hashtable hTable = new Hashtable();
ArrayList duplicateList = new ArrayList();
//Add list of all the unique item value to hashtable, which stores combination of key, value pair.
//And add duplicate item value in arraylist.
foreach (DataRow drow in dTable.Rows)
{
if (hTable.Contains(drow[colName]))
duplicateList.Add(drow);
else
hTable.Add(drow[colName], string.Empty);
}
//Removing a list of duplicate items from datatable.
foreach (DataRow dRow in duplicateList)
dTable.Rows.Remove(dRow);
//Datatable which contains unique records will be return as output.
return dTable;
}
Here Links below
http://www.dotnetspider.com/resources/4535-Remove-duplicate-records-from-table.aspx
http://www.dotnetspark.com/kb/94-remove-duplicate-rows-value-from-datatable.aspx
For remove duplicates in column
http://dotnetguts.blogspot.com/2007/02/removing-duplicate-records-from.html
Related Topics
Using Data.Table Package Inside My Own Package
How to Extract Plot Axes' Ranges For a Ggplot2 Object
R on Macos Error: Vector Memory Exhausted (Limit Reached)
Read All Files in Directory and Apply Multiple Functions to Each Data Frame
Offline Install of R Package and Dependencies
Create Sequence of Repeated Values, in Sequence
Conditional Merge/Replacement in R
Collapsing Rows Where Some Are All Na, Others Are Disjoint With Some Nas
Rename Multiple Columns by Names
Convert Data.Frame Column Format from Character to Factor
Overlay Histogram With Density Curve
How to Count Runs in a Sequence
Unlist Data Frame Column Preserving Information from Other Column
How to Number/Label Data-Table by Group-Number from Group_By
Applying a Function to Every Row of a Table Using Dplyr
How to Assign from a Function Which Returns More Than One Value