Extracting Unique Rows from a Data Table in R

Extracting unique rows from a data table in R

Before data.table v1.9.8, the default behavior of unique.data.table method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key was NULL (the default), one would get the original data set back (as in OPs situation).

As of data.table 1.9.8+, unique.data.table method uses all columns by default which is consistent with the unique.data.frame in base R. To have it use the key columns, explicitly pass by = key(DT) into unique (replacing DT in the call to key with the name of the data.table).

Hence, old behavior would be something like

library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b)) 
## [1] 8 3

While for data.table v1.9.8+, just

b <- data.table(a) 
dim(unique(b)) 
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them

Or without a copy

setDT(a)
dim(unique(a))
## [1] 8 3

Extracting unique rows in R data table based on another column

Subset in the j part :

library(data.table)
setDT(df)
df[, .SD[!duplicated(Color)], Year]

#   Year  Color X Y
#1: 2014    red 1 3
#2: 2014   blue 1 3
#3: 2015    red 1 3
#4: 2015   blue 1 3
#5: 2015 yellow 1 3

Another approach is to group by Year and Color and select the first row.

df[, .SD[seq_len(.N) == 1], .(Year, Color)]

Or the most easy one is to select unique rows and specify by :

unique(df, by = c('Year', 'Color'))

data

df <- structure(list(Year = c(2014L, 2014L, 2014L, 2015L, 2015L, 2015L
), Color = c("red", "red", "blue", "red", "blue", "yellow"), 
X = c(1L, 1L, 1L, 1L, 1L, 1L), Y = c(3L, 3L, 3L, 3L, 3L, 
3L)), class = "data.frame", row.names = c(NA, -6L))

R data.table get unique rows dropping some columns as well

How about this:

R> unique(tbl, by=c("reader_id", "book_id"))[,-4]
#    reader_id book_id date
# 1:        10       1   d1
# 2:        20       2   d2
# 3:        30       4   d4
# 4:        50       5   d5

Or if you prefer to drop by name,

unique(tbl,by=c("reader_id", "book_id"))[,!"inf"]

How to extract unique rows from a data frame with an index column?

You could use duplicated().

> df1[-which(duplicated(df1[,-1])), ]
  Index  a  b  c
1     1 12 12 14
3     3 11 12 13

Data

df1 <- structure(list(Index = 1:3, a = c(12L, 12L, 11L), b = c(12L, 
                                                               12L, 12L), c = c(14L, 14L, 13L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                     -3L))

Filtering out duplicated/non-unique rows in data.table

For v1.9.8+ (released November 2016)

From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G

How do I extract the unique rows from a subset of columns in a data table?

The most straightforward, to me at least, would be either unique(jk[c4 >= 10, list(c1, c2)]) as suggested by @Justin, or unique(jk[c4 >= 10, c("c1", "c2")]). The latter of these is the quickest of the four suggestions so far, at least on my laptop:

microbenchmark(
a=jk[c4 >= 10, list(c1,c2), keyby = list(c1,c2)][,c("c1","c2")],
b=jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")],
c=unique(jk[c4>=10,list(c1,c2)]),
d=unique(jk[c4>=10,c("c1","c2")])
)

Unit: microseconds
 expr      min       lq    median        uq      max neval
    a 1378.742 1456.676 1494.9380 1531.1395 2515.796   100
    b  906.404  943.072  963.7790  997.4930 3805.846   100
    c 1167.125 1201.988 1232.3500 1272.2250 2077.047   100
    d  627.768  653.314  669.8625  683.8045  739.808   100

Extract all the unique values for a given substring in the column names

With sapply:

sapply(transpose(strsplit(col, "\\.")), function(x) unlist(unique(x), recursive = F))

Or use data.table::transpose instead of transpose to make it easier:

sapply(data.table::transpose(strsplit(col, "\\.")), unique)

Finally, use setNames to set the names:

sapply(transpose(strsplit(col, "\\.")), function(x) unlist(unique(x), recursive = F)) |>
  setNames(c("City", "Type", "Year", "Active"))

output:

$City
[1] "Barcelona" "Berlin"    "London"   

$Type
[1] "Standard" "One"     

$Year
[1] "2012" "2013" "2014" "2015" "2016"

$Active
[1] "True"

data

col <- c("Barcelona.Standard.2012.True",
  "Berlin.One.2013.True",
  "London.One.2014.True",
  "Barcelona.Standard.2015.True",
  "Berlin.One.2016.True")

Find unique rows

Check duplicated from the beginning and end of the data frame, if none returns true, then select it:

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)),]

#     x   y
#5  115 215
#10 521 151

Extracting Unique Rows from a Data Table in R