Extracting Unique Rows from a Data Table in R

Extracting unique rows from a data table in R

Before data.table v1.9.8, the default behavior of unique.data.table method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key was NULL (the default), one would get the original data set back (as in OPs situation).

As of data.table 1.9.8+, unique.data.table method uses all columns by default which is consistent with the unique.data.frame in base R. To have it use the key columns, explicitly pass by = key(DT) into unique (replacing DT in the call to key with the name of the data.table).

Hence, old behavior would be something like

library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3

While for data.table v1.9.8+, just

b <- data.table(a) 
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them

Or without a copy

setDT(a)
dim(unique(a))
## [1] 8 3

Extracting unique rows in R data table based on another column

Subset in the j part :

library(data.table)
setDT(df)
df[, .SD[!duplicated(Color)], Year]

# Year Color X Y
#1: 2014 red 1 3
#2: 2014 blue 1 3
#3: 2015 red 1 3
#4: 2015 blue 1 3
#5: 2015 yellow 1 3

Another approach is to group by Year and Color and select the first row.

df[, .SD[seq_len(.N) == 1], .(Year, Color)]

Or the most easy one is to select unique rows and specify by :

unique(df, by = c('Year', 'Color'))

data

df <- structure(list(Year = c(2014L, 2014L, 2014L, 2015L, 2015L, 2015L
), Color = c("red", "red", "blue", "red", "blue", "yellow"),
X = c(1L, 1L, 1L, 1L, 1L, 1L), Y = c(3L, 3L, 3L, 3L, 3L,
3L)), class = "data.frame", row.names = c(NA, -6L))

R data.table get unique rows dropping some columns as well

How about this:

R> unique(tbl, by=c("reader_id", "book_id"))[,-4]
# reader_id book_id date
# 1: 10 1 d1
# 2: 20 2 d2
# 3: 30 4 d4
# 4: 50 5 d5

Or if you prefer to drop by name,

unique(tbl,by=c("reader_id", "book_id"))[,!"inf"]

How to extract unique rows from a data frame with an index column?

You could use duplicated().

> df1[-which(duplicated(df1[,-1])), ]
Index a b c
1 1 12 12 14
3 3 11 12 13

Data

df1 <- structure(list(Index = 1:3, a = c(12L, 12L, 11L), b = c(12L, 
12L, 12L), c = c(14L, 14L, 13L)), class = "data.frame", row.names = c(NA,
-3L))

Filtering out duplicated/non-unique rows in data.table

For v1.9.8+ (released November 2016)

From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
V1 V2
1: A B
2: A C
3: A D
4: B A
5: C D
6: E F
7: G G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
V1 V2
1: A B
2: A C
3: A D
4: B A
5: E F
6: G G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
V1 V2
[1,] B A
[2,] A B
[3,] A C
[4,] A D
[5,] E F
[6,] G G

How do I extract the unique rows from a subset of columns in a data table?

The most straightforward, to me at least, would be either unique(jk[c4 >= 10, list(c1, c2)]) as suggested by @Justin, or unique(jk[c4 >= 10, c("c1", "c2")]). The latter of these is the quickest of the four suggestions so far, at least on my laptop:

microbenchmark(
a=jk[c4 >= 10, list(c1,c2), keyby = list(c1,c2)][,c("c1","c2")],
b=jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")],
c=unique(jk[c4>=10,list(c1,c2)]),
d=unique(jk[c4>=10,c("c1","c2")])
)

Unit: microseconds
expr min lq median uq max neval
a 1378.742 1456.676 1494.9380 1531.1395 2515.796 100
b 906.404 943.072 963.7790 997.4930 3805.846 100
c 1167.125 1201.988 1232.3500 1272.2250 2077.047 100
d 627.768 653.314 669.8625 683.8045 739.808 100

Extract all the unique values for a given substring in the column names

With sapply:

sapply(transpose(strsplit(col, "\\.")), function(x) unlist(unique(x), recursive = F))

Or use data.table::transpose instead of transpose to make it easier:

sapply(data.table::transpose(strsplit(col, "\\.")), unique)

Finally, use setNames to set the names:

sapply(transpose(strsplit(col, "\\.")), function(x) unlist(unique(x), recursive = F)) |>
setNames(c("City", "Type", "Year", "Active"))

output:

$City
[1] "Barcelona" "Berlin" "London"

$Type
[1] "Standard" "One"

$Year
[1] "2012" "2013" "2014" "2015" "2016"

$Active
[1] "True"

data

col <- c("Barcelona.Standard.2012.True",
"Berlin.One.2013.True",
"London.One.2014.True",
"Barcelona.Standard.2015.True",
"Berlin.One.2016.True")

Find unique rows

Check duplicated from the beginning and end of the data frame, if none returns true, then select it:

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)),]

# x y
#5 115 215
#10 521 151


Related Topics



Leave a reply



Submit