## R data.table duplicate rows with a pair of columns

The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:

`dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]`

# Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR

#1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311

#2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215

If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:

`dupvars <- c("Gene1","Gene2")`

sel <- !duplicated(

dcast(

melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[

order(id,value), grp := seq_len(.N), by=id],

id ~ grp

)[,-1])

dt[sel,]

## Keep first row by multiple columns in an R data.table

`data.table`

provides S3 methods for `unique`

, `duplicated`

and `anyDuplicated`

`unique(dt, by = c('x','y'))`

will give you what you want.

## Extracting unique rows from a data table in R

Before data.table v1.9.8, the default behavior of `unique.data.table`

method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the `key`

was `NULL`

(the default), one would get the original data set back (as in OPs situation).

As of data.table 1.9.8+, `unique.data.table`

method uses all columns by default which is consistent with the `unique.data.frame`

in base R. To have it use the key columns, explicitly pass `by = key(DT)`

into `unique`

(replacing `DT`

in the call to key with the name of the data.table).

Hence, old behavior would be something like

`library(data.table) v1.9.7-`

set.seed(123)

a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))

b <- data.table(a, key = names(a))

## key(b)

## [1] "V1" "V2" "V3"

dim(unique(b))

## [1] 8 3

While for data.table v1.9.8+, just

`b <- data.table(a) `

dim(unique(b))

## [1] 8 3

## or dim(unique(b, by = key(b)) # in case you have keys you want to use them

Or without a copy

`setDT(a)`

dim(unique(a))

## [1] 8 3

## Removing duplicate rows from data frame in R

We can use `data.table`

. Convert the 'data.frame' to 'data.table' (`setDT(df1)`

), grouped by the `pmin(A, B)`

and `pmax(A,B)`

, `if`

the number of rows is greater than 1, we get the first row or `else`

return the rows.

` library(data.table)`

setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]

# A B prob

#1: 1 2 0.1

#2: 1 3 0.2

#3: 1 4 0.3

#4: 2 3 0.1

#5: 2 4 0.4

Or we can just used `duplicated`

on the `pmax`

, `pmin`

output to return a logical index and subset the data based on that.

` setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]`

# A B prob

#1: 1 2 0.1

#2: 1 3 0.2

#3: 1 4 0.3

#4: 2 3 0.1

#5: 2 4 0.4

## Unique rows, considering two columns, in R, without order

There are lot's of ways to do this, here is one:

`unique(t(apply(df, 1, sort)))`

duplicated(t(apply(df, 1, sort)))

One gives the unique rows, the other gives the mask.

