Porting Set Operations from R's Data Frames to Data Tables: How to Identify Duplicated Rows

Porting set operations from R's data frames to data tables: How to identify duplicated rows?

duplicated.data.table needs the same fix unique.data.table got [EDIT: Now done in v1.7.2]. Please raise another bug report: bug.report(package="data.table"). For the benefit of others watching, you're already using v1.6.7 from R-Forge, not 1.6.6 on CRAN.

But, on Note 1, there's a 'not join' idiom :

x[-x[y,which=TRUE]]

See also FR#1384 (New 'not' and 'whichna' arguments?) to make that easier for users, and that links to the keys that don't match thread which goes into more detail.


Update. Now in v1.8.3, not-join has been implemented.

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...] # same result, now preferred.

Extracting unique rows from a data table in R

Before data.table v1.9.8, the default behavior of unique.data.table method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key was NULL (the default), one would get the original data set back (as in OPs situation).

As of data.table 1.9.8+, unique.data.table method uses all columns by default which is consistent with the unique.data.frame in base R. To have it use the key columns, explicitly pass by = key(DT) into unique (replacing DT in the call to key with the name of the data.table).

Hence, old behavior would be something like

library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3

While for data.table v1.9.8+, just

b <- data.table(a) 
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them

Or without a copy

setDT(a)
dim(unique(a))
## [1] 8 3

Efficient Combination and Operating on Large Data Frames

It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.

require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)

R: spread function on data frame with duplicates

You could use dcast from the devel version of data.table ie. v1.9.5. Instructions to install are here

library(data.table)#v1.9.5+
dcast(setDT(df), Dimension~Date, value.var='Metric',
fun.aggregate=function(x) toString(unique(x)))
# Dimension Fri Mon Tue Wed
#1: A 7, 8 23 25
#2: B 7 9

Or

library(dplyr)
library(tidyr)
df %>%
group_by(Dimension, Date) %>%
summarise(Metric=toString(unique(Metric))) %>%
spread(Date, Metric, fill='')
# Dimension Fri Mon Tue Wed
#1 A 7, 8 23 25
#2 B 7 9

Update

Using the new dataset from `OP's post

 setDF(df2)
df2 %>%
group_by(Dimension, Date) %>%
summarise(Metric=toString(unique(Metric))) %>%
spread(Date, Metric, fill='') %>%
head(2) %>%
select(1:3)
# Dimension 16 analog tuner
#1 10994030020 9
#2 12300245685 NTSC

non-joins with data.tables

As far as I know, this is a part of base R.

# This works
(1:4)[c(-2,-3)]

# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] :
# only 0's may be mixed with negative subscripts

The textual error message indicates that it is intended behavior.

Here's my best guess as to why that is the intended behavior:

From the way they treat NA's elsewhere (e.g. typically defaulting to na.rm=FALSE), it seems that R's designers view NA's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0 gives you a clean way to pass that instruction along!)

In this context, the designers' preference probably explains why NA's are accepted for positive indexing, but not for negative indexing:

# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]

# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]

How can I subset the negation of a key value using R's data.table package?

I think you answered your own question:

> nrow(df1[group != "a"])
[1] 90
> table(df1[group != "a", group])

a b c d e f g h i j
0 10 10 10 10 10 10 10 10 10

Seems pretty concise to me?

EDIT FROM MATTHEW : As per comments this a vector scan. There is a not join idiom here and here, and feature request #1384 to make it easier.

EDIT: feature request #1384 is implemented in data.table 1.8.3

df1[!'a']

# and to avoid the character-to-factor coercion warning in this example (where
# the key column happens to be a factor) :
df1[!J(factor('a'))]

How do I do a negative / nomatch / inverse search in data.table?

The idiom is this:

DT[-DT["a", which=TRUE]]

x y v
1: b 1 4
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9

Inspiration from:

  • The mailing list posting Return Select/Join that does NOT match?
  • The previous question non-joins with data.tables
  • Matthew Dowle's answer to Porting set operations from R's data frames to data tables: How to identify duplicated rows?

Update. New in v1.8.3 is not-join syntax. Farrel's first expectation (! rather than -) has been implemented :

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...] # same result, now preferred.

See the NEWS item for more detailed info and example.

Generate a sequence of Data frame from function

Here's how to create an empty data.frame (and it's not what you are trying):
Create an empty data.frame

And you should not be creating 100 separate dataframes but rather a list of dataframes. I would not do it with rbind, since that would be very slow. Instead I would create them with a function that returns a dataframe of the required structure:

 make_df <- function(n,var) {data.frame( a=(1:n)+var,b=(1:n)-var,c=(1:n)/var) }

mylist <- setNames(
lapply(1:100, function(n) make_df(n,n)) , # the dataframes
paste0("d_", 1:100)) # the names for access

head(mylist,3)
#---------------
$d_1
a b c
1 2 0 1

$d_2
a b c
1 3 -1 0.5
2 4 0 1.0

$d_3
a b c
1 4 -2 0.3333333
2 5 -1 0.6666667
3 6 0 1.0000000

Then if you want the "d_40" dataframe it's just:

 mylist[[ "d_40" ]]

Or

 mylist$d_40

If you want to perform the same operation or get a result from all of them at nce; just use lapply:

 lapply(mylist, nrow)  # will be a list

Or:

 sapply(mylist, nrow)  #will be a vector because each value is the same length.

Unpacking and merging lists in a column in data.frame

Here's a possible data.table approach

library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
# id name
# 1: 1001 Joan
# 2: 1002 Jane
# 3: 1002 Janie
# 4: 1002 Janet
# 5: 1002 Jan
# 6: 1003 John
# 7: 1003 Jon
# 8: 1004 Bill
# 9: 1004 Will
# 10: 1005 Tom

Why is running unique faster on a data frame than a matrix in R?

  1. In this implementation, unique.matrix is the same as unique.array

    > identical(unique.array, unique.matrix)

    [1] TRUE

  2. unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:

    collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)

    temp <- if (collapse)
    apply(x, MARGIN, function(x) paste(x, collapse = "\r"))

  3. unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.

Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

is 1 while

NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

and

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

are both 2. Are you sure unique is what you want?



Related Topics



Leave a reply



Submit