Porting set operations from R's data frames to data tables: How to identify duplicated rows?
duplicated.data.table
needs the same fix unique.data.table
got [EDIT: Now done in v1.7.2]. Please raise another bug report: bug.report(package="data.table")
. For the benefit of others watching, you're already using v1.6.7 from R-Forge, not 1.6.6 on CRAN.
But, on Note 1, there's a 'not join' idiom :
x[-x[y,which=TRUE]]
See also FR#1384 (New 'not' and 'whichna' arguments?) to make that easier for users, and that links to the keys that don't match thread which goes into more detail.
Update. Now in v1.8.3, not-join has been implemented.
DT[-DT["a",which=TRUE,nomatch=0],...] # old idiom
DT[!"a",...] # same result, now preferred.
Extracting unique rows from a data table in R
Before data.table v1.9.8, the default behavior of unique.data.table
method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key
was NULL
(the default), one would get the original data set back (as in OPs situation).
As of data.table 1.9.8+, unique.data.table
method uses all columns by default which is consistent with the unique.data.frame
in base R. To have it use the key columns, explicitly pass by = key(DT)
into unique
(replacing DT
in the call to key with the name of the data.table).
Hence, old behavior would be something like
library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3
While for data.table v1.9.8+, just
b <- data.table(a)
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them
Or without a copy
setDT(a)
dim(unique(a))
## [1] 8 3
Efficient Combination and Operating on Large Data Frames
It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.
require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)
R: spread function on data frame with duplicates
You could use dcast
from the devel version of data.table
ie. v1.9.5
. Instructions to install are here
library(data.table)#v1.9.5+
dcast(setDT(df), Dimension~Date, value.var='Metric',
fun.aggregate=function(x) toString(unique(x)))
# Dimension Fri Mon Tue Wed
#1: A 7, 8 23 25
#2: B 7 9
Or
library(dplyr)
library(tidyr)
df %>%
group_by(Dimension, Date) %>%
summarise(Metric=toString(unique(Metric))) %>%
spread(Date, Metric, fill='')
# Dimension Fri Mon Tue Wed
#1 A 7, 8 23 25
#2 B 7 9
Update
Using the new dataset from `OP's post
setDF(df2)
df2 %>%
group_by(Dimension, Date) %>%
summarise(Metric=toString(unique(Metric))) %>%
spread(Date, Metric, fill='') %>%
head(2) %>%
select(1:3)
# Dimension 16 analog tuner
#1 10994030020 9
#2 12300245685 NTSC
non-joins with data.tables
As far as I know, this is a part of base R.
# This works
(1:4)[c(-2,-3)]
# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] :
# only 0's may be mixed with negative subscripts
The textual error message indicates that it is intended behavior.
Here's my best guess as to why that is the intended behavior:
From the way they treat NA
's elsewhere (e.g. typically defaulting to na.rm=FALSE
), it seems that R's designers view NA
's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0
gives you a clean way to pass that instruction along!)
In this context, the designers' preference probably explains why NA
's are accepted for positive indexing, but not for negative indexing:
# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]
# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]
How can I subset the negation of a key value using R's data.table package?
I think you answered your own question:
> nrow(df1[group != "a"])
[1] 90
> table(df1[group != "a", group])
a b c d e f g h i j
0 10 10 10 10 10 10 10 10 10
Seems pretty concise to me?
EDIT FROM MATTHEW : As per comments this a vector scan. There is a not join idiom here and here, and feature request #1384 to make it easier.
EDIT: feature request #1384 is implemented in data.table 1.8.3
df1[!'a']
# and to avoid the character-to-factor coercion warning in this example (where
# the key column happens to be a factor) :
df1[!J(factor('a'))]
How do I do a negative / nomatch / inverse search in data.table?
The idiom is this:
DT[-DT["a", which=TRUE]]
x y v
1: b 1 4
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9
Inspiration from:
- The mailing list posting Return Select/Join that does NOT match?
- The previous question non-joins with data.tables
- Matthew Dowle's answer to Porting set operations from R's data frames to data tables: How to identify duplicated rows?
Update. New in v1.8.3 is not-join syntax. Farrel's first expectation (!
rather than -
) has been implemented :
DT[-DT["a",which=TRUE,nomatch=0],...] # old idiom
DT[!"a",...] # same result, now preferred.
See the NEWS item for more detailed info and example.
Generate a sequence of Data frame from function
Here's how to create an empty data.frame (and it's not what you are trying):
Create an empty data.frame
And you should not be creating 100 separate dataframes but rather a list of dataframes. I would not do it with rbind, since that would be very slow. Instead I would create them with a function that returns a dataframe of the required structure:
make_df <- function(n,var) {data.frame( a=(1:n)+var,b=(1:n)-var,c=(1:n)/var) }
mylist <- setNames(
lapply(1:100, function(n) make_df(n,n)) , # the dataframes
paste0("d_", 1:100)) # the names for access
head(mylist,3)
#---------------
$d_1
a b c
1 2 0 1
$d_2
a b c
1 3 -1 0.5
2 4 0 1.0
$d_3
a b c
1 4 -2 0.3333333
2 5 -1 0.6666667
3 6 0 1.0000000
Then if you want the "d_40" dataframe it's just:
mylist[[ "d_40" ]]
Or
mylist$d_40
If you want to perform the same operation or get a result from all of them at nce; just use lapply:
lapply(mylist, nrow) # will be a list
Or:
sapply(mylist, nrow) #will be a vector because each value is the same length.
Unpacking and merging lists in a column in data.frame
Here's a possible data.table
approach
library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
# id name
# 1: 1001 Joan
# 2: 1002 Jane
# 3: 1002 Janie
# 4: 1002 Janet
# 5: 1002 Jan
# 6: 1003 John
# 7: 1003 Jon
# 8: 1004 Bill
# 9: 1004 Will
# 10: 1005 Tom
Why is running unique faster on a data frame than a matrix in R?
In this implementation,
unique.matrix
is the same asunique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array
has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls topaste()
) which are not needed in the 2-dimensional case. The key section of code is:collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse)
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))unique.data.frame
is optimised for the 2D case,unique.matrix
is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is 1
while
NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both 2
. Are you sure unique
is what you want?
Related Topics
R: Adding Alpha Bags to a 2D or 3D Scatterplot
Observeevent Shiny Function Used in a Module Does Not Work
Scraping from Aspx Website Using R
Cumulative Count of Unique Values in R
How to Select All Unique Combinations of Two Columns in an R Data Frame
How to Edit and Save Changes Made on Shiny Datatable Using Dt Package
Shading Confidence Intervals Manually with Ggplot2
R: Split Elements of a List into Sublists
How to Check If a Vector Contains N Consecutive Numbers
How to Install the R Package Rgl on Ubuntu 9.10, Using R Version 2.12.1
Make R Studio Plots Only Show Up in New Window
Data.Table VS Plyr Regression Output
Create a Variable Length 'Alist()'
Easiest Way to Discretize Continuous Scales for Ggplot2 Color Scales
Extract Survival Probabilities in Survfit by Groups
Applying Revgeocode to a List of Longitude-Latitude Coordinates