R: Selecting Subset Without Copying

R: selecting subset without copying

Try package ref. Specifically, its refdata class.

What you might be missing about data.table is that when grouping (by= parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]

:= in data.table is one way to modify a data.table in place. data.table departs from usual R programming style in that it is not copied-on-write. User has to call copy() explicitly to copy a (potentially very large) table, even within a function.

You're right that there isn't a mechanism like refdata built into data.table. I see what you mean and it would be a nice feature. refdata should work on a data.table, though, and you might be fine with data.frame (but be sure to monitor copies with tracemem(DF)).

There is also idata.frame (immutable data.frame) in package plyr you could try.

Subset data table without using -

What you are asking would be delete rows by reference.

It is not yet possible, but there is FR for that #635.

Until then you need to copy (in-memory) your data.table subset, the copy is done by <- (or =) when is combined with subset (i arg) so for now you cannot avoid that.

If it will help somehow you can operate on language objects to predefine the operation and delay it's evaluation, also reuse predefined objects multiple times:

mtcars_sub <- quote(mtcars[like(rn,"Mer")])
mtcars_sub2 <- quote(eval(mtcars_sub)[mpg > 20])
eval(mtcars_sub2)
#           rn  mpg cyl  disp hp drat   wt qsec vs am gear carb
# 1: Merc 240D 24.4   4 146.7 62 3.69 3.19 20.0  1  0    4    2
# 2:  Merc 230 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

BTW. when subsetting data.table you don't need to use middle comma like dt[x==1,] you can use dt[x==1].

In R subsetting without using subset() and use [ in a more concise manner to prevent typos?

After some thought, I wrote a super simple function called given:

given=function(.,...) { with(.,...) }

This way, I don't have to repeat the name of the data.frame. I also found it to be 14 times faster than filter(). See below:

adf=data.frame(a=1:10,b=11:20)
given=function(.,...) { with(.,...) }
with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)

Using microbenchmark

> adf=data.frame(a=1:10,b=11:20)
> given=function(.,...) { with(.,...) }
> with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
  a  b
6 6 16
7 7 17
> given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
  a  b
6 6 16
7 7 17
> dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
  a  b
1 6 16
2 7 17
> microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
Unit: microseconds
                             expr    min     lq     mean median     uq     max neval
 with(adf, adf[a > 5 & b < 18, ]) 47.897 60.441 67.59776 67.284 70.705 361.507  1000
> microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
Unit: microseconds
                            expr    min     lq     mean median    uq     max neval
 given(adf, .[a > 5 & b < 18, ]) 48.277 50.558 54.26993 51.698 56.64 272.556  1000
> microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)
Unit: microseconds
                              expr     min       lq     mean   median       uq      max neval
 dplyr::filter(adf, a > 5, b < 18) 524.965 581.2245 748.1818 674.7375 889.7025 7341.521  1000

I noticed that given() is actually a tad faster than with(), due to the length of the variable name.

The neat thing about given, is that you can do some things inline without assignment:
given(data.frame(a=1:10,b=11:20),.[a>5 & b<18,])

Copying values from one subset to all others for selected columns using dplyr

I think you could calculate the correct median for each user using only the first record for each user, and then left_join.

df = 
  tibble(
    UserId = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D"),
    DFFU = c(0, 1, 2, 3, 4, 0, 2, 4, 5, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4),
    Q1 = c(3, 1, 1, 0, 1, 4, 2, 5, 4, 5, 2, 5, 6, 6, 5, 5, 4, 0, 1),
    Q2 = c(2,0,1,2,1,8,2,6,5,7,2,5,5,6,3,3,2,0,1),
    Q3 = c(1,0,0,0,1,2,1,5,5,2,2,4,3,4,5,4,6,1,1)
  )

df <- df %>% 
  group_by(UserId) %>%
  mutate(across(all_of(c("Q1", "Q2", "Q3")), sd,.names = paste0("Sigma_", "{.col}"))) %>%
  ungroup()

df %>%
  filter(DFFU == 0) %>%
  transmute(UserId = UserId, across(all_of(paste0("Sigma_", c("Q1", "Q2", "Q3"))), median ,.names = paste0("Median_", "{.col}"))) %>%
  {left_join(df, .)}

Yielding:

> df
# A tibble: 19 x 11
   UserId  DFFU    Q1    Q2    Q3 Sigma_Q1 Sigma_Q2 Sigma_Q3 Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
   <chr>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>           <dbl>           <dbl>           <dbl>
 1 A          0     3     2     1     1.10    0.837    0.548            1.45            1.59            1.53
 2 A          1     1     0     0     1.10    0.837    0.548            1.45            1.59            1.53
 3 A          2     1     1     0     1.10    0.837    0.548            1.45            1.59            1.53
 4 A          3     0     2     0     1.10    0.837    0.548            1.45            1.59            1.53
 5 A          4     1     1     1     1.10    0.837    0.548            1.45            1.59            1.53
 6 B          0     4     8     2     1.26    2.5      2.06             1.45            1.59            1.53
 7 B          2     2     2     1     1.26    2.5      2.06             1.45            1.59            1.53
 8 B          4     5     6     5     1.26    2.5      2.06             1.45            1.59            1.53
 9 B          5     4     5     5     1.26    2.5      2.06             1.45            1.59            1.53
10 C          0     5     7     2     1.64    1.87     1                1.45            1.59            1.53
11 C          1     2     2     2     1.64    1.87     1                1.45            1.59            1.53
12 C          2     5     5     4     1.64    1.87     1                1.45            1.59            1.53
13 C          3     6     5     3     1.64    1.87     1                1.45            1.59            1.53
14 C          4     6     6     4     1.64    1.87     1                1.45            1.59            1.53
15 D          0     5     3     5     2.35    1.30     2.30             1.45            1.59            1.53
16 D          1     5     3     4     2.35    1.30     2.30             1.45            1.59            1.53
17 D          2     4     2     6     2.35    1.30     2.30             1.45            1.59            1.53
18 D          3     0     0     1     2.35    1.30     2.30             1.45            1.59            1.53
19 D          4     1     1     1     2.35    1.30     2.30             1.45            1.59            1.53

One of the reasons your analysis is getting so weird, tho, is that you are breaking tidy data principles. In your original data set, each row represents one survey, but the standard deviation applies to each student, not to each survey. So the standard deviation values should appear in a table with 5 rows, one row for each student. Then the median represent the population of students. There is only one population, so there should only be one row. Therefore, I'd recommend:

sd_df <- 
  df %>%
  group_by(UserId) %>%
  summarize(
    across(
      all_of(c("Q1", "Q2", "Q3")), 
      .fns = sd, 
      .names = paste0("Sigma_", "{.col}")
    )
  )

median_sd_df <-
  sd_df %>%
  summarize(
    across(
      all_of(paste0("Sigma_", c("Q1", "Q2", "Q3"))), 
      .fns = median, 
      .names = paste0("Sigma_", "{.col}")
    ),
    n = n()
  )

which gives you:

> sd_df
# A tibble: 4 x 5
  UserId Sigma_Q1 Sigma_Q2 Sigma_Q3     n
  <chr>     <dbl>    <dbl>    <dbl> <int>
1 A          1.10    0.837    0.548     5
2 B          1.26    2.5      2.06      4
3 C          1.64    1.87     1         5
4 D          2.35    1.30     2.30      5

>  median_sd_df
# A tibble: 1 x 3
  Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
            <dbl>           <dbl>           <dbl>
1            1.45            1.59            1.53

Efficient way to subset data.table based on value in any of selected columns

One option is to specify the 'cols' of interest in .SDcols, loop through the Subset of Data.table (.SD), generate a list of logical vectors, Reduce it to single logical vector with (|) and use that to subset the rows

i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE

Select data till the end based on a pattern in one column

You can use the which() function to return the indices where str_detect() is TRUE.

library(stringr)

which(str_detect(df1$V2, "Descripción"))
[1] 2

If instead you save the output of which() to a variable, you can use it to subset your data. Note that the follow explicitly calls the first value in x in case there are more than one place str_detect returns true.

x <- which(str_detect(df1$V2, "Descripción"))
df1[x[1]:nrow(df1),]

                             V1                                               V2         V3                      V4       V5
2 la Partida / Concepto de Obra                                Descripción Breve ClaveCUCOP Observaciones de Medida Cantidad
3                               Trabajos de señalamiento horizontal en puente de                                            
4                 LO-009J0U004- cuota "El Zacatal", consistentes en suministro y                                            
5                      E50-2021  aplicación de pintura de tránsito, suministro y   62502002                    Obra        1
6                                  colocación de botones y ménsulas reflejantes.

When should I use which for subsetting?

Since this question is specifically about subsetting, I thought I would
illustrate some of the performance benefits of using which() over a
logical subset brought up in the linked question.

When you want to extract the entire subset, there is not much difference in
processing speed, but using which() needs to allocate less memory. However,if you only want a part of the subset (e.g. to showcase some strange
findings), which() has a significant speed and memory advantage due to
being able to avoid subsetting a data frame twice by subsetting the result of
which() instead.

Here are the benchmarks:

df <- ggplot2::diamonds; dim(df)
#> [1] 53940    10
mu <- mean(df$price)

bench::press(
  n = c(sum(df$price > mu), 10),
  {
    i <- seq_len(n)
    bench::mark(
      logical = df[df$price > mu, ][i, ],
      which_1 = df[which(df$price > mu), ][i, ],
      which_2 = df[which(df$price > mu)[i], ]
    )
  }
)
#> Running with:
#>       n
#> 1 19657
#> 2    10
#> # A tibble: 6 x 11
#>   expression     n      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>      <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 logical    19657    1.5ms   1.81ms   1.71ms   3.39ms      553.     5.5MB
#> 2 which_1    19657   1.41ms   1.61ms   1.56ms   2.41ms      620.    2.89MB
#> 3 which_2    19657 826.56us 934.72us 910.88us   1.41ms     1070.    1.76MB
#> 4 logical       10 893.12us   1.06ms   1.02ms   1.93ms      941.    4.21MB
#> 5 which_1       10  814.4us 944.81us 908.16us   1.78ms     1058.    1.69MB
#> 6 which_2       10 230.72us 264.45us 249.28us   1.08ms     3781.  498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

Created on 2018-08-19 by the reprex package (v0.2.0).

Copy a subset of a column, based on conditions, to another dataframe in R

The issue is that the number of rows in 'df' and 'sub1' are different. 'df' is created with 1 row. Instead, we can create the 'df' directly from the 'sub1' itself

df <- sub1['Red']
names(df) <- cls[1]

Also, another way to create the data.frame, would be to specify the nrow as well

df <- as.data.frame(matrix(nrow = nrow(sub1), ncol = length(cls)),
       dimnames = list(NULL, cls))

Regarding the second error with cut, it needs breaks. Either we specify the number of breaks

cut(sub1$Red, breaks = 3)

Or a vector of break points

cut(sub1$Red, breaks = c(-Inf, 100, 500, 1000, Inf))

If there are many 'tabn' objects, get them into a list, loop over the list with lapply

lst1 <- mget(ls(pattern = '^tabn\\d+$'))
out_lst <- lapply(lst1, function(x) subset(x, Ro >=3 | Co == 3)$Red)

It is possible that after subsetting and selecting the 'Red' column, the number of elements may be different. If the lengths are different, a option is to pad NA at the end for those having lesser number of elements before cbinding it

mx <- max(lengths(out_lst))
df <- do.call(cbind, lapply(out_lst, `length<-`, mx))

Selecting a subset of columns in R data.table using a vector with a constant column

Combine .SDcols with c(.SD, .(new=10)):

dt <- data.table(one = 1:3, two = 2:4, three = 3:5)
parlist <- c("one", "three")

dt[, c(.SD, .(new = 10)), .SDcols = parlist]
#      one three   new
#    <int> <int> <num>
# 1:     1     3    10
# 2:     2     4    10
# 3:     3     5    10

R: Selecting Subset Without Copying