Is Data Really Copied Four Times in R's Replacement Functions

Is data really copied four times in R's replacement functions?

NOTE: Unless otherwise specified, all explanations below are valid for R versions < 3.1.0. There are great improvements made in R v3.1.0, which is also briefly touched upon here.

To answer your first question, "why four copies and shouldn't one be enough?", we'll begin by quoting the relevant part from R-internals first:

A 'named' value of 2, NAM(2), means that the object must be duplicated before being changed. (Note that this does not say that it is necessary to duplicate, only that it should be duplicated whether necessary or not.) A value of 0 means that it is known that no other SEXP shares data with this object, and so it may safely be altered.

A value of 1 is used for situations like dim(a) <- c(7, 2) where in principle two copies of a exist for the duration of the computation as (in principle)
a <- dim<-(a, c(7, 2)) but for no longer, and so some primitive functions can be optimized to avoid a copy in this case.

NAM(1):

Let's start with NAM(1) objects. Here's an example:

x <- 1:5 # (1)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
tracemem(x)
# [1] "<0x10374ecc8>"

x[2L] <- 10L # (2)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(1),TR] (len=5, tl=0) 1,10,3,4,5

What's happening here? We created an integer vector using :, it being a primitive, resulted in a NAM(1) object. And when we used [<- on that object, the value got changed in-place (note that the pointers are identical, (1) and (2)). This is because [<- being a primitive knows quite well how to handle its inputs and is optimised for a no-copy in this scenario.

y = x # (3)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(2),TR] (len=5, tl=0) 1,10,3,4,5

x[2L] <- 20L # (4)
.Internal(inspect(x))
# tracemem[0x10374ecc8 -> 0x10372f328]:
# @10372f328 13 INTSXP g0c3 [NAM(1),TR] (len=5, tl=0) 1,20,3,4,5

Now the same assignment results in a copy, why? By doing (3), the 'named' field gets incremented to NAM(2) as more than one object is pointing to the same data. Even if [<- is optimised, the fact that it's a NAM(2) means that the object must be duplicated. That's why it's now again a NAM(1) object after the assignment. That's because, call to duplicate sets named to 0 and the new assignment bumps it back to 1.

Note: Peter Dalgaard explains this case nicely in this link as to why x = 2L results in NAM(2) object.



NAM(2):

Now let's return to your question on calling *<- on a data.frame which is a NAM(2) object.

The first question then is, why is data.frame() a NAM(2) object? Why not a NAM(1) like the earlier case x <- 1:5? Duncan Murdoch answers this very nicely on the same post:

data.frame() is a plain R function, so it is treated no differently than any user-written function. On the other hand, the internal function that implements the : operator is a primitive, so it has complete control over its return value, and it can set NAMED in the most efficient way.

This means any attempt to change the value would result in triggering a duplicate (a deep copy). From ?tracemem:

... any copying of the object by the C function duplicate produces a message to standard output.

So a message from tracemem helps understand the number of copies. To understand the first line of your tracemem output, let's construct the function f<-, which does no actual replacement. Also, let's construct a data.frame big enough so that we can measure the time taken for a single copy of that data.frame.

## R v 3.0.3
`f<-` = function(x, value) {
return(x) ## no actual replacement
}

df <- data.frame(x=1:1e8, y=1:1e8) # 762.9 Mb
tracemem(df) # [1] "<0x7fbccd2f4ae8>"

require(data.table)
system.time(copy(df))
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4ff0]: copy system.time
# user system elapsed
# 0.609 0.484 1.106

system.time(f(df) <- 3)
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4f10]: system.time
# user system elapsed
# 0.608 0.480 1.101

I've used the function copy() from data.table (which basically calls the C duplicate function). The times for copying are more or less identical. So, the first step is clearly a deep copy, even if it did nothing.

This explains the first two verbose messages from tracemem in your post:

(1) From the global environment we called f(df) <- 3). Here's one copy.

(2) From within the function f<-, another assignment x[1,1] <- 3 which'll call the [<- (and hence the [<-.data.frame function). That makes the second copy immediately.

Finding the rest of the copies is easy with a debugonce() on [<-.data.frame. That is, doing:

debugonce(`[<-`)
df <- data.frame(x=1:1e8, y=1:1e8)
`f<-` = function(x, value) {
x[1,1] = value
return(x)
}
tracemem(df)
f(df) = 3

# first three lines:

# tracemem[0x7f8ba33d8a08 -> 0x7f8ba33d8d50]: (1)
# tracemem[0x7f8ba33d8d50 -> 0x7f8ba33d8a78]: f<- (2)
# debugging in: `[<-.data.frame`(`*tmp*`, 1L, 1L, value = 3L)

By hitting enter, you'll find the other two copies to be inside this function:

# debug: class(x) <- NULL
# tracemem[0x7f8ba33d8a78 -> 0x7f8ba3cd6078]: [<-.data.frame [<- f<- (3)

# debug: x[[jj]][iseq] <- vjj
# tracemem[0x7f8ba3cd6078 -> 0x7f882c35ed40]: [<-.data.frame [<- f<- (4)

Note that class is primitive but it's being called on a NAM(2) object. I suspect that's the reason for the copy there. And the last copy is inevitable as it modifies the column.

So, there you go.


Now a small note on R v3.1.0:

I also tested the same in R V3.1.0. tracemem provides all four lines. However, the only time-consuming step is (4). IIUC, the remaining cases, all due to [<- / class<- should be triggering a shallow copy instead of deep copy. What's awesome is that, even in (4), only that column that's being modified seems to be deep copied. R 3.1.0 has great improvements!

This means tracemem provides output due to shallow copy too - which is a bit confusing since the documentation doesn't explicitly state that and makes it hard to tell between a shallow and deep copy, except by measuring time. Perhaps it's my (incorrect) understanding. Feel free to correct me.


On your part 2, I'll quote Luke Tierney from here:

Calling a foo<- function directly is not a good idea unless you really understand what is going on in the assignment mechanism in general and in the particular foo<- function. It is definitely not something to be done in routine programming unless you like unpleasant surprises.

But I am unable to tell if these unpleasant surprises extend to an object that's already NAM(2). Because, Matt was calling it on a list, which is a primitive and therefore NAM(1), and calling foo<- directly wouldn't increment it's 'named' value.

But, the fact that R v3.1.0 has great improvements should already convince you that such a function call is not necessary anymore.

HTH.

PS: Feel free to correct me (and help me shorten this answer if possible) :).


Edit: I seem to have missed the point about a copy being reduced when calling f<- directly as spotted under comment. It's pretty easy to see by using the function Simon Urbanek used in the post (that's linked multiple times now):

# rm(list=ls()) # to make sure there' no other object in your workspace
`f<-` <- function(x, value) {
print(ls(env = parent.frame()))
}

df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce01a65358>"

f(df) = 3
# tracemem[0x7fce0359b2a0 -> 0x7fce0359ae08]:
# [1] "*tmp*" "df" "f<-"

df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce03c505c0>"
df <- `f<-`(df, 3)
# [1] "df" "f<-"

As you can see, in the first method there's an object *tmp* that's being created, which is not, in the second case. And it seems like this creation of *tmp* object for a NAM(2) input object triggers a copy of the input before *tmp* gets assigned to the function argument. But that's as far as my understanding goes.

Why does the extract method for data frames make two copies?

This isn't really a complete answer to your question, but it's a start.

If you look in the R Language Definition, you'll see that df[["name"]] <- 3.2 is implemented as

`*tmp*` <- df
df <- "[[<-.data.frame"(`*tmp*`, "name", value=3.2)
rm(`*tmp*`)

So one copy gets put into *tmp*. If you call debug("[[<-.data.frame"), you'll see that it really does get called with an argument called *tmp*, and
tracemem() will show that the first duplication happens before you enter.

The function [[<-.data.frame is a regular function with a header like this:

function (x, i, j, value)  

That function gets called as

`[[<-.data.frame`(`*tmp*`, "name", value = 3.2)

Now there are three references to the dataframe: df in the global environment, *tmp* in the internal code, and x in that function. (Actually, there's an intermediate step where the generic is called, but it is a primitive, so it doesn't need to make a new reference.)

The class of x gets changed in the function; that triggers a copy. Then one of the components of x is changed; that's another copy. So that makes 3.

Just guessing, I'd say the reason for the first duplication is that a complicated replacement might refer to the original value, and it's avoiding the possibility of retrieving a partially modified value.

Writing efficient functions for data.tables that replace by reference

This is similar to Frank's but lets the arguments be passed to a function that builds the translation vector and returns the translation. You don't need to do the loop inside the function since the lapply , the :=, and the .SDcols functions are doing the looping inside [.data.table.

recode_dt <- function(datacol, oldval, newval) 
{ trans <- setNames(newval, oldval)
trans[ datacol ] }

dt[, (bincols) := lapply(.SD, recode_dt, oldval = c("u", "n", "y"),
newval = c(NA_real_, 0, 1)),
.SDcols = bincols]
dt
#===============
id fruit mydate eaten present sex
1: 1 apple 2015-09-01 1 0 m
2: 2 orange 2015-09-02 1 0 f
3: 3 banana 2015-11-15 0 1 f
4: 4 strawbery 2016-02-24 1 1 m
5: 5 rasberry 2016-03-08 NA 1 f

Note that your columns were not actually factors as it appeared you thought from one of your comments. They might have been had you built a data.frame as an intermediate step.

fast replacement of data.table values by labels stored in another data.table

I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by @det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.

The answer below:

library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)

#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]

same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]

labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)

#Update joins via matching IDs (credit to @det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..

Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

Now, with our copied newDT we can modify it by reference :

newDT
# a b
# [1,] 1 11
# [2,] 2 200

newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..

Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

Or, we can modify the original DT by reference :

DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..

Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

Writing to data frame with many lines is very slow

Your code is slow because the function [.<-data.frame makes a copy of the underlying object each time you modify the object.

If you trace the memory usage it becomes clear:

tracemem(toto.big)
system.time({
for(i in 1:100) { toto.big[i,2] <- 3 }
})

tracemem[0x000000001d416b58 -> 0x000000001e08e9f8]: system.time
tracemem[0x000000001e08e9f8 -> 0x000000001e08eb10]: [<-.data.frame [<- system.time
tracemem[0x000000001e08eb10 -> 0x000000001e08ebb8]: [<-.data.frame [<- system.time
tracemem[0x000000001e08ebb8 -> 0x000000001e08e7c8]: system.time
tracemem[0x000000001e08e7c8 -> 0x000000001e08e758]: [<-.data.frame [<- system.time
tracemem[0x000000001e08e758 -> 0x000000001e08e800]: [<-.data.frame [<- system.time
....
tracemem[0x000000001e08e790 -> 0x000000001e08e838]: system.time
tracemem[0x000000001e08e838 -> 0x000000001e08eaa0]: [<-.data.frame [<- system.time
tracemem[0x000000001e08eaa0 -> 0x000000001e08e790]: [<-.data.frame [<- system.time
user system elapsed
4.31 1.01 5.29

To resolve this, your best action is to modify the data frame only once:

untracemem(toto.big)

system.time({
toto.big[1:100, 2] <- 5
})

user system elapsed
0.02 0.00 0.02

In those cases where it is more convenient to calculates values in a loop (or lapply) then you can perform the calculation on a vector in a loop, then allocate into the data frame in one vectorised allocation:

system.time({
newvalues <- numeric(100)
for(i in 1:100)newvalues[i] <- rnorm(1)
toto.big[1:100, 2] <- newvalues
})

user system elapsed
0.02 0.00 0.02

You can view the code for <-.data.frame by typing `<-.data.frame` into your console.

R: Creating a Function to Randomly Replace Data from a Data Frame

Here's a solution (I think). The following function implements the 5 step process you outlined above.

random_drop <- function(x) {
# Randomly select variables
which_vars <- names(x[, sort(sample(ncol(x), sample(ncol(x), 1)))])
# Randomly select factor levels subset or generate continuous cutoff value
cutoff_vals <- lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(sample(levels(x[[i]]), sample(nlevels(x[[i]]), 1)))
}
runif(1, min(x[[i]], na.rm = TRUE), max(x[[i]], na.rm = TRUE))
}
)
names(cutoff_vals) <- which_vars
# Create random prob value
r <- runif(1,0,1)
# Generate idx for which rows to select
row_idx <- Reduce(
`&`,
lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(x[[i]] %in% cutoff_vals[[i]])
}
x[[i]] > cutoff_vals[[i]]
}
)
)
x_sub <- x[row_idx, !colnames(x) %in% which_vars, drop = FALSE]
# With prob. 'r' fill row values in with '0'
r_mat <- matrix(
sample(
c(TRUE, FALSE),
ncol(x_sub)*nrow(x_sub),
replace = TRUE,
prob = c(r, 1 - r)
),
nrow = nrow(x_sub),
ncol = ncol(x_sub)
)
x_sub[r_mat] <- 0
x[row_idx, !colnames(x) %in% which_vars] <- x_sub
return(x)
}

Then this function recursively will apply the function as many times as you wish.

random_drop_recurse <- function(x, n = 10) {
if (n == 1) return(random_drop(x))
random_drop_recurse(random_drop(x), n = n - 1)
}

Note: 0 is not a valid factor level so this function will generate warnings due to trying to replace factor values with 0 and will instead replace the factor values with NA.

Using a subset of your data supplied above, this is what it looks like running the function 10 and 100 times, respectively:

set.seed(123)

num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)

factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")

factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <- as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <- as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))

my_data = data.frame(num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)

random_drop <- function(x) {
# Randomly select variables
which_vars <- names(x[, sort(sample(ncol(x), sample(ncol(x), 1)))])
# Randomly select factor levels subset or generate continuous cutoff value
cutoff_vals <- lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(sample(levels(x[[i]]), sample(nlevels(x[[i]]), 1)))
}
runif(1, min(x[[i]], na.rm = TRUE), max(x[[i]], na.rm = TRUE))
}
)
names(cutoff_vals) <- which_vars
# Create random prob value
r <- runif(1,0,1)
# Generate idx for which rows to select
row_idx <- Reduce(
`&`,
lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(x[[i]] %in% cutoff_vals[[i]])
}
x[[i]] > cutoff_vals[[i]]
}
)
)
x_sub <- x[row_idx, !colnames(x) %in% which_vars, drop = FALSE]
# With prob. 'r' fill row values in with '0'
r_mat <- matrix(
sample(
c(TRUE, FALSE),
ncol(x_sub)*nrow(x_sub),
replace = TRUE,
prob = c(r, 1 - r)
),
nrow = nrow(x_sub),
ncol = ncol(x_sub)
)
x_sub[r_mat] <- 0
x[row_idx, !colnames(x) %in% which_vars] <- x_sub
return(x)
}

random_drop_recurse <- function(x, n = 10) {
if (n == 1) return(random_drop(x))
random_drop_recurse(random_drop(x), n = n - 1)
}

suppressWarnings(
head(
random_drop_recurse(my_data[, c(1:3, 6:8)], 10),
20
)
)
#> num_var_1 num_var_2 num_var_3 factor_var_1 factor_var_2 factor_var_3
#> 1 9.439524 5.021006 4.883963 B AA AAA
#> 2 9.769823 4.800225 12.369379 B AA AAA
#> 3 11.558708 9.910099 0.000000 C AA BBB
#> 4 10.070508 9.339124 22.192276 B CC DDD
#> 5 10.129288 -2.746714 11.741359 B AA AAA
#> 6 11.715065 15.202867 3.847317 <NA> AA CCC
#> 7 10.460916 11.248629 -8.068930 C CC <NA>
#> 8 8.734939 22.081037 0.000000 C AA BBB
#> 9 9.313147 13.425991 30.460189 C AA BBB
#> 10 9.554338 7.765203 4.392376 B AA AAA
#> 11 11.224082 23.986956 1.640007 A <NA> AAA
#> 12 10.359814 24.161130 16.529475 A AA AAA
#> 13 0.000000 3.906441 0.000000 A CC <NA>
#> 14 10.110683 12.345160 17.516291 B CC AAA
#> 15 9.444159 8.943765 7.220249 A AA DDD
#> 16 11.786913 10.935256 21.226542 B CC DDD
#> 17 10.497850 11.137714 -1.726089 B AA AAA
#> 18 8.033383 3.690498 9.511232 B CC CCC
#> 19 10.701356 11.427948 2.958597 B BB AAA
#> 20 9.527209 18.746237 16.807586 C AA BBB

suppressWarnings(
head(
random_drop_recurse(my_data[, c(1:3, 6:8)], 100),
20
)
)
#> num_var_1 num_var_2 num_var_3 factor_var_1 factor_var_2 factor_var_3
#> 1 9.439524 0.00000 0.000000 B <NA> <NA>
#> 2 9.769823 0.00000 12.369379 B <NA> <NA>
#> 3 11.558708 0.00000 0.000000 <NA> <NA> BBB
#> 4 10.070508 0.00000 0.000000 B <NA> <NA>
#> 5 10.129288 0.00000 0.000000 B <NA> <NA>
#> 6 11.715065 0.00000 0.000000 B <NA> <NA>
#> 7 10.460916 0.00000 0.000000 C <NA> <NA>
#> 8 0.000000 22.08104 0.000000 <NA> AA <NA>
#> 9 9.313147 0.00000 0.000000 C <NA> <NA>
#> 10 0.000000 0.00000 0.000000 B AA AAA
#> 11 11.224082 0.00000 0.000000 <NA> <NA> AAA
#> 12 10.359814 0.00000 0.000000 A <NA> <NA>
#> 13 10.400771 0.00000 0.000000 A <NA> <NA>
#> 14 10.110683 0.00000 0.000000 B <NA> <NA>
#> 15 9.444159 0.00000 0.000000 A <NA> <NA>
#> 16 11.786913 0.00000 0.000000 B <NA> <NA>
#> 17 10.497850 0.00000 0.000000 B <NA> <NA>
#> 18 8.033383 0.00000 0.000000 B <NA> <NA>
#> 19 0.000000 0.00000 2.958597 B BB AAA
#> 20 9.527209 0.00000 0.000000 C <NA> BBB

R: Randomly Replacing Elements of a Data Frame with 0

Here's a version that allows you to specify a vector of probabilities pnul of becoming 0 in every column separately using Map. length of splitted string is being multiplied by the elements of pnul to get number of samples set to zero. You may also set pnul to a scalar for same probability in all columns.

pnul <- c(.0, .2, .5, .8, 1)

res <- Map(\(x, a) {
S <- strsplit(x, ',')
sapply(S, \(s) {
s[sample(seq_along(s), length(s)*a)] <- '0'
paste(s, collapse=',')
})
}, my_data, pnul) |> as.data.frame()

head(res)
# var_1 var_2 var_3 var_4 var_5
# 1 1,2,3,4,5,6,7,8,9,10 0,0,3,4,5,6,7,8,9,10 1,2,0,4,0,0,7,8,0,0 0,0,0,0,0,0,0,8,9,0 0,0,0,0,0,0,0,0,0,0
# 2 1,2,3,4,5,6,7,8,9,10 1,0,3,4,5,6,7,8,9,0 1,0,3,0,5,0,0,0,9,10 0,0,0,0,0,0,7,8,0,0 0,0,0,0,0,0,0,0,0,0
# 3 1,2,3,4,5,6,7,8,9,10 1,0,0,4,5,6,7,8,9,10 1,0,0,0,0,6,7,0,9,10 0,0,0,0,5,0,0,0,0,10 0,0,0,0,0,0,0,0,0,0
# 4 1,2,3,4,5,6,7,8,9,10 1,2,3,0,5,6,7,0,9,10 0,0,3,0,5,0,7,0,9,10 0,0,0,4,0,0,7,0,0,0 0,0,0,0,0,0,0,0,0,0
# 5 1,2,3,4,5,6,7,8,9,10 1,0,3,4,5,6,7,8,9,0 0,2,0,4,5,0,7,0,0,10 1,0,0,0,0,0,0,8,0,0 0,0,0,0,0,0,0,0,0,0
# 6 1,2,3,4,5,6,7,8,9,10 0,2,3,4,5,6,0,8,9,10 1,2,3,0,5,0,7,0,0,0 0,0,0,4,5,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0

Writing a function or loop to replace data based on two conditions, one of which is time

With data.table and dplyr:

we may use data.table::rleid for grouping and use the n() for every group. Then use replace to replace all values that meet the condition (Pressure>1000 and <60 lines), for every group.
The following answer will only work if there is strictly one observation for every second. If there are missing rows or duplicate DateTime values, it may yield inconsistent results



Related Topics



Leave a reply



Submit