Is data really copied four times in R's replacement functions?
NOTE: Unless otherwise specified, all explanations below are valid for R versions < 3.1.0. There are great improvements made in R v3.1.0, which is also briefly touched upon here.
To answer your first question, "why four copies and shouldn't one be enough?", we'll begin by quoting the relevant part from R-internals first:
A 'named' value of 2, NAM(2), means that the object must be duplicated before being changed. (Note that this does not say that it is necessary to duplicate, only that it should be duplicated whether necessary or not.) A value of 0 means that it is known that no other SEXP shares data with this object, and so it may safely be altered.
A value of 1 is used for situations like
dim(a) <- c(7, 2)
where in principle two copies of a exist for the duration of the computation as (in principle)a <-
dim<-(a, c(7, 2))
but for no longer, and so some primitive functions can be optimized to avoid a copy in this case.
NAM(1):
Let's start with NAM(1)
objects. Here's an example:
x <- 1:5 # (1)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
tracemem(x)
# [1] "<0x10374ecc8>"
x[2L] <- 10L # (2)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(1),TR] (len=5, tl=0) 1,10,3,4,5
What's happening here? We created an integer vector using :
, it being a primitive, resulted in a NAM(1) object. And when we used [<-
on that object, the value got changed in-place (note that the pointers are identical, (1) and (2)). This is because [<-
being a primitive knows quite well how to handle its inputs and is optimised for a no-copy in this scenario.
y = x # (3)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(2),TR] (len=5, tl=0) 1,10,3,4,5
x[2L] <- 20L # (4)
.Internal(inspect(x))
# tracemem[0x10374ecc8 -> 0x10372f328]:
# @10372f328 13 INTSXP g0c3 [NAM(1),TR] (len=5, tl=0) 1,20,3,4,5
Now the same assignment results in a copy, why? By doing (3), the 'named' field gets incremented to NAM(2) as more than one object is pointing to the same data. Even if [<-
is optimised, the fact that it's a NAM(2)
means that the object must be duplicated. That's why it's now again a NAM(1)
object after the assignment. That's because, call to duplicate
sets named
to 0 and the new assignment bumps it back to 1.
Note: Peter Dalgaard explains this case nicely in this link as to why
x = 2L
results in NAM(2) object.
NAM(2):
Now let's return to your question on calling *<-
on a data.frame
which is a NAM(2)
object.
The first question then is, why is data.frame()
a NAM(2)
object? Why not a NAM(1) like the earlier case x <- 1:5
? Duncan Murdoch answers this very nicely on the same post:
data.frame()
is a plain R function, so it is treated no differently than any user-written function. On the other hand, the internal function that implements the:
operator is a primitive, so it has complete control over its return value, and it can setNAMED
in the most efficient way.
This means any attempt to change the value would result in triggering a duplicate
(a deep copy). From ?tracemem
:
... any copying of the object by the C function
duplicate
produces a message to standard output.
So a message from tracemem
helps understand the number of copies. To understand the first line of your tracemem
output, let's construct the function f<-
, which does no actual replacement. Also, let's construct a data.frame
big enough so that we can measure the time taken for a single copy of that data.frame
.
## R v 3.0.3
`f<-` = function(x, value) {
return(x) ## no actual replacement
}
df <- data.frame(x=1:1e8, y=1:1e8) # 762.9 Mb
tracemem(df) # [1] "<0x7fbccd2f4ae8>"
require(data.table)
system.time(copy(df))
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4ff0]: copy system.time
# user system elapsed
# 0.609 0.484 1.106
system.time(f(df) <- 3)
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4f10]: system.time
# user system elapsed
# 0.608 0.480 1.101
I've used the function copy()
from data.table
(which basically calls the C duplicate
function). The times for copying are more or less identical. So, the first step is clearly a deep copy, even if it did nothing.
This explains the first two verbose messages from tracemem
in your post:
(1) From the global environment we called
f(df) <- 3)
. Here's one copy.
(2) From within the functionf<-
, another assignmentx[1,1] <- 3
which'll call the[<-
(and hence the[<-.data.frame
function). That makes the second copy immediately.
Finding the rest of the copies is easy with a debugonce()
on [<-.data.frame
. That is, doing:
debugonce(`[<-`)
df <- data.frame(x=1:1e8, y=1:1e8)
`f<-` = function(x, value) {
x[1,1] = value
return(x)
}
tracemem(df)
f(df) = 3
# first three lines:
# tracemem[0x7f8ba33d8a08 -> 0x7f8ba33d8d50]: (1)
# tracemem[0x7f8ba33d8d50 -> 0x7f8ba33d8a78]: f<- (2)
# debugging in: `[<-.data.frame`(`*tmp*`, 1L, 1L, value = 3L)
By hitting enter, you'll find the other two copies to be inside this function:
# debug: class(x) <- NULL
# tracemem[0x7f8ba33d8a78 -> 0x7f8ba3cd6078]: [<-.data.frame [<- f<- (3)
# debug: x[[jj]][iseq] <- vjj
# tracemem[0x7f8ba3cd6078 -> 0x7f882c35ed40]: [<-.data.frame [<- f<- (4)
Note that class
is primitive but it's being called on a NAM(2) object. I suspect that's the reason for the copy there. And the last copy is inevitable as it modifies the column.
So, there you go.
Now a small note on R v3.1.0
:
I also tested the same in
R V3.1.0
.tracemem
provides all four lines. However, the only time-consuming step is (4). IIUC, the remaining cases, all due to[<-
/class<-
should be triggering a shallow copy instead of deep copy. What's awesome is that, even in (4), only that column that's being modified seems to be deep copied. R 3.1.0 has great improvements!This means
tracemem
provides output due to shallow copy too - which is a bit confusing since the documentation doesn't explicitly state that and makes it hard to tell between a shallow and deep copy, except by measuring time. Perhaps it's my (incorrect) understanding. Feel free to correct me.
On your part 2, I'll quote Luke Tierney from here:
Calling a
foo<-
function directly is not a good idea unless you really understand what is going on in the assignment mechanism in general and in the particularfoo<-
function. It is definitely not something to be done in routine programming unless you like unpleasant surprises.
But I am unable to tell if these unpleasant surprises extend to an object that's already NAM(2)
. Because, Matt was calling it on a list
, which is a primitive and therefore NAM(1), and calling foo<-
directly wouldn't increment it's 'named' value.
But, the fact that R v3.1.0 has great improvements should already convince you that such a function call is not necessary anymore.
HTH.
PS: Feel free to correct me (and help me shorten this answer if possible) :).
Edit: I seem to have missed the point about a copy being reduced when calling f<-
directly as spotted under comment. It's pretty easy to see by using the function Simon Urbanek used in the post (that's linked multiple times now):
# rm(list=ls()) # to make sure there' no other object in your workspace
`f<-` <- function(x, value) {
print(ls(env = parent.frame()))
}
df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce01a65358>"
f(df) = 3
# tracemem[0x7fce0359b2a0 -> 0x7fce0359ae08]:
# [1] "*tmp*" "df" "f<-"
df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce03c505c0>"
df <- `f<-`(df, 3)
# [1] "df" "f<-"
As you can see, in the first method there's an object *tmp*
that's being created, which is not, in the second case. And it seems like this creation of *tmp*
object for a NAM(2)
input object triggers a copy of the input before *tmp*
gets assigned to the function argument. But that's as far as my understanding goes.
Why does the extract method for data frames make two copies?
This isn't really a complete answer to your question, but it's a start.
If you look in the R Language Definition, you'll see that df[["name"]] <- 3.2
is implemented as
`*tmp*` <- df
df <- "[[<-.data.frame"(`*tmp*`, "name", value=3.2)
rm(`*tmp*`)
So one copy gets put into *tmp*
. If you call debug("[[<-.data.frame")
, you'll see that it really does get called with an argument called *tmp*
, andtracemem()
will show that the first duplication happens before you enter.
The function [[<-.data.frame
is a regular function with a header like this:
function (x, i, j, value)
That function gets called as
`[[<-.data.frame`(`*tmp*`, "name", value = 3.2)
Now there are three references to the dataframe: df
in the global environment, *tmp*
in the internal code, and x
in that function. (Actually, there's an intermediate step where the generic is called, but it is a primitive, so it doesn't need to make a new reference.)
The class of x
gets changed in the function; that triggers a copy. Then one of the components of x
is changed; that's another copy. So that makes 3.
Just guessing, I'd say the reason for the first duplication is that a complicated replacement might refer to the original value, and it's avoiding the possibility of retrieving a partially modified value.
Writing efficient functions for data.tables that replace by reference
This is similar to Frank's but lets the arguments be passed to a function that builds the translation vector and returns the translation. You don't need to do the loop inside the function since the lapply
, the :=
, and the .SDcols functions are doing the looping inside [.data.table
.
recode_dt <- function(datacol, oldval, newval)
{ trans <- setNames(newval, oldval)
trans[ datacol ] }
dt[, (bincols) := lapply(.SD, recode_dt, oldval = c("u", "n", "y"),
newval = c(NA_real_, 0, 1)),
.SDcols = bincols]
dt
#===============
id fruit mydate eaten present sex
1: 1 apple 2015-09-01 1 0 m
2: 2 orange 2015-09-02 1 0 f
3: 3 banana 2015-11-15 0 1 f
4: 4 strawbery 2016-02-24 1 1 m
5: 5 rasberry 2016-03-08 NA 1 f
Note that your columns were not actually factors as it appeared you thought from one of your comments. They might have been had you built a data.frame as an intermediate step.
fast replacement of data.table values by labels stored in another data.table
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch
to identify labels to update.
As pointed out by @det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop
approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to @det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]
Understanding exactly when a data.table is a reference to (vs a copy of) another data.table
Yes, it's subassignment in R using <-
(or =
or ->
) that makes a copy of the whole object. You can trace that using tracemem(DT)
and .Internal(inspect(DT))
, as below. The data.table
features :=
and set()
assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <-
or an explicit copy(DT)
) then it's the copy that gets modified by reference.
DT <- data.table(a = c(1, 2), b = c(11, 12))
newDT <- DT
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
tracemem(newDT)
# [1] "<0x0000000003b7e2a0"
newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..
Notice how even the a
vector was copied (different hex value indicates new copy of vector), even though a
wasn't changed. Even the whole of b
was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why :=
and set()
were introduced to data.table
.
Now, with our copied newDT
we can modify it by reference :
newDT
# a b
# [1,] 1 11
# [2,] 2 200
newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..
Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.
Or, we can modify the original DT
by reference :
DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..
Those hex values are the same as the original values we saw for DT
above. Type example(copy)
for more examples using tracemem
and comparison to data.frame
.
Btw, if you tracemem(DT)
then DT[2,b:=600]
you'll see one copy reported. That is a copy of the first 10 rows that the print
method does. When wrapped with invisible()
or when called within a function or script, the print
method isn't called.
All this applies inside functions too; i.e., :=
and set()
do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x)
at the start of the function. But, remember data.table
is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).
Writing to data frame with many lines is very slow
Your code is slow because the function [.<-data.frame
makes a copy of the underlying object each time you modify the object.
If you trace the memory usage it becomes clear:
tracemem(toto.big)
system.time({
for(i in 1:100) { toto.big[i,2] <- 3 }
})
tracemem[0x000000001d416b58 -> 0x000000001e08e9f8]: system.time
tracemem[0x000000001e08e9f8 -> 0x000000001e08eb10]: [<-.data.frame [<- system.time
tracemem[0x000000001e08eb10 -> 0x000000001e08ebb8]: [<-.data.frame [<- system.time
tracemem[0x000000001e08ebb8 -> 0x000000001e08e7c8]: system.time
tracemem[0x000000001e08e7c8 -> 0x000000001e08e758]: [<-.data.frame [<- system.time
tracemem[0x000000001e08e758 -> 0x000000001e08e800]: [<-.data.frame [<- system.time
....
tracemem[0x000000001e08e790 -> 0x000000001e08e838]: system.time
tracemem[0x000000001e08e838 -> 0x000000001e08eaa0]: [<-.data.frame [<- system.time
tracemem[0x000000001e08eaa0 -> 0x000000001e08e790]: [<-.data.frame [<- system.time
user system elapsed
4.31 1.01 5.29
To resolve this, your best action is to modify the data frame only once:
untracemem(toto.big)
system.time({
toto.big[1:100, 2] <- 5
})
user system elapsed
0.02 0.00 0.02
In those cases where it is more convenient to calculates values in a loop (or lapply
) then you can perform the calculation on a vector in a loop, then allocate into the data frame in one vectorised allocation:
system.time({
newvalues <- numeric(100)
for(i in 1:100)newvalues[i] <- rnorm(1)
toto.big[1:100, 2] <- newvalues
})
user system elapsed
0.02 0.00 0.02
You can view the code for <-.data.frame
by typing `<-.data.frame`
into your console.
R: Creating a Function to Randomly Replace Data from a Data Frame
Here's a solution (I think). The following function implements the 5 step process you outlined above.
random_drop <- function(x) {
# Randomly select variables
which_vars <- names(x[, sort(sample(ncol(x), sample(ncol(x), 1)))])
# Randomly select factor levels subset or generate continuous cutoff value
cutoff_vals <- lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(sample(levels(x[[i]]), sample(nlevels(x[[i]]), 1)))
}
runif(1, min(x[[i]], na.rm = TRUE), max(x[[i]], na.rm = TRUE))
}
)
names(cutoff_vals) <- which_vars
# Create random prob value
r <- runif(1,0,1)
# Generate idx for which rows to select
row_idx <- Reduce(
`&`,
lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(x[[i]] %in% cutoff_vals[[i]])
}
x[[i]] > cutoff_vals[[i]]
}
)
)
x_sub <- x[row_idx, !colnames(x) %in% which_vars, drop = FALSE]
# With prob. 'r' fill row values in with '0'
r_mat <- matrix(
sample(
c(TRUE, FALSE),
ncol(x_sub)*nrow(x_sub),
replace = TRUE,
prob = c(r, 1 - r)
),
nrow = nrow(x_sub),
ncol = ncol(x_sub)
)
x_sub[r_mat] <- 0
x[row_idx, !colnames(x) %in% which_vars] <- x_sub
return(x)
}
Then this function recursively will apply the function as many times as you wish.
random_drop_recurse <- function(x, n = 10) {
if (n == 1) return(random_drop(x))
random_drop_recurse(random_drop(x), n = n - 1)
}
Note: 0
is not a valid factor level so this function will generate warnings due to trying to replace factor values with 0
and will instead replace the factor values with NA
.
Using a subset of your data supplied above, this is what it looks like running the function 10 and 100 times, respectively:
set.seed(123)
num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)
factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")
factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <- as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <- as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))
my_data = data.frame(num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)
random_drop <- function(x) {
# Randomly select variables
which_vars <- names(x[, sort(sample(ncol(x), sample(ncol(x), 1)))])
# Randomly select factor levels subset or generate continuous cutoff value
cutoff_vals <- lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(sample(levels(x[[i]]), sample(nlevels(x[[i]]), 1)))
}
runif(1, min(x[[i]], na.rm = TRUE), max(x[[i]], na.rm = TRUE))
}
)
names(cutoff_vals) <- which_vars
# Create random prob value
r <- runif(1,0,1)
# Generate idx for which rows to select
row_idx <- Reduce(
`&`,
lapply(
which_vars,
function(i) {
if (is.factor(x[[i]])) {
return(x[[i]] %in% cutoff_vals[[i]])
}
x[[i]] > cutoff_vals[[i]]
}
)
)
x_sub <- x[row_idx, !colnames(x) %in% which_vars, drop = FALSE]
# With prob. 'r' fill row values in with '0'
r_mat <- matrix(
sample(
c(TRUE, FALSE),
ncol(x_sub)*nrow(x_sub),
replace = TRUE,
prob = c(r, 1 - r)
),
nrow = nrow(x_sub),
ncol = ncol(x_sub)
)
x_sub[r_mat] <- 0
x[row_idx, !colnames(x) %in% which_vars] <- x_sub
return(x)
}
random_drop_recurse <- function(x, n = 10) {
if (n == 1) return(random_drop(x))
random_drop_recurse(random_drop(x), n = n - 1)
}
suppressWarnings(
head(
random_drop_recurse(my_data[, c(1:3, 6:8)], 10),
20
)
)
#> num_var_1 num_var_2 num_var_3 factor_var_1 factor_var_2 factor_var_3
#> 1 9.439524 5.021006 4.883963 B AA AAA
#> 2 9.769823 4.800225 12.369379 B AA AAA
#> 3 11.558708 9.910099 0.000000 C AA BBB
#> 4 10.070508 9.339124 22.192276 B CC DDD
#> 5 10.129288 -2.746714 11.741359 B AA AAA
#> 6 11.715065 15.202867 3.847317 <NA> AA CCC
#> 7 10.460916 11.248629 -8.068930 C CC <NA>
#> 8 8.734939 22.081037 0.000000 C AA BBB
#> 9 9.313147 13.425991 30.460189 C AA BBB
#> 10 9.554338 7.765203 4.392376 B AA AAA
#> 11 11.224082 23.986956 1.640007 A <NA> AAA
#> 12 10.359814 24.161130 16.529475 A AA AAA
#> 13 0.000000 3.906441 0.000000 A CC <NA>
#> 14 10.110683 12.345160 17.516291 B CC AAA
#> 15 9.444159 8.943765 7.220249 A AA DDD
#> 16 11.786913 10.935256 21.226542 B CC DDD
#> 17 10.497850 11.137714 -1.726089 B AA AAA
#> 18 8.033383 3.690498 9.511232 B CC CCC
#> 19 10.701356 11.427948 2.958597 B BB AAA
#> 20 9.527209 18.746237 16.807586 C AA BBB
suppressWarnings(
head(
random_drop_recurse(my_data[, c(1:3, 6:8)], 100),
20
)
)
#> num_var_1 num_var_2 num_var_3 factor_var_1 factor_var_2 factor_var_3
#> 1 9.439524 0.00000 0.000000 B <NA> <NA>
#> 2 9.769823 0.00000 12.369379 B <NA> <NA>
#> 3 11.558708 0.00000 0.000000 <NA> <NA> BBB
#> 4 10.070508 0.00000 0.000000 B <NA> <NA>
#> 5 10.129288 0.00000 0.000000 B <NA> <NA>
#> 6 11.715065 0.00000 0.000000 B <NA> <NA>
#> 7 10.460916 0.00000 0.000000 C <NA> <NA>
#> 8 0.000000 22.08104 0.000000 <NA> AA <NA>
#> 9 9.313147 0.00000 0.000000 C <NA> <NA>
#> 10 0.000000 0.00000 0.000000 B AA AAA
#> 11 11.224082 0.00000 0.000000 <NA> <NA> AAA
#> 12 10.359814 0.00000 0.000000 A <NA> <NA>
#> 13 10.400771 0.00000 0.000000 A <NA> <NA>
#> 14 10.110683 0.00000 0.000000 B <NA> <NA>
#> 15 9.444159 0.00000 0.000000 A <NA> <NA>
#> 16 11.786913 0.00000 0.000000 B <NA> <NA>
#> 17 10.497850 0.00000 0.000000 B <NA> <NA>
#> 18 8.033383 0.00000 0.000000 B <NA> <NA>
#> 19 0.000000 0.00000 2.958597 B BB AAA
#> 20 9.527209 0.00000 0.000000 C <NA> BBB
R: Randomly Replacing Elements of a Data Frame with 0
Here's a version that allows you to specify a vector of probabilities pnul
of becoming 0
in every column separately using Map
. length
of splitted string is being multiplied by the elements of pnul
to get number of sample
s set to zero. You may also set pnul
to a scalar for same probability in all columns.
pnul <- c(.0, .2, .5, .8, 1)
res <- Map(\(x, a) {
S <- strsplit(x, ',')
sapply(S, \(s) {
s[sample(seq_along(s), length(s)*a)] <- '0'
paste(s, collapse=',')
})
}, my_data, pnul) |> as.data.frame()
head(res)
# var_1 var_2 var_3 var_4 var_5
# 1 1,2,3,4,5,6,7,8,9,10 0,0,3,4,5,6,7,8,9,10 1,2,0,4,0,0,7,8,0,0 0,0,0,0,0,0,0,8,9,0 0,0,0,0,0,0,0,0,0,0
# 2 1,2,3,4,5,6,7,8,9,10 1,0,3,4,5,6,7,8,9,0 1,0,3,0,5,0,0,0,9,10 0,0,0,0,0,0,7,8,0,0 0,0,0,0,0,0,0,0,0,0
# 3 1,2,3,4,5,6,7,8,9,10 1,0,0,4,5,6,7,8,9,10 1,0,0,0,0,6,7,0,9,10 0,0,0,0,5,0,0,0,0,10 0,0,0,0,0,0,0,0,0,0
# 4 1,2,3,4,5,6,7,8,9,10 1,2,3,0,5,6,7,0,9,10 0,0,3,0,5,0,7,0,9,10 0,0,0,4,0,0,7,0,0,0 0,0,0,0,0,0,0,0,0,0
# 5 1,2,3,4,5,6,7,8,9,10 1,0,3,4,5,6,7,8,9,0 0,2,0,4,5,0,7,0,0,10 1,0,0,0,0,0,0,8,0,0 0,0,0,0,0,0,0,0,0,0
# 6 1,2,3,4,5,6,7,8,9,10 0,2,3,4,5,6,0,8,9,10 1,2,3,0,5,0,7,0,0,0 0,0,0,4,5,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0
Writing a function or loop to replace data based on two conditions, one of which is time
With data.table
and dplyr
:
we may use data.table::rleid
for grouping and use the n()
for every group. Then use replace
to replace all values that meet the condition (Pressure>1000 and <60 lines), for every group.
The following answer will only work if there is strictly one observation for every second. If there are missing rows or duplicate DateTime values, it may yield inconsistent results
Related Topics
Change the Color of the Axis Labels
Conditional 'Echo' (Or Eval or Include) in Rmarkdown Chunks
Difference Between Pull and Select in Dplyr
Filtering Data Frame Based on Na on Multiple Columns
Convert Scientific Notation to Numeric, Preserving Decimals
Find Overlapping Dates for Each Id and Create a New Row for the Overlap
Matrix Expression Causes Error "Requires Numeric/Complex Matrix/Vector Arguments"
How to Convert Mm:Ss.00 to Seconds.00
R: Plot Multiple Box Plots Using Columns from Data Frame
R:Ggplot2:Facet_Grid:How Include Math Expressions in Few (Not All) Labels
How to Get Axis Ticks Labels with Different Colors Within a Single Axis for a Ggplot Graph
Automated Ggplot2 Example Gallery in Knitr
Building a Tiny R Package with Cuda and Rcpp
How to Get Xtabs to Calculate Means Instead of Sums in R
Error When Using Predict() on a Randomforest Object Trained with Caret's Train() Using Formula