Automate for loop to increase samples selected
You can use lapply
and save the output as nested list.
select_rows <- 2:4
n_times <- 5
inds <- nrow(iris)
result <- lapply(select_rows, function(x)
replicate(n_times, iris[sample(inds, x), ], simplify = FALSE))
Change select_rows
and n_times
as per your choice.
How to resample and remodel n times by vectorization?
(Assuming butterfly$inv.alt <- 1/butterfly$alt
)
You get the error because resample
is not a list of resampled data.frames, which you can obtain with:
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
The the following should work:
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
To extract coefficients from a list of models, re.model$coef
does work. The correct path to coefficients are: re.model[[1]]$coef
, re.model[[2]]$coef
, .... You can get all of them with the following code:
re.model.coef <- sapply(re.model, coef)
Then you can combined it with the observed coefficients:
estMat <- cbind(re.model.coef, model$coef)
In fact, you can put all of them into replicate
:
re.model.coef <- replicate(B, {
bf.rs <- butterfly[sample(1:n, replace = TRUE),]
coef(lm(formula = Hk ~ inv.alt, data = bf.rs))
})
estMat <- cbind(re.model.coef, model$coef)
Resample and looping over dplyr functions in R
I don't think you need a loop. You can do this faster by sampling 3*1000
values per group at once, assign sample_id
and add it to grouping variables, and finaly summarize
to get desired values. This way you are calling all functions only once. -
dat %>%
group_by(fertilizer, crop, level) %>%
sample_n(3*1000, replace = T) %>%
mutate(sample_id = rep(1:1000, each = 3)) %>%
group_by(sample_id, add = TRUE) %>%
summarise(
mean = mean(growth, na.rm = T),
var = sd(growth)^2
) %>%
ungroup()
# A tibble: 8,000 x 6
fertilizer crop level sample_id mean var
<chr> <chr> <chr> <int> <dbl> <dbl>
1 N alone high 1 30.7 2640.
2 N alone high 2 1 0
3 N alone high 3 60.3 2640.
4 N alone high 4 1.33 0.333
5 N alone high 5 1.33 0.333
6 N alone high 6 60.3 2640.
7 N alone high 7 1.33 0.333
8 N alone high 8 30.3 2670.
9 N alone high 9 1.33 0.333
10 N alone high 10 60.7 2581.
# ... with 7,990 more rows
How to create a dataframe with repeated columns created from randomly sampling another dataframe?
Here is a data.table
option that might help
dt <- as.data.table(df)
dt2 <- as.data.table(df2)
n <- 1000
res <- cbind(
dt2[, .(obs)],
dt2[
,
replicate(n, sample(dt[.BY, x, on = "cat"], .N, replace = TRUE), simplify = FALSE),
cat
]
)
How to run a function multiple times and write the results to a list?
Combination of above answer, comment, and my own answer. Naturally, I like mine better. Also, I think there is a mistake in the above answer for base
R.
n <- 10
# give 1 to gen_mat n-times
lapply(rep(1, n), gen_mat)
# replicate n-times
replicate(n, gen_mat(1), simplify=FALSE)
# lapply returns error if FUN is not function or
# the function is not taking an argument. Hence a dummy function.
lapply(seq_len(n), function(x) gen_mat(1))
microbenchmarking the three methods
I used a larger value for n
, but the results are similar in my desktop with smaller n
as well. For this, replicate
takes longer than the other two methods.
set.seed(1)
gen_mat <- function(x) matrix(c(1, 1, 1, x + rnorm(1)), nrow = 2)
n <- 1000
library(microbenchmark)
library(ggplot2)
mb <- microbenchmark(
lap1 = {lapply(rep(1, n), gen_mat)},
repl = {replicate(n, gen_mat(1), simplify=FALSE)},
lap2 = {lapply(seq_len(n), function(x) gen_mat(1))},
times = 10000L
)
mb
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lap1 2.839435 3.494157 4.806954 3.854269 5.611413 103.0111 10000 a
# repl 3.091829 3.777199 5.140789 4.165856 6.013591 101.4318 10000 b
# lap2 3.131491 3.761274 5.089170 4.140316 5.939075 101.1983 10000 b
autoplot(mb)
In R,Computing average from 1000 replications
Here's a more R
-ish solution that doesn't use a loop:
x <- replicate(1000, sample.int(500,30,replace=F))
y <- abs(2*colMeans(x)-501)
sumValue = sum(y)
divValue = mean(y)
replicate
will give us the desired output as a matrix, and we can then use the vectorized function colMeans
.
This method will perform much better as your number of replicates increases:
loopSoln <- function(N) {
value = list()
for(i in 1:N) {
x=sample(1:500,30,replace=F)
mhat=stimateM=2*mean(x)-1
y=abs(mhat-500)
value[i]=y
}
sumValue = sum(unlist(value))
divValue = sumValue/N
}
replicateSoln <- function(N) {
x <- replicate(1000, sample.int(500,30,replace=F))
y <- abs(2*colMeans(x)-501)
sumValue = sum(y)
divValue = mean(y)
}
(ltimes <- sapply(c(1e2,1e3,1e4,1e5), function(N) system.time(loopSoln(N))[3]))
## elapsed elapsed elapsed elapsed
## 0.002 0.014 0.158 2.009
(rtimes <- sapply(c(1e2,1e3,1e4,1e5), function(N) system.time(replicateSoln(N))[3]))
## elapsed elapsed elapsed elapsed
## 0.007 0.011 0.007 0.010
plot(ltimes~I(2:5), type = 'l', col = 2, xlab = "log_10 number of replicates", ylab = "elapsed time (sec)")
lines(rtimes~I(2:5), col = 4)
legend("topleft", lty = 1, col = c(2,4),
legend = c("Loop", "replicate + rowMeans"))
Repeat simulation of test scores 1000 times
At the end of the day you are testing whether at least 22 students score higher than 78, which can be compactly computed with:
sum(rnorm(75, 70, 10) > 78) >= 22
Breaking this down a bit, rnorm(75, 70, 10)
returns the 75 scores, which are normally distributed with mean 70 and standard deviation 10. rnorm(75, 70, 10) > 78
is a vector of length 75 that indicates whether or not each of these scores is above 78. sum(rnorm(75, 70, 10) > 78)
converts each true to a 1 and each false to a 0 and sums these values up, meaning it counts the number of the 75 scores that exceed 78. Lastly we test whether the sum is 22 or higher with the full expression above.
replicate
can be used to replicate this any number of times. So to see the breakdown of 1000 simulations, you can use the following 1-liner (after setting your random seed, of course):
set.seed(144)
table(replicate(1000, sum(rnorm(75, 70, 10) > 78) >= 22))
# FALSE TRUE
# 936 64
In 64 of the replicates, at least 22 students scored above a 78, so we estimate the probability to be 6.4%.
Related Topics
Implementation of Skyline Query or Efficient Frontier
Why Does Is.Vector() Return True for List
How to See All Rows of a Data Frame in a Jupyter Notebook with an R Kernel
Missing Data When Supplying a Dual-Axis--Multiple-Traces to Subplot
Group Vector on Conditional Sum
R Specify Function Environment
Why Are the Colors Wrong on This Ggplot
Ggplot2: Problem with X Axis When Adding Regression Line Equation on Each Facet
R Shiny Widgetfunc() Warning Messages with Eventreactive(Warning 1) and Renderdatatable (Warning 2)
Compute Projection/Hat Matrix via Qr Factorization, Svd (And Cholesky Factorization)
Nas Are Not Allowed in Subscripted Assignments
How to Ignore Na in Ifelse Statement
R Shiny Loop to Display Multiple Plots
In R, How to Plot into a Memory Buffer Instead of a File
How to Convert Unix Timestamp (Milliseconds) and Timezone in R