How to sample from a dataset based on dataset size criteria
I think cut
will be helpful here in determining the group and then sampling an appropriate number of rows:
# example data:
dat <- data.frame(row=seq_len(10000),id=seq_len(10000))
# sample away!
dat[sample(seq_len(nrow(dat)), c(nrow(dat),1.5e4,2e4)[cut(nrow(dat), c(0,1e4,5e4,Inf))]),]
Random Sample with multiple probabilities in R
It sounds like you are interested in taking a random stratified sample. You could do this using the stratsample()
function from the survey
package.
In the example below, I create some fake data to mimic what you have, then I define a function to take a random proportional stratified random sample, then I apply the function to the fake data.
# example data
ndf <- 1000
df <- data.frame(ID=sample(ndf), Name=sample(ndf),
Campaign=sample(c("D2D", "F2F", "TM", "WW"), ndf, prob=c(0.25, 0.38, 0.17, 0.21), replace=TRUE),
Gender=sample(c("Male", "Female"), ndf, prob=c(0.54, 0.46), replace=TRUE))
# function to take a random proportional stratified sample of size n
rpss <- function(stratum, n) {
props <- table(stratum)/length(stratum)
nstrat <- as.vector(round(n*props))
nstrat[nstrat==0] <- 1
names(nstrat) <- names(props)
stratsample(stratum, nstrat)
}
# take a random proportional stratified sample of size 10
selrows <- rpss(stratum=interaction(df$Campaign, df$Gender, drop=TRUE), n=10)
df[selrows, ]
Sampling from a data.frame while controlling for a proportion [stratified sampling]
You can try the stratified
function from my "splitstackshape" package:
library(splitstackshape)
stratified(df, "status", 10/nrow(df))
# id1 status
# 1: 5 1
# 2: 12 1
# 3: 2 1
# 4: 1 1
# 5: 6 1
# 6: 9 1
# 7: 16 2
# 8: 17 2
# 9: 18 2
# 10: 15 2
Alternatively, using sample_frac
from "dplyr":
library(dplyr)
df %>%
group_by(status) %>%
sample_frac(10/nrow(df))
Both of these would take a stratified sample proportional to the original grouping variable (hence the use of 10/nrow(df)
, or, equivalently, 0.5
).
Randomly sample with if-else condition in R
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'group', we get the sample
of 'val'
library(data.table)
setDT(df)[, .(val=sample(val)), by = group]
If we need to add a condition such that if
the nrow is greater than 3, sample
3 values or else
all the values.
setDT(df)[, if(.N >3 ) sample(val, 3, replace=FALSE) else sample(val), by = group]
How do I sub sample data by group using ddply?
It looks like it should work once you remove , subset
from your call.
stratified sampling or proportional sampling in R
You can use my stratified
function, specifying a value < 1 as your proportion, like this:
## Sample data. Seed for reproducibility
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
## Taking the sample
out <- stratified(myData, "group", .3)
out
# a b group
# 17 17 -0.02 2
# 8 8 0.74 3
# 25 25 0.62 3
# 49 49 -0.11 3
# 4 4 1.60 3
# 26 26 -0.06 4
# 27 27 -0.16 4
# 7 7 0.49 4
# 12 12 0.39 4
# 40 40 0.76 4
# 32 32 -0.10 4
# 9 9 0.58 5
# 42 42 -0.25 5
# 43 43 0.70 5
# 37 37 -0.39 5
# 11 11 1.51 6
Compare the counts in the final group with what we would have expected.
round(table(myData$group) * .3)
#
# 2 3 4 5 6
# 1 4 6 4 1
table(out$group)
#
# 2 3 4 5 6
# 1 4 6 4 1
You can also easily take a fixed number of samples per group, like this:
stratified(myData, "group", 2)
# a b group
# 34 34 -0.05 2
# 17 17 -0.02 2
# 49 49 -0.11 3
# 22 22 0.78 3
# 12 12 0.39 4
# 7 7 0.49 4
# 18 18 0.94 5
# 33 33 0.39 5
# 45 45 -0.69 6
# 11 11 1.51 6
Random sampling from a dataset, while preserving original probability distribution
It works as you want. The order of the data is irrelevant.
MATLAB: Taking sample with same number of values from each class
It shouldn't be too hard. Let's say that the observations are in a vector observations
. Then you can do
fraction = 0.7;
classes = unique(observations);
nObs = length(observations);
nClasses = length(classes);
nSamples = round(nObs * fraction / nClasses);
for ii = 1:nClasses
idx = observations == classes(ii);
samples((ii-1)*nSamples+1:ii*nSamples) = randsample(observations(idx), nSamples);
end
Now samples
is a vector of length nClasses * nsamples
that contains your sampled observations, with an equal number from each class.
At the moment it will fail if one of the classes doesn't contain at least nSamples
observations. The simplest fix is to add the additional arguments 'replace','true'
to the call to randsample
, which will tell it to replace each observation after being picked.
Related Topics
R: How to Select Files in Directory Which Satisfy Conditions Both on the Beginning and End of Name
How Do Add a Column in a Data Frame in R
How to Reset All Options() Arguments to Their Default Values
Format Latitude and Longitude Axis Labels in Ggplot
Use Dygraph for R to Plot Xts Time Series by Year Only
How to Programmatically Darken the Color Given Rgb Values
Ggplot2 Time Series Plotting: How to Omit Periods When There Is No Data Points
Using Predict to Find Values of Non-Linear Model
Error in R Gbm Function When Cv.Folds > 0
How to Convert .Rdata Format into Text File Format
R Map Switzerland According to Npa (Locality)
R Markdown - Format Text in Code Chunk with New Lines
Dplyr Filter() with SQL-Like %Wildcard%
How to Fill in the Contour Fully Using Stat_Contour
Check If R Package Is Installed Then Load Library