Taking a Disproportionate Sample from a Dataset in R

How to sample from a dataset based on dataset size criteria

I think cut will be helpful here in determining the group and then sampling an appropriate number of rows:

# example data:
dat <- data.frame(row=seq_len(10000),id=seq_len(10000))
# sample away!
dat[sample(seq_len(nrow(dat)), c(nrow(dat),1.5e4,2e4)[cut(nrow(dat), c(0,1e4,5e4,Inf))]),]

Random Sample with multiple probabilities in R

It sounds like you are interested in taking a random stratified sample. You could do this using the stratsample() function from the survey package.

In the example below, I create some fake data to mimic what you have, then I define a function to take a random proportional stratified random sample, then I apply the function to the fake data.

# example data
ndf <- 1000
df <- data.frame(ID=sample(ndf), Name=sample(ndf),
Campaign=sample(c("D2D", "F2F", "TM", "WW"), ndf, prob=c(0.25, 0.38, 0.17, 0.21), replace=TRUE),
Gender=sample(c("Male", "Female"), ndf, prob=c(0.54, 0.46), replace=TRUE))

# function to take a random proportional stratified sample of size n
rpss <- function(stratum, n) {
props <- table(stratum)/length(stratum)
nstrat <- as.vector(round(n*props))
nstrat[nstrat==0] <- 1
names(nstrat) <- names(props)
stratsample(stratum, nstrat)
}

# take a random proportional stratified sample of size 10
selrows <- rpss(stratum=interaction(df$Campaign, df$Gender, drop=TRUE), n=10)
df[selrows, ]

Sampling from a data.frame while controlling for a proportion [stratified sampling]

You can try the stratified function from my "splitstackshape" package:

library(splitstackshape)
stratified(df, "status", 10/nrow(df))
# id1 status
# 1: 5 1
# 2: 12 1
# 3: 2 1
# 4: 1 1
# 5: 6 1
# 6: 9 1
# 7: 16 2
# 8: 17 2
# 9: 18 2
# 10: 15 2

Alternatively, using sample_frac from "dplyr":

library(dplyr)

df %>%
group_by(status) %>%
sample_frac(10/nrow(df))

Both of these would take a stratified sample proportional to the original grouping variable (hence the use of 10/nrow(df), or, equivalently, 0.5).

Randomly sample with if-else condition in R

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'group', we get the sample of 'val'

library(data.table)
setDT(df)[, .(val=sample(val)), by = group]

If we need to add a condition such that if the nrow is greater than 3, sample 3 values or else all the values.

setDT(df)[, if(.N >3 ) sample(val, 3, replace=FALSE) else sample(val), by = group]

How do I sub sample data by group using ddply?

It looks like it should work once you remove , subset from your call.

stratified sampling or proportional sampling in R

You can use my stratified function, specifying a value < 1 as your proportion, like this:

## Sample data. Seed for reproducibility 
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))

## Taking the sample
out <- stratified(myData, "group", .3)
out
# a b group
# 17 17 -0.02 2
# 8 8 0.74 3
# 25 25 0.62 3
# 49 49 -0.11 3
# 4 4 1.60 3
# 26 26 -0.06 4
# 27 27 -0.16 4
# 7 7 0.49 4
# 12 12 0.39 4
# 40 40 0.76 4
# 32 32 -0.10 4
# 9 9 0.58 5
# 42 42 -0.25 5
# 43 43 0.70 5
# 37 37 -0.39 5
# 11 11 1.51 6

Compare the counts in the final group with what we would have expected.

round(table(myData$group) * .3)
#
# 2 3 4 5 6
# 1 4 6 4 1
table(out$group)
#
# 2 3 4 5 6
# 1 4 6 4 1

You can also easily take a fixed number of samples per group, like this:

stratified(myData, "group", 2)
# a b group
# 34 34 -0.05 2
# 17 17 -0.02 2
# 49 49 -0.11 3
# 22 22 0.78 3
# 12 12 0.39 4
# 7 7 0.49 4
# 18 18 0.94 5
# 33 33 0.39 5
# 45 45 -0.69 6
# 11 11 1.51 6

Random sampling from a dataset, while preserving original probability distribution

It works as you want. The order of the data is irrelevant.

MATLAB: Taking sample with same number of values from each class

It shouldn't be too hard. Let's say that the observations are in a vector observations. Then you can do

fraction = 0.7;

classes = unique(observations);
nObs = length(observations);
nClasses = length(classes);
nSamples = round(nObs * fraction / nClasses);

for ii = 1:nClasses
idx = observations == classes(ii);
samples((ii-1)*nSamples+1:ii*nSamples) = randsample(observations(idx), nSamples);
end

Now samples is a vector of length nClasses * nsamples that contains your sampled observations, with an equal number from each class.

At the moment it will fail if one of the classes doesn't contain at least nSamples observations. The simplest fix is to add the additional arguments 'replace','true' to the call to randsample, which will tell it to replace each observation after being picked.



Related Topics



Leave a reply



Submit