Using Sample() with Sample Space Size = 1

Using sample() with sample space size = 1

The sample documentation recommends this:

resample <- function(x, ...) x[sample.int(length(x), ...)]

Sample with only 1 number

?sample supplies an answer in its examples:

set.seed(47)

resample <- function(x, ...) x[sample.int(length(x), ...)]

# infers 100 means 1:100
sample(100, 1)
#> [1] 98

# stricter
resample(100, 1)
#> [1] 100

# still works normally if explicit
resample(1:100, 1)
#> [1] 77

Why does sample() not work for a single number?

Or do I just need to include an if statement to avoid this.

Yeah, unfortunately. Something like this:

result = if(length(x) == 1) {x} else {sample(x, ...)}

Sample with more samples at the begining and end of sample space

This will give you more samples to the end of the intervall:

np.sqrt(np.linspace(0,100,5))
array([ 0. , 5. , 7.07106781, 8.66025404, 10. ])

You can choose a higher exponent to get more frequent intervalls towards the ends.

To get more samples towards beginning and end of the intervall, make the original linspace symmetrical to 0 and then just shift it.

General function:

def nonlinspace(xmin, xmax, n=50, power=2):
'''Intervall from xmin to xmax with n points, the higher the power, the more dense towards the ends'''
xm = (xmax - xmin) / 2
x = np.linspace(-xm**power, xm**power, n)
return np.sign(x)*abs(x)**(1/power) + xm + xmin

Examples:

>>> nonlinspace(0,10,5,2).round(2)
array([ 0. , 1.46, 5. , 8.54, 10. ])
>>> nonlinspace(0,10,5,3).round(2)
array([ 0. , 1.03, 5. , 8.97, 10. ])
>>> nonlinspace(0,10,5,4).round(2)
array([ 0. , 0.8, 5. , 9.2, 10. ])

Generate all samples of given size with replacement in R

To get samples of size n with replacement

dataset = 1:100

sample(dataset, size = 2, rep=T)

To get means for N samples

N = 1000

means = replicate(N, mean(sample(dataset, 2, rep=T)))

To plot means

hist(means)

Ok, I see from your comment you want all possible n=2 permutations of the data. This can be achieved with:

library(gtools)
x = permutations(n=3, r=2, v=1:3, repeats.allowed=T)
# n = size of sampling vector
# r = size of samples
# v = vector to sample from

This gives you a matrix with each possible permutation including repeats:

      [,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 2 1
[5,] 2 2
[6,] 2 3
[7,] 3 1
[8,] 3 2
[9,] 3 3

To calculate means of this vector you can use:

rowMeans(x)

How can I sample equally from a dataframe?

For more elegance you can do this:

df.groupby('classes').apply(lambda x: x.sample(sample_size))

Extension:

You can make the sample_size a function of group size to sample with equal probabilities (or proportionately):

nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))

It won't result in the exact number of rows as total_sample_size but sampling will be more proportional than the naive method.

How to join data with a weighted sampling?

You can split index.dat according to the zip, to give a list of data frames for each individual zip code. If you use test.sample$zip to subset this list, you will get a list of 50 data frames with the appropriate zip codes. You can then sample the fip using the weights in the prob column of each data frame.

In your case, that would look like this:

sample_space <- split(index.dat, index.dat$zip)[test.sample$zip]

test.sample$fips <- sapply(sample_space,
function(x) sample(x$fips, 1, prob = x$prob))

Now test.sample$fips will have a random fip chosen from the appropriate zip code, with the sampling done according to the relative weight. If we do a table of test.sampl$fips, we can see that the proportions are about right:

table(test.sample$fips)

#> A1 A2 B C1 C2
#> 13 5 19 10 3

The 18 members of zip 1 have been assigned to A1 and A2 with an (almost) 75:25 split. All members of zip 2 are given a B, as expected, and the 13 members of zip 3 have been assigned appropriately (though by chance no C3s have been selected due to its low probability)

If test.sample had 5000 rows, we would see that the proportions are much closer to the expected weightings due to the law of large numbers:

#>   A1   A2    B   C1   C2   C3 
#> 1257 419 1687 1153 325 159


Related Topics



Leave a reply



Submit