Using sample() with sample space size = 1
The sample
documentation recommends this:
resample <- function(x, ...) x[sample.int(length(x), ...)]
Sample with only 1 number
?sample
supplies an answer in its examples:
set.seed(47)
resample <- function(x, ...) x[sample.int(length(x), ...)]
# infers 100 means 1:100
sample(100, 1)
#> [1] 98
# stricter
resample(100, 1)
#> [1] 100
# still works normally if explicit
resample(1:100, 1)
#> [1] 77
Why does sample() not work for a single number?
Or do I just need to include an if statement to avoid this.
Yeah, unfortunately. Something like this:
result = if(length(x) == 1) {x} else {sample(x, ...)}
Sample with more samples at the begining and end of sample space
This will give you more samples to the end of the intervall:
np.sqrt(np.linspace(0,100,5))
array([ 0. , 5. , 7.07106781, 8.66025404, 10. ])
You can choose a higher exponent to get more frequent intervalls towards the ends.
To get more samples towards beginning and end of the intervall, make the original linspace symmetrical to 0 and then just shift it.
General function:
def nonlinspace(xmin, xmax, n=50, power=2):
'''Intervall from xmin to xmax with n points, the higher the power, the more dense towards the ends'''
xm = (xmax - xmin) / 2
x = np.linspace(-xm**power, xm**power, n)
return np.sign(x)*abs(x)**(1/power) + xm + xmin
Examples:
>>> nonlinspace(0,10,5,2).round(2)
array([ 0. , 1.46, 5. , 8.54, 10. ])
>>> nonlinspace(0,10,5,3).round(2)
array([ 0. , 1.03, 5. , 8.97, 10. ])
>>> nonlinspace(0,10,5,4).round(2)
array([ 0. , 0.8, 5. , 9.2, 10. ])
Generate all samples of given size with replacement in R
To get samples of size n with replacement
dataset = 1:100
sample(dataset, size = 2, rep=T)
To get means for N samples
N = 1000
means = replicate(N, mean(sample(dataset, 2, rep=T)))
To plot means
hist(means)
Ok, I see from your comment you want all possible n=2 permutations of the data. This can be achieved with:
library(gtools)
x = permutations(n=3, r=2, v=1:3, repeats.allowed=T)
# n = size of sampling vector
# r = size of samples
# v = vector to sample from
This gives you a matrix with each possible permutation including repeats:
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 2 1
[5,] 2 2
[6,] 2 3
[7,] 3 1
[8,] 3 2
[9,] 3 3
To calculate means of this vector you can use:
rowMeans(x)
How can I sample equally from a dataframe?
For more elegance you can do this:
df.groupby('classes').apply(lambda x: x.sample(sample_size))
Extension:
You can make the sample_size
a function of group size to sample with equal probabilities (or proportionately):
nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))
It won't result in the exact number of rows as total_sample_size
but sampling will be more proportional than the naive method.
How to join data with a weighted sampling?
You can split index.dat
according to the zip
, to give a list of data frames for each individual zip code. If you use test.sample$zip
to subset this list, you will get a list of 50 data frames with the appropriate zip codes. You can then sample the fip using the weights in the prob
column of each data frame.
In your case, that would look like this:
sample_space <- split(index.dat, index.dat$zip)[test.sample$zip]
test.sample$fips <- sapply(sample_space,
function(x) sample(x$fips, 1, prob = x$prob))
Now test.sample$fips
will have a random fip chosen from the appropriate zip code, with the sampling done according to the relative weight. If we do a table of test.sampl$fips
, we can see that the proportions are about right:
table(test.sample$fips)
#> A1 A2 B C1 C2
#> 13 5 19 10 3
The 18 members of zip 1 have been assigned to A1 and A2 with an (almost) 75:25 split. All members of zip 2 are given a B, as expected, and the 13 members of zip 3 have been assigned appropriately (though by chance no C3s have been selected due to its low probability)
If test.sample
had 5000 rows, we would see that the proportions are much closer to the expected weightings due to the law of large numbers:
#> A1 A2 B C1 C2 C3
#> 1257 419 1687 1153 325 159
Related Topics
How to Remove Rows with Nas Only If They Are Present in More Than Certain Percentage of Columns
Calculate Differences Between Rows Faster Than a for Loop
Combine Two Lists of Dataframes, Dataframe by Dataframe
Leaflet Not Rendering in Dynamically Generated R Markdown HTML Knitr
Order Dataframe for Given Columns
R Plotly: Cannot Re-Arrange X-Axis When Axis Type Is Category
Trouble with Strings with <U+0092> Unicode Characters
Return Rows Establishing a "Closest Value To" in R
Make a Boxplot Without Whiskers
Error: $ Operator Not Defined for This S4 Class
Mlogit: Missing Value Where True/False Needed
Using Recordlinkage to Add a Column with a Number for Each Person
How to Use Stat_Function by Group
Get Tick Break Positions in Ggplot
Change Position of Tick Marks of a Single Graph, Using Ggplot2