What happens when prob argument in sample sums to less/greater than 1?
Good question. The docs are unclear on this, but the question can be answered by reviewing the source code.
If you look at the R code, sample
always calls another R function, sample.int
If you pass in a single number x
to sample
, it will use sample.int
to create a vector of integers less than or equal to that number, whereas if x
is a vector, it uses sample.int
to generate a sample of integers less than or equal to length(x)
, then uses that to subset x.
Now, if you examine the function sample.int
, it looks like this:
function (n, size = n, replace = FALSE, prob = NULL, useHash = (!replace &&
is.null(prob) && size <= n/2 && n > 1e+07))
{
if (useHash)
.Internal(sample2(n, size))
else .Internal(sample(n, size, replace, prob))
}
The .Internal
means any sampling is done by calling compiled code written in C: in this case, it's the function do_sample
, defined here in src/main/random.c.
If you look at this C code, do_sample
checks whether it has been passed a prob
vector. If not, it samples on the assumption of equal weights. If prob
exists, the function ensures that it is numeric and not NA. If prob
passes these checks, a pointer to the underlying array of doubles is generated and passed to another function in random.c called FixUpProbs
, defined here.
This function examines each member of prob
and throws an error if any elements of prob
are not positive finite doubles. It then normalises the numbers by dividing each by the sum of all. There is therefore no preference at all for prob
summing to 1 inherent in the code. That is, even if prob
sums to 1 in your input, the function will still calculate the sum and divide each number by it.
Therefore, the parameter is poorly named. It should be "weights", as others here have pointed out. To be fair, the docs only say that prob
should be a vector of weights, not absolute probabilities.
So the behaviour of the prob
parameter from my reading of the code should be:
prob
can be absent altogether, in which case sampling defaults to equal weights.- If any of
prob
's numbers are less than zero, or are infinite, or NA, the function will throw. - An error should be thrown if any of the
prob
values are non-numeric, as they will be interpreted asNA
in the SEXP passed to the C code. prob
must have the same length asx
or the C code throws- You can pass a zero probability as one or more elements of
prob
if you have specifiedreplace=T
, as long as you have at least one non-zero probability. - If you specify
replace=F
, the number of samples you request must be less than or equal to the number of non-zero elements inprob
. Essentially,FixUpProbs
will throw if you ask it to sample with a zero probability. - A valid
prob
vector will be normalised to sum to 1 and used as sampling weights.
As an interesting side effect of this behaviour, this allows you to use odds instead of probabilities if you are choosing between 2 alternatives by setting probs = c(1, odds)
How should I specify argument prob when using sample() for resampling?
Overthinking is devil.
You want to resample these samples, following the original distribution or an empirical distribution. Think about how an empirical CDF is obtained:
plot(sort(x), 1:length(x)/length(x))
In other words, the empirical PDF is just
plot(sort(x), rep(1/length(x), length(x)))
So, we want prob = rep(1/length(x), length(x))
or simply, prob = rep(1, length(x))
as sample
normalizes prob
internally. Or, just leave it unspecified as equal probability is default.
How does prob argument in rbinom work when prob is a vector?
The vector is recycled over the 17 generated values:
> rbinom(17, 1, c(0,.999))
[1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Often R will generate a warning if you try recycling two vectors that aren't don't fit into each other:
> (1:10) + (1:3)
[1] 2 4 6 5 7 9 8 10 12 11
Warning message:
In (1:10) + (1:3) :
longer object length is not a multiple of shorter object length
but not in this case.
R_Sample with probabilities
sample(...)
takes a random sample with probabilities given in prob=...
, so you will not get exactly that proportion every time. On the other hand, the proportions get closer to those specified in prob
as n
increases:
f <- function(n)sample(1:4,n,replace=T,prob=(1:4)/10)
samples <- lapply(10^(2:6),f)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
# n 1 2 3 4
# [1,] 1e+02 0.090000 0.220000 0.260000 0.430000
# [2,] 1e+03 0.076000 0.191000 0.309000 0.424000
# [3,] 1e+04 0.095300 0.200200 0.310100 0.394400
# [4,] 1e+05 0.099720 0.199800 0.302250 0.398230
# [5,] 1e+06 0.099661 0.199995 0.300223 0.400121
If you need a random sample with exactly those proportions, use rep(...)
and randomize the order.
g <- function(n) rep(1:4,n*(1:4)/10)[sample(1:n,n)]
samples <- lapply(10^(2:6),g)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
# n 1 2 3 4
# [1,] 1e+02 0.1 0.2 0.3 0.4
# [2,] 1e+03 0.1 0.2 0.3 0.4
# [3,] 1e+04 0.1 0.2 0.3 0.4
# [4,] 1e+05 0.1 0.2 0.3 0.4
# [5,] 1e+06 0.1 0.2 0.3 0.4
Sample 'randomly' but ensure final sample representative of population?
How about creating copies of all 14 image indices rep(1:14, 4)
and then shuffle that array: sample(rep(1:14, 4))
.
sample and rbinom functions in R
From the documentation for rbinom
:
The numerical arguments other than n are recycled to the length of the result.
This means that in your example the prob
vector you pass in will be recycled until it reaches the required length (presumably 5). So the vector which will be used is:
c(0.9, 0.2, 0.3, 0.9, 0.2)
As for the sample
function, as @thelatemail pointed out the probabilities do not have to sum to 1. It appears that the prob
vector gets normalized to 1 internally.
Understanding code about inverse transform sampling R
It looks like you need to do some basic research on R and programming in general. Here are short answers to your simple questions, but please read on afterward for some broader advice.
- Where is the
1
value when returned? Wherever it is assigned. Here, namely insamples[i]
for whicheveri
that branch is reached. - Where is this
state
allocated? In the linefor(state in 2:length(p.vec))
- Why is this line
names(p.vec)<-1:4
for? Good question.names()<-
just assigns names to an object, and I'm not sure why in your context it's useful to have names that are equal to the vector indices, though I could imagine it to be so in some contexts. - What
seq_len
means?seq_len(x)
creates an integer vector with all the numbers from1
tox
inclusive. Seehelp("seq_len")
- Why
samples[i]
is not used anymore in the code? Because it's only useful in the for loop.
All this points to a bigger problem though: You don't understand the basics of R. We all started out there, but it means you need to read some basic info and work through some basic tutorials. RStudio provides some resources for learning here.
Related Topics
R: Remove Repeating Row Entries in Gridextra Table
Flag First By-Group in R Data Frame
How to Append R Data Frame into Existing Excel Without Overwriting
R - Column Names in Read.Table and Write.Table Starting with Number and Containing Space
Inserting Logo into Beamer Presentation Using R Markdown
Write a File Using 'saverds()' So That It Is Backwards Compatible with Old Versions of R
R - Carry Last Observation Forward N Times
Plot Weighted Frequency Matrix
Use Different Font Sizes for Different Portions of Text in Ggplot2 Title
How to Set R to Default Options
Filling Polygons of a Map Using Ggplot in R
Error Trying to Read a PDF Using Readpdf from The Tm Package
In R, Merge Two Data Frames, Fill Down The Blanks