random sampling - matrix
There is a very easy way to sample a matrix that works if you understand that R represents a matrix internally as a vector.
This means you can use sample
directly on your matrix. For example, let's assume you want to sample 10 points with replacement:
n <- 10
replace=TRUE
Now just use sample
on your matrix:
set.seed(1)
sample(dataset, n, replace=replace)
[1] 1 0 0 1 0 1 1 0 0 1
To demonstrate how this works, let's decompose it into two steps. Step 1 is to generate an index of sampling positions, and step 2 is to find those positions in your matrix:
set.seed(1)
mysample <- sample(length(dataset), n, replace=replace)
mysample
[1] 8 12 18 28 7 27 29 20 19 2
dataset[mysample]
[1] 1 0 0 1 0 1 1 0 0 1
And, hey presto, the results of the two methods are identical.
Sampling with replacement a matrix element from a list in R
If we want to sample from each element of the list
'v' use sapply
to loop over the list
out <- sapply(v, function(x) sample(x, size = 11, replace = TRUE))
or if you want to sample 11 list
elements, sample
from the sequence of list
, and use that index to extract the list
elements
out <- v[sample(seq_along(v), 11, replace = TRUE)]
How do I select a sample of rows at random with repetition from a matrix in R?
Use sample
on the rows with replace=TRUE
or replace=FALSE
.
If X
is your original matrix then
X[sample(nrow(X),size=5,replace=TRUE),]
or
X[sample(nrow(X),size=5,replace=FALSE),]
should work. (It may be more readable if you choose the sample first: s <- sample(...)
and then subset: newmat <- X[s,]
)
Loops with random sampling from a matrix and distance calculation
Not EXACTLY sure what you are looking for but this code may be able to help you. It's not extremely fast, as instead of using a while to stop after hitting your total_capacity it just does a cumsum on the full node list and find the place where you exceed 120.
nodes <- structure(list(node_number = 1:50,
x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L,
89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L,
58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L,
84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L,
8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L,
41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L),
y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L,
72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L,
18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L,
32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L),
node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L,
7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L,
14L, 11L, 19L, 15L, 15L, 4L, 13L,
13L, 5L, 16L, 3L, 7L, 14L, 17L,
3L, 3L, 12L, 14L, 20L, 13L, 10L,
9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L,
5L, 1L, 7L, 9L, 2L)),
.Names = c("node_number", "x", "y", "node_demand"),
class = "data.frame", row.names = c(NA, -50L))
total_nodes = nrow(nodes)
hubs_required = 5
total_capacity = 120
iterations <- 100
track_sums <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
And then I prefer using a function for distance, in this case A and B are 2 separate vectors with c(x,y) and c(x,y).
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2))) # distances
The Loop:
for(i in 1:iterations){
# random hub selection
hubs <- nodes[sample(1:total_nodes, hubs_required, replace = FALSE),]
for(h in 1:hubs_required){
# sample the nodes into a random order
random_nodes <- nodes[sample(1:nrow(nodes), size = nrow(nodes), replace = FALSE),]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
min(rowSums(track_sums))
EDIT
as a function:
hubnode <- function(nodes, hubs_required = 5, total_capacity = 120, iterations = 10){
# initialize results matrices
track_sums <- node_count <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
colnames(node_count) <- paste0("nodes_at_hub",1:hubs_required)
# user defined distance function (only exists wihtin hubnode() function)
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2)))
for(i in 1:iterations){
# random hub selection
assigned_hubs <- sample(1:nrow(nodes), hubs_required, replace = FALSE)
hubs <- nodes[assigned_hubs,]
assigned_nodes <- NULL
for(h in 1:hubs_required){
# sample the nodes into a random order
assigned_nodes <- sample((1:nrow(nodes))[-assigned_hubs], replace = FALSE)
random_nodes <- nodes[assigned_nodes,]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# if there are none
if(is.na(last)) last = nrow(random_nodes)
node_count[i,h] <- last
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
return(list(track_sums = track_sums, node_count = node_count))
}
output <- hubnode(nodes, iterations = 100)
node_count <- output$node_count
track_sums <- output$track_sums
plot(rowSums(node_count),
rowSums(track_sums), xlab = "Node Count", ylab = "Total Demand", main = paste("Result of", 100, "iterations"))
min(rowSums(track_sums))
R - Randomly sample from a matrix using a distribution to denote the number of zeros in each column - stratified sampling
The problem was to try and down-sample a large matrix to the size of a smaller matrix whilst also ensuring the proportion of zeros for each variable in the down-sampled matrix are equal to that of the smaller matrix.
So after days of research I have found a method to solve the problem I was having without using any predefined function to do the heavy lifting. As it turned out, this was a problem of permutations and, as I painfully figured out, as the number of rows in the matrix I needed to sample from b
and the number of rows in the matrix I need to down-sample to the size of a
increased, the problem became impossible to computationally compute. For example if the matrix I'm trying to sample from contained 200 rows and the matrix size I'm trying to down-sample to is 20, this gives n!/(n-r)! permutations or in r
> factorial(200)/factorial(200-20)
[1] NaN
This problem of the size of the number of permutations eats computational power and memory and has been well covered in stack overflow. So, in the interests of time and memory, I obviously couldn't check every permutation. I decided to do two things to get around this; firstly, I would only check a proportion of permutations that give me a 95% chance of getting one permutation in the top 5% of solutions (I define top here as having the best approximation of zero distributions to a
) and secondly, I would stop early if I found a solution where every variable's zero distribution was within 0.05 of a
. The code below walks through the solution:
First let's create the matrix to sample from b
and the matrix to get the size and zero distribution to downsample to a
set.seed(1234)
# matrix a is the matrix that holds the distribution of zeros I want to match
a <- matrix(as.integer(rexp(200, rate=.1)), ncol=20)
# matrix b is the matrix to be sampled from
b <- matrix(as.integer(rexp(2000, rate=.1)), ncol=20)
Next I work out the distribution of zeros I am trying to replicate
zero_dist_to_replicate<-apply(a,2, function(x)
> zero_dist_to_replicate
[1] 0.8 0.8 0.9 0.6 0.7 0.9 1.0 0.8 1.0 1.0 0.8 0.9 1.0 1.0 0.9 0.9 0.9 0.9 1.0 0.9
Next I am creating variables to keep a count of permutations checked and their errors
perms_used <- list()
error <- vector()
answer <- matrix()
To work out how many random samples from b
I need to take to get a permutation in the top 5%, 95% of the time I use
ceiling(log(1-0.95)/log(1-0.05))
[1] 59
Now I run a while loop that randomly samples from b
and checks if my second condition from above is met if it isn't I store the permutation and the associated error and continue until I find one that meets the second condition or try 59 conditions. If I try 59 then I return the one with the closest zero distribution to a
counter<-1
while(counter < 59){
perm <- NULL
#Keep picking random permutations until you find one that hasn't been checked before
while(is.null(perm) || perm %in% perms_used){
#sample used to generate random numbers to pick rows from b,
#-1 and +1 used so random number picked doesn't include 0
perm <- sample((n-1),num_vars,replace=T)+1
}
subsample_set <- b[perm,]
#check distribution of zeros of this permutation
subsample_set_dist <- apply(subsample_set,2, function(x) sum(x!=0)/length(x))
#if the permuted subsample's distribution of zeros is within .05
#for each variable of other matrix end early
diff <- abs(subsample_set_dist-zero_dist_to_replicate)
if(all(diff <= 0.05)==T){
answer <- subsample_set
break
}
#getting the sum of the error across all variables
error[counter]<-sum(diff)
perms_used[[counter]]<-perm
counter = counter+1
}
if(all(is.na(answer))){
#return first row with the minimum error
best_subsample<-perms_used[which(error == min(error))]
answer <- matrix_to_sample[best_subsample[[1]],])
}
How to randomly pick a % of observations from a matrix in R?
You can use sample
from a vector from seq_along
and get the row and column using arrayInd
and cbind
this with the value of the matrix.
i <- sample(seq_along(M), length(M) %/% 10)
cbind(arrayInd(i, dim(M)), M[i])
#cbind(arrayInd(i, dim(M), c("row", "column"), TRUE), value = M[i]) #Alternative with names
# [,1] [,2] [,3]
#[1,] 5 1 -0.72818419
#[2,] 9 1 1.14609041
#[3,] 2 2 0.01162598
R: Sample a matrix for cells close to a specified position
I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.
# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns
# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y
xy <- cbind(x, y); xy
# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required
# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)
From the help page of dist2
dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.
We can visualize this using the image
function.
XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1
# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
(x - min(x)) / (m + max(x) - min(x))
# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))
Lighter tones indicate grids closer to the station at (3,2)
.
# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)
# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3
To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.
snail.sam <- function(nsamples) {
sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}
SAMPLES <- replicate(10000, snail.sam(5))
tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
col=cols[tab])
If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:
There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.
sam <- sample(1:nrow(xy), size = 200, prob=prob^3)
The tendency for the points to be located near the station is now more obvious.
Sampling columns in a matrix within R
When we are working with matrix
, the indexing should be row, column
(similar to data.frame
- but in data.frame
if we don't specify the ,
, by default it will give the columns as columns are the individual units). In a matrix, the individual unit is an element. So, without the row,column
, it extracts the element based on that position. To avoid that, use ,
. One more thing is that when we have a single column/row, it will drop the dimensions as drop = TRUE
by default for ?Extract
x[,sample(ncol(x), size = 1), drop = FALSE]
Related Topics
Knitr: Opts_Chunk$Set() Not Working in Rscript Command
What Are the Ways to Create an Executable from R Program
How to Get the Min/Max Possible Numeric
Insert Function Variable into Graph Title
Extract Certain Files from .Zip
Adding Labels on Curves in Glmnet Plot in R
Error in If/While (Condition):Argument Is Not Interpretable as Logical
Let Ggplot2 Histogram Show Classwise Percentages on Y Axis
Robust and Clustered Standard Error in R for Probit and Logit Regression
Scoping of Variables in Aes(...) Inside a Function in Ggplot
Shiny Rcharts Multiple Chart Output
Ggplot and R: Two Variables Over Time
Extract Hyperlink from Excel File in R
Plot Separate Years on a Common Day-Month Scale
R Shiny Loop to Display Multiple Plots
Add Column to Data Frame Which Returns 1 If String Match a Certain Pattern