Random Sampling - Matrix

random sampling - matrix

There is a very easy way to sample a matrix that works if you understand that R represents a matrix internally as a vector.

This means you can use sample directly on your matrix. For example, let's assume you want to sample 10 points with replacement:

n <- 10
replace=TRUE

Now just use sample on your matrix:

set.seed(1)
sample(dataset, n, replace=replace)
 [1] 1 0 0 1 0 1 1 0 0 1

To demonstrate how this works, let's decompose it into two steps. Step 1 is to generate an index of sampling positions, and step 2 is to find those positions in your matrix:

set.seed(1)
mysample <- sample(length(dataset), n, replace=replace)
mysample
 [1]  8 12 18 28  7 27 29 20 19  2

dataset[mysample]
 [1] 1 0 0 1 0 1 1 0 0 1

And, hey presto, the results of the two methods are identical.

Sampling with replacement a matrix element from a list in R

If we want to sample from each element of the list 'v' use sapply to loop over the list

out <- sapply(v, function(x) sample(x, size = 11, replace = TRUE))

or if you want to sample 11 list elements, sample from the sequence of list, and use that index to extract the list elements

out <- v[sample(seq_along(v), 11, replace = TRUE)]

How do I select a sample of rows at random with repetition from a matrix in R?

Use sample on the rows with replace=TRUE or replace=FALSE.

If X is your original matrix then

X[sample(nrow(X),size=5,replace=TRUE),]

X[sample(nrow(X),size=5,replace=FALSE),]

should work. (It may be more readable if you choose the sample first: s <- sample(...) and then subset: newmat <- X[s,])

Loops with random sampling from a matrix and distance calculation

Not EXACTLY sure what you are looking for but this code may be able to help you. It's not extremely fast, as instead of using a while to stop after hitting your total_capacity it just does a cumsum on the full node list and find the place where you exceed 120.

nodes <- structure(list(node_number = 1:50, 
               x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L, 
                     89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L, 
                     58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L, 
                     84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L, 
                     8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L, 
                     41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L), 
               y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L, 
                     72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L, 
                     18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L, 
                     32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
                     50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L), 
               node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L, 
                               7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L, 
                               14L, 11L, 19L,  15L, 15L, 4L, 13L, 
                               13L, 5L, 16L, 3L, 7L, 14L, 17L, 
                               3L, 3L, 12L, 14L, 20L, 13L, 10L, 
                               9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L, 
                               5L, 1L, 7L, 9L, 2L)), 
          .Names = c("node_number", "x", "y", "node_demand"), 
          class = "data.frame", row.names = c(NA, -50L))

total_nodes = nrow(nodes)
hubs_required = 5
total_capacity = 120
iterations <- 100
track_sums <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)

And then I prefer using a function for distance, in this case A and B are 2 separate vectors with c(x,y) and c(x,y).

euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2))) # distances

The Loop:

for(i in 1:iterations){
  # random hub selection
  hubs <- nodes[sample(1:total_nodes, hubs_required, replace = FALSE),]
  for(h in 1:hubs_required){
    # sample the nodes into a random order
    random_nodes <- nodes[sample(1:nrow(nodes), size = nrow(nodes), replace = FALSE),]
    # cumulative sum their demand, and get which number passes 120, 
    # and subtract 1 to get the node before that
    last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
    # get sum of all distances to those nodes (1 though the last)
     all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
      euc.dist(A = hubs[h,c("x","y")], 
               B = rn[c("x","y")]) 
    })
    track_sums[i,h] <- sum(all_distances)
  }
}

min(rowSums(track_sums))

EDIT

as a function:

hubnode <- function(nodes, hubs_required = 5, total_capacity = 120, iterations = 10){
  # initialize results matrices
  track_sums <- node_count <- matrix(NA, nrow = iterations, ncol = hubs_required)
  colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
  colnames(node_count) <- paste0("nodes_at_hub",1:hubs_required)
  # user defined distance function (only exists wihtin hubnode() function)
  euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2))) 

  for(i in 1:iterations){
    # random hub selection
    assigned_hubs <- sample(1:nrow(nodes), hubs_required, replace = FALSE)
    hubs <- nodes[assigned_hubs,]
    assigned_nodes <- NULL
    for(h in 1:hubs_required){
      # sample the nodes into a random order
      assigned_nodes <- sample((1:nrow(nodes))[-assigned_hubs], replace = FALSE)
      random_nodes <- nodes[assigned_nodes,]
      # cumulative sum their demand, and get which number passes 120, 
      # and subtract 1 to get the node before that
      last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
      # if there are none
      if(is.na(last)) last = nrow(random_nodes)
      node_count[i,h] <- last
      # get sum of all distances to those nodes (1 though the last)
      all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
        euc.dist(A = hubs[h,c("x","y")], 
                 B = rn[c("x","y")]) 
      })
      track_sums[i,h] <- sum(all_distances)
    }
  }
  return(list(track_sums = track_sums, node_count = node_count))
}

output <- hubnode(nodes, iterations = 100)

node_count <- output$node_count
track_sums <- output$track_sums

plot(rowSums(node_count),  
     rowSums(track_sums), xlab = "Node Count", ylab = "Total Demand", main = paste("Result of", 100, "iterations"))

min(rowSums(track_sums))

R - Randomly sample from a matrix using a distribution to denote the number of zeros in each column - stratified sampling

The problem was to try and down-sample a large matrix to the size of a smaller matrix whilst also ensuring the proportion of zeros for each variable in the down-sampled matrix are equal to that of the smaller matrix.

So after days of research I have found a method to solve the problem I was having without using any predefined function to do the heavy lifting. As it turned out, this was a problem of permutations and, as I painfully figured out, as the number of rows in the matrix I needed to sample from b and the number of rows in the matrix I need to down-sample to the size of a increased, the problem became impossible to computationally compute. For example if the matrix I'm trying to sample from contained 200 rows and the matrix size I'm trying to down-sample to is 20, this gives n!/(n-r)! permutations or in r

> factorial(200)/factorial(200-20)
[1] NaN

This problem of the size of the number of permutations eats computational power and memory and has been well covered in stack overflow. So, in the interests of time and memory, I obviously couldn't check every permutation. I decided to do two things to get around this; firstly, I would only check a proportion of permutations that give me a 95% chance of getting one permutation in the top 5% of solutions (I define top here as having the best approximation of zero distributions to a) and secondly, I would stop early if I found a solution where every variable's zero distribution was within 0.05 of a. The code below walks through the solution:

First let's create the matrix to sample from b and the matrix to get the size and zero distribution to downsample to a

set.seed(1234)
# matrix a is the matrix that holds the distribution of zeros I want to match
a <- matrix(as.integer(rexp(200, rate=.1)), ncol=20)
# matrix b is the matrix to be sampled from 
b <- matrix(as.integer(rexp(2000, rate=.1)), ncol=20)

Next I work out the distribution of zeros I am trying to replicate

zero_dist_to_replicate<-apply(a,2, function(x) 
> zero_dist_to_replicate
 [1] 0.8 0.8 0.9 0.6 0.7 0.9 1.0 0.8 1.0 1.0 0.8 0.9 1.0 1.0 0.9 0.9 0.9 0.9 1.0 0.9

Next I am creating variables to keep a count of permutations checked and their errors

  perms_used <- list()
  error <- vector()
  answer <- matrix()

To work out how many random samples from b I need to take to get a permutation in the top 5%, 95% of the time I use

ceiling(log(1-0.95)/log(1-0.05))
[1] 59

Now I run a while loop that randomly samples from b and checks if my second condition from above is met if it isn't I store the permutation and the associated error and continue until I find one that meets the second condition or try 59 conditions. If I try 59 then I return the one with the closest zero distribution to a

counter<-1
while(counter < 59){
  perm <- NULL
  #Keep picking random permutations until you find one that hasn't been checked before
  while(is.null(perm) || perm %in% perms_used){
    #sample used to generate random numbers to pick rows from b, 
    #-1 and +1 used so random number picked doesn't include 0
    perm <- sample((n-1),num_vars,replace=T)+1
   }
   subsample_set <- b[perm,]
   #check distribution of zeros of this permutation
   subsample_set_dist <- apply(subsample_set,2, function(x) sum(x!=0)/length(x))
   #if the permuted subsample's distribution of zeros is within .05 
   #for each variable of other matrix end early
   diff <- abs(subsample_set_dist-zero_dist_to_replicate)
   if(all(diff <= 0.05)==T){
      answer <- subsample_set
      break
   }
    #getting the sum of the error across all variables
    error[counter]<-sum(diff)
    perms_used[[counter]]<-perm
    counter = counter+1
  }
  if(all(is.na(answer))){
   #return first row with the minimum error
   best_subsample<-perms_used[which(error == min(error))]
   answer <- matrix_to_sample[best_subsample[[1]],])
  }

How to randomly pick a % of observations from a matrix in R?

You can use sample from a vector from seq_along and get the row and column using arrayInd and cbind this with the value of the matrix.

i <- sample(seq_along(M), length(M) %/% 10)
cbind(arrayInd(i, dim(M)), M[i])
#cbind(arrayInd(i, dim(M), c("row", "column"), TRUE), value = M[i]) #Alternative with names
#     [,1] [,2]        [,3]
#[1,]    5    1 -0.72818419
#[2,]    9    1  1.14609041
#[3,]    2    2  0.01162598

R: Sample a matrix for cells close to a specified position

I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.

# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns

# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y 
xy <- cbind(x, y); xy

# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required

# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)

From the help page of dist2

dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.

We can visualize this using the image function.

XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1

# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
  (x - min(x)) / (m + max(x) - min(x))

# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))

Sample Image

Lighter tones indicate grids closer to the station at (3,2).

# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)

# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
     x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3

To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.

snail.sam <- function(nsamples) {
  sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
  apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}

SAMPLES <- replicate(10000, snail.sam(5))

tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
        col=cols[tab])

Sample Image

If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:

Sample Image

There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.

sam <- sample(1:nrow(xy), size = 200, prob=prob^3)

Sample Image

The tendency for the points to be located near the station is now more obvious.

Sampling columns in a matrix within R

When we are working with matrix, the indexing should be row, column (similar to data.frame - but in data.frame if we don't specify the ,, by default it will give the columns as columns are the individual units). In a matrix, the individual unit is an element. So, without the row,column, it extracts the element based on that position. To avoid that, use ,. One more thing is that when we have a single column/row, it will drop the dimensions as drop = TRUE by default for ?Extract

x[,sample(ncol(x), size = 1), drop = FALSE]

Random Sampling - Matrix