Randomly Sample Groups

Randomly sample groups

Just use sample() to choose some number of groups

iris %>% filter(Species %in% sample(levels(Species),2))

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

Randomly sample groups of rows linked by unique column values with the same # of observations

Should be pretty straightforward. The strategy I’ve used here is to get the unique IDs for sampling, then subset based on that. Below is a base R solution.

I’ve expanded the number of IDs so you can see the sampling a little better.

df <- 
  data.frame(
    ID = rep(1:10, each = 6),
    status = rep(c("present", rep("absent", 5)), 10),
    Cov1 = "numbers",
    Cov2 = "numbers"
  )

head(df)
#>   ID  status    Cov1    Cov2
#> 1  1 present numbers numbers
#> 2  1  absent numbers numbers
#> 3  1  absent numbers numbers
#> 4  1  absent numbers numbers
#> 5  1  absent numbers numbers
#> 6  1  absent numbers numbers

ids <- unique(df$ID)
sample.size <- floor(0.6*length(ids))

set.seed(1080)
train.1.ids <- sample(ids, sample.size)
train.1.ids
#> [1] 1 8 3 4 7 5
train.1 <- df[df$ID %in% train.1.ids , ]
tail(train.1)
#>    ID  status    Cov1    Cov2
#> 43  8 present numbers numbers
#> 44  8  absent numbers numbers
#> 45  8  absent numbers numbers
#> 46  8  absent numbers numbers
#> 47  8  absent numbers numbers
#> 48  8  absent numbers numbers

set.seed(1081)
train.2.ids <- sample(ids, sample.size)
train.2.ids
#> [1] 10  7  4  8  1  3
train.2 <- df[df$ID %in% train.2.ids , ]
tail(train.2)
#>    ID  status    Cov1    Cov2
#> 55 10 present numbers numbers
#> 56 10  absent numbers numbers
#> 57 10  absent numbers numbers
#> 58 10  absent numbers numbers
#> 59 10  absent numbers numbers
#> 60 10  absent numbers numbers

You could also turn this into a function

get_sample <- function(df, prop = 0.6, seed){
  ids <- unique(df$ID)
  ids <- order(ids)
  sample.size <- floor(prop*length(ids))

  set.seed(seed)
  train.1.ids <- sample(ids, sample.size)
  ## just to illustrate the IDs sampled
  print(train.1.ids)
  train.1 <- df[df$ID %in% train.1.ids , ]
  
  return(train.1)
}

train.3 <- get_sample(df, prop = 0.6, seed = 1)
#> [1] 9 4 7 1 2 5
train.4 <- get_sample(df, prop = 0.4, seed = 2)
#> [1] 5 6 9 1

^{Created on 2022-01-13 by the reprex package (v2.0.1)}

How do I randomly sample a set of groups in a Python Data Frame, keeping all records of each selected group, but WITHOUT merging?

If I understand correctly , you can first take the specific level and then do a random sampling and use .loc[] directly:

arr = mydf.index.get_level_values(level='PT').unique()
n = 0.3
choice = np.random.choice(arr,round(len(arr)*n),replace=False)
output = mydf.loc[choice]

Sample Output:

              measurement
PT Encounter             
B  1                   48
   2                    1
   3                   19
   4                   36
   5                   25
D  1                   33
   2                    2
   3                   10
   4                   33
   5                   32

Randomly sampling groups, followed by sampling within these sampled groups

I think this is what you're looking for. Let's start with your data in a reproducible format:

data1 <- structure(list(id = structure(1:14, .Label = c("1001", "1002", 
"1003", "3002", "3003", "3005", "3006", "3007", "4001", "4002", 
"5006", "5007", "5009", "5010"), class = "factor"), group = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", 
"2", "3", "4"), class = "factor"), y = structure(c(1L, 4L, 8L, 
7L, 4L, 10L, 9L, 2L, 3L, 4L, 11L, 12L, 6L, 5L), .Label = c("10", 
"11", "12", "15", "19", "22", "24", "3", "32", "37", "7", "9"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-14L))

And just to make sure:

data1
#>      id group  y
#> 1  1001     1 10
#> 2  1002     1 15
#> 3  1003     1  3
#> 4  3002     2 24
#> 5  3003     2 15
#> 6  3005     2 37
#> 7  3006     2 32
#> 8  3007     2 11
#> 9  4001     3 12
#> 10 4002     3 15
#> 11 5006     4  7
#> 12 5007     4  9
#> 13 5009     4 22
#> 14 5010     4 19

We start by splitting the data frame by group into smaller data frames, using the split function. This gives us a list with four data frames, each one containing all the members of its respective group. (The set.seed is there purely to make this example reproducible).

set.seed(69)
split_dfs <- split(data1, data1$group)

Now we can sample this list, giving us a new list of four data frames drawn with replacement from split_dfs. Each one will again contain all the members of its respective group, though of course some whole groups might be sampled more than once, and other whole groups not sampled at all.

sampled_group_dfs <- split_dfs[sample(length(split_dfs), replace = TRUE)]

Now we can sample within each group by sampling with replacement from the rows of each data frame in our new list. We do this for all our data frames in our list by using lapply

all_sampled <- lapply(sampled_group_dfs, function(x) x[sample(nrow(x), replace = TRUE), ])

All that remains is to stick all the resultant dataframes in this list back together to get our result:

result <- do.call(rbind, all_sampled)

As you can see from the final result, it just so happens that each of the four groups was sampled once (this is just by chance - alter set.seed to get different results). However, within the groups there have clearly been some duplicates drawn. In fact, since R mandates unique row names in a data frame, these are easy to pick out by the .1 that has been appended to the duplicate row names. If you don't like this, you can reset the row names with rownames(result) <- seq(nrow(result))

result
#>          id group  y
#> 4.14   5010     4 19
#> 4.14.1 5010     4 19
#> 4.11   5006     4  7
#> 4.13   5009     4 22
#> 1.3    1003     1  3
#> 1.3.1  1003     1  3
#> 1.2    1002     1 15
#> 3.9    4001     3 12
#> 3.9.1  4001     3 12
#> 2.5    3003     2 15
#> 2.5.1  3003     2 15
#> 2.6    3005     2 37
#> 2.7    3006     2 32
#> 2.5.2  3003     2 15

^{Created on 2020-02-15 by the reprex package (v0.3.0)}

Random sampling of groups after pandas groupby

Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:

new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()

for _, row in new_df.iterrows():
    print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))

Random Sample by Group and Specific Probability Distribution for Groups

I believe this might work. Join the frequency tibble with your date tibble. After filtering for the month of interest, a revised frequency can be calculated based on frequency for day of the week, adjusting for number of times that day of the week appears in that month. Finally, use slice_sample with this new frequency included as weight_by (weights add up to 1, though they otherwise would be standardized to add up to 1 anyways).

library(tidyverse)

set.seed(123)

dt2021 %>%
  filter(month == 1) %>%
  left_join(dtWkDays) %>%
  group_by(wkDay) %>%
  mutate(newFreq = Freq / n()) %>%
  ungroup() %>%
  slice_sample(n = 1000, weight_by = newFreq, replace = TRUE) %>%
  count(wkDay)

Output

  wkDay         n
  <chr>     <int>
1 Friday      312
2 Monday       81
3 Saturday    320
4 Sunday       10
5 Thursday    120
6 Tuesday      62
7 Wednesday    95

How can I sample removing some groups randomly and some individuals within group randomly?

You can e.g. do:

gb2 = gb[sample(1:length(gb), 2, replace=false)] # sample 2 groups
combine(gb2, sdf -> sdf[sample(1:nrow(sdf), 2, replace=false), :]) # sample 2 observations per group

random sample a vector multiple times to make groups and conduct ANOVA

Since you are repeating the random sampling, you should start by making a function that does what you want:

SimAnova <- function() {
     Groups <-rep(LETTERS[1:10], each=5)
     Values <- rnorm(50, 3.47, 0.0189)
     AnovaResults <- anova(lm(Values~Groups))
     F <- AnovaResults[1, 4]
     df <- AnovaResults[, 1]
     Crit <- qf(1 - .05, df[1], df[2])
     P <- AnovaResults[1, 5]
     c("F-Value"=F, "Critical F-Value" =Crit, "P-Value"=P)
}
SimAnova()
#          F-Value Critical F-Value          P-Value 
#        1.7350592        2.1240293        0.1126789 
SimAnova()
#          F-Value Critical F-Value          P-Value 
#       2.04024282       2.12402926       0.05965209 
SimAnova()
#          F-Value Critical F-Value          P-Value 
#        1.635386         2.124029         0.138158

Now just repeat it 1000 times:

result <- t(replicate(1000, SimAnova()))
head(result)
#        F-Value Critical F-Value   P-Value
# [1,] 0.5659946         2.124029 0.8164247
# [2,] 0.7717596         2.124029 0.6427732
# [3,] 0.8377358         2.124029 0.5862101
# [4,] 1.6284143         2.124029 0.1401280
# [5,] 0.2191311         2.124029 0.9899751
# [6,] 0.2744286         2.124029 0.9780476

Notice that you don't really need to save the Critical F-Value because it is the same for every sample.