Randomly Sample Groups

Randomly sample groups

Just use sample() to choose some number of groups

iris %>% filter(Species %in% sample(levels(Species),2))

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

Randomly sample groups of rows linked by unique column values with the same # of observations

Should be pretty straightforward. The strategy I’ve used here is to get the unique IDs for sampling, then subset based on that. Below is a base R solution.

I’ve expanded the number of IDs so you can see the sampling a little better.

df <- 
data.frame(
ID = rep(1:10, each = 6),
status = rep(c("present", rep("absent", 5)), 10),
Cov1 = "numbers",
Cov2 = "numbers"
)

head(df)
#> ID status Cov1 Cov2
#> 1 1 present numbers numbers
#> 2 1 absent numbers numbers
#> 3 1 absent numbers numbers
#> 4 1 absent numbers numbers
#> 5 1 absent numbers numbers
#> 6 1 absent numbers numbers
ids <- unique(df$ID)
sample.size <- floor(0.6*length(ids))

set.seed(1080)
train.1.ids <- sample(ids, sample.size)
train.1.ids
#> [1] 1 8 3 4 7 5
train.1 <- df[df$ID %in% train.1.ids , ]
tail(train.1)
#> ID status Cov1 Cov2
#> 43 8 present numbers numbers
#> 44 8 absent numbers numbers
#> 45 8 absent numbers numbers
#> 46 8 absent numbers numbers
#> 47 8 absent numbers numbers
#> 48 8 absent numbers numbers

set.seed(1081)
train.2.ids <- sample(ids, sample.size)
train.2.ids
#> [1] 10 7 4 8 1 3
train.2 <- df[df$ID %in% train.2.ids , ]
tail(train.2)
#> ID status Cov1 Cov2
#> 55 10 present numbers numbers
#> 56 10 absent numbers numbers
#> 57 10 absent numbers numbers
#> 58 10 absent numbers numbers
#> 59 10 absent numbers numbers
#> 60 10 absent numbers numbers

You could also turn this into a function

get_sample <- function(df, prop = 0.6, seed){
ids <- unique(df$ID)
ids <- order(ids)
sample.size <- floor(prop*length(ids))

set.seed(seed)
train.1.ids <- sample(ids, sample.size)
## just to illustrate the IDs sampled
print(train.1.ids)
train.1 <- df[df$ID %in% train.1.ids , ]

return(train.1)
}

train.3 <- get_sample(df, prop = 0.6, seed = 1)
#> [1] 9 4 7 1 2 5
train.4 <- get_sample(df, prop = 0.4, seed = 2)
#> [1] 5 6 9 1

Created on 2022-01-13 by the reprex package (v2.0.1)

How do I randomly sample a set of groups in a Python Data Frame, keeping all records of each selected group, but WITHOUT merging?

If I understand correctly , you can first take the specific level and then do a random sampling and use .loc[] directly:

arr = mydf.index.get_level_values(level='PT').unique()
n = 0.3
choice = np.random.choice(arr,round(len(arr)*n),replace=False)
output = mydf.loc[choice]

Sample Output:

              measurement
PT Encounter
B 1 48
2 1
3 19
4 36
5 25
D 1 33
2 2
3 10
4 33
5 32

Randomly sampling groups, followed by sampling within these sampled groups

I think this is what you're looking for. Let's start with your data in a reproducible format:

data1 <- structure(list(id = structure(1:14, .Label = c("1001", "1002", 
"1003", "3002", "3003", "3005", "3006", "3007", "4001", "4002",
"5006", "5007", "5009", "5010"), class = "factor"), group = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), y = structure(c(1L, 4L, 8L,
7L, 4L, 10L, 9L, 2L, 3L, 4L, 11L, 12L, 6L, 5L), .Label = c("10",
"11", "12", "15", "19", "22", "24", "3", "32", "37", "7", "9"
), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))

And just to make sure:

data1
#> id group y
#> 1 1001 1 10
#> 2 1002 1 15
#> 3 1003 1 3
#> 4 3002 2 24
#> 5 3003 2 15
#> 6 3005 2 37
#> 7 3006 2 32
#> 8 3007 2 11
#> 9 4001 3 12
#> 10 4002 3 15
#> 11 5006 4 7
#> 12 5007 4 9
#> 13 5009 4 22
#> 14 5010 4 19

We start by splitting the data frame by group into smaller data frames, using the split function. This gives us a list with four data frames, each one containing all the members of its respective group. (The set.seed is there purely to make this example reproducible).

set.seed(69)
split_dfs <- split(data1, data1$group)

Now we can sample this list, giving us a new list of four data frames drawn with replacement from split_dfs. Each one will again contain all the members of its respective group, though of course some whole groups might be sampled more than once, and other whole groups not sampled at all.

sampled_group_dfs <- split_dfs[sample(length(split_dfs), replace = TRUE)]

Now we can sample within each group by sampling with replacement from the rows of each data frame in our new list. We do this for all our data frames in our list by using lapply

all_sampled <- lapply(sampled_group_dfs, function(x) x[sample(nrow(x), replace = TRUE), ])

All that remains is to stick all the resultant dataframes in this list back together to get our result:

result <- do.call(rbind, all_sampled)

As you can see from the final result, it just so happens that each of the four groups was sampled once (this is just by chance - alter set.seed to get different results). However, within the groups there have clearly been some duplicates drawn. In fact, since R mandates unique row names in a data frame, these are easy to pick out by the .1 that has been appended to the duplicate row names. If you don't like this, you can reset the row names with rownames(result) <- seq(nrow(result))

result
#> id group y
#> 4.14 5010 4 19
#> 4.14.1 5010 4 19
#> 4.11 5006 4 7
#> 4.13 5009 4 22
#> 1.3 1003 1 3
#> 1.3.1 1003 1 3
#> 1.2 1002 1 15
#> 3.9 4001 3 12
#> 3.9.1 4001 3 12
#> 2.5 3003 2 15
#> 2.5.1 3003 2 15
#> 2.6 3005 2 37
#> 2.7 3006 2 32
#> 2.5.2 3003 2 15

Created on 2020-02-15 by the reprex package (v0.3.0)

Random sampling of groups after pandas groupby

Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:

new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()

for _, row in new_df.iterrows():
print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))

Random Sample by Group and Specific Probability Distribution for Groups

I believe this might work. Join the frequency tibble with your date tibble. After filtering for the month of interest, a revised frequency can be calculated based on frequency for day of the week, adjusting for number of times that day of the week appears in that month. Finally, use slice_sample with this new frequency included as weight_by (weights add up to 1, though they otherwise would be standardized to add up to 1 anyways).

library(tidyverse)

set.seed(123)

dt2021 %>%
filter(month == 1) %>%
left_join(dtWkDays) %>%
group_by(wkDay) %>%
mutate(newFreq = Freq / n()) %>%
ungroup() %>%
slice_sample(n = 1000, weight_by = newFreq, replace = TRUE) %>%
count(wkDay)

Output

  wkDay         n
<chr> <int>
1 Friday 312
2 Monday 81
3 Saturday 320
4 Sunday 10
5 Thursday 120
6 Tuesday 62
7 Wednesday 95

How can I sample removing some groups randomly and some individuals within group randomly?

You can e.g. do:

gb2 = gb[sample(1:length(gb), 2, replace=false)] # sample 2 groups
combine(gb2, sdf -> sdf[sample(1:nrow(sdf), 2, replace=false), :]) # sample 2 observations per group

random sample a vector multiple times to make groups and conduct ANOVA

Since you are repeating the random sampling, you should start by making a function that does what you want:

SimAnova <- function() {
Groups <-rep(LETTERS[1:10], each=5)
Values <- rnorm(50, 3.47, 0.0189)
AnovaResults <- anova(lm(Values~Groups))
F <- AnovaResults[1, 4]
df <- AnovaResults[, 1]
Crit <- qf(1 - .05, df[1], df[2])
P <- AnovaResults[1, 5]
c("F-Value"=F, "Critical F-Value" =Crit, "P-Value"=P)
}
SimAnova()
# F-Value Critical F-Value P-Value
# 1.7350592 2.1240293 0.1126789
SimAnova()
# F-Value Critical F-Value P-Value
# 2.04024282 2.12402926 0.05965209
SimAnova()
# F-Value Critical F-Value P-Value
# 1.635386 2.124029 0.138158

Now just repeat it 1000 times:

result <- t(replicate(1000, SimAnova()))
head(result)
# F-Value Critical F-Value P-Value
# [1,] 0.5659946 2.124029 0.8164247
# [2,] 0.7717596 2.124029 0.6427732
# [3,] 0.8377358 2.124029 0.5862101
# [4,] 1.6284143 2.124029 0.1401280
# [5,] 0.2191311 2.124029 0.9899751
# [6,] 0.2744286 2.124029 0.9780476

Notice that you don't really need to save the Critical F-Value because it is the same for every sample.



Related Topics



Leave a reply



Submit