Randomly Sample Data Frame into 3 Groups in R

Randomly sample data frame into 3 groups in R

Do you need the partitioning to be exact? If not,

set.seed(7)
ss <- sample(1:3,size=nrow(mtcars),replace=TRUE,prob=c(0.6,0.2,0.2))
train <- mtcars[ss==1,]
test <- mtcars[ss==2,]
cvr <- mtcars[ss==3,]

should do it.

Or, as @Frank says in comments, you can split() the original data to keep them as elements of a list:

mycars <- setNames(split(mtcars,ss), c("train","test","cvr"))

How to randomly split data into three equal sizes?

IMO it should be sufficient to assign just random project names.

dat$ProjectName <- sample(factor(rep(1:3, length.out=nrow(dat)), 
labels=paste0("Project", 1:3)))

Result

head(dat)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ProjectName
# 1 1 1 0 1 1 1 1 0 1 0 Project1
# 2 1 1 1 1 1 1 0 0 1 0 Project1
# 3 0 0 1 1 0 0 0 1 1 1 Project1
# 4 1 1 1 0 1 0 1 1 0 1 Project3
# 5 1 0 0 1 1 1 1 0 0 1 Project1
# 6 1 0 0 0 0 1 0 1 1 1 Project3

table(dat$ProjectName)
# Project1 Project2 Project3
# 3186 3186 3186

Data

set.seed(42)
dat <- data.frame(replicate(10, sample(0:1, 9558, rep=TRUE)))

How can I randomly sample a subgroup with multiple rows from within a larger group?

Like so perhaps:


set.seed( 100 )
df %>% group_by( ID, Group ) %>%
sample_n(1) %>%
select( -Score ) %>%
left_join( df, by=c("ID","Group","Color") )

Think I misunderstood you at first, but this sounds like it could be it.

Output:


ID Group Color Score
1 Bravo 1 yellow 0.65
2 Bravo 1 yellow 0.70
3 Bravo 1 yellow 0.90
4 Charlie 1 red 0.55
5 Charlie 2 red 0.60
6 Charlie 3 red 0.80
7 Charlie 4 red 0.90
8 Delta 1 red 0.85
9 Delta 2 red 0.63
10 Delta 2 red 0.51
11 Echo 1 yellow 0.85
12 Echo 1 yellow 0.89

How to randomly sample entire group based on multiple grouping conditions

You can use lubridate::floor_date to create groups and then filter one randomly sampled frame per group. You can manually set the interval you need in floor_date, here it's "1 minute".

df %>% 
mutate(datetime = ymd_hms(datetime),
fl = floor_date(datetime, "1 minute")) %>%
group_by(uniquename, fl) %>%
filter(frame == sample(unique(frame), 1))

output:

# A tibble: 11 × 5
# Groups: uniquename, floor [4]
uniquename frame id datetime fl
<chr> <dbl> <chr> <dttm> <dttm>
1 unique1 2 b1 2021-05-05 07:05:03 2021-05-05 07:05:00
2 unique1 2 b2 2021-05-05 07:05:03 2021-05-05 07:05:00
3 unique1 2 b3 2021-05-05 07:05:03 2021-05-05 07:05:00
4 unique1 3 b2 2021-05-05 07:07:03 2021-05-05 07:07:00
5 unique1 3 b4 2021-05-05 07:07:03 2021-05-05 07:07:00
6 unique2 1 b3 2021-06-06 09:17:25 2021-06-06 09:17:00
7 unique2 1 b4 2021-06-06 09:17:25 2021-06-06 09:17:00
8 unique2 16 b1 2021-06-06 09:20:59 2021-06-06 09:20:00
9 unique2 16 b2 2021-06-06 09:20:59 2021-06-06 09:20:00
10 unique2 16 b3 2021-06-06 09:20:59 2021-06-06 09:20:00
11 unique2 16 b4 2021-06-06 09:20:59 2021-06-06 09:20:00

Sample n rows from a data frame by group using another data frame

We can do an inner_join with 'pick_df', grouped by 'manufacturer', 'year', get the sample_n based on the first value of 'pick'

library(dplyr)   
library(ggplot20
mpg %>%
inner_join(pick_df) %>%
group_by(manufacturer, year) %>%
sample_n(first(pick))
# A tibble: 27 x 12
# Groups: manufacturer, year [5]
# manufacturer model displ year cyl trans drv cty hwy fl class pick
# <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <int>
# 1 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 6
# 2 audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize 6
# 3 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6
# 4 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact 6
# 5 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 6
# 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 6
# 7 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact 3
# 8 honda civic 2 2008 4 manual(m6) f 21 29 p subcompact 3
# 9 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact 3
#10 land rover range rover 4.2 2008 8 auto(s6) 4 12 18 r suv 2
# … with 17 more rows

Sample n random rows per group in a dataframe

You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

Randomly assign sample into groups in R

Using toString.

df$class <- factor(apply(df[c("City", "Age_Group")], 1, toString))
levels(df$class)
# [1] "City 1, 0-9" "City 1, 10-19" "City 1, 20-29" "City 1, 30-39"
# [5] "City 1, 40-49" "City 1, 50-59" "City 1, 60-69" "City 1, 70-79"
# [9] "City 1, 80-89" "City 1, 90+" "City 10, 0-9" "City 10, 10-19"
# [13] "City 10, 20-29" [...]

To get random samples, you could split the data set by "class" into subsets, say s, and calculate how many groups you get, when you divide the nrow(s)/20 (individuals) by 20. Use ceiling of this probably decimal point number, say x, and exploit then recycling properties of R; bind 1:ceiling(x) to s using cbind and let it recycle to nrow(s), where we may safely suppressWarnings. Of course we want now use sample to disturb the order, and just want column [,2]. Finally use do.call(rbind(.)) to unsplit the data set, and delete the rownames if we want.

set.seed(1)  ## for sake of reproducibility
df <- `rownames<-`(do.call(rbind, by(df, df$class, function(s)
transform(s, SAMP=suppressWarnings(
sample(cbind(s$class, SAMP=1:ceiling(nrow(s)/20))[,2])
)))), NULL)

Result:

Yields "SAMP" column with approximately equal sized groups with ~20 members for each "class".

df[60:70, ]  ##example rows
# ID City Age_Group class SAMP
# 60 8766 City 01 0-9 City 01, 0-9 4
# 61 8775 City 01 0-9 City 01, 0-9 1
# 62 9021 City 01 0-9 City 01, 0-9 3
# 63 9041 City 01 0-9 City 01, 0-9 3
# 64 9482 City 01 0-9 City 01, 0-9 1
# 65 9622 City 01 0-9 City 01, 0-9 1
# 66 47 City 01 10-19 City 01, 10-19 4
# 67 698 City 01 10-19 City 01, 10-19 3
# 68 833 City 01 10-19 City 01, 10-19 1
# 69 1166 City 01 10-19 City 01, 10-19 1
# 70 1221 City 01 10-19 City 01, 10-19 2

Check first ten tables of the classes with its SAMPles:

by(df$SAMP, df$class, table)[1:10]
# $`City 01, 0-9`
#
# 1 2 3 4
# 17 16 16 16
#
# $`City 01, 10-19`
#
# 1 2 3 4
# 18 17 17 17
#
# $`City 01, 20-29`
#
# 1 2 3 4
# 18 18 17 17
#
# $`City 01, 30-39`
#
# 1 2 3 4
# 19 19 19 19
#
# $`City 01, 40-49`
#
# 1 2 3 4
# 19 19 19 18
#
# $`City 01, 50-59`
#
# 1 2 3 4 5
# 18 17 17 17 17
#
# $`City 01, 60-69`
#
# 1 2 3 4
# 16 16 16 16
#
# $`City 01, 70-79`
#
# 1 2 3 4
# 19 19 19 19
#
# $`City 01, 80-89`
#
# 1 2 3 4
# 20 19 19 19
#
# $`City 01, 90+`
#
# 1 2 3 4
# 18 17 17 17

Case you want the numbering by class rather than altogether, just paste "class" (as numeric) and "SAMP" together.

df <- transform(df, SAMP2=paste(as.numeric(class), SAMP, sep="."))
head(df)
# ID City Age_Group class SAMP SAMP2
# 1 193 City 01 0-9 City 01, 0-9 3 1.3
# 2 480 City 01 0-9 City 01, 0-9 1 1.1
# 3 742 City 01 0-9 City 01, 0-9 2 1.2
# 4 757 City 01 0-9 City 01, 0-9 1 1.1
# 5 811 City 01 0-9 City 01, 0-9 3 1.3
# 6 870 City 01 0-9 City 01, 0-9 3 1.3


Related Topics



Leave a reply



Submit