Randomly Sample a Percentage of Rows Within a Data Frame

Randomly sample a percentage of rows within a data frame

How about this:

mydf[ sample( which(mydf$gender=='F'), round(0.2*length(which(mydf$gender=='F')))), ]

Where 0.2 is your 20% and length(which(mydf$gender=='F')) is the total number of rows with F

How to randomly sample two columns based on percentages and assign labels?

You can use sample and assign the probability of occurrence of group using the prob argument.

library(dplyr)

df <- df %>%
mutate(group = sample(c('group1', 'group2'), n(),
replace = TRUE, prob = c(0.3, 0.7)))

Since sample uses probability if you have 100 rows in df not necessarily exact 70 rows would always be assigned to 'group2'. As the number of rows increase this probability would take you closer to 70%.

If you want exact 70%-30% partition use rep instead.

n <- round(nrow(df) * 0.7)
df <- df %>% mutate(group = sample(rep(c('group1', 'group2'), c(n() - n, n))))

Randomly selecting different percentages of data in Python

Here is something you can do, you need a function to do this every time.

import pandas as pd 
df = pd.read_csv(r'R_100.csv', encoding='cp1252')

After you read the dataframe

def frac(dataframe, fraction, other_info=None):
"""Returns fraction of data"""
return dataframe.sample(frac=fraction)

here other_info can be specific column name and then call the function however many times you want

df_1 = frac(df, 0.3)

it will return you a new dataframe that you can use for anything you want, you can use this something like this as I infer from your example you are taking sum of a column

import random

def random_gen():
"""generates random number"""
return random.randint(0,1)

def print_sum(column_name):
"""Prints sum"""

# call the random_gen() to give out a number
rand_num = random_gen()

# pass the number as fraction parameter to frac()
df_tmp = frac(df, rand_num)

print(df_tmp[str(column_name)].sum())

Or if you want

but I'm not sure how to extend this to select different percentages at the same time.

Then just change the print_sum as follows

def print_sum(column_name):
"""returns result for 10 iterations"""
# list to store all the result
results = []

# selecting different percentage fraction
# for 10 different random fraction or you can have a list of all the fractions you want
# and then for loop over that list
for i in range(1,10):
# generate random number
fracr = random_gen()
# pass the number as fraction parameter to frac()
df_tmp = frac(df, fracr)
result.append(df_tmp[str(column_name)].sum())

return result

Hope this helps! Feedback is much appreciated :)

How to randomly filter rows to achieve desired proportions of a grouping variable

EDITed answer in view of EDIT-3

#desired sample sizes
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))

arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n)
# A tibble: 1,000 x 2
animal weight
<chr> <int>
1 cat 19
2 cat 7
3 cat 17
4 cat 11
5 cat 22
6 cat 8
7 cat 22
8 cat 14
9 cat 22
10 cat 18
# ... with 990 more rows

Try this out on different df

set.seed(123)
my_df <-
data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(1500, 4100, 220))),
weight = sample(5:25, size = 5820, replace = TRUE))

library(tidyverse)
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))

arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n) -> sampled

library(janitor)
tabyl(sampled$animal)

sampled$animal n percent
cat 1026 0.6998636
dog 220 0.1500682
rabbit 220 0.1500682

Sample n random rows per group in a dataframe

You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.

Sampling a proportion from a population data frame in R (random sampling in stratified sampling)

You can create a dataframe with category and respective proportions, join it with pop and use sample_n to select rows in each group by its respective proportion.

library(dplyr)

prop_table <- data.frame(category = c('a','b', 'c'), prop = c(0.005, 0.001, 0.2))

pop %>%
left_join(prop_table, by = 'category') %>%
group_by(category) %>%
sample_n(n() * first(prop)) %>%
ungroup %>%
select(-prop)

Note that sample_n has been replaced with slice_sample but slice_sample needs fixed prop value for each category and does not allow using something like first(prop).



Related Topics



Leave a reply



Submit