Randomly sample a percentage of rows within a data frame
How about this:
mydf[ sample( which(mydf$gender=='F'), round(0.2*length(which(mydf$gender=='F')))), ]
Where 0.2 is your 20% and length(which(mydf$gender=='F'))
is the total number of rows with F
How to randomly sample two columns based on percentages and assign labels?
You can use sample
and assign the probability of occurrence of group using the prob
argument.
library(dplyr)
df <- df %>%
mutate(group = sample(c('group1', 'group2'), n(),
replace = TRUE, prob = c(0.3, 0.7)))
Since sample
uses probability if you have 100 rows in df
not necessarily exact 70 rows would always be assigned to 'group2'
. As the number of rows increase this probability would take you closer to 70%.
If you want exact 70%-30% partition use rep
instead.
n <- round(nrow(df) * 0.7)
df <- df %>% mutate(group = sample(rep(c('group1', 'group2'), c(n() - n, n))))
Randomly selecting different percentages of data in Python
Here is something you can do, you need a function to do this every time.
import pandas as pd
df = pd.read_csv(r'R_100.csv', encoding='cp1252')
After you read the dataframe
def frac(dataframe, fraction, other_info=None):
"""Returns fraction of data"""
return dataframe.sample(frac=fraction)
here other_info can be specific column name and then call the function however many times you want
df_1 = frac(df, 0.3)
it will return you a new dataframe that you can use for anything you want, you can use this something like this as I infer from your example you are taking sum of a column
import random
def random_gen():
"""generates random number"""
return random.randint(0,1)
def print_sum(column_name):
"""Prints sum"""
# call the random_gen() to give out a number
rand_num = random_gen()
# pass the number as fraction parameter to frac()
df_tmp = frac(df, rand_num)
print(df_tmp[str(column_name)].sum())
Or if you want
but I'm not sure how to extend this to select different percentages at the same time.
Then just change the print_sum
as follows
def print_sum(column_name):
"""returns result for 10 iterations"""
# list to store all the result
results = []
# selecting different percentage fraction
# for 10 different random fraction or you can have a list of all the fractions you want
# and then for loop over that list
for i in range(1,10):
# generate random number
fracr = random_gen()
# pass the number as fraction parameter to frac()
df_tmp = frac(df, fracr)
result.append(df_tmp[str(column_name)].sum())
return result
Hope this helps! Feedback is much appreciated :)
How to randomly filter rows to achieve desired proportions of a grouping variable
EDITed answer in view of EDIT-3
#desired sample sizes
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))
arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n)
# A tibble: 1,000 x 2
animal weight
<chr> <int>
1 cat 19
2 cat 7
3 cat 17
4 cat 11
5 cat 22
6 cat 8
7 cat 22
8 cat 14
9 cat 22
10 cat 18
# ... with 990 more rows
Try this out on different df
set.seed(123)
my_df <-
data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(1500, 4100, 220))),
weight = sample(5:25, size = 5820, replace = TRUE))
library(tidyverse)
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))
arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n) -> sampled
library(janitor)
tabyl(sampled$animal)
sampled$animal n percent
cat 1026 0.6998636
dog 220 0.1500682
rabbit 220 0.1500682
Sample n random rows per group in a dataframe
You can assign a random ID to each element that has a particular factor level using ave
. Then you can select all random IDs in a certain range.
rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]
This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid
vector to create subset of different lengths fairly easily.
Sampling a proportion from a population data frame in R (random sampling in stratified sampling)
You can create a dataframe with category and respective proportions, join it with pop
and use sample_n
to select rows in each group by its respective proportion.
library(dplyr)
prop_table <- data.frame(category = c('a','b', 'c'), prop = c(0.005, 0.001, 0.2))
pop %>%
left_join(prop_table, by = 'category') %>%
group_by(category) %>%
sample_n(n() * first(prop)) %>%
ungroup %>%
select(-prop)
Note that sample_n
has been replaced with slice_sample
but slice_sample
needs fixed prop
value for each category and does not allow using something like first(prop)
.
Related Topics
Dplyr Summarise_Each with Na.Rm
R Web Application Introduction
Speeding Up Julia's Poorly Written R Examples
Circular Heatmap That Looks Like a Donut
Jupyter-Client Has to Be Installed But "Jupyter Kernelspec --Version" Exited with Code 127
Reordering Columns in a Large Dataframe
To Find Whether a Column Exists in Data Frame or Not
How to Apply Function Over Each Matrix Element's Indices
Ggplot2 Legend to Bottom and Horizontal
Rearrange Dataframe to a Table, the Opposite of "Melt"
Figures Captions and Labels in Knitr
How to Get Rstudio to Automatically Compile R Markdown Vignettes
Compute Rolling Sum by Id Variables, with Missing Timepoints
Multinomial Logit in R: Mlogit Versus Nnet
How to Make a Heatmap with a Large Matrix